Integrating PySpark with Salesforce

To get a connection in Spark with Salesforce the advice is to use the spark-salesforce library. In order to make this work several dependencies need to be added. Make sure the core libraries to support XML are also downloaded.

$ wget 
$ wget
$ wget
$ wget
$ wget
$ wget

The configuration is saved in config.ini with the following fields:

username = [email protected]
password = securePassw0rd
token = sal3sforceT0ken

Loading the configuration is done using configparser:

from configparser import ConfigParser

config = ConfigParser()'config.ini')

When creating the SparkSession make sure the paths to the differents JARs are correctly set:

from pyspark import SparkSession

jars = [
spark = (SparkSession
  .appName("PySpark with Salesforce")
  .config("spark.driver.extraClassPath", ":".join(jars))

The session is created and we are ready to pull some data:

soql = "SELECT name, industry, type, billingaddress, sic FROM account"  
df = spark \
     .read \
     .format("com.springml.spark.salesforce") \
     .option("username", config.get('salesforce', 'username')) \
     .option("password", f"{config.get('salesforce', 'password')}{config.get('salesforce', 'token')}") \
     .option("soql", soql) \