Integrating PySpark with SQL server using JDBC

First of all I need the JDBC driver for Spark in order to make the connection to a Microsoft SQL server.

$ wget -P /opt/notebooks/

The configuration is saved in config.ini with the following fields:

database = mydb
host =
username = readonly
password = mypassword
port = 1433

Loading the configuration is simple with configparser:

from configparser import ConfigParser

config = ConfigParser()'config.ini')

Set the values for the connection:

jdbc_url = f"jdbc:sqlserver://{config.get('mydb', 'host')}:{config.get('mydb', 'port')};database={config.get('mydb', 'database')}"
connection_properties = {
    "user": config.get('mydb', 'username'),
    "password": config.get('mydb', 'password')

When creating the SparkSession make sure the path to the JAR is correctly set:

from pyspark.sql import SparkSession

jars = [
spark = (SparkSession
  .appName("PySpark with SQL server")
  .config("spark.driver.extraClassPath", ":".join(jars))

The session is created and we can query the actual database:

schema = 'dbo'
table = 'users'

The reading is done using the jdbc read option and specifying the connection details:

df = spark \
    .read \

An alternative approach is to use the same syntax as for the Redshift article by omitting the connection_properties and use a more explicit notation.

df = spark
    .read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", f"{schema}.{table}") \
    .option("user", config.get('mydb', 'username')) \
    .option("password", config.get('mydb', 'password')) \