Using PySpark with S3 (Updated)

Install Apache Spark

Install Apache Spark (3.3.1 currently) on MacOS through brew

$ brew install apache-spark
$ brew info apache-spark
==> apache-spark: stable 3.3.1 (bottled), HEAD
Engine for large-scale data processing
/Users/jitsejan/homebrew/Cellar/apache-spark/3.3.1 (1,512 files, 605.3MB) *
  Poured from bottle on 2022-11-28 at 19:34:56
License: Apache-2.0
==> Dependencies
Required: openjdk ✔
==> Options
        Install HEAD version
==> Analytics
install: 6,463 (30 days), 16,623 (90 days), 59,684 (365 days)
install-on-request: 6,459 (30 days), 16,606 (90 days), 59,625 (365 days)
build-error: 0 (30 days)

Note: I installed wget to easily download the JAR files.

$ brew install wget
$ wget --version
GNU Wget 1.21.3 built on darwin21.6.0.

Installing JAR files

Download JAR files to enable Spark with AWS S3:

aws-java-sdk-bundle version 1.12.349

$ wget

hadoop-aws version 3.3.1

$ wget

Setup AWS profile

In order to use gimme-aws-creds and PySpark, add the following to your ~/.aws/credentials:

[profile local]
source_profile = org-sso
role_arn = arn:aws:iam::123456789:role/my-dev-role

where org-sso refers to the profile that is used by gimme-aws-creds. The role_arn is the role that you want to use with Spark and should have permissions on AWS to perform read or write actions.

Set the environment variable for AWS_PROFILE to the profile you have defined in the previous step. In my case this would be local. Next, create a Spark session and set the credential provider to use the AWS ProfileCredentialsProvider.

import os

from pyspark.sql import SparkSession

# Set profile to be used by the credentials provider
os.environ["AWS_PROFILE"] = "local"
# Create Spark Session
spark = SparkSession.builder.getOrCreate()
# Make sure the ProfileCredentialsProvider is used to authenticate in Spark
spark._jsc.hadoopConfiguration().set("", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
Validate the code
S3_URI = "s3a://some-bucket-with-parquet-files/"
df =