Install Apache Spark
Install Apache Spark (3.3.1 currently) on MacOS through brew
$ brew install apache-spark $ brew info apache-spark ==> apache-spark: stable 3.3.1 (bottled), HEAD Engine for large-scale data processing https://spark.apache.org/ /Users/jitsejan/homebrew/Cellar/apache-spark/3.3.1 (1,512 files, 605.3MB) * Poured from bottle on 2022-11-28 at 19:34:56 From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/apache-spark.rb License: Apache-2.0 ==> Dependencies Required: openjdk ✔ ==> Options --HEAD Install HEAD version ==> Analytics install: 6,463 (30 days), 16,623 (90 days), 59,684 (365 days) install-on-request: 6,459 (30 days), 16,606 (90 days), 59,625 (365 days) build-error: 0 (30 days)
Note: I installed wget to easily download the JAR files.
$ brew install wget $ wget --version GNU Wget 1.21.3 built on darwin21.6.0.
Installing JAR files
Download JAR files to enable Spark with AWS S3:
aws-java-sdk-bundle version 1.12.349
$ wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.349/aws-java-sdk-bundle-1.12.349.jar
hadoop-aws version 3.3.1
$ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar
Setup AWS profile
In order to use gimme-aws-creds and PySpark, add the following to your
[profile local] source_profile = org-sso role_arn = arn:aws:iam::123456789:role/my-dev-role
org-sso refers to the profile that is used by gimme-aws-creds. The
role_arn is the role that you want to use with Spark and should have permissions on AWS to perform read or write actions.
Set the environment variable for
AWS_PROFILE to the profile you have defined in the previous step. In my case this would be local. Next, create a Spark session and set the credential provider to use the AWS
import os from pyspark.sql import SparkSession # Set profile to be used by the credentials provider os.environ["AWS_PROFILE"] = "local" # Create Spark Session spark = SparkSession.builder.getOrCreate() # Make sure the ProfileCredentialsProvider is used to authenticate in Spark spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider") Validate the code S3_URI = "s3a://some-bucket-with-parquet-files/" df = spark.read.parquet(S3_URI) df.take(5)