Using SFTP with Spark
A simple showcase of how to use SFTP together with Spark. By using Spark I can read the files directly into a Spark RDD. Reading big files was failing for me when I was using plain Python with pysftp.
A simple showcase of how to use SFTP together with Spark. By using Spark I can read the files directly into a Spark RDD. Reading big files was failing for me when I was using plain Python with pysftp.
In this article I will create an abstract class and different concrete classes to be used within AWS Lambda deployed with Terraform.
# Find the columns where each value is null
empty_cols = [col for col in df.columns if df[col].isnull().all()]
# Drop these columns from the dataframe
df.drop(empty_cols,
axis=1,
inplace=True)
To setup one of my data projects, I need (object) storage to save my data. Using Spark I want to be able to read and write Parquet, CSV and other file formats.
My first experiment with Ansible to automate the provisioning of my server.
A simple Terraform deployment of a Lambda function that exports a Looker view to S3.
In this notebook I interact with AWS Glue using boto3.
In this notebook I create a date range with a precision of days and a date range with a precision of a month using datetime with timedelta. This has helped me for automating filtering tasks, where I had to query data each day for a certain period and write te results to timestamped files.
My attempt to interact with Parquet files on Azure Blob Storage. Reading and writing Pandas dataframes is straightforward, but only the reading part is working with Spark 2.4.0.
This notebook contains a small example that interpolates the values for a sparse dataframe and calculates the difference with a smaller dataframe.