Home | JJ's World

Integrating PySpark with Redshift

Sat 08 February 2020

In my article on how to connect to S3 from PySpark I showed how to setup Spark with the right libraries to be able to connect to read and right from AWS S3. In the following article I show a quick example how I connect to Redshift and use the S3 setup to write the table to file.

Python PySpark Redshift dataframe Spark

Integrating PySpark notebook with S3

Fri 24 January 2020

In my post Using Spark to read from S3 I explained how I was able to connect Spark to AWS S3 on a Ubuntu machine. Last week I was trying to connect to S3 again using Spark on my local machine, but I wasn't able to read data from our datalake. Our datalake is hosted in the eu-west-2 region which apparently requires you to specify the version of authentication. Instead of setting up the right environment on my machine and reconfigure everything, I chose to update the Docker image from my notebook repo so I could test on my Mac before pushing it to my server. Instead of configuring both my local and remote environment I can simply spin up the Docker container and have two identical environments.

Python PySpark S3 dataframe Spark Docker

Casting a PySpark DataFrame column to a specific datatype

Mon 30 December 2019

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# Cast the count column to an integer
dataframe.withColumn("count", F.col("count").cast(IntegerType()))

Python PySpark trick dataframe

Run methods dynamically by name

Fri 06 December 2019

In this project I want to verify the availability of the APIs that we use to ingest data into our data platform. In the example I will use Jira, Workable and HiBob, since they offer clean APIs without too much configuration. First I will create a test suite to verify the availability and once this works move it to a Lambda function that could be scheduled with CloudWatch on a fixed schedule.

Python pytest dynamic testing availability AWS boto3 API

Using Python with Jinja and PDFkit to generate a resume

Sun 10 November 2019

This project contains a simple example on how to build a resume with Python using Jinja, HTML, Bootstrap and a data file. In the past I have always created my resume with Latex, but to make life a little easier I chose to switch to a combination of Python and HTML. Maintaining a Latex document is cumbersome and it is difficult to divide the data from the style. By using Jinja it is straightforward to separate the resume data from the actual layout. And the most important part, I can stick to Python!

Python PDF Jinja template resume automation Flask

Using Faker to generate events

Tue 29 October 2019

A simple introduction to create fake data using the Faker tool in Python. Very convenient if you need to generate dummy data for an experiment.

notebook faker testing generator Python Jupyter

Moving from `unittest` to `pytest`

Thu 17 October 2019

In my two previous articles Unittesting in a Jupyter notebook and Mocking in unittests in Python I have discussed the use of unittest and mock to run tests for a simple Castle and Character class. For the code behind this article please check Github.

Python testing unittest pytest mock

Using SFTP with Spark

Fri 02 August 2019

A simple showcase of how to use SFTP together with Spark. By using Spark I can read the files directly into a Spark RDD. Reading big files was failing for me when I was using plain Python with pysftp.

notebook Spark PySpark Python Jupyter SFTP Data Engineering

Creating abstract classes with Lambda and Terraform

Fri 26 July 2019

In this article I will create an abstract class and different concrete classes to be used within AWS Lambda deployed with Terraform.

DevOps AWS data engineer Terraform Lambda IaC OOP

Find and delete empty columns in Pandas dataframe

Sun 07 July 2019

# Find the columns where each value is null
empty_cols = [col for col in df.columns if df[col].isnull().all()]
# Drop these columns from the dataframe
df.drop(empty_cols,
        axis=1,
        inplace=True)

Python pandas