JJ's World

Using SFTP with Spark

Fri 02 August 2019

A simple showcase of how to use SFTP together with Spark. By using Spark I can read the files directly into a Spark RDD. Reading big files was failing for me when I was using plain Python with pysftp.

Read more →
notebook Spark PySpark Python Jupyter SFTP Data Engineering

Creating abstract classes with Lambda and Terraform

Fri 26 July 2019

In this article I will create an abstract class and different concrete classes to be used within AWS Lambda deployed with Terraform.

Read more →
DevOps AWS data engineer Terraform Lambda IaC OOP

Find and delete empty columns in Pandas dataframe

Sun 07 July 2019
# Find the columns where each value is null
empty_cols = [col for col in df.columns if df[col].isnull().all()]
# Drop these columns from the dataframe
df.drop(empty_cols,
        axis=1,
        inplace=True)
Python pandas

Setting up Spark with minIO as object storage

Sun 30 June 2019

To setup one of my data projects, I need (object) storage to save my data. Using Spark I want to be able to read and write Parquet, CSV and other file formats.

Read more →
DevOps Ansible data engineer VPS Ubuntu Spark minIO object storage

Creating an Ansible playbook to provision my Ubuntu VPS

Wed 26 June 2019

My first experiment with Ansible to automate the provisioning of my server.

Read more →
DevOps Ansible data engineer VPS Ubuntu Spark Jupyter

Creating a Lambda function with Terraform to upload a Looker view

Mon 13 May 2019

A simple Terraform deployment of a Lambda function that exports a Looker view to S3.

Read more →
DevOps AWS data engineer Terraform Lambda IaC

Interacting with AWS Glue

Tue 02 April 2019

In this notebook I interact with AWS Glue using boto3.

Read more →
notebook AWS Python Jupyter Glue

Creating a data range with Python

Fri 08 March 2019

In this notebook I create a date range with a precision of days and a date range with a precision of a month using datetime with timedelta. This has helped me for automating filtering tasks, where I had to query data each day for a certain period and write te results to timestamped files.

Read more →
notebook Python Jupyter date generation daterange

Using Azure Blob Storage and Parquet

Tue 26 February 2019

My attempt to interact with Parquet files on Azure Blob Storage. Reading and writing Pandas dataframes is straightforward, but only the reading part is working with Spark 2.4.0.

Read more →
notebook Python Jupyter Spark Azure Blob. PySpark parquet data

Calculate differences with sparse date/value dataframes

Mon 07 January 2019

This notebook contains a small example that interpolates the values for a sparse dataframe and calculates the difference with a smaller dataframe.

Read more →
notebook Python Jupyter interpolation Pandas dataframes
← Older
Newer →

I am a lead data engineer with over 15 years of working with data. I have a passion for the field of machine learning, pattern recognition, big data, blockchain and ubiquitous computing.

While I mainly work in Python, I try to experiment with different languages and frameworks when I can. Lately I have been experimenting with AWS and Terraform since apart from data skills I want to stay on top of new developments within DevOps.

I am using this page as a portfolio and showcase, cheatsheet but mainly a historical record. That is why you will mainly find shell commands, short scripts or notebooks just for myself to not reinvent the wheel.


  • dataframe
  • postgres
  • API
  • Docker
  • PySpark
  • AWS
  • Spark
  • Flask
  • VueJS
  • notebook
  • Jupyter
  • Python
  • Ethereum
  • MongoDB
  • DevOps
  • Pandas
  • shell
  • S3
  • data engineer
  • Ubuntu
  • javascript
  • testing
  • blockchain

© JJ's World | Powered by Pelican | Hosted on Cloudflare Pages | 2008 - 2022