Home | JJ's World

Setting up Spark with minIO as object storage

Sun 30 June 2019

To setup one of my data projects, I need (object) storage to save my data. Using Spark I want to be able to read and write Parquet, CSV and other file formats.

DevOps Ansible data engineer VPS Ubuntu Spark minIO object storage

Creating an Ansible playbook to provision my Ubuntu VPS

Wed 26 June 2019

My first experiment with Ansible to automate the provisioning of my server.

DevOps Ansible data engineer VPS Ubuntu Spark Jupyter

Creating a Lambda function with Terraform to upload a Looker view

Mon 13 May 2019

A simple Terraform deployment of a Lambda function that exports a Looker view to S3.

DevOps AWS data engineer Terraform Lambda IaC

Interacting with AWS Glue

Tue 02 April 2019

In this notebook I interact with AWS Glue using boto3.

notebook AWS Python Jupyter Glue

Creating a data range with Python

Fri 08 March 2019

In this notebook I create a date range with a precision of days and a date range with a precision of a month using datetime with timedelta. This has helped me for automating filtering tasks, where I had to query data each day for a certain period and write te results to timestamped files.

notebook Python Jupyter date generation daterange

Using Azure Blob Storage and Parquet

Tue 26 February 2019

My attempt to interact with Parquet files on Azure Blob Storage. Reading and writing Pandas dataframes is straightforward, but only the reading part is working with Spark 2.4.0.

notebook Python Jupyter Spark Azure Blob. PySpark parquet data

Calculate differences with sparse date/value dataframes

Mon 07 January 2019

This notebook contains a small example that interpolates the values for a sparse dataframe and calculates the difference with a smaller dataframe.

notebook Python Jupyter interpolation Pandas dataframes

Using Spark to read from S3

Fri 04 January 2019

A short example on how to interact with S3 from Pyspark.

notebook Python Jupyter Spark pyspark AWS S3

Developing AWS Glue scripts on Mac OSX

Wed 21 November 2018

In this short tutorial I show how I developed my first Glue scripts for the AWS platform.

DevOps data architecture Glue AWS data engineer Spark Zeppelin notebook

AWS Lambda development - Python & SAM

Tue 23 October 2018

This tutorial explains how to write a lambda functions in Python, test it locally, deploy it to AWS and test it in the cloud using Amazon's SAM. The README.md inside the cookiecutter template folder is used as the base of this tutorial.

DevOps data architecture serverless AWS data engineer Lambda SAM