Home | JJ's World

None shall PAAS

Wed 17 October 2018

This article was originally posted on the MarketInvoice blog.

DevOps data architecture serverless AWS Azure data engineer

Write a Pandas dataframe to CSV on S3

Fri 05 October 2018

Write a pandas dataframe to a single CSV file on S3.

import boto3
from io import StringIO

DESTINATION = 'my-bucket'

def _write_dataframe_to_csv_on_s3(dataframe, filename):
    """ Write a dataframe to a CSV on S3 """
    print("Writing {} records to {}".format(len(dataframe), filename))
    # Create buffer
    csv_buffer = StringIO()
    # Write dataframe to buffer
    dataframe.to_csv(csv_buffer, sep="|", index=False)
    # Create S3 object
    s3_resource = boto3.resource("s3")
    # Write buffer to S3 object
    s3_resource.Object(DESTINATION, filename).put(Body=csv_buffer.getvalue())

_write_dataframe_to_csv_on_s3(my_df, 'my-folder')

Gist

Python pandas CSV dataframe AWS S3

Write a Pandas dataframe to Parquet on S3

Fri 05 October 2018

Write a pandas dataframe to a single Parquet file on S3.

# Note: make sure `s3fs` is installed in order to make Pandas use S3.
#       Credentials for AWS in the normal location ~/.aws/credentials
DESTINATION = 'my-bucket'

def _write_dataframe_to_parquet_on_s3(dataframe, filename):
    """ Write a dataframe to a Parquet on S3 """
    print("Writing {} records to {}".format(len(dataframe), filename))
    output_file = f"s3://{DESTINATION}/{filename}/data.parquet"
    dataframe.to_parquet(output_file)

_write_dataframe_to_parquet_on_s3(my_df, 'my-folder')

Gist

Python pandas Parquet dataframe AWS S3

Interacting with Parquet on S3 with PyArrow and s3fs

Fri 17 August 2018

A simple example on interaction with Parquet files on AWS S3.

notebook Python Jupyter S3 pyarrow s3fs Parquet

After numerous restarts and resets of my VPS and executing over 20 different tutorials, I finally have my receipe to get my single node Kubernetes cluster with a proper load balancer working. Please note that the last version of my Kubernetes experiment can be found on my Git repo. Remember that this repository mainly contains a variety of YAML files that got the system working. What worked for me, might not work for anyone else.

DevOps Kubernetes Docker Traefik LoadBalancer helm weave

Using executemany to increase PyODBC connection

Thu 19 July 2018

I recently had to insert data from a Pandas dataframe into a Azure SQL database using pandas.to_sql(). This was performing very poorly and seemed to take ages, but since PyODBC introduced executemany it is easy to improve the performance: simply add an event listener that activates the executemany for the cursor. For 2300 records I did a small comparison 8.67s and 7.22s versus 5min 57s and 5min 26s, so roughly 50 times faster for this small example dataset.

import pandas as pd
import pyodbc
from sqlalchemy import create_engine, event
from sqlalchemy.pool import StaticPool

wh_conn = pyodbc.connect(f"DRIVER={config[name]['driver']};SERVER={config[name]['server']},{config[name]['port']};DATABASE={config[name]['database']};UID={config[name]['username']};PWD={config[name]['password']}")
engine = create_engine("mssql+pyodbc://", poolclass=StaticPool, creator=lambda: wh_conn)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

df.to_sql(name='Table',
          con=engine,
          schema='Schema',
          index=False,
          if_exists='replace')

Python pandas Azure dataframe PyODBC

Using ButterCMS in a Jupyter notebook with VueJS

Thu 14 June 2018

The goal of this notebook is to get myself familiar with ButterCMS, a headless CMS that can be easily integrated with any platform in any language. In the past I've experienced with Joomla, Drupal and WordPress (and Blogger long time ago), where I have lost invested a lot of time in simply getting the CMS in place, extending the CMS with more functionality and migrating data between servers. Since I am investigating different ways of using microservices, becoming less platform dependent and overall learning new tools, I use this notebook to show a simple example of ButterCMS in Jupyter.

notebook Python Jupyter VueJS ButterCMS javascript

Creating Pandas dataframe from Azure Table Storage

Fri 08 June 2018

import pandas as pd
from azure.cosmosdb.table.tableservice import TableService

CONNECTION_STRING = "DUMMYSTRING"
SOURCE_TABLE = "DUMMYTABLE"

def set_table_service():
    """ Set the Azure Table Storage service """
    return TableService(connection_string=CONNECTION_STRING)

def get_dataframe_from_table_storage_table(table_service, filter_query):
    """ Create a dataframe from table storage data """
    return pd.DataFrame(get_data_from_table_storage_table(table_service,
                                                          filter_query))

def get_data_from_table_storage_table(table_service, filter_query):
    """ Retrieve data from Table Storage """
    for record in table_service.query_entities(
        SOURCE_TABLE, filter=filter_query
    ):
        yield record

fq = "PartitionKey eq '12345'"
ts = set_table_service()
df = get_dataframe_from_table_storage_table(table_service=ts,
                                            filter_query=fq)

Python pandas Azure dataframe generator

Creating prediction model as a service using Flask and Docker

Wed 09 May 2018

The aim of this notebook is to create a machine learning model and transform it an API which, when given some novel input parameters, returns the model’s prediction.

notebook Python Jupyter machine learning classifier API model as a service Flask

Moving my blog to Github Pages

Thu 26 April 2018

A short description how I managed to move my blog from my VPS to Github pages with SSL served via CloudFlare.

infra static hosting