Write a Pandas dataframe to CSV on S3

Write a pandas dataframe to a single CSV file on S3.

import boto3
from io import StringIO

DESTINATION = 'my-bucket'

def _write_dataframe_to_csv_on_s3(dataframe, filename):
    """ Write a dataframe to a CSV on S3 """
    print("Writing {} records to {}".format(len(dataframe), filename))
    # Create buffer
    csv_buffer = StringIO()
    # Write dataframe to buffer
    dataframe.to_csv(csv_buffer, sep="|", index=False)
    # Create S3 object
    s3_resource = boto3.resource("s3")
    # Write buffer to S3 object
    s3_resource.Object(DESTINATION, filename).put(Body=csv_buffer.getvalue())

_write_dataframe_to_csv_on_s3(my_df, 'my-folder')

Gist

Write a Pandas dataframe to Parquet on S3

Write a pandas dataframe to a single Parquet file on S3.

# Note: make sure `s3fs` is installed in order to make Pandas use S3.
#       Credentials for AWS in the normal location ~/.aws/credentials
DESTINATION = 'my-bucket'

def _write_dataframe_to_parquet_on_s3(dataframe, filename):
    """ Write a dataframe to a Parquet on S3 """
    print("Writing {} records to {}".format(len(dataframe), filename))
    output_file = f"s3://{DESTINATION}/{filename}/data.parquet"
    dataframe.to_parquet(output_file)

_write_dataframe_to_parquet_on_s3(my_df, 'my-folder')

Gist

Notes on Kubernetes

After numerous restarts and resets of my VPS and executing over 20 different tutorials, I finally have my receipe to get my single node Kubernetes cluster with a proper load balancer working. Please note that the last version of my Kubernetes experiment can be found on my Git repo. Remember that this repository mainly contains a variety of YAML files that got the system working. What worked for me, might not work for anyone else.

Using executemany to increase PyODBC connection

I recently had to insert data from a Pandas dataframe into a Azure SQL database using pandas.to_sql(). This was performing very poorly and seemed to take ages, but since PyODBC introduced executemany it is easy to improve the performance: simply add an event listener that activates the executemany for the cursor. For 2300 records I did a small comparison 8.67s and 7.22s versus 5min 57s and 5min 26s, so roughly 50 times faster for this small example dataset.

import pandas as pd
import pyodbc
from sqlalchemy import create_engine, event
from sqlalchemy.pool import StaticPool

wh_conn = pyodbc.connect(f"DRIVER={config[name]['driver']};SERVER={config[name]['server']},{config[name]['port']};DATABASE={config[name]['database']};UID={config[name]['username']};PWD={config[name]['password']}")
engine = create_engine("mssql+pyodbc://", poolclass=StaticPool, creator=lambda: wh_conn)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

df.to_sql(name='Table',
          con=engine,
          schema='Schema',
          index=False,
          if_exists='replace')

Using ButterCMS in a Jupyter notebook with VueJS

The goal of this notebook is to get myself familiar with ButterCMS, a headless CMS that can be easily integrated with any platform in any language. In the past I've experienced with Joomla, Drupal and WordPress (and Blogger long time ago), where I have lost invested a lot of time in simply getting the CMS in place, extending the CMS with more functionality and migrating data between servers. Since I am investigating different ways of using microservices, becoming less platform dependent and overall learning new tools, I use this notebook to show a simple example of ButterCMS in Jupyter.

Creating Pandas dataframe from Azure Table Storage

import pandas as pd
from azure.cosmosdb.table.tableservice import TableService

CONNECTION_STRING = "DUMMYSTRING"
SOURCE_TABLE = "DUMMYTABLE"

def set_table_service():
    """ Set the Azure Table Storage service """
    return TableService(connection_string=CONNECTION_STRING)

def get_dataframe_from_table_storage_table(table_service, filter_query):
    """ Create a dataframe from table storage data """
    return pd.DataFrame(get_data_from_table_storage_table(table_service,
                                                          filter_query))

def get_data_from_table_storage_table(table_service, filter_query):
    """ Retrieve data from Table Storage """
    for record in table_service.query_entities(
        SOURCE_TABLE, filter=filter_query
    ):
        yield record

fq = "PartitionKey eq '12345'"
ts = set_table_service()
df = get_dataframe_from_table_storage_table(table_service=ts,
                                            filter_query=fq)