Write a Pandas dataframe to CSV on S3

Write a pandas dataframe to a single CSV file on S3.

import boto3
from io import StringIO

DESTINATION = 'my-bucket'

def _write_dataframe_to_csv_on_s3(dataframe, filename):
    """ Write a dataframe to a CSV on S3 """
    print("Writing {} records to {}".format(len(dataframe), filename))
    # Create buffer
    csv_buffer = StringIO()
    # Write dataframe to buffer
    dataframe.to_csv(csv_buffer, sep="|", index=False)
    # Create S3 object
    s3_resource = boto3.resource("s3")
    # Write buffer to S3 object
    s3_resource.Object(DESTINATION, filename).put(Body=csv_buffer.getvalue())

_write_dataframe_to_csv_on_s3(my_df, 'my-folder')


Write a Pandas dataframe to Parquet on S3

Write a pandas dataframe to a single Parquet file on S3.

# Note: make sure `s3fs` is installed in order to make Pandas use S3.
#       Credentials for AWS in the normal location ~/.aws/credentials
DESTINATION = 'my-bucket'

def _write_dataframe_to_parquet_on_s3(dataframe, filename):
    """ Write a dataframe to a Parquet on S3 """
    print("Writing {} records to {}".format(len(dataframe), filename))
    output_file = f"s3://{DESTINATION}/{filename}/data.parquet"

_write_dataframe_to_parquet_on_s3(my_df, 'my-folder')


Notes on Kubernetes

After numerous restarts and resets of my VPS and executing over 20 different tutorials, I finally have my receipe to get my single node Kubernetes cluster with a proper load balancer working. Please note that the last version of my Kubernetes experiment can be found on my Git repo. Remember that this repository mainly contains a variety of YAML files that got the system working. What worked for me, might not work for anyone else.

Using executemany to increase PyODBC connection

I recently had to insert data from a Pandas dataframe into a Azure SQL database using pandas.to_sql(). This was performing very poorly and seemed to take ages, but since PyODBC introduced executemany it is easy to improve the performance: simply add an event listener that activates the executemany for the cursor. For 2300 records I did a small comparison 8.67s and 7.22s versus 5min 57s and 5min 26s, so roughly 50 times faster for this small example dataset.

import pandas as pd
import pyodbc
from sqlalchemy import create_engine, event
from sqlalchemy.pool import StaticPool

wh_conn = pyodbc.connect(f"DRIVER={config[name]['driver']};SERVER={config[name]['server']},{config[name]['port']};DATABASE={config[name]['database']};UID={config[name]['username']};PWD={config[name]['password']}")
engine = create_engine("mssql+pyodbc://", poolclass=StaticPool, creator=lambda: wh_conn)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True