Interacting with Parquet on S3 with PyArrow and s3fs

Prerequisites

Create the hidden folder to contain the AWS credentials:

In [1]:
!mkdir ~/.aws

Write the credentials to the credentials file:

In [2]:
%%file ~/.aws/credentials
[default]
aws_access_key_id=AKIAJAAAAAAAAAJ4ZMIQ
aws_secret_access_key=fVAAAAAAAALuLBvYQZ/5G+zxSe7wwJy+AAA
Writing /Users/j.waterschoot/.aws/credentials

Alternatively we can use the key and secret from other locations, or environment variables that we provide to the S3 instance.

Write to Parquet on S3

Create the inputdata:

In [3]:
%%file inputdata.csv
name,description,color,occupation,picture
Luigi,This is Luigi,green,plumber,https://upload.wikimedia.org/wikipedia/en/f/f1/LuigiNSMBW.png
Mario,This is Mario,red,plumber,https://upload.wikimedia.org/wikipedia/en/9/99/MarioSMBW.png
Peach,My name is Peach,pink,princess,https://s-media-cache-ak0.pinimg.com/originals/d2/4d/77/d24d77cfbba789256c9c1afa1f69b385.png
Toad,I like funghi,red,,https://upload.wikimedia.org/wikipedia/en/d/d1/Toad_3D_Land.png
Overwriting inputdata.csv

Read the data into a dataframe with Pandas:

In [4]:
import pandas as pd
dataframe = pd.read_csv('inputdata.csv')
dataframe
Out[4]:
name description color occupation picture
0 Luigi This is Luigi green plumber https://upload.wikimedia.org/wikipedia/en/f/f1...
1 Mario This is Mario red plumber https://upload.wikimedia.org/wikipedia/en/9/99...
2 Peach My name is Peach pink princess https://s-media-cache-ak0.pinimg.com/originals...
3 Toad I like funghi red NaN https://upload.wikimedia.org/wikipedia/en/d/d1...

Convert to a PyArrow table:

In [5]:
import pyarrow as pa
table = pa.Table.from_pandas(dataframe)
table   
Out[5]:
pyarrow.Table
name: string
description: string
color: string
occupation: string
picture: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "name", "field_name": "name", "pandas_type": "unicode'
            b'", "numpy_type": "object", "metadata": null}, {"name": "descript'
            b'ion", "field_name": "description", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "color", "field'
            b'_name": "color", "pandas_type": "unicode", "numpy_type": "object'
            b'", "metadata": null}, {"name": "occupation", "field_name": "occu'
            b'pation", "pandas_type": "unicode", "numpy_type": "object", "meta'
            b'data": null}, {"name": "picture", "field_name": "picture", "pand'
            b'as_type": "unicode", "numpy_type": "object", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.23.3"}'}

Create the output path for S3:

In [6]:
BUCKET_NAME = 'my-game-bucket-for-demo'
CONTAINER_NAME = 'nintendo-container'
TABLE_NAME = 'character-table'

output_file = f"s3://{BUCKET_NAME}/{CONTAINER_NAME}/{TABLE_NAME}.parquet"
output_file
Out[6]:
's3://my-game-bucket-for-demo/nintendo-container/character-table.parquet'

Setup connection with S3:

In [7]:
from s3fs import S3FileSystem
s3 = S3FileSystem() # or s3fs.S3FileSystem(key=ACCESS_KEY_ID, secret=SECRET_ACCESS_KEY)
s3
Out[7]:
<s3fs.core.S3FileSystem at 0x1030f6eb8>

Create the bucket if it does not exist yet:

In [8]:
BUCKET_EXISTS = False
try:
    s3.ls(BUCKET_NAME)
    BUCKET_EXISTS = True
except:
    print("Create bucket first!")
Create bucket first!
In [9]:
if not BUCKET_EXISTS:
    s3.mkdir(BUCKET_NAME)

Write the table to the S3 output:

In [10]:
import pyarrow.parquet as pq
pq.write_to_dataset(table=table, 
                    root_path=output_file,
                    filesystem=s3) 

Check the files:

In [11]:
s3.ls(BUCKET_NAME)
Out[11]:
['my-game-bucket-for-demo/nintendo-container']
In [12]:
s3.ls(f"{BUCKET_NAME}/{CONTAINER_NAME}")
Out[12]:
['my-game-bucket-for-demo/nintendo-container/character-table.parquet']

Read the data from the Parquet file

In [13]:
import pyarrow.parquet as pq

dataset = pq.ParquetDataset(output_file, filesystem=s3)
df = dataset.read_pandas().to_pandas()
df
Out[13]:
name description color occupation picture
0 Luigi This is Luigi green plumber https://upload.wikimedia.org/wikipedia/en/f/f1...
1 Mario This is Mario red plumber https://upload.wikimedia.org/wikipedia/en/9/99...
2 Peach My name is Peach pink princess https://s-media-cache-ak0.pinimg.com/originals...
3 Toad I like funghi red None https://upload.wikimedia.org/wikipedia/en/d/d1...