Getting started with Great Expectations

Great Expectations

This notebook describes an experiment to get to know Great Expectations better. In the approach below we will use the core API rather than the configuration option for the expectations.

Create events with the data generator

I will reuse some code I have written before to generate events.

!pip install mimesis

The next bit defines the EventGenerator and shows five different keys will be created per event.

In [1]:
from mimesis.random import Random
from mimesis import Datetime
import json


class EventGenerator:
    """ Defines the EventGenerator """

    MIN_LIVES = 1
    MAX_LIVES = 99
    CHARACTERS = ["Mario", "Luigi", "Peach", "Toad"]

    def __init__(self, start_date, end_date, num_events=10, output_type=None, output_file=None):
        """ Initialize the EventGenerator """
        self.datetime = Datetime()
        self.random = Random()
        self.num_events = num_events
        self.output_type = output_type
        self.output_file = output_file
        self.start_date = start_date
        self.end_date = end_date

    def _get_date_between(self, date_start, date_end):
        """ Get a date between start and end date """
        return self.random.choice(self.datetime.bulk_create_datetimes(self.start_date, self.end_date, days=1))

    def _generate_events(self):
        """ Generate the metric data """
        for _ in range(self.num_events):
            yield {
                "character": self.random.choice(self.CHARACTERS),
                "world": self.random.randint(1, 8),
                "level": self.random.randint(1, 4),
                "lives": self.random.randint(self.MIN_LIVES, self.MAX_LIVES),
                "time": str(self._get_date_between(self.start_date, self.end_date)),
            }

    def store_events(self):
        if self.output_type == "jl":
            with open(self.output_file, "w") as outputfile:
                for event in self._generate_events():
                    outputfile.write(f"{json.dumps(event)}\n")
        elif self.output_type == "list":
            return list(self._generate_events())
        else:
            return self._generate_events()

The next step is to create the generator before calling the event generators main function.

In [2]:
import datetime
from dateutil.relativedelta import relativedelta


DATE_END = datetime.datetime.now()
DATE_START = DATE_END + relativedelta(months=-1)

params = {
    "num_events": 10,
    "start_date": DATE_START,
    "end_date": DATE_END,
}
# Create the event generator
generator = EventGenerator(**params)

Create the dataframe with Pandas.

In [3]:
import pandas as pd

df = pd.DataFrame(generator._generate_events())
In [ ]:
df.head(10)

Data validation

To check the static data I will use Great Expectations with a minimal set of tests.

!pip install great_expectations
In [5]:
import great_expectations as ge

To actually use Great Expectations against your data you need to import the data through a GE dataframe which is simply a wrapped Pandas dataframe with GE functionality.

In [6]:
gedf = ge.from_pandas(df)
In [7]:
gedf
Out[7]:
character world level lives time
0 Mario 2 3 39 2022-01-23 02:21:12.724933
1 Peach 8 2 7 2022-02-02 02:21:12.724933
2 Mario 5 2 69 2022-01-31 02:21:12.724933
3 Toad 6 4 9 2022-02-10 02:21:12.724933
4 Peach 2 4 94 2022-02-15 02:21:12.724933
5 Toad 6 3 11 2022-02-09 02:21:12.724933
6 Toad 4 3 28 2022-01-29 02:21:12.724933
7 Luigi 7 1 96 2022-02-16 02:21:12.724933
8 Peach 2 3 72 2022-02-02 02:21:12.724933
9 Luigi 3 1 46 2022-02-14 02:21:12.724933

The world column should have values from 1 to 8.

In [8]:
gedf.expect_column_values_to_be_between(column="world", min_value=1, max_value=8)
In [9]:
gedf.expect_column_values_to_be_in_set(column="character", value_set=["Mario", "Luigi", "Peach", "Toad"])
Out[9]:
{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 10,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "success": true
}
In [10]:
gedf.get_expectation_suite()
Out[10]:
{
  "expectation_suite_name": "default",
  "meta": {
    "great_expectations_version": "0.14.7"
  },
  "ge_cloud_id": null,
  "expectations": [
    {
      "kwargs": {
        "column": "world",
        "min_value": 1,
        "max_value": 8
      },
      "meta": {},
      "expectation_type": "expect_column_values_to_be_between"
    },
    {
      "kwargs": {
        "column": "character",
        "value_set": [
          "Mario",
          "Luigi",
          "Peach",
          "Toad"
        ]
      },
      "meta": {},
      "expectation_type": "expect_column_values_to_be_in_set"
    }
  ],
  "data_asset_type": "Dataset"
}

Write the final expectations to file to be used later in the pipeline.

In [11]:
import json

with open( "ge_expectation_file.json", "w") as fh:
    fh.write(
        json.dumps(gedf.get_expectation_suite().to_json_dict())
    )

We can quickly check the content of the configuration file that has been created. This file can now be used when calling Great Expectations from the command line.

In [17]:
!cat ge_expectation_file.json | python -m json.tool
{
    "expectation_suite_name": "default",
    "meta": {
        "great_expectations_version": "0.14.7"
    },
    "ge_cloud_id": null,
    "expectations": [
        {
            "kwargs": {
                "column": "world",
                "min_value": 1,
                "max_value": 8
            },
            "meta": {},
            "expectation_type": "expect_column_values_to_be_between"
        },
        {
            "kwargs": {
                "column": "character",
                "value_set": [
                    "Mario",
                    "Luigi",
                    "Peach",
                    "Toad"
                ]
            },
            "meta": {},
            "expectation_type": "expect_column_values_to_be_in_set"
        }
    ],
    "data_asset_type": "Dataset"
}