From 1bf4dac7f5dfda4fefd56d1a2a1df489928a592a Mon Sep 17 00:00:00 2001
From: stijnvanhoey <stijnvanhoey@gmail.com>
Date: Wed, 23 Mar 2022 22:33:54 +0100
Subject: [PATCH] Add notebook on data validation frameworks

---
 notebooks/data_validation.ipynb | 1701 +++++++++++++++++++++++++++++++
 1 file changed, 1701 insertions(+)
 create mode 100644 notebooks/data_validation.ipynb
diff --git a/notebooks/data_validation.ipynb b/notebooks/data_validation.ipynb
new file mode 100644
index 0000000..fd91589
--- /dev/null
+++ b/notebooks/data_validation.ipynb
@@ -0,0 +1,1701 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "88a0e9df-2780-417b-8002-7dc0401cee76",
+   "metadata": {},
+   "source": [
+    "<p><font size=\"6\"><b>Automate data validation - describe data</b></font></p>\n",
+    "\n",
+    "> *© 2021, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "60d811e9-59ca-487b-9c50-603a6db34f63",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "251a31de-cdde-479e-8098-5f213170983d",
+   "metadata": {},
+   "source": [
+    "When running the same analysis (code) on newly incoming data from which the data is _expected to have_ the same characteristics (format, data type,...) can become tedious aserrors occur due to a unexpected changes in the data.\n",
+    "\n",
+    "Creating quarterly reports, processing data from a repeated experiment, comparing scenario's... are just a few examples of repeated analysis on data. To overcome manual checks of the data quality, setting up an __automated validation__ can be worthwhile. In this notebook, some frameworks targeted on Pandas DataFrames are highlighted:\n",
+    "\n",
+    "- [Great expectations](https://greatexpectations.io/)\n",
+    "- [Pandera](https://pandera.readthedocs.io)\n",
+    "- [Frictionless data](https://frictionlessdata.io/)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95df9341-a9cc-4f09-bbc1-93ac8e48c7b4",
+   "metadata": {},
+   "source": [
+    "__Note__ Imports of the packages are grouped per section/framework"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "abbfadcd-3338-46e8-935b-ebcd27c7a5c8",
+   "metadata": {},
+   "source": [
+    "## Great expectations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "f0dea0e3-64ca-44e7-9e15-cd30c1ee1cc3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from ruamel import yaml\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "import great_expectations as ge\n",
+    "from great_expectations import DataContext\n",
+    "from great_expectations.core import ExpectationSuite\n",
+    "from great_expectations.core.batch import RuntimeBatchRequest, BatchRequest\n",
+    "from great_expectations.validator.validator import Validator\n",
+    "from great_expectations.checkpoint.checkpoint import SimpleCheckpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "412637e9-6ead-4b54-b4a9-55185188f2d2",
+   "metadata": {},
+   "source": [
+    "[Great expectations](https://greatexpectations.io/) provides a entire framework for data quality check, aka 'production-ready' data validation: connecting to different data sources, notifications, automated triggers,... This provides a powerful ecosystem when working with continuous incoming data in a corporate environment, but the large set of functionalities can be overwhelming. \n",
+    "\n",
+    "In this introduction the focus is on setting up a set of _expectations_ (data validation rules) and applying these rules on a Pandas DataFrame data set loaded in memory."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56d372aa-cce6-4c3f-a534-8cd718989836",
+   "metadata": {},
+   "source": [
+    "### Start using great expectations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "62673bc9-89c6-4267-b74f-1ea14b6e46db",
+   "metadata": {},
+   "source": [
+    "_Great expectations_ requires a specific folder-structure, which can be setup using the `init` command:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca075c2e-c813-49f9-8b1a-fa1d6792bc98",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "great_expectations init\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0d80dd7-bf67-4a8f-a944-934e2b395033",
+   "metadata": {},
+   "source": [
+    "See also the [getting started documentation](https://docs.greatexpectations.io/docs/tutorials/getting_started/tutorial_overview)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5bb8ce90-ee84-405d-939a-8cc3dad5c6a7",
+   "metadata": {},
+   "source": [
+    "### Define _expectations_ within Jupyter notebook interactively"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a7ba0a4-2c2d-441c-955f-28a77f330800",
+   "metadata": {},
+   "source": [
+    "Defining a set of _expectations_ can be done using an existing data set. Let's start from the casualties data set used earlier:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "d3ede0f2-012e-4e87-9af5-8ee5f677c5b6",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>DT_DAY</th>\n",
+       "      <th>DT_HOUR</th>\n",
+       "      <th>CD_DAY_OF_WEEK</th>\n",
+       "      <th>TX_DAY_OF_WEEK_DESCR_FR</th>\n",
+       "      <th>TX_DAY_OF_WEEK_DESCR_NL</th>\n",
+       "      <th>MS_VICT</th>\n",
+       "      <th>MS_VIC_OK</th>\n",
+       "      <th>MS_SLY_INJ</th>\n",
+       "      <th>MS_SERLY_INJ</th>\n",
+       "      <th>MS_DEAD_30_DAYS</th>\n",
+       "      <th>...</th>\n",
+       "      <th>TX_ADM_DSTR_DESCR_NL</th>\n",
+       "      <th>CD_PROV_REFNIS</th>\n",
+       "      <th>TX_PROV_DESCR_FR</th>\n",
+       "      <th>TX_PROV_DESCR_NL</th>\n",
+       "      <th>CD_RGN_REFNIS</th>\n",
+       "      <th>TX_RGN_DESCR_FR</th>\n",
+       "      <th>TX_RGN_DESCR_NL</th>\n",
+       "      <th>CD_SEX</th>\n",
+       "      <th>TX_SEX_DESCR_FR</th>\n",
+       "      <th>TX_SEX_DESCR_NL</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2020-09-24</td>\n",
+       "      <td>15</td>\n",
+       "      <td>4</td>\n",
+       "      <td>Jeudi</td>\n",
+       "      <td>donderdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>2</td>\n",
+       "      <td>Féminin</td>\n",
+       "      <td>Vrouwelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2020-10-25</td>\n",
+       "      <td>14</td>\n",
+       "      <td>7</td>\n",
+       "      <td>Dimanche</td>\n",
+       "      <td>zondag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2020-09-24</td>\n",
+       "      <td>15</td>\n",
+       "      <td>4</td>\n",
+       "      <td>Jeudi</td>\n",
+       "      <td>donderdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2020-12-01</td>\n",
+       "      <td>15</td>\n",
+       "      <td>2</td>\n",
+       "      <td>Mardi</td>\n",
+       "      <td>dinsdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2020-12-16</td>\n",
+       "      <td>17</td>\n",
+       "      <td>3</td>\n",
+       "      <td>Mercredi</td>\n",
+       "      <td>woensdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 43 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       DT_DAY  DT_HOUR  CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR  \\\n",
+       "0  2020-09-24       15               4                   Jeudi   \n",
+       "1  2020-10-25       14               7                Dimanche   \n",
+       "2  2020-09-24       15               4                   Jeudi   \n",
+       "3  2020-12-01       15               2                   Mardi   \n",
+       "4  2020-12-16       17               3                Mercredi   \n",
+       "\n",
+       "  TX_DAY_OF_WEEK_DESCR_NL  MS_VICT  MS_VIC_OK  MS_SLY_INJ  MS_SERLY_INJ  \\\n",
+       "0               donderdag        1          0           1             0   \n",
+       "1                  zondag        1          0           1             0   \n",
+       "2               donderdag        1          0           1             0   \n",
+       "3                 dinsdag        1          1           0             0   \n",
+       "4                woensdag        1          0           1             0   \n",
+       "\n",
+       "   MS_DEAD_30_DAYS  ...      TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS  \\\n",
+       "0                0  ...  Arrondissement Antwerpen          10000   \n",
+       "1                0  ...  Arrondissement Antwerpen          10000   \n",
+       "2                0  ...  Arrondissement Antwerpen          10000   \n",
+       "3                0  ...  Arrondissement Antwerpen          10000   \n",
+       "4                0  ...  Arrondissement Antwerpen          10000   \n",
+       "\n",
+       "    TX_PROV_DESCR_FR     TX_PROV_DESCR_NL CD_RGN_REFNIS  TX_RGN_DESCR_FR  \\\n",
+       "0  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "1  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "2  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "3  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "4  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "\n",
+       "   TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR  TX_SEX_DESCR_NL  \n",
+       "0    Vlaams Gewest      2         Féminin       Vrouwelijk  \n",
+       "1    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "2    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "3    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "4    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "\n",
+       "[5 rows x 43 columns]"
+      ]
+     },
+     "execution_count": 31,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n",
+    "                         compression='zip', \n",
+    "                         sep=\"|\", \n",
+    "                         low_memory=False)\n",
+    "casualties_raw.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15c72e88-70f6-4eb9-9610-2c4d70f50f0a",
+   "metadata": {},
+   "source": [
+    "Different options are provided to create new expectations (_great expectations_ provides automatically generated notebook or have a look at the [automated _profilers_](https://docs.greatexpectations.io/docs/guides/expectations/advanced/how_to_create_a_new_expectation_suite_using_rule_based_profilers) as well). \n",
+    "\n",
+    "The following approach starts from our Pandas DataFrame, connecting it to _great expectations_ and adding new rules interactively in this session:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "8ae50ba5-5156-4041-9173-52dc3ceb46e4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# https://greatexpectations.io/expectations/\n",
+    "my_df = ge.from_pandas(casualties_raw)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a09d24f1-c016-434a-bee4-7bac50a9ac6d",
+   "metadata": {},
+   "source": [
+    "For each rule defined in _great expectations_, the `my_df` variable tracks the set of rules defined and run (duplicate runs are ignored):"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87b11c35-b13e-4e1f-b668-cb4ef641b2b5",
+   "metadata": {},
+   "source": [
+    "_check some rule options yourself using `my_df.expect_` + TAB-button_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "7a261014-0bd9-45c3-be94-9f274bf3a564",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{\n",
+       "  \"meta\": {},\n",
+       "  \"result\": {\n",
+       "    \"observed_value\": [\n",
+       "      \"Mannelijk\",\n",
+       "      \"Onbekend\",\n",
+       "      \"Vrouwelijk\"\n",
+       "    ],\n",
+       "    \"element_count\": 66130,\n",
+       "    \"missing_count\": null,\n",
+       "    \"missing_percent\": null\n",
+       "  },\n",
+       "  \"exception_info\": {\n",
+       "    \"raised_exception\": false,\n",
+       "    \"exception_traceback\": null,\n",
+       "    \"exception_message\": null\n",
+       "  },\n",
+       "  \"success\": false\n",
+       "}"
+      ]
+     },
+     "execution_count": 33,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "my_df.expect_column_distinct_values_to_be_in_set(column=\"TX_SEX_DESCR_NL\",\n",
+    "                                                 value_set=[\"Vrouwelijk\", \"Mannelijk\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "2b4182a5-ffb1-4837-a73a-9ddf3f33c317",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{\n",
+       "  \"meta\": {},\n",
+       "  \"result\": {\n",
+       "    \"observed_value\": 1,\n",
+       "    \"element_count\": 66130,\n",
+       "    \"missing_count\": null,\n",
+       "    \"missing_percent\": null\n",
+       "  },\n",
+       "  \"exception_info\": {\n",
+       "    \"raised_exception\": false,\n",
+       "    \"exception_traceback\": null,\n",
+       "    \"exception_message\": null\n",
+       "  },\n",
+       "  \"success\": true\n",
+       "}"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "my_df.expect_column_min_to_be_between(column=\"MS_VICT\", min_value=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "2448ad81-7b04-4ec9-a51a-8b430fcd5b83",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{\n",
+       "  \"meta\": {},\n",
+       "  \"result\": {\n",
+       "    \"element_count\": 66130,\n",
+       "    \"unexpected_count\": 0,\n",
+       "    \"unexpected_percent\": 0.0,\n",
+       "    \"unexpected_percent_total\": 0.0,\n",
+       "    \"partial_unexpected_list\": []\n",
+       "  },\n",
+       "  \"exception_info\": {\n",
+       "    \"raised_exception\": false,\n",
+       "    \"exception_traceback\": null,\n",
+       "    \"exception_message\": null\n",
+       "  },\n",
+       "  \"success\": true\n",
+       "}"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "my_df.expect_column_values_to_not_be_null(column=\"DT_DAY\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "faad3ce2-2362-4936-bbbf-bd149f696020",
+   "metadata": {},
+   "source": [
+    "Let's check the _expectations_ defined up to this point:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "4aa9df78-e1bf-4f87-b209-ef7fabad8744",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{\n",
+       "  \"meta\": {\n",
+       "    \"great_expectations_version\": \"0.14.10\"\n",
+       "  },\n",
+       "  \"expectation_suite_name\": \"default\",\n",
+       "  \"ge_cloud_id\": null,\n",
+       "  \"expectations\": [\n",
+       "    {\n",
+       "      \"meta\": {},\n",
+       "      \"expectation_type\": \"expect_column_distinct_values_to_be_in_set\",\n",
+       "      \"kwargs\": {\n",
+       "        \"column\": \"TX_SEX_DESCR_NL\",\n",
+       "        \"value_set\": [\n",
+       "          \"Vrouwelijk\",\n",
+       "          \"Mannelijk\"\n",
+       "        ]\n",
+       "      }\n",
+       "    },\n",
+       "    {\n",
+       "      \"meta\": {},\n",
+       "      \"expectation_type\": \"expect_column_min_to_be_between\",\n",
+       "      \"kwargs\": {\n",
+       "        \"column\": \"MS_VICT\",\n",
+       "        \"min_value\": 0\n",
+       "      }\n",
+       "    },\n",
+       "    {\n",
+       "      \"meta\": {},\n",
+       "      \"expectation_type\": \"expect_column_values_to_not_be_null\",\n",
+       "      \"kwargs\": {\n",
+       "        \"column\": \"DT_DAY\"\n",
+       "      }\n",
+       "    }\n",
+       "  ],\n",
+       "  \"data_asset_type\": \"Dataset\"\n",
+       "}"
+      ]
+     },
+     "execution_count": 36,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "my_df.get_expectation_suite(discard_failed_expectations=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c339da7-9e8c-4324-8375-51e296f2b6dd",
+   "metadata": {},
+   "source": [
+    "Store them to reuse the set of expectations later on. We store the `json` output in the `expectations` subfolder of within the folder structure. The name of the file, i.e. `be_casualties` will be used later on to use this set of expectations:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "69ba9eea-891c-4bd8-b402-35fff6b87dd2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"./great_expectations/expectations/be_casualties.json\", \"w\") as my_file:    \n",
+    "    my_file.write(json.dumps(my_df.get_expectation_suite(discard_failed_expectations=False).to_json_dict()) )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7cf0b648-71ca-4ec2-8ea7-47da5223d57b",
+   "metadata": {},
+   "source": [
+    "### Setup to run great expectations framework on in-memory DataFrame"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15928374-c288-476a-9fe9-e1a1ddf8c44a",
+   "metadata": {},
+   "source": [
+    "These steps only need to be done __once for a new project__. The aim is to create the necessary configuiration to apply a set of _expectations_ on a _in memory_ Pandas DataFrame.\n",
+    "\n",
+    "1. First let `great expectations` get the context from all the configuration in the different subfolders: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "id": "96726aa9-41a5-40a0-83ab-8cc42dfe886d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "context = ge.get_context()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a771543-21b7-4812-a679-3e5e576d0a9e",
+   "metadata": {},
+   "source": [
+    "_Note: this also loads the `be_casualties` expectations we already defined in the previous step._"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "974dfc82-4644-4cd7-b978-0d656db2201a",
+   "metadata": {},
+   "source": [
+    "2. Define an in-memory Pandas 'data source'\n",
+    "\n",
+    "The framework provides a [wider range of connectors](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_choose_which_dataconnector_to_use) (e.g. csv-file, database, cloud storage,...). This example provides the configuration to run it within the context of a Python session (Jupyter notebook) on an in-memory DataFrame:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "a472e4be-bf19-484b-ad7a-81bd751adf52",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<great_expectations.datasource.new_datasource.Datasource at 0x7f0fe53ef100>"
+      ]
+     },
+     "execution_count": 39,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "datasource_yaml = f\"\"\"\n",
+    "name: pandas_casualties\n",
+    "class_name: Datasource\n",
+    "module_name: great_expectations.datasource\n",
+    "execution_engine:\n",
+    "  module_name: great_expectations.execution_engine\n",
+    "  class_name: PandasExecutionEngine\n",
+    "data_connectors:\n",
+    "    casualties_memory:\n",
+    "        class_name: RuntimeDataConnector\n",
+    "        batch_identifiers:\n",
+    "            - year\n",
+    "\"\"\"\n",
+    "context.add_datasource(**yaml.safe_load(datasource_yaml))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2f7066a-f8cc-4747-bc68-6b0bec12e281",
+   "metadata": {},
+   "source": [
+    "3. A _checkpoint_ links a data set with a set of _expectations_:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "c2005d71-3485-4e8f-abca-3f052ed61d49",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{\n",
+       "  \"action_list\": [\n",
+       "    {\n",
+       "      \"name\": \"store_validation_result\",\n",
+       "      \"action\": {\n",
+       "        \"class_name\": \"StoreValidationResultAction\"\n",
+       "      }\n",
+       "    },\n",
+       "    {\n",
+       "      \"name\": \"store_evaluation_params\",\n",
+       "      \"action\": {\n",
+       "        \"class_name\": \"StoreEvaluationParametersAction\"\n",
+       "      }\n",
+       "    },\n",
+       "    {\n",
+       "      \"name\": \"update_data_docs\",\n",
+       "      \"action\": {\n",
+       "        \"class_name\": \"UpdateDataDocsAction\",\n",
+       "        \"site_names\": []\n",
+       "      }\n",
+       "    }\n",
+       "  ],\n",
+       "  \"batch_request\": {},\n",
+       "  \"class_name\": \"Checkpoint\",\n",
+       "  \"config_version\": 1.0,\n",
+       "  \"evaluation_parameters\": {},\n",
+       "  \"module_name\": \"great_expectations.checkpoint\",\n",
+       "  \"name\": \"casualties_check\",\n",
+       "  \"profilers\": [],\n",
+       "  \"runtime_configuration\": {},\n",
+       "  \"validations\": [\n",
+       "    {\n",
+       "      \"batch_request\": {\n",
+       "        \"datasource_name\": \"pandas_casualties\",\n",
+       "        \"data_connector_name\": \"casualties_memory\",\n",
+       "        \"data_asset_name\": \"casualties\"\n",
+       "      },\n",
+       "      \"expectation_suite_name\": \"be_casualties\"\n",
+       "    }\n",
+       "  ]\n",
+       "}"
+      ]
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "checkpoint_config = {\n",
+    "    \"name\": \"casualties_check\",\n",
+    "    \"config_version\": 1,\n",
+    "    \"class_name\": \"SimpleCheckpoint\",\n",
+    "    \"validations\": [\n",
+    "        {\n",
+    "            \"batch_request\": {\n",
+    "                \"datasource_name\": \"pandas_casualties\",\n",
+    "                \"data_connector_name\": \"casualties_memory\",\n",
+    "                \"data_asset_name\": \"casualties\",\n",
+    "            },\n",
+    "            \"expectation_suite_name\": \"be_casualties\"\n",
+    "        }\n",
+    "    ],\n",
+    "}\n",
+    "context.add_checkpoint(**checkpoint_config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aeb8b668-5ee8-4ccc-a33b-72771f777a33",
+   "metadata": {},
+   "source": [
+    "This is the necessary configuration to be able to run the data validation. Check the subfolder [./great_expectations](./great_expectations) to see the created configuration files stored on disk."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2549ef9d-036f-418d-9ed4-3ed48f32def1",
+   "metadata": {},
+   "source": [
+    "### Apply a set of _expectations_ on a (new version of) data set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "09daaa0a-3e82-43c6-81bf-ceb333e30706",
+   "metadata": {},
+   "source": [
+    "We have now all elements together to check a data set with the defined expectations on a (new) data set and check the created data validation report."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97ed8c1a-4f6e-4cc5-8a37-9e22a084899d",
+   "metadata": {},
+   "source": [
+    "The code in this section is required to run an evaluation on __any new version of the data__."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "438e556c-9efc-4a79-a744-e7137057809c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>DT_DAY</th>\n",
+       "      <th>DT_HOUR</th>\n",
+       "      <th>CD_DAY_OF_WEEK</th>\n",
+       "      <th>TX_DAY_OF_WEEK_DESCR_FR</th>\n",
+       "      <th>TX_DAY_OF_WEEK_DESCR_NL</th>\n",
+       "      <th>MS_VICT</th>\n",
+       "      <th>MS_VIC_OK</th>\n",
+       "      <th>MS_SLY_INJ</th>\n",
+       "      <th>MS_SERLY_INJ</th>\n",
+       "      <th>MS_DEAD_30_DAYS</th>\n",
+       "      <th>...</th>\n",
+       "      <th>TX_ADM_DSTR_DESCR_NL</th>\n",
+       "      <th>CD_PROV_REFNIS</th>\n",
+       "      <th>TX_PROV_DESCR_FR</th>\n",
+       "      <th>TX_PROV_DESCR_NL</th>\n",
+       "      <th>CD_RGN_REFNIS</th>\n",
+       "      <th>TX_RGN_DESCR_FR</th>\n",
+       "      <th>TX_RGN_DESCR_NL</th>\n",
+       "      <th>CD_SEX</th>\n",
+       "      <th>TX_SEX_DESCR_FR</th>\n",
+       "      <th>TX_SEX_DESCR_NL</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2020-09-24</td>\n",
+       "      <td>15</td>\n",
+       "      <td>4</td>\n",
+       "      <td>Jeudi</td>\n",
+       "      <td>donderdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>2</td>\n",
+       "      <td>Féminin</td>\n",
+       "      <td>Vrouwelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2020-10-25</td>\n",
+       "      <td>14</td>\n",
+       "      <td>7</td>\n",
+       "      <td>Dimanche</td>\n",
+       "      <td>zondag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2020-09-24</td>\n",
+       "      <td>15</td>\n",
+       "      <td>4</td>\n",
+       "      <td>Jeudi</td>\n",
+       "      <td>donderdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2020-12-01</td>\n",
+       "      <td>15</td>\n",
+       "      <td>2</td>\n",
+       "      <td>Mardi</td>\n",
+       "      <td>dinsdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2020-12-16</td>\n",
+       "      <td>17</td>\n",
+       "      <td>3</td>\n",
+       "      <td>Mercredi</td>\n",
+       "      <td>woensdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 43 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       DT_DAY  DT_HOUR  CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR  \\\n",
+       "0  2020-09-24       15               4                   Jeudi   \n",
+       "1  2020-10-25       14               7                Dimanche   \n",
+       "2  2020-09-24       15               4                   Jeudi   \n",
+       "3  2020-12-01       15               2                   Mardi   \n",
+       "4  2020-12-16       17               3                Mercredi   \n",
+       "\n",
+       "  TX_DAY_OF_WEEK_DESCR_NL  MS_VICT  MS_VIC_OK  MS_SLY_INJ  MS_SERLY_INJ  \\\n",
+       "0               donderdag        1          0           1             0   \n",
+       "1                  zondag        1          0           1             0   \n",
+       "2               donderdag        1          0           1             0   \n",
+       "3                 dinsdag        1          1           0             0   \n",
+       "4                woensdag        1          0           1             0   \n",
+       "\n",
+       "   MS_DEAD_30_DAYS  ...      TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS  \\\n",
+       "0                0  ...  Arrondissement Antwerpen          10000   \n",
+       "1                0  ...  Arrondissement Antwerpen          10000   \n",
+       "2                0  ...  Arrondissement Antwerpen          10000   \n",
+       "3                0  ...  Arrondissement Antwerpen          10000   \n",
+       "4                0  ...  Arrondissement Antwerpen          10000   \n",
+       "\n",
+       "    TX_PROV_DESCR_FR     TX_PROV_DESCR_NL CD_RGN_REFNIS  TX_RGN_DESCR_FR  \\\n",
+       "0  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "1  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "2  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "3  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "4  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "\n",
+       "   TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR  TX_SEX_DESCR_NL  \n",
+       "0    Vlaams Gewest      2         Féminin       Vrouwelijk  \n",
+       "1    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "2    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "3    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "4    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "\n",
+       "[5 rows x 43 columns]"
+      ]
+     },
+     "execution_count": 41,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n",
+    "                         compression='zip', \n",
+    "                         sep=\"|\", \n",
+    "                         low_memory=False)\n",
+    "casualties_raw.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "46cfd158-155d-4171-925b-e55cbce9bedf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "context = ge.get_context()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8b7f0f2-6b68-4a94-861c-48fd32d806bc",
+   "metadata": {},
+   "source": [
+    "We can now _run_ a checkpoint on our DataFrame:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "830fd3d1-7798-4bf6-9425-cb368cc16a65",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "633857360c074aaaa88afa1ba52c575d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "results = context.run_checkpoint(\n",
+    "    checkpoint_name=\"casualties_check\",\n",
+    "    batch_request={\n",
+    "        \"runtime_parameters\": {\"batch_data\": casualties_raw},\n",
+    "        \"batch_identifiers\": {\n",
+    "            \"year\": 2020\n",
+    "        },\n",
+    "    },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36cd9206-2642-4f3e-ba62-2aa434c5967e",
+   "metadata": {},
+   "source": [
+    "See the _expectation_ result in an interactive page:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "3df7c69d-fa05-4c1b-8b41-09623e4f6b41",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "context.build_data_docs()\n",
+    "context.open_data_docs()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c073e731-05df-44e8-9d9a-3e8bd4cd92fa",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "**Great expectations**\n",
+    "\n",
+    "- An entire ecosystem with lots of integrations and configuration options. \n",
+    "- Once a set of _expectations_ is setup and the Context/DataSource/Checkpoint has been setup, a very comprehensive __data validation report__ is provided.\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a1bc9d6-e4b4-4c30-a53c-7cb41b0950a7",
+   "metadata": {},
+   "source": [
+    "## Pandera"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "01d6e309-0d6e-4974-9baa-5ce91b17b1e1",
+   "metadata": {},
+   "source": [
+    "[Pandera](https://pandera.readthedocs.io) provides a similar functionality as Great expectations, but requires less configuration setup:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "be527cf8-78d4-4315-8305-e967321a1cf6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import pandera as pa"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "2f33718d-dd4b-4855-b7f3-3bd2770fc2b9",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>DT_DAY</th>\n",
+       "      <th>DT_HOUR</th>\n",
+       "      <th>CD_DAY_OF_WEEK</th>\n",
+       "      <th>TX_DAY_OF_WEEK_DESCR_FR</th>\n",
+       "      <th>TX_DAY_OF_WEEK_DESCR_NL</th>\n",
+       "      <th>MS_VICT</th>\n",
+       "      <th>MS_VIC_OK</th>\n",
+       "      <th>MS_SLY_INJ</th>\n",
+       "      <th>MS_SERLY_INJ</th>\n",
+       "      <th>MS_DEAD_30_DAYS</th>\n",
+       "      <th>...</th>\n",
+       "      <th>TX_ADM_DSTR_DESCR_NL</th>\n",
+       "      <th>CD_PROV_REFNIS</th>\n",
+       "      <th>TX_PROV_DESCR_FR</th>\n",
+       "      <th>TX_PROV_DESCR_NL</th>\n",
+       "      <th>CD_RGN_REFNIS</th>\n",
+       "      <th>TX_RGN_DESCR_FR</th>\n",
+       "      <th>TX_RGN_DESCR_NL</th>\n",
+       "      <th>CD_SEX</th>\n",
+       "      <th>TX_SEX_DESCR_FR</th>\n",
+       "      <th>TX_SEX_DESCR_NL</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2020-09-24</td>\n",
+       "      <td>15</td>\n",
+       "      <td>4</td>\n",
+       "      <td>Jeudi</td>\n",
+       "      <td>donderdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>2</td>\n",
+       "      <td>Féminin</td>\n",
+       "      <td>Vrouwelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2020-10-25</td>\n",
+       "      <td>14</td>\n",
+       "      <td>7</td>\n",
+       "      <td>Dimanche</td>\n",
+       "      <td>zondag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2020-09-24</td>\n",
+       "      <td>15</td>\n",
+       "      <td>4</td>\n",
+       "      <td>Jeudi</td>\n",
+       "      <td>donderdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2020-12-01</td>\n",
+       "      <td>15</td>\n",
+       "      <td>2</td>\n",
+       "      <td>Mardi</td>\n",
+       "      <td>dinsdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2020-12-16</td>\n",
+       "      <td>17</td>\n",
+       "      <td>3</td>\n",
+       "      <td>Mercredi</td>\n",
+       "      <td>woensdag</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>Arrondissement Antwerpen</td>\n",
+       "      <td>10000</td>\n",
+       "      <td>Province d’Anvers</td>\n",
+       "      <td>Provincie Antwerpen</td>\n",
+       "      <td>2000</td>\n",
+       "      <td>Région flamande</td>\n",
+       "      <td>Vlaams Gewest</td>\n",
+       "      <td>1</td>\n",
+       "      <td>Masculin</td>\n",
+       "      <td>Mannelijk</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 43 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      DT_DAY  DT_HOUR  CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR  \\\n",
+       "0 2020-09-24       15               4                   Jeudi   \n",
+       "1 2020-10-25       14               7                Dimanche   \n",
+       "2 2020-09-24       15               4                   Jeudi   \n",
+       "3 2020-12-01       15               2                   Mardi   \n",
+       "4 2020-12-16       17               3                Mercredi   \n",
+       "\n",
+       "  TX_DAY_OF_WEEK_DESCR_NL  MS_VICT  MS_VIC_OK  MS_SLY_INJ  MS_SERLY_INJ  \\\n",
+       "0               donderdag        1          0           1             0   \n",
+       "1                  zondag        1          0           1             0   \n",
+       "2               donderdag        1          0           1             0   \n",
+       "3                 dinsdag        1          1           0             0   \n",
+       "4                woensdag        1          0           1             0   \n",
+       "\n",
+       "   MS_DEAD_30_DAYS  ...      TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS  \\\n",
+       "0                0  ...  Arrondissement Antwerpen          10000   \n",
+       "1                0  ...  Arrondissement Antwerpen          10000   \n",
+       "2                0  ...  Arrondissement Antwerpen          10000   \n",
+       "3                0  ...  Arrondissement Antwerpen          10000   \n",
+       "4                0  ...  Arrondissement Antwerpen          10000   \n",
+       "\n",
+       "    TX_PROV_DESCR_FR     TX_PROV_DESCR_NL CD_RGN_REFNIS  TX_RGN_DESCR_FR  \\\n",
+       "0  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "1  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "2  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "3  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "4  Province d’Anvers  Provincie Antwerpen          2000  Région flamande   \n",
+       "\n",
+       "   TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR  TX_SEX_DESCR_NL  \n",
+       "0    Vlaams Gewest      2         Féminin       Vrouwelijk  \n",
+       "1    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "2    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "3    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "4    Vlaams Gewest      1        Masculin        Mannelijk  \n",
+       "\n",
+       "[5 rows x 43 columns]"
+      ]
+     },
+     "execution_count": 47,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n",
+    "                         compression='zip', \n",
+    "                         sep=\"|\", \n",
+    "                         low_memory=False,\n",
+    "                         parse_dates=[\"DT_DAY\"])\n",
+    "casualties_raw.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "607c096a-75bc-48fb-96a4-0cdc2f53b304",
+   "metadata": {},
+   "source": [
+    "Define a set of rules as a `DataFrameSchema`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "909bfb9a-2042-48ce-9ce4-d51c9df2175a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# define schema\n",
+    "schema = pa.DataFrameSchema({\n",
+    "    \"MS_VICT\": pa.Column(int, checks=[\n",
+    "        pa.Check.greater_than_or_equal_to(0)\n",
+    "    ]),\n",
+    "    \"DT_DAY\": pa.Column(\"datetime64\"),\n",
+    "    \"TX_SEX_DESCR_NL\": pa.Column(\n",
+    "        str, \n",
+    "        checks=pa.Check.isin([\"Mannelijk\", \"Vrouwelijk\"])\n",
+    "    ),\n",
+    "    \"CD_.+\": pa.Column(\n",
+    "        int,\n",
+    "        regex=True\n",
+    "    ),    \n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d1b81558-f611-49ee-9bb2-22af7e925020",
+   "metadata": {},
+   "source": [
+    "Apply a schema to a DataFrame:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "id": "77847f3a-b636-4405-94bb-9c34561aa18d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# validated_df = schema(casualties_raw, lazy=True)  # RUN to see report"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81670e50-5f64-48b6-a4a6-5bcee271b013",
+   "metadata": {},
+   "source": [
+    "__Note__ Use the `lazy=True` option to see all errors in a single Exception."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d32c1ef-e95a-4ea7-ab18-da9dd45b95fa",
+   "metadata": {},
+   "source": [
+    "Integrate it with existing code and packages using the `check_input` decorator. This can overcome repeated checks at the start of your data processing functions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "e975e088-4bdb-4ae8-b48c-2319adf5d099",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pandera import check_input\n",
+    "\n",
+    "# by default, `check_input` assumes that the first argument is dataframe/series.\n",
+    "@check_input(schema)\n",
+    "def dataprocessor(df):\n",
+    "    \"\"\"My data analysis functionality...\n",
+    "    # ...\n",
+    "    \"\"\"\n",
+    "    return df\n",
+    "\n",
+    "#dataprocessor(casualties_raw)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac094bfd-c4fb-40b9-a550-fe036ae2f897",
+   "metadata": {},
+   "source": [
+    "This overcomes repeated checks at the start of a processing function, e.g.\n",
+    "\n",
+    "```python\n",
+    "def dataprocessor(df):\n",
+    "    \"\"\"My data analysis functionality...\n",
+    "    # ...\n",
+    "    \"\"\"\n",
+    "    if \"name\" not in df.columns:\n",
+    "        raise Exception(\"...\")\n",
+    "    # ...\n",
+    "    return df\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18a418a4-c573-4a1b-b471-711e49d27f4a",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "**Pandera**\n",
+    "\n",
+    "- Easy to setup and __integrate with existing workflow/code__.\n",
+    "- No detailed reporting included (error message summary).\n",
+    "    \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "669ee9e8-c574-4c76-a6ae-abe3af60e681",
+   "metadata": {},
+   "source": [
+    "## Frictionless data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 94,
+   "id": "1c475bbc-79a0-4905-b755-d3186905399d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pprint import pprint\n",
+    "from frictionless import validate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 148,
+   "id": "5076eb70-8d38-4bae-b188-8fcd384feb46",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a small subset of cleaned casualties to illustrate frictionless setup\n",
+    "#!head -n 100 ./data/casualties.csv > ./data/casualties_example.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d693127-05a3-432f-a372-709ae2d042cc",
+   "metadata": {},
+   "source": [
+    "[Frictionless data](https://framework.frictionlessdata.io/) provides a ata management framework for Python to describe, extract, __validate__, and transform tabular data.\n",
+    "\n",
+    "It can be used both as a command-line utility as with Python."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 92,
+   "id": "29ba2ae7-bb69-4adb-81a3-fbe03be27f8d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Usage: frictionless [OPTIONS] COMMAND [ARGS]...\n",
+      "\n",
+      "  Describe, extract, validate and transform tabular data.\n",
+      "\n",
+      "Options:\n",
+      "  --version\n",
+      "  --install-completion  Install completion for the current shell.\n",
+      "  --show-completion     Show completion for the current shell, to copy it or\n",
+      "                        customize the installation.\n",
+      "  --help                Show this message and exit.\n",
+      "\n",
+      "Commands:\n",
+      "  api        Start API server\n",
+      "  describe   Describe a data source.\n",
+      "  extract    Extract a data source.\n",
+      "  transform  Transform data using a provided pipeline.\n",
+      "  validate   Validate a data source.\n"
+     ]
+    }
+   ],
+   "source": [
+    "!frictionless --help"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 93,
+   "id": "491b923a-2d4a-494a-8117-bbb4ef71b52d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CLI examples\n",
+    "#!frictionless describe ./data/casualties_example.csv > casualties_example.resource.yaml\n",
+    "#!frictionless validate ./data/casualties_example.csv"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 145,
+   "id": "2cec8fa3-5e89-4137-aad9-967fa80b769c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from frictionless import describe, validate\n",
+    "\n",
+    "resource = describe(\"./data/casualties_example.csv\")  # create automated initial version\n",
+    "\n",
+    "# Overwrite an data type\n",
+    "resource.schema.get_field(\"n_victims_ok\").type = 'integer'\n",
+    "\n",
+    "# Save the specification to a yaml-file\n",
+    "resource.to_yaml(\"casualties_example.resource.yaml\");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba7351bf-d542-4477-aadd-2038f9742745",
+   "metadata": {},
+   "source": [
+    "The data `Schema` can be made part of a [Data package](https://framework.frictionlessdata.io/docs/guides/describing-data#describing-a-package) (i.e. a csv with metadata) to communicate on the data specification.\n",
+    "\n",
+    "The Frictionless framework provides following ways to check the data consistency:\n",
+    "\n",
+    "- [Constraints](https://specs.frictionlessdata.io/table-schema/#constraints) as part of the description configuration, e.g. minimum/maximum, required, unique, patter,...\n",
+    "- [Validation checks](https://framework.frictionlessdata.io/docs/guides/validation-checks/) included int he Framework which can be defined on validate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 146,
+   "id": "67caa263-6512-48f0-8de2-fdef50306df3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "resource = describe(\"./data/casualties_example.csv\")\n",
+    "\n",
+    "# Add additional rules to the data set Schema:\n",
+    "resource.schema.get_field(\"n_victims_ok\").type = 'integer'\n",
+    "resource.schema.get_field(\"n_victims_ok\").constraints[\"minimum\"] = 0\n",
+    "resource.schema.get_field(\"gender\").constraints[\"pattern\"] = \"male|female\"\n",
+    "resource.schema.get_field(\"gender\").constraints[\"required\"] = True\n",
+    "\n",
+    "# Save the specification to a yaml-file\n",
+    "resource.to_yaml(\"casualties_example.resource.yaml\");"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 147,
+   "id": "871d0942-6672-4223-9532-6f0697fb90e2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "report = validate(\"casualties_example.resource.yaml\",\n",
+    "                  checks=[{\"code\": \"table-dimensions\",   # additional validation check\n",
+    "                           \"minFields\": 10, \n",
+    "                           \"maxRows\": 200}])\n",
+    "#report[\"tasks\"][0][\"errors\"] # uncomment to see report"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1106b928-64a9-482b-9926-b174219b3fb3",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "**Frictionless data**\n",
+    "\n",
+    "- Targets csv (excel, json and sql) files, not specific to Pandas DataFrame.\n",
+    "- Provides a set of tools to exchange data with appropriate documentation (metadata). For example, it is used to share [camera trap data](https://tdwg.github.io/camtrap-dp/) in a structured and standardized format.\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d752cc23-12d4-47f0-a6db-09bd309fca72",
+   "metadata": {},
+   "source": [
+    "## Conclusion"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cc46a45a-a830-45e8-a90e-28c8e72922e6",
+   "metadata": {},
+   "source": [
+    "Different tools exist in the Python landscape to validate data sets. Each of the frameworks have their own strengths and use cases:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6eb9309a-fd2f-44e2-a01e-e7a5c3bd38d1",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "**Conclusion**\n",
+    "\n",
+    "- __Great Expectations__: When external data integration (databases, cloud storage,...) or _advanced reporting_ (e.g. to provide detailed/automated feedback) is essential.\n",
+    "- __Pandera__: Ideal for _personal_ (or small team) usage when doing data analysis in Pandas. _Minimal effort_ to get started.\n",
+    "- __Frictionless data__: Provides the tools to _share_ data in a documented and well-structured workflow, while keeping technical burden low. Does not expect collaborators to use Pandas (e.g. [frictionless-r](https://github.com/frictionlessdata/frictionless-r) for R users).\n",
+    "    \n",
+    "__Note:__ \n",
+    "\n",
+    "- Each framework supports the extension with _custom_ or _user-defined_ rules.\n",
+    "- Pandera can convert and use a [Frictionless data schema](https://pandera.readthedocs.io/en/stable/frictionless.html#frictionless-integration).\n",
+    "    \n",
+    "</div>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

	DT_DAY	DT_HOUR	CD_DAY_OF_WEEK	TX_DAY_OF_WEEK_DESCR_FR	TX_DAY_OF_WEEK_DESCR_NL	MS_VICT	MS_VIC_OK	MS_SLY_INJ	...	TX_ADM_DSTR_DESCR_NL	CD_PROV_REFNIS	TX_PROV_DESCR_FR	TX_PROV_DESCR_NL	CD_RGN_REFNIS	TX_RGN_DESCR_FR	TX_RGN_DESCR_NL	CD_SEX	TX_SEX_DESCR_FR	TX_SEX_DESCR_NL
0	2020-09-24	15	4	Jeudi	donderdag	1	0	1	...	Arrondissement Antwerpen	10000	Province d’Anvers	Provincie Antwerpen	2000	Région flamande	Vlaams Gewest	2	Féminin	Vrouwelijk
1	2020-10-25	14	7	Dimanche	zondag	1	0	1	...	Arrondissement Antwerpen	10000	Province d’Anvers	Provincie Antwerpen	2000	Région flamande	Vlaams Gewest	1	Masculin	Mannelijk
2	2020-09-24	15	4	Jeudi	donderdag	1	0	1	...	Arrondissement Antwerpen	10000	Province d’Anvers	Provincie Antwerpen	2000	Région flamande	Vlaams Gewest	1	Masculin	Mannelijk
3	2020-12-01	15	2	Mardi	dinsdag	1	1	0	...	Arrondissement Antwerpen	10000	Province d’Anvers	Provincie Antwerpen	2000	Région flamande	Vlaams Gewest	1	Masculin	Mannelijk
4	2020-12-16	17	3	Mercredi	woensdag	1	0	1	...	Arrondissement Antwerpen	10000	Province d’Anvers	Provincie Antwerpen	2000	Région flamande	Vlaams Gewest	1	Masculin	Mannelijk