"
+ ]
+ },
+ "execution_count": 39,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "datasource_yaml = f\"\"\"\n",
+ "name: pandas_casualties\n",
+ "class_name: Datasource\n",
+ "module_name: great_expectations.datasource\n",
+ "execution_engine:\n",
+ " module_name: great_expectations.execution_engine\n",
+ " class_name: PandasExecutionEngine\n",
+ "data_connectors:\n",
+ " casualties_memory:\n",
+ " class_name: RuntimeDataConnector\n",
+ " batch_identifiers:\n",
+ " - year\n",
+ "\"\"\"\n",
+ "context.add_datasource(**yaml.safe_load(datasource_yaml))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d2f7066a-f8cc-4747-bc68-6b0bec12e281",
+ "metadata": {},
+ "source": [
+ "3. A _checkpoint_ links a data set with a set of _expectations_:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "id": "c2005d71-3485-4e8f-abca-3f052ed61d49",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{\n",
+ " \"action_list\": [\n",
+ " {\n",
+ " \"name\": \"store_validation_result\",\n",
+ " \"action\": {\n",
+ " \"class_name\": \"StoreValidationResultAction\"\n",
+ " }\n",
+ " },\n",
+ " {\n",
+ " \"name\": \"store_evaluation_params\",\n",
+ " \"action\": {\n",
+ " \"class_name\": \"StoreEvaluationParametersAction\"\n",
+ " }\n",
+ " },\n",
+ " {\n",
+ " \"name\": \"update_data_docs\",\n",
+ " \"action\": {\n",
+ " \"class_name\": \"UpdateDataDocsAction\",\n",
+ " \"site_names\": []\n",
+ " }\n",
+ " }\n",
+ " ],\n",
+ " \"batch_request\": {},\n",
+ " \"class_name\": \"Checkpoint\",\n",
+ " \"config_version\": 1.0,\n",
+ " \"evaluation_parameters\": {},\n",
+ " \"module_name\": \"great_expectations.checkpoint\",\n",
+ " \"name\": \"casualties_check\",\n",
+ " \"profilers\": [],\n",
+ " \"runtime_configuration\": {},\n",
+ " \"validations\": [\n",
+ " {\n",
+ " \"batch_request\": {\n",
+ " \"datasource_name\": \"pandas_casualties\",\n",
+ " \"data_connector_name\": \"casualties_memory\",\n",
+ " \"data_asset_name\": \"casualties\"\n",
+ " },\n",
+ " \"expectation_suite_name\": \"be_casualties\"\n",
+ " }\n",
+ " ]\n",
+ "}"
+ ]
+ },
+ "execution_count": 40,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "checkpoint_config = {\n",
+ " \"name\": \"casualties_check\",\n",
+ " \"config_version\": 1,\n",
+ " \"class_name\": \"SimpleCheckpoint\",\n",
+ " \"validations\": [\n",
+ " {\n",
+ " \"batch_request\": {\n",
+ " \"datasource_name\": \"pandas_casualties\",\n",
+ " \"data_connector_name\": \"casualties_memory\",\n",
+ " \"data_asset_name\": \"casualties\",\n",
+ " },\n",
+ " \"expectation_suite_name\": \"be_casualties\"\n",
+ " }\n",
+ " ],\n",
+ "}\n",
+ "context.add_checkpoint(**checkpoint_config)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "aeb8b668-5ee8-4ccc-a33b-72771f777a33",
+ "metadata": {},
+ "source": [
+ "This is the necessary configuration to be able to run the data validation. Check the subfolder [./great_expectations](./great_expectations) to see the created configuration files stored on disk."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2549ef9d-036f-418d-9ed4-3ed48f32def1",
+ "metadata": {},
+ "source": [
+ "### Apply a set of _expectations_ on a (new version of) data set"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "09daaa0a-3e82-43c6-81bf-ceb333e30706",
+ "metadata": {},
+ "source": [
+ "We have now all elements together to check a data set with the defined expectations on a (new) data set and check the created data validation report."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "97ed8c1a-4f6e-4cc5-8a37-9e22a084899d",
+ "metadata": {},
+ "source": [
+ "The code in this section is required to run an evaluation on __any new version of the data__."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "id": "438e556c-9efc-4a79-a744-e7137057809c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " DT_DAY | \n",
+ " DT_HOUR | \n",
+ " CD_DAY_OF_WEEK | \n",
+ " TX_DAY_OF_WEEK_DESCR_FR | \n",
+ " TX_DAY_OF_WEEK_DESCR_NL | \n",
+ " MS_VICT | \n",
+ " MS_VIC_OK | \n",
+ " MS_SLY_INJ | \n",
+ " MS_SERLY_INJ | \n",
+ " MS_DEAD_30_DAYS | \n",
+ " ... | \n",
+ " TX_ADM_DSTR_DESCR_NL | \n",
+ " CD_PROV_REFNIS | \n",
+ " TX_PROV_DESCR_FR | \n",
+ " TX_PROV_DESCR_NL | \n",
+ " CD_RGN_REFNIS | \n",
+ " TX_RGN_DESCR_FR | \n",
+ " TX_RGN_DESCR_NL | \n",
+ " CD_SEX | \n",
+ " TX_SEX_DESCR_FR | \n",
+ " TX_SEX_DESCR_NL | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 2020-09-24 | \n",
+ " 15 | \n",
+ " 4 | \n",
+ " Jeudi | \n",
+ " donderdag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 2 | \n",
+ " Féminin | \n",
+ " Vrouwelijk | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2020-10-25 | \n",
+ " 14 | \n",
+ " 7 | \n",
+ " Dimanche | \n",
+ " zondag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 2020-09-24 | \n",
+ " 15 | \n",
+ " 4 | \n",
+ " Jeudi | \n",
+ " donderdag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 2020-12-01 | \n",
+ " 15 | \n",
+ " 2 | \n",
+ " Mardi | \n",
+ " dinsdag | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2020-12-16 | \n",
+ " 17 | \n",
+ " 3 | \n",
+ " Mercredi | \n",
+ " woensdag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 43 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR \\\n",
+ "0 2020-09-24 15 4 Jeudi \n",
+ "1 2020-10-25 14 7 Dimanche \n",
+ "2 2020-09-24 15 4 Jeudi \n",
+ "3 2020-12-01 15 2 Mardi \n",
+ "4 2020-12-16 17 3 Mercredi \n",
+ "\n",
+ " TX_DAY_OF_WEEK_DESCR_NL MS_VICT MS_VIC_OK MS_SLY_INJ MS_SERLY_INJ \\\n",
+ "0 donderdag 1 0 1 0 \n",
+ "1 zondag 1 0 1 0 \n",
+ "2 donderdag 1 0 1 0 \n",
+ "3 dinsdag 1 1 0 0 \n",
+ "4 woensdag 1 0 1 0 \n",
+ "\n",
+ " MS_DEAD_30_DAYS ... TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS \\\n",
+ "0 0 ... Arrondissement Antwerpen 10000 \n",
+ "1 0 ... Arrondissement Antwerpen 10000 \n",
+ "2 0 ... Arrondissement Antwerpen 10000 \n",
+ "3 0 ... Arrondissement Antwerpen 10000 \n",
+ "4 0 ... Arrondissement Antwerpen 10000 \n",
+ "\n",
+ " TX_PROV_DESCR_FR TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_FR \\\n",
+ "0 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "1 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "2 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "3 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "4 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "\n",
+ " TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR TX_SEX_DESCR_NL \n",
+ "0 Vlaams Gewest 2 Féminin Vrouwelijk \n",
+ "1 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "2 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "3 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "4 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "\n",
+ "[5 rows x 43 columns]"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n",
+ " compression='zip', \n",
+ " sep=\"|\", \n",
+ " low_memory=False)\n",
+ "casualties_raw.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "id": "46cfd158-155d-4171-925b-e55cbce9bedf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "context = ge.get_context()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8b7f0f2-6b68-4a94-861c-48fd32d806bc",
+ "metadata": {},
+ "source": [
+ "We can now _run_ a checkpoint on our DataFrame:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "id": "830fd3d1-7798-4bf6-9425-cb368cc16a65",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "633857360c074aaaa88afa1ba52c575d",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Calculating Metrics: 0%| | 0/8 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "results = context.run_checkpoint(\n",
+ " checkpoint_name=\"casualties_check\",\n",
+ " batch_request={\n",
+ " \"runtime_parameters\": {\"batch_data\": casualties_raw},\n",
+ " \"batch_identifiers\": {\n",
+ " \"year\": 2020\n",
+ " },\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "36cd9206-2642-4f3e-ba62-2aa434c5967e",
+ "metadata": {},
+ "source": [
+ "See the _expectation_ result in an interactive page:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "id": "3df7c69d-fa05-4c1b-8b41-09623e4f6b41",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "context.build_data_docs()\n",
+ "context.open_data_docs()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c073e731-05df-44e8-9d9a-3e8bd4cd92fa",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "**Great expectations**\n",
+ "\n",
+ "- An entire ecosystem with lots of integrations and configuration options. \n",
+ "- Once a set of _expectations_ is setup and the Context/DataSource/Checkpoint has been setup, a very comprehensive __data validation report__ is provided.\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a1bc9d6-e4b4-4c30-a53c-7cb41b0950a7",
+ "metadata": {},
+ "source": [
+ "## Pandera"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "01d6e309-0d6e-4974-9baa-5ce91b17b1e1",
+ "metadata": {},
+ "source": [
+ "[Pandera](https://pandera.readthedocs.io) provides a similar functionality as Great expectations, but requires less configuration setup:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "id": "be527cf8-78d4-4315-8305-e967321a1cf6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import pandera as pa"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "id": "2f33718d-dd4b-4855-b7f3-3bd2770fc2b9",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " DT_DAY | \n",
+ " DT_HOUR | \n",
+ " CD_DAY_OF_WEEK | \n",
+ " TX_DAY_OF_WEEK_DESCR_FR | \n",
+ " TX_DAY_OF_WEEK_DESCR_NL | \n",
+ " MS_VICT | \n",
+ " MS_VIC_OK | \n",
+ " MS_SLY_INJ | \n",
+ " MS_SERLY_INJ | \n",
+ " MS_DEAD_30_DAYS | \n",
+ " ... | \n",
+ " TX_ADM_DSTR_DESCR_NL | \n",
+ " CD_PROV_REFNIS | \n",
+ " TX_PROV_DESCR_FR | \n",
+ " TX_PROV_DESCR_NL | \n",
+ " CD_RGN_REFNIS | \n",
+ " TX_RGN_DESCR_FR | \n",
+ " TX_RGN_DESCR_NL | \n",
+ " CD_SEX | \n",
+ " TX_SEX_DESCR_FR | \n",
+ " TX_SEX_DESCR_NL | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 2020-09-24 | \n",
+ " 15 | \n",
+ " 4 | \n",
+ " Jeudi | \n",
+ " donderdag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 2 | \n",
+ " Féminin | \n",
+ " Vrouwelijk | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2020-10-25 | \n",
+ " 14 | \n",
+ " 7 | \n",
+ " Dimanche | \n",
+ " zondag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 2020-09-24 | \n",
+ " 15 | \n",
+ " 4 | \n",
+ " Jeudi | \n",
+ " donderdag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 2020-12-01 | \n",
+ " 15 | \n",
+ " 2 | \n",
+ " Mardi | \n",
+ " dinsdag | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2020-12-16 | \n",
+ " 17 | \n",
+ " 3 | \n",
+ " Mercredi | \n",
+ " woensdag | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " ... | \n",
+ " Arrondissement Antwerpen | \n",
+ " 10000 | \n",
+ " Province d’Anvers | \n",
+ " Provincie Antwerpen | \n",
+ " 2000 | \n",
+ " Région flamande | \n",
+ " Vlaams Gewest | \n",
+ " 1 | \n",
+ " Masculin | \n",
+ " Mannelijk | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 43 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR \\\n",
+ "0 2020-09-24 15 4 Jeudi \n",
+ "1 2020-10-25 14 7 Dimanche \n",
+ "2 2020-09-24 15 4 Jeudi \n",
+ "3 2020-12-01 15 2 Mardi \n",
+ "4 2020-12-16 17 3 Mercredi \n",
+ "\n",
+ " TX_DAY_OF_WEEK_DESCR_NL MS_VICT MS_VIC_OK MS_SLY_INJ MS_SERLY_INJ \\\n",
+ "0 donderdag 1 0 1 0 \n",
+ "1 zondag 1 0 1 0 \n",
+ "2 donderdag 1 0 1 0 \n",
+ "3 dinsdag 1 1 0 0 \n",
+ "4 woensdag 1 0 1 0 \n",
+ "\n",
+ " MS_DEAD_30_DAYS ... TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS \\\n",
+ "0 0 ... Arrondissement Antwerpen 10000 \n",
+ "1 0 ... Arrondissement Antwerpen 10000 \n",
+ "2 0 ... Arrondissement Antwerpen 10000 \n",
+ "3 0 ... Arrondissement Antwerpen 10000 \n",
+ "4 0 ... Arrondissement Antwerpen 10000 \n",
+ "\n",
+ " TX_PROV_DESCR_FR TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_FR \\\n",
+ "0 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "1 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "2 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "3 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "4 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n",
+ "\n",
+ " TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR TX_SEX_DESCR_NL \n",
+ "0 Vlaams Gewest 2 Féminin Vrouwelijk \n",
+ "1 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "2 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "3 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "4 Vlaams Gewest 1 Masculin Mannelijk \n",
+ "\n",
+ "[5 rows x 43 columns]"
+ ]
+ },
+ "execution_count": 47,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n",
+ " compression='zip', \n",
+ " sep=\"|\", \n",
+ " low_memory=False,\n",
+ " parse_dates=[\"DT_DAY\"])\n",
+ "casualties_raw.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "607c096a-75bc-48fb-96a4-0cdc2f53b304",
+ "metadata": {},
+ "source": [
+ "Define a set of rules as a `DataFrameSchema`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "id": "909bfb9a-2042-48ce-9ce4-d51c9df2175a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# define schema\n",
+ "schema = pa.DataFrameSchema({\n",
+ " \"MS_VICT\": pa.Column(int, checks=[\n",
+ " pa.Check.greater_than_or_equal_to(0)\n",
+ " ]),\n",
+ " \"DT_DAY\": pa.Column(\"datetime64\"),\n",
+ " \"TX_SEX_DESCR_NL\": pa.Column(\n",
+ " str, \n",
+ " checks=pa.Check.isin([\"Mannelijk\", \"Vrouwelijk\"])\n",
+ " ),\n",
+ " \"CD_.+\": pa.Column(\n",
+ " int,\n",
+ " regex=True\n",
+ " ), \n",
+ "})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d1b81558-f611-49ee-9bb2-22af7e925020",
+ "metadata": {},
+ "source": [
+ "Apply a schema to a DataFrame:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 81,
+ "id": "77847f3a-b636-4405-94bb-9c34561aa18d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# validated_df = schema(casualties_raw, lazy=True) # RUN to see report"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "81670e50-5f64-48b6-a4a6-5bcee271b013",
+ "metadata": {},
+ "source": [
+ "__Note__ Use the `lazy=True` option to see all errors in a single Exception."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d32c1ef-e95a-4ea7-ab18-da9dd45b95fa",
+ "metadata": {},
+ "source": [
+ "Integrate it with existing code and packages using the `check_input` decorator. This can overcome repeated checks at the start of your data processing functions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "id": "e975e088-4bdb-4ae8-b48c-2319adf5d099",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pandera import check_input\n",
+ "\n",
+ "# by default, `check_input` assumes that the first argument is dataframe/series.\n",
+ "@check_input(schema)\n",
+ "def dataprocessor(df):\n",
+ " \"\"\"My data analysis functionality...\n",
+ " # ...\n",
+ " \"\"\"\n",
+ " return df\n",
+ "\n",
+ "#dataprocessor(casualties_raw)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ac094bfd-c4fb-40b9-a550-fe036ae2f897",
+ "metadata": {},
+ "source": [
+ "This overcomes repeated checks at the start of a processing function, e.g.\n",
+ "\n",
+ "```python\n",
+ "def dataprocessor(df):\n",
+ " \"\"\"My data analysis functionality...\n",
+ " # ...\n",
+ " \"\"\"\n",
+ " if \"name\" not in df.columns:\n",
+ " raise Exception(\"...\")\n",
+ " # ...\n",
+ " return df\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "18a418a4-c573-4a1b-b471-711e49d27f4a",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "**Pandera**\n",
+ "\n",
+ "- Easy to setup and __integrate with existing workflow/code__.\n",
+ "- No detailed reporting included (error message summary).\n",
+ " \n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "669ee9e8-c574-4c76-a6ae-abe3af60e681",
+ "metadata": {},
+ "source": [
+ "## Frictionless data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 94,
+ "id": "1c475bbc-79a0-4905-b755-d3186905399d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pprint import pprint\n",
+ "from frictionless import validate"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 148,
+ "id": "5076eb70-8d38-4bae-b188-8fcd384feb46",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create a small subset of cleaned casualties to illustrate frictionless setup\n",
+ "#!head -n 100 ./data/casualties.csv > ./data/casualties_example.csv"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5d693127-05a3-432f-a372-709ae2d042cc",
+ "metadata": {},
+ "source": [
+ "[Frictionless data](https://framework.frictionlessdata.io/) provides a ata management framework for Python to describe, extract, __validate__, and transform tabular data.\n",
+ "\n",
+ "It can be used both as a command-line utility as with Python."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 92,
+ "id": "29ba2ae7-bb69-4adb-81a3-fbe03be27f8d",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Usage: frictionless [OPTIONS] COMMAND [ARGS]...\n",
+ "\n",
+ " Describe, extract, validate and transform tabular data.\n",
+ "\n",
+ "Options:\n",
+ " --version\n",
+ " --install-completion Install completion for the current shell.\n",
+ " --show-completion Show completion for the current shell, to copy it or\n",
+ " customize the installation.\n",
+ " --help Show this message and exit.\n",
+ "\n",
+ "Commands:\n",
+ " api Start API server\n",
+ " describe Describe a data source.\n",
+ " extract Extract a data source.\n",
+ " transform Transform data using a provided pipeline.\n",
+ " validate Validate a data source.\n"
+ ]
+ }
+ ],
+ "source": [
+ "!frictionless --help"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 93,
+ "id": "491b923a-2d4a-494a-8117-bbb4ef71b52d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# CLI examples\n",
+ "#!frictionless describe ./data/casualties_example.csv > casualties_example.resource.yaml\n",
+ "#!frictionless validate ./data/casualties_example.csv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 145,
+ "id": "2cec8fa3-5e89-4137-aad9-967fa80b769c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from frictionless import describe, validate\n",
+ "\n",
+ "resource = describe(\"./data/casualties_example.csv\") # create automated initial version\n",
+ "\n",
+ "# Overwrite an data type\n",
+ "resource.schema.get_field(\"n_victims_ok\").type = 'integer'\n",
+ "\n",
+ "# Save the specification to a yaml-file\n",
+ "resource.to_yaml(\"casualties_example.resource.yaml\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ba7351bf-d542-4477-aadd-2038f9742745",
+ "metadata": {},
+ "source": [
+ "The data `Schema` can be made part of a [Data package](https://framework.frictionlessdata.io/docs/guides/describing-data#describing-a-package) (i.e. a csv with metadata) to communicate on the data specification.\n",
+ "\n",
+ "The Frictionless framework provides following ways to check the data consistency:\n",
+ "\n",
+ "- [Constraints](https://specs.frictionlessdata.io/table-schema/#constraints) as part of the description configuration, e.g. minimum/maximum, required, unique, patter,...\n",
+ "- [Validation checks](https://framework.frictionlessdata.io/docs/guides/validation-checks/) included int he Framework which can be defined on validate"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 146,
+ "id": "67caa263-6512-48f0-8de2-fdef50306df3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "resource = describe(\"./data/casualties_example.csv\")\n",
+ "\n",
+ "# Add additional rules to the data set Schema:\n",
+ "resource.schema.get_field(\"n_victims_ok\").type = 'integer'\n",
+ "resource.schema.get_field(\"n_victims_ok\").constraints[\"minimum\"] = 0\n",
+ "resource.schema.get_field(\"gender\").constraints[\"pattern\"] = \"male|female\"\n",
+ "resource.schema.get_field(\"gender\").constraints[\"required\"] = True\n",
+ "\n",
+ "# Save the specification to a yaml-file\n",
+ "resource.to_yaml(\"casualties_example.resource.yaml\");"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 147,
+ "id": "871d0942-6672-4223-9532-6f0697fb90e2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "report = validate(\"casualties_example.resource.yaml\",\n",
+ " checks=[{\"code\": \"table-dimensions\", # additional validation check\n",
+ " \"minFields\": 10, \n",
+ " \"maxRows\": 200}])\n",
+ "#report[\"tasks\"][0][\"errors\"] # uncomment to see report"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1106b928-64a9-482b-9926-b174219b3fb3",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "**Frictionless data**\n",
+ "\n",
+ "- Targets csv (excel, json and sql) files, not specific to Pandas DataFrame.\n",
+ "- Provides a set of tools to exchange data with appropriate documentation (metadata). For example, it is used to share [camera trap data](https://tdwg.github.io/camtrap-dp/) in a structured and standardized format.\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d752cc23-12d4-47f0-a6db-09bd309fca72",
+ "metadata": {},
+ "source": [
+ "## Conclusion"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cc46a45a-a830-45e8-a90e-28c8e72922e6",
+ "metadata": {},
+ "source": [
+ "Different tools exist in the Python landscape to validate data sets. Each of the frameworks have their own strengths and use cases:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6eb9309a-fd2f-44e2-a01e-e7a5c3bd38d1",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "**Conclusion**\n",
+ "\n",
+ "- __Great Expectations__: When external data integration (databases, cloud storage,...) or _advanced reporting_ (e.g. to provide detailed/automated feedback) is essential.\n",
+ "- __Pandera__: Ideal for _personal_ (or small team) usage when doing data analysis in Pandas. _Minimal effort_ to get started.\n",
+ "- __Frictionless data__: Provides the tools to _share_ data in a documented and well-structured workflow, while keeping technical burden low. Does not expect collaborators to use Pandas (e.g. [frictionless-r](https://github.com/frictionlessdata/frictionless-r) for R users).\n",
+ " \n",
+ "__Note:__ \n",
+ "\n",
+ "- Each framework supports the extension with _custom_ or _user-defined_ rules.\n",
+ "- Pandera can convert and use a [Frictionless data schema](https://pandera.readthedocs.io/en/stable/frictionless.html#frictionless-integration).\n",
+ " \n",
+ "
"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}