From 1bf4dac7f5dfda4fefd56d1a2a1df489928a592a Mon Sep 17 00:00:00 2001 From: stijnvanhoey Date: Wed, 23 Mar 2022 22:33:54 +0100 Subject: [PATCH] Add notebook on data validation frameworks --- notebooks/data_validation.ipynb | 1701 +++++++++++++++++++++++++++++++ 1 file changed, 1701 insertions(+) create mode 100644 notebooks/data_validation.ipynb diff --git a/notebooks/data_validation.ipynb b/notebooks/data_validation.ipynb new file mode 100644 index 0000000..fd91589 --- /dev/null +++ b/notebooks/data_validation.ipynb @@ -0,0 +1,1701 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "88a0e9df-2780-417b-8002-7dc0401cee76", + "metadata": {}, + "source": [ + "

Automate data validation - describe data

\n", + "\n", + "> *© 2021, Joris Van den Bossche and Stijn Van Hoey (, ). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "60d811e9-59ca-487b-9c50-603a6db34f63", + "metadata": {}, + "source": [ + "## Introduction" + ] + }, + { + "cell_type": "markdown", + "id": "251a31de-cdde-479e-8098-5f213170983d", + "metadata": {}, + "source": [ + "When running the same analysis (code) on newly incoming data from which the data is _expected to have_ the same characteristics (format, data type,...) can become tedious aserrors occur due to a unexpected changes in the data.\n", + "\n", + "Creating quarterly reports, processing data from a repeated experiment, comparing scenario's... are just a few examples of repeated analysis on data. To overcome manual checks of the data quality, setting up an __automated validation__ can be worthwhile. In this notebook, some frameworks targeted on Pandas DataFrames are highlighted:\n", + "\n", + "- [Great expectations](https://greatexpectations.io/)\n", + "- [Pandera](https://pandera.readthedocs.io)\n", + "- [Frictionless data](https://frictionlessdata.io/)" + ] + }, + { + "cell_type": "markdown", + "id": "95df9341-a9cc-4f09-bbc1-93ac8e48c7b4", + "metadata": {}, + "source": [ + "__Note__ Imports of the packages are grouped per section/framework" + ] + }, + { + "cell_type": "markdown", + "id": "abbfadcd-3338-46e8-935b-ebcd27c7a5c8", + "metadata": {}, + "source": [ + "## Great expectations" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "f0dea0e3-64ca-44e7-9e15-cd30c1ee1cc3", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from ruamel import yaml\n", + "\n", + "import pandas as pd\n", + "\n", + "import great_expectations as ge\n", + "from great_expectations import DataContext\n", + "from great_expectations.core import ExpectationSuite\n", + "from great_expectations.core.batch import RuntimeBatchRequest, BatchRequest\n", + "from great_expectations.validator.validator import Validator\n", + "from great_expectations.checkpoint.checkpoint import SimpleCheckpoint" + ] + }, + { + "cell_type": "markdown", + "id": "412637e9-6ead-4b54-b4a9-55185188f2d2", + "metadata": {}, + "source": [ + "[Great expectations](https://greatexpectations.io/) provides a entire framework for data quality check, aka 'production-ready' data validation: connecting to different data sources, notifications, automated triggers,... This provides a powerful ecosystem when working with continuous incoming data in a corporate environment, but the large set of functionalities can be overwhelming. \n", + "\n", + "In this introduction the focus is on setting up a set of _expectations_ (data validation rules) and applying these rules on a Pandas DataFrame data set loaded in memory." + ] + }, + { + "cell_type": "markdown", + "id": "56d372aa-cce6-4c3f-a534-8cd718989836", + "metadata": {}, + "source": [ + "### Start using great expectations" + ] + }, + { + "cell_type": "markdown", + "id": "62673bc9-89c6-4267-b74f-1ea14b6e46db", + "metadata": {}, + "source": [ + "_Great expectations_ requires a specific folder-structure, which can be setup using the `init` command:" + ] + }, + { + "cell_type": "markdown", + "id": "ca075c2e-c813-49f9-8b1a-fa1d6792bc98", + "metadata": {}, + "source": [ + "```\n", + "great_expectations init\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "b0d80dd7-bf67-4a8f-a944-934e2b395033", + "metadata": {}, + "source": [ + "See also the [getting started documentation](https://docs.greatexpectations.io/docs/tutorials/getting_started/tutorial_overview)." + ] + }, + { + "cell_type": "markdown", + "id": "5bb8ce90-ee84-405d-939a-8cc3dad5c6a7", + "metadata": {}, + "source": [ + "### Define _expectations_ within Jupyter notebook interactively" + ] + }, + { + "cell_type": "markdown", + "id": "3a7ba0a4-2c2d-441c-955f-28a77f330800", + "metadata": {}, + "source": [ + "Defining a set of _expectations_ can be done using an existing data set. Let's start from the casualties data set used earlier:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "d3ede0f2-012e-4e87-9af5-8ee5f677c5b6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DT_DAYDT_HOURCD_DAY_OF_WEEKTX_DAY_OF_WEEK_DESCR_FRTX_DAY_OF_WEEK_DESCR_NLMS_VICTMS_VIC_OKMS_SLY_INJMS_SERLY_INJMS_DEAD_30_DAYS...TX_ADM_DSTR_DESCR_NLCD_PROV_REFNISTX_PROV_DESCR_FRTX_PROV_DESCR_NLCD_RGN_REFNISTX_RGN_DESCR_FRTX_RGN_DESCR_NLCD_SEXTX_SEX_DESCR_FRTX_SEX_DESCR_NL
02020-09-24154Jeudidonderdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest2FémininVrouwelijk
12020-10-25147Dimanchezondag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
22020-09-24154Jeudidonderdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
32020-12-01152Mardidinsdag11000...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
42020-12-16173Mercrediwoensdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
\n", + "

5 rows × 43 columns

\n", + "
" + ], + "text/plain": [ + " DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR \\\n", + "0 2020-09-24 15 4 Jeudi \n", + "1 2020-10-25 14 7 Dimanche \n", + "2 2020-09-24 15 4 Jeudi \n", + "3 2020-12-01 15 2 Mardi \n", + "4 2020-12-16 17 3 Mercredi \n", + "\n", + " TX_DAY_OF_WEEK_DESCR_NL MS_VICT MS_VIC_OK MS_SLY_INJ MS_SERLY_INJ \\\n", + "0 donderdag 1 0 1 0 \n", + "1 zondag 1 0 1 0 \n", + "2 donderdag 1 0 1 0 \n", + "3 dinsdag 1 1 0 0 \n", + "4 woensdag 1 0 1 0 \n", + "\n", + " MS_DEAD_30_DAYS ... TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS \\\n", + "0 0 ... Arrondissement Antwerpen 10000 \n", + "1 0 ... Arrondissement Antwerpen 10000 \n", + "2 0 ... Arrondissement Antwerpen 10000 \n", + "3 0 ... Arrondissement Antwerpen 10000 \n", + "4 0 ... Arrondissement Antwerpen 10000 \n", + "\n", + " TX_PROV_DESCR_FR TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_FR \\\n", + "0 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "1 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "2 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "3 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "4 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "\n", + " TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR TX_SEX_DESCR_NL \n", + "0 Vlaams Gewest 2 Féminin Vrouwelijk \n", + "1 Vlaams Gewest 1 Masculin Mannelijk \n", + "2 Vlaams Gewest 1 Masculin Mannelijk \n", + "3 Vlaams Gewest 1 Masculin Mannelijk \n", + "4 Vlaams Gewest 1 Masculin Mannelijk \n", + "\n", + "[5 rows x 43 columns]" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n", + " compression='zip', \n", + " sep=\"|\", \n", + " low_memory=False)\n", + "casualties_raw.head()" + ] + }, + { + "cell_type": "markdown", + "id": "15c72e88-70f6-4eb9-9610-2c4d70f50f0a", + "metadata": {}, + "source": [ + "Different options are provided to create new expectations (_great expectations_ provides automatically generated notebook or have a look at the [automated _profilers_](https://docs.greatexpectations.io/docs/guides/expectations/advanced/how_to_create_a_new_expectation_suite_using_rule_based_profilers) as well). \n", + "\n", + "The following approach starts from our Pandas DataFrame, connecting it to _great expectations_ and adding new rules interactively in this session:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "8ae50ba5-5156-4041-9173-52dc3ceb46e4", + "metadata": {}, + "outputs": [], + "source": [ + "# https://greatexpectations.io/expectations/\n", + "my_df = ge.from_pandas(casualties_raw)" + ] + }, + { + "cell_type": "markdown", + "id": "a09d24f1-c016-434a-bee4-7bac50a9ac6d", + "metadata": {}, + "source": [ + "For each rule defined in _great expectations_, the `my_df` variable tracks the set of rules defined and run (duplicate runs are ignored):" + ] + }, + { + "cell_type": "markdown", + "id": "87b11c35-b13e-4e1f-b668-cb4ef641b2b5", + "metadata": {}, + "source": [ + "_check some rule options yourself using `my_df.expect_` + TAB-button_" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "7a261014-0bd9-45c3-be94-9f274bf3a564", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{\n", + " \"meta\": {},\n", + " \"result\": {\n", + " \"observed_value\": [\n", + " \"Mannelijk\",\n", + " \"Onbekend\",\n", + " \"Vrouwelijk\"\n", + " ],\n", + " \"element_count\": 66130,\n", + " \"missing_count\": null,\n", + " \"missing_percent\": null\n", + " },\n", + " \"exception_info\": {\n", + " \"raised_exception\": false,\n", + " \"exception_traceback\": null,\n", + " \"exception_message\": null\n", + " },\n", + " \"success\": false\n", + "}" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_df.expect_column_distinct_values_to_be_in_set(column=\"TX_SEX_DESCR_NL\",\n", + " value_set=[\"Vrouwelijk\", \"Mannelijk\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "2b4182a5-ffb1-4837-a73a-9ddf3f33c317", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{\n", + " \"meta\": {},\n", + " \"result\": {\n", + " \"observed_value\": 1,\n", + " \"element_count\": 66130,\n", + " \"missing_count\": null,\n", + " \"missing_percent\": null\n", + " },\n", + " \"exception_info\": {\n", + " \"raised_exception\": false,\n", + " \"exception_traceback\": null,\n", + " \"exception_message\": null\n", + " },\n", + " \"success\": true\n", + "}" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_df.expect_column_min_to_be_between(column=\"MS_VICT\", min_value=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "2448ad81-7b04-4ec9-a51a-8b430fcd5b83", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{\n", + " \"meta\": {},\n", + " \"result\": {\n", + " \"element_count\": 66130,\n", + " \"unexpected_count\": 0,\n", + " \"unexpected_percent\": 0.0,\n", + " \"unexpected_percent_total\": 0.0,\n", + " \"partial_unexpected_list\": []\n", + " },\n", + " \"exception_info\": {\n", + " \"raised_exception\": false,\n", + " \"exception_traceback\": null,\n", + " \"exception_message\": null\n", + " },\n", + " \"success\": true\n", + "}" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_df.expect_column_values_to_not_be_null(column=\"DT_DAY\")" + ] + }, + { + "cell_type": "markdown", + "id": "faad3ce2-2362-4936-bbbf-bd149f696020", + "metadata": {}, + "source": [ + "Let's check the _expectations_ defined up to this point:" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "4aa9df78-e1bf-4f87-b209-ef7fabad8744", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{\n", + " \"meta\": {\n", + " \"great_expectations_version\": \"0.14.10\"\n", + " },\n", + " \"expectation_suite_name\": \"default\",\n", + " \"ge_cloud_id\": null,\n", + " \"expectations\": [\n", + " {\n", + " \"meta\": {},\n", + " \"expectation_type\": \"expect_column_distinct_values_to_be_in_set\",\n", + " \"kwargs\": {\n", + " \"column\": \"TX_SEX_DESCR_NL\",\n", + " \"value_set\": [\n", + " \"Vrouwelijk\",\n", + " \"Mannelijk\"\n", + " ]\n", + " }\n", + " },\n", + " {\n", + " \"meta\": {},\n", + " \"expectation_type\": \"expect_column_min_to_be_between\",\n", + " \"kwargs\": {\n", + " \"column\": \"MS_VICT\",\n", + " \"min_value\": 0\n", + " }\n", + " },\n", + " {\n", + " \"meta\": {},\n", + " \"expectation_type\": \"expect_column_values_to_not_be_null\",\n", + " \"kwargs\": {\n", + " \"column\": \"DT_DAY\"\n", + " }\n", + " }\n", + " ],\n", + " \"data_asset_type\": \"Dataset\"\n", + "}" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_df.get_expectation_suite(discard_failed_expectations=False)" + ] + }, + { + "cell_type": "markdown", + "id": "9c339da7-9e8c-4324-8375-51e296f2b6dd", + "metadata": {}, + "source": [ + "Store them to reuse the set of expectations later on. We store the `json` output in the `expectations` subfolder of within the folder structure. The name of the file, i.e. `be_casualties` will be used later on to use this set of expectations:" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "69ba9eea-891c-4bd8-b402-35fff6b87dd2", + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"./great_expectations/expectations/be_casualties.json\", \"w\") as my_file: \n", + " my_file.write(json.dumps(my_df.get_expectation_suite(discard_failed_expectations=False).to_json_dict()) )" + ] + }, + { + "cell_type": "markdown", + "id": "7cf0b648-71ca-4ec2-8ea7-47da5223d57b", + "metadata": {}, + "source": [ + "### Setup to run great expectations framework on in-memory DataFrame" + ] + }, + { + "cell_type": "markdown", + "id": "15928374-c288-476a-9fe9-e1a1ddf8c44a", + "metadata": {}, + "source": [ + "These steps only need to be done __once for a new project__. The aim is to create the necessary configuiration to apply a set of _expectations_ on a _in memory_ Pandas DataFrame.\n", + "\n", + "1. First let `great expectations` get the context from all the configuration in the different subfolders: " + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "96726aa9-41a5-40a0-83ab-8cc42dfe886d", + "metadata": {}, + "outputs": [], + "source": [ + "context = ge.get_context()" + ] + }, + { + "cell_type": "markdown", + "id": "8a771543-21b7-4812-a679-3e5e576d0a9e", + "metadata": {}, + "source": [ + "_Note: this also loads the `be_casualties` expectations we already defined in the previous step._" + ] + }, + { + "cell_type": "markdown", + "id": "974dfc82-4644-4cd7-b978-0d656db2201a", + "metadata": {}, + "source": [ + "2. Define an in-memory Pandas 'data source'\n", + "\n", + "The framework provides a [wider range of connectors](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_choose_which_dataconnector_to_use) (e.g. csv-file, database, cloud storage,...). This example provides the configuration to run it within the context of a Python session (Jupyter notebook) on an in-memory DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "a472e4be-bf19-484b-ad7a-81bd751adf52", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "datasource_yaml = f\"\"\"\n", + "name: pandas_casualties\n", + "class_name: Datasource\n", + "module_name: great_expectations.datasource\n", + "execution_engine:\n", + " module_name: great_expectations.execution_engine\n", + " class_name: PandasExecutionEngine\n", + "data_connectors:\n", + " casualties_memory:\n", + " class_name: RuntimeDataConnector\n", + " batch_identifiers:\n", + " - year\n", + "\"\"\"\n", + "context.add_datasource(**yaml.safe_load(datasource_yaml))" + ] + }, + { + "cell_type": "markdown", + "id": "d2f7066a-f8cc-4747-bc68-6b0bec12e281", + "metadata": {}, + "source": [ + "3. A _checkpoint_ links a data set with a set of _expectations_:" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "c2005d71-3485-4e8f-abca-3f052ed61d49", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{\n", + " \"action_list\": [\n", + " {\n", + " \"name\": \"store_validation_result\",\n", + " \"action\": {\n", + " \"class_name\": \"StoreValidationResultAction\"\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"store_evaluation_params\",\n", + " \"action\": {\n", + " \"class_name\": \"StoreEvaluationParametersAction\"\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"update_data_docs\",\n", + " \"action\": {\n", + " \"class_name\": \"UpdateDataDocsAction\",\n", + " \"site_names\": []\n", + " }\n", + " }\n", + " ],\n", + " \"batch_request\": {},\n", + " \"class_name\": \"Checkpoint\",\n", + " \"config_version\": 1.0,\n", + " \"evaluation_parameters\": {},\n", + " \"module_name\": \"great_expectations.checkpoint\",\n", + " \"name\": \"casualties_check\",\n", + " \"profilers\": [],\n", + " \"runtime_configuration\": {},\n", + " \"validations\": [\n", + " {\n", + " \"batch_request\": {\n", + " \"datasource_name\": \"pandas_casualties\",\n", + " \"data_connector_name\": \"casualties_memory\",\n", + " \"data_asset_name\": \"casualties\"\n", + " },\n", + " \"expectation_suite_name\": \"be_casualties\"\n", + " }\n", + " ]\n", + "}" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "checkpoint_config = {\n", + " \"name\": \"casualties_check\",\n", + " \"config_version\": 1,\n", + " \"class_name\": \"SimpleCheckpoint\",\n", + " \"validations\": [\n", + " {\n", + " \"batch_request\": {\n", + " \"datasource_name\": \"pandas_casualties\",\n", + " \"data_connector_name\": \"casualties_memory\",\n", + " \"data_asset_name\": \"casualties\",\n", + " },\n", + " \"expectation_suite_name\": \"be_casualties\"\n", + " }\n", + " ],\n", + "}\n", + "context.add_checkpoint(**checkpoint_config)" + ] + }, + { + "cell_type": "markdown", + "id": "aeb8b668-5ee8-4ccc-a33b-72771f777a33", + "metadata": {}, + "source": [ + "This is the necessary configuration to be able to run the data validation. Check the subfolder [./great_expectations](./great_expectations) to see the created configuration files stored on disk." + ] + }, + { + "cell_type": "markdown", + "id": "2549ef9d-036f-418d-9ed4-3ed48f32def1", + "metadata": {}, + "source": [ + "### Apply a set of _expectations_ on a (new version of) data set" + ] + }, + { + "cell_type": "markdown", + "id": "09daaa0a-3e82-43c6-81bf-ceb333e30706", + "metadata": {}, + "source": [ + "We have now all elements together to check a data set with the defined expectations on a (new) data set and check the created data validation report." + ] + }, + { + "cell_type": "markdown", + "id": "97ed8c1a-4f6e-4cc5-8a37-9e22a084899d", + "metadata": {}, + "source": [ + "The code in this section is required to run an evaluation on __any new version of the data__." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "438e556c-9efc-4a79-a744-e7137057809c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DT_DAYDT_HOURCD_DAY_OF_WEEKTX_DAY_OF_WEEK_DESCR_FRTX_DAY_OF_WEEK_DESCR_NLMS_VICTMS_VIC_OKMS_SLY_INJMS_SERLY_INJMS_DEAD_30_DAYS...TX_ADM_DSTR_DESCR_NLCD_PROV_REFNISTX_PROV_DESCR_FRTX_PROV_DESCR_NLCD_RGN_REFNISTX_RGN_DESCR_FRTX_RGN_DESCR_NLCD_SEXTX_SEX_DESCR_FRTX_SEX_DESCR_NL
02020-09-24154Jeudidonderdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest2FémininVrouwelijk
12020-10-25147Dimanchezondag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
22020-09-24154Jeudidonderdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
32020-12-01152Mardidinsdag11000...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
42020-12-16173Mercrediwoensdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
\n", + "

5 rows × 43 columns

\n", + "
" + ], + "text/plain": [ + " DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR \\\n", + "0 2020-09-24 15 4 Jeudi \n", + "1 2020-10-25 14 7 Dimanche \n", + "2 2020-09-24 15 4 Jeudi \n", + "3 2020-12-01 15 2 Mardi \n", + "4 2020-12-16 17 3 Mercredi \n", + "\n", + " TX_DAY_OF_WEEK_DESCR_NL MS_VICT MS_VIC_OK MS_SLY_INJ MS_SERLY_INJ \\\n", + "0 donderdag 1 0 1 0 \n", + "1 zondag 1 0 1 0 \n", + "2 donderdag 1 0 1 0 \n", + "3 dinsdag 1 1 0 0 \n", + "4 woensdag 1 0 1 0 \n", + "\n", + " MS_DEAD_30_DAYS ... TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS \\\n", + "0 0 ... Arrondissement Antwerpen 10000 \n", + "1 0 ... Arrondissement Antwerpen 10000 \n", + "2 0 ... Arrondissement Antwerpen 10000 \n", + "3 0 ... Arrondissement Antwerpen 10000 \n", + "4 0 ... Arrondissement Antwerpen 10000 \n", + "\n", + " TX_PROV_DESCR_FR TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_FR \\\n", + "0 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "1 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "2 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "3 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "4 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "\n", + " TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR TX_SEX_DESCR_NL \n", + "0 Vlaams Gewest 2 Féminin Vrouwelijk \n", + "1 Vlaams Gewest 1 Masculin Mannelijk \n", + "2 Vlaams Gewest 1 Masculin Mannelijk \n", + "3 Vlaams Gewest 1 Masculin Mannelijk \n", + "4 Vlaams Gewest 1 Masculin Mannelijk \n", + "\n", + "[5 rows x 43 columns]" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n", + " compression='zip', \n", + " sep=\"|\", \n", + " low_memory=False)\n", + "casualties_raw.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "46cfd158-155d-4171-925b-e55cbce9bedf", + "metadata": {}, + "outputs": [], + "source": [ + "context = ge.get_context()" + ] + }, + { + "cell_type": "markdown", + "id": "c8b7f0f2-6b68-4a94-861c-48fd32d806bc", + "metadata": {}, + "source": [ + "We can now _run_ a checkpoint on our DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "830fd3d1-7798-4bf6-9425-cb368cc16a65", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "633857360c074aaaa88afa1ba52c575d", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/8 [00:00\n", + "\n", + "**Great expectations**\n", + "\n", + "- An entire ecosystem with lots of integrations and configuration options. \n", + "- Once a set of _expectations_ is setup and the Context/DataSource/Checkpoint has been setup, a very comprehensive __data validation report__ is provided.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "7a1bc9d6-e4b4-4c30-a53c-7cb41b0950a7", + "metadata": {}, + "source": [ + "## Pandera" + ] + }, + { + "cell_type": "markdown", + "id": "01d6e309-0d6e-4974-9baa-5ce91b17b1e1", + "metadata": {}, + "source": [ + "[Pandera](https://pandera.readthedocs.io) provides a similar functionality as Great expectations, but requires less configuration setup:" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "be527cf8-78d4-4315-8305-e967321a1cf6", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import pandera as pa" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "2f33718d-dd4b-4855-b7f3-3bd2770fc2b9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DT_DAYDT_HOURCD_DAY_OF_WEEKTX_DAY_OF_WEEK_DESCR_FRTX_DAY_OF_WEEK_DESCR_NLMS_VICTMS_VIC_OKMS_SLY_INJMS_SERLY_INJMS_DEAD_30_DAYS...TX_ADM_DSTR_DESCR_NLCD_PROV_REFNISTX_PROV_DESCR_FRTX_PROV_DESCR_NLCD_RGN_REFNISTX_RGN_DESCR_FRTX_RGN_DESCR_NLCD_SEXTX_SEX_DESCR_FRTX_SEX_DESCR_NL
02020-09-24154Jeudidonderdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest2FémininVrouwelijk
12020-10-25147Dimanchezondag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
22020-09-24154Jeudidonderdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
32020-12-01152Mardidinsdag11000...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
42020-12-16173Mercrediwoensdag10100...Arrondissement Antwerpen10000Province d’AnversProvincie Antwerpen2000Région flamandeVlaams Gewest1MasculinMannelijk
\n", + "

5 rows × 43 columns

\n", + "
" + ], + "text/plain": [ + " DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_FR \\\n", + "0 2020-09-24 15 4 Jeudi \n", + "1 2020-10-25 14 7 Dimanche \n", + "2 2020-09-24 15 4 Jeudi \n", + "3 2020-12-01 15 2 Mardi \n", + "4 2020-12-16 17 3 Mercredi \n", + "\n", + " TX_DAY_OF_WEEK_DESCR_NL MS_VICT MS_VIC_OK MS_SLY_INJ MS_SERLY_INJ \\\n", + "0 donderdag 1 0 1 0 \n", + "1 zondag 1 0 1 0 \n", + "2 donderdag 1 0 1 0 \n", + "3 dinsdag 1 1 0 0 \n", + "4 woensdag 1 0 1 0 \n", + "\n", + " MS_DEAD_30_DAYS ... TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS \\\n", + "0 0 ... Arrondissement Antwerpen 10000 \n", + "1 0 ... Arrondissement Antwerpen 10000 \n", + "2 0 ... Arrondissement Antwerpen 10000 \n", + "3 0 ... Arrondissement Antwerpen 10000 \n", + "4 0 ... Arrondissement Antwerpen 10000 \n", + "\n", + " TX_PROV_DESCR_FR TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_FR \\\n", + "0 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "1 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "2 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "3 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "4 Province d’Anvers Provincie Antwerpen 2000 Région flamande \n", + "\n", + " TX_RGN_DESCR_NL CD_SEX TX_SEX_DESCR_FR TX_SEX_DESCR_NL \n", + "0 Vlaams Gewest 2 Féminin Vrouwelijk \n", + "1 Vlaams Gewest 1 Masculin Mannelijk \n", + "2 Vlaams Gewest 1 Masculin Mannelijk \n", + "3 Vlaams Gewest 1 Masculin Mannelijk \n", + "4 Vlaams Gewest 1 Masculin Mannelijk \n", + "\n", + "[5 rows x 43 columns]" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "casualties_raw = pd.read_csv(\"./data/TF_ACCIDENTS_VICTIMS_2020.zip\", \n", + " compression='zip', \n", + " sep=\"|\", \n", + " low_memory=False,\n", + " parse_dates=[\"DT_DAY\"])\n", + "casualties_raw.head()" + ] + }, + { + "cell_type": "markdown", + "id": "607c096a-75bc-48fb-96a4-0cdc2f53b304", + "metadata": {}, + "source": [ + "Define a set of rules as a `DataFrameSchema`:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "909bfb9a-2042-48ce-9ce4-d51c9df2175a", + "metadata": {}, + "outputs": [], + "source": [ + "# define schema\n", + "schema = pa.DataFrameSchema({\n", + " \"MS_VICT\": pa.Column(int, checks=[\n", + " pa.Check.greater_than_or_equal_to(0)\n", + " ]),\n", + " \"DT_DAY\": pa.Column(\"datetime64\"),\n", + " \"TX_SEX_DESCR_NL\": pa.Column(\n", + " str, \n", + " checks=pa.Check.isin([\"Mannelijk\", \"Vrouwelijk\"])\n", + " ),\n", + " \"CD_.+\": pa.Column(\n", + " int,\n", + " regex=True\n", + " ), \n", + "})" + ] + }, + { + "cell_type": "markdown", + "id": "d1b81558-f611-49ee-9bb2-22af7e925020", + "metadata": {}, + "source": [ + "Apply a schema to a DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "77847f3a-b636-4405-94bb-9c34561aa18d", + "metadata": {}, + "outputs": [], + "source": [ + "# validated_df = schema(casualties_raw, lazy=True) # RUN to see report" + ] + }, + { + "cell_type": "markdown", + "id": "81670e50-5f64-48b6-a4a6-5bcee271b013", + "metadata": {}, + "source": [ + "__Note__ Use the `lazy=True` option to see all errors in a single Exception." + ] + }, + { + "cell_type": "markdown", + "id": "8d32c1ef-e95a-4ea7-ab18-da9dd45b95fa", + "metadata": {}, + "source": [ + "Integrate it with existing code and packages using the `check_input` decorator. This can overcome repeated checks at the start of your data processing functions." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "e975e088-4bdb-4ae8-b48c-2319adf5d099", + "metadata": {}, + "outputs": [], + "source": [ + "from pandera import check_input\n", + "\n", + "# by default, `check_input` assumes that the first argument is dataframe/series.\n", + "@check_input(schema)\n", + "def dataprocessor(df):\n", + " \"\"\"My data analysis functionality...\n", + " # ...\n", + " \"\"\"\n", + " return df\n", + "\n", + "#dataprocessor(casualties_raw)" + ] + }, + { + "cell_type": "markdown", + "id": "ac094bfd-c4fb-40b9-a550-fe036ae2f897", + "metadata": {}, + "source": [ + "This overcomes repeated checks at the start of a processing function, e.g.\n", + "\n", + "```python\n", + "def dataprocessor(df):\n", + " \"\"\"My data analysis functionality...\n", + " # ...\n", + " \"\"\"\n", + " if \"name\" not in df.columns:\n", + " raise Exception(\"...\")\n", + " # ...\n", + " return df\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "18a418a4-c573-4a1b-b471-711e49d27f4a", + "metadata": {}, + "source": [ + "
\n", + "\n", + "**Pandera**\n", + "\n", + "- Easy to setup and __integrate with existing workflow/code__.\n", + "- No detailed reporting included (error message summary).\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "669ee9e8-c574-4c76-a6ae-abe3af60e681", + "metadata": {}, + "source": [ + "## Frictionless data" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "1c475bbc-79a0-4905-b755-d3186905399d", + "metadata": {}, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "from frictionless import validate" + ] + }, + { + "cell_type": "code", + "execution_count": 148, + "id": "5076eb70-8d38-4bae-b188-8fcd384feb46", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a small subset of cleaned casualties to illustrate frictionless setup\n", + "#!head -n 100 ./data/casualties.csv > ./data/casualties_example.csv" + ] + }, + { + "cell_type": "markdown", + "id": "5d693127-05a3-432f-a372-709ae2d042cc", + "metadata": {}, + "source": [ + "[Frictionless data](https://framework.frictionlessdata.io/) provides a ata management framework for Python to describe, extract, __validate__, and transform tabular data.\n", + "\n", + "It can be used both as a command-line utility as with Python." + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "29ba2ae7-bb69-4adb-81a3-fbe03be27f8d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Usage: frictionless [OPTIONS] COMMAND [ARGS]...\n", + "\n", + " Describe, extract, validate and transform tabular data.\n", + "\n", + "Options:\n", + " --version\n", + " --install-completion Install completion for the current shell.\n", + " --show-completion Show completion for the current shell, to copy it or\n", + " customize the installation.\n", + " --help Show this message and exit.\n", + "\n", + "Commands:\n", + " api Start API server\n", + " describe Describe a data source.\n", + " extract Extract a data source.\n", + " transform Transform data using a provided pipeline.\n", + " validate Validate a data source.\n" + ] + } + ], + "source": [ + "!frictionless --help" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "491b923a-2d4a-494a-8117-bbb4ef71b52d", + "metadata": {}, + "outputs": [], + "source": [ + "# CLI examples\n", + "#!frictionless describe ./data/casualties_example.csv > casualties_example.resource.yaml\n", + "#!frictionless validate ./data/casualties_example.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 145, + "id": "2cec8fa3-5e89-4137-aad9-967fa80b769c", + "metadata": {}, + "outputs": [], + "source": [ + "from frictionless import describe, validate\n", + "\n", + "resource = describe(\"./data/casualties_example.csv\") # create automated initial version\n", + "\n", + "# Overwrite an data type\n", + "resource.schema.get_field(\"n_victims_ok\").type = 'integer'\n", + "\n", + "# Save the specification to a yaml-file\n", + "resource.to_yaml(\"casualties_example.resource.yaml\");" + ] + }, + { + "cell_type": "markdown", + "id": "ba7351bf-d542-4477-aadd-2038f9742745", + "metadata": {}, + "source": [ + "The data `Schema` can be made part of a [Data package](https://framework.frictionlessdata.io/docs/guides/describing-data#describing-a-package) (i.e. a csv with metadata) to communicate on the data specification.\n", + "\n", + "The Frictionless framework provides following ways to check the data consistency:\n", + "\n", + "- [Constraints](https://specs.frictionlessdata.io/table-schema/#constraints) as part of the description configuration, e.g. minimum/maximum, required, unique, patter,...\n", + "- [Validation checks](https://framework.frictionlessdata.io/docs/guides/validation-checks/) included int he Framework which can be defined on validate" + ] + }, + { + "cell_type": "code", + "execution_count": 146, + "id": "67caa263-6512-48f0-8de2-fdef50306df3", + "metadata": {}, + "outputs": [], + "source": [ + "resource = describe(\"./data/casualties_example.csv\")\n", + "\n", + "# Add additional rules to the data set Schema:\n", + "resource.schema.get_field(\"n_victims_ok\").type = 'integer'\n", + "resource.schema.get_field(\"n_victims_ok\").constraints[\"minimum\"] = 0\n", + "resource.schema.get_field(\"gender\").constraints[\"pattern\"] = \"male|female\"\n", + "resource.schema.get_field(\"gender\").constraints[\"required\"] = True\n", + "\n", + "# Save the specification to a yaml-file\n", + "resource.to_yaml(\"casualties_example.resource.yaml\");" + ] + }, + { + "cell_type": "code", + "execution_count": 147, + "id": "871d0942-6672-4223-9532-6f0697fb90e2", + "metadata": {}, + "outputs": [], + "source": [ + "report = validate(\"casualties_example.resource.yaml\",\n", + " checks=[{\"code\": \"table-dimensions\", # additional validation check\n", + " \"minFields\": 10, \n", + " \"maxRows\": 200}])\n", + "#report[\"tasks\"][0][\"errors\"] # uncomment to see report" + ] + }, + { + "cell_type": "markdown", + "id": "1106b928-64a9-482b-9926-b174219b3fb3", + "metadata": {}, + "source": [ + "
\n", + "\n", + "**Frictionless data**\n", + "\n", + "- Targets csv (excel, json and sql) files, not specific to Pandas DataFrame.\n", + "- Provides a set of tools to exchange data with appropriate documentation (metadata). For example, it is used to share [camera trap data](https://tdwg.github.io/camtrap-dp/) in a structured and standardized format.\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "d752cc23-12d4-47f0-a6db-09bd309fca72", + "metadata": {}, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "cc46a45a-a830-45e8-a90e-28c8e72922e6", + "metadata": {}, + "source": [ + "Different tools exist in the Python landscape to validate data sets. Each of the frameworks have their own strengths and use cases:" + ] + }, + { + "cell_type": "markdown", + "id": "6eb9309a-fd2f-44e2-a01e-e7a5c3bd38d1", + "metadata": {}, + "source": [ + "
\n", + "\n", + "**Conclusion**\n", + "\n", + "- __Great Expectations__: When external data integration (databases, cloud storage,...) or _advanced reporting_ (e.g. to provide detailed/automated feedback) is essential.\n", + "- __Pandera__: Ideal for _personal_ (or small team) usage when doing data analysis in Pandas. _Minimal effort_ to get started.\n", + "- __Frictionless data__: Provides the tools to _share_ data in a documented and well-structured workflow, while keeping technical burden low. Does not expect collaborators to use Pandas (e.g. [frictionless-r](https://github.com/frictionlessdata/frictionless-r) for R users).\n", + " \n", + "__Note:__ \n", + "\n", + "- Each framework supports the extension with _custom_ or _user-defined_ rules.\n", + "- Pandera can convert and use a [Frictionless data schema](https://pandera.readthedocs.io/en/stable/frictionless.html#frictionless-integration).\n", + " \n", + "
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}