diff --git a/Arvato_Project_Workbook_zh.ipynb b/Arvato_Project_Workbook_zh.ipynb new file mode 100644 index 00000000..96b95a4b --- /dev/null +++ b/Arvato_Project_Workbook_zh.ipynb @@ -0,0 +1,3359 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.3" + }, + "colab": { + "name": "Arvato-Project-Workbook-zh.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true, + "machine_shape": "hm", + "include_colab_link": true + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0fOa2fgHUUd1", + "colab_type": "text" + }, + "source": [ + "# 毕业项目:为 Arvato Financial Services 金融服务公司实现一个顾客分类报告\n", + "\n", + "该项目要求你分析德国的一家邮购公司的顾客的人口统计数据,将它和一般的人口统计数据进行比较。你将使用非监督学习技术来实现顾客分类,识别出哪些人群是这家公司的基础核心用户。之后,你将把所学的知识应用到第三个数据集上,该数据集是该公司的一场邮购活动的营销对象的人口统计数据。用你搭建的模型预测哪些人更可能成为该公司的顾客。你要使用的数据由我们的合作伙伴 Bertelsmann Arvato Analytics 公司提供。这是真实场景下的数据科学任务。\n", + "\n", + "如果你完成了这个纳米学位的第一学期,做过其中的非监督学习项目,那么你应该对这个项目的第一部分很熟悉了。两个数据集版本不同。这个项目中用到的数据集会包括更多的特征,而且没有预先清洗过。你也可以自由选取分析数据的方法,而不用按照既定的步骤。如果你选择完成的是这个项目,请仔细记录你的步骤和决策,因为你主要交付的成果就是一篇博客文章报告你的发现。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Uz6KTQwTVicl", + "colab_type": "code", + "outputId": "59095343-4d06-4f96-c3cb-a7d574ecef86", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')\n", + "drive_path = '/content/drive/My Drive/UdacityDataScience/data/'\n", + "workInCoLab = True" + ], + "execution_count": 1, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5lKwworjUUd5", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# import libraries here; add more as necessary\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import re\n", + "\n", + "# magic word for producing visualizations in notebook\n", + "%matplotlib inline" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HcBQkudEUUd_", + "colab_type": "text" + }, + "source": [ + "§## 第 0 部分:了解数据\n", + "\n", + "项目数据中包括四个数据文件\n", + "\n", + "- `Udacity_AZDIAS_052018.csv`: 德国的一般人口统计数据;891211 人(行)x 366 个特征(列)\n", + "- `Udacity_CUSTOMERS_052018.csv`: 邮购公司顾客的人口统计数据;191652 人(行)x 369 个特征(列)\n", + "- `Udacity_MAILOUT_052018_TRAIN.csv`: 营销活动的对象的人口统计数据;42982 人(行)x 367 个特征(列)\n", + "- `Udacity_MAILOUT_052018_TEST.csv`: 营销活动的对象的人口统计数据;42833 人(行)x 366个特征(列)\n", + "\n", + "人口统计数据的每一行表示是一个单独的人,也包括一些非个人特征,比如他的家庭信息、住房信息以及周边环境信息。使用前两个数据文件中的信息来发现顾客(\"CUSTOMERS\")和一般人(\"AZDIAS\")在何种程度上相同和不同,然后根据你的分析对其余两个数据文件(\"MAILOUT\")进行预测,预测更可能成为该邮购公司的客户。\n", + "\n", + "\"CUSTOMERS\" 文件包括三个额外的列('CUSTOMER_GROUP'、’'ONLINE_PURCHASE' 和 'PRODUCT_GROUP'),提供了文件中顾客的更多维度的信息。原始的 \"MAILOUT\" 包括一个额外的列 \"RESPONSE\",表示每个收到邮件的人是否成为了公司的顾客。对于 \"TRAIN\" 子数据集,该列被保留,但是在 \"TEST\" 子数据集中该列被删除了,它和你最后要在 Kaggle 比赛上预测的数据集中保留的列是对应的。\n", + "\n", + "三个数据文件中其他的所有列都是相同的。要获得关于文件中列的更多信息,你可以参考 Workspace 中的两个 Excel 电子表格。[其一](./DIAS Information Levels - Attributes 2017.xlsx) 是一个所有属性和描述的列表,按照信息的类别进行排列。[其二](./DIAS Attributes - Values 2017.xlsx) 是一个详细的每个特征的数据值对应关系,按照字母顺序进行排列。\n", + "\n", + "在下面的单元格中,我们提供了一些简单的代码,用于加载进前两个数据集。注意,这个项目中所有的 `.csv` 数据文件都是分号(`;`) 分割的,所以 [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) 中需要加入额外的参数以正确地读取数据。而且,考虑数据集的大小,加载整个数据集可能会花费一些时间。\n", + "\n", + "你会注意到在数据加载的时候,会弹出一个警告(warning)信息。在你开始建模和分析之前,你需要先清洗一下数据。浏览一下数据集的结构,查看电子表格中信息了解数据的取值。决定一下要挑选哪些特征,要舍弃哪些特征,以及是否有些数据格式需要修订。我们建议创建一个做预处理的函数,因为你需要在使用数据训练模型前清洗所有数据集。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LEOLMmESUUeA", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + }, + "outputId": "669cb66d-506f-4c2d-8bff-8b8bb5c77958" + }, + "source": [ + "%%time\n", + "if not workInCoLab: \n", + " # load in the data, This code can only be run on Udacity workspace\n", + " azdias = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_AZDIAS_052018.csv', sep=';')\n", + " customers = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_CUSTOMERS_052018.csv', sep=';')\n", + "else:\n", + " azdias = pd.read_csv(drive_path + 'Udacity_AZDIAS_052018.csv', sep=';')\n", + " customers = pd.read_csv(drive_path + 'Udacity_CUSTOMERS_052018.csv', sep=';')" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "text": [ + ":2: DtypeWarning: Columns (18,19) have mixed types. Specify dtype option on import or set low_memory=False.\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "CPU times: user 25.2 s, sys: 4.6 s, total: 29.8 s\n", + "Wall time: 30.5 s\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3OuZq_Y9FsBM", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# use column LNR as index\n", + "azdias.set_index('LNR', inplace=True)\n", + "customers.set_index('LNR', inplace=True)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QqPh9M-gLE0T", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + }, + "outputId": "8f9b7160-2fe3-4e42-c24e-09274a11fa55" + }, + "source": [ + "azdias.head()" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AGER_TYPAKT_DAT_KLALTER_HHALTER_KIND1ALTER_KIND2ALTER_KIND3ALTER_KIND4ALTERSKATEGORIE_FEINANZ_HAUSHALTE_AKTIVANZ_HH_TITELANZ_KINDERANZ_PERSONENANZ_STATISTISCHE_HAUSHALTEANZ_TITELARBEITBALLRAUMCAMEO_DEU_2015CAMEO_DEUG_2015CAMEO_INTL_2015CJT_GESAMTTYPCJT_KATALOGNUTZERCJT_TYP_1CJT_TYP_2CJT_TYP_3CJT_TYP_4CJT_TYP_5CJT_TYP_6D19_BANKEN_ANZ_12D19_BANKEN_ANZ_24D19_BANKEN_DATUMD19_BANKEN_DIREKTD19_BANKEN_GROSSD19_BANKEN_LOKALD19_BANKEN_OFFLINE_DATUMD19_BANKEN_ONLINE_DATUMD19_BANKEN_ONLINE_QUOTE_12D19_BANKEN_RESTD19_BEKLEIDUNG_GEHD19_BEKLEIDUNG_RESTD19_BILDUNG...REGIOTYPRELAT_ABRETOURTYP_BK_SRT_KEIN_ANREIZRT_SCHNAEPPCHENRT_UEBERGROESSESEMIO_DOMSEMIO_ERLSEMIO_FAMSEMIO_KAEMSEMIO_KRITSEMIO_KULTSEMIO_LUSTSEMIO_MATSEMIO_PFLICHTSEMIO_RATSEMIO_RELSEMIO_SOZSEMIO_TRADVSEMIO_VERTSHOPPER_TYPSOHO_KZSTRUKTURTYPTITEL_KZUMFELD_ALTUMFELD_JUNGUNGLEICHENN_FLAGVERDICHTUNGSRAUMVERS_TYPVHAVHNVK_DHT4AVK_DISTANZVK_ZG11W_KEIT_KIND_HHWOHNDAUER_2008WOHNLAGEZABEOTYPANREDE_KZALTERSKATEGORIE_GROB
LNR
910215-1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.05.01.01.05.05.05.05.000100001010NaN0000...NaNNaN5.01.04.01.063667355547231-1NaNNaNNaNNaNNaNNaNNaN-1NaNNaNNaNNaNNaNNaNNaNNaN312
910220-19.00.0NaNNaNNaNNaN21.011.00.00.02.012.00.03.06.08A8515.01.05.05.02.03.01.01.000100001010NaN0000...3.04.01.05.03.05.07244432376456131.02.00.03.03.01.00.020.04.08.011.010.03.09.04.0521
910225-19.017.0NaNNaNNaNNaN17.010.00.00.01.07.00.03.02.04C4243.02.04.04.01.03.02.02.0001000010100.00006...2.02.03.05.04.05.07617734334343420.03.00.02.05.00.01.010.02.09.09.06.03.09.02.0523
91022621.013.0NaNNaNNaNNaN13.01.00.00.00.02.00.02.04.02A2122.03.02.02.04.04.05.03.0001000010100.00000...0.03.02.03.02.03.04715444143254410.01.00.04.05.00.00.011.00.07.010.011.0NaN9.07.0324
910241-11.020.0NaNNaNNaNNaN14.03.00.00.04.03.00.04.02.06B6435.03.03.03.03.04.03.03.035512010510.06616...5.05.05.03.05.05.02442364242462720.03.00.04.03.00.01.020.02.03.05.04.02.09.03.0413
\n", + "

5 rows × 365 columns

\n", + "
" + ], + "text/plain": [ + " AGER_TYP AKT_DAT_KL ... ANREDE_KZ ALTERSKATEGORIE_GROB\n", + "LNR ... \n", + "910215 -1 NaN ... 1 2\n", + "910220 -1 9.0 ... 2 1\n", + "910225 -1 9.0 ... 2 3\n", + "910226 2 1.0 ... 2 4\n", + "910241 -1 1.0 ... 1 3\n", + "\n", + "[5 rows x 365 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dO-oU0KLLId9", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + }, + "outputId": "3288ca23-6e04-4ccb-9ead-f66dc05f4cf8" + }, + "source": [ + "customers.head()" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AGER_TYPAKT_DAT_KLALTER_HHALTER_KIND1ALTER_KIND2ALTER_KIND3ALTER_KIND4ALTERSKATEGORIE_FEINANZ_HAUSHALTE_AKTIVANZ_HH_TITELANZ_KINDERANZ_PERSONENANZ_STATISTISCHE_HAUSHALTEANZ_TITELARBEITBALLRAUMCAMEO_DEU_2015CAMEO_DEUG_2015CAMEO_INTL_2015CJT_GESAMTTYPCJT_KATALOGNUTZERCJT_TYP_1CJT_TYP_2CJT_TYP_3CJT_TYP_4CJT_TYP_5CJT_TYP_6D19_BANKEN_ANZ_12D19_BANKEN_ANZ_24D19_BANKEN_DATUMD19_BANKEN_DIREKTD19_BANKEN_GROSSD19_BANKEN_LOKALD19_BANKEN_OFFLINE_DATUMD19_BANKEN_ONLINE_DATUMD19_BANKEN_ONLINE_QUOTE_12D19_BANKEN_RESTD19_BEKLEIDUNG_GEHD19_BEKLEIDUNG_RESTD19_BILDUNG...RT_KEIN_ANREIZRT_SCHNAEPPCHENRT_UEBERGROESSESEMIO_DOMSEMIO_ERLSEMIO_FAMSEMIO_KAEMSEMIO_KRITSEMIO_KULTSEMIO_LUSTSEMIO_MATSEMIO_PFLICHTSEMIO_RATSEMIO_RELSEMIO_SOZSEMIO_TRADVSEMIO_VERTSHOPPER_TYPSOHO_KZSTRUKTURTYPTITEL_KZUMFELD_ALTUMFELD_JUNGUNGLEICHENN_FLAGVERDICHTUNGSRAUMVERS_TYPVHAVHNVK_DHT4AVK_DISTANZVK_ZG11W_KEIT_KIND_HHWOHNDAUER_2008WOHNLAGEZABEOTYPPRODUCT_GROUPCUSTOMER_GROUPONLINE_PURCHASEANREDE_KZALTERSKATEGORIE_GROB
LNR
962621.010.0NaNNaNNaNNaN10.01.00.00.02.01.00.01.03.01A1135.04.01.01.05.05.05.05.0001000010100.00000...1.05.03.01351347621261630.03.00.04.04.00.08.010.03.05.03.02.06.09.07.03COSMETIC_AND_FOODMULTI_BUYER014
9628-19.011.0NaNNaNNaNNaNNaNNaNNaN0.03.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN01605010100.06000...NaNNaNNaN3362345641231730.0NaN0.0NaNNaN0.0NaN10.0NaN6.06.03.00.09.0NaN3FOODSINGLE_BUYER014
143872-11.06.0NaNNaNNaNNaN0.01.00.00.01.01.00.03.07.05D5342.05.02.02.05.05.05.05.0001000010100.00006...1.05.01.05726717342121310.03.00.01.05.00.00.020.04.010.013.011.06.09.02.03COSMETIC_AND_FOODMULTI_BUYER024
14387311.08.0NaNNaNNaNNaN8.00.0NaN0.00.01.00.01.07.04C4242.05.01.01.05.05.05.05.0001000010100.00000...1.05.02.03353345433364700.01.00.03.04.00.00.010.02.06.04.02.0NaN9.07.01COSMETICMULTI_BUYER014
143874-11.020.0NaNNaNNaNNaN14.07.00.00.04.07.00.03.03.07B7416.04.03.03.03.04.03.03.01235031070.00060...4.03.05.05452356655444510.03.00.02.04.00.01.020.04.03.05.04.02.09.03.01FOODMULTI_BUYER013
\n", + "

5 rows × 368 columns

\n", + "
" + ], + "text/plain": [ + " AGER_TYP AKT_DAT_KL ... ANREDE_KZ ALTERSKATEGORIE_GROB\n", + "LNR ... \n", + "9626 2 1.0 ... 1 4\n", + "9628 -1 9.0 ... 1 4\n", + "143872 -1 1.0 ... 2 4\n", + "143873 1 1.0 ... 1 4\n", + "143874 -1 1.0 ... 1 3\n", + "\n", + "[5 rows x 368 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "98XxwxDALRs1", + "colab_type": "text" + }, + "source": [ + "## 第0部分:清洗数据 cleaning Data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5dY2IYCL862", + "colab_type": "text" + }, + "source": [ + "### 警告形象对应的数据问题\n", + "\n", + "首先我们看看警告所提出的问题,18和19列里到底有什么样的数据问题。我们发现了这里有 X 和 XX 的数据,作为需要的值是可以处理为-1,表示unknown。\n", + "\n", + "基于上面的警告信息,对比我们的拥有的value 和 Atrribute这里我们需要对拥有的信息做清洗。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0q5GBA948y7l", + "colab_type": "code", + "outputId": "e94c914f-237f-4707-eb86-03eaad0bd049", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + } + }, + "source": [ + "print('column 18 label is', azdias.columns[18])\n", + "print('column 19 label is', azdias.columns[19])\n", + "print(azdias[azdias.columns[18]].unique())\n", + "print(azdias[azdias.columns[19]].unique())" + ], + "execution_count": 7, + "outputs": [ + { + "output_type": "stream", + "text": [ + "column 18 label is CAMEO_INTL_2015\n", + "column 19 label is CJT_GESAMTTYP\n", + "[nan 51.0 24.0 12.0 43.0 54.0 22.0 14.0 13.0 15.0 33.0 41.0 34.0 55.0 25.0\n", + " 23.0 31.0 52.0 35.0 45.0 44.0 32.0 '22' '24' '41' '12' '54' '51' '44'\n", + " '35' '23' '25' '14' '34' '52' '55' '31' '32' '15' '13' '43' '33' '45'\n", + " 'XX']\n", + "[ 2. 5. 3. 4. 1. 6. nan]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hcy6pmAJxGFY", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def findObjectAttributs(dataframe):\n", + " '''\n", + " find which column in dataframe has object as dtype. \n", + " Args:\n", + " dataframe {DataFrame} -- it could be customer or azdias\n", + " Returns:\n", + " {set} -- a set of column names, those the dtypes of column is value type \n", + " object\n", + " '''\n", + " object_columns = set()\n", + " for attr in dataframe.columns[1:]: \n", + " attr_unique_values = dataframe[attr].unique()\n", + " if dataframe[attr].dtypes == \"object\": \n", + " object_columns.add(attr)\n", + " print(f'{attr} has value {attr_unique_values}')\n", + " return object_columns" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FPgW0yfixJ2S", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 493 + }, + "outputId": "3d7d823c-7cdc-466b-cdee-a5e5804cc2dd" + }, + "source": [ + "findObjectAttributs(azdias)" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "stream", + "text": [ + "CAMEO_DEU_2015 has value [nan '8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D'\n", + " '9E' '9B' '1B' '3D' '4E' '4B' '3C' '5A' '7B' '9A' '6D' '6E' '2C' '7C'\n", + " '9C' '7D' '5E' '1D' '8D' '6C' '6A' '5B' '4D' '3A' '2B' '7E' '3B' '6F'\n", + " '5F' '1C' 'XX']\n", + "CAMEO_DEUG_2015 has value [nan 8.0 4.0 2.0 6.0 1.0 9.0 5.0 7.0 3.0 '4' '3' '7' '2' '8' '9' '6' '5'\n", + " '1' 'X']\n", + "CAMEO_INTL_2015 has value [nan 51.0 24.0 12.0 43.0 54.0 22.0 14.0 13.0 15.0 33.0 41.0 34.0 55.0 25.0\n", + " 23.0 31.0 52.0 35.0 45.0 44.0 32.0 '22' '24' '41' '12' '54' '51' '44'\n", + " '35' '23' '25' '14' '34' '52' '55' '31' '32' '15' '13' '43' '33' '45'\n", + " 'XX']\n", + "D19_LETZTER_KAUF_BRANCHE has value [nan 'D19_UNBEKANNT' 'D19_SCHUHE' 'D19_ENERGIE' 'D19_KOSMETIK'\n", + " 'D19_VOLLSORTIMENT' 'D19_SONSTIGE' 'D19_BANKEN_GROSS'\n", + " 'D19_DROGERIEARTIKEL' 'D19_HANDWERK' 'D19_BUCH_CD' 'D19_VERSICHERUNGEN'\n", + " 'D19_VERSAND_REST' 'D19_TELKO_REST' 'D19_BANKEN_DIREKT' 'D19_BANKEN_REST'\n", + " 'D19_FREIZEIT' 'D19_LEBENSMITTEL' 'D19_HAUS_DEKO' 'D19_BEKLEIDUNG_REST'\n", + " 'D19_SAMMELARTIKEL' 'D19_TELKO_MOBILE' 'D19_REISEN' 'D19_BEKLEIDUNG_GEH'\n", + " 'D19_TECHNIK' 'D19_NAHRUNGSERGAENZUNG' 'D19_DIGIT_SERV' 'D19_LOTTO'\n", + " 'D19_RATGEBER' 'D19_TIERARTIKEL' 'D19_KINDERARTIKEL' 'D19_BIO_OEKO'\n", + " 'D19_WEIN_FEINKOST' 'D19_GARTEN' 'D19_BILDUNG' 'D19_BANKEN_LOKAL']\n", + "EINGEFUEGT_AM has value [nan '1992-02-10 00:00:00' '1992-02-12 00:00:00' ... '2010-12-02 00:00:00'\n", + " '2005-03-19 00:00:00' '2011-11-18 00:00:00']\n", + "OST_WEST_KZ has value [nan 'W' 'O']\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'CAMEO_DEUG_2015',\n", + " 'CAMEO_DEU_2015',\n", + " 'CAMEO_INTL_2015',\n", + " 'D19_LETZTER_KAUF_BRANCHE',\n", + " 'EINGEFUEGT_AM',\n", + " 'OST_WEST_KZ'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qO6tIHGR73k7", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 578 + }, + "outputId": "af6fcb97-7775-41c1-91e5-e5e8c070b487" + }, + "source": [ + "findObjectAttributs(customers)" + ], + "execution_count": 10, + "outputs": [ + { + "output_type": "stream", + "text": [ + "CAMEO_DEU_2015 has value ['1A' nan '5D' '4C' '7B' '3B' '1D' '9E' '2D' '4A' '6B' '9D' '8B' '5C' '9C'\n", + " '4E' '6C' '8C' '8A' '5B' '9B' '3D' '2A' '3C' '5F' '7A' '1E' '2C' '7C'\n", + " '5A' '2B' '6D' '7E' '5E' '6E' '3A' '9A' '4B' '1C' '1B' '6A' '8D' '7D'\n", + " '6F' '4D' 'XX']\n", + "CAMEO_DEUG_2015 has value [1.0 nan 5.0 4.0 7.0 3.0 9.0 2.0 6.0 8.0 '6' '3' '8' '9' '2' '4' '1' '7'\n", + " '5' 'X']\n", + "CAMEO_INTL_2015 has value [13.0 nan 34.0 24.0 41.0 23.0 15.0 55.0 14.0 22.0 43.0 51.0 33.0 25.0 44.0\n", + " 54.0 32.0 12.0 35.0 31.0 45.0 52.0 '45' '25' '55' '51' '14' '54' '43'\n", + " '22' '15' '24' '35' '23' '12' '44' '41' '52' '31' '13' '34' '32' '33'\n", + " 'XX']\n", + "D19_LETZTER_KAUF_BRANCHE has value ['D19_UNBEKANNT' 'D19_BANKEN_GROSS' 'D19_NAHRUNGSERGAENZUNG' 'D19_SCHUHE'\n", + " 'D19_BUCH_CD' 'D19_DROGERIEARTIKEL' 'D19_SONSTIGE' 'D19_TECHNIK'\n", + " 'D19_VERSICHERUNGEN' 'D19_TELKO_MOBILE' 'D19_VOLLSORTIMENT' nan\n", + " 'D19_HAUS_DEKO' 'D19_ENERGIE' 'D19_REISEN' 'D19_BANKEN_LOKAL'\n", + " 'D19_VERSAND_REST' 'D19_BEKLEIDUNG_REST' 'D19_FREIZEIT'\n", + " 'D19_BEKLEIDUNG_GEH' 'D19_TELKO_REST' 'D19_SAMMELARTIKEL'\n", + " 'D19_BANKEN_DIREKT' 'D19_KINDERARTIKEL' 'D19_BANKEN_REST'\n", + " 'D19_LEBENSMITTEL' 'D19_GARTEN' 'D19_HANDWERK' 'D19_RATGEBER'\n", + " 'D19_DIGIT_SERV' 'D19_BIO_OEKO' 'D19_BILDUNG' 'D19_WEIN_FEINKOST'\n", + " 'D19_TIERARTIKEL' 'D19_LOTTO' 'D19_KOSMETIK']\n", + "EINGEFUEGT_AM has value ['1992-02-12 00:00:00' nan '1992-02-10 00:00:00' ... '2008-04-25 00:00:00'\n", + " '2005-03-30 00:00:00' '2008-07-14 00:00:00']\n", + "OST_WEST_KZ has value ['W' nan 'O']\n", + "PRODUCT_GROUP has value ['COSMETIC_AND_FOOD' 'FOOD' 'COSMETIC']\n", + "CUSTOMER_GROUP has value ['MULTI_BUYER' 'SINGLE_BUYER']\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'CAMEO_DEUG_2015',\n", + " 'CAMEO_DEU_2015',\n", + " 'CAMEO_INTL_2015',\n", + " 'CUSTOMER_GROUP',\n", + " 'D19_LETZTER_KAUF_BRANCHE',\n", + " 'EINGEFUEGT_AM',\n", + " 'OST_WEST_KZ',\n", + " 'PRODUCT_GROUP'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "y6MTvA77xkuo", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def cleanupCameoDeu2015(dataframe):\n", + " dataframe['CAMEO_DEU_2015'] = dataframe['CAMEO_DEU_2015'].replace('XX', np.nan)\n", + " print(f'after cleanup column CAMEO_DEU_2015 has values: {dataframe[\"CAMEO_DEU_2015\"].unique()}\\n')\n", + "\n", + " dataframe['CAMEO_DEUG_2015'] = dataframe['CAMEO_DEUG_2015']\\\n", + " .replace('X', np.nan)\\\n", + " .map(lambda x: str(x)[0])\\\n", + " .map(lambda x: np.nan if x in ['n'] else x)\\\n", + " .astype(float)\n", + " print(f'after cleanup column CAMEO_DEUG_2015 has values: {dataframe[\"CAMEO_DEUG_2015\"].unique()}\\n')\n", + "\n", + " dataframe['CAMEO_INTL_2015'] = dataframe['CAMEO_INTL_2015']\\\n", + " .replace('XX', np.nan)\\\n", + " .map(lambda x: str(x)[0:1] if str(x)[-2:]=='.0' else x)\\\n", + " .map(lambda y: np.nan if y == 'na' else y)\\\n", + " .astype(float)\n", + " print(f'after cleanup column CAMEO_INTL_2015 has values: {dataframe[\"CAMEO_INTL_2015\"].unique()}\\n')\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IyNuVoAM8Eyw", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 391 + }, + "outputId": "aa274751-6a73-4be3-e48c-a5d94e330eeb" + }, + "source": [ + "print(\"== cleanupCameoDeu2015 in azdias ==\")\n", + "cleanupCameoDeu2015(azdias)\n", + "\n", + "print(\"== cleanupCameoDeu2015 in customers ==\")\n", + "cleanupCameoDeu2015(customers)" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "stream", + "text": [ + "== cleanupCameoDeu2015 in azdias ==\n", + "after cleanup column CAMEO_DEU_2015 has values: [nan '8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D'\n", + " '9E' '9B' '1B' '3D' '4E' '4B' '3C' '5A' '7B' '9A' '6D' '6E' '2C' '7C'\n", + " '9C' '7D' '5E' '1D' '8D' '6C' '6A' '5B' '4D' '3A' '2B' '7E' '3B' '6F'\n", + " '5F' '1C']\n", + "\n", + "after cleanup column CAMEO_DEUG_2015 has values: [nan 8. 4. 2. 6. 1. 9. 5. 7. 3.]\n", + "\n", + "after cleanup column CAMEO_INTL_2015 has values: [nan 5. 2. 1. 4. 3. 22. 24. 41. 12. 54. 51. 44. 35. 23. 25. 14. 34.\n", + " 52. 55. 31. 32. 15. 13. 43. 33. 45.]\n", + "\n", + "== cleanupCameoDeu2015 in customers ==\n", + "after cleanup column CAMEO_DEU_2015 has values: ['1A' nan '5D' '4C' '7B' '3B' '1D' '9E' '2D' '4A' '6B' '9D' '8B' '5C' '9C'\n", + " '4E' '6C' '8C' '8A' '5B' '9B' '3D' '2A' '3C' '5F' '7A' '1E' '2C' '7C'\n", + " '5A' '2B' '6D' '7E' '5E' '6E' '3A' '9A' '4B' '1C' '1B' '6A' '8D' '7D'\n", + " '6F' '4D']\n", + "\n", + "after cleanup column CAMEO_DEUG_2015 has values: [ 1. nan 5. 4. 7. 3. 9. 2. 6. 8.]\n", + "\n", + "after cleanup column CAMEO_INTL_2015 has values: [ 1. nan 3. 2. 4. 5. 45. 25. 55. 51. 14. 54. 43. 22. 15. 24. 35. 23.\n", + " 12. 44. 41. 52. 31. 13. 34. 32. 33.]\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYiZztOfQOXe", + "colab_type": "text" + }, + "source": [ + "我们利用pandas的get_dummies方法可以将这个attribute对应成以键值的特征矩阵,后面可以替换对应的attribute列。比如列CAMEO_INTL_2015将被下面的dummies替换。" + ] + }, + { + "cell_type": "code", + "metadata": { + "colab_type": "code", + "id": "Q3BW4HgqTyf7", + "colab": {} + }, + "source": [ + "# dummies = pd.get_dummies(azdias['CAMEO_INTL_2015'], prefix='CAMEO_INTL_2015', prefix_sep='__')\n", + "# dummies" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ebEP5hPtO9NM", + "colab_type": "text" + }, + "source": [ + "### Attributes和Values \n", + "\n", + "Arvato 提供了对应的Meta数据集,Attributes是所有像个的描述。Values包含了每个Attribute可能的数据值,以及其对应的意义Meaning。这里我们把他读取出来,整理成一些相应需要的变量,为后面的数据清洗做准备。特别是Unknown数据,我们可以利用描述来找到对应的值,准备好清洗掉它们。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KtYzrxPAzfYY", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + }, + "outputId": "80e95e90-b0d6-433e-f2bc-3716dd284bd1" + }, + "source": [ + "attributes_df = pd.read_excel(drive_path + 'DIAS Information Levels - Attributes 2017.xlsx', index_col=None, header=1)\n", + "values_df = pd.read_excel(drive_path + 'DIAS Attributes - Values 2017.xlsx', index_col=None, header=1)\n", + "\n", + "del attributes_df['Unnamed: 0']\n", + "del values_df['Unnamed: 0']\n", + "\n", + "values_df = values_df.fillna(method='ffill', axis=0) # fill merged-cell with first value in above\n", + "values_df.head()" + ], + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AttributeDescriptionValueMeaning
0AGER_TYPbest-ager typology-1unknown
1AGER_TYPbest-ager typology0no classification possible
2AGER_TYPbest-ager typology1passive elderly
3AGER_TYPbest-ager typology2cultural elderly
4AGER_TYPbest-ager typology3experience-driven elderly
\n", + "
" + ], + "text/plain": [ + " Attribute Description Value Meaning\n", + "0 AGER_TYP best-ager typology -1 unknown\n", + "1 AGER_TYP best-ager typology 0 no classification possible\n", + "2 AGER_TYP best-ager typology 1 passive elderly\n", + "3 AGER_TYP best-ager typology 2 cultural elderly\n", + "4 AGER_TYP best-ager typology 3 experience-driven elderly" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rVMHXoTK0x5Q", + "colab_type": "code", + "outputId": "3b8cefd7-d152-4ceb-8c09-02397690bc6e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "attributes_df.head()" + ], + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Information levelAttributeDescriptionAdditional notes
0NaNAGER_TYPbest-ager typologyin cooperation with Kantar TNS; the informatio...
1PersonALTERSKATEGORIE_GROBage through prename analysismodelled on millions of first name-age-referen...
2NaNANREDE_KZgenderNaN
3NaNCJT_GESAMTTYPCustomer-Journey-Typology relating to the pref...relating to the preferred information, marketi...
4NaNFINANZ_MINIMALISTfinancial typology: low financial interestGfk-Typology based on a representative househo...
\n", + "
" + ], + "text/plain": [ + " Information level ... Additional notes\n", + "0 NaN ... in cooperation with Kantar TNS; the informatio...\n", + "1 Person ... modelled on millions of first name-age-referen...\n", + "2 NaN ... NaN\n", + "3 NaN ... relating to the preferred information, marketi...\n", + "4 NaN ... Gfk-Typology based on a representative househo...\n", + "\n", + "[5 rows x 4 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "EaBQYVnWptiM", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + }, + "outputId": "fbf32036-252f-4918-cbfa-a8a104993dd2" + }, + "source": [ + "# display all possilbe meaning has 'unknown'\n", + "meaning_se = pd.Series(values_df['Meaning'].unique())\n", + "meaning_se[meaning_se.str.contains('known', flags=re.IGNORECASE, regex=True)]" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 unknown\n", + "10 unknown / no main age detectable\n", + "129 no transactions known\n", + "145 no transaction known\n", + "199 residental building buildings without actually...\n", + "201 mixed building without actually known househol...\n", + "202 company building w/o known company \n", + "203 mixed building without actually known household \n", + "205 mixed building without actually known company \n", + "dtype: object" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GpysWLdxt1yu", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "22c6a601-f819-4b7a-dccc-c163787c979a" + }, + "source": [ + "def getUnknownColumnsSet():\n", + " ''' \n", + " calculate the column names of the with meaning 'unknown' or simular meaning.\n", + " the unknown value -1, we will do general replace, so we do ignore it.\n", + "\n", + " Args:\n", + " None\n", + " Returns:\n", + " {set} of all possible attirbute columns. eg. \"D19_VERSI_ANZ_12__0\" \n", + " '''\n", + " unknown = ['unknown', \n", + " 'unknown / no main age detectable',\n", + " 'no transactions known',\n", + " 'no transaction known']\n", + "\n", + " attribute_unknown = values_df[values_df['Meaning'].isin(unknown)]\n", + " attribute_unknown['Value'].astype(str).map(lambda st: st.split(', '))\n", + " unknown_columns = set()\n", + "\n", + " for index, row in attribute_unknown.iterrows():\n", + " attribute_name = row['Attribute']\n", + " attribute_values = str(row['Value']).split(', ')\n", + " for unknown_val in attribute_values:\n", + " if unknown_val == '-1':\n", + " continue\n", + " unknown_columns.add(f'{attribute_name}__{unknown_val}')\n", + "\n", + " return unknown_columns\n", + "\n", + "unknown_columns = getUnknownColumnsSet()\n", + "unknown_columns" + ], + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'ALTERSKATEGORIE_GROB__0',\n", + " 'ALTER_HH__0',\n", + " 'ANREDE_KZ__0',\n", + " 'CJT_GESAMTTYP__0',\n", + " 'D19_BANKEN_ANZ_12__0',\n", + " 'D19_BANKEN_ANZ_24__0',\n", + " 'D19_BANKEN_DATUM__10',\n", + " 'D19_BANKEN_DIREKT_RZ__0',\n", + " 'D19_BANKEN_GROSS_RZ__0',\n", + " 'D19_BANKEN_LOKAL_RZ__0',\n", + " 'D19_BANKEN_OFFLINE_DATUM__10',\n", + " 'D19_BANKEN_ONLINE_DATUM__10',\n", + " 'D19_BANKEN_REST_RZ__0',\n", + " 'D19_BEKLEIDUNG_GEH_RZ__0',\n", + " 'D19_BEKLEIDUNG_REST_RZ__0',\n", + " 'D19_BILDUNG_RZ__0',\n", + " 'D19_BIO_OEKO_RZ__0',\n", + " 'D19_BUCH_RZ__0',\n", + " 'D19_DIGIT_SERV_RZ__0',\n", + " 'D19_DROGERIEARTIKEL_RZ__0',\n", + " 'D19_ENERGIE_RZ__0',\n", + " 'D19_FREIZEIT_RZ__0',\n", + " 'D19_GARTEN_RZ__0',\n", + " 'D19_GESAMT_ANZ_12__0',\n", + " 'D19_GESAMT_ANZ_24__0',\n", + " 'D19_GESAMT_DATUM__10',\n", + " 'D19_GESAMT_OFFLINE_DATUM__10',\n", + " 'D19_GESAMT_ONLINE_DATUM__10',\n", + " 'D19_HANDWERK_RZ__0',\n", + " 'D19_HAUS_DEKO_RZ__0',\n", + " 'D19_KINDERARTIKEL_RZ__0',\n", + " 'D19_KOSMETIK_RZ__0',\n", + " 'D19_LEBENSMITTEL_RZ__0',\n", + " 'D19_LOTTO_RZ__0',\n", + " 'D19_NAHRUNGSERGAENZUNG_RZ__0',\n", + " 'D19_RATGEBER_RZ__0',\n", + " 'D19_REISEN_RZ__0',\n", + " 'D19_SAMMELARTIKEL_RZ__0',\n", + " 'D19_SCHUHE_RZ__0',\n", + " 'D19_SONSTIGE_RZ__0',\n", + " 'D19_TECHNIK_RZ__0',\n", + " 'D19_TELKO_ANZ_12__0',\n", + " 'D19_TELKO_ANZ_24__0',\n", + " 'D19_TELKO_DATUM__10',\n", + " 'D19_TELKO_MOBILE_RZ__0',\n", + " 'D19_TELKO_OFFLINE_DATUM__10',\n", + " 'D19_TELKO_ONLINE_DATUM__10',\n", + " 'D19_TELKO_REST_RZ__0',\n", + " 'D19_TIERARTIKEL_RZ__0',\n", + " 'D19_VERSAND_ANZ_12__0',\n", + " 'D19_VERSAND_ANZ_24__0',\n", + " 'D19_VERSAND_DATUM__10',\n", + " 'D19_VERSAND_OFFLINE_DATUM__10',\n", + " 'D19_VERSAND_ONLINE_DATUM__10',\n", + " 'D19_VERSAND_REST_RZ__0',\n", + " 'D19_VERSICHERUNGEN_RZ__0',\n", + " 'D19_VERSI_ANZ_12__0',\n", + " 'D19_VERSI_ANZ_24__0',\n", + " 'D19_VOLLSORTIMENT_RZ__0',\n", + " 'D19_WEIN_FEINKOST_RZ__0',\n", + " 'GEBAEUDETYP__0',\n", + " 'GEOSCORE_KLS7__0',\n", + " 'HAUSHALTSSTRUKTUR__0',\n", + " 'HH_EINKOMMEN_SCORE__0',\n", + " 'KBA05_ALTER1__9',\n", + " 'KBA05_ALTER2__9',\n", + " 'KBA05_ALTER3__9',\n", + " 'KBA05_ALTER4__9',\n", + " 'KBA05_ANHANG__9',\n", + " 'KBA05_AUTOQUOT__9',\n", + " 'KBA05_BAUMAX__0',\n", + " 'KBA05_CCM1__9',\n", + " 'KBA05_CCM2__9',\n", + " 'KBA05_CCM3__9',\n", + " 'KBA05_CCM4__9',\n", + " 'KBA05_DIESEL__9',\n", + " 'KBA05_FRAU__9',\n", + " 'KBA05_GBZ__0',\n", + " 'KBA05_HERST1__9',\n", + " 'KBA05_HERST2__9',\n", + " 'KBA05_HERST3__9',\n", + " 'KBA05_HERST4__9',\n", + " 'KBA05_HERST5__9',\n", + " 'KBA05_HERSTTEMP__9',\n", + " 'KBA05_KRSAQUOT__9',\n", + " 'KBA05_KRSHERST1__9',\n", + " 'KBA05_KRSHERST2__9',\n", + " 'KBA05_KRSHERST3__9',\n", + " 'KBA05_KRSKLEIN__9',\n", + " 'KBA05_KRSOBER__9',\n", + " 'KBA05_KRSVAN__9',\n", + " 'KBA05_KRSZUL__9',\n", + " 'KBA05_KW1__9',\n", + " 'KBA05_KW2__9',\n", + " 'KBA05_KW3__9',\n", + " 'KBA05_MAXAH__9',\n", + " 'KBA05_MAXBJ__9',\n", + " 'KBA05_MAXHERST__9',\n", + " 'KBA05_MAXSEG__9',\n", + " 'KBA05_MAXVORB__9',\n", + " 'KBA05_MOD1__9',\n", + " 'KBA05_MOD2__9',\n", + " 'KBA05_MOD3__9',\n", + " 'KBA05_MOD4__9',\n", + " 'KBA05_MOD8__9',\n", + " 'KBA05_MODTEMP__9',\n", + " 'KBA05_MOTOR__9',\n", + " 'KBA05_MOTRAD__9',\n", + " 'KBA05_SEG10__9',\n", + " 'KBA05_SEG1__9',\n", + " 'KBA05_SEG2__9',\n", + " 'KBA05_SEG3__9',\n", + " 'KBA05_SEG4__9',\n", + " 'KBA05_SEG5__9',\n", + " 'KBA05_SEG6__9',\n", + " 'KBA05_SEG7__9',\n", + " 'KBA05_SEG8__9',\n", + " 'KBA05_SEG9__9',\n", + " 'KBA05_VORB0__9',\n", + " 'KBA05_VORB1__9',\n", + " 'KBA05_VORB2__9',\n", + " 'KBA05_ZUL1__9',\n", + " 'KBA05_ZUL2__9',\n", + " 'KBA05_ZUL3__9',\n", + " 'KBA05_ZUL4__9',\n", + " 'KKK__0',\n", + " 'NATIONALITAET_KZ__0',\n", + " 'PRAEGENDE_JUGENDJAHRE__0',\n", + " 'REGIOTYP__0',\n", + " 'RELAT_AB__9',\n", + " 'RETOURTYP_BK_S__0',\n", + " 'SEMIO_DOM__9',\n", + " 'SEMIO_ERL__9',\n", + " 'SEMIO_FAM__9',\n", + " 'SEMIO_KAEM__9',\n", + " 'SEMIO_KRIT__9',\n", + " 'SEMIO_KULT__9',\n", + " 'SEMIO_LUST__9',\n", + " 'SEMIO_MAT__9',\n", + " 'SEMIO_PFLICHT__9',\n", + " 'SEMIO_RAT__9',\n", + " 'SEMIO_REL__9',\n", + " 'SEMIO_SOZ__9',\n", + " 'SEMIO_TRADV__9',\n", + " 'SEMIO_VERT__9',\n", + " 'TITEL_KZ__0',\n", + " 'WACHSTUMSGEBIET_NB__0',\n", + " 'WOHNDAUER_2008__0',\n", + " 'W_KEIT_KIND_HH__0',\n", + " 'ZABEOTYP__9'}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OI0PO52ZoVlQ", + "colab_type": "text" + }, + "source": [ + "#### attributes 和 values定义和实际数据中使用的差距" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5iX8f5kl8n5u", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + }, + "outputId": "7923ffc7-f0df-4563-f0a3-13973808019f" + }, + "source": [ + "azdias_attributes = azdias.columns\n", + "customer_attributes = customers.columns\n", + "attributes_in_valuesDf = values_df['Attribute'].unique()\n", + "attributes_in_attributesDf = attributes_df['Attribute'].unique()\n", + "\n", + "print(f'there are {azdias_attributes.size} in azdias')\n", + "print(f'there are {customer_attributes.size} in customers')\n", + "print(f'there are {attributes_in_valuesDf.shape[0]} in values_df')\n", + "print(f'there are {attributes_in_attributesDf.shape[0]} in attributes_df')" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "stream", + "text": [ + "there are 365 in azdias\n", + "there are 368 in customers\n", + "there are 314 in values_df\n", + "there are 313 in attributes_df\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wEqwclg5SrqJ", + "colab_type": "text" + }, + "source": [ + "为什么在azdias和customer的特征列数量和描述数据values和attributs不对应呢?这里我们进一步通过显示数据来做分析。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q_i0ZUkBeUPw", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "470b5c67-4ae9-449f-98a8-7705426eaa68" + }, + "source": [ + "attributes_without_meta = set()\n", + "ind = 0\n", + "for attr in azdias_attributes:\n", + " ind += 1\n", + " if attr not in attributes_in_valuesDf:\n", + " attributes_without_meta.add(attr)\n", + " print(f'Attribute No.{ind} {attr} is not in attributes_df, but in azdias values are: {azdias[attr].unique()}')" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Attribute No.2 AKT_DAT_KL is not in attributes_df, but in azdias values are: [nan 9. 1. 5. 8. 7. 6. 4. 3. 2.]\n", + "Attribute No.4 ALTER_KIND1 is not in attributes_df, but in azdias values are: [nan 17. 10. 18. 13. 16. 11. 6. 8. 9. 15. 14. 7. 12. 4. 3. 5. 2.]\n", + "Attribute No.5 ALTER_KIND2 is not in attributes_df, but in azdias values are: [nan 13. 8. 12. 10. 7. 16. 15. 14. 17. 5. 9. 18. 11. 6. 4. 3. 2.]\n", + "Attribute No.6 ALTER_KIND3 is not in attributes_df, but in azdias values are: [nan 10. 18. 17. 16. 8. 15. 9. 12. 13. 14. 11. 7. 5. 6. 4.]\n", + "Attribute No.7 ALTER_KIND4 is not in attributes_df, but in azdias values are: [nan 10. 9. 16. 14. 13. 11. 18. 17. 15. 8. 12. 7.]\n", + "Attribute No.8 ALTERSKATEGORIE_FEIN is not in attributes_df, but in azdias values are: [nan 21. 17. 13. 14. 10. 16. 20. 11. 19. 15. 18. 9. 22. 12. 0. 8. 7.\n", + " 23. 4. 24. 6. 3. 2. 5. 25. 1.]\n", + "Attribute No.11 ANZ_KINDER is not in attributes_df, but in azdias values are: [nan 0. 1. 2. 3. 4. 5. 6. 9. 7. 11. 8.]\n", + "Attribute No.13 ANZ_STATISTISCHE_HAUSHALTE is not in attributes_df, but in azdias values are: [ nan 12. 7. 2. 3. 5. 6. 1. 14. 4. 11. 13. 30. 22.\n", + " 36. 244. 10. 32. 8. 9. 18. 17. 16. 67. 19. 15. 26. 20.\n", + " 23. 33. 34. 68. 53. 21. 42. 57. 28. 25. 60. 35. 29. 43.\n", + " 64. 27. 46. 24. 48. 31. 56. 37. 243. 157. 39. 40. 71. 63.\n", + " 38. 44. 50. 101. 66. 41. 81. 47. 192. 131. 149. 74. 84. 80.\n", + " 137. 45. 94. 65. 54. 87. 69. 125. 61. 82. 73. 72. 86. 292.\n", + " 70. 83. 91. 112. 58. 51. 75. 52. 90. 140. 49. 212. 79. 152.\n", + " 142. 166. 251. 99. 107. 76. 173. 89. 138. 92. 154. 115. 100. 55.\n", + " 116. 88. 113. 162. 95. 168. 62. 97. 110. 127. 102. 93. 103. 78.\n", + " 111. 114. 77. 98. 365. 146. 109. 59. 108. 289. 130. 85. 119. 159.\n", + " 183. 117. 303. 96. 124. 163. 123. 122. 156. 155. 0. 319. 104. 223.\n", + " 105. 128. 217. 129. 170. 218. 367. 253. 132. 328. 126. 193. 133. 153.\n", + " 118. 121. 233. 143. 229. 134. 256. 180. 106. 202. 120. 136. 148. 204.\n", + " 245. 135. 164. 248. 150. 160. 147. 139. 189. 195. 187. 269. 179. 174.\n", + " 161. 225. 314. 158. 141. 167. 353. 354. 284. 258. 234. 194. 264. 145.\n", + " 239. 238. 144. 235. 369. 257. 227. 184. 296. 317. 240. 213. 169. 199.\n", + " 222. 230. 274. 181. 197. 237. 252. 309. 339. 262. 366. 177. 241. 175.\n", + " 304. 186. 178. 176. 172. 200. 216. 182. 205. 228. 336. 445. 242. 299.\n", + " 151. 203. 185. 268. 214. 286. 322. 198. 449. 165. 190. 297. 209. 342.\n", + " 375. 371. 171.]\n", + "Attribute No.15 ARBEIT is not in attributes_df, but in azdias values are: [nan 3. 2. 4. 1. 5. 9.]\n", + "Attribute No.19 CAMEO_INTL_2015 is not in attributes_df, but in azdias values are: [nan 5. 2. 1. 4. 3. 22. 24. 41. 12. 54. 51. 44. 35. 23. 25. 14. 34.\n", + " 52. 55. 31. 32. 15. 13. 43. 33. 45.]\n", + "Attribute No.21 CJT_KATALOGNUTZER is not in attributes_df, but in azdias values are: [ 5. 1. 2. 3. 4. nan]\n", + "Attribute No.22 CJT_TYP_1 is not in attributes_df, but in azdias values are: [ 1. 5. 4. 2. 3. nan]\n", + "Attribute No.23 CJT_TYP_2 is not in attributes_df, but in azdias values are: [ 1. 5. 4. 2. 3. nan]\n", + "Attribute No.24 CJT_TYP_3 is not in attributes_df, but in azdias values are: [ 5. 2. 1. 4. 3. nan]\n", + "Attribute No.25 CJT_TYP_4 is not in attributes_df, but in azdias values are: [ 5. 3. 4. 1. 2. nan]\n", + "Attribute No.26 CJT_TYP_5 is not in attributes_df, but in azdias values are: [ 5. 1. 2. 3. 4. nan]\n", + "Attribute No.27 CJT_TYP_6 is not in attributes_df, but in azdias values are: [ 5. 1. 2. 3. 4. nan]\n", + "Attribute No.31 D19_BANKEN_DIREKT is not in attributes_df, but in azdias values are: [0 1 6 5 4 3 7 2]\n", + "Attribute No.32 D19_BANKEN_GROSS is not in attributes_df, but in azdias values are: [0 2 6 3 5 1 4]\n", + "Attribute No.33 D19_BANKEN_LOKAL is not in attributes_df, but in azdias values are: [0 7 3 6 5 2 1 4]\n", + "Attribute No.37 D19_BANKEN_REST is not in attributes_df, but in azdias values are: [0 6 5 4 3 7 2 1]\n", + "Attribute No.38 D19_BEKLEIDUNG_GEH is not in attributes_df, but in azdias values are: [0 6 5 7 3 2 4 1]\n", + "Attribute No.39 D19_BEKLEIDUNG_REST is not in attributes_df, but in azdias values are: [0 1 6 7 5 3 4 2]\n", + "Attribute No.40 D19_BILDUNG is not in attributes_df, but in azdias values are: [0 6 3 7 2 4 5 1]\n", + "Attribute No.41 D19_BIO_OEKO is not in attributes_df, but in azdias values are: [0 6 7 3 5 2 4 1]\n", + "Attribute No.42 D19_BUCH_CD is not in attributes_df, but in azdias values are: [0 6 5 3 1 7 4 2]\n", + "Attribute No.43 D19_DIGIT_SERV is not in attributes_df, but in azdias values are: [0 6 7 3 5 2 4 1]\n", + "Attribute No.44 D19_DROGERIEARTIKEL is not in attributes_df, but in azdias values are: [0 1 6 3 7 4 5 2]\n", + "Attribute No.45 D19_ENERGIE is not in attributes_df, but in azdias values are: [0 5 3 6 7 2 1 4]\n", + "Attribute No.46 D19_FREIZEIT is not in attributes_df, but in azdias values are: [0 7 3 6 5 4 1 2]\n", + "Attribute No.47 D19_GARTEN is not in attributes_df, but in azdias values are: [0 3 6 7 5 4 2 1]\n", + "Attribute No.54 D19_HANDWERK is not in attributes_df, but in azdias values are: [0 6 3 5 7 4 2 1]\n", + "Attribute No.55 D19_HAUS_DEKO is not in attributes_df, but in azdias values are: [0 5 6 7 1 3 4 2]\n", + "Attribute No.56 D19_KINDERARTIKEL is not in attributes_df, but in azdias values are: [0 6 7 3 5 2 1 4]\n", + "Attribute No.58 D19_KONSUMTYP_MAX is not in attributes_df, but in azdias values are: [9 8 1 2 3 4]\n", + "Attribute No.59 D19_KOSMETIK is not in attributes_df, but in azdias values are: [0 6 3 5 7 2 4 1]\n", + "Attribute No.60 D19_LEBENSMITTEL is not in attributes_df, but in azdias values are: [0 6 7 5 3 4 1 2]\n", + "Attribute No.61 D19_LETZTER_KAUF_BRANCHE is not in attributes_df, but in azdias values are: [nan 'D19_UNBEKANNT' 'D19_SCHUHE' 'D19_ENERGIE' 'D19_KOSMETIK'\n", + " 'D19_VOLLSORTIMENT' 'D19_SONSTIGE' 'D19_BANKEN_GROSS'\n", + " 'D19_DROGERIEARTIKEL' 'D19_HANDWERK' 'D19_BUCH_CD' 'D19_VERSICHERUNGEN'\n", + " 'D19_VERSAND_REST' 'D19_TELKO_REST' 'D19_BANKEN_DIREKT' 'D19_BANKEN_REST'\n", + " 'D19_FREIZEIT' 'D19_LEBENSMITTEL' 'D19_HAUS_DEKO' 'D19_BEKLEIDUNG_REST'\n", + " 'D19_SAMMELARTIKEL' 'D19_TELKO_MOBILE' 'D19_REISEN' 'D19_BEKLEIDUNG_GEH'\n", + " 'D19_TECHNIK' 'D19_NAHRUNGSERGAENZUNG' 'D19_DIGIT_SERV' 'D19_LOTTO'\n", + " 'D19_RATGEBER' 'D19_TIERARTIKEL' 'D19_KINDERARTIKEL' 'D19_BIO_OEKO'\n", + " 'D19_WEIN_FEINKOST' 'D19_GARTEN' 'D19_BILDUNG' 'D19_BANKEN_LOKAL']\n", + "Attribute No.62 D19_LOTTO is not in attributes_df, but in azdias values are: [nan 0. 6. 7. 5. 3. 4. 2. 1.]\n", + "Attribute No.63 D19_NAHRUNGSERGAENZUNG is not in attributes_df, but in azdias values are: [0 5 6 7 4 1 2 3]\n", + "Attribute No.64 D19_RATGEBER is not in attributes_df, but in azdias values are: [0 7 6 3 5 2 4 1]\n", + "Attribute No.65 D19_REISEN is not in attributes_df, but in azdias values are: [0 6 7 3 5 2 4 1]\n", + "Attribute No.66 D19_SAMMELARTIKEL is not in attributes_df, but in azdias values are: [0 6 1 7 5 3 2 4]\n", + "Attribute No.67 D19_SCHUHE is not in attributes_df, but in azdias values are: [0 1 3 5 6 2 7 4]\n", + "Attribute No.68 D19_SONSTIGE is not in attributes_df, but in azdias values are: [0 6 4 7 5 3 2 1]\n", + "Attribute No.69 D19_SOZIALES is not in attributes_df, but in azdias values are: [nan 0. 4. 5. 3. 1. 2.]\n", + "Attribute No.70 D19_TECHNIK is not in attributes_df, but in azdias values are: [0 6 5 7 3 1 4 2]\n", + "Attribute No.74 D19_TELKO_MOBILE is not in attributes_df, but in azdias values are: [0 6 3 7 4 5 2 1]\n", + "Attribute No.77 D19_TELKO_ONLINE_QUOTE_12 is not in attributes_df, but in azdias values are: [nan 0. 10. 5. 7. 3.]\n", + "Attribute No.78 D19_TELKO_REST is not in attributes_df, but in azdias values are: [0 5 6 4 3 7 2 1]\n", + "Attribute No.79 D19_TIERARTIKEL is not in attributes_df, but in azdias values are: [0 6 5 3 4 7 2 1]\n", + "Attribute No.86 D19_VERSAND_REST is not in attributes_df, but in azdias values are: [0 2 6 5 3 1 4 7]\n", + "Attribute No.89 D19_VERSI_DATUM is not in attributes_df, but in azdias values are: [10 2 8 9 6 7 5 1 4 3]\n", + "Attribute No.90 D19_VERSI_OFFLINE_DATUM is not in attributes_df, but in azdias values are: [10 7 9 6 4 8 5 2 3 1]\n", + "Attribute No.91 D19_VERSI_ONLINE_DATUM is not in attributes_df, but in azdias values are: [10 8 9 5 6 7 4 1 2 3]\n", + "Attribute No.92 D19_VERSI_ONLINE_QUOTE_12 is not in attributes_df, but in azdias values are: [nan 0. 10. 5. 7. 8. 6. 3. 9.]\n", + "Attribute No.93 D19_VERSICHERUNGEN is not in attributes_df, but in azdias values are: [0 3 6 4 5 7 2 1]\n", + "Attribute No.94 D19_VOLLSORTIMENT is not in attributes_df, but in azdias values are: [0 7 6 3 2 5 4 1]\n", + "Attribute No.95 D19_WEIN_FEINKOST is not in attributes_df, but in azdias values are: [0 6 7 5 3 2 4 1]\n", + "Attribute No.96 DSL_FLAG is not in attributes_df, but in azdias values are: [nan 1. 0.]\n", + "Attribute No.97 EINGEFUEGT_AM is not in attributes_df, but in azdias values are: [nan '1992-02-10 00:00:00' '1992-02-12 00:00:00' ... '2010-12-02 00:00:00'\n", + " '2005-03-19 00:00:00' '2011-11-18 00:00:00']\n", + "Attribute No.98 EINGEZOGENAM_HH_JAHR is not in attributes_df, but in azdias values are: [ nan 2004. 2000. 1998. 1994. 2005. 2007. 2009. 2016. 2014. 2015. 2013.\n", + " 2008. 2010. 2001. 2002. 1997. 2012. 1992. 1999. 1996. 1995. 2011. 2003.\n", + " 2006. 1991. 2017. 1993. 2018. 1989. 1990. 1987. 1986. 1988. 1900. 1904.\n", + " 1971. 1984.]\n", + "Attribute No.100 EXTSEL992 is not in attributes_df, but in azdias values are: [nan 14. 31. 20. 56. 53. 27. 54. 6. 25. 48. 55. 36. 34. 35. 18. 38. 32.\n", + " 29. 41. 43. 22. 19. 24. 23. 8. 21. 37. 7. 3. 39. 12. 15. 17. 4. 9.\n", + " 44. 50. 13. 33. 42. 49. 1. 30. 10. 45. 26. 16. 28. 47. 2. 46. 11. 40.\n", + " 51. 52. 5.]\n", + "Attribute No.108 FIRMENDICHTE is not in attributes_df, but in azdias values are: [nan 2. 4. 5. 3. 1.]\n", + "Attribute No.112 GEMEINDETYP is not in attributes_df, but in azdias values are: [nan 22. 40. 21. 12. 30. 11. 50.]\n", + "Attribute No.116 HH_DELTA_FLAG is not in attributes_df, but in azdias values are: [nan 0. 1.]\n", + "Attribute No.188 KBA13_ANTG1 is not in attributes_df, but in azdias values are: [nan 2. 1. 4. 3. 0.]\n", + "Attribute No.189 KBA13_ANTG2 is not in attributes_df, but in azdias values are: [nan 4. 3. 2. 1. 0.]\n", + "Attribute No.190 KBA13_ANTG3 is not in attributes_df, but in azdias values are: [nan 2. 1. 0. 3.]\n", + "Attribute No.191 KBA13_ANTG4 is not in attributes_df, but in azdias values are: [nan 1. 0. 2.]\n", + "Attribute No.195 KBA13_BAUMAX is not in attributes_df, but in azdias values are: [nan 2. 1. 4. 5. 3.]\n", + "Attribute No.207 KBA13_CCM_1401_2500 is not in attributes_df, but in azdias values are: [nan 3. 2. 1. 4. 5.]\n", + "Attribute No.220 KBA13_GBZ is not in attributes_df, but in azdias values are: [nan 4. 3. 5. 2. 1.]\n", + "Attribute No.238 KBA13_HHZ is not in attributes_df, but in azdias values are: [nan 5. 4. 3. 2. 1.]\n", + "Attribute No.244 KBA13_KMH_210 is not in attributes_df, but in azdias values are: [nan 4. 2. 3. 5. 1.]\n", + "Attribute No.300 KK_KUNDENTYP is not in attributes_df, but in azdias values are: [nan 1. 3. 6. 4. 2. 5.]\n", + "Attribute No.302 KOMBIALTER is not in attributes_df, but in azdias values are: [9 1 2 4 3]\n", + "Attribute No.304 KONSUMZELLE is not in attributes_df, but in azdias values are: [nan 1. 0.]\n", + "Attribute No.312 MOBI_RASTER is not in attributes_df, but in azdias values are: [nan 1. 2. 4. 3. 5. 6.]\n", + "Attribute No.329 RT_KEIN_ANREIZ is not in attributes_df, but in azdias values are: [ 1. 5. 3. 4. 2. nan]\n", + "Attribute No.330 RT_SCHNAEPPCHEN is not in attributes_df, but in azdias values are: [ 4. 3. 2. 5. 1. nan]\n", + "Attribute No.331 RT_UEBERGROESSE is not in attributes_df, but in azdias values are: [ 1. 5. 3. 4. nan 2. 0.]\n", + "Attribute No.347 SOHO_KZ is not in attributes_df, but in azdias values are: [nan 1. 0.]\n", + "Attribute No.348 STRUKTURTYP is not in attributes_df, but in azdias values are: [nan 2. 3. 1.]\n", + "Attribute No.350 UMFELD_ALT is not in attributes_df, but in azdias values are: [nan 3. 2. 4. 5. 1.]\n", + "Attribute No.351 UMFELD_JUNG is not in attributes_df, but in azdias values are: [nan 3. 5. 4. 2. 1.]\n", + "Attribute No.352 UNGLEICHENN_FLAG is not in attributes_df, but in azdias values are: [nan 1. 0.]\n", + "Attribute No.353 VERDICHTUNGSRAUM is not in attributes_df, but in azdias values are: [nan 0. 1. 35. 3. 7. 23. 4. 8. 13. 16. 25. 5. 21. 6. 15. 32. 42.\n", + " 31. 11. 33. 22. 30. 18. 12. 27. 2. 9. 28. 10. 14. 20. 17. 43. 19. 24.\n", + " 34. 40. 39. 29. 26. 44. 45. 37. 36. 41. 38.]\n", + "Attribute No.355 VHA is not in attributes_df, but in azdias values are: [nan 0. 1. 5. 2. 4. 3.]\n", + "Attribute No.356 VHN is not in attributes_df, but in azdias values are: [nan 4. 2. 0. 1. 3.]\n", + "Attribute No.357 VK_DHT4A is not in attributes_df, but in azdias values are: [nan 8. 9. 7. 3. 10. 1. 6. 4. 2. 5. 11.]\n", + "Attribute No.358 VK_DISTANZ is not in attributes_df, but in azdias values are: [nan 11. 9. 10. 5. 7. 12. 1. 6. 13. 8. 4. 3. 2.]\n", + "Attribute No.359 VK_ZG11 is not in attributes_df, but in azdias values are: [nan 10. 6. 11. 4. 9. 8. 1. 3. 7. 5. 2.]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mj2Dia9ROmOH", + "colab_type": "text" + }, + "source": [ + "通过观察我们发现,D19_LETZTER_KAUF_BRANCHE的值刚好对应了其他D19的列,数据看上去有一定的重复。CJT_KATALOGNUTZER也是类似情况,被其他CJT列所重复。ANZ_STATISTISCHE_HAUSHALTE,EXTSEL992有大量的数值,但是我们这里缺乏具体的meta数据,这里我们决定不再保留。EINGEZOGENAM_HH_JAHR。GEBURTSJAHR是出身年份,我们还有其他的列含有相关年龄的列ALTER_HH所以我们也决定忽略。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_6twSg0w1hpR", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + }, + "outputId": "eadc0632-6a7c-42fc-ffd9-9f776a106644" + }, + "source": [ + "azdias_shapeBefore = azdias.shape[1]\n", + "customers_shapeBefore = customers.shape[1]\n", + "columns_to_drop = {'D19_LETZTER_KAUF_BRANCHE', \n", + " 'CJT_KATALOGNUTZER', \n", + " 'EINGEZOGENAM_HH_JAHR', \n", + " 'ANZ_STATISTISCHE_HAUSHALTE',\n", + " 'ANZ_HAUSHALTE_AKTIV',\n", + " 'VERDICHTUNGSRAUM', \n", + " 'EXTSEL992',\n", + " 'GEBURTSJAHR'}\n", + "azdias.drop(columns=columns_to_drop, axis=1, errors='ignore', inplace=True)\n", + "customers.drop(columns=columns_to_drop, axis=1, errors='ignore', inplace=True)\n", + "columns_to_drop.clear()\n", + "\n", + "print(f'before azdias drop {azdias_shapeBefore}, and after {azdias.shape}')\n", + "print(f'before customers drop {customers_shapeBefore}, and customers {customers.shape}')" + ], + "execution_count": 20, + "outputs": [ + { + "output_type": "stream", + "text": [ + "before azdias drop 365, and after (891221, 357)\n", + "before customers drop 368, and customers (191652, 360)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mgIRtHsf1kkf", + "colab_type": "text" + }, + "source": [ + "EINGEFUEGT_AM是数据添加的时间,一共有5163个时间点,我们只采纳的年份作为特这。将其替换为EINGEFUEGT_AM将会只是数据输入的年份。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XRDJwJuMk1CM", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def pickYearValue(dataframe, attributeName):\n", + " '''\n", + " use only the year of timestamp as value for the attribute\n", + " Args:\n", + " dataframe {DataFrame} -- customers or azdias\n", + " attributeName {string} -- attribute name, which hast timestamp value \n", + " like 1992-02-10 00:00:00\n", + " Returns:\n", + " None\n", + " '''\n", + " attr_values = dataframe[attributeName].unique()\n", + " print(f'Attribute {attributeName} has {attr_values.shape[0]} values')\n", + " print('Before change:\\n', dataframe[attributeName].head())\n", + " dataframe[attributeName] = dataframe[attributeName]\\\n", + " .map(lambda x: str(x)[:4] if x != np.nan else x)\\\n", + " .map(lambda x: np.nan if x=='nan' else x)\\\n", + " .astype(float)\n", + " print(f'After change:\\n', dataframe[attributeName].head())\n", + " print('We replaced dataframe[attributeName] with only the year values:',\n", + " dataframe[attributeName].unique())\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lFGXWZffnW4Z", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 717 + }, + "outputId": "ea314f90-3eae-409c-8121-017e2c871dbd" + }, + "source": [ + "pickYearValue(customers, 'EINGEFUEGT_AM')\n", + "pickYearValue(azdias, 'EINGEFUEGT_AM')" + ], + "execution_count": 22, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Attribute EINGEFUEGT_AM has 3035 values\n", + "Before change:\n", + " LNR\n", + "9626 1992-02-12 00:00:00\n", + "9628 NaN\n", + "143872 1992-02-10 00:00:00\n", + "143873 1992-02-10 00:00:00\n", + "143874 1992-02-12 00:00:00\n", + "Name: EINGEFUEGT_AM, dtype: object\n", + "After change:\n", + " LNR\n", + "9626 1992.0\n", + "9628 NaN\n", + "143872 1992.0\n", + "143873 1992.0\n", + "143874 1992.0\n", + "Name: EINGEFUEGT_AM, dtype: float64\n", + "We replaced dataframe[attributeName] with only the year values: [1992. nan 2004. 1997. 1995. 2007. 2005. 1996. 2012. 1994. 2008. 2003.\n", + " 2006. 1993. 1998. 2015. 2011. 2000. 1999. 2009. 2010. 2002. 2014. 2001.\n", + " 2013. 2016.]\n", + "Attribute EINGEFUEGT_AM has 5163 values\n", + "Before change:\n", + " LNR\n", + "910215 NaN\n", + "910220 1992-02-10 00:00:00\n", + "910225 1992-02-12 00:00:00\n", + "910226 1997-04-21 00:00:00\n", + "910241 1992-02-12 00:00:00\n", + "Name: EINGEFUEGT_AM, dtype: object\n", + "After change:\n", + " LNR\n", + "910215 NaN\n", + "910220 1992.0\n", + "910225 1992.0\n", + "910226 1997.0\n", + "910241 1992.0\n", + "Name: EINGEFUEGT_AM, dtype: float64\n", + "We replaced dataframe[attributeName] with only the year values: [ nan 1992. 1997. 2005. 2009. 1995. 1996. 2002. 2015. 2004. 2000. 2008.\n", + " 1994. 1993. 2003. 2014. 2016. 2007. 1999. 2010. 2001. 1998. 2006. 2013.\n", + " 2012. 2011. 1991.]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9VxkIUCkDOnj", + "colab_type": "text" + }, + "source": [ + "#### 在FEIN和GROB数据中做选择\n", + "\n", + "在属性描述中我们看到,有不少属性我们同时拥有细化(FEIN)和粗略(GROB)的数据特征。这里在试验初期我们决定采用粗略的特征。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "pwq9p61ED9Fj", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + }, + "outputId": "90f8f78a-7890-416e-c72e-11d294317725" + }, + "source": [ + "print(azdias['LP_FAMILIE_FEIN'].unique())\n", + "print(azdias['LP_FAMILIE_GROB'].unique())" + ], + "execution_count": 23, + "outputs": [ + { + "output_type": "stream", + "text": [ + "[ 2. 5. 1. 0. 10. 7. 11. 3. 8. 4. 6. nan 9.]\n", + "[ 2. 3. 1. 0. 5. 4. nan]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tcmDXzAYGlb-", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "2f70f1e0-0c52-4b23-c6dd-78b15bfec618" + }, + "source": [ + "for attr in attributes_in_attributesDf:\n", + " if attr.endswith('_FEIN'):\n", + " columns_to_drop.add(attr)\n", + "\n", + "print(f'All _FEIN columns will be delete {columns_to_drop}')\n", + "azdias.drop(columns=columns_to_drop, axis=1, errors='ignore', inplace=True)\n", + "customers.drop(columns=columns_to_drop, axis=1, errors='ignore', inplace=True)\n", + "columns_to_drop.clear()" + ], + "execution_count": 24, + "outputs": [ + { + "output_type": "stream", + "text": [ + "All _FEIN columns will be delete {'LP_STATUS_FEIN', 'LP_LEBENSPHASE_FEIN', 'LP_FAMILIE_FEIN'}\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "-JClQmtOUZ_k", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def findObjectAttributs(dataframe):\n", + " '''\n", + " find which column in dataframe has object as dtype. \n", + " Args:\n", + " dataframe {DataFrame} -- it could be customer or azdias\n", + " Returns:\n", + " {set} -- a set of column names, those the dtypes of column is value type \n", + " object\n", + " '''\n", + " object_columns = set()\n", + " for (columnName, columnData) in dataframe.iteritems():\n", + " attr_unique_values = columnData.unique()\n", + " if columnData.dtypes == \"object\": \n", + " object_columns.add(columnName)\n", + " print(f'{columnName} has value {attr_unique_values}')\n", + " #return object_columns\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "GQxQ92n3eMOI", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + }, + "outputId": "d7a7c24d-c8ab-469a-9963-c5e915e1270c" + }, + "source": [ + "print('== findObjectAttributs(azdias)==')\n", + "findObjectAttributs(azdias)\n", + "print('== findObjectAttributs(customers)==')\n", + "findObjectAttributs(customers)" + ], + "execution_count": 26, + "outputs": [ + { + "output_type": "stream", + "text": [ + "== findObjectAttributs(azdias)==\n", + "CAMEO_DEU_2015 has value [nan '8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D'\n", + " '9E' '9B' '1B' '3D' '4E' '4B' '3C' '5A' '7B' '9A' '6D' '6E' '2C' '7C'\n", + " '9C' '7D' '5E' '1D' '8D' '6C' '6A' '5B' '4D' '3A' '2B' '7E' '3B' '6F'\n", + " '5F' '1C']\n", + "OST_WEST_KZ has value [nan 'W' 'O']\n", + "== findObjectAttributs(customers)==\n", + "CAMEO_DEU_2015 has value ['1A' nan '5D' '4C' '7B' '3B' '1D' '9E' '2D' '4A' '6B' '9D' '8B' '5C' '9C'\n", + " '4E' '6C' '8C' '8A' '5B' '9B' '3D' '2A' '3C' '5F' '7A' '1E' '2C' '7C'\n", + " '5A' '2B' '6D' '7E' '5E' '6E' '3A' '9A' '4B' '1C' '1B' '6A' '8D' '7D'\n", + " '6F' '4D']\n", + "OST_WEST_KZ has value ['W' nan 'O']\n", + "PRODUCT_GROUP has value ['COSMETIC_AND_FOOD' 'FOOD' 'COSMETIC']\n", + "CUSTOMER_GROUP has value ['MULTI_BUYER' 'SINGLE_BUYER']\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Fk7EZdileWiP", + "colab_type": "code", + "colab": {} + }, + "source": [ + "def findAttributesWithMuchValues(dataframe, valueSizeLimit = 10):\n", + " '''\n", + " find attributes/columns in dataframe, they have an oversize of the values. \n", + "\n", + " Args:\n", + " dataframe {DataFrame} -- it could be customers and azdias\n", + " valueSizeLimit {int} -- the limitation of value size you want to check, \n", + " default 10\n", + " Returns:\n", + " {set} -- all attributs, in dataframe has more than 10 values\n", + " '''\n", + " value_oversize_columns = set()\n", + " for (columnName, columnData) in dataframe.iteritems():\n", + " attr_unique_values = columnData.unique()\n", + " if attr_unique_values.size >= valueSizeLimit: \n", + " value_oversize_columns.add(columnName)\n", + " print(f'{columnName} has value {attr_unique_values}')\n", + " #return value_oversize_columns" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5_6ucBa4U42h", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 935 + }, + "outputId": "b5e1a3bc-a792-40cd-9132-df131f5c0822" + }, + "source": [ + "findAttributesWithMuchValues(customers)" + ], + "execution_count": 28, + "outputs": [ + { + "output_type": "stream", + "text": [ + "AKT_DAT_KL has value [ 1. 9. 3. 7. 5. 2. nan 4. 6. 8.]\n", + "ALTER_HH has value [10. 11. 6. 8. 20. 5. 14. 21. 15. 17. 0. 19. 9. 12. 13. nan 18. 7.\n", + " 16. 4. 2. 3.]\n", + "ALTER_KIND1 has value [nan 8. 12. 9. 7. 13. 17. 14. 18. 11. 16. 6. 10. 15. 5. 3. 4. 2.]\n", + "ALTER_KIND2 has value [nan 9. 17. 10. 14. 13. 12. 11. 16. 18. 15. 7. 5. 8. 6. 3. 2. 4.]\n", + "ALTER_KIND3 has value [nan 13. 16. 18. 15. 17. 14. 12. 11. 10. 8. 7. 9. 6. 5.]\n", + "ALTER_KIND4 has value [nan 18. 12. 16. 13. 17. 11. 14. 15. 10. 8.]\n", + "ALTERSKATEGORIE_FEIN has value [10. nan 0. 8. 14. 9. 4. 13. 6. 12. 19. 17. 15. 11. 16. 7. 18. 21.\n", + " 25. 20. 24. 5. 2. 22. 3. 23.]\n", + "ANZ_HH_TITEL has value [ 0. nan 2. 4. 1. 13. 6. 5. 3. 20. 9. 11. 8. 14. 18. 23. 7. 12.\n", + " 17. 15. 10.]\n", + "ANZ_KINDER has value [ 0. 1. nan 3. 2. 4. 5. 6. 8. 7.]\n", + "ANZ_PERSONEN has value [ 2. 3. 1. 0. 4. 5. 6. nan 8. 7. 9. 12. 11. 10. 14. 16. 15. 21.\n", + " 13.]\n", + "CAMEO_DEU_2015 has value ['1A' nan '5D' '4C' '7B' '3B' '1D' '9E' '2D' '4A' '6B' '9D' '8B' '5C' '9C'\n", + " '4E' '6C' '8C' '8A' '5B' '9B' '3D' '2A' '3C' '5F' '7A' '1E' '2C' '7C'\n", + " '5A' '2B' '6D' '7E' '5E' '6E' '3A' '9A' '4B' '1C' '1B' '6A' '8D' '7D'\n", + " '6F' '4D']\n", + "CAMEO_DEUG_2015 has value [ 1. nan 5. 4. 7. 3. 9. 2. 6. 8.]\n", + "CAMEO_INTL_2015 has value [ 1. nan 3. 2. 4. 5. 45. 25. 55. 51. 14. 54. 43. 22. 15. 24. 35. 23.\n", + " 12. 44. 41. 52. 31. 13. 34. 32. 33.]\n", + "D19_BANKEN_DATUM has value [10 6 3 8 7 5 2 9 1 4]\n", + "D19_BANKEN_OFFLINE_DATUM has value [10 8 5 9 2 6 3 1 4 7]\n", + "D19_BANKEN_ONLINE_DATUM has value [10 7 3 5 9 4 8 6 1 2]\n", + "D19_BANKEN_ONLINE_QUOTE_12 has value [ 0. 10. nan 8. 5. 7. 6. 3. 2. 9. 4.]\n", + "D19_GESAMT_DATUM has value [ 9 6 10 1 5 4 2 3 8 7]\n", + "D19_GESAMT_OFFLINE_DATUM has value [ 9 10 6 8 5 4 3 7 1 2]\n", + "D19_GESAMT_ONLINE_DATUM has value [10 9 1 5 4 2 3 6 8 7]\n", + "D19_GESAMT_ONLINE_QUOTE_12 has value [ 0. 10. 7. 6. nan 8. 5. 9. 3. 4. 2. 1.]\n", + "D19_TELKO_DATUM has value [10 7 9 8 5 6 2 1 3 4]\n", + "D19_TELKO_OFFLINE_DATUM has value [10 9 8 5 1 6 4 3 7 2]\n", + "D19_TELKO_ONLINE_DATUM has value [10 9 5 8 6 7 4 2 1 3]\n", + "D19_VERSAND_DATUM has value [ 9 10 6 1 5 8 4 2 3 7]\n", + "D19_VERSAND_OFFLINE_DATUM has value [ 9 10 6 8 5 3 4 1 7 2]\n", + "D19_VERSAND_ONLINE_DATUM has value [10 9 1 5 4 2 8 6 3 7]\n", + "D19_VERSAND_ONLINE_QUOTE_12 has value [ 0. 10. 7. 6. 8. nan 5. 9. 3. 2. 1. 4.]\n", + "D19_VERSI_DATUM has value [10 9 4 5 1 8 2 7 6 3]\n", + "D19_VERSI_OFFLINE_DATUM has value [10 4 5 9 8 7 6 3 2 1]\n", + "D19_VERSI_ONLINE_DATUM has value [10 5 6 8 9 4 7 2 1 3]\n", + "EINGEFUEGT_AM has value [1992. nan 2004. 1997. 1995. 2007. 2005. 1996. 2012. 1994. 2008. 2003.\n", + " 2006. 1993. 1998. 2015. 2011. 2000. 1999. 2009. 2010. 2002. 2014. 2001.\n", + " 2013. 2016.]\n", + "GFK_URLAUBERTYP has value [ 4. nan 3. 10. 2. 11. 8. 1. 5. 9. 12. 7. 6.]\n", + "KBA13_ANZAHL_PKW has value [1201. nan 433. ... 77. 13. 34.]\n", + "LP_LEBENSPHASE_GROB has value [ 5. nan 3. 0. 10. 2. 8. 12. 11. 1. 4. 6. 7. 9.]\n", + "MIN_GEBAEUDEJAHR has value [1992. nan 1994. 1997. 1995. 1996. 2011. 2008. 1991. 1990. 2007. 2005.\n", + " 1993. 2001. 1999. 2003. 1998. 2000. 2015. 2006. 2004. 1989. 2010. 2002.\n", + " 2014. 2012. 2009. 2013. 1987. 1988. 1986. 1985. 2016.]\n", + "ORTSGR_KLS9 has value [ 2. nan 5. 3. 7. 4. 8. 6. 1. 9.]\n", + "PRAEGENDE_JUGENDJAHRE has value [ 4 0 1 8 9 6 15 14 11 5 3 2 10 13 12 7]\n", + "VK_DHT4A has value [ 5. 6. 10. 3. 1. 8. 4. 2. 9. 7. nan 11.]\n", + "VK_DISTANZ has value [ 3. 6. 13. 4. 5. 2. 11. 8. 7. 1. 10. nan 9. 12.]\n", + "VK_ZG11 has value [ 2. 3. 11. 4. 1. 9. 7. 5. 6. nan 8. 10.]\n", + "WOHNDAUER_2008 has value [ 9. 3. 7. 8. 5. nan 6. 4. 2. 1.]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AqiRhzfrfMjQ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 935 + }, + "outputId": "b7493d81-438d-4b51-b1ee-0b9e5b9303c5" + }, + "source": [ + "findAttributesWithMuchValues(azdias)" + ], + "execution_count": 29, + "outputs": [ + { + "output_type": "stream", + "text": [ + "AKT_DAT_KL has value [nan 9. 1. 5. 8. 7. 6. 4. 3. 2.]\n", + "ALTER_HH has value [nan 0. 17. 13. 20. 10. 14. 16. 21. 11. 19. 15. 9. 18. 8. 7. 12. 4.\n", + " 3. 6. 5. 2. 1.]\n", + "ALTER_KIND1 has value [nan 17. 10. 18. 13. 16. 11. 6. 8. 9. 15. 14. 7. 12. 4. 3. 5. 2.]\n", + "ALTER_KIND2 has value [nan 13. 8. 12. 10. 7. 16. 15. 14. 17. 5. 9. 18. 11. 6. 4. 3. 2.]\n", + "ALTER_KIND3 has value [nan 10. 18. 17. 16. 8. 15. 9. 12. 13. 14. 11. 7. 5. 6. 4.]\n", + "ALTER_KIND4 has value [nan 10. 9. 16. 14. 13. 11. 18. 17. 15. 8. 12. 7.]\n", + "ALTERSKATEGORIE_FEIN has value [nan 21. 17. 13. 14. 10. 16. 20. 11. 19. 15. 18. 9. 22. 12. 0. 8. 7.\n", + " 23. 4. 24. 6. 3. 2. 5. 25. 1.]\n", + "ANZ_HH_TITEL has value [nan 0. 1. 5. 2. 3. 7. 4. 6. 9. 15. 14. 8. 11. 10. 12. 13. 20.\n", + " 16. 17. 23. 18.]\n", + "ANZ_KINDER has value [nan 0. 1. 2. 3. 4. 5. 6. 9. 7. 11. 8.]\n", + "ANZ_PERSONEN has value [nan 2. 1. 0. 4. 3. 5. 7. 6. 8. 12. 9. 21. 10. 13. 11. 14. 45.\n", + " 20. 31. 29. 37. 16. 22. 15. 23. 18. 35. 17. 40. 38.]\n", + "CAMEO_DEU_2015 has value [nan '8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D'\n", + " '9E' '9B' '1B' '3D' '4E' '4B' '3C' '5A' '7B' '9A' '6D' '6E' '2C' '7C'\n", + " '9C' '7D' '5E' '1D' '8D' '6C' '6A' '5B' '4D' '3A' '2B' '7E' '3B' '6F'\n", + " '5F' '1C']\n", + "CAMEO_DEUG_2015 has value [nan 8. 4. 2. 6. 1. 9. 5. 7. 3.]\n", + "CAMEO_INTL_2015 has value [nan 5. 2. 1. 4. 3. 22. 24. 41. 12. 54. 51. 44. 35. 23. 25. 14. 34.\n", + " 52. 55. 31. 32. 15. 13. 43. 33. 45.]\n", + "D19_BANKEN_DATUM has value [10 5 8 6 9 1 7 4 2 3]\n", + "D19_BANKEN_OFFLINE_DATUM has value [10 9 8 2 5 4 1 6 7 3]\n", + "D19_BANKEN_ONLINE_DATUM has value [10 5 8 6 9 1 4 7 2 3]\n", + "D19_BANKEN_ONLINE_QUOTE_12 has value [nan 0. 10. 8. 5. 9. 7. 6. 3. 4. 2. 1.]\n", + "D19_GESAMT_DATUM has value [10 1 3 5 9 4 7 6 8 2]\n", + "D19_GESAMT_OFFLINE_DATUM has value [10 6 8 9 5 2 4 1 7 3]\n", + "D19_GESAMT_ONLINE_DATUM has value [10 1 3 5 9 4 7 6 8 2]\n", + "D19_GESAMT_ONLINE_QUOTE_12 has value [nan 0. 10. 7. 9. 5. 8. 6. 3. 4. 2. 1.]\n", + "D19_TELKO_DATUM has value [10 6 9 8 7 5 4 2 1 3]\n", + "D19_TELKO_OFFLINE_DATUM has value [10 8 9 5 6 7 4 2 3 1]\n", + "D19_TELKO_ONLINE_DATUM has value [10 9 7 8 6 5 4 1 2 3]\n", + "D19_VERSAND_DATUM has value [10 1 5 9 4 8 7 6 3 2]\n", + "D19_VERSAND_OFFLINE_DATUM has value [10 9 6 8 5 2 1 4 7 3]\n", + "D19_VERSAND_ONLINE_DATUM has value [10 1 5 9 4 8 7 6 3 2]\n", + "D19_VERSAND_ONLINE_QUOTE_12 has value [nan 0. 10. 7. 5. 9. 3. 8. 6. 4. 2. 1.]\n", + "D19_VERSI_DATUM has value [10 2 8 9 6 7 5 1 4 3]\n", + "D19_VERSI_OFFLINE_DATUM has value [10 7 9 6 4 8 5 2 3 1]\n", + "D19_VERSI_ONLINE_DATUM has value [10 8 9 5 6 7 4 1 2 3]\n", + "EINGEFUEGT_AM has value [ nan 1992. 1997. 2005. 2009. 1995. 1996. 2002. 2015. 2004. 2000. 2008.\n", + " 1994. 1993. 2003. 2014. 2016. 2007. 1999. 2010. 2001. 1998. 2006. 2013.\n", + " 2012. 2011. 1991.]\n", + "GFK_URLAUBERTYP has value [10. 1. 5. 12. 9. 3. 8. 11. 4. 2. 7. 6. nan]\n", + "KBA13_ANZAHL_PKW has value [ nan 963. 712. ... 2. 30. 7.]\n", + "LP_LEBENSPHASE_GROB has value [ 4. 6. 1. 0. 10. 2. 3. 5. 7. 12. 11. 9. 8. nan]\n", + "MIN_GEBAEUDEJAHR has value [ nan 1992. 1997. 2005. 2009. 1994. 1996. 2002. 2015. 1991. 1993. 1995.\n", + " 2003. 2014. 2008. 2006. 2000. 1990. 2004. 1999. 1998. 2001. 2007. 2013.\n", + " 1989. 2011. 2012. 2010. 1987. 1988. 1985. 2016. 1986.]\n", + "ORTSGR_KLS9 has value [nan 5. 3. 6. 4. 8. 2. 7. 9. 1. 0.]\n", + "PRAEGENDE_JUGENDJAHRE has value [ 0 14 15 8 3 10 11 5 9 6 4 2 1 12 13 7]\n", + "VK_DHT4A has value [nan 8. 9. 7. 3. 10. 1. 6. 4. 2. 5. 11.]\n", + "VK_DISTANZ has value [nan 11. 9. 10. 5. 7. 12. 1. 6. 13. 8. 4. 3. 2.]\n", + "VK_ZG11 has value [nan 10. 6. 11. 4. 9. 8. 1. 3. 7. 5. 2.]\n", + "WOHNDAUER_2008 has value [nan 9. 8. 3. 4. 5. 6. 2. 7. 1.]\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "j2sELMpeL1D3", + "colab_type": "code", + "colab": {} + }, + "source": [ + "azdias.replace(-1, value=np.nan, inplace=True)\n", + "azdias.astype(str)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QAjEPUawrhwV", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 450 + }, + "outputId": "2dd0b5d6-07fb-4841-e9f8-8478de2790af" + }, + "source": [ + "pd.get_dummies(azdias[['AGER_TYP', 'AKT_DAT_KL']], prefix_sep='__')" + ], + "execution_count": 36, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AGER_TYPAKT_DAT_KL
LNR
910215NaNNaN
910220NaN9.0
910225NaN9.0
9102262.01.0
910241NaN1.0
.........
825761NaN5.0
825771NaN9.0
825772NaN1.0
825776NaN9.0
825787NaN1.0
\n", + "

891221 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " AGER_TYP AKT_DAT_KL\n", + "LNR \n", + "910215 NaN NaN\n", + "910220 NaN 9.0\n", + "910225 NaN 9.0\n", + "910226 2.0 1.0\n", + "910241 NaN 1.0\n", + "... ... ...\n", + "825761 NaN 5.0\n", + "825771 NaN 9.0\n", + "825772 NaN 1.0\n", + "825776 NaN 9.0\n", + "825787 NaN 1.0\n", + "\n", + "[891221 rows x 2 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yKP9cikoafB3", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 130 + }, + "outputId": "44126282-733e-4be6-9f38-cc219094ce32" + }, + "source": [ + "azdias.join(pd.get_dummies(azdias['AGER_TYP'], prefix='AGER_TYP', prefix_sep='__'))\n", + "azdias.drop(['AGER_TYP'], axis=1, errors='ignore', inplace=True)\n", + "azdias['AGER_TYP__1']" + ], + "execution_count": 88, + "outputs": [ + { + "output_type": "error", + "ename": "SyntaxError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m3\u001b[0m\n\u001b[0;31m azdias.['AGER_TYP__1']\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aDx4DFIppLxS", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 698 + }, + "outputId": "aae8477a-1931-4ce4-d64f-955b4e153d89" + }, + "source": [ + "azdias['AGER_TYP__1']" + ], + "execution_count": 89, + "outputs": [ + { + "output_type": "error", + "ename": "KeyError", + "evalue": "ignored", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2896\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2897\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2898\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'AGER_TYP__1'", + "\nDuring handling of the above exception, another exception occurred:\n", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mazdias\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'AGER_TYP__1'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2993\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2994\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2995\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2996\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2997\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2897\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2898\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2899\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_maybe_cast_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2900\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtolerance\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtolerance\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndim\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msize\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mKeyError\u001b[0m: 'AGER_TYP__1'" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "w3uOqw0OXlfh", + "colab_type": "code", + "colab": {} + }, + "source": [ + "for attr in attributesInValues:\n", + " if attr not in attributesInAzidas:\n", + " print(f'Attribute {attr} is not in attributes_df')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "FGItsJUZQMJf", + "colab_type": "code", + "colab": {} + }, + "source": [ + "last_attribute_name = ''\n", + "possible_values = set()\n", + "\n", + "attribute_values = {}\n", + "for ind, row in values_df.iterrows():\n", + " \n", + " if str(row['Attribute']) != 'nan' and last_attribute_name == '':\n", + " last_attribute_name = row['Attribute']\n", + " possible_values.add(str(row['Value']))\n", + "\n", + " next_ind = ind + 1\n", + " if next_ind < values_df.shape[0] and str(values_df.iloc[next_ind]['Attribute'])!= 'nan' :\n", + " # print(last_attribute_name, possible_values)\n", + " attribute_values[last_attribute_name] = possible_values\n", + " last_attribute_name = ''\n", + " possible_values = set()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "lMPfk49DNExJ", + "colab_type": "code", + "colab": {} + }, + "source": [ + "dummies = pd.get_dummies(azdias['MIN_GEBAEUDEJAHR'], prefix='MIN_GEBAEUDEJAHR', prefix_sep=\"__\")\n", + "dummies" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Xx3u3qTUUeJ", + "colab_type": "text" + }, + "source": [ + "## 第1部分:顾客分类报告\n", + "\n", + "项目报告的主体部分应该就是这部分。在这个部分,你应该使用非监督学习技术来刻画公司已有顾客和德国一般人群的人口统计数据的关系。这部分做完后,你应该能够描述一般人群中的哪一类人更可能是邮购公司的主要核心顾客,哪些人则很可能不是。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DHQGzOvDUUeK", + "colab_type": "text" + }, + "source": [ + "## 第2部分:监督学习模型\n", + "\n", + "你现在应该已经发现哪部分人更可能成为邮购公司的顾客了,是时候搭建一个预测模型了。\"MAILOUT\"数据文件的的每一行表示一个邮购活动的潜在顾客。理想情况下我们应该能够使用每个人的人口统计数据来决定是否该把他作为该活动的营销对象。\n", + "\n", + "\"MAILOUT\" 数据被分成了两个大致相等的部分,每部分大概有 43 000 行数据。在这部分,你可以用\"TRAIN\"部分来检验你的模型,该数据集包括一列\"RESPONSE\",该列表示该对象是否参加了该公司的邮购活动。在下一部分,你需要在\"TEST\"数据集上做出预测,该数据集中\"RESPONSE\" 列也被保留了。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NH2I5NW7UUeM", + "colab_type": "code", + "colab": {} + }, + "source": [ + "mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BuljvWQoVDEU", + "colab_type": "code", + "colab": {} + }, + "source": [ + "mailout_train = pd.read_csv(drive_path+'Udacity_MAILOUT_052018_TRAIN.csv', sep=';')\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0HVb5z4lVJbO", + "colab_type": "code", + "colab": {} + }, + "source": [ + "mailout_train.head" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zP7KcZ79UUep", + "colab_type": "text" + }, + "source": [ + "## Part 3:Kaggle比赛\n", + "\n", + "你已经搭建了一个用于预测人们有多大程度上会回应邮购活动的模型,是时候到Kaggle上检验一下这个模型了。如果你点击这个 [链接](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140),你会进入到比赛界面(如果你已经有一个Kaggle账户的话)如果你表现突出的话,你将有机会收到Arvato或Bertelsmann的人力资源管理的经理的面试邀约!\n", + "\n", + "你比赛用提交的文件格式为CSV,该文件含2列。第一列是\"LNR\",是\"TEST\"部分每个顾客的ID。第二列是\"RESPONSE\"表示此人有多大程度上会参加该活动,可以是某种度量,不一定是概率。你应该在第2部分已经发现了,该数据集存在一个巨大的输出类不平衡的问题,也就是说大部分人都不会参加该邮购活动。因此,预测目标人群的分类并使用准确率来衡量不是一个合适的性能评估方法。相反地,该项竞赛使用AUC衡量模型的性能。\"RESPONSE\"列的绝对值并不重要:仅仅表示高的取值可能吸引到更多的实际参与者,即ROC曲线的前端曲线比较平缓。" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "cOx-WrtnUUeq", + "colab_type": "code", + "colab": {} + }, + "source": [ + "mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "edl5iRH3UUeu", + "colab_type": "text" + }, + "source": [ + "```python\n", + "\n", + "```" + ] + } + ] +} \ No newline at end of file