Section1 Exercise1 Perform Data Engineering Tasks
Section1 Exercise1 Perform Data Engineering Tasks
Exercise
Perform data engineering tasks
Section 1 Exercise 1
02/2020
Spatial Data Science MOOC
Time to complete
90 minutes
Introduction
Data engineering is a fundamental part of every analysis. The term refers to the planning,
preparation, and processing of data to make it more useful for analysis. It can include simple
tasks like identifying and correcting imperfections in your data and calculating new fields. It
can also include more complex tasks like reducing the dimensions of a multivariate dataset.
Data engineering also involves the process of geoenriching your data. Geoenrichment can
include various tasks:
In this exercise, you will use ArcGIS Pro and ArcGIS Notebooks to perform data engineering
tasks. These tasks will use the built-in tools available with these products as well as tools
available by integrating open source libraries.
Exercise scenario
Because voting is voluntary in the United States, the level of voter participation (referred to as
"voter turnout") has a significant impact on the election results and resulting public policy.
Modeling voter turnout, and understanding where low turnout is prevalent, can inform
outreach efforts to increase voter participation. With the ultimate goal of predicting voter
turnout, this exercise will focus on performing various data engineering tasks to prepare
election result data for predictive analysis.
c Extract the files to a folder on your local computer, saving them in a location that you will
remember.
d If your computer does not meet these requirements, use the provided links to complete
the recommended updates, and then run the test again.
d Under My Esri / My Organizations / Overview, confirm that MOOC Program is the listed
organization.
Note: If you do not see Downloads, click the My Profile tab, and then under Connected
Organizations, select MOOC Program.
Note: If you do not see Downloads, you may be signed in to My Esri with the wrong account.
Sign out of the site and sign back in with your ArcGIS account.
Note: You can run ArcGIS Pro in a different language by installing a language pack. Keep in
mind that this course is taught in English, which means that all screen shots and exercises will
use the English version of ArcGIS Pro.
g From the Download Components tab, to the right of ArcGIS Pro, click Download.
If the default download location does not have enough space, you can change the location by
following the steps in this link.
a Near the top of this page, under Spatial Data Science: The New Frontier in Analytics, click
Lessons.
The course ArcGIS account user name and password are listed under Lessons. The user name
for this account ends with _sds (for example, jdoe_sds). You may want to write down the user
name and password for quick reference.
Note: If you registered in the last few hours, your account may not be ready. Refresh the page
in an hour or so to determine if your account is available.
b In the Open Project dialog box, browse to the Data Engineering and Visualization folder
that you saved on your computer.
d Click OK.
Your ArcGIS Pro project opens to a gray reference map, called a basemap. You can zoom and
pan this map to different areas of the world. Because you are preparing United States election
data, it is currently focused on the contiguous United States.
To the left of the map is the Contents pane; the Contents pane lists the layers that have been
added to the map. To the right of the map is the Catalog pane; the Catalog pane lists the
items associated with this ArcGIS Pro package—Maps, Toolboxes, Notebooks, Databases,
Styles, Folders, and Locations. To learn more about the ArcGIS Pro interface, see ArcGIS Pro
Help: ArcGIS Pro user interface, and to learn more about ArcGIS Pro projects, see ArcGIS Pro
Help: Projects in ArcGIS Pro.
A notebook will open in the ArcGIS Pro project. The first few cells in this notebook are
markdown cells used to explain the exercise.
a In the notebook, double-click the first markdown cell titled Data Engineering.
Markdown cells use hashtags to determine the size and format of the explanatory text.
c Add a space between the hashtag and the word Data Engineering.
The text font style and size change to make it appear more like a heading.
Note: Adding additional hashtags will decrease the size of the font. If you are familiar with
HTML, you can think of this as switching between header tags (<h1>, <h2>, <h3>). Be sure to
maintain a space between the hashtag and your text; otherwise, the font style and size will
appear as regular text.
Running a markdown cell will apply the formatting that you have indicated in the cell.
Similarly, running a code cell will execute the code that you have written in the cell.
b From the ArcGIS Notebook Toolbar, click the Insert Cell Below button .
A code cell is added under the markdown cell. You will use this cell to import the Python
modules required to complete this exercise.
• arcgis
• pandas
• os
• arcpy
This code cell will call the modules from the ArcGIS Pro conda environment. To the left of the
code cell is blue text with brackets. When you run a code cell, an asterisk appears in the
brackets to indicate that the cell is running. When the cell is complete, the asterisk is replaced
with a number.
The number 1 appears in the brackets to indicate that the cell has been executed, which
means that the modules were successfully loaded.
You will use the pandas module quite often in this exercise. Instead of typing pandas each
time, you will shorten pandas to pd.
e Modify the line of code that says import pandas to say import pandas as pd.
You used pd as a variable. A variable is a name that references an object. The object could be
a dataset or, in this case, a Python module. You could have shortened pandas to any variable
name. You used pd because it is the most common local name for pandas. The remaining
code cells will use pd when using pandas functionality.
b From the ArcGIS Notebook Toolbar, click the Insert Cell Below button .
By defining this variable, you can use table_csv_path throughout the script to refer to the
county election dataset (countypres2016.csv).
You want to specify that the FIPS attribute field in this data frame will be a text, or string value.
You will use the dtype parameter to specify this field type.
You created a data frame for the county elections dataset that you will use to prepare,
reformat, and geoenable your data.
k In ArcGIS Pro, from the Notebook tab, in the Notebook group, click Save.
l Execute the rest of the notebook and review each step as you execute each cell.
You must execute each cell in the notebook before proceeding to the next step.
Note: Although you are not writing all the Python code, it is recommended that you carefully
look at the Python syntax and logic in each cell. Reviewing each cell can help you to
familiarize yourself with the ArcGIS Notebook interface and learn Python syntax. The
notebook can also act as sample code that you can reference for data engineering tasks.
The Geoprocessing pane opens. The Geoprocessing pane is used to browse or search for
geoprocessing tools available with ArcGIS Pro.
e Click Enrich.
The Enrich tool opens. In the Geoprocessing pane, the Enrich tool lists the parameters
required to run the tool. Parameters define the values used to run the tool and its underlying
algorithms. To run the Enrich tool, you will need to define the input feature class, a name for
the output feature class, and the variables that will be added to the output feature class.
The Enrich tool uses credits. To learn more about credits, see ArcGIS Pro Help: Understand
credits.
b In the Input Features dialog box, double-click Data Engineering and Visualization.gdb.
Note: This parameter represents a file path that leads to the ArcGIS Pro project's file
geodatabase (Data Engineering and Visualization.gdb). Ensure that the output name that you
enter is at the end of this pathname to indicate that the feature class should be stored inside
the geodatabase.
Esri provides various demographic variables that you can add to your data. You can also add
variables that you created or that were shared with you.
f In the Add Variable dialog box, in the search field, type 2019 Median Age and press
Enter.
g If necessary, expand 2019 Age: 5 Year Increments (Esri) and click 2019 Median Age.
To the right of 2019 Median Age are a hashtag and the word Index. These icons, along with a
percent sign icon, are used to specify if you want a total count (hashtag), index, or percentage
(percent sign) of the variable.
j Click OK.
k Click Run.
Note: It may take a few minutes for this tool to run.
The attribute table includes the fields added in the initial data engineering steps as well as the
fields added using the Enrich tool.
After completing various data engineering techniques, you cleaned and prepared the election
data. Geoenabling and geoenriching the data provides demographic variables that you can
use to model or predict voter turnout. You will use various visualization techniques to explore
relationships between voter turnout and these variables. You will use this information to
identify potential variables to use in your prediction model.
n If you would like to perform additional data engineering tasks, proceed to the optional
stretch goal; otherwise, save the project and exit ArcGIS Pro.
If you would like to continue engineering your data, you can modify the ArcGIS Notebook to
include the following tasks:
1. Identify and remove records with null candidatevotes values in the election data.
2. Apply a symbology layer (default.lyrx) to the 2016 election turnout feature class
(out_2016_fc_name).
The default.lyrx is located in the Data Engineering and Visualization folder. The ArcGIS Pro
Help: Apply Symbology From Layer (Data Management) describes the process of applying
a symbology layer and includes syntax to use in your script.
3. Determine how to incorporate Alaska into this analysis.
Hint: Alaska does not have counties. Research their administrative and political subdivisions
to determine how the data would need to be engineered to address this issue.
Use the Lesson Forum to post your questions, observations, and syntax examples. Be sure to
include the #stretch hashtag in the posting title.