Week13 2 Data Analysis 2
Week13 2 Data Analysis 2
and Visualizing
15-110 – Wednesday 4/15
Learning Goals
• Perform basic analyses on data to answer simple questions
2
Last Time
Last week, we discussed the data analysis process, and went over
several methods for reading, representing, and organizing data.
This time, we'll talk more about what we can do with that data once
we've processed it.
3
Analysis
4
Basic Data Analyses – Mean/Median/Mode
There are many basic analyses we can run on features in data to get a
sense of what the data means. You've learned about some of them
already in math or statistics classes.
Median: sorted(lst)[len(lst) // 2]
6
Calculating Probabilities
You'll also often want to calculate probabilities based on your data.
In general, the probability that a certain data type occurs in a dataset is the
count of how often it occurred, divided by the total number of data points.
count = 0
for i in range(len(data)):
if meetsCondition1(data[i]) and meetsCondition2(data[i]):
count += 1
print(count / len(data))
8
Messy Data – Duplicates
You'll also sometimes need to clean up messy data to get a proper analysis.
Some of this is done in the data cleaning stage, but even cleaned data can
have problems.
One potential issue is duplicate data, when the same data entry is included
in the dataset multiple times. To detect duplicate data, check if your data has
a unique ID per data point; then you can count how often each ID occurs to
find the duplicates.
for id in dataIds:
if dataIds.count(id) > 1:
print("Duplicate:", id)
9
Messy Data – Missing Values
Analyses can also run into problems when there are missing values in some data entries.
Data can be missing if some entries were not collected, and is likely to occur in surveys
with optional questions.
for i in range(len(data)):
if data[i] == "": # can also check for 'n/a' or 'none'
print("Missing row:", i)
To deal with missing data, ask yourself: how crucial is that data? If it's an important part of
the analysis, all entries that are missing the needed data point should be removed, and the
final report should include how much data was thrown out.
If it's less important, you can substitute in a 'N/A' class for categorical data, or skip the
entry for numerical data. But be careful about how missing data affects the analysis.
10
Messy Data – Outliers
Finally, be careful about how outliers can affect the results of data analysis.
Outliers are data points that are extremely different from the rest of the
dataset. For example, in a dataset of daily time reports, most people might
report 5-15 hours, but one person might report 100 hours.
The easiest way to detect outliers is to use visualizations, which we'll discuss
later in the lecture. Outliers should be removed from some calculations
(especially means) to avoid skewing the results. Be careful, some outlier-like
data may not actually be an outlier and may reveal important information.
11
Other Common Statistics
Python has already implemented some of these statistics for you!
12
Example: Analyzing Ice Cream Data
We've now cleaned the ice cream dataset from last week. Let's analyze the data to answer
this question: which ice cream flavors do people like most?
Here's a bit of code from last time to load and represent the dataset:
import csv
def readData(filename):
f = open(filename, "r")
reader = csv.reader(f)
data = [ ]
for row in reader:
data.append(row)
return data
13
Example: Total Preferences
First: how many times does each flavor occur in any of a person's
preferences?
def getIceCreamCounts(data):
iceCreamDict = { }
for i in range(1, len(data)): # skip header
for flavor in data[i]:
if flavor not in iceCreamDict:
iceCreamDict[flavor] = 0
iceCreamDict[flavor] += 1
return iceCreamDict
14
Activity: Count Top Flavors
Second: how often does each flavor occur as the top preference a
person has?
Modify the code from before to count only the top preference
("Flavor 1")
15
Visualization
16
Exploration vs. Presentation
Data Visualization is the process of taking a set of data and
representing it in a visual format. Whenever you've made charts or
graphs in past math or science classes, you've visualized data!
17
Data Exploration
In data exploration, charts created
from data can provide information
about that data beyond what is
found in simple analyses alone.
18
Visual Variables Show Differences
In visualization, we use different visual variables to demonstrate the
differences between categories or data points.
Which visual variable you use depends on the type of the data you're
representing – categorical, ordinal, or numerical.
19
Visual Variable Options – Numerical
If you want to encode numerical
data, you basically have two options:
position and size.
20
Visual Variable Options – Ordinal
For ordinal data, you can use position and size, but you can also use
value.
Value: the hue of a color in the chart (from 0 RGB to 255 RGB). Hues
are ordered based on the ordinal comparison.
21
Visual Variable Options – Categorical
Categorical data can be presented
using position, size, and value, but it
also adds two other options: color
and shape.
22
Choosing a Visualization
There are dozens of different visualizations you can use on data.
In order to choose the best visualization for the job, consider how
many dimensions of data you need to visualize.
23
One-Dimensional Data
A one-dimensional visualization only visualizes a single feature of the
dataset. For example:
"I want to know how many of each product type are in my data"
"I want to know the proportion of people who have cats in my data"
24
Histograms Show Counts
For categorical or ordinal data,
show counts for each type of data
using bars (length = count).
25
Pie Charts Show Percentages
A pie chart shows the proportion
of the data that falls into each
category of the feature. The
proportions should always add up
to 100%.
26
Two-Dimensional Data
A two-dimensional visualization shows how two features in the dataset
relate to each other. For example:
"I want to know the cost of each product category that we have"
"I want to know the weight of the animals that people own, by pet species"
"I want to know how the size of the product affects the cost of shipping"
To visualize two-dimensional data, use a bar chart, a scatter plot, a line plot,
or a box-and-whiskers plot.
27
Bar Charts Compare Averages
A bar chart compares the average
results of a numerical feature
across the categories of a
categorical feature.
28
Box-and-Whisker Plots Show Ranges
A box-and-whisker plot also
compares averages of a numerical
feature across categories of a
categorical feature, but it visually
provides summary statistics across
the range of the data.
29
Scatter Plots Show Trends
A scatter plot compares two
numerical features by plotting
every data point as a dot on the
graph (with the first feature as the
x axis and the second as the y
axis).
30
Line Plots Show Trends Over Time
A line plot uses a numerical
feature that specifically measures
time on the x axis, and a different
numerical feature on the y axis.
31
Three-Dimensional Data
A three-dimensional visualization tries to show the relationship between
three different features at the same time. For example:
"I want to know the cost and the development time by product category"
"I want to know the weight of the animals that people own and how much
they cost, by pet species"
"I want to know how the size of the product and the manufacturing location
affects the cost of shipping"
33
Bubble Plots Show Size
A bubble plot can be used to
compare three numerical
features. One feature is the x axis,
and another is the y axis. The third
feature is used to specify the size
of the data points.
34
Scatter Plot Matrixes Compare Multiple Features
A scatter plot matrix can be used to
compare three (or more) numerical
features. Each column corresponds
to one of the tree features, and each
row corresponds to one of the three
features. The graph shown in each
position is then the scatter plot
between the row's feature and the
column's feature.
Note that graphs on the diagonal are
histograms, as they compare a
feature to itself.
35
Coding Visualizations with
Matplotlib
36
Matplotlib Makes Visualizations
The matplotlib library can be used to generate interesting visualizations
in Python.
37
Matplotlib Core Ideas
For every visualization you make in Matplotlib, you'll need to set up a figure
and axis. This is generally done with the code:
fig, ax = plt.subplots()
You can then directly add visualizations to the axis by calling different
methods on ax, with the data as the parameter. Let's look at histograms and
bar charts specifically.
plt.show()
39
Bar Chart Example - Categorical
Let's use our ice cream data to make a nice categorical histogram (which will be
formed using bar charts). We'll graph the counts of the three classic flavors: vanilla,
chocolate, and strawberry.
data = readData("icecream.csv")
d = getIceCreamCounts(data)
flavors = [ "vanilla", "chocolate", "strawberry" ]
counts = [ d["vanilla"], d["chocolate"], d["strawberry"] ]
40
Bar Chart Example - Categorical
import matplotlib.pyplot as plt
# Add labels
ax.set_ylabel('Flavors')
ax.set_title('Counts of Three Flavors')
ax.set_xticks(ind)
ax.set_xticklabels(flavors)
plt.show()
41
Advanced Bar Chart
We can make our visualizations more advanced by adding side-by-side bars,
and using the other matplotlib features to add data to the chart.
For example, let's write a bit of matplotlib code to compare averages and
standard deviations across an arbitrary data set.
42
Bar Chart Matplotlib
# From matplotlib website
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
# Using numpy arrays lets us do useful operations
mensInd = np.arange(5)
width = 0.35 # the width of the bars
womensInd = mensInd + width
rects1 = ax.bar(mensInd, menMeans, width,
color='r', yerr=menStd)
rects2 = ax.bar(womensInd, womenMeans, width,
color='b', yerr=womenStd)
# Labels and titles
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(mensInd + width / 2)
ax.set_xticklabels(['G1', 'G2', 'G3', 'G4', 'G5'])
ax.legend([rects1[0], rects2[0]], ['Men', 'Women'])
plt.show()
Learning Goals
• Perform basic analyses on data to answer simple questions
44