CS109a Lecture1
CS109a Lecture1
On the first day of class you were introduced to the “data science”
process.
■ Ask questions
■ Data Collection
■ Data Exploration
■ Data Modeling
■ Data Analysis
■ Visualization and Presentation of Results
2
lecture outline
What Is Data?
Exploring Data
Descriptive Statistics
Data Visualization
An Example
What Next?
3
.what is data?
what is data?
5
where does it come from?
6
web scraping
■ Why do it? Older government or smaller news sites might not have
APIs for accessing data, or publish RSS feeds or have databases for
download. You don’t want to pay to use the API or the database.
■ How do you it? See HW0
■ Should you do it?
⇒ You just want to explore: Are you violating their terms of service?
Privacy concerns for website and their clients?
⇒ You want to publish your analysis or product: Do they have an API or
fee that you’re bypassing? Are they willing to share this data? Are you
violating their terms of service? Are there privacy concerns?
7
what does it look like?
Simple or atomic:
8
what does it look like?
8
what does it look like?
8
what does it look like?
■ Textual Data
■ Temporal Data
■ Geolocation Data
8
more on tabular data
9
is the data any good?
10
handling messy data
11
handling messy data
11
handling messy data
11
handling messy data
11
handling messy data
11
handling messy data
11
.exploring data
talk outline
What Is Data?
Exploring Data
Descriptive Statistics
Data Visualization
An Example
What Next?
13
basic terms
Biases in samples:
Given some large dataset, we’d like to compute a few quantities that
intuitively summarizes the data. To begin with we’d like to know
15
Location: Mean
centrality
1. The Mean
The meanTo of
calculate the
a set of n average
number ofx samples
of a set ofofobservations,
a variable isadd their x
denoted
value and
and is defined by divide by the number of observations:
xn x 1 ∑
n
x1x1++x x2 2++x.3. +. +...+
n
x = = 1 xi
x=
n
n
n
=
n
n
i=1
" xi
i=1
Example:
22+23
Median = 2 = 22.5
16
centrality
16
centrality
■ the mean
■ the median
16
centrality
16
centrality
17
spread
17
spread
17
talk outline
What Is Data?
Exploring Data
Descriptive Statistics
Data Visualization
An Example
What Next?
18
why data visualization?
The following data sets comprise the Anscombe’s Quartet; all four
sets of data have identical simple summary statistics.
19
why data visualization?
The following data sets comprise the Anscombe’s Quartet; all four
sets of data have identical simple summary statistics.
19
why data visualization?
If I tell you that the average score for Homework 0 Part A is: 7.64/15.
19
why data visualization?
19
what is data visualization good for?
Analyze:
20
what is data visualization good for?
Communicate:
20
visualization design principles
21
visualization design principles
8.095
8.090
21
visualization design principles
21
types of data visualizations
22
distribution
Effect of Bin Size on Histogram
• Simulated 1000 N(0,1) and 500 N(1,1)
Frequency
Frequency
Effect of Bin SizeEffect of Bin Size on Histogram
on Histogram
A histogram is a way to visualize how 1-dimensional data is
• Simulated 1000 N(0,1) and• 500
Simulated
N(1,1)1000 N(0,1) and 500 N(1,1)
distributed across certain values.
Frequency
Frequency
Frequency
Frequency
Frequency
23
distribution
23
relationships
24
composition
25
composition
25
comparisons
26
visualizing the impossible
27
reducing the dimension
28
adding extra dimensions
Bacteria Name Group No. Res. to Drug 1 Res. to Drug 2 Res. to Drug 3
Brucella abortus 1 0.1 3 49
Diplococcus pneumoniae 2 4.75 0.007 0.125
Aerobacter aerogenes 1 0.3 1 47.2
Streptococcus viridans 2 4.9 0.03 -1.45
31
effectiveness of drugs
Any patterns?
31
effectiveness of drugs
Any patterns?
31
effectiveness of drugs
33
predict
Can we predict the type of iris given petal and sepal lengths?
34