0% found this document useful (0 votes)
15 views10 pages

8.3x Week 1

The document covers introductory concepts in data science and machine learning including averages, median, standard deviation, distributions, histograms, and correlation. It discusses quantifying concepts, examining distributions, understanding why many distributions are bell-shaped, and examining variability around the mean. Key terms like average, median, standard deviation, normal distribution, central limit theorem, and correlation are defined.

Uploaded by

Arooj Shahbaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

8.3x Week 1

The document covers introductory concepts in data science and machine learning including averages, median, standard deviation, distributions, histograms, and correlation. It discusses quantifying concepts, examining distributions, understanding why many distributions are bell-shaped, and examining variability around the mean. Key terms like average, median, standard deviation, normal distribution, central limit theorem, and correlation are defined.

Uploaded by

Arooj Shahbaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Science: Machine Learning and Predictions

Notes: 8.3x Lab 1 Videos Notes

Lec 1.1 Introduction

Goals
● Quantify natural concepts like “____________” and “____________”
● Examine ________-shaped distributions
● Understand why many of the _________________ distributions are bell shaped

The Average (or Mean) Data: 2,3,3,9 Average = (2+3+3+9)/4 = 4.25


● Need not be a value in the collection
● Need not be an integer even if the data are integers
● Somewhere between min and max, but not necessarily halfway in between
● Same units as the data
● Smoothing _______________: collect all the contributions in one big pot, then split evenly

What does len(values) mean?

What does this code mean?

The _________________ is a property of the


histogram.

Relation to the Histogram


● The average of a list depends only on the
__________________ in which the distinct values
appear, not on the number of entries in the list
● The ____________ is the center of gravity of the
histogram
Balance Point shows average (in red)
Lec 1.2 Average and Median
Create a sample data set:

Median? Average?

Are the medians (halfway points of the data) of these


two distributions the same or different?

Are the means of these two distributions the same or


different?

Tail pulls average away from the ______________.

Which is bigger? Mean or median?

What is the code for finding the median height?

Lec 1.3 Standard Deviation


Defining Variability
Plan A: “biggest value - smallest value”
● Doesn’t tell us much about the shape of the distribution

Plan B:
● Measure variability around the mean
● Need to figure out a way to quantify this

What we are interested in is how far off these


numbers are from their _________________ (which
is 4.25).

________________ are obtained by just taking the


values from each one and subtracting off the
average.

Add up all deviations: You get ____.

We are going to square the deviations and


augment the table with a column of squared
deviations.
What does variance mean?

What are the units of squared deviation?

Standard deviation (SD) is the _______________ of


the variance.

How Far from the Average?


● The SD measures roughly how far the data are from their average
● SD = ____________________________________________
● The SD has the same units as the data

Why Use the SD?


● 1st reason: No matter what the shape of the distribution, the bulk of the data are
in the range “average + a few SDs”
● 2nd reason: Relation with bell-shaped curves: coming later

Lec 1.4 Chebyshev's Bounds


The bulk of the data are in the range “average + a few SDs”.

Chebyshev’s Inequality: No matter what the shape


of the distribution, the proportion of values in the
range “average + z SDs” is at least 1 - 1/z2
● Bounds are most helpful when z is on the
_______________ side.

What does this distribution show?


Here you find the ____________ and the ______ .

Does Chebyshev’s bounds work for this


shape? Yes or No

So the average and the SD together tell you,


to a large extent, where the data are situated.

Reading and Practice for Section 1- Great quiz questions

Lec 2.1 Standard Units


The Normal Curve
Goals:
● Describe what is meant by “bell-shaped curve”
● Explain how bell-shaped curves arise in inference

Standard Units
● How many SDs above average?
● z = (value - average)/SD
○ Negative z: value below average
○ Positive z: value above average
○ z = 0: value = average
● When values are in standard units: average = 0, SD = 1
● Chebyshev: At least 96% of the values of z are between -5 and 5

In simple terms, how is an array of numbers


converted to standard units?

What is the average age?

What is the SD of the ages?

How did you figure this?

Does this code confirm the above results?

Lec 2.2 SD and Bell Curves


The SD and the Histogram
● Usually it’s not easy to estimate the SD by looking at a histogram.
● If histogram has a bell shape, then you can estimate SD:
The SD is the distance between the average (at center) and the points of inflection on
either side

What does the red arrow indicate?

Distance between the center (64) and the


point of inflection = the SD.

SD = ______

Does this code confirm the above results?

Lec 2.3 Normal Distribution


Bell Curve:
Total area under the curve is _______% ( think of it as a
histogram or distribution).

If a histogram is bell-shaped, then almost all the data are in


the range “average + _____ SDs”

Bounds and Normal Approximations:

(ignore “bootstrap” references)

Lec 2.4 Central Limit Theorem


First Reason for Using the SD: If you know the average and the SD, you have a pretty good
sense of where the data lie.

Second Reason for Using the SD: If the sample is large, and you draw it at ____________ with
replacement…
Then, regardless of the distribution of the population,
the probability distribution of the sample sum (or of the sample average) is roughly
normal.

We are going to look at delays as our


_________________.

We are going to draw randomly from that


population, and see what happens to the sample
average.

Is this distribution normal? Yes or No

We drew ________ flights and computed the


average.
We did this __________times.

This histogram shows the distribution of the


sample averages (looks like a ____________ curve).

Sample Averages
● Often, we only have a sample; we don’t know much about the population from which it
was drawn
● The Central Limit Theorem states that the probability distribution of the average of a
large random sample is roughly normal, regardless of the ___________________ of the
population.
● This allows us to make ___________________ based on averages of large random samples.
(ignore “bootstrap” references)

Reading and Practice for Section 2

Lec 3.1 Visualization


Correlation: measure of a certain kind of association between two numerical variables

Prediction
● To predict the value of a variable,
○ Identify attributes that are associated with that variable that you can
________________
○ Describe the relation between the attributes and the variable you want to predict
○ Use the relation to make your prediction

Visualization: Two Numerical Variables


● T_______________
○ Positive or negative association

● P_______________
○ Any discernible “shape” in the scatter
○ Linear or nonlinear

Visualize, then quantify

This scatter plot is


generally linear with a
_____________ association.

This scatter plot is


nonlinear with a
________________
association.

The more miles/gallon, the


cheaper the price.
(or the fewer miles/gallon, the more
expensive the car)

Why is this surprising?

Describe the association


between acceleration and
price.

What happens to the


scatterplot when the
variables are drawn in
standard units?
Lec 3.2 Calculation
The Correlation Coefficient r
● Measures linear association
● Based on standard units
● -1 < r < 1
○ r = _____: scatter is perfect straight line sloping _______
○ r = _____: scatter is perfect straight line sloping _______
○ r = _____: no linear association; uncorrelated
This scatterplot has a correlation of ______.

What are the units of the axes? ___________________

How is this scatterplot with a correlation of 0.9


different from the plot above?

What correlation number creates this “blob”


scatterplot?

No linear association
Describe how to calculate r from this table:

Step 1: Find the _____________ of standard units

Step 2: Find the ________________ of the products

This function does all of this. This function is


helpful because it is hard to judge correlation by
eye.

Definition of r: Take x and y in standard units → multiply → then take


the average.
This measures the degree of clustering around a straight line.

Lec 3.3 Interpretation


Causal Conclusions- Be careful…
● Correlation measures linear association
● Association does NOT imply ________________
● Just because two variables are correlated, does NOT mean that one ____________
the other

Nonlinearity and Outliers


● Draw a scatterplot before you decide to compute r

How are X and Y related?

Y is the square of X

Correlation = ______

No need to compute______ since there is no


linear association.
The outlier point on the right drops the
correlation to 0.

If you compute the _______________ of this


scatterplot, you have 0.985 (extremely high).

This is a plot of state scores, not individuals.

Be cautious in interpreting correlation!

What has happened in this plot is by taking


everybody in this state and lumping them all
into one point, we have artificially introduced
______________.

Reading and Practice for Section 3- Great quiz graphs

Lab 1: Sample Means and Correlation

You might also like