Data Science: Objectives
Data Science: Objectives
Data Science
Objectives
Describe what supervised machine learning is
Contrast the dierence between quantitative and categorical varaibles
Describe what classication is
Demo supervised classication of Iris type data.
Supervised Learning
Take known input data
Create a model of that data and train it based on known output data
After that, predict outputs from previously unknown inputs.
Modules to import
NumPy: An exceptional numeric computation system. (we have already seen this)
matplotlib: A system to make data visualizations.
Pandas: An spreadsheet-like way of manipulating data in Python.
sklearn: A machine learning system.
http://localhost:8888/notebooks/top-second-half.ipynb 1/10
7/18/2017 top-second-half
http://localhost:8888/notebooks/top-second-half.ipynb 2/10
7/18/2017 top-second-half
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
http://localhost:8888/notebooks/top-second-half.ipynb 3/10
7/18/2017 top-second-half
to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analy
sis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Expos
ed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transa
ctions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS
II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
http://localhost:8888/notebooks/top-second-half.ipynb 4/10
7/18/2017 top-second-half
In[5]: iris_data.target
Out[5]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
I hear that most humans prefer words though. So let's look at the translation of these numbers.
http://localhost:8888/notebooks/top-second-half.ipynb 5/10
7/18/2017 top-second-half
http://localhost:8888/notebooks/top-second-half.ipynb 6/10
7/18/2017 top-second-half
http://localhost:8888/notebooks/top-second-half.ipynb 7/10
7/18/2017 top-second-half
http://localhost:8888/notebooks/top-second-half.ipynb 8/10
7/18/2017 top-second-half
Yes, but we are doing something called supervised learning. Which means we start with data we
already know the labels of (like the iris type here) to train the computer from that known source of
truth.
Logistic Regression
Fit a logisitc model that determines irises.
Test the model on a single observation where it should predict Setosa as our iris type.
Run the model on all known observations and score its accuracy.
In[10]: X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal
y = iris_df['label'].values
logistic_model = LogisticRegression()
logistic_model.fit(X, y)
Wow that is a high number! But since we tested the data on our training data, we should get a high
number. That isn't a good test though. You should always test on data other than the training set.
But that is the jist of the basics.
What we did
When we rst encountered the dataset we explored it with matplotlib.
Then we trained our classication model, which was logistic regression.
Finally we will test our model, and discuss ways it could be made better.
BONUS!
Remember how I said it was a good idea to test the model on data other than training data? Here's
how to do it.
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, y)
logistic_model.fit(X_train, y_train)
print 'Score {} of {} is {}'.format(i+1, trials, logistic_model.score(X_test
Score 1 of 10 is 0.921052631579
Score 2 of 10 is 0.868421052632
Score 3 of 10 is 0.947368421053
Score 4 of 10 is 0.921052631579
Score 5 of 10 is 0.842105263158
Score 6 of 10 is 0.894736842105
Score 7 of 10 is 0.921052631579
Score 8 of 10 is 0.973684210526
Score 9 of 10 is 0.973684210526
Score 10 of 10 is 0.947368421053
We cut the training and test in multiple locations and that aects their score.
In[]:
http://localhost:8888/notebooks/top-second-half.ipynb 10/10