0% found this document useful (0 votes)

50 views10 pages

Data Science: Objectives

This document describes supervised machine learning and classification using the Iris flower dataset. It contains 150 samples across 3 iris types (setosa, versicolor, virginica) described by 4 features - sepal length and width and petal length and width. The document outlines importing relevant libraries, exploring the dataset to understand the features and types, and preparing the data for supervised classification using logistic regression to predict iris types from feature values.

Uploaded by

Shiqi Sheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views10 pages

Data Science: Objectives

Uploaded by

Shiqi Sheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

7/18/2017 top-second-half

Data Science
Objectives
Describe what supervised machine learning is
Contrast the dierence between quantitative and categorical varaibles
Describe what classication is
Demo supervised classication of Iris type data.

Supervised Learning
Take known input data
Create a model of that data and train it based on known output data
After that, predict outputs from previously unknown inputs.

Quantitative vs Qualititave variables

Quantitative variables are continuous variable values. These are the petal and sepal lengths
and widths in centimeters for this demo.
Categorical variables seperate things into a controlled number of categories. These are the
target Iris types in our data.
Classication is used to take quantitative variables and place them into categories. For us,
this guring out what type of Iris an observation is.

Categorical variables on Wikipedia (https://en.wikipedia.org/wiki/Categorical_variable)

Modules to import
NumPy: An exceptional numeric computation system. (we have already seen this)
matplotlib: A system to make data visualizations.
Pandas: An spreadsheet-like way of manipulating data in Python.
sklearn: A machine learning system.

Exploration and Training

When we rst encounter a dataset it is useful to explore it to nd out about its properties.
So we will explore the data with matplotlib.
Then we will train our classication model. In our case the is logistic regression.
Finally we will test our model, and discuss ways it could be made better.

Preparing the Iris data

Import Pandas and NumPy
Get the data science logistic regression code

http://localhost:8888/notebooks/top-second-half.ipynb 1/10
7/18/2017 top-second-half

Get the iris data set

Prepare to do visualizations.

In[1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
%matplotlib inline
iris_data = load_iris()

Now explore what is there.

A long-winded description of the dataset.

http://localhost:8888/notebooks/top-second-half.ipynb 2/10
7/18/2017 top-second-half

In[2]: print iris_data.DESCR

Iris Plants Database

====================

Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988

This is a copy of UCI ML iris datasets.

http://archive.ics.uci.edu/ml/datasets/Iris (http://archive.ics.uci.edu/m
l/datasets/Iris)

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field
and
is referenced frequently to this day. (See Duda & Hart, for example.) T
he
data set contains 3 classes of 50 instances each, where each class refers
to a
type of iris plant. One class is linearly separable from the other 2; th
e
latter are NOT linearly separable from each other.

References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
http://localhost:8888/notebooks/top-second-half.ipynb 3/10
7/18/2017 top-second-half
to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analy
sis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Expos
ed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transa
ctions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS
II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...

What are the types of irises in the data set?

In[3]: print iris_data.target_names

['setosa' 'versicolor' 'virginica']

What are the features of the irises?

In[4]: print iris_data.data

[[ 5.1 3.5 1.4 0.2]

[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]
[ 5.4 3.9 1.7 0.4]
[ 4.6 3.4 1.4 0.3]
[ 5. 3.4 1.5 0.2]
[ 4.4 2.9 1.4 0.2]
[ 4.9 3.1 1.5 0.1]
[ 5.4 3.7 1.5 0.2]
[ 4.8 3.4 1.6 0.2]
[ 4.8 3. 1.4 0.1]
[ 4.3 3. 1.1 0.1]
[ 5.8 4. 1.2 0.2]
[ 5.7 4.4 1.5 0.4]
[ 5.4 3.9 1.3 0.4]
[ 5.1 3.5 1.4 0.3]
[ 5.7 3.8 1.7 0.3]
[ 5.1 3.8 1.5 0.3]
What are the known types of Irises corresponding to each row of the data? First, the data set
assigned integers of 0, 1, 2 and to the iris types.

http://localhost:8888/notebooks/top-second-half.ipynb 4/10
7/18/2017 top-second-half

In[5]: iris_data.target

Out[5]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

I hear that most humans prefer words though. So let's look at the translation of these numbers.

In[6]: print iris_data.target_names[iris_data.target]

['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'

'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
'setosa' 'setosa' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
'versicolor' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
'virginica' 'virginica' 'virginica']

Pandas to the rescue

Pandas will put all the data together in one place.
The label is the type of Iris and what we aim to predict. Each label has an integer target
that corresponds to the label name.

http://localhost:8888/notebooks/top-second-half.ipynb 5/10
7/18/2017 top-second-half

In[12]: target_names = iris_data.target_names[iris_data.target]

iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df['target'] = iris_data.target
iris_df['label'] = target_names
# iris_df.head()
iris_df.tail()

Out[12]: sepal length sepal width petal length petal width

target label
(cm) (cm) (cm) (cm)

145 6.7 3.0 5.2 2.3 2 virginica

146 6.3 2.5 5.0 1.9 2 virginica

147 6.5 3.0 5.2 2.0 2 virginica

148 6.2 3.4 5.4 2.3 2 virginica

149 5.9 3.0 5.1 1.8 2 virginica

Now some pretty picture

Basically this is making a scatter plot for all the sepal measurement pairs for each iris type. First, a
plot of

sepal length vs. sepal width.

http://localhost:8888/notebooks/top-second-half.ipynb 6/10
7/18/2017 top-second-half

In[8]: fig = plt.figure(figsize=(10, 10))

ax = fig.add_subplot(1, 1, 1)
for target, color in zip(iris_df.target.unique(), ['r', 'g', 'b']):
sub_df = iris_df.query('target == @target')
ax.scatter(x=sub_df['sepal length (cm)'].values, y=sub_df['sepal width (cm)'
color=color, label=iris_data.target_names[target], s=30)
ax.legend(loc='best')
ax.set_xlabel('Sepal Length', size=15)
ax.set_ylabel('Sepal Width', size=15)

Out[8]: <matplotlib.text.Text at 0x11a131890>

Petal length vs. petal width

http://localhost:8888/notebooks/top-second-half.ipynb 7/10
7/18/2017 top-second-half

In[9]: fig = plt.figure(figsize=(10, 10))

ax = fig.add_subplot(1, 1, 1)
for target, color in zip(iris_df.target.unique(), ['r', 'g', 'b']):
sub_df = iris_df.query('target == @target')
ax.scatter(x=sub_df['petal length (cm)'].values, y=sub_df['petal width (cm)'
color=color, label=iris_data.target_names[target], s=30)
ax.legend(loc='best')
ax.set_xlabel('Petal Length', size=15)
ax.set_ylabel('Petal Width', size=15)

Out[9]: <matplotlib.text.Text at 0x11a489390>

Now for some visual observations

Notice how Setosa is always in its own clumps?

Wait...but aren't we going to have the computer do this?

http://localhost:8888/notebooks/top-second-half.ipynb 8/10
7/18/2017 top-second-half

Yes, but we are doing something called supervised learning. Which means we start with data we
already know the labels of (like the iris type here) to train the computer from that known source of
truth.

Logistic Regression
Fit a logisitc model that determines irises.
Test the model on a single observation where it should predict Setosa as our iris type.
Run the model on all known observations and score its accuracy.

In[10]: X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal
y = iris_df['label'].values
logistic_model = LogisticRegression()
logistic_model.fit(X, y)

Out[10]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=Tr

ue,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.000
1,
verbose=0, warm_start=False)

How well does it do?

In[13]: print 'Model guess for a setosa: {}'.format(logistic_model.predict(np.array([[
# logistic_model.predict(np.array([[7.2, 2.8, 6.6, 2], [6.2, 2.5, 3.6, 2], [4.7,
print 'Score {}'.format(logistic_model.score(X, y))

Model guess for a setosa: setosa

Score 0.96

Wow that is a high number! But since we tested the data on our training data, we should get a high
number. That isn't a good test though. You should always test on data other than the training set.
But that is the jist of the basics.

What we did
When we rst encountered the dataset we explored it with matplotlib.
Then we trained our classication model, which was logistic regression.
Finally we will test our model, and discuss ways it could be made better.

BONUS!
Remember how I said it was a good idea to test the model on data other than training data? Here's
how to do it.

Splitting test and train data

http://localhost:8888/notebooks/top-second-half.ipynb 9/10
7/18/2017 top-second-half

Splitting test and train data

Split the data into training and testing data
Train the model on the training data
Test model on test data by testing it on data it has not seen before. That is calld out-of-
sample data.
Run the score multiple times.
Watch the scores change as dierent splits are made in the dataset.

In[15]: from sklearn.model_selection import train_test_split

logistic_model = LogisticRegression()
trials = 10

for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, y)
logistic_model.fit(X_train, y_train)
print 'Score {} of {} is {}'.format(i+1, trials, logistic_model.score(X_test

Score 1 of 10 is 0.921052631579
Score 2 of 10 is 0.868421052632
Score 3 of 10 is 0.947368421053
Score 4 of 10 is 0.921052631579
Score 5 of 10 is 0.842105263158
Score 6 of 10 is 0.894736842105
Score 7 of 10 is 0.921052631579
Score 8 of 10 is 0.973684210526
Score 9 of 10 is 0.973684210526
Score 10 of 10 is 0.947368421053

We cut the training and test in multiple locations and that aects their score.

In[]:

http://localhost:8888/notebooks/top-second-half.ipynb 10/10

Task 1
No ratings yet
Task 1
14 pages
ML N PY Programs
No ratings yet
ML N PY Programs
17 pages
Lab 3 - SciKitLearn ML
No ratings yet
Lab 3 - SciKitLearn ML
2 pages
Chap5 - Wei - Ipynb - Colab
No ratings yet
Chap5 - Wei - Ipynb - Colab
29 pages
BT-2016 SEM-IV Project Report (Review 1)
No ratings yet
BT-2016 SEM-IV Project Report (Review 1)
42 pages
Iris Flower Classification
No ratings yet
Iris Flower Classification
47 pages
Iris Dataset Project Report - Compress
No ratings yet
Iris Dataset Project Report - Compress
16 pages
Exno 4
No ratings yet
Exno 4
13 pages
Amber Iris
No ratings yet
Amber Iris
23 pages
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
No ratings yet
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
4 pages
Iris Dataset Visualization Guide
No ratings yet
Iris Dataset Visualization Guide
3 pages
JAYESH BANSAL - FinalProjectReport - Jayesh Bansal
No ratings yet
JAYESH BANSAL - FinalProjectReport - Jayesh Bansal
38 pages
# Common Datatype: Print Type Print Type Print Type Print Type Print Type
No ratings yet
# Common Datatype: Print Type Print Type Print Type Print Type Print Type
4 pages
Python ML Exercises for Beginners
No ratings yet
Python ML Exercises for Beginners
19 pages
Dsfasdflalksdflkasdjfasf
No ratings yet
Dsfasdflalksdflkasdjfasf
4 pages
Support Vector Machine (SVM Classifier) Implemenation in Python With Scikit-Learn
No ratings yet
Support Vector Machine (SVM Classifier) Implemenation in Python With Scikit-Learn
21 pages
EDA AnalysisA
No ratings yet
EDA AnalysisA
15 pages
SUMITs MINOR REPORT
No ratings yet
SUMITs MINOR REPORT
16 pages
Iris ML Analysis for Data Scientists
No ratings yet
Iris ML Analysis for Data Scientists
13 pages
Iris Flower Classification Project
100% (1)
Iris Flower Classification Project
14 pages
Classification of Iris Flower Species Updated
100% (1)
Classification of Iris Flower Species Updated
5 pages
Ludic - Workshop - Iris - Copie
No ratings yet
Ludic - Workshop - Iris - Copie
5 pages
Mlpy 2
No ratings yet
Mlpy 2
18 pages
R Course - Part7 ML - Exercise Sheet 2024
No ratings yet
R Course - Part7 ML - Exercise Sheet 2024
8 pages
Iris Flower Classification Project
No ratings yet
Iris Flower Classification Project
9 pages
Iris Classification
No ratings yet
Iris Classification
6 pages
ML#07
No ratings yet
ML#07
21 pages
1 Assignment 3 - Classification
No ratings yet
1 Assignment 3 - Classification
16 pages
IJARESM June2021
No ratings yet
IJARESM June2021
10 pages
Python Data Handling Lab Guide
No ratings yet
Python Data Handling Lab Guide
8 pages
Plot Iris Dataset
100% (1)
Plot Iris Dataset
2 pages
DS Report
No ratings yet
DS Report
11 pages
Ai Final
No ratings yet
Ai Final
11 pages
Ex No4
No ratings yet
Ex No4
3 pages
Fo DS
No ratings yet
Fo DS
9 pages
Data Set
No ratings yet
Data Set
2 pages
Week3 Stat
No ratings yet
Week3 Stat
4 pages
ML LabReport Final Index Edited
No ratings yet
ML LabReport Final Index Edited
35 pages
王玉 20201108012390
No ratings yet
王玉 20201108012390
13 pages
Pandas Exercises
No ratings yet
Pandas Exercises
15 pages
KNN Datacamp
No ratings yet
KNN Datacamp
31 pages
Iris - Copy1 - Jupyter Notebook
No ratings yet
Iris - Copy1 - Jupyter Notebook
8 pages
Iris Dataset Analysis & App Dev
No ratings yet
Iris Dataset Analysis & App Dev
10 pages
AI & ML Lab Journal for MCA Students
No ratings yet
AI & ML Lab Journal for MCA Students
77 pages
ML Mod-4
No ratings yet
ML Mod-4
30 pages
Iris Flower Classification
No ratings yet
Iris Flower Classification
3 pages
Iris Flower Classification Final
No ratings yet
Iris Flower Classification Final
15 pages
ML Keshav
No ratings yet
ML Keshav
23 pages
K-Nearest Neighbors Classifiers 2025
No ratings yet
K-Nearest Neighbors Classifiers 2025
33 pages
A Complete Guide To The Iris Dataset in R
No ratings yet
A Complete Guide To The Iris Dataset in R
3 pages
DSBDA Assignment 10
No ratings yet
DSBDA Assignment 10
3 pages
Practical 6
No ratings yet
Practical 6
4 pages
Machine Learning in R (Autosaved)
No ratings yet
Machine Learning in R (Autosaved)
8 pages
Task 1 Iris Flower Classification Using Machine Learning
No ratings yet
Task 1 Iris Flower Classification Using Machine Learning
10 pages
Lab Manual
No ratings yet
Lab Manual
32 pages
04 SVM
No ratings yet
04 SVM
8 pages
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
No ratings yet
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
14 pages
Ahsan Gardezi DM
No ratings yet
Ahsan Gardezi DM
13 pages
Final Year Dissertation Report FMS MBA F
No ratings yet
Final Year Dissertation Report FMS MBA F
46 pages
Stock Watson 3U ExerciseSolutions Chapter9 Instructors
100% (4)
Stock Watson 3U ExerciseSolutions Chapter9 Instructors
16 pages
Machine Learning Problem Set
No ratings yet
Machine Learning Problem Set
5 pages
Applied Numerical Methods With MATLAB For Engineers and Scientists 2nd Edition Steven Chapra Solutions Manual Download
100% (27)
Applied Numerical Methods With MATLAB For Engineers and Scientists 2nd Edition Steven Chapra Solutions Manual Download
29 pages
AB 252 2 e PDF
No ratings yet
AB 252 2 e PDF
10 pages
CFA Formula Cheatsheet
100% (1)
CFA Formula Cheatsheet
166 pages
F-Tests in Econometrics: Zero Slopes - F-Test
No ratings yet
F-Tests in Econometrics: Zero Slopes - F-Test
7 pages
Nigeria Real Estate Funding Challenges
No ratings yet
Nigeria Real Estate Funding Challenges
12 pages
Early Hypophosphatemia in Critically Ill Children and The Effect of Parenteral Nutrition - A Secondary Analysis of The PEPaNIC RCT
No ratings yet
Early Hypophosphatemia in Critically Ill Children and The Effect of Parenteral Nutrition - A Secondary Analysis of The PEPaNIC RCT
9 pages
Data Collection Methods Guide
No ratings yet
Data Collection Methods Guide
11 pages
PAPER On Reach Analysis
No ratings yet
PAPER On Reach Analysis
5 pages
Data Science
No ratings yet
Data Science
13 pages
Auditory Processing and Early Literacy Skills in A Preschool and Kindergarten Population
No ratings yet
Auditory Processing and Early Literacy Skills in A Preschool and Kindergarten Population
14 pages
Demand Estimation and Forecasting: Managerial Economics, 2e
No ratings yet
Demand Estimation and Forecasting: Managerial Economics, 2e
13 pages
2009 Neuronorma 4 VOSP JLO PDF
No ratings yet
2009 Neuronorma 4 VOSP JLO PDF
16 pages
Fe 455
No ratings yet
Fe 455
340 pages
II-Sem-MULTIVARIATE DATA ANALYSIS
No ratings yet
II-Sem-MULTIVARIATE DATA ANALYSIS
2 pages
Solution Manual For The Essentials of Political Analysis, 6th Edition, Philip H. Pollock, Barry C. Edwardsdownload
100% (13)
Solution Manual For The Essentials of Political Analysis, 6th Edition, Philip H. Pollock, Barry C. Edwardsdownload
44 pages
Applied Numerical Methods With MATLAB For Engineers and Scientists, 5th Edition Steven C. Chapra PDF Download
No ratings yet
Applied Numerical Methods With MATLAB For Engineers and Scientists, 5th Edition Steven C. Chapra PDF Download
151 pages
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
No ratings yet
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
7 pages
Gao Kowalczyk 2021 Random Forest Model Identifies Serve Strength As A Key Predictor of Tennis Match Outcome
No ratings yet
Gao Kowalczyk 2021 Random Forest Model Identifies Serve Strength As A Key Predictor of Tennis Match Outcome
8 pages
IJPREMS Template
No ratings yet
IJPREMS Template
7 pages
Econometrics: Limited Dependent Variables
No ratings yet
Econometrics: Limited Dependent Variables
77 pages
Psychological Testing by Kaplan Notes
No ratings yet
Psychological Testing by Kaplan Notes
24 pages
Synopsis 3d Objects2
No ratings yet
Synopsis 3d Objects2
21 pages
CH 15
No ratings yet
CH 15
88 pages
Batanero Borovcnik Statistics and Probability in High School Contents Preface
No ratings yet
Batanero Borovcnik Statistics and Probability in High School Contents Preface
8 pages
Linear and Nonlinear Regression in Mathcad: Scalar Case
No ratings yet
Linear and Nonlinear Regression in Mathcad: Scalar Case
3 pages
F23 Final Exam Review
No ratings yet
F23 Final Exam Review
12 pages
Shap
100% (1)
Shap
214 pages

Data Science: Objectives

Uploaded by

Data Science: Objectives

Uploaded by

7/18/2017 top-second-half

Quantitative vs Qualititave variables

Categorical variables on Wikipedia (https://en.wikipedia.org/wiki/Categorical_variable)

Exploration and Training

Preparing the Iris data

Get the iris data set

In[1]: import pandas as pd

Now explore what is there.

A long-winded description of the dataset.

In[2]: print iris_data.DESCR

Iris Plants Database

============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

This is a copy of UCI ML iris datasets.

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the

What are the types of irises in the data set?

In[3]: print iris_data.target_names

['setosa' 'versicolor' 'virginica']

What are the features of the irises?

In[4]: print iris_data.data

[[ 5.1 3.5 1.4 0.2]

In[6]: print iris_data.target_names[iris_data.target]

['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'

Pandas to the rescue

In[12]: target_names = iris_data.target_names[iris_data.target]

Out[12]: sepal length sepal width petal length petal width

145 6.7 3.0 5.2 2.3 2 virginica

146 6.3 2.5 5.0 1.9 2 virginica

147 6.5 3.0 5.2 2.0 2 virginica

148 6.2 3.4 5.4 2.3 2 virginica

149 5.9 3.0 5.1 1.8 2 virginica

Now some pretty picture

sepal length vs. sepal width.

In[8]: fig = plt.figure(figsize=(10, 10))

Out[8]: <matplotlib.text.Text at 0x11a131890>

Petal length vs. petal width

In[9]: fig = plt.figure(figsize=(10, 10))

Out[9]: <matplotlib.text.Text at 0x11a489390>

Now for some visual observations

Wait...but aren't we going to have the computer do this?

Out[10]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=Tr

How well does it do?

Model guess for a setosa: setosa

Splitting test and train data

Splitting test and train data

In[15]: from sklearn.model_selection import train_test_split

You might also like