0% found this document useful (0 votes)
8 views

Slides on DataI

The document provides an introduction to data processing and machine learning, highlighting the differences between traditional programming and machine learning models. It discusses the importance of datasets, including supervised and unsupervised learning, and outlines various algorithms for regression and classification. Additionally, it addresses data quality, feature encoding, and model assessment techniques.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Slides on DataI

The document provides an introduction to data processing and machine learning, highlighting the differences between traditional programming and machine learning models. It discusses the importance of datasets, including supervised and unsupervised learning, and outlines various algorithms for regression and classification. Additionally, it addresses data quality, feature encoding, and model assessment techniques.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to

Data Processing I

Prof. Denio Duarte


[email protected]
● Machine Learning
○ Build a model that describes the input data (dataset)
○ The model can be called program or hypothesis
Introduction
● Traditional programming

Input (data)
Computador Output
Program
Introduction
● Machine learning

Input (data)
Computador Program
Output
Introduction
● Comments
○ Data is the raw material for machine learning
algorithms
○ Algorithms build a model that describes the input data
○ The data quality affects the model quality

Fonte: https://www.r-bloggers.com/2019/08/new-course-learn-advanced-data-cleaning-in-r/
Introduction

Fonte: 7wData
Dataset
● Store the examples from the domain to be modeled
● Definitions
○ X={(x(1), y(1)), …, (x(m), y(m))}
■ m is the number of examples
■ x(i) is a tuple that represents the i-th example
● x(i)=(x1, x2, …, xn), n is the number of attributes (features)
of a given example (tuple)
■ y(i) is the label of example i
■ X is called the input and y is the output
Dataset
● Supervised
○ y is not empty, it means, every example is associated
with a label
● Unsupervised
○ y is empty, it means, examples are not associated with
any label
Dataset – example (supervised)
X y

Student Grade 1 Grade 2 Study Hours Result

x(1) Angelina 6.0 7.0 4 Pass

x(2) Meryl 9.0 8.4 6 Pass

x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3) Exam

x(4) Arnold 5.0 4.4 2 Pass

x(5) Brad 5.0 4 1 Fail

x(6) Sandra 3.4 2.0 0 Fail


Dataset – example (unsupervised)
X y

Student Grade 1 Grade 2 Study Hours

x(1) Angelina 6.0 7.0 4

x(2) Meryl 9.0 8.4 6

x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3)

x(4) Arnold 5.0 4.4 2

x(5) Brad 5.0 4 1

x(6) Sandra 3.4 2.0 0


Supervised Algorithms
● Rely on the labels to build the model
● Generalize the dataset based on the label values
– Regression
– Classification
Supervised Algorithms
● If y domain is continuous (y ∈ ℝ), the problem is a
y
regression problem
Student Grade 1 Grade 2 Study Hours Result

Angelina 6.0 7.0 4 7.2 Note:


Meryl 9.0 8.4 6 8.9 Every regression
problem can be
Tom 4.0 3.4 1 6.3 transformed into a
Arnold 5.0 4.4 2 7.0
classification
problem
Brad 5.0 4 1 4.9

Sandra 3.4 2.0 0 2.2


Supervised Algorithms
● If the y domain is discrete (classes), the problem is a
y
classification problem
Student Grade 1 Grade 2 Study Hours Result y ∈ {Pass, Exam, Fail}
Angelina 6.0 7.0 4 Pass

Meryl 9.0 8.4 6 Pass Transformation:


Tom 4.0 3.4 1 Exam >=7 Pass
< 5 Fail
Arnold 5.0 4.4 2 Pass Otherwise Exam
Brad 5.0 4 1 Fail

Sandra 3.4 2.0 0 Fail


Overall
Regression Intuition
● Given the wind speed (x1) and the number of people in
a room (x1), how much energy is necessary to cool the
room (y)?
x1 x2 y
wind speed # people energy

100 2 5

50 42 25

45 31 22

60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ All features will be multiplied by a given weight, and
we will add a bias as know as slope
○ ϴo+ϴ1x1+ϴ2x2
wind speed # people energy
○ What are the best values for ϴ’s? 100 2 5

50 42 25

45 31 22

60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ ϴo= 0.5 ϴ1=0.2 e ϴ1 2×=0.3
∑ energy − yhat
4
○ x1=0.5+0.2x100+0.3x2 = 21.1 (21.1-5=16.1)
■ Not so close to the real value 5 wind speed # people energy y_hat
– Residual error: 6.55 1
4
×∑ energy− yhat
100 2 5 21.1
■ Which are the best ϴ’s? 50 42 25 23.1

45 31 22 18.8

60 35 18 23
Classification Intuiton
● Given the wind speed (x1) and the number of people in a
room (x1), how much energy is necessary to cool the
room (y)?
x1 x2 y
wind speed # people energy

100 2 Baixa

50 42 Alta

45 31 Alta

60 35 Média
Classification Intuition
● Approach: building a set of rules to map each class
○ if attr > n then class1
else if attr2 < 5 then class2 else class3

wind speed # people energy

100 2 Baixa

50 42 Alta

45 31 Alta

60 35 Media
Classification Intuition
● Approach: building a set of rules to map each class
○ if x1 > 100 then Baixa
else if x2 > 40 then Alta
else if x1 > 50 then Média
else Alta
wind speed # people energy
○ Are there a better set of rules? 100 2 Baixa

50 42 Alta

45 31 Alta

60 35 Media
Be Aware
● The model cannot specialize the input data
○ Overfitting
● The model cannot generalize the input data
○ Underfitting

Fonte:https://abracd.org/overfitting-e-underfitting-em-machine-learning/
Assess the Model
● How to know if a built model is good?
○ Classification
■ Accuracy, precision , recall, F-Score, ...
○ Regression
■ R2 score, Mean Square Error (MSE), Mean Absolute Error
(MAE), ...
Dataset I
● We are interested in data
○ Features (attributes/variables) represent the propriety of
a given example
○ Features (attributes/variables) belong to a domain

Qualitative

Quantitative
import seaborn as sb
data=sb.load_dataset('tips')
data.head()
Dataset I
● The domain is associated with a type

float float string string string string int


Dataset I
● Most of the machine learning algorithms need features
as numbers
– Generally, non-numbers features are qualitative
float float string string string string int
Dataset I
● If a non-numeric attribute is qualitative, we can encode it
○ sex, smoker, day, and time are qualitative (or discrete)
■ We can encode them numerically
● no = 0
● yes = 1
● female = 0
● male = 1
● :
Dataset I
● Option 1:
○ Use the method LabelEncoder from preprocessing
(sklearn)
from sklearn import preprocessing as pp
laben=pp.LabelEncoder()
laben.fit(data[‘sex’])
print(laben.classes_)
[‘Male’,’Female’]
laben.fit(data[‘day’])
print(laben.classes_)
['Fri' 'Sat' 'Sun' 'Thur']
Dataset I

● Option 1:
○ Use the method LabelEncoder from preprocessing
(sklearn)
from sklearn import preprocessing as pp
laben=pp.LabelEncoder()
laben.fit(data['sex'])
data[‘sex’]=laben.transform(X[‘sex’])
Conjunto de Dados I
● Option 2:
○ If data is a dataframe (pandas) – And it is.

Verify the unique values of the attribute
data['sex'].unique()
['Male','Female']

#build a json (dictionary) type


dict={ 'sex': {'Male':0 , 'Female':1} ,
'smoker' : {'No': 0 , 'Yes': 1}}
data.replace(dict,inplace=True)
#inplace=True garantees the changement
#encode sex and smoker
Exercice
import numpy as np
import seaborn as sb
import pandas as pd
data=sb.load_dataset('tips')
print(data.columns)
y=data['tip'] # the amount of the tip is the label
X=data.drop(['tip'],axis=1) # the rest compose X
# Prepare the dataset to have only numeric attributes
Keep Pushing
X['day'].unique()
## ['Fri','Thur', 'Sat', 'Sun']
# Instead of encoding a feature, we can create
# new features based on the values of the original one
# first we create a new df with the values of the feature
days=pd.get_dummies(X['day'])
# days is composed of four columns Thur Fri Sat and Sun
# now we can replace the column day by days
X = pd.concat([X,days],axis=1) # add new columns
X.drop(['day'],inplace=True,axis=1) #delete day
# this approach can help the estimator learning
# better models
Keep Pushing
# We can delete one of the columns that represents a day
# in this case, if the other columns are 0, it means that
# the week day is the removed one, i.e., Thur
X['day'].unique()
## ['Fri','Thur', 'Sat', 'Sun']
days=pd.get_dummies(X['day'],drop_first=True)
# days, now, is composed of Fri Sat and Sun
# now we can replace the column day by days
X = pd.concat([X,days],axis=1) # add new columns
X.drop(['day'],inplace=True,axis=1) #delete day
It is your time (again)
● Transform the values of time into new features (as day)
● The label of tips dataset indicates that we have a
regression
○ Build a new dataset from tips.
○ In this new dataset, you are going to transform the values
of tip (label) into discrete ones (classes)

small, average, and big.
● DataFrame.to_csv(file_name, index=False)

You might also like