Slides on DataI
Slides on DataI
Data Processing I
Input (data)
Computador Output
Program
Introduction
● Machine learning
Input (data)
Computador Program
Output
Introduction
● Comments
○ Data is the raw material for machine learning
algorithms
○ Algorithms build a model that describes the input data
○ The data quality affects the model quality
Fonte: https://www.r-bloggers.com/2019/08/new-course-learn-advanced-data-cleaning-in-r/
Introduction
Fonte: 7wData
Dataset
● Store the examples from the domain to be modeled
● Definitions
○ X={(x(1), y(1)), …, (x(m), y(m))}
■ m is the number of examples
■ x(i) is a tuple that represents the i-th example
● x(i)=(x1, x2, …, xn), n is the number of attributes (features)
of a given example (tuple)
■ y(i) is the label of example i
■ X is called the input and y is the output
Dataset
● Supervised
○ y is not empty, it means, every example is associated
with a label
● Unsupervised
○ y is empty, it means, examples are not associated with
any label
Dataset – example (supervised)
X y
100 2 5
50 42 25
45 31 22
60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ All features will be multiplied by a given weight, and
we will add a bias as know as slope
○ ϴo+ϴ1x1+ϴ2x2
wind speed # people energy
○ What are the best values for ϴ’s? 100 2 5
50 42 25
45 31 22
60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ ϴo= 0.5 ϴ1=0.2 e ϴ1 2×=0.3
∑ energy − yhat
4
○ x1=0.5+0.2x100+0.3x2 = 21.1 (21.1-5=16.1)
■ Not so close to the real value 5 wind speed # people energy y_hat
– Residual error: 6.55 1
4
×∑ energy− yhat
100 2 5 21.1
■ Which are the best ϴ’s? 50 42 25 23.1
45 31 22 18.8
60 35 18 23
Classification Intuiton
● Given the wind speed (x1) and the number of people in a
room (x1), how much energy is necessary to cool the
room (y)?
x1 x2 y
wind speed # people energy
100 2 Baixa
50 42 Alta
45 31 Alta
60 35 Média
Classification Intuition
● Approach: building a set of rules to map each class
○ if attr > n then class1
else if attr2 < 5 then class2 else class3
100 2 Baixa
50 42 Alta
45 31 Alta
60 35 Media
Classification Intuition
● Approach: building a set of rules to map each class
○ if x1 > 100 then Baixa
else if x2 > 40 then Alta
else if x1 > 50 then Média
else Alta
wind speed # people energy
○ Are there a better set of rules? 100 2 Baixa
50 42 Alta
45 31 Alta
60 35 Media
Be Aware
● The model cannot specialize the input data
○ Overfitting
● The model cannot generalize the input data
○ Underfitting
Fonte:https://abracd.org/overfitting-e-underfitting-em-machine-learning/
Assess the Model
● How to know if a built model is good?
○ Classification
■ Accuracy, precision , recall, F-Score, ...
○ Regression
■ R2 score, Mean Square Error (MSE), Mean Absolute Error
(MAE), ...
Dataset I
● We are interested in data
○ Features (attributes/variables) represent the propriety of
a given example
○ Features (attributes/variables) belong to a domain
●
Qualitative
●
Quantitative
import seaborn as sb
data=sb.load_dataset('tips')
data.head()
Dataset I
● The domain is associated with a type
● Option 1:
○ Use the method LabelEncoder from preprocessing
(sklearn)
from sklearn import preprocessing as pp
laben=pp.LabelEncoder()
laben.fit(data['sex'])
data[‘sex’]=laben.transform(X[‘sex’])
Conjunto de Dados I
● Option 2:
○ If data is a dataframe (pandas) – And it is.
●
Verify the unique values of the attribute
data['sex'].unique()
['Male','Female']