0% found this document useful (0 votes)

50 views

ML Unit-Ii

This document discusses data preprocessing techniques. It covers data objects and attributes, statistical descriptions of data, and various preprocessing tasks like data cleaning, integration, and reduction. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques such as filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data reduction reduces dimensionality and numerosity to simplify data through methods like PCA and compression.

Uploaded by

Supriya alluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

ML Unit-Ii

Uploaded by

Supriya alluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

UNIT-II

Know the data

and
Data Preprocessing
UNIT-2
Know the Data and Data Preprocessing: Data Objects and
attribute types, Basic statistical description of Data, Data preprocessing,
Data cleaning, Data Integration and Data reduction. Main Approaches
for Dimensionality Reduction, Projection, Manifold Learning, PCA.
Insufficient Quantity of Training Data, Nonrepresentative Training Data,
Poor-Quality Data, Irrelevant Features, Overfitting the Training Data,
Underfitting the Training Data, Stepping Back, Testing and Validating.
Data Object
❑Data sets are made up of data objects.
❑A data object represents an entity.
❑Examples:
❑ sales database: customers, store items, sales
❑ medical database: patients, treatments
❑ university database: students, professors, courses
❑Also called samples , examples, instances, data points, objects, data tuples.
❑Data objects are described by attributes.
Data Objects

❑An attribute is a property or characteristic or feature of a data object.

❑ Examples: eye color of a person, temperature, etc.

❑Attribute is also known as variable, field, characteristic, or feature

❑A collection of attributes describe an object.
❑Attribute values are numbers or symbols assigned to an attribute

❑Database rows -> data objects; columns ->attributes.

Data Objects

Database rows → data objects

Database columns → attributes
Attributes
❑ Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
❑E.g., customer _ID, name, address
❑ Attribute values are numbers or symbols assigned to an attribute.

❑ Distinction between attributes and attribute values

❑ Same attribute can be mapped to different attribute values

❑Example: height can be measured in feet or meters

❑ Different attributes can be mapped to the same set of values

❑Example: Attribute values for ID and age are integers

❑But properties of attribute values can be different; ID has no limit, but
age has a maximum and minimum value
Attribute Types

❖NOMINAL ( “relating to names”)

❖BINARY (only two categories or states)

❖ORDINAL (Order or Ranking)

❖NUMERIC (Measurable quantity)

❖DISCRETE

❖CONTINUOUS
Attribute Types

❑Categorical (Qualitative)
❑ Nominal and Ordinal attributes are collectively referred to as
categorical or qualitative attributes.

❑Numeric (Quantitative)
❑ Interval and Ratio are collectively referred to as quantitative
or numeric attributes.

❑Discrete vs Continuous attributes

Attribute Types
Nominal: categories, states, or “names of things”, “Symbols”.
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ Marital status, occupation, ID numbers, zip codes
Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
e.g., gender
◼ Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
◼ Values have a meaningful order (ranking) but magnitude between successive values
is not known.
◼ Size = {small, medium, large}, grades, army rankings 11
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
◼ E.g., temperature in C˚ or F˚, calendar dates

No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts, monetary

quantities
Attribute Types
(Discrete Vs Continuous Attribute)
Discrete Attribute
◼ Has only a finite or countably finite set of values
◼ zip codes, profession, or the set of words in a collection of documents

◼ Sometimes, represented as integer variables

◼ Note: Binary attributes are a special case of discrete attributes
◼ Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
Continuous Attribute
◼ Has real numbers as attribute values
◼ Temperature, height, or weight

◼ Practically, real values can only be measured and represented using a finite number of
digits
◼ Continuous attributes are typically represented as floating-point variables
Basic statistical description of data

❑Basic statistical descriptions can be used to

identify properties of the data and highlight
which data values should be treated as noise
or outliers.

❑For data preprocessing tasks, we want to learn

about data characteristics regarding both
central tendency and dispersion of the data.
Measures of central tendency include mean, median, mode,
and midrange.
Measures of data dispersion include quartiles, interquartile
range (IQR), and variance.
These descriptive statistics are of great help in understanding
the distribution of the data.
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

23
February 27, 2023 Data Mining: Concepts and Techniques
Dispersion
measures the
extent to which
the items vary
from central
value.

Also called as
spread out,
scatter, variance.
Data Preprocessing
Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

34
34
Data Quality: Why Preprocess the Data?

• Accuracy: correct or wrong, accurate or not

• Completeness: not recorded, unavailable
Measures • Consistency: some modified but some not,
for data dangling, …
quality: A • Timeliness: timely update?
multidimens • Believability: how trustable the data are
ional view correct?
• Interpretability: how easily the data can be
understood?

35
Major Techniques/ Tasks in Data Preprocessing

Data cleaning Data Data Data

integration reduction transformation and
• Fill in missing
values, smooth
Data discretization
• Integration of • Dimensionality
noisy data, identify multiple reduction • Normalization
or remove outliers, databases, data • Concept hierarchy
and resolve • Numerosity
cubes, or files reduction generation
inconsistencies
• Data compression

36
Forms of data preprocessing
Data Preprocessing
Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

38
38
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data
Ex. instrument faulty, human or computer error, transmission error

Intentional
Incomplete: Noisy: Inconsistent: e.g. disguised missing
data)
• lacking attribute • containing noise, • containing • Jan. 1 as everyone’s
values, lacking errors, or discrepancies in codes birthday?
or names, e.g.,
certain attributes outliers • Age=“42”,
of interest, or • e.g., Birthday=“20/03/2010”
containing only Salary=“−10” (an • Was rating “1, 2, 3”,
aggregate data error) now rating “A, B, C”
• e.g. Occupation=“ • discrepancy between
” (missing data) duplicate records
39
Incomplete (Missing) Data
Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data

Missing data may need to be inferred

40
How to Handle Missing Data?
Ignore the tuple : usually
done when class label is missing Fill in the missing value
(when doing classification)—not manually:
effective when the % of missing
values per attribute varies tedious + infeasible?
considerably

Fill in it automatically with

• a global constant : e.g., “unknown”, a new class?!
• The attribute mean
• The attribute mean for all samples belonging to the
same class: smarter
• The most probable value: inference-based such as
Bayesian formula or decision tree
41
Noisy Data
Incorrect attribute Other data problems
Noise: random error values may be due which require data
to cleaning
or variance in a
measured variable faulty data collection
instruments duplicate records

data entry problems

incomplete data

data transmission
problems inconsistent data

Technology limitation

Inconsistency in
naming convention
43
How to Handle Noisy Data?
• First sort data & partition into (equal-frequency) bins
• Then one can smooth by bin means,
Binning
• smooth by bin median,
• smooth by bin boundaries, etc.

• Data smooth can also be done regression functions

Regression • Linear Regression
• Multiple linear Regression

• Place data elements in their similar groups as Clusters.

Clustering • Detect and remove outliers

Combined computer • Detect suspicious values and check by human (e.g.,

and human inspection deal with possible outliers)
44
Data Preprocessing

Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

46
Data Integration
• Combines data from multiple sources into a
Data integration: coherent store

Entity identification • Identify real world entities from multiple data

problem: sources, Ex. Bill Clinton = William Clinton

• For the same real-world entity, attribute values

Detecting and resolving from different sources are different
data value conflicts • Possible reasons: different representations,
different scales, e.g., metric vs. British units

• Object Identification.
Redundancy • Derivable data
48
Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
Handling • Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundancy
in Data Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Integration
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
52
Χ2 Correlation Test (Nominal Data)
Χ2 (chi-square) test

(Observed − Expected) 2
 =
2

Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
◼ # of hospitals and # of car-theft in a city are correlated
◼ Both are causally linked to the third variable: population
56
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)

( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840
It shows that like_science_fiction and play_chess are correlated in the group
57
Data Reduction Strategies
Data reduction strategies
Why data
Data reduction? • Dimensionality reduction:
(remove unimportant attributes)
reduction: -Increase storage • Wavelet transforms
Data reduction efficiency • Principal Components Analysis
is a process - Performance (PCA)
that reduces (Complex data analysis • Feature subset selection, feature
the volume of may take a very long creation
original data time to run on the
complete data set.)
• Numerosity reduction: (some
and represents simply call it: Data Reduction)
it in a much -Reduce storage • Regression and Log-Linear
smaller Cost Models
volume. • Histograms, clustering, sampling
• Data cube aggregation
• Data compression
Principal Component Analysis (PCA)
Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality
reduction in machine learning.
It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.
It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It is a
technique to draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.

75
x1
Step 2: Step 3:
Step 1: Calculate the Calculate the
Standardize the covariance matrix eigenvalues and
dataset. for the features in eigenvectors for the
the dataset. covariance matrix.

Step 4:
Step 5:
Step 6: Sort eigenvalues
Pick k eigenvalues
Transform the and their
and form a matrix
original matrix. corresponding
of eigenvectors.
eigenvectors.
Regression Analysis y
Y1
Regression analysis:
Y1’ y=x+1
◼ Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor)
X1 x
variables with one or more independent variables.
Used for prediction
◼ More specifically, Regression analysis helps us to understand how (including forecasting
the value of the dependent variable is changing corresponding to an of time-series data),
inference, hypothesis
independent variable when other independent variables are held testing, and modeling
fixed. of causal relationships

◼ It predicts continuous/real values such as temperature, age, salary,

House price, etc.
The parameters are estimated so as to give a "best fit" of the data
78
Regress Analysis and Log-Linear Models

Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be estimated by using the
data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above
Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations
◼ Useful for dimensionality reduction and data smoothing 79
Histogram Analysis
40
35
30
25
20
15
10
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
Divide data into buckets and store
average (sum) for each bucket
Partitioning rules:
◼ Equal-width: equal bucket range
◼ Equal-frequency (or equal-depth)
80
Clustering

Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms

81
Sampling

Sampling: obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew
◼ Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)

82
Types of Sampling

Simple random Sampling with Sampling without

Stratified sampling:
sampling replacement replacement

• There is an • Once an • A selected • Partition the data

set, and draw
equal object is object is not samples from
probability of selected, it is removed each partition
selecting any removed from the (proportionally,
particular from the population i.e.,
approximately the
item population same percentage
of the data)
• Used in
conjunction with
skewed data
83
Sampling: With or without Replacement

Raw Data
84
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

85
What Is Wavelet Transform?
Decomposes a signal into different
frequency sub bands
◼ Applicable to n-dimensional signals
Data are transformed to preserve
relative distance between objects at
different levels of resolution
Allow natural clusters to become
more distinguishable
Used for image compression

88
Wavelet Transformation
Discrete wavelet transform (DWT) for linear signal
Haar2 Daubechie4
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
▪ Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
▪ Each transform has 2 functions: smoothing, difference
▪ Applies to pairs of data, resulting in two set of data of length L/2
▪ Applies two functions recursively, until reaches the desired length
89
Why Wavelet Transform?

Use hat-shape
filters Effective
Multi-
• Emphasize removal of
region where resolution
outliers
points cluster • Detect
• Insensitive to arbitrary Only
• Suppress noise,
weaker shaped applicable to Efficient
insensitive to clusters at
information in input order low Complexity
their different
scales dimensional O(N)
boundaries
data

91
Data Preprocessing

Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
92
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values i.e.
each old value can be identified with one of the new values
Data transformation is a process of converting data from one format or structure into another format or
structure.

Aggregation: Normalization: Scaled Discretization:

Smoothing: Remove Attribute/feature Summarization, to fall within a smaller, Concept hierarchy
noise from data construction data cube specified range climbing
construction
min-max
New attributes normalization
constructed from
the given ones
z-score
normalization

normalization by
decimal scaling
Normalization
Min-max normalization: to [new_minA, new_maxA]

v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
◼ v−
Ex. Let μ = 54,000, σ = 16,000. A
Then
v' =
Normalization by decimal scaling  A
73,600 − 54,000
= 1.225
16,000

v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
94
Discretization

Three types of attributes

Nominal—values from an Ordinal—values from an

unordered set Numeric—real numbers,
ordered set,
e.g., integer or real numbers
e.g., color, profession e.g., military or academic rank

96
Discretization: Divide the
range of a continuous attribute
into intervals

Interval labels can Prepare for further

then be used to Supervised vs. Split (top-down) vs.
analysis, e.g.,
replace actual data unsupervised merge (bottom-up)
classification
values

Reduce data size by

discretization

Discretization can be
performed recursively on
an attribute
Data Discretization Methods

Data Discretization
methods
( All the methods can be applied
recursively)

Histogram Clustering analysis Decision-tree Correlation (e.g., 2)

Binning (unsupervised, top-down analysis analysis
analysis (supervised, top-
split or bottom-up merge) (unsupervised,
down split) bottom-up merge)

Top-down
split, Top-down split,
unsupervised unsupervised

98
Simple Discretization: Binning

Equal-width (distance) Equal-depth (frequency)

partitioning partitioning
• Divides the range into N • Divides the range into N
intervals of equal size: uniform intervals, each containing
grid approximately same number of
• if A and B are the lowest and samples
highest values of the attribute, • Good data scaling
the width of intervals will be: W • Managing categorical attributes
= (B –A)/N. can be tricky
• The most straightforward, but
outliers may dominate
presentation
• Skewed data is not handled well
99
Equal width vs Equal depth binning
Challenges of ML
Insufficient Quantity
of training data
Data
Mismatch Nonrepresentative
training data
Hyperparameter
Poor – Quality
tuning and Model
of data
selection

Testing and Irrelevant

Validating features

Stepping Back Overfitting the

Training data

Underfitting the
Training data 102
103
104
105
10
6

BY PUNNA RAO

DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
ppt2
No ratings yet
ppt2
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Unit I
No ratings yet
Unit I
57 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Why Data Preprocessing?
No ratings yet
Why Data Preprocessing?
3 pages
Null 1
No ratings yet
Null 1
62 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Full
No ratings yet
Full
367 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Mining-L3
No ratings yet
Data Mining-L3
22 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
DP
No ratings yet
DP
44 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
DMDW_
No ratings yet
DMDW_
14 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Data
No ratings yet
Data
84 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
03 Data Science Process_Fall 23-24
No ratings yet
03 Data Science Process_Fall 23-24
38 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
253777
No ratings yet
253777
66 pages
Lec 5
No ratings yet
Lec 5
24 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Unit 2 1 Feature Sampling Normalization
No ratings yet
Unit 2 1 Feature Sampling Normalization
43 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Unit 3 NMLT
No ratings yet
Unit 3 NMLT
51 pages
Unit 5 - Testing of Hypothesis
No ratings yet
Unit 5 - Testing of Hypothesis
65 pages
Unit 2 - DAA
No ratings yet
Unit 2 - DAA
30 pages
Unit 1 - DAA
No ratings yet
Unit 1 - DAA
28 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
8 pages
Network Programming
No ratings yet
Network Programming
6 pages
Data Analitics
No ratings yet
Data Analitics
11 pages
Upgrade
No ratings yet
Upgrade
50 pages
Lab 3 - Query Examples 3-31: Wonderware System Platform Course - Part 2
No ratings yet
Lab 3 - Query Examples 3-31: Wonderware System Platform Course - Part 2
10 pages
Chery Spare Parts System Manual: Input Your Name
No ratings yet
Chery Spare Parts System Manual: Input Your Name
6 pages
ST11 Installation and Administration
100% (1)
ST11 Installation and Administration
33 pages
Research paper exactscore
No ratings yet
Research paper exactscore
9 pages
Chap1 - Relational Model
No ratings yet
Chap1 - Relational Model
49 pages
Normalization
No ratings yet
Normalization
41 pages
WK5-TMF1913-1014-SEM1-2022-23-LU6 Using DFDs
No ratings yet
WK5-TMF1913-1014-SEM1-2022-23-LU6 Using DFDs
55 pages
Chapter 9 - BDMT
No ratings yet
Chapter 9 - BDMT
61 pages
Primitives
100% (1)
Primitives
3 pages
Fds - c4 - Basic SQL
No ratings yet
Fds - c4 - Basic SQL
62 pages
Unit-5 Displaying Pictures and Menus With Views: Using Menus With Viewscreating The
No ratings yet
Unit-5 Displaying Pictures and Menus With Views: Using Menus With Viewscreating The
3 pages
Turbo Rag
No ratings yet
Turbo Rag
15 pages
Snowflake Row-Level Security Using Row Access Policies - by Debi Prasad Mishra - Snowflake - Jan, 2023 - Medium
No ratings yet
Snowflake Row-Level Security Using Row Access Policies - by Debi Prasad Mishra - Snowflake - Jan, 2023 - Medium
5 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
117 pages
DA Lab Manual
No ratings yet
DA Lab Manual
60 pages
Makeup Artist booking system
No ratings yet
Makeup Artist booking system
32 pages
Big Data Analytics
No ratings yet
Big Data Analytics
82 pages
Indiacom 2025
No ratings yet
Indiacom 2025
2 pages
Muzaffer Kareem, IBM BA Coursera Cert
No ratings yet
Muzaffer Kareem, IBM BA Coursera Cert
1 page
01 DBMS Architecture and Entity Relationship Diagram
No ratings yet
01 DBMS Architecture and Entity Relationship Diagram
13 pages
Warwick University Dissertation Structure
100% (2)
Warwick University Dissertation Structure
5 pages
Misc
No ratings yet
Misc
11 pages
Standalone PDF
No ratings yet
Standalone PDF
1,108 pages
Online Voting System
No ratings yet
Online Voting System
91 pages
The Friendly Data Science Handbook 2020
No ratings yet
The Friendly Data Science Handbook 2020
17 pages
Yogesh Cs Project
100% (1)
Yogesh Cs Project
15 pages
A Comparative Study of Business Intelligence and Artificial Intelligence With Big Data Analytics
No ratings yet
A Comparative Study of Business Intelligence and Artificial Intelligence With Big Data Analytics
8 pages
Detailed_Non-Teaching_Advt__NT-03-2025_pdf
No ratings yet
Detailed_Non-Teaching_Advt__NT-03-2025_pdf
9 pages
Research
No ratings yet
Research
5 pages

ML Unit-Ii

Uploaded by

ML Unit-Ii

Uploaded by

UNIT-II

Know the data

❑An attribute is a property or characteristic or feature of a data object.

❑Attribute is also known as variable, field, characteristic, or feature

❑Database rows -> data objects; columns ->attributes.

Database rows → data objects

❑ Distinction between attributes and attribute values

❑Example: height can be measured in feet or meters

❑Example: Attribute values for ID and age are integers

❖NOMINAL ( “relating to names”)

❖BINARY (only two categories or states)

❖ORDINAL (Order or Ranking)

❖NUMERIC (Measurable quantity)

❑Discrete vs Continuous attributes

◼ Sometimes, represented as integer variables

❑Basic statistical descriptions can be used to

❑For data preprocessing tasks, we want to learn

positively skewed negatively skewed

◼ Major Tasks in Data Preprocessing

• Accuracy: correct or wrong, accurate or not

Data cleaning Data Data Data

◼ Major Tasks in Data Preprocessing

Missing data may need to be inferred

Fill in it automatically with

data entry problems

• Data smooth can also be done regression functions

• Place data elements in their similar groups as Clusters.

Combined computer • Detect suspicious values and check by human (e.g.,

Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

Entity identification • Identify real world entities from multiple data

• For the same real-world entity, attribute values

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

◼ It predicts continuous/real values such as temperature, age, salary,

Sampling: obtaining a small sample s to represent the whole data set N

Simple random Sampling with Sampling without

• There is an • Once an • A selected • Partition the data

Raw Data Cluster/Stratified Sample

Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Aggregation: Normalization: Scaled Discretization:

Three types of attributes

Nominal—values from an Ordinal—values from an

Interval labels can Prepare for further

Reduce data size by

Histogram Clustering analysis Decision-tree Correlation (e.g., 2)

Equal-width (distance) Equal-depth (frequency)

Testing and Irrelevant

Stepping Back Overfitting the

You might also like