3.
Data Preprocessing
Prodi Informatika 2021
Anna Baita, M. Kom.
Fakultas Ilmu Komputer
Outline
SCPMK 1683903: Mahasiswa dapat menerapkan teknik pre-processing [CPMK39]
The students can apply pre-processing techniques.
• Outline:
• What & Why preprocess the data?
• Data Cleaning
• Data Integration
• Data Transformation
• Data reduction
2
Data Preprocessing
It is a data mining technique that involves transforming
raw data into an understable format
Why PreProcess The data??
Why Preprocess the data?
Data in the real world is:
✓incomplete: lacking value, certain attributes of interest
✓noisy: containing error or outlier
✓inconsistent: lack of compatibility or similarity between two
or more fact
No quality data, No quality Mining result
✓Quality decisions must be based on quality data
✓Data warehouse needs consistent integration of quality data
Measure Of Data Quality
❑ Accuracy
❑ Completeness
❑ Consistency
❑ Timeliness
❑ Believability
❑ Value Added
❑ Interpretability
❑ Accessibility
Data Preprocessing Technique
1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction
Data Cleaning
Data Cleaning attempt to fill in
missing values, smooth out noise
while indentifying outliers and
correct inconsistensies in the
realworld data
Fill The Missing Value
Data Cleaning- Missing Value
1. Ignore The Tuple
• Ignore The Tuple
Data Cleaning- Missing Value
2. Fill the Missing Value Manually (Feasible)
3. Use a Global Constant
ex: “-”,
“unknown”
Data Cleaning- Missing Value
4. Use the Attribute Mean, or median Mean X2=66.1
Mean X4=0.22
Mean Y=69.44
66.1
69.44
66.1 0.22
Data Cleaning- Missing Value
3. Use The Most Probable Value
Predict using KNN, Regression,
Decission Tree, etc
14
smooth out noise
Data Cleaning- Noisy
Data Derau (Noise) : Adanya kesalahan kecil yang
random
Penyebab:
1. Kesalahan Instrumen
Pengumpul data
2. Masalah data Entri
3. Masalah transmisi data
4. Keterbatasan Teknologi
5. Tidak Konsisten dalam
penamaan.
Untuk mengatasinya harus
ex: “yogya” vs “jogja”
dilakukan smoothing
(dengan memperhatikan
nilai-nilai tetangga)
Data Cleaning- Noisy Data
✓ Binning
✓Clustering
✓Combined Computer and Human Inspection
Deteksi data yang mencurigakan tangani manusia
✓Regression
Data Cleaning- Noisy Data
Binning
Binning adalah sebuah proses untuk
mengelompokkan data ke dalam bagian-bagian
yang lebih kecil yang disebut bin berdasarkan
kriteria tertentu.
Langkah
1. Urutkan data
2. Partisi data tersebut ke dalam bin
3. Tentukan teknik Smoothing :
- by mean
- by boundaries
Data Cleaning- Noisy Data
1. Urutkan data
70,100,150,200,250,270,300,380,400
2. Misalnya jumlah bin 3
Bin 1 : 70,100,150
Bin 2 : 200,250,270
Bin 3 : 300,380,400
Data Cleaning- Noisy Data
Teknik Smoothing by Mean By mean :
Bin 1 : 70,100,150 Bin 1 : 107,107,107
Bin 2 : 200,250,270 Bin 2 : 240,240,240
Bin 3 : 300,380,400 Bin 3 : 360,360,360
In smoothing by bin means, each value
in a bin is replaced by the mean value
of the bin.
Data Cleaning- Noisy Data
Teknik Smoothing by Boundaries By boundaries :
Bin 1 : 70,100,150 Bin 1 : 70,70,150
Bin 2 : 200,250,270 Bin 2 : 200,270,270
Bin 3 : 300,380,400 Bin 3 : 300,400,400
In smoothing by bin boundaries, the
minimum and maximum values in a
given bin are identified as the bin
boundaries. Each bin value is then
replaced by the closest boundary
value.
Data Cleaning- Noisy Data
Clustering
Data pencilan : data yang menyimpang dari data
yang lainnya
Data pencilan dalam statistik disebut data
“outlier”
data pencilan boleh dibuang/diabaikan,
jumlah data pencilan umumnya tidak
banyak, hanya sekitar 2% dari jumlah data
Data Cleaning- Noisy Data
Regresi
correct inconsistensies
Data Cleaning- Inconsistent Data
• Manually, Using External References
• Knowledge Engineering tools
Data Integration
Data Integration implies combining of data
from multiple source into a coherent data
store (data warehouse)
Data Integration - Issue
• Entity Indentification problem
• Redudancy
• Tuple Duplication
• Detecting data value conflicts
Handling Redudant Data in Data Integration
• Redudant data occur often when integration of multiple databases
-the same attribute may have different names in different databases
-one attribute may be a "derived" attribute in another table.
• Redundant data may be able to be detected by correlation analysis
• Careful integration of multiple sources may help reduce/avoid redudancies
and inconsistencies and improve mining speed and quality
Data Integration
Data Source 1
Data Source 2
What Deferences??
Can the data be combined into one database?
Data Integration
Data Transformation
Transforming or consolidating data into mining suitable form is
known as data transformation
Smoothing
Agregation
Generalization
Normalization
Attribute Construction
Data Transformation
Smoothing: remove noise from data
Aggregation: Summarization, data cube construction
Generalization: Concept hierarchy climbing
Data Reduction
Data Reduction techniques are aplied to obtain a
reduce representation of the dataset that is much
smaller in volume, yet closely maintains the integrity
of base data
Data Reduction- Strategies
• Data cube aggregation
• Dimension Reduction
• Data Compression
• Numerosity Reduction
• Discretization and concept hierarchy generation
Text Pre processing???
35
Image Pre Processing??
36
Any Question