data-mining-notes
data-mining-notes
Data Processing
Data processing forms the foundation of successful data mining. It involves transforming
raw data into a format suitable for analysis. The process follows several key stages:
Data Pre-processing
Pre-processing is crucial as real-world data is often incomplete, noisy, and inconsistent.
The main forms include:
1. Data Cleaning
Handling Missing Values:
Ignore the tuple (record)
Fill manually
Use global constant
Use attribute mean/median
Use most probable value through prediction
Dealing with Noisy Data through:
Binning:
Sort data and partition into equal-sized bins
Smooth by bin means, median, or boundaries
Clustering:
Group similar data points
Detect and remove outliers
Regression:
Fit data to a function
Smooth by predicting values
Human and Computer Inspection:
Combined approach using automated tools and expert knowledge
2. Data Integration
Merging data from multiple sources
Resolving conflicts in attribute names
Handling redundancy
Ensuring consistent measurement units
3. Data Transformation
Normalization: Scaling values to specific ranges
Aggregation: Combining multiple attributes
Feature construction: Creating new attributes
Smoothing: Removing noise from data
Data Reduction
Data reduction techniques help manage large datasets by reducing volume while
maintaining integrity:
2. Dimensionality Reduction
Reducing the number of random variables
Methods include:
Principal Component Analysis (PCA)
Feature selection
Feature extraction
3. Data Compression
Transforming data into compact representations
Lossless vs. lossy compression techniques
Trade-off between size and information preservation
4. Numerosity Reduction
Storing reduced representations of data
Methods:
Parametric (regression, log-linear models)
Non-parametric (histograms, clustering, sampling)
Decision Trees
Decision trees are powerful classification tools in data mining:
1. Structure
Root node: Starting point
Internal nodes: Test conditions
Branches: Outcomes of tests
Leaf nodes: Class labels
2. Construction Process
Select best attribute for splitting
Create branch for each attribute value
Repeat process recursively
Stop when meeting termination criteria
3. Advantages
Easy to understand and interpret
Handles both numerical and categorical data
Requires little data preparation
Can handle missing values
4. Key Algorithms
ID3
C4.5
CART
Random Forest
Exam Tips
1. Focus Areas
Understand the complete data preprocessing workflow
Know different types of data reduction techniques
Master decision tree concepts
Practice identifying scenarios for different cleaning methods
2. Common Question Types
Definition and explanation of key concepts
Comparing different techniques
Step-by-step problem solving
Real-world applications
3. Important Formulas/Calculations
Information gain for decision trees
Distance metrics for clustering
Normalization formulas
Sampling calculations