Data transformation in data mining refers to converting raw data into suitable formats for pattern discovery, machine learning, and analytical algorithms. Since real-world datasets often contain inconsistencies, noise, skewed distributions and incompatible formats, transformation ensures that data becomes clean, standardized, and ready for mining tasks such as classification, clustering, association rule mining, and prediction.
These transformations help improve model accuracy, reduce computational overhead and ensure meaningful interpretation of results. Methods that we generally use in data transformation are discussed below:
1. Smoothing
Smoothing reduces noise or random variation in data to highlight important patterns or trends. It is especially useful in time-series and continuous numerical data.
Example: Consider you have noisy data like this -> [5, 7, 6, 20, 7, 8, 6].
Here, the value 20 is an outlier that makes the data look jagged. One simple smoothing method is replacing each value with the average of its neighbors to reduce sharp jumps. For example, replacing 20 with the average of its neighbors (6 + 7) / 2 = 6.5 gives smoother data:
[5, 7, 6, 6.5, 7, 8, 6]
Some of the common smoothing methods are:
- Moving averages
- Binning
- Regression-based smoothing
2. Aggregation
Aggregation combines data values to produce summary information. It can merge data from multiple sources or summarize large datasets.
Example: Sales, data may be aggregated to compute monthly and annual total amounts.
January: 1000, February: 1200, March: 1300
Quarterly Sales = 3500
3. Discretization
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle these attributes. Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by replacing a constant quality attribute with its discrete values.
Example: Continuous age values: 22, 25, 37, 60.
Discretized into categories:
- 0-25 → "Young"
- 26–50 → "Middle-aged"
- 50+ → "Senior"
So: 22 → Young, 37 → Middle-aged, 60 → Senior
4. Attribute Construction
Where new attributes are created and applied to assist the mining process from the given set of attributes. This simplifies the original data and makes the mining more efficient.
Example: Original attributes:
- Height (in cm) = 175
- Weight (in kg) = 70
Constructed attribute:
BMI = Weight / (Height in meters)2 = 70 / (1.75)2 = 22.86
5. Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age initially in Numerical form (22, 25) is converted into categorical value (young, old). Like , categorical attributes such as house addresses, may be generalized to higher-level definitions, such as town or country.
Example:
- Age 24 → Age group: Young
- City: San Francisco → State: California
This reduces complexity while retaining meaning.
6. Normalization
Normalization scales numeric data into a uniform range to avoid biases arising from differing value magnitudes. Common in clustering and neural networks.
1. Min-Max Normalization: Maps values into a fixed range (usually 0 to 1):
- Transforms the original data linearly.
- Suppose that: min_A is the minima and max_A is the maxima of an attribute
- v is the value you want to plot in the new range.
- v' is the new value you get after normalizing the old value.
2. Z-Score Normalization: Scales values based on mean and standard deviation:
3. Decimal Scaling: Moves decimal point based on maximum absolute value: Example: Values from -99 to 99 → divide by 100 to bring range under 1.
7. Data Reduction
Data reduction transforms the dataset into a smaller representation that preserves essential information. This reduces storage and speeds up mining.
Techniques include:
- Principal Component Analysis (PCA)
- Sampling
- Wavelet transforms
- Dimensionality reduction
Example: A dataset with 50 features may be reduced to 10 principal components using PCA.
8. Encoding Techniques
Encoding converts categorical values into numerical formats required for machine learning algorithms.
Common encodings:
- One-Hot Encoding
- Label Encoding
- Binary Encoding
- Frequency Encoding
Example: Color: Red, Green, Blue. One-hot encoding → Red = [1,0,0], Green = [0,1,0], Blue = [0,0,1]
9. Feature Scaling
Feature scaling standardizes values when multiple variables use different units or orders of magnitude.
Example: Height (cm) vs Salary (₹). Scaling ensures neither variable dominates the model.
10. Data Integration
Combines data from different sources (databases, files, APIs) into a unified structure prior to mining.
Tasks involved:
- Handling schema mismatch
- Resolving naming conflicts
- Merging tables
This prevents duplication and inconsistency.
11. Data Encoding for Text (Text Transformation)
Used in text mining and NLP tasks.
Techniques:
- Tokenization
- Stemming / Lemmatization
- TF-IDF
- Word embeddings
Transforms text into machine-readable numeric vectors.
12. Data Binarization
Converts numeric data into binary values based on a threshold.
Example:
- Age > 18 → 1
- Age ≤ 18 → 0
Useful in classification models.
13. Data Scaling and Standardization for Outliers
For skewed data distributions:
- Log transformation
- Square root transformation
- Reciprocal transformation
Example: Income data typically has heavy right-skew → log scaling normalizes it.