Data Quality
Data Quality
Noisy data containing errors or outlier value that deviate from the
expected.
e.g., Salary=“-10”
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
• Clustering
– detect and remove outliers