Data Mining and Data Profiling - Nargis Hamid Monami
Data Mining and Data Profiling - Nargis Hamid Monami
1
2
Background
Data Mining and Data profiling are the essentials of modern management. For knowledge discovery,
Data Mining is a great tool in business intelligence. On the other hand, Data Processing and analysis
can’t happen without data profiling. The differences between Data Mining and Data Profiling are often
confusing, but they are not same. Data Mining has been in use for quite some time, whereas data
Profiling is a relatively rare and new. In this report, I’ll discuss the differences between Data Mining and
Data Profiling.
Data Mining
The Main objective of data mining is to find out the new, unknown and unpredictable information from
huge database. It’s the process of identifying patterns in a database. For example, A company may have
a database full of data and information, but not every data is necessary for an organization. In that case,
Data mining is used, to analysis valid, novel and potentially useful data and trends in data to solve
problems, which is useful and helps in decision making. It’s objective is to generate new market
opportunities.
The common techniques of data mining are- Association Learning, Clustering, Classification, Prediction,
Sequential patterns, Regression and more.
Association Learning is also called relation technique. It is the most commonly used technique.
Here, the relationships between items are used to identify patterns.
As per the name, Classification technique classifies items or variables in a data set into
predefined groups or classes. It mainly uses linear programming and statistics.
clustering puts objects in classes that are defined by classification.
As per the name, Prediction technique predicts the relationship between independent and
dependent variables of the data.
Sequential patterns technique is used to identify similar trends, patterns, and events in the
database.
Data mining attract all kind of companies to provide valuable information that helps them to stay in the
competition. Data mining is popular and successful in many areas but data base marketing and credit
card fraud detection is most popular among them. For example:
In Response Modeling : By using demographic, geographic and life style data, data mining predicts
which prospects are likely to buy.
3
In Cross Selling: By observing the purchase pattern and frequently purchased items, data mining helps to
increase sales providing effective services to the existing customers.
In Customer retention: It’s based on the customer buying habits. Also, by analyzing the competitor’s
policies, data mining helps in making strategies for customer retention.
In segmentation and profiling: By classification and clustering, data mining helps in segmentation and
profiling customers.
Data mining is valuable and useful in any industry or business sector as it works as Fraud Detection, Loan
Approval, Investments Analysis, Portfolio Trading, and others.
Data Profiling
Data Profiling is also called data archaeology. It basically asses the data. Normally, it deals with the data
quality, but it also helps to evaluate data sets for consistency. Profiling tools evaluate the actual content,
structure and quality of the data by exploring relationships between value collections in data sets. In
short, Data profiling derives information about data quality to discover anomalies in the dataset.
Structure discovery or structure analysis that makes sure that the data is consistent and
formatted correctly.
Content Discovery helps in identifying null values or values that are incorrect or ambiguous.
Relationship discovery does metadata analysis and narrows down the data making sure that the
data don’t overlap.
As we know, Data profiling is the process of reviewing source data, it has some demand in
industry. For example:
Data warehouse and business intelligence projects, can uncover data quality issues in data sources.
Data conversion and migration projects, can uncover the new requirements for the target system.
but it also helps to evaluate data sets for consistency. Profiling tools evaluate the actual contents by
exploring relationships between value collections in data sets. In short, Data profiling derives
information about data quality to discover anomalies in the dataset.
4
The best practices of Data Profiling:
Distinct count and percent—identifies natural keys, differentiate values in each column that can
help processing inserts and updates.
Minimum / maximum / average string length—helps select appropriate data types and sizes in
target database.
Data Profiling Tools: Quadient DataCleaner, Aggregate Profiler, Talend Open Studio, Data Profiling in
Informatica, Oracle Enterprise Data Quality, etc.
Conclusion
After a brief analysis of the two concepts, it can be said that some of the techniques of data mining are
used for data profiling, thus, data profiling is a part of data mining. Nevertheless, Data mining is a rather
broader concept than data profiling. Data mining is based on the fact that there’s a need to analyse
massive volumes of data. Data profiling adds value to that analysis.
References:
Deoras S. (2020, July 7), Data Mining Vs Data Profiling: What Makes Them Different, Analytics India
Magazine, analyticsindiamag.com/data-mining-vs-data-profiling-what-makes-them-different/
5
6