Data Mining Lecture One - Docx1
Data Mining Lecture One - Docx1
Confronted with huge collections of data, we have now created new needs to help us make better
managerial choices. These needs are automatic summarization of data, extraction of the
"essence" of information stored, and the discovery of patterns in raw data. Data mining is a
powerful new technology with great potential to help companies focus on the most important
information in their data warehouses.
It has been defined as: The automated analysis of large or complex data sets in order to discover
significant patterns or trends that would otherwise go unrecognised.
Data mining refers to extracting or mining knowledge from large amountsof data.
Data mining is ready for application because it is supported by three technologies that are now
sufficiently mature:
The key to understanding the different facets of data mining is to distinguish between data
mining applications, operations, techniques and algorithms.
Applications
Database marketing
customer segmentation
customer retention
fraud detection
credit checking
web site analysis
Operations
Neural networks
decision trees
K-nearest neighbour algorithms
naive Bayesian cluster analysis
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to
the nontrivial extraction of implicit, previously unknown and potentially useful information
from data in databases.
While data mining and knowledge discovery in databases (or KDD) are frequently treated as
synonyms
The Knowledge Discovery in Databases process comprises of a few steps leading from raw
data collections to some form of new knowledge.
Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant
data are removed from the collection.
Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
Data selection: at this step, the data relevant to the analysis is decided on and retrieved from
the data collection.
Data transformation: also known as data consolidation, it is a phase in which the selected
data is transformed into forms appropriate for the mining procedure.
Data mining: it is the crucial step in which clever techniques are applied to extract patterns
potentially useful.
Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
Knowledge representation: is the final phase in which the discovered knowledge is visually
represented to the user. This essential step uses visualization techniques to help users
understand and interpret the data mining results.
KDD Process
Six common classes of Data mining tasks
Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer
purchasing habits. Using association rule learning, the supermarket can determine
which products are frequently bought together and use this information for marketing
purposes. This is sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in
some way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error
Association rule mining is a popular and well researched method for
discovering interesting relations between variables in large databases. It is
intended to identify strong rules discovered in databases using different
measures of interestingness.