0% found this document useful (0 votes)
11 views

Data Mining Lecture One - Docx1

Data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Mining Lecture One - Docx1

Data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Data mining Lecture One

Confronted with huge collections of data, we have now created new needs to help us make better
managerial choices. These needs are automatic summarization of data, extraction of the
"essence" of information stored, and the discovery of patterns in raw data. Data mining is a
powerful new technology with great potential to help companies focus on the most important
information in their data warehouses.

It has been defined as: The automated analysis of large or complex data sets in order to discover
significant patterns or trends that would otherwise go unrecognised.

What is Data mining ?

Data mining refers to extracting or mining knowledge from large amountsof data.

Data mining is ready for application because it is supported by three technologies that are now
sufficiently mature:

 Massive data collection

 Powerful multiprocessor computers

 Data mining algorithms

The key to understanding the different facets of data mining is to distinguish between data
mining applications, operations, techniques and algorithms.

Applications

 Database marketing
 customer segmentation
 customer retention
 fraud detection
 credit checking
 web site analysis

Operations

 Classification and prediction


 clustering association analysis
 forecasting
Techniques

 Neural networks
 decision trees
 K-nearest neighbour algorithms
 naive Bayesian cluster analysis

Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to
the nontrivial extraction of implicit, previously unknown and potentially useful information
from data in databases.

While data mining and knowledge discovery in databases (or KDD) are frequently treated as
synonyms

The Knowledge Discovery in Databases process comprises of a few steps leading from raw
data collections to some form of new knowledge.

The iterative process consists of the following steps:

Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant
data are removed from the collection.

Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.

Data selection: at this step, the data relevant to the analysis is decided on and retrieved from
the data collection.

Data transformation: also known as data consolidation, it is a phase in which the selected
data is transformed into forms appropriate for the mining procedure.

Data mining: it is the crucial step in which clever techniques are applied to extract patterns
potentially useful.

Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.

Knowledge representation: is the final phase in which the discovered knowledge is visually
represented to the user. This essential step uses visualization techniques to help users
understand and interpret the data mining results.
KDD Process
Six common classes of Data mining tasks
Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer
purchasing habits. Using association rule learning, the supermarket can determine
which products are frequently bought together and use this information for marketing
purposes. This is sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in
some way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error
Association rule mining is a popular and well researched method for
discovering interesting relations between variables in large databases. It is
intended to identify strong rules discovered in databases using different
measures of interestingness.

You might also like