Unit 3 Data Mining PDF
Unit 3 Data Mining PDF
Introduction
Data Mining is the process of discovering patterns in large data sets involving methods
at the intersection of machine learning, statistics, and database systems. Data mining is
an interdisciplinary subfield of computer science and statistics with an overall goal to extract
information from a data set and transform the information into a comprehensible structure for
further use. Data mining is the analysis step of the "knowledge discovery in databases"
process or KDD. Aside from the raw analysis step, it also involves database and data
management aspects, data, model and inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered structures, visualization, and online
updating.
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data
Mining techniques. It is a field of interest to researchers in various fields, including artificial
intelligence, machine learning, pattern recognition, databases, statistics, knowledge acquisition
for expert systems, and data visualization.
The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.
KDD process
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse).
1
• Data integration using Data Migration tools.
• Data integration using Data Synchronization tools.
• Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
• Data selection using Neural network.
• Data selection using Decision Trees.
• Data selection using Naive bayes.
• Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure.
Data Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination to capture
transformations.
• Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
• Transforms task relevant data into patterns.
• Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing
patterns representing knowledge based on given measures.
• Find interestingness score of each pattern.
• Uses summarization and Visualization to make data understandable by user.
2
7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
• Generate reports.
• Generate tables.
• Generate discriminant rules, classification rules, characterization rules, etc.
Note:
• KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results.
• Preprocessing of databases consists of Data cleaning and Data Integration.
Kinds of Data
The following are the different sources of data that are used in data mining process.
1. Flat Files
o Flat files are defined as data files in text form or binary form with a structure that
can be easily extracted by data mining algorithms.
o Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, and then there will be no relations
between the tables.
2. Relational Databases
3. Data Warehouse
3
o A data warehouse is defined as the collection of data integrated from multiple
sources that will query and used in decision making.
4. Transactional Databases
o Highly flexible system where users can modify information without changing any
sensitive information.
5. Multimedia Databases
6. Spatial Database
o Stores geographical information.
o Stores the data in the form of coordinates, topology, lines, polygons, etc.
o Application: Maps, Global positioning, etc.
7. Time-series Databases
o Time series database contains stock exchange data and user logged activities.
o Handles array of numbers indexed by time, date, etc.
o It requires real-time analysis.
o Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
4
o WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
Listed below are the various fields of market where data mining is used −
• Customer Profiling − Data mining helps determine what kind of people buy what kind
of products.
• Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract
new customers.
• Cross Market Analysis − Data mining performs Association/correlations between
product sales.
• Target Marketing − Data mining helps to find clusters of model customers who share
the same characteristics such as interests, spending habits, income, etc.
• Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
• Providing Summary Information − Data mining provides us various multidimensional
summary reports.
5
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
• Design and construction of data warehouses for multidimensional data analysis and data
mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It
is natural that the quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies,
the telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns,
catch fraudulent activities, make better use of resource, and improve quality of service. Here is
the list of examples for which data mining improves telecommunication services −
• Multidimensional Analysis of Telecommunication data.
• Fraudulent pattern analysis.
• Identification of unusual patterns.
6
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability
of network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network
administration. Here is the list of areas in which data mining technology may be applied for
intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build discriminating
attributes.
7
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.
Kinds of Patterns
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −
• Descriptive
• Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is the list
of descriptive functions −
• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in a
company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
• Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
8
• Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased together.
This process refers to the process of uncovering the relationship among data and determining
association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.
Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
• We can specify a data mining task in the form of a data mining query.
• This query is input to the system.
• A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data mining
system. Here is the list of Data Mining Task Primitives −
• Database Attributes
• Data Warehouse dimensions of interest
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
10
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge discovery.
There are different interesting measures for different kind of knowledge.
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
Data Mining – Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −
11
Mining Methodology and User Interaction Issues
Performance Issues
12
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
Technologies Used in Data Mining
Data mining has incorporated many techniques from other domains such as statistics,
machine learning, pattern recognition, database and data warehouse systems, information
retrieval, visualization, algorithms, high performance computing, and many application domains.
The interdisciplinary nature of data mining research and development contributes significantly to
the success of data mining and its extensive applications.
Statistics
Statistics studies the collection, analysis, interpretation or explanation, and presentation
of data. Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of the objects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes. For example, in data mining tasks like data characterization
and classification, statistical models of target classes can be built. In other words, such statistical
models can be the outcome of a data mining task. Alternatively, data mining tasks can be built on
top of statistical models. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use the model
to help identify and handle noisy or missing values in the data.
Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data. Machine learning is a fast-
13
growing discipline. Here, we illustrate classic problems in machine learning that are highly
related to data mining.
Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which supervise the learning of the
classification model.
Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine the boundaries between
classes. For a two-class problem, we can think of the set of examples belonging to one class as
the positive examples and those belonging to the other class as the negative examples.
Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality by actively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.
14
For classification and clustering tasks, machine learning research often focuses on the accuracy
of the model. In addition to accuracy, data mining research places strong emphasis on the
efficiency and scalability of mining methods on large data sets, as well as on ways to handle
complex types of data and explore new, alternative methods.
Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established highly
recognized principles in data models, query languages, query processing and optimization
methods, data storage, and indexing and accessing methods. Database systems are often well
known for their high scalability in processing very large, relatively structured data sets. Many
data mining tasks need to handle large data sets or even real-time, fast streaming data. Therefore,
data mining can make good use of scalable database technologies to achieve high efficiency and
scalability on large data sets.Moreover, data mining tasks can be used to extend the capability of
existing database systems to satisfy advanced users’ sophisticated data analysis requirements.
Recent database systems have built systematic data analysis capabilities on database data using
data warehousing and data mining facilities. A data warehouse integrates data originating from
multiple sources and various timeframes. It consolidates data in multidimensional space to form
partially materialized data cubes. The data cube model not only facilitates OLAP in
multidimensional databases but also promotes multidimensional data mining
15
Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
16
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
3.Data Integration
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting
data set. This can help improve the accuracy and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration.
In this preprocessing step, the data are transformed or consolidated so that the resulting mining
process may be more efficient, and the patterns found may be easier to understand.
1. Smoothing, works to remove noise from the data. These Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total amounts. This
step is typically used in constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as -
1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn,
can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the
18
numeric attribute. More than one concept hierarchy can be defined for the same attribute to
accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes
are implicit within the database schema and can be automatically defined at the schema
definition level.
19