0% found this document useful (0 votes)
323 views

Unit 3 Data Mining PDF

Data mining is the process of discovering patterns in large data sets involving machine learning, statistics, and database systems. The goal is to extract useful information from data and transform it into a comprehensible structure. It is a key step in the knowledge discovery process and involves analyzing, cleaning, transforming, and modeling data to uncover hidden patterns. Common applications of data mining include market analysis, financial analysis, scientific research, and intrusion detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
323 views

Unit 3 Data Mining PDF

Data mining is the process of discovering patterns in large data sets involving machine learning, statistics, and database systems. The goal is to extract useful information from data and transform it into a comprehensible structure. It is a key step in the knowledge discovery process and involves analyzing, cleaning, transforming, and modeling data to uncover hidden patterns. Common applications of data mining include market analysis, financial analysis, scientific research, and intrusion detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Mining

Introduction

Data Mining is the process of discovering patterns in large data sets involving methods
at the intersection of machine learning, statistics, and database systems. Data mining is
an interdisciplinary subfield of computer science and statistics with an overall goal to extract
information from a data set and transform the information into a comprehensible structure for
further use. Data mining is the analysis step of the "knowledge discovery in databases"
process or KDD. Aside from the raw analysis step, it also involves database and data
management aspects, data, model and inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered structures, visualization, and online
updating.

Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern


analysis, information harvesting, etc.

KDD- Knowledge Discovery in Databases

The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data
Mining techniques. It is a field of interest to researchers in various fields, including artificial
intelligence, machine learning, pattern recognition, databases, statistics, knowledge acquisition
for expert systems, and data visualization.

The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.

The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis


and modeling of vast data repositories.KDD is the organized procedure of recognizing valid,
useful, and understandable patterns from huge and complex data sets. Data Mining is the root of
the KDD procedure, including the inferring of algorithms that investigate the data, develop the
model, and find previously unknown patterns. The model is used for extracting the knowledge
from the data, analyze the data, and predict the data.

Steps Involved in KDD Process:

KDD process
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse).
1
• Data integration using Data Migration tools.
• Data integration using Data Synchronization tools.
• Data integration using ETL(Extract-Load-Transformation) process.

3. Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
• Data selection using Neural network.
• Data selection using Decision Trees.
• Data selection using Naive bayes.
• Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure.
Data Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination to capture
transformations.
• Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
• Transforms task relevant data into patterns.
• Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing
patterns representing knowledge based on given measures.
• Find interestingness score of each pattern.
• Uses summarization and Visualization to make data understandable by user.

2
7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
• Generate reports.
• Generate tables.
• Generate discriminant rules, classification rules, characterization rules, etc.
Note:
• KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results.
• Preprocessing of databases consists of Data cleaning and Data Integration.

Kinds of Data
The following are the different sources of data that are used in data mining process.

1. Flat Files

o Flat files are defined as data files in text form or binary form with a structure that
can be easily extracted by data mining algorithms.

o Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, and then there will be no relations
between the tables.

o Flat files are represented by data dictionary. Eg: CSV file.

o Application: Used in Data Warehousing to store data, Used in carrying data to


and from server, etc.

2. Relational Databases

o A Relational database is defined as the collection of data organized in tables with


rows and columns.
o Physical schema in Relational databases is a schema which defines the structure
of tables.

o Logical schema in Relational databases is a schema which defines the relationship


among tables.

o Standard API of relational database is SQL.


o Application: Data Mining, ROLAP model, etc.

3. Data Warehouse

3
o A data warehouse is defined as the collection of data integrated from multiple
sources that will query and used in decision making.

o There are three types of data warehouse: Enterprise datawarehouse, Data


Mart and Virtual Warehouse.

o Two approaches can be used to update data in Data Warehouse: Query-


driven Approach and Update-driven Approach.

o Application: Business decision making, Data mining, etc.

4. Transactional Databases

o Transactional databases are a collection of data organized by time stamps, date,


etc to represent transaction in databases.
o This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.

o Highly flexible system where users can modify information without changing any
sensitive information.

o Follows ACID property of DBMS.

o Application: Banking, Distributed systems, Object databases, etc.

5. Multimedia Databases

o Multimedia databases consists audio, video, images and text media.


o They can be stored on Object-Oriented Databases.
o They are used to store complex information in pre-specified formats.
o Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.

6. Spatial Database
o Stores geographical information.
o Stores the data in the form of coordinates, topology, lines, polygons, etc.
o Application: Maps, Global positioning, etc.

7. Time-series Databases
o Time series database contains stock exchange data and user logged activities.
o Handles array of numbers indexed by time, date, etc.
o It requires real-time analysis.
o Application: eXtremeDB, Graphite, InfluxDB, etc.

8. WWW

4
o WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.

o It is the most heterogeneous repository as it collects data from multiple resources.


o It is dynamic in nature as Volume of data is continuously increasing and
changing.
o Application: Online shopping, Job search, Research, studying, etc.

Data Mining Applications

Here is the list of areas where data mining is widely used −

• Market Analysis and Management


• Financial Data Analysis
• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

Market Analysis and Management

Listed below are the various fields of market where data mining is used −
• Customer Profiling − Data mining helps determine what kind of people buy what kind
of products.
• Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract
new customers.
• Cross Market Analysis − Data mining performs Association/correlations between
product sales.
• Target Marketing − Data mining helps to find clusters of model customers who share
the same characteristics such as interests, spending habits, income, etc.
• Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
• Providing Summary Information − Data mining provides us various multidimensional
summary reports.

5
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
• Design and construction of data warehouses for multidimensional data analysis and data
mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.

Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It
is natural that the quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.

Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies,
the telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns,
catch fraudulent activities, make better use of resource, and improve quality of service. Here is
the list of examples for which data mining improves telecommunication services −
• Multidimensional Analysis of Telecommunication data.
• Fraudulent pattern analysis.
• Identification of unusual patterns.

6
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.

Biological Data Analysis


In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −
• Semantic integration of heterogeneous, distributed genomic and proteomic databases.
• Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
• Discovery of structural patterns and analysis of genetic networks and protein pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.

Other Scientific Applications


The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and
ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications
of data mining in the field of Scientific Applications −

• Data Warehouses and data preprocessing.


• Graph-based mining.
• Visualization and domain specific knowledge.

Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability
of network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network
administration. Here is the list of areas in which data mining technology may be applied for
intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build discriminating
attributes.

7
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.
Kinds of Patterns
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −

• Descriptive
• Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the database. Here is the list
of descriptive functions −

• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters

Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in a
company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
• Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.

Mining of Frequent Patterns


Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of
kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.

8
• Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.

Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased together.
This process refers to the process of uncovering the relationship among data and determining
association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.

Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.

Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

• Classification (IF-THEN) Rules


• Decision Trees
• Mathematical Formulae
• Neural Networks
The list of functions involved in these processes are as follows −
• Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
• Prediction − It is used to predict missing or unavailable numerical data values rather
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.
• Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.
9
• Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.

Data Mining Task Primitives

• We can specify a data mining task in the form of a data mining query.
• This query is input to the system.
• A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data mining
system. Here is the list of Data Mining Task Primitives −

• Set of task relevant data to be mined.


• Kind of knowledge to be mined.
• Background knowledge to be used in discovery process.
• Interestingness measures and thresholds for pattern evaluation.
• Representation for visualizing the discovered patterns.

Set of task relevant data to be mined


This is the portion of database in which the user is interested. This portion includes the
following −

• Database Attributes
• Data Warehouse dimensions of interest

Kind of knowledge to be mined


It refers to the kind of functions to be performed. These functions are −

• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis

Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.

10
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge discovery.
There are different interesting measures for different kind of knowledge.

Representation for visualizing the discovered patterns


This refers to the form in which discovered patterns are to be displayed. These representations
may include the following. −

• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
Data Mining – Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −

• Mining Methodology and User Interaction


• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.

11
Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −


• Mining different kinds of knowledge in databases − Different users may be interested
in different kinds of knowledge. Therefore it is necessary for data mining to cover a
broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −


• Efficiency and scalability of data mining algorithms − In order to effectively extract
the information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
Diverse Data Types Issues

12
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
Technologies Used in Data Mining
Data mining has incorporated many techniques from other domains such as statistics,
machine learning, pattern recognition, database and data warehouse systems, information
retrieval, visualization, algorithms, high performance computing, and many application domains.
The interdisciplinary nature of data mining research and development contributes significantly to
the success of data mining and its extensive applications.

Statistics
Statistics studies the collection, analysis, interpretation or explanation, and presentation
of data. Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of the objects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes. For example, in data mining tasks like data characterization
and classification, statistical models of target classes can be built. In other words, such statistical
models can be the outcome of a data mining task. Alternatively, data mining tasks can be built on
top of statistical models. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use the model
to help identify and handle noisy or missing values in the data.

Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data. Machine learning is a fast-

13
growing discipline. Here, we illustrate classic problems in machine learning that are highly
related to data mining.

Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which supervise the learning of the
classification model.

Unsupervised learning is essentially a synonym for clustering. The learning process is


unsupervised since the input examples are not class labeled. Typically, we may use clustering to
discover classes within the data. For example, an unsupervised learning method can take, as
input, a set of images of handwritten digits. Suppose that it finds 10 clusters of data. These
clusters may correspond to the 10 distinct digits of 0 to 9, respectively. However, since the
training data are not labeled, the learned model cannot tell us the semantic meaning of the
clusters found.

Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine the boundaries between
classes. For a two-class problem, we can think of the set of examples belonging to one class as
the positive examples and those belonging to the other class as the negative examples.

Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality by actively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.

14
For classification and clustering tasks, machine learning research often focuses on the accuracy
of the model. In addition to accuracy, data mining research places strong emphasis on the
efficiency and scalability of mining methods on large data sets, as well as on ways to handle
complex types of data and explore new, alternative methods.

Database Systems and Data Warehouses

Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established highly
recognized principles in data models, query languages, query processing and optimization
methods, data storage, and indexing and accessing methods. Database systems are often well
known for their high scalability in processing very large, relatively structured data sets. Many
data mining tasks need to handle large data sets or even real-time, fast streaming data. Therefore,
data mining can make good use of scalable database technologies to achieve high efficiency and
scalability on large data sets.Moreover, data mining tasks can be used to extend the capability of
existing database systems to satisfy advanced users’ sophisticated data analysis requirements.
Recent database systems have built systematic data analysis capabilities on database data using
data warehousing and data mining facilities. A data warehouse integrates data originating from
multiple sources and various timeframes. It consolidates data in multidimensional space to form
partially materialized data cubes. The data cube model not only facilitates OLAP in
multidimensional databases but also promotes multidimensional data mining

Information retrieval (IR) is the science of searching for documents or information in


documents. Documents can be text or multimedia, and may reside on the Web. The differences
between traditional information retrieval and database systems are twofold: Information retrieval
assumes that (1) the data under search are unstructured; and (2) the queries are formed mainly by
keywords, which do not have complex structures (unlike SQL queries in database systems). The
typical approaches in information retrieval adopt probabilistic models. For example, a text
document can be regarded as a bag of words, that is, a multi set of words appearing in the
document. The document’s language model is the probability density function that generates the
bag of words in the document. The similarity between two documents can be measured by the
similarity between their corresponding language models. Furthermore, a topic in a set of text
documents can be modeled as a probability distribution over the vocabulary, which is called a
topic model. A text document, which may involve one or multiple topics, can be regarded as a
mixture of multiple topic models. By integrating information retrieval models and data mining
techniques, we can find the major topics in a collection of documents and, for each document in
the collection, the major topics involved. Increasingly large amounts of text and multimedia data
have been accumulated and made available online due to the fast growth of the Web and
applications such as digital libraries, digital governments, and health care information systems.
Their effective search and analysis have raised many challenging issues in data mining.
Therefore, text mining and multimedia data mining, integrated with information retrieval
methods, have become increasingly important.

15
Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues.
Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
(b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).

16
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.

2. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.

The various steps to data reduction are:


1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the
attribute having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Componenet
Analysis).

3.Data Integration

Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting
data set. This can help improve the accuracy and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration.

3.1 Entity Identification Problem


It is likely that your data analysis task will involve data integration, which combines data from
multiple sources into a coherent data store, as in data warehousing. These sources may include
multiple databases, data cubes, or flat files. There are a number of issues to consider during data
integration. Schema integration and object matching can be tricky. How can equivalent real-
world entities from multiple data sources be matched up? This is referred to as the entity
identification problem.

3.2 Redundancy and Correlation Analysis


Redundancy is another important issue in data integration. An attribute (such as annual revenue,
for instance) may be redundant if it can be “derived” from another attribute or set of attributes.
17
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting
data set. Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available data.
For nominal data, we use the χ2 (chi-square) test. For numeric attributes, we can use the
correlation coefficient and covariance, both of which access how one attribute’s values vary from
those of another.

3.3 Tuple Duplication


In addition to detecting redundancies between attributes, duplication should also be detected at
the tuple level (e.g., where there are two or more identical tuples for a given unique data entry
case). The use of denormalized tables (often done to improve performance by avoiding joins) is
another source of data redundancy. Inconsistencies often arise between various duplicates, due to
inaccurate data entry or updating some but not all data occurrences. For example, if a purchase
order database contains attributes for the purchaser’s name and address instead of a key to this
information in a purchaser database, discrepancies can occur, such as the same purchaser’s name
appearing with different addresses within the purchase order database.

3.4 Data Value Conflict Detection and Resolution


Data integration also involves the detection and resolution of data value conflicts. For example,
for the same real-world entity, attribute values from different sources may differ. This may be
due to differences in representation, scaling, or encoding. For instance, a weight attribute may be
stored in metric units in one system and British imperial units in another

4. Data Transformation and Data Discretization

In this preprocessing step, the data are transformed or consolidated so that the resulting mining
process may be more efficient, and the patterns found may be easier to understand.

Data Transformation Strategies Overview


In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Strategies for data transformation include the following:

1. Smoothing, works to remove noise from the data. These Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total amounts. This
step is typically used in constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as -
1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn,
can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the

18
numeric attribute. More than one concept hierarchy can be defined for the same attribute to
accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes
are implicit within the database schema and can be automatically defined at the schema
definition level.

19

You might also like