0% found this document useful (0 votes)

323 views

Unit 3 Data Mining PDF

Data mining is the process of discovering patterns in large data sets involving machine learning, statistics, and database systems. The goal is to extract useful information from data and transform it into a comprehensible structure. It is a key step in the knowledge discovery process and involves analyzing, cleaning, transforming, and modeling data to uncover hidden patterns. Common applications of data mining include market analysis, financial analysis, scientific research, and intrusion detection.

Uploaded by

shamiruksha kataraki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

323 views

Unit 3 Data Mining PDF

Uploaded by

shamiruksha kataraki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Data Mining

Introduction

Data Mining is the process of discovering patterns in large data sets involving methods
at the intersection of machine learning, statistics, and database systems. Data mining is
an interdisciplinary subfield of computer science and statistics with an overall goal to extract
information from a data set and transform the information into a comprehensible structure for
further use. Data mining is the analysis step of the "knowledge discovery in databases"
process or KDD. Aside from the raw analysis step, it also involves database and data
management aspects, data, model and inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered structures, visualization, and online
updating.

Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern

analysis, information harvesting, etc.

KDD- Knowledge Discovery in Databases

The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data
Mining techniques. It is a field of interest to researchers in various fields, including artificial
intelligence, machine learning, pattern recognition, databases, statistics, knowledge acquisition
for expert systems, and data visualization.

The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.

The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis

and modeling of vast data repositories.KDD is the organized procedure of recognizing valid,
useful, and understandable patterns from huge and complex data sets. Data Mining is the root of
the KDD procedure, including the inferring of algorithms that investigate the data, develop the
model, and find previously unknown patterns. The model is used for extracting the knowledge
from the data, analyze the data, and predict the data.

Steps Involved in KDD Process:

KDD process
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse).
1
• Data integration using Data Migration tools.
• Data integration using Data Synchronization tools.
• Data integration using ETL(Extract-Load-Transformation) process.

3. Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
• Data selection using Neural network.
• Data selection using Decision Trees.
• Data selection using Naive bayes.
• Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure.
Data Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination to capture
transformations.
• Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
• Transforms task relevant data into patterns.
• Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing
patterns representing knowledge based on given measures.
• Find interestingness score of each pattern.
• Uses summarization and Visualization to make data understandable by user.

2
7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
• Generate reports.
• Generate tables.
• Generate discriminant rules, classification rules, characterization rules, etc.
Note:
• KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results.
• Preprocessing of databases consists of Data cleaning and Data Integration.

Kinds of Data
The following are the different sources of data that are used in data mining process.

1. Flat Files

o Flat files are defined as data files in text form or binary form with a structure that
can be easily extracted by data mining algorithms.

o Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, and then there will be no relations
between the tables.

o Flat files are represented by data dictionary. Eg: CSV file.

o Application: Used in Data Warehousing to store data, Used in carrying data to

and from server, etc.

2. Relational Databases

o A Relational database is defined as the collection of data organized in tables with

rows and columns.
o Physical schema in Relational databases is a schema which defines the structure
of tables.

o Logical schema in Relational databases is a schema which defines the relationship

among tables.

o Standard API of relational database is SQL.

o Application: Data Mining, ROLAP model, etc.

3. Data Warehouse

3
o A data warehouse is defined as the collection of data integrated from multiple
sources that will query and used in decision making.

o There are three types of data warehouse: Enterprise datawarehouse, Data

Mart and Virtual Warehouse.

o Two approaches can be used to update data in Data Warehouse: Query-

driven Approach and Update-driven Approach.

o Application: Business decision making, Data mining, etc.

4. Transactional Databases

o Transactional databases are a collection of data organized by time stamps, date,

etc to represent transaction in databases.
o This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.

o Highly flexible system where users can modify information without changing any
sensitive information.

o Follows ACID property of DBMS.

o Application: Banking, Distributed systems, Object databases, etc.

5. Multimedia Databases

o Multimedia databases consists audio, video, images and text media.

o They can be stored on Object-Oriented Databases.
o They are used to store complex information in pre-specified formats.
o Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.

6. Spatial Database
o Stores geographical information.
o Stores the data in the form of coordinates, topology, lines, polygons, etc.
o Application: Maps, Global positioning, etc.

7. Time-series Databases
o Time series database contains stock exchange data and user logged activities.
o Handles array of numbers indexed by time, date, etc.
o It requires real-time analysis.
o Application: eXtremeDB, Graphite, InfluxDB, etc.

8. WWW

4
o WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.

o It is the most heterogeneous repository as it collects data from multiple resources.

o It is dynamic in nature as Volume of data is continuously increasing and
changing.
o Application: Online shopping, Job search, Research, studying, etc.

Data Mining Applications

Here is the list of areas where data mining is widely used −

• Market Analysis and Management

• Financial Data Analysis
• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

Market Analysis and Management

Listed below are the various fields of market where data mining is used −
• Customer Profiling − Data mining helps determine what kind of people buy what kind
of products.
• Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract
new customers.
• Cross Market Analysis − Data mining performs Association/correlations between
product sales.
• Target Marketing − Data mining helps to find clusters of model customers who share
the same characteristics such as interests, spending habits, income, etc.
• Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
• Providing Summary Information − Data mining provides us various multidimensional
summary reports.

5
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
• Design and construction of data warehouses for multidimensional data analysis and data
mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.

Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It
is natural that the quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.

Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies,
the telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns,
catch fraudulent activities, make better use of resource, and improve quality of service. Here is
the list of examples for which data mining improves telecommunication services −
• Multidimensional Analysis of Telecommunication data.
• Fraudulent pattern analysis.
• Identification of unusual patterns.

6
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.

Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −
• Semantic integration of heterogeneous, distributed genomic and proteomic databases.
• Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
• Discovery of structural patterns and analysis of genetic networks and protein pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.

Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and
ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications
of data mining in the field of Scientific Applications −

• Data Warehouses and data preprocessing.

• Graph-based mining.
• Visualization and domain specific knowledge.

Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability
of network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network
administration. Here is the list of areas in which data mining technology may be applied for
intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build discriminating
attributes.

7
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.
Kinds of Patterns
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −

• Descriptive
• Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the database. Here is the list
of descriptive functions −

• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters

Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in a
company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
• Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.

Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of
kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.

8
• Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.

Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased together.
This process refers to the process of uncovering the relationship among data and determining
association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.

Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.

Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

• Classification (IF-THEN) Rules

• Decision Trees
• Mathematical Formulae
• Neural Networks
The list of functions involved in these processes are as follows −
• Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
• Prediction − It is used to predict missing or unavailable numerical data values rather
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.
• Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.
9
• Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.

Data Mining Task Primitives

• We can specify a data mining task in the form of a data mining query.
• This query is input to the system.
• A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data mining
system. Here is the list of Data Mining Task Primitives −

• Set of task relevant data to be mined.

• Kind of knowledge to be mined.
• Background knowledge to be used in discovery process.
• Interestingness measures and thresholds for pattern evaluation.
• Representation for visualizing the discovered patterns.

Set of task relevant data to be mined

This is the portion of database in which the user is interested. This portion includes the
following −

• Database Attributes
• Data Warehouse dimensions of interest

Kind of knowledge to be mined

It refers to the kind of functions to be performed. These functions are −

• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis

Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.

10
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge discovery.
There are different interesting measures for different kind of knowledge.

Representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed. These representations
may include the following. −

• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
Data Mining – Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −

• Mining Methodology and User Interaction

• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.

11
Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

• Mining different kinds of knowledge in databases − Different users may be interested
in different kinds of knowledge. Therefore it is necessary for data mining to cover a
broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −

• Efficiency and scalability of data mining algorithms − In order to effectively extract
the information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
Diverse Data Types Issues

12
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
Technologies Used in Data Mining
Data mining has incorporated many techniques from other domains such as statistics,
machine learning, pattern recognition, database and data warehouse systems, information
retrieval, visualization, algorithms, high performance computing, and many application domains.
The interdisciplinary nature of data mining research and development contributes significantly to
the success of data mining and its extensive applications.

Statistics
Statistics studies the collection, analysis, interpretation or explanation, and presentation
of data. Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of the objects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes. For example, in data mining tasks like data characterization
and classification, statistical models of target classes can be built. In other words, such statistical
models can be the outcome of a data mining task. Alternatively, data mining tasks can be built on
top of statistical models. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use the model
to help identify and handle noisy or missing values in the data.

Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data. Machine learning is a fast-

13
growing discipline. Here, we illustrate classic problems in machine learning that are highly
related to data mining.

Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which supervise the learning of the
classification model.

Unsupervised learning is essentially a synonym for clustering. The learning process is

unsupervised since the input examples are not class labeled. Typically, we may use clustering to
discover classes within the data. For example, an unsupervised learning method can take, as
input, a set of images of handwritten digits. Suppose that it finds 10 clusters of data. These
clusters may correspond to the 10 distinct digits of 0 to 9, respectively. However, since the
training data are not labeled, the learned model cannot tell us the semantic meaning of the
clusters found.

Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine the boundaries between
classes. For a two-class problem, we can think of the set of examples belonging to one class as
the positive examples and those belonging to the other class as the negative examples.

Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality by actively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.

14
For classification and clustering tasks, machine learning research often focuses on the accuracy
of the model. In addition to accuracy, data mining research places strong emphasis on the
efficiency and scalability of mining methods on large data sets, as well as on ways to handle
complex types of data and explore new, alternative methods.

Database Systems and Data Warehouses

Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established highly
recognized principles in data models, query languages, query processing and optimization
methods, data storage, and indexing and accessing methods. Database systems are often well
known for their high scalability in processing very large, relatively structured data sets. Many
data mining tasks need to handle large data sets or even real-time, fast streaming data. Therefore,
data mining can make good use of scalable database technologies to achieve high efficiency and
scalability on large data sets.Moreover, data mining tasks can be used to extend the capability of
existing database systems to satisfy advanced users’ sophisticated data analysis requirements.
Recent database systems have built systematic data analysis capabilities on database data using
data warehousing and data mining facilities. A data warehouse integrates data originating from
multiple sources and various timeframes. It consolidates data in multidimensional space to form
partially materialized data cubes. The data cube model not only facilitates OLAP in
multidimensional databases but also promotes multidimensional data mining

Information retrieval (IR) is the science of searching for documents or information in

documents. Documents can be text or multimedia, and may reside on the Web. The differences
between traditional information retrieval and database systems are twofold: Information retrieval
assumes that (1) the data under search are unstructured; and (2) the queries are formed mainly by
keywords, which do not have complex structures (unlike SQL queries in database systems). The
typical approaches in information retrieval adopt probabilistic models. For example, a text
document can be regarded as a bag of words, that is, a multi set of words appearing in the
document. The document’s language model is the probability density function that generates the
bag of words in the document. The similarity between two documents can be measured by the
similarity between their corresponding language models. Furthermore, a topic in a set of text
documents can be modeled as a probability distribution over the vocabulary, which is called a
topic model. A text document, which may involve one or multiple topics, can be regarded as a
mixture of multiple topic models. By integrating information retrieval models and data mining
techniques, we can find the major topics in a collection of documents and, for each document in
the collection, the major topics involved. Increasingly large amounts of text and multimedia data
have been accumulated and made available online due to the fast growth of the Web and
applications such as digital libraries, digital governments, and health care information systems.
Their effective search and analysis have raised many challenging issues in data mining.
Therefore, text mining and multimedia data mining, integrated with information retrieval
methods, have become increasingly important.

15
Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues.
Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
(b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).

16
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.

2. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.

The various steps to data reduction are:

1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the
attribute having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Componenet
Analysis).

3.Data Integration

Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting
data set. This can help improve the accuracy and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration.

3.1 Entity Identification Problem

It is likely that your data analysis task will involve data integration, which combines data from
multiple sources into a coherent data store, as in data warehousing. These sources may include
multiple databases, data cubes, or flat files. There are a number of issues to consider during data
integration. Schema integration and object matching can be tricky. How can equivalent real-
world entities from multiple data sources be matched up? This is referred to as the entity
identification problem.

3.2 Redundancy and Correlation Analysis

Redundancy is another important issue in data integration. An attribute (such as annual revenue,
for instance) may be redundant if it can be “derived” from another attribute or set of attributes.
17
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting
data set. Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available data.
For nominal data, we use the χ2 (chi-square) test. For numeric attributes, we can use the
correlation coefficient and covariance, both of which access how one attribute’s values vary from
those of another.

3.3 Tuple Duplication

In addition to detecting redundancies between attributes, duplication should also be detected at
the tuple level (e.g., where there are two or more identical tuples for a given unique data entry
case). The use of denormalized tables (often done to improve performance by avoiding joins) is
another source of data redundancy. Inconsistencies often arise between various duplicates, due to
inaccurate data entry or updating some but not all data occurrences. For example, if a purchase
order database contains attributes for the purchaser’s name and address instead of a key to this
information in a purchaser database, discrepancies can occur, such as the same purchaser’s name
appearing with different addresses within the purchase order database.

3.4 Data Value Conflict Detection and Resolution

Data integration also involves the detection and resolution of data value conflicts. For example,
for the same real-world entity, attribute values from different sources may differ. This may be
due to differences in representation, scaling, or encoding. For instance, a weight attribute may be
stored in metric units in one system and British imperial units in another

4. Data Transformation and Data Discretization

In this preprocessing step, the data are transformed or consolidated so that the resulting mining
process may be more efficient, and the patterns found may be easier to understand.

Data Transformation Strategies Overview

In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Strategies for data transformation include the following:

1. Smoothing, works to remove noise from the data. These Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total amounts. This
step is typically used in constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as -
1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn,
can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the

18
numeric attribute. More than one concept hierarchy can be defined for the same attribute to
accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes
are implicit within the database schema and can be automatically defined at the schema
definition level.

Unit - 4 Introduction To Data Mining
No ratings yet
Unit - 4 Introduction To Data Mining
71 pages
Why We Need Data Mining?
No ratings yet
Why We Need Data Mining?
39 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Data Mining 445545
No ratings yet
Data Mining 445545
11 pages
Chap 1
No ratings yet
Chap 1
32 pages
Data Minng
No ratings yet
Data Minng
20 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
39 pages
DM
No ratings yet
DM
99 pages
Datamining Unit -1
No ratings yet
Datamining Unit -1
20 pages
DATA MINING UNIT-1
No ratings yet
DATA MINING UNIT-1
59 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
No ratings yet
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
DM Mod 1
No ratings yet
DM Mod 1
17 pages
DWDM - Unit - II
No ratings yet
DWDM - Unit - II
55 pages
DMDW
No ratings yet
DMDW
287 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
Bca DM Unit I
No ratings yet
Bca DM Unit I
20 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
1intro - Data Mining
No ratings yet
1intro - Data Mining
61 pages
UNIT-1 PPT DMA
No ratings yet
UNIT-1 PPT DMA
83 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
DWDM Notes
No ratings yet
DWDM Notes
59 pages
Software
No ratings yet
Software
93 pages
wao
No ratings yet
wao
9 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
DWM 4
No ratings yet
DWM 4
23 pages
Module1 DataMining Ktustudents - in
No ratings yet
Module1 DataMining Ktustudents - in
24 pages
Unit 1
No ratings yet
Unit 1
59 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Cap481 - Business Communication Unit 4
No ratings yet
Cap481 - Business Communication Unit 4
90 pages
1_Lect 1 & 2 Data Mining
No ratings yet
1_Lect 1 & 2 Data Mining
20 pages
Introduction To Data Mining: - Chapter 3
No ratings yet
Introduction To Data Mining: - Chapter 3
39 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
Lecture 1-Data Mining (Introduction)
No ratings yet
Lecture 1-Data Mining (Introduction)
30 pages
Data Warehouse & Mining
No ratings yet
Data Warehouse & Mining
28 pages
Anaum Hamid: Lecture 01 - Introduction To DM
No ratings yet
Anaum Hamid: Lecture 01 - Introduction To DM
50 pages
Cs2032 Data Warehousing and Data Mining Notes (Unit III) .PDF - Www.chennaiuniversity - Net.notes
No ratings yet
Cs2032 Data Warehousing and Data Mining Notes (Unit III) .PDF - Www.chennaiuniversity - Net.notes
54 pages
Data & Web Mining: Manoj Pandia, Silicon Institute of Technology
No ratings yet
Data & Web Mining: Manoj Pandia, Silicon Institute of Technology
21 pages
Unit 3
No ratings yet
Unit 3
23 pages
dm mod1
No ratings yet
dm mod1
29 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
DB-14
No ratings yet
DB-14
97 pages
DM NOTES
No ratings yet
DM NOTES
193 pages
My Third Publication
No ratings yet
My Third Publication
8 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
KM Notes Unit-3
No ratings yet
KM Notes Unit-3
20 pages
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
No ratings yet
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
97 pages
Computer Science 3rd Year Specilization
No ratings yet
Computer Science 3rd Year Specilization
9 pages
CH 1
No ratings yet
CH 1
66 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Dmbi PPT 1
No ratings yet
Dmbi PPT 1
40 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Mining
No ratings yet
Data Mining
7 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Mining: Knowledge Discovery in Databases
No ratings yet
Data Mining: Knowledge Discovery in Databases
21 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
Unit 10 Data Mining and Data Visualization: Structure
No ratings yet
Unit 10 Data Mining and Data Visualization: Structure
25 pages
What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology
100% (1)
What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology
18 pages
CS-505 Introduction To Data Mining Exercises: Page 1 of 4
No ratings yet
CS-505 Introduction To Data Mining Exercises: Page 1 of 4
4 pages
Case Law Analysis With Machine Learning in Brazilian Court
No ratings yet
Case Law Analysis With Machine Learning in Brazilian Court
16 pages
Notes - Machine Learning
No ratings yet
Notes - Machine Learning
138 pages
Data Mining and Knowledge Management
No ratings yet
Data Mining and Knowledge Management
9 pages
21AI643
No ratings yet
21AI643
2 pages
Data Mining From Data To Knowledge PDF
No ratings yet
Data Mining From Data To Knowledge PDF
464 pages
Software Engineering Unit-3
No ratings yet
Software Engineering Unit-3
25 pages
Data Mining Resources
No ratings yet
Data Mining Resources
26 pages
SpringerNature Books Title List 20240403 062757
No ratings yet
SpringerNature Books Title List 20240403 062757
1,056 pages
Modeling - and - Simulation - in - Science - Proceedings - O... - (Part - C - Methods - and - Techniques) PDF
No ratings yet
Modeling - and - Simulation - in - Science - Proceedings - O... - (Part - C - Methods - and - Techniques) PDF
12 pages
Sematic Web: Bachelor of Technology
No ratings yet
Sematic Web: Bachelor of Technology
26 pages
MCQ On Data Mining With Answers Set-1
No ratings yet
MCQ On Data Mining With Answers Set-1
11 pages
[Ebooks PDF] download Data Analytics for Business AI-ML-PBI-SQL-R Wolfgang Garn full chapters
100% (3)
[Ebooks PDF] download Data Analytics for Business AI-ML-PBI-SQL-R Wolfgang Garn full chapters
62 pages
Daniel Mamo
No ratings yet
Daniel Mamo
125 pages
Knowledge Discovery in Textual Databases (KDT)
No ratings yet
Knowledge Discovery in Textual Databases (KDT)
7 pages
01 Unit1
No ratings yet
01 Unit1
13 pages
Assignment 2: Submitted To
No ratings yet
Assignment 2: Submitted To
8 pages
Accident Case Retrieval and Analyses Using Natural Language Processing in The Construction Industry
No ratings yet
Accident Case Retrieval and Analyses Using Natural Language Processing in The Construction Industry
13 pages
Block 4 MLI 101 Unit 15
No ratings yet
Block 4 MLI 101 Unit 15
31 pages
TextMining PAKDD1999
No ratings yet
TextMining PAKDD1999
7 pages
Assignment in MIT123
No ratings yet
Assignment in MIT123
16 pages
Knowledge Discovery and Data Mining (KDD)
No ratings yet
Knowledge Discovery and Data Mining (KDD)
52 pages
Knowledge Discovery in Databases: An Overview: William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
No ratings yet
Knowledge Discovery in Databases: An Overview: William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
14 pages
IS414: Data Mining: DR - Waleed M.Ead
No ratings yet
IS414: Data Mining: DR - Waleed M.Ead
36 pages
The University of Zambia
No ratings yet
The University of Zambia
17 pages
Garn W. Data Analytics For Business. AI-ML-PBI-SQL-R 2024
No ratings yet
Garn W. Data Analytics For Business. AI-ML-PBI-SQL-R 2024
283 pages
SE - Lecture Notes - Unit-III
No ratings yet
SE - Lecture Notes - Unit-III
28 pages

Unit 3 Data Mining PDF

Uploaded by

Unit 3 Data Mining PDF

Uploaded by

Data Mining

Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern

KDD- Knowledge Discovery in Databases

The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis

Steps Involved in KDD Process:

o Flat files are represented by data dictionary. Eg: CSV file.

o Application: Used in Data Warehousing to store data, Used in carrying data to

o A Relational database is defined as the collection of data organized in tables with

o Logical schema in Relational databases is a schema which defines the relationship

o Standard API of relational database is SQL.

o There are three types of data warehouse: Enterprise datawarehouse, Data

o Two approaches can be used to update data in Data Warehouse: Query-

o Application: Business decision making, Data mining, etc.

o Transactional databases are a collection of data organized by time stamps, date,

o Follows ACID property of DBMS.

o Application: Banking, Distributed systems, Object databases, etc.

o Multimedia databases consists audio, video, images and text media.

o It is the most heterogeneous repository as it collects data from multiple resources.

Data Mining Applications

Here is the list of areas where data mining is widely used −

• Market Analysis and Management

Market Analysis and Management

Biological Data Analysis

Other Scientific Applications

• Data Warehouses and data preprocessing.

Mining of Frequent Patterns

Classification and Prediction

• Classification (IF-THEN) Rules

Data Mining Task Primitives

• Set of task relevant data to be mined.

Set of task relevant data to be mined

Kind of knowledge to be mined

Representation for visualizing the discovered patterns

• Mining Methodology and User Interaction

It refers to the following kinds of issues −

There can be performance-related issues such as follows −

Unsupervised learning is essentially a synonym for clustering. The learning process is

Database Systems and Data Warehouses

Information retrieval (IR) is the science of searching for documents or information in

Steps Involved in Data Preprocessing:

(a). Missing Data:

The various steps to data reduction are:

3.1 Entity Identification Problem

3.2 Redundancy and Correlation Analysis

3.3 Tuple Duplication

3.4 Data Value Conflict Detection and Resolution

4. Data Transformation and Data Discretization

Data Transformation Strategies Overview

You might also like