0% found this document useful (0 votes)
46 views43 pages

Data Mining and Warehousing-1

Data preprocessing is an important step in data mining that involves cleaning, transforming, and reducing raw data to prepare it for analysis. The data cleaning process removes noise and inconsistent data through techniques like handling missing values, smoothing outliers, and resolving inconsistencies. Data integration combines data from multiple sources, while reduction reduces data dimensionality to manageable levels. Transformation converts data into appropriate forms for mining through methods like normalization, aggregation, and generalization. The goal of data preprocessing is to prepare useful and high quality data for knowledge discovery through data mining algorithms and techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views43 pages

Data Mining and Warehousing-1

Data preprocessing is an important step in data mining that involves cleaning, transforming, and reducing raw data to prepare it for analysis. The data cleaning process removes noise and inconsistent data through techniques like handling missing values, smoothing outliers, and resolving inconsistencies. Data integration combines data from multiple sources, while reduction reduces data dimensionality to manageable levels. Transformation converts data into appropriate forms for mining through methods like normalization, aggregation, and generalization. The goal of data preprocessing is to prepare useful and high quality data for knowledge discovery through data mining algorithms and techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Data Mining and

Warehousing
Data, Information & Knowledge
• Data are the raw Facts(alphanumeric values) obtained through different
acquisition methods. Data in their simplest form consist of raw alphanumeric
values.
• Information is created when data are processed, organized, or structured to
provide context and meaning. Information is essentially processed data.
• Knowledge is what we know. Knowledge is unique to each individual and is
the accumulation of past experience and insight that shapes the lens by which
we interpret, and assign meaning to, information.
• Wisdom is the ability to make sensible decisions and judgments because of
your knowledge or experience
Types of Knowledge
• Explicit knowledge

Explicit knowledge is knowledge covering topics that are easy to systematically


document (in writing), and share out at scale: what we think of as structured information. When explicit
knowledge is well-managed, it can help a company make better decisions, save time, and maintain an
increase in performance.

• Implicit knowledge

Implicit knowledge is, essentially, learned skills or know-how. It is gained by taking


explicit knowledge and applying it to a specific situation. If explicit knowledge is a book on the mechanics
of flight and a layout diagram of an airplane cockpit, implicit knowledge is what happens when you apply
that information in order to fly the plane.
• Declarative knowledge

Declarative knowledge which can be also understood as propositional


knowledge, refers to static information and facts that are specific to a given topic, which
can be easily accessed and retrieved. It’s a type of knowledge where the individual is
consciously aware of their understanding of the subject matter.

• Procedural knowledge

Procedural knowledge focuses on the ‘how’ behind which things


operate, and is demonstrated through one’s ability to do something. Where declarative
knowledge focuses more on the ‘who, what, where, or when’, procedural knowledge is
less articulated and shown through action or documented through manuals.
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society

• Major sources of abundant data


• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!


• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
Evolution of Database Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
What is Data Mining?
• Data mining refers to extracting or mining knowledge from large amounts of data.

• Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems.

• It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.

• Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to
extract valuable information from huge sets of data. Data mining is also called Knowledge Discovery in
Database (KDD).

• The overall goal of the data mining process is to extract information from a data set and transform it into
an understandable structure for further use.
Advantages of Data Mining
• The Data Mining technique enables organizations to obtain knowledge-based data.

• Data mining enables organizations to make modifications in operation and production.

• Compared with other statistical data applications, data mining is a cost-efficient.

• Data Mining helps the decision-making process of an organization.

• It facilitates the automated discovery of hidden patterns as well as the prediction of


trends and behaviors.

• It is a quick process that makes it easy for new users to analyze data in a short time.
Data Mining Applications
Classification of Data Mining Systems
• Data mining refers to the process of extracting important data from raw data. It
analyses the data patterns in huge sets of data with the help of several software.
Ever since the development of data mining, it is being incorporated by
researchers in the research and development field.

• With Data mining, businesses are found to gain more profit. It has helped in
determining business objectives for making clear decisions.

• To understand the system and meet the desired requirements, data mining can be
classified into the following systems:
Classification of Data Mining Systems
Challenges of Data Mining
• Security and Social Challenges: Decision-Making strategies are done through data collection-sharing, so it
requires considerable security.

• User Interface: The knowledge discovered using data mining tools is useful only if it is interesting and
above all understandable by the user.

• Mining Methodology Challenges: These challenges are related to data mining approaches and their
limitations.

(i)Versatility of the mining approaches

(ii) Diversity of data available,

(iii) Dimensionality of the domain,

(iv) Control and handling of noise in data, etc.


Challenges of Data Mining
• Complex Data: Real-world data is heterogeneous and it could be multimedia data
containing images, audio and video, complex data, temporal data, spatial data,
time series, natural language text etc. It is difficult to handle these various kinds of
data and extract the required information.

• Performance: The performance of the data mining system depends on the


efficiency of algorithms and techniques are using. The algorithms and techniques
designed are not up to the mark lead to affect the performance of the data mining
process.
What is Knowledge Discovery?
• The following diagram shows the process of knowledge discovery
The list of steps involved in the knowledge discovery process −

• Data Cleaning − In this step, the noise and inconsistent data is removed.

• Data Integration − In this step, multiple data sources are combined.

• Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.

• Data Transformation − In this step, data is transformed or consolidated into appropriate


forms for mining by performing summary or aggregation operations.

• Data Mining − In this step, intelligent methods are applied in order to extract data patterns.

• Pattern Evaluation − In this step, data patterns are evaluated.

• Knowledge Presentation − In this step, knowledge is represented.


Architecture of Data Mining
• Knowledge Base: This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction.

• Data Mining Engine: This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as association and correlation analysis, classification, prediction,
cluster analysis and evolution analysis.

• Pattern Evaluation Module: This component typically employs interestingness measures interacts
with the data mining modules so as to focus the search toward interesting patterns.

• User interface: This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task, providing information to
help focus the search.
Data Preprocessing:

• Need of Data Preprocessing

• Data Cleaning Process

• Data Integration Process

• Data Reduction Process

• Data Transformation Process


Need of Data Preprocessing

• Data preprocessing refers to the set of techniques implemented on the


databases to remove noisy, missing, and inconsistent data. Different
Data preprocessing techniques involved in data mining are data
cleaning, data integration, data reduction, and data transformation.
Data Cleaning Process

• Data in the real world is usually incomplete, incomplete and noisy.


The data cleaning process includes the procedure which aims at filling
the missing values, smoothing out the noise which determines the
outliers and rectifies the inconsistencies in data.

• Let us discuss the basic methods of data cleaning


1. Missing Values

Assume that you are dealing with any data like sales and customer data and you observe that there are
several attributes from which the data is missing. One cannot compute data with missing values. In this
case, there are some methods which sort out this problem. Let us go through them one by one,

1.1 Ignore the tuple: If there is no class label specified then we could go for this method. It is not
effective in the case if the percentage of missing values per attribute changes considerably.

1.2. Enter the missing value manually or fill it with global constant: When the database contains large
missing values, then filling manually method is not feasible. Meanwhile, this method is time-consuming.
Another method is to fill it with some global constant.

1.3. Filling the missing value with attribute mean or by using the most probable value: Filling the
missing value with attribute value can be the other option. Filling with the most probable value uses
regression or decision tree.
2. Noisy Data
• Noise refers to any error in a measured variable. If a numerical attribute is given
you need to smooth out the data by eliminating noise. Some data smoothing
techniques are as follows,
2.1. Binning:
• Smoothing by bin means: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin.
• Smoothing by bin median: In this method, each bin value is replaced by its bin
median value.
• Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Every value
of bin is then replaced with the closest boundary value.
• Let us understand with an example,
• Data for price: 15, 8, 21, 26, 21, 9, 25, 4, 34, 28, 24, 29
• Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Smoothing by bin means: - Bin 1: 9, 9, 9, 9 -


Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

Smoothing by bin median: - Bin 1: 9 9, 9, 9 -


Bin 2: 24, 24, 24, 24 - Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15


- Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
2.2. Regression

• Regression is used to predict the value. Linear regression uses the


formula of a straight line which predicts the value of y on the specified
value of x whereas multiple linear regression is used to predict the
value of a variable is predicted by using given values of two or more
variables.
3) Data Integration Process
• Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and supply a unified
view of the info. These sources may include multiple data cubes, databases or flat files.
• 3.1 Issues in Data Integration
• There are not any issues to think about during data integration: Schema Integration,
Redundancy, Detection and determination of knowledge value conflicts. These are
explained in short as below,
• 3.1.1. Entity identification:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are matched mentioned because of the
entity identification problem.
• For example, How can the info analyst and computer make certain that customer id in
one database and customer number in another regard to an equivalent attribute.
• 3.2.2. Redundancy problem:
• An attribute could also be redundant if it is often derived or obtaining from another
attribute or set of the attribute.
• Inconsistencies in attribute also can cause redundancies within the resulting data set.
• Some redundancies are often detected by correlation analysis.
• 3.3.3. Detection and Resolution of data value conflicts:
• This is the third important issues in data integration. Attribute values from another
different source may differ for an equivalent world entity. An attribute in one system
could also be recorded at a lower level abstraction than the "same" attribute in
another.
4) Data Transformation Process
• In data transformation process data are transformed from one format
to a different format, that's more appropriate for data processing.
Some Data Transformation Strategies,
• Smoothing:-Smoothing may be a process of removing noise from the
info.
• Aggregation:-Aggregation may be a process where summary or
aggregation operations are applied to the info.
• Generalization:-In generalization, low-level data are replaced with
high-level data by using concept hierarchies climbing.
• Normalization:-Normalization scaled attribute data so on fall within a
little specified range, such as 0.0 to 1.0.
5) Data Reduction Process
• Data warehouses usually store large amounts of data the data mining operation takes a long time to
process this data. The data reduction technique helps to minimize the size of the dataset without
affecting the result. The following are the methods that are commonly used for data reduction,
• Data cube aggregation:- Refers to a method where aggregation operations are performed on data to
create a data cube, which helps to analyze business trends and performance.
• Attribute subset selection:- Refers to a method where redundant attributes or dimensions or irrelevant
data may be identified and removed.
• Data Compression:- Refers to a method where encoding techniques are used to minimize the size of the
data set.
• Numerosity reduction:- Refers to a method where smaller data representation replaces the data.
• Discretization and concept hierarchy generation:- Refers to methods where higher conceptual values
replace raw data values for attributes. Data discretization is a type of numerosity reduction for the
automatic generation of concept hierarchies.
Data Cube
• A data cube is created from a subset of attributes in the database. Specific
attributes are chosen to be measure attributes, i.e., the attributes whose
values are of interest. Another attributes are selected as dimensions or
functional attributes. The measure attributes are aggregated according to the
dimensions.
• A data cube enables data to be modeled and viewed in multiple dimensions.
A multidimensional data model is organized around a central theme, like
sales and transactions. A fact table represents this theme. Facts are
numerical measures.
• Dimensions are a fact that defines a data cube. Facts are generally
quantities, which are used for analyzing the relationship between
dimensions.
The figure below shows the data cube for All Electronics sales.

Each dimension has a dimension table which contains a further description of that dimension.
Such as a branch dimension may have branch_name, branch_code, branch_address etc.
Attribute Subset Selection in Data Mining
• Attribute subset Selection is a technique which is used for data
reduction in data mining process. Data reduction reduces the size of
data so that it can be used for analysis purposes more efficiently.
• This is a kind of greedy approach in which a significance level is
decided (statistically ideal value of significance level is 5%) and the
models are tested again and again until p-value (probability value) of
all attributes is less than or equal to the selected significance level.
• Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
• Stepwise Forward Selection: This procedure start with an empty set of
attributes as the minimal set. The most relevant attributes are chosen(having
minimum p-value) and are added to the minimal set. In each iteration, one
attribute is added to a reduced set.
• Stepwise Backward Elimination: Here all the attributes are considered in
the initial set of attributes. In each iteration, one attribute is eliminated from
the set of attributes whose p-value is higher than significance level.
• Combination of Forward Selection and Backward Elimination: The
stepwise forward selection and backward elimination are combined so as to
select the relevant attributes most efficiently. This is the most common
technique which is generally used for attribute selection.
• Decision Tree Induction: This approach uses decision tree for attribute
selection. It constructs a flow chart like structure having nodes denoting a
test on an attribute. Each branch corresponds to the outcome of test and leaf
nodes is a class prediction. The attribute that is not the part of tree is
considered irrelevant and hence discarded.
Step-Wise Forward Selection

•Suppose there are the following attributes in the data set in


which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
Data Compression
• Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial in
situations where the dataset is too large to be processed efficiently, or where the
dataset contains a large amount of irrelevant or redundant information.
• Types of data compression techniques
• While one can refer to this data compression technique PDF, to know about the
various type of techniques available, the two common types that always stand
out are:
1. Lossy
2. Lossless
• Lossy compression

To understand the lossy compression technique, we must first understand the


difference between data and information. Data is a raw, often unorganized
collection of facts or values and can mean numbers, text, symbols, etc. On the
other hand, Information brings context by carefully organizing the facts.

• Lossless Compression

Lossless compression, unlike lossy compression, doesn’t remove any data;


instead, it transforms it to reduce its size.
Numerosity Reduction
• In the numerosity reduction, the data volume is decreased by
selecting an alternative, smaller form of data representation.
• These techniques can be parametric or non-parametric.
• Parametric methods, a model can estimate the data so that only the
data parameters need to be saved, instead of the actual data, for
example, Log-linear models.
• Non-parametric methods are used to store a reduced representation
of the data, including histograms, clustering, and sampling.
Discretization
• Data discretization refers to a method of converting a huge
number of data values into smaller ones so that the evaluation
and management of data become easy.
• In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals
with minimum data loss.
• Example:-
Age-5,7,8,12,15,18,25,35,45,60,72,82
Discretization-child(5,7,8) Young(12,15,18,25) Mature(35,45)
Old(60,72,82)
Association Rule
• Association rule mining is a procedure which aims to observe
frequently occurring patterns, correlations, or associations from
datasets found in various kinds of databases such as relational
databases, transactional databases, and other forms of repositories.
• Association rule mining finds interesting associations and
relationships among large sets of data items. This rule shows how
frequently a itemset occurs in a transaction. A typical example is a
Market Based Analysis.
• The Association rule is a learning technique that helps identify the
dependencies between two data items. Based on the dependency, it
then maps accordingly so that it can be more profitable.
• Association rules are created by thoroughly analyzing data and
looking for frequent if/then patterns. Then, depending on the
following two parameters, the important relationships are observed:

• Support: Support indicates how frequently the if/then relationship


appears in the database.

• Confidence: Confidence tells about the number of times these


relationships have been found to be true.
Market Basket Analysis
• A data mining technique that is used to uncover purchase patterns in
any retail setting is known as Market Basket Analysis.
• In simple terms Basically, Market basket analysis in data mining is to
analyze the combination of products which been bought together.
• This is a technique that gives the careful study of purchases done by a
customer in a supermarket.
• This concept identifies the pattern of frequent purchase items by
customers. This analysis can help to promote deals, offers, sale by the
companies, and data mining techniques helps to achieve this analysis
task.
How does Association Rule Learning work?
• Association rule learning works on the concept of If and Else
Statement, such as if A then B.
• measure the associations between thousands of data items, there are
several metrics.
• These metrics are given below:
Support
Confidence
Lift
• Support: Support is the frequency of A or how frequently an item appears
in the dataset. It is defined as the fraction of the transaction T that contains
the itemset X. If there are X datasets, then for transactions T, it can be
written as:
Supp(X)=freq(X)/T
• Confidence: Confidence indicates how often the rule has been found to be
true. Or how often the items X and Y occur together in the dataset when
the occurrence of X is already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.
Confidence=freq(X,Y)/freq(X)
• Lift: It is the strength of any rule, which can be defined as below formula:
Lift=support(X,Y)/sup(X)*sup(Y )

You might also like