Data Mining and Warehousing-1
Data Mining and Warehousing-1
Warehousing
Data, Information & Knowledge
• Data are the raw Facts(alphanumeric values) obtained through different
acquisition methods. Data in their simplest form consist of raw alphanumeric
values.
• Information is created when data are processed, organized, or structured to
provide context and meaning. Information is essentially processed data.
• Knowledge is what we know. Knowledge is unique to each individual and is
the accumulation of past experience and insight that shapes the lens by which
we interpret, and assign meaning to, information.
• Wisdom is the ability to make sensible decisions and judgments because of
your knowledge or experience
Types of Knowledge
• Explicit knowledge
• Implicit knowledge
• Procedural knowledge
• Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems.
• It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
• Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to
extract valuable information from huge sets of data. Data mining is also called Knowledge Discovery in
Database (KDD).
• The overall goal of the data mining process is to extract information from a data set and transform it into
an understandable structure for further use.
Advantages of Data Mining
• The Data Mining technique enables organizations to obtain knowledge-based data.
• It is a quick process that makes it easy for new users to analyze data in a short time.
Data Mining Applications
Classification of Data Mining Systems
• Data mining refers to the process of extracting important data from raw data. It
analyses the data patterns in huge sets of data with the help of several software.
Ever since the development of data mining, it is being incorporated by
researchers in the research and development field.
• With Data mining, businesses are found to gain more profit. It has helped in
determining business objectives for making clear decisions.
• To understand the system and meet the desired requirements, data mining can be
classified into the following systems:
Classification of Data Mining Systems
Challenges of Data Mining
• Security and Social Challenges: Decision-Making strategies are done through data collection-sharing, so it
requires considerable security.
• User Interface: The knowledge discovered using data mining tools is useful only if it is interesting and
above all understandable by the user.
• Mining Methodology Challenges: These challenges are related to data mining approaches and their
limitations.
• Data Cleaning − In this step, the noise and inconsistent data is removed.
• Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
• Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
• Data Mining Engine: This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as association and correlation analysis, classification, prediction,
cluster analysis and evolution analysis.
• Pattern Evaluation Module: This component typically employs interestingness measures interacts
with the data mining modules so as to focus the search toward interesting patterns.
• User interface: This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task, providing information to
help focus the search.
Data Preprocessing:
Assume that you are dealing with any data like sales and customer data and you observe that there are
several attributes from which the data is missing. One cannot compute data with missing values. In this
case, there are some methods which sort out this problem. Let us go through them one by one,
1.1 Ignore the tuple: If there is no class label specified then we could go for this method. It is not
effective in the case if the percentage of missing values per attribute changes considerably.
1.2. Enter the missing value manually or fill it with global constant: When the database contains large
missing values, then filling manually method is not feasible. Meanwhile, this method is time-consuming.
Another method is to fill it with some global constant.
1.3. Filling the missing value with attribute mean or by using the most probable value: Filling the
missing value with attribute value can be the other option. Filling with the most probable value uses
regression or decision tree.
2. Noisy Data
• Noise refers to any error in a measured variable. If a numerical attribute is given
you need to smooth out the data by eliminating noise. Some data smoothing
techniques are as follows,
2.1. Binning:
• Smoothing by bin means: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin.
• Smoothing by bin median: In this method, each bin value is replaced by its bin
median value.
• Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Every value
of bin is then replaced with the closest boundary value.
• Let us understand with an example,
• Data for price: 15, 8, 21, 26, 21, 9, 25, 4, 34, 28, 24, 29
• Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Each dimension has a dimension table which contains a further description of that dimension.
Such as a branch dimension may have branch_name, branch_code, branch_address etc.
Attribute Subset Selection in Data Mining
• Attribute subset Selection is a technique which is used for data
reduction in data mining process. Data reduction reduces the size of
data so that it can be used for analysis purposes more efficiently.
• This is a kind of greedy approach in which a significance level is
decided (statistically ideal value of significance level is 5%) and the
models are tested again and again until p-value (probability value) of
all attributes is less than or equal to the selected significance level.
• Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
• Stepwise Forward Selection: This procedure start with an empty set of
attributes as the minimal set. The most relevant attributes are chosen(having
minimum p-value) and are added to the minimal set. In each iteration, one
attribute is added to a reduced set.
• Stepwise Backward Elimination: Here all the attributes are considered in
the initial set of attributes. In each iteration, one attribute is eliminated from
the set of attributes whose p-value is higher than significance level.
• Combination of Forward Selection and Backward Elimination: The
stepwise forward selection and backward elimination are combined so as to
select the relevant attributes most efficiently. This is the most common
technique which is generally used for attribute selection.
• Decision Tree Induction: This approach uses decision tree for attribute
selection. It constructs a flow chart like structure having nodes denoting a
test on an attribute. Each branch corresponds to the outcome of test and leaf
nodes is a class prediction. The attribute that is not the part of tree is
considered irrelevant and hence discarded.
Step-Wise Forward Selection
• Lossless Compression