Chapter 3: Data Mining
Chapter 3: Data Mining
• To some extent the query and reporting tools can assist in answering questions like,
where did the largest number of students come from last year?
• But these tools cannot provide any intelligence about why it happened.
Taking an Example of University Database system: o The OLTP system will quickly
be able to answer the query like “how many students are enrolled in university”
o The OLAP system using data warehouse will be able to show the trends in
students enrollments (ex: how many students are preferring BCA)
o Data mining will be able to answer where the university should market.
The input data is stored in various formats (flat files, spread sheet or relational tables)
The purpose of preprocessing is to transform the raw input data into an appropriate format
for subsequent analysis.
3.3 Motivating Challenges:
• Traditional analysis techniques have often faced practical difficulties posed by new data sets.
1) Scalability: Data sets are of the size of Terabytes and Petabytes, if data mining
algorithm is handling this massive data it must be scalable, and it need to have parallel
distributed algorithms to achieve this.
2) High Dimensionality:
Large amount of data always contains thousands of attributes, Complexity increases as
the dimensions grows, so the data sets need to have high dimensionality so that it can
deal with data containing many dimensions.
Traditional data analysis technique can only deal with low dimensional data.
4) Data Ownership and Distribution: as the data is not always stored at one location
and it might be scattered at different places in different organization, Distributed data
mining technique is required and the challenges faced during this is:
1) How to reduce the amount of communication needed to perform
distributed computing
2) How to combine the data mining results obtained from multiple sources 3)
How to address data security issues.
• Draws ideas from machine learning/AI, pattern recognition, statistics, and database
systems
• Traditional Techniques may be unsuitable
due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of
data
Data mining tasks are generally divided into two major categories:
• Predictive tasks:
Use some variables to predict unknown or future values of other variables.
Ex: by seeing the behaviour of one variable we can decide the value of other variable.
The attribute to be predicted is called: target or dependent
Attribute used for making prediction are called: explanatory or independent
variable
• Descriptive tasks:
Here the objective is to derive patterns (correlations, anomalies, cluster etc..) that
summarize the relationships in data.
Cluster Analysis
Predictive Modeling
Association Analysis
Anomaly Detection
They are needed post processing the data to validate and explain the results.
Four of the Core data Mining tasks:
1) Predictive Modeling
2) Association analysis
3) Cluster analysis
4) Anomaly detection
1) Predictive Modeling: refers to the task of building a model for the target variable as a
function of the explanatory variable.
There are two types of predictive modeling tasks:
1) Classification: used for discrete target variables ex: Web user will make
purchase at an online bookstore is a classification task, because the target variable
is binary valued.
2) Regression: used for continuous target variables.
Ex: forecasting the future price of a stock is regression task because price is a
continuous values attribute
2) Association Analysis: useful application of association is to find group of data that have
related functionality.
The Goal of associating analysis is to extract the most of interesting patterns in an efficient
manner.
Ex: market based analysis:
We may discover the rule that {diapers}{Milk}, which suggests that customers who buy
diapers also tend to buy milk.
3) Cluster Analysis: clustering has been used to group sets of related customers.
EX: collection of news articles below table shows first 3 rows speak about economy and
second 3 lines speak about health sector. A good clustering algorithm should be able to
identify these two clusters based on the similarity between words that appear in the article.
Example:
Article Words
1 Dollar:1, industry:4, country:2, loan:3, government:2
2 Machinery:2, labor:3, market:4, industry:2, work:3, country:1
3 Job:5, inflation3, rise:2, jobless:2, market: 3, country:2
4 Patient:4, symptoms:2, drug:3, health:2, clinic:2, doctor:2
5 Death:2, cancer:4, drug:3, public:4, health:4, director:1
6 Medical:2, cost:3, increase:2, patient:2, health:3, care:2
3.6 Data:
What is Data?
Collection of data objects and their attributes What is
an Attribute?
Note: properties of attribute values can be different like you can find the average ages of persons but
you cannot find the average ID’s
Types of Attributes:
Nominal (particular identity)
Examples: ID numbers, eye color, zip codes
Ordinal (measurable)
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in
{tall, medium, short}
Interval (range between the two)
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio ()
Examples: temperature in Kelvin, length, time, counts
3) Ordered data
– Sequential Data
– Sequence Data or Genetic Sequence Data
– Time series or Temporal Data
– Spatial Data
i) Dimensionality
• The dimensionality of data set is the number of attributed that the objects in
the data set possess. Data with small number of dimensions tends to be
qualitatively different than moderate or high dimensional data.
• The difficulty associated with analyzing high dimensional data are sometimes
referred to as the curse of dimensionality
ii) Sparsity
• Data with asymmetric features, most of the attribute values are zero’s, in
practice terms, sparsity is an advantage because usually only non-zero values
to be stored and manipulated. This results in significant savings with respect to
computation time and storage.
• Some of the Data mining algorithms work well for sparse data.
iii) Resolution
• It is frequently possible to obtain data different levels of resolution, often the
properties of the data are different at different resolution.
• Ex: the surface of the earth seems very uneven at a resolution of few meters,
but is relatively smooth at a resolution of tens of kilometers.
• Ex: Photo Pixels (higher the pixel resolution clears the image lesser the
resolution image is blurred.
1) Record data:
Record data set is a collection of data objects, which consists of fixed set of data fields
(attributes).
Record data is usually stored in flat files Or relational tables
– If objects have structure, that is the object contains sub objects that have
relationship, then such objects are frequently represented as graphs. Ex: benzene
molecule
3) Ordered Data:
a. Sequential Data:
Sequential data also referred as temporal data, can be thought of as an extension of
record data, where each record has time associated with it.
Some objects have spatial attributes, such as positions or areas, as well as other types of
attributes. An example of spatial data is weather data that is collected for a variety
of geographical location.
1. Aggregation:
• Combining two or more attributes (or objects) into a single attribute (or object)
Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
2. Sampling
• Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and the final data
analysis.
• Statisticians sample because obtaining the entire set of data of interest is too expensive or
time consuming.
• Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
• ANALOGY: (Rice : to see whether the rice is cooked or not we only see one particle of it not
all the rice particles)
Types of Sampling:
• Simple Random Sampling
– There is an equal probability of selecting any particular item » There
are two variations on random sampling:
1) Sampling without replacement
• As each item is selected, it is removed from the population
2) Sampling with replacement
• Objects are not removed from the population as they are selected for the
sample.
• In sampling with replacement, the same object can be picked up more than
once
• Stratified sampling
– Split the data into several partitions; then draw random samples from each partition
• Progressive Sampling:
– If the proper sample size selection is difficult then adaptive or progressive sampling is
used. Then these approaches start with a small sample, and then increase the sample
size until a sample of sufficient size has been obtained.
3. Dimensionality Reduction:
Complexity of data increases as the dimensions grows in data.
Curse of Dimensionality:
• When dimensionality increases, data becomes increasingly sparse in the space that it
occupies
• For clustering, the definitions of density and distance between points, this is critical for
clustering and outlier detection, become less meaningful.
• As a result many clustering and classification algorithms have trouble with high
dimensional data, as it results in reduced classification accuracy and poor quality
clusters.
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining algorithms
– Allow data to be more easily visualized
– duplicate much or all of the information contained in one or more other attributes
– Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
– contain no information that is useful for the data mining task at hand
– Example: students' ID is often irrelevant to the task of predicting students' Grade Point
Averages
• Feature selection occurs naturally as part of the data mining algorithm – Filter
approaches:
• Features are selected before data mining algorithm is run – Wrapper approaches:
• Use the data mining algorithm as a black box to find best subset of attributes
4. Feature Creation:
• Create new attributes that can capture the important information in a data set
much more efficiently than the original attributes Three general
methodologies: – Feature Extraction
• creation of new set of features from the original raw data is known as feature
creation
– And transforming continuous and discrete attributes into binary attributes is called as
Binarization.
6. Variable transformation
– A variable transformation refers to a transformation that is applied to all the values of
a variable, or even attributes.
– In each object the transformation is applied to the value of the variable for that object.
Basics:
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
• It would seem reasonable that a product, PI , which is rated wonderful, would be closer to a
product P2, which is rated good, than it would be to a product P3, which is rated OK.
• To make this observation quantitative, the values of the ordinal attribute are often mapped to
successive integers, beginning at 0 or 1, e.g., {poor=0, fair=l, OK=2. good=3, wonderful=4].
• Then, d ( Pl , P2 ) = 3 — 2 = 1 or, if we want the dissimilarity
• to fall between 0 and 1, d ( P l , P 2 ) = (3-2)/5-1 = 0.25. A similarity for ordinal attributes can
then be defined as s = 1- d.
1. What is data mining and why do we need data mining? Ans : Page-1
2. Write a note on Data mining and Knowledge discovery? Ans: Page-1
3. Explain the Challenges that motivated the use of data mining? Ans: page-2
4. Explain the data mining tasks in details? Ans: page-3 to 4
5. Write a note on Data, attribute and Object? Ans:Page-5
6. Explain in detail types of attributes? Ans:Page-5 to 6
7. Explain the different types of data sets in details with proper examples and figures? Ans:
page-6 to 10
8. Explain data processing in details with examples? Ans: Page-11 to 14
9. Write a note on measures of similarities and dissimilarities? Ans: Page-15-16