Unit-2 Introduction To Data Mining
Unit-2 Introduction To Data Mining
Unit-2
Introduction to Data Mining
Data Mining
Data mining is the process of discovering interesting patterns and knowledge from the huge
amount of data. Data Mining is one of the essential step in the process of KDD (Knowledge
Discovery in Database).
Why Data Mining? (Motivation)
Data mining helps to turn the huge amount of data into useful information and knowledge
that can have different applications.
Data mining helps in
a. Automatic discovery of patterns
b. Prediction of likely outcomes
c. Creation of actionable information
Data mining can answer questions that cannot be addressed through simple query and
reporting techniques.
Class/Concept Description: Data can be associated with classes or concepts that can be
described in summarized, concise and yet precise, terms. Such descriptions of a concept or
class are called class/concept descriptions. These descriptions can be derived via:
- Data Characterization: Characterization is a summarization of the general
characteristics or features of a target class of data which creates what is called a
characteristic rule.
- Data Discrimination: Data discrimination is a comparison of the general features of
target class data objects with the general features of objects from one or a set of
contrasting classes.
Association analysis on frequent patterns: Frequent patterns are patterns that occur
frequently in data. Association analysis aims to discover associations between items
occurring together frequently.
E.g. buys(X,“computer”) => buys(X,“software”) [support=1%,confidence=50%]
where X is a variable representing a customer. Confidence=50% means that if a
customer buys a computer, there is a 50% chance that she will buy software as well.
Classification and Prediction: Classification is the process of finding a model (or function)
that describes and distinguishes data classes or concepts. This model is derived based on
the analysis of a set of training data and used to predict the class label of objects for which
the class label is unknown.
Prediction is used to predict missing or unavailable numeric data values rather than class
labels. Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well.
Cluster Analysis / Clustering: Clustering analyzes data objects without consulting class
labels. It can be used to generate class labels for a group of data which did not exist at the
beginning. The objects are clustered or grouped based on the principle of maximizing the
intra-class similarity and minimizing the interclass similarity. That is, clusters of objects
are formed so that objects within a cluster have high similarity in comparison to one
another, but are very dissimilar to objects in other clusters.
Outlier Analysis: Outliers are objects that do not comply with the general behavior or
model of the data. Most data mining methods discard outliers as noise or exceptions.
However, in some events these kind of events are more interesting. This analysis of outlier
data is referred to as outlier analysis. E.g. Fraud detection
Evolution Analysis: Data evolution analysis describes and models regularities or trends for
objects whose behavior changes over time. This may include characterization,
discrimination, association and correlation analysis, classification, prediction or clustering
of time related data. Distinct features of such data include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.
Data Integration: In this step data from various sources such as database, data warehouse
and transactional data are combined.
Data Selection: Data which is required for data mining process can be extracted from
multiple and heterogeneous data sources such as databases, files etc. Data selection is a
process where the appropriate data required for analysis is fetched from the databases.
Data Transformation: In the transformation stage data extracted from multiple data
sources are converted into an appropriate format for data mining process. Data reduction
or summarization is used to decrease the number of possible values of data without
affecting the integrity of data.
Data Mining: It is the most essential step of KDD process where intelligent methods are
applied in order to extract hidden patterns from data stored in databases.
Pattern Evaluation: This step identifies the truly interesting patterns representing
knowledge on the basis of some interestingness measures. Support and confidence are two
widely used interestingness measures. These patterns are helpful for decision support
systems.
Data Objects
Data sets are made up of data objects. A data object represents an entity - in a sales database,
the objects may be customers, store items, and sales. Data objects are typically described by
attributes. If the data objects are stored in a database, they are data tuples.
Attribute
An attribute is a data field, representing a characteristic or feature of a data object. Attributes
describing a customer object can include, for example, customer_ID, name, and address.
On the basis of set of possible values attributes can be divided into following types:
1) Nominal Attributes: Nominal means “relating to names.” The values of a nominal attribute
are symbols or names of things. Each value represents some kind of category, code, or state,
and so nominal attributes are also referred to as categorical. The values do not have any
meaningful order. E.g.
- Hair_color: possible values are: {black, brown, red, grey, white}
- Marital_status: possible values are:{Married, Single, Divorced,Widowed}
2) Binary Attributes: A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it
is present. E.g. Given the attribute smoker describing a patient object, 1 indicates that
the patient smokes, while 0 indicates that the patient does not.
- A binary attribute is symmetric if both of its states are equally valuable. E.g.
attribute gender having the states male and female.
- A binary attribute is asymmetric if the outcomes of the states are not equally important,
such as the positive (1) and negative (0) outcomes of a medical test for HIV.
3) Ordinal Attributes: An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between successive values is
not known. E.g. Height: possible values are: {Tall, Medium, Short}. The values have a
meaningful sequence (which corresponds to increasing height); however, we cannot tell
from the values how much bigger, say, a medium is than a short. Other example of ordinal
attributes include grade (e.g., A+, A, A−, B+, and so on).
Median: A better measure of the center of data is the median, which is the middle value in
a set of ordered data values. It is the value that separates the higher half of a data set from
the lower half.
Suppose that a given data set of 𝑁 values for an attribute 𝑋 is sorted in increasing order. If
𝑁 is odd, then the median is the middle value of the ordered set. If 𝑁 is even, then the
median is not unique; it is the two middlemost values and any value in between. If X is a
numeric attribute in this case, by convention, the median is taken as the average of the two
middlemost values.
Mode: The mode for a set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative attributes. Data sets with
one, two, or three modes are respectively called unimodal, bimodal, and trimodal. In
general, a data set with two or more modes is multimodal. At the other extreme, if each
data value occurs only once, then there is no mode.
For unimodal numeric data, we have the following empirical relation:
Midrange: The midrange can also be used to assess the central tendency of a numeric data
set. It is the average of the largest and smallest values in the set.
Example:
Let 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 are the values.
30+36+47+50+52+52+56+60+63+70+70+110
𝑀𝑒𝑎𝑛(𝑥̅ ) = = 58
12
52+56
𝑀𝑒𝑑𝑖𝑎𝑛 = = 54
2
𝑀𝑜𝑑𝑒: The given data are bimodal. Two modes are 52 and 70.
30+110
𝑀𝑖𝑑𝑟𝑎𝑛𝑔𝑒 = = 70
2
Measure of Dispersion
Measures of dispersion indicate how much the observed data is spread out around a measure
of central tendency. The measures include range, quantiles, quartiles, percentiles, and the
interquartile range. Variance and standard deviation also indicate the spread of a data
distribution.
Range: The range of the set is the difference between the largest (max()) and smallest
(min()) values.
Quantiles: Suppose that the data for attribute X are sorted in increasing numeric order.
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-size consecutive sets.
- The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
Quartiles: The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
Percentiles: The 100-quantiles are more commonly referred to as percentiles; they divide
the data distribution into 100 equal-sized consecutive sets.
Interquartile Range: The distance between the first (25th percentile) and third (75th
percentile) quartiles is called the interquartile range (IQR).
Standard Deviation: The standard deviation, 𝜎, of the observations is the square root of
the variance, σ2. A low standard deviation means that the data observations tend to be very
close to the mean, while a high standard deviation indicates that the data are spread out
over a large range of values.
f) Handling noisy or incomplete data: The data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
g) Pattern evaluation: The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues:
a) Efficiency and scalability of data mining algorithms: In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
b) Parallel, distributed, and incremental mining algorithms: The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms, update databases without
mining the data again from scratch.
3. Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration
of the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.
4. Intrusion Detection
Data mining can help improve intrusion detection by adding a level of focus to anomaly
detection. It helps an analyst to distinguish an activity from common everyday network
activity.
7. Space Science
Data mining can be used to automate the analysis image data collected from sky survey
with better accuracy.