0% found this document useful (0 votes)
7 views

UNIT 1 Introduction of Data Mining

Uploaded by

jaydityashopy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

UNIT 1 Introduction of Data Mining

Uploaded by

jaydityashopy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-1 : Introduction to Data Mining

Definition & Overview


Data mining is the computational process of discovering patterns in large data
sets, using methods from machine learning, statistics, and database systems. It is
a key step in the Knowledge Discovery in Databases (KDD) process, which is
an overall process of converting raw data into useful information. Data mining
aims to find useful information that wasn’t previously known, which can then be
acted upon for decision-making, prediction, or analysis.

About to Data Mining


 Definition: Data mining is the process of extracting useful information
and knowledge from large datasets.
 Goal: To discover patterns, trends, and relationships that are not easily
apparent.
 Applications: Marketing, finance, healthcare, retail, and more.

There are a number of data mining functionalities. These include Class/Concept


characterization and discrimination (1), the mining of frequent patterns,
associations, and correlations (2), classification and regression (3), clustering
analysis (4) and outlier analysis (5). Data mining functionalities are used to
specify the kinds of patterns to be found in data mining tasks. In general, such
tasks can be classified into two categories: descriptive and predictive. Descriptive
mining tasks characterize properties of the data in a target data set. Predictive
mining tasks perform induction on the current data in order to make predictions.

Data Mining Functionality:


1. Class/Concept Descriptions: Classes or definitions can be correlated with
results. In simplified, descriptive and yet accurate ways, it can be helpful to
define individual groups and concepts. These class or concept definitions are
referred to as class/concept descriptions.
 Data Characterization: This refers to the summary of general
characteristics or features of the class that is under the study. The output
of the data characterization can be presented in various forms include pie
charts, bar charts, curves, multidimensional data cubes.
Example: To study the characteristics of software products with sales increased
by 10% in the previous years. To summarize the characteristics of the customer
who spend more than $5000 a year at AllElectronics, the result is general profile
of those customers such as that they are 40-50 years old, employee and have
excellent credit rating.
 Data Discrimination: It compares common features of class which is
under study. It is a comparison of the general features of the target class
data objects against the general features of objects from one or multiple
contrasting classes.
Example: we may want to compare two groups of customers those who shop for
computer products regular and those who rarely shop for such products(less
than 3 times a year), the resulting description provides a general comparative
profile of those customers, such as 80% of the customers who frequently
purchased computer products are between 20 and 40 years old and have a
university degree, and 60% of the customers who infrequently buys such
products are either seniors or youth, and have no university degree.

2. Mining Frequent Patterns, Associations, and Correlations: Frequent


patterns are nothing but things that are found to be most common in the data.
There are different kinds of frequencies that can be observed in the dataset.
 Frequent item set: This applies to a number of items that can be seen
together regularly for eg: milk and sugar.
 Frequent Subsequence: This refers to the pattern series that often occurs
regularly such as purchasing a phone followed by a back cover.
 Frequent Substructure: It refers to the different kinds of data structures
such as trees and graphs that may be combined with the itemset or
subsequence.
Association Analysis: The process involves uncovering the relationship between
data and deciding the rules of the association. It is a way of discovering the
relationship between various items.
Example: Suppose we want to know which items are frequently purchased
together. An example for such a rule mined from a transactional database is,
buys (X, “computer”) ⇒ buys (X, “software”) [support = 1%, confidence =
50%],
where X is a variable representing a customer. A confidence, or certainty, of
50% means that if a customer buys a computer, there is a 50% chance that she
will buy software as well. A 1% support means that 1% of all the transactions
under analysis show that computer and software are purchased together. This
association rule involves a single attribute or predicate (i.e., buys) that repeats.
Association rules that contain a single predicate are referred to as single-
dimensional association rules.
age (X, “20…29”) ∧ income (X, “40K..49K”) ⇒ buys (X, “laptop”)
[support = 2%, confidence = 60%].
The rule says that 2% are 20 to 29 years old with an income of $40,000 to
$49,000 and have purchased a laptop. There is a 60% probability that a
customer in this age and income group will purchase a laptop. The association
involving more than one attribute or predicate can be referred to as a
multidimensional association rule.
Typically, association rules are discarded as uninteresting if they do not satisfy
both a minimum support threshold and a minimum confidence threshold.
Additional analysis can be performed to uncover interesting statistical
correlations between associated attribute–value pairs.
Correlation Analysis: Correlation is a mathematical technique that can show
whether and how strongly the pairs of attributes are related to each other. For
example, Highted people tend to have more weight

3. Classification and Regression for Predictive Analysis :


Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts. The model are derived based on the
analysis of a set of training data (i.e., data objects for which the class labels are
known). The model is used to predict the class label of objects for which the the
class label is unknown. “How is the derived model presented?” The derived
model may be represented in various forms, such as classification rules (i.e., IF-
THEN rules), decision trees, mathematical formulae, or neural networks(Figure
1.9). A decision tree is a flowchart-like tree structure, where each node denotes
a test on an attribute value, each branch represents an outcome of the test, and
tree leaves represent classes or class distributions. Decision trees can easily

4. Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data
sets, clustering analyzes data objects without consulting class labels. In many
cases, class labeled data may simply not exist at the beginning. Clustering can be
used to generate class labels for a group of data. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity. That is, clusters of objects are formed so that
objects within a cluster have high similarity in comparison to one another, but are
rather dissimilar to objects in other clusters. Each cluster so formed can be viewed
as a class of objects, from which rules can be derived. Clustering can also
facilitate taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.
Example 1.9
Cluster analysis. Cluster analysis can be performed on All Electronics customer
data to identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing. Figure 1.10 shows a 2-D plot of
customers with respect to customer locations in a city. Three clusters of data
points are evident
5. Outlier Analysis
A data set may contain objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Many data mining methods
discard outliers as noise or exceptions. However, in some applications (e.g., fraud
detection) the rare events can be more interesting than the more regularly
occurring ones. The analysis of outlier data is referred to as outlier analysis or
anomaly mining. Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or using distance measures where
objects that are remote from any other cluster are considered outliers. Rather than
using statistical or distance measures, density-based methods may identify
outliers in a local region, although they look normal from a global statistical
distribution view
Example 1.10 Outlier analysis. Outlier analysis may uncover fraudulent usage of
credit cards by detecting purchases of unusually large amounts for a given
account number in comparison to regular charges incurred by the same account.
Outlier values may also be detected with respect to the locations and types of
purchase, or the purchase frequency.

Data Mining Functionalities


1. Descriptive Functions:
o Clustering: Grouping a set of objects in such a way that objects in
the same group (cluster) are more similar to each other than to
those in other groups. This can be applied to customer
segmentation, market research, etc.
o Association Rule Learning: Used to identify relationships among
a set of items in large databases. The well-known Apriori
Algorithm is often used here. E.g., in retail, discovering that
"customers who buy bread also tend to buy butter" is an association
rule.
o Summarization: Produces a compact representation of the data
set. For example, calculating averages or identifying common
properties.
2. Predictive Functions:
o Classification: The process of finding a model or function that
describes and distinguishes data classes or concepts. For example,
in email classification, the model may classify emails as either
"spam" or "not spam." Algorithms used for classification include
Decision Trees, Naive Bayes, and Support Vector Machines.
o Regression: A statistical method used to predict a continuous target
variable based on one or more independent variables. E.g.,
predicting house prices based on location, size, and features.
o Anomaly Detection: Finds instances or patterns in the data that
significantly differ from the rest. This can be applied to fraud
detection or network security.

Steps in the Data Mining Process


1. Data Cleaning:
This step is about removing noise, correcting inconsistencies, and
handling missing data. Tools for this can include data imputation
techniques (mean substitution, regression imputation) or outlier detection
algorithms.
2. Data Integration:
Often, data from multiple sources need to be combined into a unified
dataset. The challenge here is ensuring schema integration, where
different tables are mapped correctly, and conflicts like unit mismatch or
name redundancy are resolved.
3. Data Selection:
Selecting only the relevant data that is required for the analysis. This can
involve feature selection techniques like backward elimination or
forward selection to reduce the number of variables.
4. Data Transformation:
o Feature Scaling: Bringing all features into the same range,
typically required for machine learning algorithms.
o Data Smoothing: Using techniques like moving averages or
binning to reduce noise.
o Data Aggregation: Summarizing or rolling up data into higher-
level forms (e.g., monthly to yearly data).
5. Data Mining:
This is where the core algorithms are applied to extract patterns. For
example, clustering techniques like K-Means or classification techniques
like Random Forests.
6. Pattern Evaluation:
Measures such as support, confidence, and lift are used to evaluate
association rules. For classification models, evaluation metrics might
include accuracy, precision, recall, and F1 score.
7. Knowledge Presentation:
Results are presented using data visualization tools, graphs, and reports.
Effective visualization techniques (like heatmaps, decision trees, etc.)
help users interpret the data insights.
Classification of Data Mining Systems
1. Based on Data Type:
o Relational Databases: Organized in tables with rows and columns,
allowing the use of SQL for querying data.
o Data Warehouses: Central repositories that integrate and store
large volumes of data from multiple sources, typically historical
data.
o Transactional Databases: Focused on transaction management
and typically store event or log data.
o Spatial Databases: Store spatial coordinates, allowing analysis
based on location (e.g., geographic information systems).
o Multimedia Databases: Deal with non-traditional data like
images, videos, and audio files.
o Text and Web Mining: Focuses on mining large collections of text
data, such as documents, emails, or web pages.
2. Based on Mining Techniques:
o Supervised Learning: Includes classification and regression,
where the algorithm learns from labelled training data. E.g.,
training a model on labelled medical records to predict patient
outcomes.
o Unsupervised Learning: Includes clustering and association rule
mining, where there are no predefined categories.
o Semi-Supervised Learning: Involves a small labelled dataset and
a large unlabelled dataset. It improves accuracy when labelling a
large dataset is expensive or time-consuming.
3. Based on Applications:
o Web Mining: Used to analyze and extract information from web
data, including web structure mining, web content mining, and web
usage mining.
o Text Mining: Extracts useful information from unstructured text
data.
o Bioinformatics: Applies data mining to biological data, such as
genetic information.

Major Issues in Data Mining


1. Data Quality:
Incomplete or noisy data can significantly affect the quality of the data
mining results. Data cleaning techniques and robust models help address
this issue.
2. Scalability:
As the size of datasets increases, ensuring that the data mining algorithms
can handle such large-scale data becomes challenging. Parallel and
distributed computing frameworks (like Hadoop and Spark) are often
used for scalable mining.
3. Data Privacy and Security:
Data mining can pose risks to privacy, particularly when it involves
sensitive information (like health or financial data). Privacy-preserving
data mining techniques, such as k-anonymity or differential privacy,
are used to protect personal information.
4. High Dimensionality:
Datasets with many features or attributes pose challenges for traditional
mining algorithms. Dimensionality reduction techniques (such as PCA
or LDA) are used to reduce the number of attributes without losing
meaningful information.
5. Interpretability:
Results of some data mining models, especially those involving deep
learning or neural networks, may not be easily interpretable by humans.
Efforts to make models explainable (e.g., using LIME or SHAP for
interpretability) are critical.

Data Wrangling and Preprocessing


Data Preprocessing: An Overview
Data preprocessing is the first step in the data mining pipeline and is essential
for ensuring that the data is clean, consistent, and properly formatted for the
mining algorithms. Garbage In, Garbage Out is a common principle in data
mining — poor-quality input data will lead to poor-quality results.

Steps in Data Preprocessing


1. Data Cleaning:
o Handling Missing Data: Various approaches can be used, such as
ignoring missing values, filling in missing values using mean,
median, or mode, or using algorithms to predict missing values
(e.g., using k-nearest neighbors (k-NN) or regression
imputation).
o Noise Removal: Noise in data can be detected using clustering or
regression techniques, where outliers (noisy data) are identified and
treated.
o Correcting Inconsistencies: Inconsistent data, such as typos,
formatting errors, or out-of-range values, must be corrected or
removed.
2. Data Integration:
When data comes from multiple sources, there may be issues with format
or schema. Entity Resolution (ER) techniques are used to resolve
inconsistencies and conflicts in merging data from various sources.
3. Data Transformation:
o Normalization: Data normalization ensures that features have the
same scale (min-max normalization, z-score normalization, etc.).
o Data Aggregation: Combining multiple data points into a
summary (e.g., summing monthly sales data to produce yearly
sales data).
o Generalization: Generalizing low-level data into higher-level
concepts. For instance, replacing age values with age ranges (20-
30, 30-40, etc.).
4. Data Reduction:
o Dimensionality Reduction: Reducing the number of attributes
while maintaining the key properties of the data, using methods
like PCA or t-SNE for visualization.
o Data Cube Aggregation: Summarizing data into multi-
dimensional structures (used heavily in OLAP systems).
5. Data Discretization:
o Binning: Splitting continuous values into bins or intervals. This
can be equal-width binning (same interval range) or equal-
frequency binning (same number of values in each bin).
o Histogram Analysis: Discretizing the data using histograms to
understand the distribution and patterns.
o Clustering: Organizing data into groups, where data points in the
same group are more similar to each other.
o Decision Trees: Decision trees like ID3 or CART can be used to
discretize continuous variables by recursively partitioning the data.

You might also like