Data Mining

The document provides an overview of data mining, including its definition, key patterns such as characterization, discrimination, frequent patterns, and association, as well as various tasks and techniques like classification, regression, clustering, and anomaly detection. It also outlines the Knowledge Discovery in Databases (KDD) process, emphasizing its iterative nature and essential activities, and describes the architecture of data mining systems, including components like data sources, mining engines, and evaluation modules. Additionally, it highlights the importance of data visualization and deployment in making data-driven decisions.

Uploaded by

jkusekwa01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views35 pages

Data Mining

Uploaded by

jkusekwa01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Mining and KDD

Christopher Kalolo ([email protected])

Bernada Sianga ([email protected])
What is Data Mining?
• Data mining refers to the process of discovering meaningful patterns,
trends, and insights from large datasets using various techniques,
including statistical analysis, machine learning, and artificial
intelligence.
• It involves extracting valuable knowledge from vast amounts of data
to support decision-making and solve complex problems.
Patterns in Data Mining - Characterization
• Description:
• Characterization provides a summarization of the general characteristics or features of a
target class of data.
• It involves generating descriptive summaries about the data.
• Example Techniques:
• Data aggregation
• Summarization tools (e.g., SQL aggregation functions)
• Examples:
• Summarizing the characteristics of software products that had a sales increase of 10% in the
previous year.
• Describing the common features of customers who spend more than $5000 a year at a
specific company.
• Objective:
• To provide an overview or summary of data attributes for a specific class or subset of data.
Patterns in Data Mining - Discrimination
• Description:
• Discrimination is the process of comparing the general features of target class data objects
against the general features of objects from one or multiple contrasting classes.
• It aims to identify differences between classes.
• Example Techniques:
• Contrast sets
• Discriminant analysis
• Examples:
• Comparing the characteristics of regular computer product shoppers (e.g., more than twice a
month) with those who shop rarely (e.g., less than three times a year).
• Distinguishing features between high-value customers and low-value customers in a retail
dataset.
• Objective:
• To find attributes or features that distinguish one class from another, often used for decision-
making and classification.
Patterns in Data Mining – Frequent Patterns
• Description:
• Frequent patterns identify items, sequences, or substructures that appear frequently in a
dataset.
• This includes frequent itemsets, subsequences, and subgraphs.
• Example Techniques:
• Apriori algorithm
• FP-Growth algorithm
• Examples:
• Finding itemsets that frequently appear together in transaction data, such as "milk and
bread".
• Identifying common sequences of events in weblogs or customer behavior data.
• Objective:
• To uncover regularities and commonalities in the data, which can be useful for market basket
analysis, sequence mining, and other applications.
Patterns in Data Mining – Association
• Description:
• Association rules mining discovers interesting relationships (associations) among data items.
• It is used to find rules that imply certain item combinations occur together with a certain
probability.
• Example Techniques:
• Association rule learning (e.g., Apriori, Eclat)
• Measures like support, confidence, and lift
• Examples:
• Identifying that customers who buy diapers also frequently buy baby formula.
• Discovering that if a customer purchases a laptop, there is a high probability they will also
buy a laptop bag.
• Objective:
• To identify and quantify relationships between items in large datasets, often used for
recommendation systems and market basket analysis.
Summary to key Patterns
1. Characterization: Summarizes general characteristics of a target class.
• Example: Summarize customers spending over $5000/year.
2. Discrimination: Compares features between target and contrasting
classes.
• Example: Compare frequent vs. infrequent computer product shoppers.
3. Frequent Patterns: Identifies items or events that occur frequently
together.
• Example: Milk and bread bought together frequently.
4. Association: Discovers co-occurrence of items with certain probabilities.
• Example: Diapers and baby formula bought together.
Data Mining Tasks and Techniques
• Data mining involves various tasks and techniques aimed at discovering
patterns, associations, changes, anomalies, and statistically significant
structures and events in data.
• Data Mining Tasks:
• Data mining tasks refer to the specific objectives or problems that data mining aims
to solve.
• They represent the goals or outcomes that the data mining process is intended to
achieve.
• Data Mining Techniques:
• Data mining techniques are the specific methods or algorithms used to perform data
mining tasks.
• These techniques are the tools that implement the strategies necessary to achieve
the objectives set by the data mining tasks.
Data Mining Tasks and Techniques
• Data mining tasks are generally divided into two major categories:
• Predictive tasks (Use some attributes to predict unknown or future values of other
attributes)
• Classification
• Regression
• Anomaly Detection
• Descriptive tasks (Find human-interpretable patterns that describe the data)
• Association Rule Mining
• Clustering
• Summarization
• Text Mining
Data Mining Tasks - Classification
• Task Description:
• Classification is the process of identifying the category or class label of new
observations based on a training dataset containing observations whose category
membership is known.
• Techniques:
• Decision Trees: Splits data into branches to build a model of decisions.
• Naive Bayes: Uses Bayes’ theorem for probabilistic classification.
• Support Vector Machines (SVM): Finds the hyperplane that best separates classes.
• k-Nearest Neighbors (k-NN): Classifies based on the majority class among the k
closest observations.
• Neural Networks: Uses layers of interconnected nodes to model complex patterns.
• Example: Classifying emails as spam or non-spam.
Data Mining Tasks - Regression
• Task Description:
• Regression is used to predict a continuous value.
• It models the relationship between a dependent (target) variable and one or more
independent (predictor) variables.
• Techniques:
• Linear Regression: Models the relationship between two variables by fitting a linear
equation.
• Polynomial Regression: Models the relationship between variables as an nth degree
polynomial.
• Ridge and Lasso Regression: Adds regularization terms to linear regression to prevent
overfitting.
• Support Vector Regression (SVR): Uses SVM principles for regression tasks.
• Neural Networks: Uses backpropagation to fit complex regression models.
• Example: Predicting house prices based on features like area, number of
bedrooms, and age of the house.
Data Mining Tasks - Clustering
• Task Description:
• Clustering involves grouping a set of objects in such a way that objects in the same
group (cluster) are more similar to each other than to those in other groups.
• Techniques:
• K-Means: Partitions data into k clusters by minimizing within-cluster variance.
• Hierarchical Clustering: Builds a tree of clusters by either a bottom-up or top-down
approach.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Finds
clusters based on density and can handle noise.
• Gaussian Mixture Models (GMM): Uses probabilistic models to represent normally
distributed subpopulations.
• Example: Segmenting customers based on purchasing behavior.
Data Mining Tasks - Association Rule Mining
• Task Description:
• Association rule mining identifies interesting relationships (associations)
between variables in large datasets.
• Techniques:
• Apriori Algorithm: Generates frequent itemsets and association rules.
• FP-Growth (Frequent Pattern Growth): Builds a compact data structure to
find frequent itemsets without candidate generation.
• Eclat Algorithm: Uses a depth-first search strategy for mining frequent
itemsets.
• Example: Identifying items that are frequently purchased together in
a supermarket.
Data Mining Tasks – Anomaly Detection
• Task Description:
• Anomaly detection aims to identify rare items, events, or observations that
raise suspicions by differing significantly from the majority of the data.
• Techniques:
• Statistical Methods: Detects anomalies based on statistical deviations.
• Isolation Forest: Isolates anomalies by randomly selecting features and
splitting data.
• One-Class SVM: Models the normal data distribution and identifies
deviations.
• Autoencoders: Uses neural networks to reconstruct input data and identify
deviations based on reconstruction error.
• Example: Detecting fraudulent transactions in credit card data.
Data Mining Tasks – Sequence Mining
• Task Description:
• Sequence mining discovers frequent sequences of events or items over time.
• Techniques:
• GSP (Generalized Sequential Pattern): Finds frequent sequences based on
user-specified constraints.
• PrefixSpan: Uses a prefix-based approach to find sequential patterns.
• Example: Identifying common sequences of web page visits on a
website.
Data Mining Tasks – Summarization
• Task Description:
• Summarization involves creating a compact representation of the dataset,
often providing a high-level overview or summary of the data.
• Techniques:
• Descriptive Statistics: Uses measures like mean, median, and standard
deviation.
• OLAP (Online Analytical Processing): Provides multidimensional data analysis
and summarization.
• Example: Generating a summary report of sales performance by
region and product category.
Data Mining Tasks – Text Mining
• Task Description:
• Text mining involves extracting useful information from unstructured text
data.
• Techniques:
• Natural Language Processing (NLP): Techniques for processing and analyzing
text.
• Topic Modeling (e.g., LDA - Latent Dirichlet Allocation): Identifies topics
present in a collection of documents.
• Sentiment Analysis: Determines the sentiment expressed in text (positive,
negative, neutral).
• Example: Analyzing customer reviews to determine overall sentiment
towards a product.
Knowledge Discovery in Databases
• Knowledge Discovery in Databases (KDD) is the process of identifying
valid, novel, potentially useful, and ultimately understandable
patterns in data.
• It encompasses several steps, from the initial data selection to the
final knowledge representation.
Knowledge Discovery in Databases
KDD Process
1. Data Selection
1. Identify and gather relevant data from various sources.
2. Data Preprocessing (or Data Cleaning and Integration)
1. Data Cleaning: Handle missing values, remove duplicates, and correct errors.
2. Data Integration: Combine data from multiple sources to create a unified dataset.
3. Data Transformation (or Data Preparation)
1. Convert data into suitable formats for mining, including normalization, aggregation, and feature selection.
4. Data Mining
1. Apply algorithms to extract patterns and models from the data.
5. Pattern Evaluation
1. Assess and validate the discovered patterns to ensure they are significant and useful.
6. Knowledge Representation
1. Present the mined knowledge in an understandable and actionable form, such as through visualization or
reports.
KDD Iterative Nature
• It's important to understand that KDD is not a strictly linear process.
• It is iterative and interactive, meaning steps might be revisited based
on insights gained during later stages.
• For example:
• Discovering that additional data cleaning is needed after some initial data
mining.
• Revisiting feature selection based on the patterns found.
Common Variations in KDD Approaches
• Combined Preprocessing Steps:
• Some approaches combine data cleaning, integration, and transformation
into a single preprocessing step.
• This is to emphasize the continuous preparation and refinement of data
before mining.
• Fewer or More Steps:
• Some methodologies break down steps further or combine them to
emphasize certain aspects.
• Five-Step Approach: Selection, preprocessing, transformation, data mining, evaluation.
• Six-Step Approach: Explicitly separates preprocessing into cleaning and integration,
followed by transformation.
Consensus on the KDD Process
• Despite the differences in terminology and step delineation, there is a
consensus on the essential activities:
• Preparing the data:
• Including selection, cleaning, integration, and transformation.
• Mining the data:
• Applying algorithms to extract patterns.
• Evaluating and interpreting the results:
• Ensuring the patterns are meaningful and useful.
• Presenting the findings:
• Making the knowledge understandable and actionable.
Data Mining Architecture
Components of data mining System
• The following are the main components of Data Mining System
Architecture
• Sources of Data
• Database or Data Warehouse Server
• Data Mining Engine
• Modules for Pattern Evaluation
• Graphical User Interface
• Knowledge Base
Data Sources
• Databases:
• Structured collections of data, typically organized in tables.
• Data Warehouses:
• Large, centralized repositories of integrated data from multiple sources,
optimized for querying and analysis.
• Data Lakes:
• Storage systems that hold vast amounts of raw data in its native format,
including structured, semi-structured, and unstructured data.
• External Data Sources:
• Data from external systems, such as web data, social media, and other third-
party sources.
Database or Data warehouse server
• Data Storage:
• Manages the storage and retrieval of data.
• Data Processing:
• Handles operations like data extraction, transformation, and loading (ETL
processes).
Data Mining Engine
• Core Data Mining Algorithms:
• Implements various algorithms for tasks such as classification, regression,
clustering, association rule mining, and anomaly detection.
• Pattern Evaluation Modules:
• Evaluates the discovered patterns to determine their significance and
relevance.
Knowledge Base
• Domain Knowledge:
• Contains domain-specific knowledge that can help guide the data mining
process, such as taxonomies, ontologies, and user preferences.
• Meta-knowledge:
• Stores metadata about the data, algorithms, and patterns discovered.
User interface
• Graphical User Interface (GUI):
• Allows users to interact with the data mining system, providing functionalities
like data selection, preprocessing options, algorithm selection, and
visualization of results.
• Query Interface:
• Enables users to input specific queries to guide the data mining process.
Pattern Evaluation Module
• Evaluation Metrics:
• Uses metrics like accuracy, precision, recall, F1-score, and others to assess the
quality of the discovered patterns.
• Validation Methods:
• Employs techniques such as cross-validation, holdout validation, and
bootstrapping to validate the patterns.
Data Preprocessing Component
• Data Cleaning: Handles missing values, removes duplicates, and
corrects errors.
• Data Integration: Combines data from multiple sources into a
coherent dataset.
• Data Transformation: Normalizes, aggregates, and transforms data
into the appropriate format for mining.
• Data Reduction: Reduces the volume of data through methods like
sampling, dimensionality reduction, and feature selection.
Data Mining Tools
• Software and Libraries:
• Provides tools and libraries for implementing data mining algorithms and
processes.
• Examples include Scikit-learn, Weka, RapidMiner, and KNIME.
Visualization Tools
• Data visualization is the graphical representation of information and data.
• It uses visual elements like charts, graphs, and maps, to provide an accessible
way to see and understand trends, outliers, and patterns in data.
• In the world of Big Data, data visualization tools and technologies are essential to
analyze massive amounts of information and make data-driven decisions.
• Visualization is an increasingly key tool to make sense of the trillions of rows of
data generated every day.
• It helps to tell stories by curating data into a form easier to understand,
highlighting the trends and outliers.
Deployment and Integration Modules
• Model Deployment:
• Manages the deployment of data mining models into production
environments.
• APIs and Connectors:
• Facilitates integration with other systems and applications, allowing the
mined knowledge to be used in decision-making processes.

2-Tasks and Techniques
No ratings yet
2-Tasks and Techniques
17 pages
1. Introduction
No ratings yet
1. Introduction
26 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
DM NOTES
No ratings yet
DM NOTES
91 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
24 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
DataMiningTechniques
No ratings yet
DataMiningTechniques
24 pages
Module 4
No ratings yet
Module 4
54 pages
IT326 - Ch1
100% (1)
IT326 - Ch1
17 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Chapter 1
No ratings yet
Chapter 1
38 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Data Mining
No ratings yet
Data Mining
25 pages
Data Warehousing Fundamentals - Unit 2
No ratings yet
Data Warehousing Fundamentals - Unit 2
38 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Unit-4 Introduction To Data Mining
No ratings yet
Unit-4 Introduction To Data Mining
26 pages
Data Mining
No ratings yet
Data Mining
24 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
Data Mining - Tasks: Data Characterization Data Discrimination
No ratings yet
Data Mining - Tasks: Data Characterization Data Discrimination
4 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Data mining module - New
No ratings yet
Data mining module - New
38 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
FoDS_Unit 1
No ratings yet
FoDS_Unit 1
7 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Unit 1
No ratings yet
Unit 1
59 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
84 pages
DATA MINING
No ratings yet
DATA MINING
7 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
29 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
combinepdf-1
No ratings yet
combinepdf-1
74 pages
Seminar on Data Mining Concepts and Its
No ratings yet
Seminar on Data Mining Concepts and Its
8 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
30 pages
UNIT 1 - Lecture 1 - Introduction To Data Mining
No ratings yet
UNIT 1 - Lecture 1 - Introduction To Data Mining
62 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Data Mining
No ratings yet
Data Mining
22 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Unit I
No ratings yet
Unit I
19 pages
ware house server
No ratings yet
ware house server
89 pages
UNIT I DBMI
No ratings yet
UNIT I DBMI
35 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
2 Data Mining Functionalities 14-12-2024
No ratings yet
2 Data Mining Functionalities 14-12-2024
27 pages
Data Mining
No ratings yet
Data Mining
6 pages
DM Module1
No ratings yet
DM Module1
15 pages
1.1 - Data Mining
No ratings yet
1.1 - Data Mining
18 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
MARKING SCHEME MATHS F3.
No ratings yet
MARKING SCHEME MATHS F3.
12 pages
Lecture 04 - Cascading Style Sheets-CSS
No ratings yet
Lecture 04 - Cascading Style Sheets-CSS
10 pages
qn
No ratings yet
qn
1 page
SQL Functions Operators and Key Words.docx
No ratings yet
SQL Functions Operators and Key Words.docx
3 pages
PHYSICS F2 TEST
No ratings yet
PHYSICS F2 TEST
1 page
7.Gamma Distribution
No ratings yet
7.Gamma Distribution
24 pages
Find Gcd and Lcm of the Following Numbers
No ratings yet
Find Gcd and Lcm of the Following Numbers
1 page
3.Normal Distribution
No ratings yet
3.Normal Distribution
42 pages
Os Distributions 1
No ratings yet
Os Distributions 1
5 pages
Database Design Provisional Cwk for Data Science Year i
No ratings yet
Database Design Provisional Cwk for Data Science Year i
8 pages
Diabetes
No ratings yet
Diabetes
73 pages
Scaling Knowledge Discovery Services For Efficient Big Data Mining in The Cloud
No ratings yet
Scaling Knowledge Discovery Services For Efficient Big Data Mining in The Cloud
13 pages
2019, Pradha - Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data
No ratings yet
2019, Pradha - Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data
8 pages
Unit-1 2
No ratings yet
Unit-1 2
25 pages
Synopsis
No ratings yet
Synopsis
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
24 pages
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
No ratings yet
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
7 pages
Cao J. E-commerce Big Data Mining and Analytics 2023
No ratings yet
Cao J. E-commerce Big Data Mining and Analytics 2023
217 pages
SAMEERA documentation
No ratings yet
SAMEERA documentation
32 pages
Bone Cancer Detection at Earlier Stage Using CNN Ijariie13980
No ratings yet
Bone Cancer Detection at Earlier Stage Using CNN Ijariie13980
7 pages
Report on Internship
No ratings yet
Report on Internship
28 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Preprocessing and Data Analysis using Python
No ratings yet
Data Preprocessing and Data Analysis using Python
32 pages
KashishRana_resume-1
No ratings yet
KashishRana_resume-1
1 page
Detection and Mitigation of DDoS Attack in Cloud
No ratings yet
Detection and Mitigation of DDoS Attack in Cloud
9 pages
task 1
No ratings yet
task 1
2 pages
Report
No ratings yet
Report
31 pages
Mini_project[PPT]...last
No ratings yet
Mini_project[PPT]...last
19 pages
Writer Identification Using Machine Lear
No ratings yet
Writer Identification Using Machine Lear
43 pages
4350702-IML
No ratings yet
4350702-IML
9 pages
Hitesh Internship Report
No ratings yet
Hitesh Internship Report
14 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 8
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 8
11 pages
IEEE Depression
No ratings yet
IEEE Depression
5 pages
UPI Fraud Detection Using Convolutional Neural Net
No ratings yet
UPI Fraud Detection Using Convolutional Neural Net
16 pages
Mini Project Report
No ratings yet
Mini Project Report
19 pages
Intership report
No ratings yet
Intership report
27 pages
Low Level Design FoodRecommendatio v0.3
No ratings yet
Low Level Design FoodRecommendatio v0.3
9 pages
Unit I NLP
No ratings yet
Unit I NLP
5 pages
Data science case report
No ratings yet
Data science case report
20 pages
IOT Mini Project Report
No ratings yet
IOT Mini Project Report
26 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

Data Mining and KDD

Christopher Kalolo ([email protected])

You might also like