Introduction-to-Data-Mining

Uploaded by

Aya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Introduction-to-Data-Mining

Uploaded by

Aya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Mining

Chapter 1 . Introduction
SASSI Abdessamed
Motivation
Why do we need data mining?
● Nowadays, the total world wide volume of data is very large
■ Hundreds of ZettaBytes (ZB = 270 byte)
● Data types and formats can be complexe
■ Video, Image, Audio, etc.
● Most data formats are not human readable
■ Binary formats
● Humans cannot deal with such amount and complexity
● We need concise insights and patterns to make decisions
Data mining is a misnomer?
● Literally data mining means gathering or collecting data
● In practice, data mining means extracting knowledge from data
● This knowledge is like golden-nuggets hidden in a large volume data
● Hence the word mining in the name
● So,
● What is Data?
● What is Knowledge?
● And, What does Data Mining really means?
Data
What is data?
● Data are collected observations or measurements represented as Text,
Numbers, or Multimedia [3].
● Data can be quantitative (represent quantities or numerical values)
■ Sensory data (Temperature, Light, Pixel Intensities, Voltage, …)
■ Time Durations (Age, Travel Length, …)
■ Size & Length Measurements (Area, Volume, Distance, Length, …)
■ Health Measurements (Blood Pressure, Sugar Level, O2 Saturation, …)
● Data can also be qualitative (categorical)
■ Text (words, letters, digits, …)
■ Age Classes (e.g. Football Age categories)
■ Blood Types
● Data can also be a complex mixture of the two types
■ E.g. Maps (Graphs)
Data vs Knowledge
● A book doesn’t know of its content
● Knowing Being Aware of the information we possess
■ Understanding
■ Being able to act and make decisions
■ Produce new thoughts
■ Discover Patterns
● Unlike having information, Knowing is active action
● How can we make computers discover by knowledge on their own?
Data Sources
● In our daily lives we produce tons of data (information)
■ Social Networks, Emails, Blogs, …
■ E-Commerce, Banking, Stores, …
■ Hospitals & Health reports
■ Administrative records
● Hence, data can be supplied by a variety of technologies:
■ Relational databases
■ Data warehouses
■ Transaction databases
■ Text databases
■ Social networks data
■ World-Wide Web
■ Time-series data
Data Formats
● The data we want to analyse using data mining methods have various
formats
■ Transactions
■ N-dimensional Vectors (data points)
■ Graphs
■ Tables
■ etc.
● The format of the data determines the data mining algorithm we can use
● We may also change the format of the data in order to be able to use a
certain type of algorithm
Data Preparation & Preprocessing
● Data integration. Combining data from multiple sources
■ Joining multiple tables.
■ Resolving data inconsistencies from different sources.
● Data selection. Selecting domain relevant data.
■ Selecting a specific of attributes (columns)
● Data cleaning.
■ Noise Reduction : Removing or correcting noisy data
■ Outlier Detection : Identifying and handling outliers
■ Handling Missing Values : Removing or filling in missing data
● Data Reduction.
■ Dimensionality Reduction: to reduce the number of attributes while retaining
important information.
■ Sampling: Selecting a subset of the data that represents the whole dataset to reduce
computation time.
Data Preparation & Preprocessing
● Data Transformation.
■ Normalization: Scaling numerical data to a common range
■ Data Discretization: Converting continuous attributes into discrete bins or categories
Data Mining
What is data mining?
● Extracting or “mining” knowledge from large amounts of data [1].
● A set of software techniques for identifying / discovering useful
patterns and trends from large amounts of data through automated
analysis.
● Obtaining a simplified view of data to help with decision making.
● Extracting Knowledge from data.
What is knowledge in this context?
● For data mining, knowledge is in the form of Patterns and Insights:
■ (If .. Then) Rules
■ Associations
■ Anomalies
■ Recommendations
■ Groups & Classes (Clusters)
■ Predictions
■ Correlations
Intersection with other fields & technologies
● Statistics
■ A variety of data mining algorithms involve some methods from the field of statistics
■ The methods of statistics themselves can be used as low-level data mining methods
● Databases
■ Most of the data sources will be stored using database technology
● Data warehouses
■ Data mining are generally applied to data integrated in a data warehouse
● Machine Learning
■ We can use some of these techniques to learn patterns
● Data visualization
■ To familiarise with the data, detect outliers, decide what preprocessing we need
■ To display the extracted patterns and make decisions after data mining
Why Data Mining?
● Large quantities of data to be analysed
■ Algorithms must be highly scalable
● High dimensionality of the data to be analysed
■ Each record of data is a vector with a large number of dimensions (attributes)
● Some data types are complex by nature
■ Web pages
■ Multimedia
■ Sensor data
■ Graphs
■ Social Network
■ …
Data mining process
Data Collection

Data Integration Data mining

Databases Data
warehouse

Patterns
Data mining as a step in KDD
KDD = Knowledge Discovery from Data
1. Data selection.
■ Identifying relevant datasets and selecting data that is important for our need / task
2. Data Preprocessing.
■ Cleaning the data by handling missing values, noise, and inconsistencies.
3. Data transformation.
■ Change the form of the data depending on the data mining algorithms to be used
4. Data mining.
■ A set of intelligent data analysis techniques
5. Pattern evaluation
■ Interpreting the discovered patterns and evaluating their Interestingness.
6. Knowledge presentation.
■ Visualize the discovered knowledge (patterns)
Data mining as a step in KDD
Architecture of a typical data mining system [1]
Database / Data Warehouse
Server

Data Cleaning, Integration, and Selection

Other types of
Database Data Warehouse World Wide Web Repositories
(spearsheets,
nosql, …)
Data Mining Tasks
Categories of Data Mining Tasks
● Data mining tasks can be on of two categories

● Descriptive Mining Tasks (Unsupervised learning)

- Clustering : find a groups or similar items,
- Associations rules : find relations between items,

● Predictive Mining Tasks (Supervised learning)

- Classification : assign data to their predefined classes
- Regression : assign data to a function
- Time series analysis: Data analysis over time
Association Rules Mining
● Frequent Patterns, Associations, and Correlations Mining
● Frequent Itemsets. Unordered sets of items that appears together very
often.
■ Milk and Bread are frequently bought together.
● Frequent Subsequences. Ordered sets of items that appears together
very often.
■ PC → Camera → Memory Card
● Association Analysis can uncover.
■ Single-dimensional Association Rules
■ BUY(X, “COMPUTER”) ⇒ BUY(X, “SOFTWARE”) [Support=1%, Confidence=50%]
■ Multi-dimensional Association Rules
■ AGE(X, “20..29”) ∧ INCOME(X, “20K..29K”) ⇒ BUY(X, “CD Player”) [Support=1%,
Confidence=50%]
Classification and Prediction
● Classification. Describe a class/concept as a function (model) than can
be used later to predict classes of new objects.
● Prediction. Finds a function (model) that can predict missing
continuous numerical values.
● In both cases, we need a set of objects with known labels (classes /
outputs) to train the model
■ Training Dataset
Cluster Analysis (Clustering)
● Unsupervised classification
● We group objects into clusters (classes) that are initially unknown
● We use the concept of similarity between objects.
● Minimize the inter-class similarity (similarity of objects from different
clusters)
● Maximize the intra-class similarity (similarity of objects of the same
cluster)
Outlier Analysis
● Detect objects in the data that are irregular with respect to other objects
● Can be used for:
■ Anomaly detection
■ Fraudulent Credit Card Transactions
■ …
Pattern Evaluation
Pattern Interestingness
● A pattern is considered interesting if [1]:
1. It is easily understood by humans.
2. Can be generalized to new unseen (test) data with some uncertainty.
3. Useful.
4. Novel (add something new to our knowledge).
● Various performance (quality) metrics can be used to evaluate (assess)
the usefulness or interestingness of discovered patterns.
● The definition of these performance metrics depends highly on the
nature and structure of the patterns.
● We can prune way uninteresting patterns by comparing their quality to
a threshold defined by the user.
Data Mining Applications
Some Applications
● Healthcare
■ Diagnosis and Treatment: Identifying patterns in patient data to help diagnose diseases
and recommend treatments.
■ Medical Research: Analyzing clinical data to discover new medical knowledge and drug
efficacy.
● Finance and Banking
■ Fraud Detection: Identifying unusual transactions or behavior that could indicate fraud.
■ Risk Management: Assessing loan applicants' risk levels and predicting credit scores.
■ Customer Segmentation: Classifying customers based on spending habits, transaction
frequency, and investment preferences.
● Telecommunications
■ Churn Prediction: Analyzing user behavior to predict when customers may leave the
service
■ Customer Service: Using data mining to offer more personalized and efficient support.
Some Applications
● Social Media and Web Analytics
■ Sentiment Analysis: Analyzing social media posts to gauge public opinion on products,
services, or events.
● Government and Public Services
■ Crime Prevention: Predicting criminal behavior and identifying hotspots based on
historical data.
■ Tax Fraud Detection: Detecting anomalies in tax records to identify potential fraud
cases.
● Marketing
■ Customer Segmentation: Grouping customers into segments based on purchasing
behavior and preferences.
■ Targeted Advertising: Analyzing data to create more effective marketing campaigns and
personalized ads.
References
1. Han, Jiawei, Micheline Kamber, and Data Mining. "Concepts and
techniques." Morgan Kaufmann 340 (2006): 94104-3205.
2. IBM Technologies on Youtube
3. University of Houston Libraries on Youtube

Full Download Murray & Nadel's Textbook of Respiratory Medicine 7th Edition Various Authors PDF
100% (6)
Full Download Murray & Nadel's Textbook of Respiratory Medicine 7th Edition Various Authors PDF
64 pages
Between U and Me - How To Rock Your Tween Years With Style and Confidence
77% (22)
Between U and Me - How To Rock Your Tween Years With Style and Confidence
47 pages
Lesson+Plans+Birthday+Boy Opening Story
No ratings yet
Lesson+Plans+Birthday+Boy Opening Story
11 pages
Maksimoski CV
No ratings yet
Maksimoski CV
7 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
1 Intro
No ratings yet
1 Intro
33 pages
1 IT326 - Ch1 - Introduction
No ratings yet
1 IT326 - Ch1 - Introduction
37 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Introduction To Data Mining: - Chapter 3
No ratings yet
Introduction To Data Mining: - Chapter 3
39 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining
No ratings yet
Data Mining
35 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
intro data mining
No ratings yet
intro data mining
51 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
1_Lect 1 & 2 Data Mining
No ratings yet
1_Lect 1 & 2 Data Mining
20 pages
Data Mining
No ratings yet
Data Mining
88 pages
Unit 1
No ratings yet
Unit 1
46 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data Mining
No ratings yet
Data Mining
13 pages
Unit 3
No ratings yet
Unit 3
23 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
data mining 1
No ratings yet
data mining 1
39 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
da257829-b262-4875-aa76-2975d8aeaa2c
No ratings yet
da257829-b262-4875-aa76-2975d8aeaa2c
31 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
DATA_MINING_UNIT_1
No ratings yet
DATA_MINING_UNIT_1
13 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Why Data Mining?: March 3, 2015
No ratings yet
Why Data Mining?: March 3, 2015
41 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
01 Intro
No ratings yet
01 Intro
23 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Introduction
No ratings yet
Introduction
27 pages
KDD Process
No ratings yet
KDD Process
56 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Knowledge Management - 10 - Data Mining Overview
No ratings yet
Knowledge Management - 10 - Data Mining Overview
41 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
01 Intro
No ratings yet
01 Intro
35 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Data Mining
No ratings yet
Data Mining
26 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
DWDM-LS1-Fall-24-25
No ratings yet
DWDM-LS1-Fall-24-25
42 pages
01Intro (2)
No ratings yet
01Intro (2)
45 pages
Chap 1
No ratings yet
Chap 1
32 pages
DB-14
No ratings yet
DB-14
97 pages
DM Module1
No ratings yet
DM Module1
15 pages
Lecture_01_11jan
No ratings yet
Lecture_01_11jan
29 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
PW2 python
No ratings yet
PW2 python
2 pages
PW1 python
No ratings yet
PW1 python
2 pages
PW4 python solution
No ratings yet
PW4 python solution
6 pages
Chapter 02 Advanced Data Structures and Functions
No ratings yet
Chapter 02 Advanced Data Structures and Functions
103 pages
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
No ratings yet
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
179 pages
Chapter 03 Object Oriented Programming and Exceptions in Python
No ratings yet
Chapter 03 Object Oriented Programming and Exceptions in Python
70 pages
Practical Work 03 Solutions
No ratings yet
Practical Work 03 Solutions
5 pages
Practical Work 03 Advanced Functions in Python
No ratings yet
Practical Work 03 Advanced Functions in Python
2 pages
Practical Work 04 Object Oriented Programming
No ratings yet
Practical Work 04 Object Oriented Programming
1 page
Practical Work 02 solution
No ratings yet
Practical Work 02 solution
9 pages
Chapter 01 Introduction to Python_part2_2
No ratings yet
Chapter 01 Introduction to Python_part2_2
62 pages
Lab 1 (1)
No ratings yet
Lab 1 (1)
5 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
Becoming Your Best Newsletter - November 2010
No ratings yet
Becoming Your Best Newsletter - November 2010
4 pages
LearnerBookMP-145
No ratings yet
LearnerBookMP-145
145 pages
1 BH 3 Parcial 22
No ratings yet
1 BH 3 Parcial 22
13 pages
Curriculum Vitae For Kimberly P. Lindsey, PHD
No ratings yet
Curriculum Vitae For Kimberly P. Lindsey, PHD
14 pages
ISEE Practice Test
No ratings yet
ISEE Practice Test
49 pages
Term Paper On Inventory Management System
100% (1)
Term Paper On Inventory Management System
7 pages
10th Test Human Capital & Rural Development Set B
No ratings yet
10th Test Human Capital & Rural Development Set B
4 pages
GREAT EXPECTATIONS LESSON PLAN WRITTEN BY ME
No ratings yet
GREAT EXPECTATIONS LESSON PLAN WRITTEN BY ME
19 pages
Get Women in Social Semiotics and SFL Making a Difference 1st Edition Eva Maagerø Ruth Mulvad Elise Seip Tønnessen free all chapters
100% (2)
Get Women in Social Semiotics and SFL Making a Difference 1st Edition Eva Maagerø Ruth Mulvad Elise Seip Tønnessen free all chapters
40 pages
MAT 1033 Syllabus On-Line 91928
No ratings yet
MAT 1033 Syllabus On-Line 91928
2 pages
Weekly-Lesson-Log-SCIENCE 9
No ratings yet
Weekly-Lesson-Log-SCIENCE 9
7 pages
Reading Comprehension
100% (2)
Reading Comprehension
2 pages
Instructional Competence of Teachers: Basis For Learning Action Cell Sessions
No ratings yet
Instructional Competence of Teachers: Basis For Learning Action Cell Sessions
4 pages
Thane School Data
No ratings yet
Thane School Data
5 pages
Uber Data Analytics Project
No ratings yet
Uber Data Analytics Project
9 pages
APTIS Advanced Instructions
No ratings yet
APTIS Advanced Instructions
3 pages
Day 4 Unit Plan - Theme
No ratings yet
Day 4 Unit Plan - Theme
3 pages
Principles of Teaching 2: Brenda C. Corpuz, PH.D Gloria G. Salandan, PH.D Dalisay V. Rigor, PH.D
100% (1)
Principles of Teaching 2: Brenda C. Corpuz, PH.D Gloria G. Salandan, PH.D Dalisay V. Rigor, PH.D
5 pages
Android Enterprise Expert Program
No ratings yet
Android Enterprise Expert Program
18 pages
Fazal Islam Most Updated CV March 2024
No ratings yet
Fazal Islam Most Updated CV March 2024
2 pages
Summary of Grades ICT
No ratings yet
Summary of Grades ICT
24 pages
You Attended A Time Management Workshop Organized by Your School Counselling Club Recently
No ratings yet
You Attended A Time Management Workshop Organized by Your School Counselling Club Recently
1 page
Innovations in Education
100% (2)
Innovations in Education
120 pages
All My Sons Thesis Statement
100% (2)
All My Sons Thesis Statement
8 pages
Conduct Competency Assessment
No ratings yet
Conduct Competency Assessment
38 pages
Isenberg 1986
No ratings yet
Isenberg 1986
11 pages

Introduction-to-Data-Mining

Uploaded by

Introduction-to-Data-Mining

Uploaded by

Data Mining

Data Integration Data mining

Data Cleaning, Integration, and Selection

● Descriptive Mining Tasks (Unsupervised learning)

● Predictive Mining Tasks (Supervised learning)

You might also like