0% found this document useful (0 votes)
63 views

Data Mining Intro

This document provides an introduction to a course on data mining techniques. The main objectives of the course are to introduce the field of data mining and cover the main data mining techniques. Specific topics that will be covered include why perform data mining, popular data mining techniques like clustering, classification, and outlier detection, and how the techniques work. The course will provide an overview in a short 20 hours to help understand the main types of techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Data Mining Intro

This document provides an introduction to a course on data mining techniques. The main objectives of the course are to introduce the field of data mining and cover the main data mining techniques. Specific topics that will be covered include why perform data mining, popular data mining techniques like clustering, classification, and outlier detection, and how the techniques work. The course will provide an overview in a short 20 hours to help understand the main types of techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

D ata Mining

(Techniques)
Course Instructor
Umaru Mohammed Yussif
D epartment of Computer Science and Engineering

1
Main
objective
The main objective is to provide an introduction to
the field of data mining

Data mining is sometimes also called: data science, big


data with a similar meaning.

2
Specific
objectives
⚫ W hy perform data mining?
⚫ How to do data mining?
⚫ W hat are the main data mining techniques?
◦ Clustering techniques: K-Means,
◦ Pattern mining: discover interesting patterns in
databases..
◦ Classification
◦ Outlier detection
◦ O ther popular topics in data mining
◦ Challenges and research direction in data mining.
⚫ How data mining techniques works?
3
Perspectiv
eThis course is quite short (20 hours)
⚫ We will focus on understanding the main types of
data mining techniques,
◦ how they work,
◦ their advantages and limitations.

4
Lecture
⚫ Slidesslides
for each lecture will be provided (PDFs) after each
lecture
⚫ Reference materials:
◦ Han and Kamber (2011), Data Mining: Concepts and Techniques, 3rd edition,
Morgan Kaufmann Publishers,
◦ Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education,
ISBN-10: 0321321367.
◦ Zaki & Meira (2014). Data Mining and Analysis Fundamental Concepts and
Algorithms.
◦ Aggarwal (2015). Data Mining:The Textbook
◦ Freidman et al (2009) The Elements of Statistical Learning”

6
Lecture
⚫ Slidesslides
for this course are modified from the slides of:
1. Philippe Fournier-Viger, School of Natural Sciences and Humanities,
HITZ
2. Jiawei Han, Univ. Illinois at Urbana-C hampaign

6
1 – IN T RO D U C T IO N

W H A T IS DATA MINING?

7
Introduction
⚫Nowadays,
◦ storing data on computers is cheap.
◦ transferring data between computers is
fast,

8
Introduction
Small and cheap devices can collect a lot of data such
as
smartphones and other sensors.

Pictures from
bellenews.com

9
D ata collected from
vehicles

Picture from
eletimes.com

Picture from
ebay.co.uk

10
Internet of
things
⚫ D ifferent objects
can communicate
and exchange
data.
⚫ ~30 billion
connected objects
in 2020 (Nordrum
2016)

Picture from
teamarin.net

11
Data collected from
humans
⚫ Movements,
⚫ Brain signals,
⚫ Skin
conductivity
⚫ Heart rate,
⚫ Blood pressure,
⚫ Eye
movements,
⚫ Spatial
locations,
⚫… 12
D ata collected from the
industry
⚫ Internal data
◦ D ata about
employees,
customers, market,
etc.
⚫ Banking:
◦ Spending patterns,
Income,
social media,

⚫ Retail industry: 13
Introduction
As a result, a huge amount of data is collected
and stored in databases.
« Big D ata »

Servers for storing


data
14
Introduction
⚫ Having a lot of data is great

⚫ But we want to be able to understand the data.

⚫ We also want to discover new knowledge that can


help us understand the data and support decision
making .

⚫ If we cannot do that, the data is useless…

15
How can I analyze
my data?
Analyzing data
by hand ?
• time-consuming
• may miss
important
informatio
n
• Not suitable
for “big
data”.

« Data rich but information


poor »Illustration:
(2006)
Han & Kamber

16
Data Rich but Information
Poor!
The Goldfields Story:
• I (a data scientist) and a colleague (a
metallurgist) were at Goldfields
(17/09/2017) on possible D ata Analysis and
Knowledge discovery with the C IL
M etallurgical D ata for decision making.
• The person in charge said one of the
following:
1. You know Prof XYZ, when I was a
student, he was my TA. Go and
generate your own data and do the
analysis
2. You know Prof XYZ, when I was a TA,
he was a student. Go and generate
your own data and do the analysis
• He saw us (two Doctors) as too young to
trusted with the data – he was scared of
data leakage probably.
• We are researchers, his fears could
have been removed if he made us sign
« Data rich but information an NDA.
poor » •
17
Data Rich but Information
The Goldfields Story vs a Mine willing to Change!
Poor!

You can read more about this article here


https://www.mckinsey.com/industries/metals-

and-mining/how-we-help-clients/inside-a-
mining-companys-ai-transformation
« Data rich but information
poor » 18
W hat is data
mining?
⚫ Data mining consists of techniques to
automatically discover interesting patterns in
data (discover knowledge).

⚫ Two goals:
I. Understand the past
e.g. Why there was an earthquake last year?
II. Predict the future
e.g. W ill there be an earthquake tomorrow?
e.g. W ill this customer pay back his debt ?
How to do data
mining?
⚫To do data mining, a process is followed, consisting
of seven steps 

⚫This process is often called « Knowledge Discovery


»

⚫D ata mining is only one step of this process.

20
The Knowledge Discovery
process
1. Data cleaning (remove noisy
data and fix inconsistencies)
Preparing 2. Data integration (integrate data
Data from multiple sources)
3. D ata selection (select
relevant data)
Discoverin 4. Data transformation
g patterns
5. Discovering patterns (data
Evaluating mining)
patterns and
using them 6. Evaluate the patterns found
using interestingness measures
7. Visualize the discovered 21
Data Mining
techniques
⚫In general, there are many techniques for
analyzing data.
⚫D ata mining techniques are generally applicable
to large volumes of data.
⚫Many different techniques:
◦ to analyze different types of data,
◦ to discover different types of knowledge, to be
used in different ways.

22
What are the applications of data
mining?
A few examples:
◦ Fraud detection
◦ Analyzing trends on the stock market
◦ Analyzing the behavior of customers in terms of
what they buy.
◦ Recommending products to customers on online
retail stores
◦ Identifying people in a crowd or at store

23
D ata Mining is an interdisciplinary research
field
⚫ D atabase systems,
⚫ Algorithmic,
⚫ Computer Science,
⚫ Machine Learning,
⚫ D ata vizualization,
⚫ Image and signal
processing,
⚫ Statistics,
⚫ etc.

simplileearn.com
24
Data mining vs
Statistics
W hat is the difference between data mining
and
statistics?

⚫ Descriptive statistics is about describing


data.
⚫ Inferential statistics is about testing
hypothesis,
◦ the goal is to draw significant conclusions
25
Data mining vs
Statistics
⚫Data mining focuses on the automated discovery
of unknown properties of the data (trends,
anomalies, correlations…).
◦ the end result is what is important.
⚫Statisticallearning: this term is sometimes
used to describe data mining techniques.

26
W hy using data
mining?
⚫ To take decision based on facts rather than
based on intuition.
⚫ To avoid analyzing data by hand, as it is
time- consuming and may result in errors.
Data mining
software
Some popular software programs are:
◦ O range: free, open-source
◦ W eka: free, open-source
◦ K nime: free/commercial, open-source
◦ R: a language widely used for data mining and
statistics
◦ SPMF: free, open-source (my software)
◦ SA S : commercial software for statistics
◦ … and many others
Data mining
software
Typical features of a data mining software:
⚫ User interface
⚫ Read different types of data (files,
databases…)
⚫ Prepare the data for analysis
⚫ Provide several algorithms to analyze the
data
⚫ Data visualization

W ek Knim RapidMiner 30
Data mining
software
⚫Several data mining techniques are designed to be
applied on huge databases.
⚫But they can also be applied on small databases.

⚫D atamining techniques can be applied on various


types of data 
VA RIO U S T YPES O F
D ATA

32
Relational
database
In a typical database system, data is organized as
tables:

Traditional database systems allows to search information in


databases (e.g. finding all patients that are male and >20 years old)

Data mining allows do to more. It allows to find correlations, trends, and other types of complex
knowledge in data (e.g. finding that young patients are more likely to cure using a given
treatment…)
33
Transactional
data
⚫A database of customer transactions.
⚫ A transaction is a list of items bought by
customers.
⚫ Example:
TID Bread Milk Noodles Eggs …
1 X X

2 X X

3 X X X

⚫ May contain additional information


◦ e.g. purchase quantities, unit price, time,
33
location…
Temporal
data
Sequences: series of symbols: a, b, c, b, a, c, d ,a
⚫ Sequences of clicks on a website: Page1, Page2, Page4,
Page1…
⚫ Protein sequences
⚫ Sequences of moves when playing chess
⚫ Sequence of GPS locations

34
Temporal
data
Time series:
◦ a series of numeric values,
◦ usually obtained at a regular interval.
◦ e.g. stock market data, EEG data,
temperature data, student grades over
time…

35
Spatial
data
⚫ Spatial or geographic data
◦ e.g. forestry, ecology,
infrastructure management
⚫ Spatio-temporal data
◦ spatial and temporal data
◦ e.g. meteorological data,
crowd movement, bird
migration

36
Text
data
Text documents:
A type of unstructured data: documents that have
no clear structure, or are not organized in a predefined
manner.
Examples:
🞄 Predicting if someone will like a movie or product
🞄 Analyzing an anonymous text to find the likely author.
How old he is ? W hat is the author profile?
🞄Sentiment analysis
🞄Automatic summarization of a document

37
Web
data
⚫ Web:
◦ a set of documents (webpages)
◦ links between documents
⚫ Examples:
🞄 Predicting the next webpage that someone will visit
🞄 Automatically grouping webpages by topics into
categories.
🞄Analyzing the time spent on webpages
🞄 Analyzing data from attacks by hackers on a website.

38
Graph
s ⚫ Social networks,
◦ Finding communities
◦ Analyzing the
relationships
between people
◦ Predict who will
be your friend
◦ Observe how
communities evolve
◦ Find who has the
most influence
◦ Find the location
of a person 39
Heterogeneous
data
⚫ Sometimes, we need to analyze data
combining multiple types of data
(e.g. spatial, temporal, time series, text,
GPS, etc.)
⚫ We may also need to analyze data stored
using different technologies and file
format
(e.g. Excel files, text files,Word
documents, pictures, videos, GPS data,
audio).
40
Data
streams
⚫ Data stream: a high-speed and non-stop stream of
data that is potentially infinite
⚫ Eg.: satellite data, video cameras, environmental data.
⚫ Challenge: must be analyzed in real-time
⚫ Needs:
◦ extract summaries of data
◦ detect changes (eg.:trends, detect changes),
◦ evaluate the state of a stream

41
W e may want to extract different types of “patterns” from
data.

TYPES OF PATTERNS

42
Cluster
s
Clustering: consists of automatically grouping similar
objects/instances into groups (clusters) of similar
instances.
Examples:
• Hospital patients having a similar
profile
• Individuals who are likely to
develop dependencies to
gambling
• taxonomy of animals
• Students with similar learning
profile
Use to summarize data, for decision
making… W e want to discover « natural »
clusters

43
Classification
⚫C lassification:build a model that can
automatically classify instances into different
categories/classes.
⚫Several applications
◦ predict who will pay back their debt and who will not,
◦ predicting who will fail/pass a course,
◦ Handwriting character recognition
⚫Several techniques:
◦ Neural networks, SVM, decision trees, Naïve Bayes
classifier, etc.
44
e.g. ID 3 decision
tree
Training data

A decision tree to predict the « play? » attribute

A prediction:

45
Discovering
patterns
⚫Discovering values that appear
frequently together in the data:
🞄 30 % of the tourists visiting Ghana are less than 30 years old and
have a university degree.
⚫Discovering strong associations in data:
🞄 There is a 60 % conditional probability that tourists visiting the
Central Region will also visit Western Region.

46
Anomalies,
outliers
D etecting what is abnormal (anomalies,
outliers) is interesting and has many
applications.
e.g.
◦ detecting hackers attacking a computer system,
◦ identifying potential terrorists based on suspicious
behavior,
◦ detecting fraud on the stock market

47
Trends, regularities, periodic patterns
….
Several applications:
◦ study patterns in the stock-market
🞄 to predict stock prices and take investment decisions.
🞄to understand the past.
◦ discovering regularities to predict earthquake aftershocks,
◦ find cycles in the behavior of a system,
◦ discover the sequence of events that lead to a system
failure.

48
F I N D I N G INTEREST I NG
PATTERNS I N DATA

Data mining techniques are designed to extract interesting


patterns and knowledge from data

How to evaluate the patterns of knowledge found in data to


ensure that it is interesting and useful?

49
Finding interesting
patterns
⚫D ata mining techniques can find millions of
patterns in data.
⚫As humans, we do not want to analyze millions
of patterns.
⚫Thus, we need to filter patterns to obtain a set
of
patterns that is interesting or useful.
⚫To evaluate patterns, different measures are used in
data mining.
⚫Evaluating patterns can be during data mining or after
(as post-processing). 50
W hat is an interesting
⚫A pattern?
pattern is interesting if:
◦ it easy to understand,
◦ it is still valid for new data;
◦ It is useful;
◦ It is novel or unexpected.
⚫Several measures:
◦ objective measures:
🞄e.g.: how frequently a pattern appears
◦ subjective measures:
🞄e.g. how interesting a pattern is for a
person
51
Opportunities for
research
Analyzing how user think or react to products?
◦ Sensor data (EEG, etc.), text, feedback form, etc.
◦ Studying the influence of emotions on customer satisfaction
for products/designs
◦ Not just satisfaction but also other reactions such as
confusion, motivation and why they occur
Opportunities for
research
⚫Analyzing how users utilizes a product?
◦ using data mining
◦ using cognitive models, to explain their behavior.
◦ e.g. aspects related to spatial cognition such as
spatial reasoning.
⚫Analyzing the user wants and needs
◦ Using data mining techniques to analyze customer reviews
from websites,
◦ Using data mining techniques to analyze/understand sale
data of similar products or characteristics
Opportunities for
research
⚫ Predicting how users will behave as customers
◦ Predicting which customer will buy or not, churn prediction,
...
◦ Modeling customer buying behavior (e.g. to predict return
on investment)
⚫ Analyzingdata from the manufacturing processes (e.g.
equipment management, fault detection, inspection
data, quality monitoring, etc.)
⚫ Analyzing data about suppliers such as
reputation, performance, etc.
Conclusion
In this part, I have introduced :
⚫ the topic of data mining,
⚫ different types of data,
⚫ different types of patterns,
⚫ some opportunities for
research.

55
Reference
s⚫ Han and Kamber (2011), D ata Mining: C oncepts and
Techniques, 3rd edition, Morgan Kaufmann Publishers,
⚫ Tan, Steinbach & Kumar (2006), Introduction to D ata
Mining, Pearson education, ISBN-10: 0321321367
⚫ Zhao,Y. (2012) R and Data Mining: Examples and Case
Studies, Academic Press, Elsevier.
⚫ And other sources.

56

You might also like