Data Mining Intro
Data Mining Intro
(Techniques)
Course Instructor
Umaru Mohammed Yussif
D epartment of Computer Science and Engineering
1
Main
objective
The main objective is to provide an introduction to
the field of data mining
2
Specific
objectives
⚫ W hy perform data mining?
⚫ How to do data mining?
⚫ W hat are the main data mining techniques?
◦ Clustering techniques: K-Means,
◦ Pattern mining: discover interesting patterns in
databases..
◦ Classification
◦ Outlier detection
◦ O ther popular topics in data mining
◦ Challenges and research direction in data mining.
⚫ How data mining techniques works?
3
Perspectiv
eThis course is quite short (20 hours)
⚫ We will focus on understanding the main types of
data mining techniques,
◦ how they work,
◦ their advantages and limitations.
4
Lecture
⚫ Slidesslides
for each lecture will be provided (PDFs) after each
lecture
⚫ Reference materials:
◦ Han and Kamber (2011), Data Mining: Concepts and Techniques, 3rd edition,
Morgan Kaufmann Publishers,
◦ Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education,
ISBN-10: 0321321367.
◦ Zaki & Meira (2014). Data Mining and Analysis Fundamental Concepts and
Algorithms.
◦ Aggarwal (2015). Data Mining:The Textbook
◦ Freidman et al (2009) The Elements of Statistical Learning”
6
Lecture
⚫ Slidesslides
for this course are modified from the slides of:
1. Philippe Fournier-Viger, School of Natural Sciences and Humanities,
HITZ
2. Jiawei Han, Univ. Illinois at Urbana-C hampaign
6
1 – IN T RO D U C T IO N
W H A T IS DATA MINING?
7
Introduction
⚫Nowadays,
◦ storing data on computers is cheap.
◦ transferring data between computers is
fast,
8
Introduction
Small and cheap devices can collect a lot of data such
as
smartphones and other sensors.
Pictures from
bellenews.com
9
D ata collected from
vehicles
Picture from
eletimes.com
Picture from
ebay.co.uk
10
Internet of
things
⚫ D ifferent objects
can communicate
and exchange
data.
⚫ ~30 billion
connected objects
in 2020 (Nordrum
2016)
Picture from
teamarin.net
11
Data collected from
humans
⚫ Movements,
⚫ Brain signals,
⚫ Skin
conductivity
⚫ Heart rate,
⚫ Blood pressure,
⚫ Eye
movements,
⚫ Spatial
locations,
⚫… 12
D ata collected from the
industry
⚫ Internal data
◦ D ata about
employees,
customers, market,
etc.
⚫ Banking:
◦ Spending patterns,
Income,
social media,
…
⚫ Retail industry: 13
Introduction
As a result, a huge amount of data is collected
and stored in databases.
« Big D ata »
15
How can I analyze
my data?
Analyzing data
by hand ?
• time-consuming
• may miss
important
informatio
n
• Not suitable
for “big
data”.
16
Data Rich but Information
Poor!
The Goldfields Story:
• I (a data scientist) and a colleague (a
metallurgist) were at Goldfields
(17/09/2017) on possible D ata Analysis and
Knowledge discovery with the C IL
M etallurgical D ata for decision making.
• The person in charge said one of the
following:
1. You know Prof XYZ, when I was a
student, he was my TA. Go and
generate your own data and do the
analysis
2. You know Prof XYZ, when I was a TA,
he was a student. Go and generate
your own data and do the analysis
• He saw us (two Doctors) as too young to
trusted with the data – he was scared of
data leakage probably.
• We are researchers, his fears could
have been removed if he made us sign
« Data rich but information an NDA.
poor » •
17
Data Rich but Information
The Goldfields Story vs a Mine willing to Change!
Poor!
and-mining/how-we-help-clients/inside-a-
mining-companys-ai-transformation
« Data rich but information
poor » 18
W hat is data
mining?
⚫ Data mining consists of techniques to
automatically discover interesting patterns in
data (discover knowledge).
⚫ Two goals:
I. Understand the past
e.g. Why there was an earthquake last year?
II. Predict the future
e.g. W ill there be an earthquake tomorrow?
e.g. W ill this customer pay back his debt ?
How to do data
mining?
⚫To do data mining, a process is followed, consisting
of seven steps
20
The Knowledge Discovery
process
1. Data cleaning (remove noisy
data and fix inconsistencies)
Preparing 2. Data integration (integrate data
Data from multiple sources)
3. D ata selection (select
relevant data)
Discoverin 4. Data transformation
g patterns
5. Discovering patterns (data
Evaluating mining)
patterns and
using them 6. Evaluate the patterns found
using interestingness measures
7. Visualize the discovered 21
Data Mining
techniques
⚫In general, there are many techniques for
analyzing data.
⚫D ata mining techniques are generally applicable
to large volumes of data.
⚫Many different techniques:
◦ to analyze different types of data,
◦ to discover different types of knowledge, to be
used in different ways.
22
What are the applications of data
mining?
A few examples:
◦ Fraud detection
◦ Analyzing trends on the stock market
◦ Analyzing the behavior of customers in terms of
what they buy.
◦ Recommending products to customers on online
retail stores
◦ Identifying people in a crowd or at store
23
D ata Mining is an interdisciplinary research
field
⚫ D atabase systems,
⚫ Algorithmic,
⚫ Computer Science,
⚫ Machine Learning,
⚫ D ata vizualization,
⚫ Image and signal
processing,
⚫ Statistics,
⚫ etc.
simplileearn.com
24
Data mining vs
Statistics
W hat is the difference between data mining
and
statistics?
26
W hy using data
mining?
⚫ To take decision based on facts rather than
based on intuition.
⚫ To avoid analyzing data by hand, as it is
time- consuming and may result in errors.
Data mining
software
Some popular software programs are:
◦ O range: free, open-source
◦ W eka: free, open-source
◦ K nime: free/commercial, open-source
◦ R: a language widely used for data mining and
statistics
◦ SPMF: free, open-source (my software)
◦ SA S : commercial software for statistics
◦ … and many others
Data mining
software
Typical features of a data mining software:
⚫ User interface
⚫ Read different types of data (files,
databases…)
⚫ Prepare the data for analysis
⚫ Provide several algorithms to analyze the
data
⚫ Data visualization
W ek Knim RapidMiner 30
Data mining
software
⚫Several data mining techniques are designed to be
applied on huge databases.
⚫But they can also be applied on small databases.
32
Relational
database
In a typical database system, data is organized as
tables:
Data mining allows do to more. It allows to find correlations, trends, and other types of complex
knowledge in data (e.g. finding that young patients are more likely to cure using a given
treatment…)
33
Transactional
data
⚫A database of customer transactions.
⚫ A transaction is a list of items bought by
customers.
⚫ Example:
TID Bread Milk Noodles Eggs …
1 X X
2 X X
3 X X X
34
Temporal
data
Time series:
◦ a series of numeric values,
◦ usually obtained at a regular interval.
◦ e.g. stock market data, EEG data,
temperature data, student grades over
time…
35
Spatial
data
⚫ Spatial or geographic data
◦ e.g. forestry, ecology,
infrastructure management
⚫ Spatio-temporal data
◦ spatial and temporal data
◦ e.g. meteorological data,
crowd movement, bird
migration
36
Text
data
Text documents:
A type of unstructured data: documents that have
no clear structure, or are not organized in a predefined
manner.
Examples:
🞄 Predicting if someone will like a movie or product
🞄 Analyzing an anonymous text to find the likely author.
How old he is ? W hat is the author profile?
🞄Sentiment analysis
🞄Automatic summarization of a document
37
Web
data
⚫ Web:
◦ a set of documents (webpages)
◦ links between documents
⚫ Examples:
🞄 Predicting the next webpage that someone will visit
🞄 Automatically grouping webpages by topics into
categories.
🞄Analyzing the time spent on webpages
🞄 Analyzing data from attacks by hackers on a website.
38
Graph
s ⚫ Social networks,
◦ Finding communities
◦ Analyzing the
relationships
between people
◦ Predict who will
be your friend
◦ Observe how
communities evolve
◦ Find who has the
most influence
◦ Find the location
of a person 39
Heterogeneous
data
⚫ Sometimes, we need to analyze data
combining multiple types of data
(e.g. spatial, temporal, time series, text,
GPS, etc.)
⚫ We may also need to analyze data stored
using different technologies and file
format
(e.g. Excel files, text files,Word
documents, pictures, videos, GPS data,
audio).
40
Data
streams
⚫ Data stream: a high-speed and non-stop stream of
data that is potentially infinite
⚫ Eg.: satellite data, video cameras, environmental data.
⚫ Challenge: must be analyzed in real-time
⚫ Needs:
◦ extract summaries of data
◦ detect changes (eg.:trends, detect changes),
◦ evaluate the state of a stream
41
W e may want to extract different types of “patterns” from
data.
TYPES OF PATTERNS
42
Cluster
s
Clustering: consists of automatically grouping similar
objects/instances into groups (clusters) of similar
instances.
Examples:
• Hospital patients having a similar
profile
• Individuals who are likely to
develop dependencies to
gambling
• taxonomy of animals
• Students with similar learning
profile
Use to summarize data, for decision
making… W e want to discover « natural »
clusters
43
Classification
⚫C lassification:build a model that can
automatically classify instances into different
categories/classes.
⚫Several applications
◦ predict who will pay back their debt and who will not,
◦ predicting who will fail/pass a course,
◦ Handwriting character recognition
⚫Several techniques:
◦ Neural networks, SVM, decision trees, Naïve Bayes
classifier, etc.
44
e.g. ID 3 decision
tree
Training data
A prediction:
45
Discovering
patterns
⚫Discovering values that appear
frequently together in the data:
🞄 30 % of the tourists visiting Ghana are less than 30 years old and
have a university degree.
⚫Discovering strong associations in data:
🞄 There is a 60 % conditional probability that tourists visiting the
Central Region will also visit Western Region.
46
Anomalies,
outliers
D etecting what is abnormal (anomalies,
outliers) is interesting and has many
applications.
e.g.
◦ detecting hackers attacking a computer system,
◦ identifying potential terrorists based on suspicious
behavior,
◦ detecting fraud on the stock market
47
Trends, regularities, periodic patterns
….
Several applications:
◦ study patterns in the stock-market
🞄 to predict stock prices and take investment decisions.
🞄to understand the past.
◦ discovering regularities to predict earthquake aftershocks,
◦ find cycles in the behavior of a system,
◦ discover the sequence of events that lead to a system
failure.
48
F I N D I N G INTEREST I NG
PATTERNS I N DATA
49
Finding interesting
patterns
⚫D ata mining techniques can find millions of
patterns in data.
⚫As humans, we do not want to analyze millions
of patterns.
⚫Thus, we need to filter patterns to obtain a set
of
patterns that is interesting or useful.
⚫To evaluate patterns, different measures are used in
data mining.
⚫Evaluating patterns can be during data mining or after
(as post-processing). 50
W hat is an interesting
⚫A pattern?
pattern is interesting if:
◦ it easy to understand,
◦ it is still valid for new data;
◦ It is useful;
◦ It is novel or unexpected.
⚫Several measures:
◦ objective measures:
🞄e.g.: how frequently a pattern appears
◦ subjective measures:
🞄e.g. how interesting a pattern is for a
person
51
Opportunities for
research
Analyzing how user think or react to products?
◦ Sensor data (EEG, etc.), text, feedback form, etc.
◦ Studying the influence of emotions on customer satisfaction
for products/designs
◦ Not just satisfaction but also other reactions such as
confusion, motivation and why they occur
Opportunities for
research
⚫Analyzing how users utilizes a product?
◦ using data mining
◦ using cognitive models, to explain their behavior.
◦ e.g. aspects related to spatial cognition such as
spatial reasoning.
⚫Analyzing the user wants and needs
◦ Using data mining techniques to analyze customer reviews
from websites,
◦ Using data mining techniques to analyze/understand sale
data of similar products or characteristics
Opportunities for
research
⚫ Predicting how users will behave as customers
◦ Predicting which customer will buy or not, churn prediction,
...
◦ Modeling customer buying behavior (e.g. to predict return
on investment)
⚫ Analyzingdata from the manufacturing processes (e.g.
equipment management, fault detection, inspection
data, quality monitoring, etc.)
⚫ Analyzing data about suppliers such as
reputation, performance, etc.
Conclusion
In this part, I have introduced :
⚫ the topic of data mining,
⚫ different types of data,
⚫ different types of patterns,
⚫ some opportunities for
research.
55
Reference
s⚫ Han and Kamber (2011), D ata Mining: C oncepts and
Techniques, 3rd edition, Morgan Kaufmann Publishers,
⚫ Tan, Steinbach & Kumar (2006), Introduction to D ata
Mining, Pearson education, ISBN-10: 0321321367
⚫ Zhao,Y. (2012) R and Data Mining: Examples and Case
Studies, Academic Press, Elsevier.
⚫ And other sources.
56