0% found this document useful (0 votes)
11 views

Introduction-to-Data-Mining

Uploaded by

Aya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Introduction-to-Data-Mining

Uploaded by

Aya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Mining

Chapter 1 . Introduction
SASSI Abdessamed
Motivation
Why do we need data mining?
● Nowadays, the total world wide volume of data is very large
■ Hundreds of ZettaBytes (ZB = 270 byte)
● Data types and formats can be complexe
■ Video, Image, Audio, etc.
● Most data formats are not human readable
■ Binary formats
● Humans cannot deal with such amount and complexity
● We need concise insights and patterns to make decisions
Data mining is a misnomer?
● Literally data mining means gathering or collecting data
● In practice, data mining means extracting knowledge from data
● This knowledge is like golden-nuggets hidden in a large volume data
● Hence the word mining in the name
● So,
● What is Data?
● What is Knowledge?
● And, What does Data Mining really means?
Data
What is data?
● Data are collected observations or measurements represented as Text,
Numbers, or Multimedia [3].
● Data can be quantitative (represent quantities or numerical values)
■ Sensory data (Temperature, Light, Pixel Intensities, Voltage, …)
■ Time Durations (Age, Travel Length, …)
■ Size & Length Measurements (Area, Volume, Distance, Length, …)
■ Health Measurements (Blood Pressure, Sugar Level, O2 Saturation, …)
● Data can also be qualitative (categorical)
■ Text (words, letters, digits, …)
■ Age Classes (e.g. Football Age categories)
■ Blood Types
● Data can also be a complex mixture of the two types
■ E.g. Maps (Graphs)
Data vs Knowledge
● A book doesn’t know of its content
● Knowing Being Aware of the information we possess
■ Understanding
■ Being able to act and make decisions
■ Produce new thoughts
■ Discover Patterns
● Unlike having information, Knowing is active action
● How can we make computers discover by knowledge on their own?
Data Sources
● In our daily lives we produce tons of data (information)
■ Social Networks, Emails, Blogs, …
■ E-Commerce, Banking, Stores, …
■ Hospitals & Health reports
■ Administrative records
● Hence, data can be supplied by a variety of technologies:
■ Relational databases
■ Data warehouses
■ Transaction databases
■ Text databases
■ Social networks data
■ World-Wide Web
■ Time-series data
Data Formats
● The data we want to analyse using data mining methods have various
formats
■ Transactions
■ N-dimensional Vectors (data points)
■ Graphs
■ Tables
■ etc.
● The format of the data determines the data mining algorithm we can use
● We may also change the format of the data in order to be able to use a
certain type of algorithm
Data Preparation & Preprocessing
● Data integration. Combining data from multiple sources
■ Joining multiple tables.
■ Resolving data inconsistencies from different sources.
● Data selection. Selecting domain relevant data.
■ Selecting a specific of attributes (columns)
● Data cleaning.
■ Noise Reduction : Removing or correcting noisy data
■ Outlier Detection : Identifying and handling outliers
■ Handling Missing Values : Removing or filling in missing data
● Data Reduction.
■ Dimensionality Reduction: to reduce the number of attributes while retaining
important information.
■ Sampling: Selecting a subset of the data that represents the whole dataset to reduce
computation time.
Data Preparation & Preprocessing
● Data Transformation.
■ Normalization: Scaling numerical data to a common range
■ Data Discretization: Converting continuous attributes into discrete bins or categories
Data Mining
What is data mining?
● Extracting or “mining” knowledge from large amounts of data [1].
● A set of software techniques for identifying / discovering useful
patterns and trends from large amounts of data through automated
analysis.
● Obtaining a simplified view of data to help with decision making.
● Extracting Knowledge from data.
What is knowledge in this context?
● For data mining, knowledge is in the form of Patterns and Insights:
■ (If .. Then) Rules
■ Associations
■ Anomalies
■ Recommendations
■ Groups & Classes (Clusters)
■ Predictions
■ Correlations
Intersection with other fields & technologies
● Statistics
■ A variety of data mining algorithms involve some methods from the field of statistics
■ The methods of statistics themselves can be used as low-level data mining methods
● Databases
■ Most of the data sources will be stored using database technology
● Data warehouses
■ Data mining are generally applied to data integrated in a data warehouse
● Machine Learning
■ We can use some of these techniques to learn patterns
● Data visualization
■ To familiarise with the data, detect outliers, decide what preprocessing we need
■ To display the extracted patterns and make decisions after data mining
Why Data Mining?
● Large quantities of data to be analysed
■ Algorithms must be highly scalable
● High dimensionality of the data to be analysed
■ Each record of data is a vector with a large number of dimensions (attributes)
● Some data types are complex by nature
■ Web pages
■ Multimedia
■ Sensor data
■ Graphs
■ Social Network
■ …
Data mining process
Data Collection

Data Integration Data mining

Databases Data
warehouse

Patterns
Data mining as a step in KDD
KDD = Knowledge Discovery from Data
1. Data selection.
■ Identifying relevant datasets and selecting data that is important for our need / task
2. Data Preprocessing.
■ Cleaning the data by handling missing values, noise, and inconsistencies.
3. Data transformation.
■ Change the form of the data depending on the data mining algorithms to be used
4. Data mining.
■ A set of intelligent data analysis techniques
5. Pattern evaluation
■ Interpreting the discovered patterns and evaluating their Interestingness.
6. Knowledge presentation.
■ Visualize the discovered knowledge (patterns)
Data mining as a step in KDD
Architecture of a typical data mining system [1]
Database / Data Warehouse
Server

Data Cleaning, Integration, and Selection

Other types of
Database Data Warehouse World Wide Web Repositories
(spearsheets,
nosql, …)
Data Mining Tasks
Categories of Data Mining Tasks
● Data mining tasks can be on of two categories

● Descriptive Mining Tasks (Unsupervised learning)


- Clustering : find a groups or similar items,
- Associations rules : find relations between items,

● Predictive Mining Tasks (Supervised learning)


- Classification : assign data to their predefined classes
- Regression : assign data to a function
- Time series analysis: Data analysis over time
Association Rules Mining
● Frequent Patterns, Associations, and Correlations Mining
● Frequent Itemsets. Unordered sets of items that appears together very
often.
■ Milk and Bread are frequently bought together.
● Frequent Subsequences. Ordered sets of items that appears together
very often.
■ PC → Camera → Memory Card
● Association Analysis can uncover.
■ Single-dimensional Association Rules
■ BUY(X, “COMPUTER”) ⇒ BUY(X, “SOFTWARE”) [Support=1%, Confidence=50%]
■ Multi-dimensional Association Rules
■ AGE(X, “20..29”) ∧ INCOME(X, “20K..29K”) ⇒ BUY(X, “CD Player”) [Support=1%,
Confidence=50%]
Classification and Prediction
● Classification. Describe a class/concept as a function (model) than can
be used later to predict classes of new objects.
● Prediction. Finds a function (model) that can predict missing
continuous numerical values.
● In both cases, we need a set of objects with known labels (classes /
outputs) to train the model
■ Training Dataset
Cluster Analysis (Clustering)
● Unsupervised classification
● We group objects into clusters (classes) that are initially unknown
● We use the concept of similarity between objects.
● Minimize the inter-class similarity (similarity of objects from different
clusters)
● Maximize the intra-class similarity (similarity of objects of the same
cluster)
Outlier Analysis
● Detect objects in the data that are irregular with respect to other objects
● Can be used for:
■ Anomaly detection
■ Fraudulent Credit Card Transactions
■ …
Pattern Evaluation
Pattern Interestingness
● A pattern is considered interesting if [1]:
1. It is easily understood by humans.
2. Can be generalized to new unseen (test) data with some uncertainty.
3. Useful.
4. Novel (add something new to our knowledge).
● Various performance (quality) metrics can be used to evaluate (assess)
the usefulness or interestingness of discovered patterns.
● The definition of these performance metrics depends highly on the
nature and structure of the patterns.
● We can prune way uninteresting patterns by comparing their quality to
a threshold defined by the user.
Data Mining Applications
Some Applications
● Healthcare
■ Diagnosis and Treatment: Identifying patterns in patient data to help diagnose diseases
and recommend treatments.
■ Medical Research: Analyzing clinical data to discover new medical knowledge and drug
efficacy.
● Finance and Banking
■ Fraud Detection: Identifying unusual transactions or behavior that could indicate fraud.
■ Risk Management: Assessing loan applicants' risk levels and predicting credit scores.
■ Customer Segmentation: Classifying customers based on spending habits, transaction
frequency, and investment preferences.
● Telecommunications
■ Churn Prediction: Analyzing user behavior to predict when customers may leave the
service
■ Customer Service: Using data mining to offer more personalized and efficient support.
Some Applications
● Social Media and Web Analytics
■ Sentiment Analysis: Analyzing social media posts to gauge public opinion on products,
services, or events.
● Government and Public Services
■ Crime Prevention: Predicting criminal behavior and identifying hotspots based on
historical data.
■ Tax Fraud Detection: Detecting anomalies in tax records to identify potential fraud
cases.
● Marketing
■ Customer Segmentation: Grouping customers into segments based on purchasing
behavior and preferences.
■ Targeted Advertising: Analyzing data to create more effective marketing campaigns and
personalized ads.
References
1. Han, Jiawei, Micheline Kamber, and Data Mining. "Concepts and
techniques." Morgan Kaufmann 340 (2006): 94104-3205.
2. IBM Technologies on Youtube
3. University of Houston Libraries on Youtube

You might also like