Introduction-to-Data-Mining
Introduction-to-Data-Mining
Chapter 1 . Introduction
SASSI Abdessamed
Motivation
Why do we need data mining?
● Nowadays, the total world wide volume of data is very large
■ Hundreds of ZettaBytes (ZB = 270 byte)
● Data types and formats can be complexe
■ Video, Image, Audio, etc.
● Most data formats are not human readable
■ Binary formats
● Humans cannot deal with such amount and complexity
● We need concise insights and patterns to make decisions
Data mining is a misnomer?
● Literally data mining means gathering or collecting data
● In practice, data mining means extracting knowledge from data
● This knowledge is like golden-nuggets hidden in a large volume data
● Hence the word mining in the name
● So,
● What is Data?
● What is Knowledge?
● And, What does Data Mining really means?
Data
What is data?
● Data are collected observations or measurements represented as Text,
Numbers, or Multimedia [3].
● Data can be quantitative (represent quantities or numerical values)
■ Sensory data (Temperature, Light, Pixel Intensities, Voltage, …)
■ Time Durations (Age, Travel Length, …)
■ Size & Length Measurements (Area, Volume, Distance, Length, …)
■ Health Measurements (Blood Pressure, Sugar Level, O2 Saturation, …)
● Data can also be qualitative (categorical)
■ Text (words, letters, digits, …)
■ Age Classes (e.g. Football Age categories)
■ Blood Types
● Data can also be a complex mixture of the two types
■ E.g. Maps (Graphs)
Data vs Knowledge
● A book doesn’t know of its content
● Knowing Being Aware of the information we possess
■ Understanding
■ Being able to act and make decisions
■ Produce new thoughts
■ Discover Patterns
● Unlike having information, Knowing is active action
● How can we make computers discover by knowledge on their own?
Data Sources
● In our daily lives we produce tons of data (information)
■ Social Networks, Emails, Blogs, …
■ E-Commerce, Banking, Stores, …
■ Hospitals & Health reports
■ Administrative records
● Hence, data can be supplied by a variety of technologies:
■ Relational databases
■ Data warehouses
■ Transaction databases
■ Text databases
■ Social networks data
■ World-Wide Web
■ Time-series data
Data Formats
● The data we want to analyse using data mining methods have various
formats
■ Transactions
■ N-dimensional Vectors (data points)
■ Graphs
■ Tables
■ etc.
● The format of the data determines the data mining algorithm we can use
● We may also change the format of the data in order to be able to use a
certain type of algorithm
Data Preparation & Preprocessing
● Data integration. Combining data from multiple sources
■ Joining multiple tables.
■ Resolving data inconsistencies from different sources.
● Data selection. Selecting domain relevant data.
■ Selecting a specific of attributes (columns)
● Data cleaning.
■ Noise Reduction : Removing or correcting noisy data
■ Outlier Detection : Identifying and handling outliers
■ Handling Missing Values : Removing or filling in missing data
● Data Reduction.
■ Dimensionality Reduction: to reduce the number of attributes while retaining
important information.
■ Sampling: Selecting a subset of the data that represents the whole dataset to reduce
computation time.
Data Preparation & Preprocessing
● Data Transformation.
■ Normalization: Scaling numerical data to a common range
■ Data Discretization: Converting continuous attributes into discrete bins or categories
Data Mining
What is data mining?
● Extracting or “mining” knowledge from large amounts of data [1].
● A set of software techniques for identifying / discovering useful
patterns and trends from large amounts of data through automated
analysis.
● Obtaining a simplified view of data to help with decision making.
● Extracting Knowledge from data.
What is knowledge in this context?
● For data mining, knowledge is in the form of Patterns and Insights:
■ (If .. Then) Rules
■ Associations
■ Anomalies
■ Recommendations
■ Groups & Classes (Clusters)
■ Predictions
■ Correlations
Intersection with other fields & technologies
● Statistics
■ A variety of data mining algorithms involve some methods from the field of statistics
■ The methods of statistics themselves can be used as low-level data mining methods
● Databases
■ Most of the data sources will be stored using database technology
● Data warehouses
■ Data mining are generally applied to data integrated in a data warehouse
● Machine Learning
■ We can use some of these techniques to learn patterns
● Data visualization
■ To familiarise with the data, detect outliers, decide what preprocessing we need
■ To display the extracted patterns and make decisions after data mining
Why Data Mining?
● Large quantities of data to be analysed
■ Algorithms must be highly scalable
● High dimensionality of the data to be analysed
■ Each record of data is a vector with a large number of dimensions (attributes)
● Some data types are complex by nature
■ Web pages
■ Multimedia
■ Sensor data
■ Graphs
■ Social Network
■ …
Data mining process
Data Collection
Databases Data
warehouse
Patterns
Data mining as a step in KDD
KDD = Knowledge Discovery from Data
1. Data selection.
■ Identifying relevant datasets and selecting data that is important for our need / task
2. Data Preprocessing.
■ Cleaning the data by handling missing values, noise, and inconsistencies.
3. Data transformation.
■ Change the form of the data depending on the data mining algorithms to be used
4. Data mining.
■ A set of intelligent data analysis techniques
5. Pattern evaluation
■ Interpreting the discovered patterns and evaluating their Interestingness.
6. Knowledge presentation.
■ Visualize the discovered knowledge (patterns)
Data mining as a step in KDD
Architecture of a typical data mining system [1]
Database / Data Warehouse
Server
Other types of
Database Data Warehouse World Wide Web Repositories
(spearsheets,
nosql, …)
Data Mining Tasks
Categories of Data Mining Tasks
● Data mining tasks can be on of two categories