0% found this document useful (0 votes)
29 views

Unit 4 Intro DM

This document provides an introduction to data mining. It defines data mining as finding hidden patterns from large data sets. It discusses how data is growing rapidly and users expect more sophisticated information. It then covers the basic concepts of data mining including common tasks like classification, clustering, association rule mining. It also discusses the relationship between data mining and knowledge discovery in databases (KDD) and highlights some issues in data mining like handling large data sets, high dimensionality, outliers and missing data.

Uploaded by

Juee Jamsandekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Unit 4 Intro DM

This document provides an introduction to data mining. It defines data mining as finding hidden patterns from large data sets. It discusses how data is growing rapidly and users expect more sophisticated information. It then covers the basic concepts of data mining including common tasks like classification, clustering, association rule mining. It also discusses the relationship between data mining and knowledge discovery in databases (KDD) and highlights some issues in data mining like handling large data sets, high dimensionality, outliers and missing data.

Uploaded by

Juee Jamsandekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction to Data

Mining
Introduction

 Data is growing at a phenomenal rate


 Users expect more sophisticated information

 How?

UNCOVERING HIDDEN INFORMATION


- DATA MINING
Data Mining Definition

 Finding hidden information in a database.


 Similar terms
⚫ Exploratory data analysis
⚫ Data driven discovery
⚫ Deductive learning
Query Examples

Data Base
–Find all credit applicants with last name of Smith.
–Identify customers who have purchased more than $10,000
in the last month.
– Find all customers who have purchased milk
– Find all credit applicants who are poor credit risks.
(classification)
Data Mining
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk.
(association rules)
Data Mining Algorithm
Characterized as consisting of 3 parts:

⚫ Model : Fit Data to a Model


• Descriptive
• Predictive
⚫ Preference – Technique to choose the best model
⚫ Search – Technique to search the data
Data Mining Models and Tasks
Predictive Data Mining
 Training set

Honest
Tridas Vickie Mike

Crooked

Wally Waldo Barney

7
Predict
Prediction

Tridas Vickie Mike

Honest = has round eyes and a smile

9
Basic Data Mining Tasks
 Classification : Maps data into predefined groups
or classes
⚫ Example:- Airport security screening station used to
determine if the passengers are terrorists or criminals.

 Regression :Is used to map a data item to a real


valued prediction variable.
⚫ Example:-Forecasting of returns on investments.
 Time Series Analysis: Value of an attribute is
examined as it varies over time.
⚫ Example: Stock Market
Basic Data Mining Tasks
(cont’d)
 Clustering :Groups similar data together into clusters.
⚫ Example:- Catalogs designed by departmental stores
for potential buyers based on attributes like income,
location age etc.
 Summarization : Extracts representative information
about the database.
⚫ Example:- Grade computation between different
colleges based on student’s performance or
infrastructure of college.
 Association Rules: Uncovers relationships among data.
⚫ Example: Market Basket Analysis
Basic Data Mining Tasks
(cont’d)
 Sequence Discovery: Determines sequential
patterns.
⚫ Example:- Sequence of pages frequently
visited in a Web site.
Data Mining vs. KDD

 Knowledge Discovery in Databases (KDD):


process of finding useful information and
patterns in data.
 Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process.
KDD Process

 Selection: Obtain data from various sources.


 Preprocessing: Cleanse data.
 Transformation: Convert to common format. Transform to
new format.
 Data Mining: Obtain desired results.
 Interpretation/Evaluation: Present results to user in
meaningful manner.
Data Mining Development
•Similarity Measures
•Relational Data Model •Hierarchical Clustering
•SQL •IR Systems
•Association Rule Algorithms •Imprecise Queries
•Data Warehousing •Textual Data
•Scalability Techniques
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis

•Algorithm Design Techniques •Neural Networks


•Algorithm Analysis
•Data Structures •Decision Tree Algorithms
Data Mining Issues
 Human Interaction
 Over fitting

 Outliers

 Interpretation

 Visualization

 Large Datasets: computational &time


complexity
 High Dimensionality
Over fitting
Outliers
High Dimensionality
Data Mining Issues (cont’d)
 Multimedia Data: audio &video info retrival
 Missing Data

 Irrelevant Data

 Noisy Data

 Changing Data

 Integration

 Application
Social Implications of DM

 Privacy

 Profiling

 Unauthorized use
Database Perspective on Data
Mining
Implementation issues of concern:
◦ Scalability(Massive datasets)
◦ Real world data(More noisy)
◦ Update(Dynamic data at times)
◦ Easy of use (Quite complex to understand)
Data Pre-processing
 Data cleaning
⚫ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
⚫ Integration of multiple databases, data cubes, or files
 Data transformation
⚫ Normalization and aggregation
 Data reduction
⚫ Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
⚫ Part of data reduction but with particular importance, especially for
numerical data
Why Data Preprocessing?
 Data in the real world is dirty
⚫ incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
⚫ noisy: containing errors or outliers
⚫ inconsistent: containing discrepancies in codes or names
 No quality data, no quality mining results!
⚫ Quality decisions must be based on quality data
⚫ Data warehouse needs consistent integration of quality
data
Forms of data preprocessing
Data Cleaning

 Data cleaning tasks


⚫ Fill in missing values
⚫ Identify outliers and smooth out noisy data
⚫ Correct inconsistent data
Missing Data

 Data is not always available


⚫ E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
 Missing data may be due to
⚫ equipment malfunction
⚫ inconsistent with other recorded data and thus deleted
⚫ data not entered due to misunderstanding
⚫ certain data may not be considered important at the time of entry
⚫ not register history or changes of the data
 Missing data may need to be inferred.
How to Handle Missing Data?

 Fill in the missing value manually: tedious + infeasible?


 Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
 Use the most probable value to fill in the missing value:
Thank You

You might also like