0% found this document useful (0 votes)
38 views

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

The document discusses an introduction to data warehousing and data mining. It covers topics like the need for data mining due to large amounts of data, definitions of data mining, related disciplines, applications, and steps in the knowledge discovery process. It also discusses functions of data mining like generalization and association analysis. The document is the syllabus for a course on data warehousing and data mining.

Uploaded by

Zafar Iqbal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

The document discusses an introduction to data warehousing and data mining. It covers topics like the need for data mining due to large amounts of data, definitions of data mining, related disciplines, applications, and steps in the knowledge discovery process. It also discusses functions of data mining like generalization and association analysis. The document is the syllabus for a course on data warehousing and data mining.

Uploaded by

Zafar Iqbal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

CS423

DATA WAREHOUSING AND DATA


MINING

Chapter 1
Introduction

Dr. Hammad Afzal

[email protected]

Department of Computer Software Engineering


National University of Sciences and Technology (NUST)
RESOURCES
 Lecture Slides will be available on LMS

 Additional references shall be provided (if any)

 OHT 1 : 15%
 OHT 2: 15%

 Quizzes: 10%
 Total 4
 All announced

 Assignment: 10%
 Semester Project
 Syndicate Members: 1-3 2
 Will be announced after 1st OHT
RESOURCES

Text Book:
 1. Data Mining Concepts and Techniques
 By Jiawei Han.
 3rd Editionn

Reference:
 Will be provided.

3
RESOURCES
 Grading Scheme:

 PEC - Washington Accord

 Outcome based Learning (OBE)

 Course Learning Objectives (CLOs)

4
WHY DATA MINING?

 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability


 Automated data collection tools, database systems, Web, computerized
society

 Major sources of abundant data


 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube

5
 We are drowning in data, but starving for knowledge!
WHAT IS DATA MINING?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data

6
DATA MINING: CONFLUENCE OF MULTIPLE
DISCIPLINES

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

7
ALTERNATIVE NAMES

Information Harvesting
Knowledge Mining
Data Mining

CS490D
Knowledge Discovery
in Databases Data Dredging

Data Archaeology
Data Pattern Processing

Database Mining
Knowledge Extraction
Siftware

The process of discovering meaningful new correlations, patterns, and trends by


sifting through large amounts of stored data, using pattern recognition
technologies and statistical and mathematical techniques 8
APPLICATIONS OF DATA MINING
 Web page analysis
 From web page classification, clustering to PageRank

 Recommender systems

 Basket data analysis to targeted marketing

 Biological and medical data analysis

9
MARKET ANALYSIS AND MANAGEMENT
 Where does the data come from?

 Credit card transactions, loyalty cards, discount coupons, customer


complaint calls, plus (public) lifestyle studies

 Target marketing

 Find clusters of “model” customers who share the same characteristics:


interest, income level, spending habits, etc.

 Determine customer purchasing patterns over time

 Cross-market analysis
10

 Associations/co-relations between product sales, & prediction based on


such association
REAL EXAMPLE FROM THE NBA
 Play-by-play information recorded by teams
 Who is on the court
 Who shoots

CS490D
 Results

 Coaches want to know what works best


 Plays that work well against a given team
 Good/bad player matchups

 Advanced Scout (from IBM Research) is a data


mining tool to answer these questions
11
NEW TRENDS IN MARKETING: TARGET
CUSTOMERS, INTEGRATE DIFFERENT RESOURCES

25/05/2022
Dr: HammaD AfzaL - Data Mining
12
FRAUD DETECTION & MINING
UNUSUAL PATTERNS
 Clustering & model construction for frauds, Outlier analysis

 Applications: Health care, retail, credit card service, telecomm.


 Money laundering: suspicious monetary transactions

CS490D
 Medical Insurance
 Professional patients, Ring of doctors, and Ring of
references

13
FRAUD DETECTION & MINING
UNUSUAL PATTERNS
 Clustering & model construction for frauds, Outlier analysis

 Banking Industry
 Fraudulent transactions

CS490D
 Retail industry
 Analysts estimate that 38% of retail shrink is due to
dishonest employees

 Anti-terrorism

14
KNOWLEDGE DISCOVERY (KDD) PROCESS

 This is a view from typical database systems


and data warehousing communities
Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration
15

Databases
STEPS OF A KDD PROCESS

 Data cleaning and preprocessing: (may take 60% of effort!)


 To remove noise and inconsistent data.

 Data Integration:
 Mulyiple Data sources may be combined.

 Data Selection:
 Where data relevant to analysis task are retrieved.

 Data reduction and transformation


 Find useful features, dimensionality/variable reduction, invariant
representation.
16
STEPS OF A KDD PROCESS
 Data mining: search for patterns of interest
 Choosing functions of data mining
 Summarization, classification, regression, association, clustering.

 Pattern evaluation and knowledge presentation


 Visualization, Removing redundant patterns, etc.

17
DATA MINING IN BUSINESS
INTELLIGENCE
Increasing potential
to support End User
business decisions Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources 18
Paper, Files, Web documents, Scientific experiments, Database Systems
DATA MINING: ON WHAT KINDS
OF DATA?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database

 Advanced data sets and advanced applications


 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
19
For Details: See Book
DATA MINING FUNCTION: (1)
GENERALIZATION
 Class/Concept Description: Characterization and
Discrimination

 Data mining can be used to describe individual classes and


concepts in summarized, precized form.

 Two techniques used for this purpose: Characterization and


discrimination

20
DATA MINING FUNCTION: (1)
GENERALIZATION
 Data Characterization:
 Summarization of general characteristics or features of
target class of data.
 Data collected by query.
 Output can be pie charts, bar charts, data cubes.

 Data Discrimination:
 Comparison of general features of target class with other
classes.
21
DATA MINING FUNCTION: (2) ASSOCIATION
ANALYSIS
 Frequent patterns (or frequent itemsets).
 Patterns that appear frequently in data.

 Many Kind of patterns, i.e. frequent itemsets, frequent sequences

 Association,
 A typical association rule
 Buys(X,computer) -> buys (X, Software)
 Computer Software [0.5%, 75%] (support, confidence)

 Strength of rule measured through support and confidence


22
DATA MINING FUNCTION: (3)
CLASSIFICATION
 Classification : A process that describes and distinguishes data
classes or concepts.
 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
 Predict some unknown class labels

 Typical methods
 Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression, …

23
DATA MINING FUNCTION: (3) CLASSIFICATION

 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

24
DATA MINING FUNCTION: (3) REGRESSION

 Similar to classification,
 but is applied on ordered data (often numeric data).
 Usually in the form:
 Y = mx + c.
 Where Y and X are variables.

 Example: Geological surveys

25
DATA MINING FUNCTION: (4)
CLUSTER ANALYSIS
 Unsupervised learning (i.e., Class label is unknown)

 Group data to form new categories (i.e., clusters), e.g., cluster


houses to find distribution patterns

 Principle: Maximizing intra-class similarity & minimizing


interclass similarity

26
DATA MINING FUNCTION: (5)
OUTLIER ANALYSIS
 Outlier analysis
 Outlier: A data object that does not comply with the general behavior of the
data

 Noise or exception? ― One person’s garbage could be another person’s


treasure

 Methods: by product of clustering.

 Useful in fraud detection, rare events analysis

27
COURTESY

 Slides are prepared using material from Website of

5/25/22
 Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign & Simon Fraser University.

Data Mining: Concepts and Techniques


 Course Slides: Infolabs Stanford University


 Course Slides: Purdue University

28

You might also like