0% found this document useful (0 votes)

42 views

CAS CS 565, Data Mining

The document provides an overview of the CAS CS 565 Data Mining course, including logistics such as schedule, instructor details, topics to be covered including frequent pattern mining, clustering, classification, and recommendation systems. The course workload consists of programming assignments, problem sets, a midterm exam, and final exam, and aims to teach students how to learn, enjoy, and apply data mining concepts and techniques to extract useful knowledge from large datasets.

Uploaded by

Kutale Tukura

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

CAS CS 565, Data Mining

Uploaded by

Kutale Tukura

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 30

CAS CS 565, Data Mining

Course logistics
• Course webpage:
– http://www.cs.bu.edu/~evimaria/cs565-11.html
• Schedule: Mon – Wed, 2:30-4:00
• Instructor: Evimaria Terzi, [email protected]
• Office hours: Tues 11:00am-12:30pm, Wed
4:00pm-5:30pm (or by appointment)
• Mailing list : [email protected]
Topics to be covered (tentative)
• Introduction to data mining and prototype problems
• Frequent pattern mining
– Frequent itemsets and association rules
• Clustering
• Dimensionality reduction
• Classification
• Link analysis ranking
• Recommendation systems
• Time-series data
• Privacy-preserving data mining
Course workload
• Three programming assignments (30%)
• Three problem sets (20%)
• Midterm exam (20%)
• Final exam (30%)
• Late assignment policy: 10% per day up to
three days; credit will be not given after that
• Incompletes will not be given
Textbooks
• D. Hand, H. Mannila and P. Smyth: Principles of
Data Mining. MIT Press, 2001

• Jiawer Han and Micheline Kamber: Data Mining:

Concepts and Techiques. Second Edition.
Morgan Kaufmann Publishers, March 2006

• Toby Segaran: Programming Collective Intelligence:

Building Smart Web 2.0 Applications. O’Reilly

• Research papers (pointers will be provided)

Prerequisites
• Basic algorithms: sorting, set manipulation, hashing

• Analysis of algorithms: O-notation and its variants, perhaps

some recursion equations, NP-hardness

• Programming: some programming language, ability to do

small experiments reasonably quickly

• Probability: concepts of probability and conditional probability,

expectations, binomial and other simple distributions

• Some linear algebra: e.g., eigenvector and eigenvalue computations

Above all
• The goal of the course is to learn and enjoy

• The basic principle is to ask questions when

you don’t understand

• Say when things are unclear; not everything

can be clear from the beginning

• Participate in the class as much as possible

Introduction to data mining
• Why do we need data analysis?

• What is data mining?

• Examples where data mining has been useful

• Data mining and other areas of computer

science and statistics

• Some (basic) data-mining tasks

Why do we need data analysis

• Really really lots of raw data data!!

– Moore’s law: more efficient processors, larger memories

– Communications have improved too

– Measurement technologies have improved dramatically

– It possible to store and collect lots of raw data

– The data-analysis methods are lagging behind

• Need to analyze the raw data to

extract knowledge
The data is also very complex
• Multiple types of data: tables, time series,
images, graphs, etc

• Spatial and temporal aspects

• Large number of different variables


• Lots of observations large datasets
Example: transaction data
• Billions of real-life customers: e.g.,
walmart, safeway customers, etc

• Billions of online customers: e.g.,

amazon, expedia, etc.
Example: document data
• Web as a document repository: billions of
web pages

• Wikipedia: 4 million articles (and counting)

• Online collections of scientific articles

Example: network data
• Web: 50 billion pages linked via hyperlinks

• Facebook: 400 million users

• MySpace: 300 million users

• Instant messenger: ~1billion users

• Blogs: 250 million blogs worldwide,

presidential candidates run blogs
Example: genomic sequences
• http://www.1000genomes.org/page.php

• Full sequence of 1000 individuals


• 310^9 nucleotides per person 310^12
nucleotides

• Lots more data in fact: medical history of

the persons, gene expression data
Example: environmental data
• Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php

• “a database of temperature, precipitation and

pressure records managed by the National
Climatic Data Center, Arizona State University and
the Carbon Dioxide Information Analysis Center”

• “6000 temperature stations, 7500

precipitation stations, 2000 pressure stations”
We have large datasets…so what?
• Goal: obtain useful knowledge from large masses of data

• “Data mining is the analysis of (often large)

observational data sets to find unsuspected relationships
and to summarize the data in novel ways that are both
understandable and useful to the data analyst”

• Tell me something interesting about the data; describe

the data

• Exploratory analysis on large datasets

What can data-mining methods do?
• Extract frequent patterns
– There are lots of documents that contain the phrases
“association rules”, “data mining” and “efficient
algorithm”

• Extract association rules

– 80% of the walmart customers that buy beer
and sausage also buy mustard

• Extract rules
– If occupation=PhD student then income < 20K
What can data-mining methods do?
• Rank web-query results
– What are the most relevant web-pages to the query: “Student
housing BU”?

• Find good recommendations for users

– Recommend amazon customers new books
– Recommend facebook users new friends/groups

• Find groups of entities that are similar (clustering)

– Find groups of facebook users that have similar friends/interests
– Find groups amazon users that buy similar products
– Find groups of walmart customers that buy similar products
Goal of this course
• Describe some problems that can be solved using
data-mining methods

• Discuss the intuition behind data-mining methods

that solve these problems

• Illustrate the theoretical underpinnings of

these methods

• Show how these methods can be useful in practice

Data mining and related areas
• How does data mining relate to
machine learning?

• How does data mining relate to statistics?

• Other related areas?

Data mining vs machine learning
• Machine learning methods are used for data mining
– Classification, clustering

• Amount of data makes the difference

– Data mining deals with much larger datasets and
scalability becomes an issue

• Data mining has more modest goals

– Automating tedious discovery tasks, not aiming at
human performance in real discovery
– Helping users, not replacing them
Data mining vs. statistics
• “tell me something interesting about this data” – what
else is this than statistics?

– The goal is similar

– Different types of methods

– In data mining one investigates lot of possible hypotheses

– Data mining is more exploratory data analysis


– In data mining there are much larger datasets
algorithmics/scalability is an issue
Data mining and databases
• Ordinary database usage: deductive

• Knowledge discovery: inductive

– Inductive reasoning is exploratory

• New requirements for database

management systems

• Novel data structures, algorithms

and architectures are needed
Data mining and algorithms
• Lots of nice connections

• A wealth of interesting research questions

• We will focus on some of these questions

later in the course
Some simple data-analysis tasks
• Given a stream or set of numbers (identifiers, etc)

• How many numbers are there?

• How many distinct numbers are there?

• What are the most frequent numbers?

• How many numbers appear at least K times?

• How many numbers appear only once?

• etc
Finding the majority element
• A neat problem

• A stream of identifiers; one of them occurs

more than 50% of the time

• How can you find it using no more than a

few memory locations?

• Suggestions?
Finding the majority element
(solution)
• A = first item you see; count = 1
• for each subsequent item B
if (A==B) count = count + 1
else {
count = count - 1
if (count == 0) {A=B; count = 1}
}
endfor
return A
• Why does this work correctly?
Finding the majority element (solution
and correctness proof)
• A = first item you see; count = 1 • Basic observation:
• for each subsequent item B Whenever we discard
if (A==B) count = count + 1 element u we also
else { discard a unique
count = count - 1 element v different
if (count == 0) from u
{A=B; count = 1}
}
endfor
return A
Finding a number in the top half
• Given a set of N numbers (N is very large)

• Find a number x such that x is likely to

be larger than the median of the numbers

• Simple solution
– Sort the numbers and store them in sorted array A
– Any value larger than A[N/2] is a solution

• Other solutions?
Finding a number in the top half
efficiently
• A solution that uses small number of operations
– Randomly sample K numbers from the file
– Output their maximum

median

N/2 items N/2 items

• Failure probability (1/2)^K

Data Mining Chapter 1 Notes
No ratings yet
Data Mining Chapter 1 Notes
40 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Data Mining: Ying Liu, Prof., PH.D
No ratings yet
Data Mining: Ying Liu, Prof., PH.D
57 pages
Syllabus
No ratings yet
Syllabus
4 pages
Cse2021 - Data Mining CH
No ratings yet
Cse2021 - Data Mining CH
13 pages
CCS415-CCT416 Course Outline
No ratings yet
CCS415-CCT416 Course Outline
3 pages
Data Mining Curriculum Proposal
No ratings yet
Data Mining Curriculum Proposal
10 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
Gujarat Technological University: Page 1 of 2
No ratings yet
Gujarat Technological University: Page 1 of 2
2 pages
DM Ch1 Introduction
No ratings yet
DM Ch1 Introduction
50 pages
DM Day1 Intro MS F24 (1)
No ratings yet
DM Day1 Intro MS F24 (1)
111 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
Week 01 Chapt01
No ratings yet
Week 01 Chapt01
49 pages
CS F415 DATA MINING L1
No ratings yet
CS F415 DATA MINING L1
4 pages
Sp24-DM-Teaching-plan-02042024-114322am
No ratings yet
Sp24-DM-Teaching-plan-02042024-114322am
7 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Template-data_mining
No ratings yet
Template-data_mining
3 pages
Course Outline DM F13
No ratings yet
Course Outline DM F13
2 pages
Datawarehouse&Data mining_ALL
No ratings yet
Datawarehouse&Data mining_ALL
46 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
dwm NOTES
No ratings yet
dwm NOTES
118 pages
Introduction Data Mining
100% (1)
Introduction Data Mining
23 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
DWDM
No ratings yet
DWDM
2 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
No ratings yet
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
50 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
DM-Course File
No ratings yet
DM-Course File
14 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
DM Guidelines 14jan2022
No ratings yet
DM Guidelines 14jan2022
5 pages
Gujarat Technological University: Page 1 of 2
No ratings yet
Gujarat Technological University: Page 1 of 2
2 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Data Mining
No ratings yet
Data Mining
26 pages
CS 432-CS 536-Introduction To Data Mining-Data Mining-Mian Muhammad Awais
No ratings yet
CS 432-CS 536-Introduction To Data Mining-Data Mining-Mian Muhammad Awais
3 pages
Data Mining - I
No ratings yet
Data Mining - I
126 pages
Data Mining Syllabus and Question
No ratings yet
Data Mining Syllabus and Question
6 pages
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
No ratings yet
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
58 pages
Birla Institute of Technology & Science, Pilani Course Handout Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani Course Handout Part A: Content Design
5 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
2016 Book PrinciplesOfDataMining PDF
100% (2)
2016 Book PrinciplesOfDataMining PDF
530 pages
Lesson Plan: Unit Topic Books For Reference No. of Hours Required Teaching Methodology
No ratings yet
Lesson Plan: Unit Topic Books For Reference No. of Hours Required Teaching Methodology
6 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
DM Ch3 Data Preprocessing
No ratings yet
DM Ch3 Data Preprocessing
45 pages
Course Details
No ratings yet
Course Details
2 pages
R Lect1 Introduction
No ratings yet
R Lect1 Introduction
16 pages
Lec 1
No ratings yet
Lec 1
33 pages
Course Outline CSC 588 Data Warehousing and Data Mining1
No ratings yet
Course Outline CSC 588 Data Warehousing and Data Mining1
5 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
40 pages
unit_1
No ratings yet
unit_1
102 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Mining
No ratings yet
Data Mining
13 pages
Data Mining Notes: 7 Semester. CS 1435: Syllabus
No ratings yet
Data Mining Notes: 7 Semester. CS 1435: Syllabus
4 pages
2 Data Mining
No ratings yet
2 Data Mining
20 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Research & the Analysis of Research Hypotheses
From Everand
Research & the Analysis of Research Hypotheses
Kathleen Thomas Allan
No ratings yet
Here Is A List of Common Terms Associated With Knowledge Bases and Knowledge Based Systems (KBS)
No ratings yet
Here Is A List of Common Terms Associated With Knowledge Bases and Knowledge Based Systems (KBS)
2 pages
Forward Chaining and Backward Chaining in AI
No ratings yet
Forward Chaining and Backward Chaining in AI
11 pages
Specialized Business Information Systems: Fundamentals of Information Syst Ems, Second Edition 1
No ratings yet
Specialized Business Information Systems: Fundamentals of Information Syst Ems, Second Edition 1
39 pages
Yonas Tesfaye
No ratings yet
Yonas Tesfaye
108 pages
Yohannes Ephrem
No ratings yet
Yohannes Ephrem
70 pages
Ejigu Tefera
No ratings yet
Ejigu Tefera
112 pages
Sciencedirect Sciencedirect Sciencedirect
No ratings yet
Sciencedirect Sciencedirect Sciencedirect
6 pages
A Case Assessment of Knowledge-Based Fit in Frame For Diagnosis of Human Eye Diseases
No ratings yet
A Case Assessment of Knowledge-Based Fit in Frame For Diagnosis of Human Eye Diseases
11 pages
Design of Knowledge Management System For Diabetic Complication Diseases
No ratings yet
Design of Knowledge Management System For Diabetic Complication Diseases
7 pages
DMidterm
No ratings yet
DMidterm
3 pages
A Self-Learning Knowledge Based System For Credit Evaluation of Loan Application: The Case of Commercial Bank of Ethiopia
No ratings yet
A Self-Learning Knowledge Based System For Credit Evaluation of Loan Application: The Case of Commercial Bank of Ethiopia
8 pages
Data Mining: Assignment 2
No ratings yet
Data Mining: Assignment 2
7 pages

CAS CS 565, Data Mining

Uploaded by

CAS CS 565, Data Mining

Uploaded by

CAS CS 565, Data Mining

• Jiawer Han and Micheline Kamber: Data Mining:

• Toby Segaran: Programming Collective Intelligence:

• Research papers (pointers will be provided)

• Analysis of algorithms: O-notation and its variants, perhaps

• Programming: some programming language, ability to do

• Probability: concepts of probability and conditional probability,

• Some linear algebra: e.g., eigenvector and eigenvalue computations

• The basic principle is to ask questions when

• Say when things are unclear; not everything

• Participate in the class as much as possible

• What is data mining?

• Examples where data mining has been useful

• Data mining and other areas of computer

• Some (basic) data-mining tasks

• Really really lots of raw data data!!

– Communications have improved too

– Measurement technologies have improved dramatically

– It possible to store and collect lots of raw data

– The data-analysis methods are lagging behind

• Need to analyze the raw data to

• Spatial and temporal aspects

• Large number of different variables

• Billions of online customers: e.g.,

• Wikipedia: 4 million articles (and counting)

• Online collections of scientific articles

• Facebook: 400 million users

• MySpace: 300 million users

• Instant messenger: ~1billion users

• Blogs: 250 million blogs worldwide,

• Full sequence of 1000 individuals

• Lots more data in fact: medical history of

• “a database of temperature, precipitation and

• “6000 temperature stations, 7500

• “Data mining is the analysis of (often large)

• Tell me something interesting about the data; describe

• Exploratory analysis on large datasets

• Extract association rules

• Find good recommendations for users

• Find groups of entities that are similar (clustering)

• Discuss the intuition behind data-mining methods

• Illustrate the theoretical underpinnings of

• Show how these methods can be useful in practice

• How does data mining relate to statistics?

• Other related areas?

• Amount of data makes the difference

• Data mining has more modest goals

– The goal is similar

– Different types of methods

– In data mining one investigates lot of possible hypotheses

– Data mining is more exploratory data analysis

• Knowledge discovery: inductive

• New requirements for database

• Novel data structures, algorithms

• A wealth of interesting research questions

• We will focus on some of these questions

• How many numbers are there?

• How many distinct numbers are there?

• What are the most frequent numbers?

• How many numbers appear at least K times?

• How many numbers appear only once?

• A stream of identifiers; one of them occurs

• How can you find it using no more than a

• Find a number x such that x is *likely* to

N/2 items N/2 items

• Failure probability (1/2)^K

You might also like

• Find a number x such that x is likely to