CAS CS 565, Data Mining
CAS CS 565, Data Mining
Course logistics
• Course webpage:
– http://www.cs.bu.edu/~evimaria/cs565-11.html
• Schedule: Mon – Wed, 2:30-4:00
• Instructor: Evimaria Terzi, [email protected]
• Office hours: Tues 11:00am-12:30pm, Wed
4:00pm-5:30pm (or by appointment)
• Mailing list : [email protected]
Topics to be covered (tentative)
• Introduction to data mining and prototype problems
• Frequent pattern mining
– Frequent itemsets and association rules
• Clustering
• Dimensionality reduction
• Classification
• Link analysis ranking
• Recommendation systems
• Time-series data
• Privacy-preserving data mining
Course workload
• Three programming assignments (30%)
• Three problem sets (20%)
• Midterm exam (20%)
• Final exam (30%)
• Late assignment policy: 10% per day up to
three days; credit will be not given after that
• Incompletes will not be given
Textbooks
• D. Hand, H. Mannila and P. Smyth: Principles of
Data Mining. MIT Press, 2001
• Lots of observations large datasets
Example: transaction data
• Billions of real-life customers: e.g.,
walmart, safeway customers, etc
• 310^9 nucleotides per person 310^12
nucleotides
• Extract rules
– If occupation=PhD student then income < 20K
What can data-mining methods do?
• Rank web-query results
– What are the most relevant web-pages to the query: “Student
housing BU”?
• etc
Finding the majority element
• A neat problem
• Suggestions?
Finding the majority element
(solution)
• A = first item you see; count = 1
• for each subsequent item B
if (A==B) count = count + 1
else {
count = count - 1
if (count == 0) {A=B; count = 1}
}
endfor
return A
• Why does this work correctly?
Finding the majority element (solution
and correctness proof)
• A = first item you see; count = 1 • Basic observation:
• for each subsequent item B Whenever we discard
if (A==B) count = count + 1 element u we also
else { discard a unique
count = count - 1 element v different
if (count == 0) from u
{A=B; count = 1}
}
endfor
return A
Finding a number in the top half
• Given a set of N numbers (N is very large)
• Simple solution
– Sort the numbers and store them in sorted array A
– Any value larger than A[N/2] is a solution
• Other solutions?
Finding a number in the top half
efficiently
• A solution that uses small number of operations
– Randomly sample K numbers from the file
– Output their maximum
median