0% found this document useful (0 votes)
41 views

ch01 Intro

This document provides information about a data mining course, including instructor details, teaching assistants, lecture and exam schedule, textbook recommendations, and an overview of topics to be covered. The instructor is Dr. Chenhao Ma and lectures will be on Tuesdays and Thursdays from 3:30-4:50pm. There will be 4 assignments, a midterm, and a final exam. Key topics include data mining algorithms, programming with big data, and applications in domains like social media, ecommerce, and healthcare.

Uploaded by

Ezekiel Loh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

ch01 Intro

This document provides information about a data mining course, including instructor details, teaching assistants, lecture and exam schedule, textbook recommendations, and an overview of topics to be covered. The instructor is Dr. Chenhao Ma and lectures will be on Tuesdays and Thursdays from 3:30-4:50pm. There will be 4 assignments, a midterm, and a final exam. Key topics include data mining algorithms, programming with big data, and applications in domains like social media, ecommerce, and healthcare.

Uploaded by

Ezekiel Loh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Modified based on materials from: http://www.mmds.

org

Chenhao Ma
[email protected]
• Course Information
• What is data mining? -
Introduction

2
¡ Instructor:
§ Dr. Chenhao Ma
§ [email protected], 319b Daoyuan Bldg
§ Office hour: 5:00-6:00PM Tuesday
§ Research interest: large-scale data management
and data mining

4
¡ Teaching assistants:
§ Miss. Yue Zhang
§ [email protected]
§ 109, 4F SDS research space Zhixin Bldg
§ Office hour: 3:00-4:00PM Friday

§ Mr. Yuyang Liang


§ [email protected]
§ 4F SDS research space Zhixin Bldg
§ Office hour: 11:00-12:00AM Friday
5
¡ USTFs
§ Mr Xinyang Gao
§ [email protected]

§ Miss Mengzhen Zhang


§ [email protected]

6
¡ Lecture
§ 3:30-4:50pm Tue/Thu
§ Teaching A Bldg 101

¡ Tutorial
§ 6:00-6:50pm Thu (starting from next week)
§ Teaching C Bldg 308

¡ Working language: English


§ After-class discusssion can be in English/Chinese
7
¡ Knowledge:
§ understand the key models and concepts of
contemporary data mining
§ understand the strengths and limitations of
popular data mining techniques
§ understand popular data mining algorithms to
solve the real-world problems
§ be able to identify promising applications of data
mining using biomedical, textual and graph data

8
¡ Skills:
§ Utilize a programming language to learn, visualize,
and mine new insights from big data
§ Utilize existing software to analyze available data
to inform critical decisions
§ Study data scientifically, and use it to prove
hypotheses
§ Be able to actively manage and participate in data
mining projects executed by consultants or
specialists in data mining

9
¡ Valued/Attitude:
§ Be equipped with theoretical data mining knowledge
and be able to communicate with domain experts to
solve their data science problems
§ Be more vigilant in some data science issues and
aware with the data science development in the
society
§ Have awareness of the impact of data mining in social,
industrial, environmental and technological context
§ Be more literate in data science, and develop a
knowledge in data science so that you can disseminate
data science related new development

10
¡ 4 assignments (40%)
§ Theoretical and programming questions
§ A1 will be given in around week #4
§ Roughly one assignment every threes after the
first
§ Late submission: up to 2 days with 20% penalty.
§ Not every API used in assignment will be
introduced in the lecture
§ Start early ~

11
¡ Midterm (20%)
§ On week #7 or #8
§ Probably 1.5-hour closed-book written exam
§ Details will be announced later

¡ Final exam (40%)


§ Details will be announced later

12
¡ Content: Programming practice and lecture
review
¡ Colab 0 (the tutorial for Spark) will be
released soon

13
¡ Leskovec, J., Rajaraman, A., and Ullman, J., Mining of
Massive Datasets. (3nd edition) - Cambridge
University Press, ISBN-13. 978-1108476348
§ We mainly follow this one in lectures
§ Available online: http://www.mmds.org/

¡ Tan, P., Steinbach, M., Karpatne, A., and Kumar, V.,


Introduction to data mining (second edition).
Pearson. ISBN 978-0321321367

¡ Han, J., Pei, J. and Kamber, M., 2011. Data mining:


concepts and techniques. Elsevier. ISBN 978-0-12-
381479-1
14
¡ Official prerequisite of this course is
§ CSC1001 or CSC1003 (programming skills)
§ STA2001 or MAT3280 or STA2003 (probability)
§ CSC3100 (Data structure)
¡ The following would be helpful:
§ Discrete math
§ Database systems (SQL, relational algebra)
§ Common Linux Commands

15
¡ Each of the topics listed is important for a part
of the course:
§ If you are missing an item of background, you can
consider just-in-time learning of the needed
material.

¡ Colab 0 can also help to decide whether to


take this course.
§ Programming skill is important.
§ You need to be comfortable with writing code in
Python or Java.

16
¡ ACM-ICPC
§ ACM: Association for Computing Machinery
§ ICPC: International Collegiate Programming Contest

¡ Online Judge (OJ) systems


§ Many programming problems
§ CUHKSZ OJ
§ http://oj.cuhk.edu.cn/ (access in campus)
§ SJTU OJ
§ https://acm.sjtu.edu.cn/OnlineJudge/
§ PKU OJ
§ http://poj.org/

17
Week Content/ topic/ activity
1 Introduction to data mining
2 Map-Reduce
3 Spark
4 Frequent items and association rules
5 Finding similar items
6 Mining data streams
7 Mid-term
8 Mining data streams, Link analysis
9 Link analysis
10 Clustering
11 Advertising on the Web
12 Recommender systems
13 Graph mining
14 Recap
19
¡ Feedback is important and highly
appreciated!
§ Talk to course instructors and TAs
§ Send us emails
§…

20
Social User Tracking & Government
Engagement

eCommerce Financial Services Real Time Search

22
¡ The Vs of big data were often referred to
as the "three Vs"
¡ Volume: In a big data environment, the
amounts of data collected and processed are
much larger than those stored in typical
relational databases.

23
¡ Variety: Big data consists of a rich variety of
data types.
¡ Velocity: Big data arrives to the organization
at high speeds and from multiple sources
simultaneously.

24
¡ In the big data era, huge amount of data is
being generated every day
Recent Twitter statistics

https://www.omnicoreagency.com/twitter-statistics/ 25
¡ Data volume is increasing exponentially (40%
increase per year)

Data amount in Zetabytes from 2010 to 2025

A forecast by IDC & SeaGate. Image by Sven Balnojan. 26


¡ Different Types:
§ Relational Data (Tables/Transaction/Legacy Data)
§ Text Data (Web)
§ Semi-structured Data (XML)
§ Spatial Data
§ Temporal Data
§ Graph Data
§ Social Network, Semantic Web (RDF), …
§ One application can be generating/collecting many
types of data
27
¡ Different Sources:
§ Movie reviews from IMDB and Rotten Tomatoes
§ Product reviews from different provider websites

To extract knowledgeè all these types of


data need to linked together

28
Social Banking
Finance
Media

Gaming
Customer Search
Engine

Entertain Purchase

29
¡ Velocity essentially measures how fast the
data is coming in.
¡ Data is being generated fast and need to be
processed fast
§ Late decisions -> missing opportunities

30
¡ It is usually met in online data analytics, for
example
§ E-Promotions: based on your current location,
your purchase history, what you like -> send
promotions right now for store next to you
§ Healthcare monitoring: sensors monitoring your
activities and body -> any abnormal
measurements require immediate reaction

31
The Owner of This iPhone Was in a Severe Car Crash'—or Just on a
Roller Coaster - WSJ
32
The statistics for 1 second in many applications.
http://www.internetlivestats.com/one-second/

33
Data contains value and knowledge
34
¡ But to extract the knowledge
data needs to be
§ Stored (systems)
§ Managed (databases)
§ And ANALYZED ß this class

Data Mining ≈ Big Data ≈


Predictive Analytics ≈ Data Science

35
¡ Given lots of data
¡ Discover patterns and models that are:
§ Valid: hold on new data with some certainty
§ Useful: should be possible to act on the item
§ Unexpected: non-obvious to the system
§ Understandable: humans should be able to
interpret the pattern

36
¡ Descriptive methods
§ Find human-interpretable patterns that
describe the data
§ Example: Clustering

¡ Predictive methods
§ Use some variables to predict unknown
or future values of other variables
§ Example: Recommender systems

37
¡ A risk with “Data mining” is that an analyst
can “discover” patterns that are meaningless
¡ Statisticians call it Bonferroni’s principle:
§ Roughly, if you look in more places for interesting
patterns than your amount of data will support,
you are bound to find crap

38
Example:
¡ We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
§ 109 people being tracked
§ 1,000 days
§ Each person stays in a hotel 1% of time (1 day out of 100)
§ Hotels hold 100 people (so 105 hotels)
§ If everyone behaves randomly (i.e., no terrorists) will the
data mining detect anything suspicious?
¡ Expected number of “suspicious” pairs of people:
§ ~250,000
§ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
39
Usage

Quality

Context

Streaming

Scalability

40
¡ Data mining overlaps with:
§ Databases: Large-scale data, simple queries
§ Machine learning: Complex models
§ CS Theory: (Randomized) Algorithms
¡ Different cultures:
§ To a DB person, data mining is an extreme form of
analytic processing – queries that
examine large amounts of data CS
Machine
Theory
§ Result is the query answer Learning

§ To a ML person, data-mining Data


Mining
is the inference of models
§ Result is the parameters of the model Database
¡ In this class we will do both! systems
41
¡ This combines best of machine learning,
statistics, artificial intelligence, databases but
more stress on
§ Scalability (big data)
§ Algorithms Statistics Machine
Learning
§ Computing architectures
§ Automation for handling Data Mining
large data
Database
systems

42
¡ We will learn to mine different types of data:
§ Data is high dimensional
§ Data is a graph
§ Data is infinite/never-ending
§ Data is labeled
¡ We will learn to use different models of
computation:
§ MapReduce
§ Streams and online algorithms
§ Single machine in-memory
43
¡ We will learn to solve real-world problems:
§ Recommender systems
§ Market Basket Analysis
§ Spam detection
§ Duplicate document detection
¡ We will learn various “tools”:
§ Linear algebra (Rec. Sys., Communities)
§ Optimization (stochastic gradient descent)
§ Dynamic programming (frequent itemsets)
§ Hashing (LSH, Bloom filters)
44
¡ Please work on Colab 0

¡ Extra materials:
§ A Systematic View of Data Science
§ By M. Tamer Özsu (https://cs.uwaterloo.ca/~tozsu/)
§ https://cs.uwaterloo.ca/~tozsu/presentations/DataSci
ence-2022-04.pdf

§ OLTP and OLAP: a practical comparison


§ https://www.stitchdata.com/resources/oltp-vs-olap/

45
I
data♥

How do you want that data?


46

You might also like