0% found this document useful (0 votes)

41 views

ch01 Intro

This document provides information about a data mining course, including instructor details, teaching assistants, lecture and exam schedule, textbook recommendations, and an overview of topics to be covered. The instructor is Dr. Chenhao Ma and lectures will be on Tuesdays and Thursdays from 3:30-4:50pm. There will be 4 assignments, a midterm, and a final exam. Key topics include data mining algorithms, programming with big data, and applications in domains like social media, ecommerce, and healthcare.

Uploaded by

Ezekiel Loh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

ch01 Intro

Uploaded by

Ezekiel Loh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Modified based on materials from: http://www.mmds.

org

Chenhao Ma
[email protected]
• Course Information
• What is data mining? -
Introduction

2
¡ Instructor:
§ Dr. Chenhao Ma
§ [email protected], 319b Daoyuan Bldg
§ Office hour: 5:00-6:00PM Tuesday
§ Research interest: large-scale data management
and data mining

4
¡ Teaching assistants:
§ Miss. Yue Zhang
§ [email protected]
§ 109, 4F SDS research space Zhixin Bldg
§ Office hour: 3:00-4:00PM Friday

§ Mr. Yuyang Liang

§ [email protected]
§ 4F SDS research space Zhixin Bldg
§ Office hour: 11:00-12:00AM Friday
5
¡ USTFs
§ Mr Xinyang Gao
§ [email protected]

§ Miss Mengzhen Zhang

§ [email protected]

6
¡ Lecture
§ 3:30-4:50pm Tue/Thu
§ Teaching A Bldg 101

¡ Tutorial
§ 6:00-6:50pm Thu (starting from next week)
§ Teaching C Bldg 308

¡ Working language: English

§ After-class discusssion can be in English/Chinese
7
¡ Knowledge:
§ understand the key models and concepts of
contemporary data mining
§ understand the strengths and limitations of
popular data mining techniques
§ understand popular data mining algorithms to
solve the real-world problems
§ be able to identify promising applications of data
mining using biomedical, textual and graph data

8
¡ Skills:
§ Utilize a programming language to learn, visualize,
and mine new insights from big data
§ Utilize existing software to analyze available data
to inform critical decisions
§ Study data scientifically, and use it to prove
hypotheses
§ Be able to actively manage and participate in data
mining projects executed by consultants or
specialists in data mining

9
¡ Valued/Attitude:
§ Be equipped with theoretical data mining knowledge
and be able to communicate with domain experts to
solve their data science problems
§ Be more vigilant in some data science issues and
aware with the data science development in the
society
§ Have awareness of the impact of data mining in social,
industrial, environmental and technological context
§ Be more literate in data science, and develop a
knowledge in data science so that you can disseminate
data science related new development

10
¡ 4 assignments (40%)
§ Theoretical and programming questions
§ A1 will be given in around week #4
§ Roughly one assignment every threes after the
first
§ Late submission: up to 2 days with 20% penalty.
§ Not every API used in assignment will be
introduced in the lecture
§ Start early ~

11
¡ Midterm (20%)
§ On week #7 or #8
§ Probably 1.5-hour closed-book written exam
§ Details will be announced later

¡ Final exam (40%)

§ Details will be announced later

12
¡ Content: Programming practice and lecture
review
¡ Colab 0 (the tutorial for Spark) will be
released soon

13
¡ Leskovec, J., Rajaraman, A., and Ullman, J., Mining of
Massive Datasets. (3nd edition) - Cambridge
University Press, ISBN-13. 978-1108476348
§ We mainly follow this one in lectures
§ Available online: http://www.mmds.org/

¡ Tan, P., Steinbach, M., Karpatne, A., and Kumar, V.,

Introduction to data mining (second edition).
Pearson. ISBN 978-0321321367

¡ Han, J., Pei, J. and Kamber, M., 2011. Data mining:

concepts and techniques. Elsevier. ISBN 978-0-12-
381479-1
14
¡ Official prerequisite of this course is
§ CSC1001 or CSC1003 (programming skills)
§ STA2001 or MAT3280 or STA2003 (probability)
§ CSC3100 (Data structure)
¡ The following would be helpful:
§ Discrete math
§ Database systems (SQL, relational algebra)
§ Common Linux Commands

15
¡ Each of the topics listed is important for a part
of the course:
§ If you are missing an item of background, you can
consider just-in-time learning of the needed
material.

¡ Colab 0 can also help to decide whether to

take this course.
§ Programming skill is important.
§ You need to be comfortable with writing code in
Python or Java.

16
¡ ACM-ICPC
§ ACM: Association for Computing Machinery
§ ICPC: International Collegiate Programming Contest

¡ Online Judge (OJ) systems

§ Many programming problems
§ CUHKSZ OJ
§ http://oj.cuhk.edu.cn/ (access in campus)
§ SJTU OJ
§ https://acm.sjtu.edu.cn/OnlineJudge/
§ PKU OJ
§ http://poj.org/

17
Week Content/ topic/ activity
1 Introduction to data mining
2 Map-Reduce
3 Spark
4 Frequent items and association rules
5 Finding similar items
6 Mining data streams
7 Mid-term
8 Mining data streams, Link analysis
9 Link analysis
10 Clustering
11 Advertising on the Web
12 Recommender systems
13 Graph mining
14 Recap
19
¡ Feedback is important and highly
appreciated!
§ Talk to course instructors and TAs
§ Send us emails
§…

20
Social User Tracking & Government
Engagement

eCommerce Financial Services Real Time Search

22
¡ The Vs of big data were often referred to
as the "three Vs"
¡ Volume: In a big data environment, the
amounts of data collected and processed are
much larger than those stored in typical
relational databases.

23
¡ Variety: Big data consists of a rich variety of
data types.
¡ Velocity: Big data arrives to the organization
at high speeds and from multiple sources
simultaneously.

24
¡ In the big data era, huge amount of data is
being generated every day
Recent Twitter statistics

https://www.omnicoreagency.com/twitter-statistics/ 25
¡ Data volume is increasing exponentially (40%
increase per year)

Data amount in Zetabytes from 2010 to 2025

A forecast by IDC & SeaGate. Image by Sven Balnojan. 26

¡ Different Types:
§ Relational Data (Tables/Transaction/Legacy Data)
§ Text Data (Web)
§ Semi-structured Data (XML)
§ Spatial Data
§ Temporal Data
§ Graph Data
§ Social Network, Semantic Web (RDF), …
§ One application can be generating/collecting many
types of data
27
¡ Different Sources：
§ Movie reviews from IMDB and Rotten Tomatoes
§ Product reviews from different provider websites

To extract knowledgeè all these types of

data need to linked together

28
Social Banking
Finance
Media

Gaming
Customer Search
Engine

Entertain Purchase

29
¡ Velocity essentially measures how fast the
data is coming in.
¡ Data is being generated fast and need to be
processed fast
§ Late decisions -> missing opportunities

30
¡ It is usually met in online data analytics, for
example
§ E-Promotions: based on your current location,
your purchase history, what you like -> send
promotions right now for store next to you
§ Healthcare monitoring: sensors monitoring your
activities and body -> any abnormal
measurements require immediate reaction

31
The Owner of This iPhone Was in a Severe Car Crash'—or Just on a
Roller Coaster - WSJ
32
The statistics for 1 second in many applications.
http://www.internetlivestats.com/one-second/

33
Data contains value and knowledge
34
¡ But to extract the knowledge
data needs to be
§ Stored (systems)
§ Managed (databases)
§ And ANALYZED ß this class

Data Mining ≈ Big Data ≈

Predictive Analytics ≈ Data Science

35
¡ Given lots of data
¡ Discover patterns and models that are:
§ Valid: hold on new data with some certainty
§ Useful: should be possible to act on the item
§ Unexpected: non-obvious to the system
§ Understandable: humans should be able to
interpret the pattern

36
¡ Descriptive methods
§ Find human-interpretable patterns that
describe the data
§ Example: Clustering

¡ Predictive methods
§ Use some variables to predict unknown
or future values of other variables
§ Example: Recommender systems

37
¡ A risk with “Data mining” is that an analyst
can “discover” patterns that are meaningless
¡ Statisticians call it Bonferroni’s principle:
§ Roughly, if you look in more places for interesting
patterns than your amount of data will support,
you are bound to find crap

38
Example:
¡ We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
§ 109 people being tracked
§ 1,000 days
§ Each person stays in a hotel 1% of time (1 day out of 100)
§ Hotels hold 100 people (so 105 hotels)
§ If everyone behaves randomly (i.e., no terrorists) will the
data mining detect anything suspicious?
¡ Expected number of “suspicious” pairs of people:
§ ~250,000
§ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
39
Usage

Quality

Context

Streaming

Scalability

40
¡ Data mining overlaps with:
§ Databases: Large-scale data, simple queries
§ Machine learning: Complex models
§ CS Theory: (Randomized) Algorithms
¡ Different cultures:
§ To a DB person, data mining is an extreme form of
analytic processing – queries that
examine large amounts of data CS
Machine
Theory
§ Result is the query answer Learning

§ To a ML person, data-mining Data

Mining
is the inference of models
§ Result is the parameters of the model Database
¡ In this class we will do both! systems
41
¡ This combines best of machine learning,
statistics, artificial intelligence, databases but
more stress on
§ Scalability (big data)
§ Algorithms Statistics Machine
Learning
§ Computing architectures
§ Automation for handling Data Mining
large data
Database
systems

42
¡ We will learn to mine different types of data:
§ Data is high dimensional
§ Data is a graph
§ Data is infinite/never-ending
§ Data is labeled
¡ We will learn to use different models of
computation:
§ MapReduce
§ Streams and online algorithms
§ Single machine in-memory
43
¡ We will learn to solve real-world problems:
§ Recommender systems
§ Market Basket Analysis
§ Spam detection
§ Duplicate document detection
¡ We will learn various “tools”:
§ Linear algebra (Rec. Sys., Communities)
§ Optimization (stochastic gradient descent)
§ Dynamic programming (frequent itemsets)
§ Hashing (LSH, Bloom filters)
44
¡ Please work on Colab 0

¡ Extra materials:
§ A Systematic View of Data Science
§ By M. Tamer Özsu (https://cs.uwaterloo.ca/~tozsu/)
§ https://cs.uwaterloo.ca/~tozsu/presentations/DataSci
ence-2022-04.pdf

§ OLTP and OLAP: a practical comparison

§ https://www.stitchdata.com/resources/oltp-vs-olap/

45
I
data♥

How do you want that data?

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Final Exam Questions & Answers
100% (4)
Final Exam Questions & Answers
6 pages
Mariadb
No ratings yet
Mariadb
222 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
DM Lecture 1 Introudction and Policies
No ratings yet
DM Lecture 1 Introudction and Policies
17 pages
Data Mining: Ying Liu, Prof., PH.D
No ratings yet
Data Mining: Ying Liu, Prof., PH.D
57 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
Ch01 Intro
No ratings yet
Ch01 Intro
19 pages
Intro_1
No ratings yet
Intro_1
43 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Lecture1
No ratings yet
Lecture1
32 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
DM-Course File
No ratings yet
DM-Course File
14 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
DM Day1 Intro MS F24 (1)
No ratings yet
DM Day1 Intro MS F24 (1)
111 pages
Lecture 1- Introduction to Big Data
No ratings yet
Lecture 1- Introduction to Big Data
51 pages
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
0% (1)
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
17 pages
Data Mining Intro
No ratings yet
Data Mining Intro
56 pages
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
No ratings yet
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
45 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
CCS415-CCT416 Course Outline
No ratings yet
CCS415-CCT416 Course Outline
3 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
ho
No ratings yet
ho
9 pages
Data Mining
No ratings yet
Data Mining
26 pages
Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Unit 1 - DA - Introduction To Data Science
No ratings yet
Unit 1 - DA - Introduction To Data Science
70 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Unit 01 DWDM
No ratings yet
Unit 01 DWDM
105 pages
Handout
No ratings yet
Handout
4 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
DLWSS551 - Introduction
No ratings yet
DLWSS551 - Introduction
54 pages
Qm 20242 Cs5228 Lecture01 Introduction
No ratings yet
Qm 20242 Cs5228 Lecture01 Introduction
80 pages
dm 1
No ratings yet
dm 1
47 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
97 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
23 pages
DM 1
No ratings yet
DM 1
7 pages
ETCh2
No ratings yet
ETCh2
36 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
dmsyll
No ratings yet
dmsyll
2 pages
1 Intor To DMW
No ratings yet
1 Intor To DMW
22 pages
Data Mining Chapter 1 Notes
No ratings yet
Data Mining Chapter 1 Notes
40 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
CH 2 - Emerging
No ratings yet
CH 2 - Emerging
24 pages
L0 Overview
No ratings yet
L0 Overview
15 pages
DA Full
No ratings yet
DA Full
738 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
MAT8033 Lecture Slides (3)
No ratings yet
MAT8033 Lecture Slides (3)
62 pages
B.Tech Jntuh DWDM Course Description
No ratings yet
B.Tech Jntuh DWDM Course Description
6 pages
MAT8033 Lecture Slides
No ratings yet
MAT8033 Lecture Slides
29 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
Chapter 1 - Introduction - : WWW - Cs.uiuc - Edu/ Hanj
No ratings yet
Chapter 1 - Introduction - : WWW - Cs.uiuc - Edu/ Hanj
52 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
01 Intro
No ratings yet
01 Intro
22 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Unit-1
No ratings yet
Unit-1
148 pages
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet
Cse 7 Sem Data Warehousing and Data Mining Winter 2017
No ratings yet
Cse 7 Sem Data Warehousing and Data Mining Winter 2017
2 pages
Lab Answer Key Module 7 - Accessing A Database
No ratings yet
Lab Answer Key Module 7 - Accessing A Database
10 pages
18CSC205J Operating Systems Unit 5 - New
No ratings yet
18CSC205J Operating Systems Unit 5 - New
140 pages
Netezza Stored Procedures Guide
No ratings yet
Netezza Stored Procedures Guide
88 pages
Short Notes On Webpage Programming Languages
100% (1)
Short Notes On Webpage Programming Languages
5 pages
Data Wrangling & Visualization - II
No ratings yet
Data Wrangling & Visualization - II
41 pages
CSV Import Guide
No ratings yet
CSV Import Guide
3 pages
Understanding Audit Trails
No ratings yet
Understanding Audit Trails
3 pages
Meta Search: Edited By: Priyabrata Nayak, Lecturer, Dept. of CSE
No ratings yet
Meta Search: Edited By: Priyabrata Nayak, Lecturer, Dept. of CSE
14 pages
Jonah Marindoque Balicoco SAS 21 Nursing InformaticsFINAL
No ratings yet
Jonah Marindoque Balicoco SAS 21 Nursing InformaticsFINAL
9 pages
CCS341 Data warehousing unit 1 notes new
No ratings yet
CCS341 Data warehousing unit 1 notes new
17 pages
My New Moving Vehicle DB2MOVE "With More POWER and Less Gas"
100% (1)
My New Moving Vehicle DB2MOVE "With More POWER and Less Gas"
10 pages
Final Report of Dbms
No ratings yet
Final Report of Dbms
24 pages
Assignment - 2
No ratings yet
Assignment - 2
72 pages
Installation Instruction For SIESTA V 3
No ratings yet
Installation Instruction For SIESTA V 3
2 pages
Daksh Pratap Singh_CV
No ratings yet
Daksh Pratap Singh_CV
3 pages
Introduction
No ratings yet
Introduction
21 pages
Database Management Answers
No ratings yet
Database Management Answers
2 pages
More SQL Data Definition: Database Systems Lecture 6 Natasha Alechina
No ratings yet
More SQL Data Definition: Database Systems Lecture 6 Natasha Alechina
31 pages
CP Unit Vi R16
No ratings yet
CP Unit Vi R16
10 pages
Logical Database Design and The Relational Model
No ratings yet
Logical Database Design and The Relational Model
109 pages
DBMS Convert ER Into Table - Unit..2
No ratings yet
DBMS Convert ER Into Table - Unit..2
3 pages
Creating Natively Compiled Stored Procedures
No ratings yet
Creating Natively Compiled Stored Procedures
2 pages
Finding The Patch: IBM Security Guardium Patch Release Notes
No ratings yet
Finding The Patch: IBM Security Guardium Patch Release Notes
2 pages
Implementing and Administering Microsoft Project Server 2019 Training
No ratings yet
Implementing and Administering Microsoft Project Server 2019 Training
9 pages
Bankproject Py
No ratings yet
Bankproject Py
2 pages
Design and Implementation of Employee Info
No ratings yet
Design and Implementation of Employee Info
2 pages

ch01 Intro

Uploaded by

ch01 Intro

Uploaded by

Modified based on materials from: http://www.mmds.

§ Mr. Yuyang Liang

§ Miss Mengzhen Zhang

¡ Working language: English

¡ Final exam (40%)

¡ Tan, P., Steinbach, M., Karpatne, A., and Kumar, V.,

¡ Han, J., Pei, J. and Kamber, M., 2011. Data mining:

¡ Colab 0 can also help to decide whether to

¡ Online Judge (OJ) systems

eCommerce Financial Services Real Time Search

Data amount in Zetabytes from 2010 to 2025

A forecast by IDC & SeaGate. Image by Sven Balnojan. 26

To extract knowledgeè all these types of

Data Mining ≈ Big Data ≈

§ To a ML person, data-mining Data

§ OLTP and OLAP: a practical comparison

How do you want that data?

You might also like