3250+module+1+ +Intro+to+Data+Science
3250+module+1+ +Intro+to+Data+Science
Data Science
Module 1: Introduction to Data Science
1
Course Plan
Module Titles
Current Focus: Module 1 – Introduction to Data Science
Module 2 – Introduction to Python
Module 3 – NumPy
Module 4 – Pandas
Module 5 – Data Collection and Cleaning
Module 6 – Descriptive Statistics and Visualization
Module 7 – Workshop (No Content)
Module 8 – Time Series
Module 9 – Introduction to Regression and Classification
Module 10 – Databases and SQL
Module 11 – Data Privacy and Security
Module 12 – Term Project Presentations (no content)
2
Learning Outcomes for this Module
3
Topics for this Module
4
Module 1 – Section 1
5
Certified Analytics Professional
• Industry Certification
• Operated by INFORMS, the world’s largest professional
society for those in the field of analytics, operations
research (O.R.), and the management sciences
• Requires experience doing analytics and a related degree
(or equivalent additional experience)
• Code of ethics
6
Certificate in Data Science
7
Certificate in Data Science (cont’d)
8
Certificate in Data Science (cont’d)
Courses
• SCS 3250 – Foundations of Data Science
• SCS 3251 – Statistics for Data Science
• SCS 3252 – Big Data Management Systems & Tools
• SCS 3253 – Machine Learning
9
Certificate in Data Science (cont’d)
3251
3253
Statistics
Machine Learning
• Sampling
• Methodology
• Modelling
• Algorithms
• Inference
3250 • Evaluation/Testing
• Regression
Foundations • Ensemble Methods
• Bayesian Stats
• Python
3252
• Pandas
Big Data Systems &
• SQL
Tools
• NoSQL
• Hadoop
• Spark
• Distributed
Systems
10
The CAP Domains
= Introductory content
= Substantial coverage
= Major focus
11
About this Course and the Certificate Program
• Not a course in general Python programming
• But a course that introduces the use of Python in data
analytics
• Subsequent courses in the certificate program
– Teach the various disciplines of data science and Big Data
technologies
– Overall content more technical than the “Management of Enterprise
Data Analytics” certificate program
– Not all mathematics (i.e. analytical solutions) but also relying on the
use of programming to understand data
12
How to Benefit the Most from this Course?
• Come to class
• Working with classmates is encouraged
• Use Quercus to share questions and insights (10%
participation mark)
– If you come across an interesting article on the subject matter, share
with the class
– If you have problems with the homework, post a question there
• Do the readings and homework
13
Quick Poll
Why would you like to become a Data Scientist?
14
Module 1 – Section 2
15
What is Data Science?
“Data Science” is a fairly new term, for a new profession that is
trying to make sense of Big Data.
16
A Brief History of Data Science
The term "Data Science" is attributed to William S. Cleveland who, in 2001,
wrote "Data Science: An Action Plan for Expanding the Technical Areas of the
Field of Statistics.“
1960s- 1998-2000
1970s • Hard drives 2010
• Advances become cheap • What is
in • Dot-Com 2002 Data
Statistics “boom” • CODATA Science?
and • Cloud Data Article is
Computer computing and Science published
Science Hadoop Journal • Big Data
Late 1990s
2001 2003
• Google invented
• Data • Columbia
a new search
engine Science University
term gets began
combining math,
statistics, data “coined” publishing
The
engineering and
computation Journal of
Data
(which replaced
AltaVista). Science
17
Evolution of Analytics
19
From Data Analysts to Data Scientists
20
Data Science Is Multidisciplinary
Mathematics
Subject Area
Expertise Statistics
Machine
Learning
Story Telling
and Data
Data Mining
Science
Visualization Business
Programming
Software
Engineering
Data
Engineering
21
Big Data
Big data is defined as a large volume of data (structured and
unstructured) that “floods” a business on a day-to-day basis.
“Data is growing faster than ever before and by the year 2020,
about 1.7 MB of new information will be created every second
for every human being on the planet” (Marr, 2015)
Source
22
The Three+ Vs of Big Data
• “Big Data” is a relatively new
term, however collecting, storing
and analysing data is centuries
Velocity
old. The concept gained
momentum in the early 2000s
when industry analyst Doug Variety
Laney articulated the now-
mainstream definition of Big
Big Data Volume
Data as the three Vs: Velocity,
Variety and Volume.
What is Big Data Variability
23
Questions:
24
Module 1 – Section 3
25
Predictive Modelling
Predictive modeling is a process used in analytics to create a statistical
model of future behaviour.
26
What is Data Mining?
• Data Mining is defined as examining data to uncover
patterns in the data to generate new information
27
Data Mining and Predictive Analytics
28
Examples of Predictive Modelling Techniques
• Decision Trees
– Classification and Regression Trees (CART), CHAID, C4.5, C5.0, etc.
– Random Forests (work by constructing many decision trees)
– Boosted Trees
• Regression
– OLS, GLM (Logistic Regression is special case of GLM, where other include Poisson,
Gamma and Multinomial regression), MARS (multivariate adaptive regression
splines), Semi-parametric regression
• Neural Network
• Support Vector Machines
• k-Nearest Neighbour algorithm (k-NN) is a non-parametric method used for classification
and regression
• Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying
Bayes' theorem with strong (naive) independence assumptions between the features
• k-means algorithm is a distance-based clustering algorithm that partitions the data into a
predetermined number of clusters
29
Module 1 – Section 4
30
Data Science - Every Day
• How is data science impacting our daily lives:
– Do you shop online?
– Do you receive coupons and offers by email/mail?
– How many credit cards do you have?
– Why you received an offer to go watch a movie this weekend?
– How does Facebook always “know” what ads you would like to see?
– Do you watch Netflix, and follow their “recommendations”?
31
Social Media and the Data “Explosion”
32
Banking and Finance
• Data Science Applications include:
– Customer acquisition (acquire new credit card customers, investors,
traders, etc.)
– Churn models (prevent customer attrition)
– Risk models (to assess who is qualified for a mortgage, credit line,
etc.)
– Next best product model
– Customer Satisfaction
– Drive revenue, reduce cost
33
Data Science in Healthcare
34
Product Recommendation Engine
35
Example: Customer Segmentation
Why Segment?
• Customers may differ in:
– What they want to buy
– Amount willing to pay
– Quantity they buy
– Time, place, frequency of purchase
– Personal taste (likes and dislikes)
• Media, telephone plan, newspapers, magazines, movies, social media
36
Example: Customer Segmentation (cont’d)
37
Customize Marketing Strategy for Each
Customer Segment
38
Example: Customer Churn (Attrition)
Data Scientists may build a predictive model to flag early signs of customer
churn, to help business develop strategy to prevent churn.
39
Example: Fraud Detection
40
Market Basket Analysis
Market Basket Analysis is a
modelling technique based upon
the theory that if you buy a certain
group of items, you are more (or
less) likely to buy another group of
Bought Milk and Eggs Bought Oil
items.
41
Quick Poll
A. Yes
B. No
C. Don’t Know
D. What is Big Data??
42
43
Module 1 – Section 5
44
How to Become a Data Scientist
• To be come a data scientist, one would need to have
– background in statistics, math and programming
– soft skills (communication, scientific curiosity)
– business understanding—and gut instinct
– strong technical skills (databases and coding)
• Formal education
– though these days a Masters or PhD isn’t a requirement in Data
Science; one could supplement a bachelors degree with experience and
relevant certifications
– (Masters in Information and Data Science MIDS at UC Berkeley costs ~
$60,000!)
• Certifications and non-degree programs (such as continuing
education)
• Python, R and/or SAS, SQL
• Strong background in analytics
45
Skills Required
Technical Analytical
Skills Stats, Math, Skills and
R, Python, Computer Education
SAS, SQL, Science,
(Scala, Physics, etc.
Julia, etc.
are nice to Data
have) Scientist
46
Python #1 for Data Science
47
Non Technical Skills
• Intellectual curiosity –This is a key skill, as one needs to think about
the problem critically and ask the right questions to be able to formulate
and eventually answer the business problem at hand.
48
Module 1 – Section 6
49
Demand for Data Science
50
Data Scientists are “Sexy”
51
Data Scientists are #1 in the US
52
Data Scientists are in Demand
• Forbes Published an Article “The 10 Toughest Jobs to Fill in 2016”,
with Data Scientist in the top 10
Source
53
Job Titles for Data Scientists
54
Salary Overview - USA
55
Data Scientist Salary - Canada
Source
56
Salary Overview – Python
Source
57
Salary Overview - Statistician
Source
58
Salary Overview – Data Mining
Source
59
What Determines Salary?
60
Who is Hiring Data Scientists?
Any company that has a great deal of “Big Data” would seek out a Data
Scientist
61
Source
62
Certifications
63
Where there is data… there is Data Science
64
Quick Poll
Would you like to be a Data Scientist?
A. Yes
B. No
C. Don’t Know
D. Still Thinking About It!
65
Summary
• Analytics have broad application in business and
science
• Data Science brings together ideas from computer
science, statistics and engineering to solve new
problems
• Business skills (formulating questions, gathering
information, building consensus) are essential to
applying data science to solving business problems
66
Module 1 – Section 7
Homework
67
Next Class
• In preparation:
– If you are reading Think Python, continue with Ch. 8 – 14
– Install Anaconda Python according to the instructions provided
• Introduction to Python
– The core syntax of Python
– Hands-on
68
Follow us on social
facebook.com/uoftscs
@uoftscs
linkedin.com/company/university-of-toronto-school-of-continuing-studies
@uoftscs
69
Any questions?
70
Thank You
Thank you for choosing the University of Toronto
School of Continuing Studies
71