0% found this document useful (0 votes)
63 views

3250+module+1+ +Intro+to+Data+Science

Uploaded by

NamdeoSakina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

3250+module+1+ +Intro+to+Data+Science

Uploaded by

NamdeoSakina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

3250 Foundations of

Data Science
Module 1: Introduction to Data Science

1
Course Plan
Module Titles
Current Focus: Module 1 – Introduction to Data Science
Module 2 – Introduction to Python
Module 3 – NumPy
Module 4 – Pandas
Module 5 – Data Collection and Cleaning
Module 6 – Descriptive Statistics and Visualization
Module 7 – Workshop (No Content)
Module 8 – Time Series
Module 9 – Introduction to Regression and Classification
Module 10 – Databases and SQL
Module 11 – Data Privacy and Security
Module 12 – Term Project Presentations (no content)

2
Learning Outcomes for this Module

• Outline the course logistics


• Discuss the history of Data Science
• Introduce applications of Predictive Modeling
• Identify the skills and knowledge a Data Scientist needs
• Review the job market
• Describe relevant certifications

3
Topics for this Module

• 1.1 Introductions and course overview


• 1.2 History of data science
• 1.3 Define predictive modeling and data mining
• 1.4 Examples of applications of predictive modeling
• 1.5 What it takes to become a data scientist
• 1.6 Job market overview
• 1.7 Homework

4
Module 1 – Section 1

Introductions and Course Overview

5
Certified Analytics Professional
• Industry Certification
• Operated by INFORMS, the world’s largest professional
society for those in the field of analytics, operations
research (O.R.), and the management sciences
• Requires experience doing analytics and a related degree
(or equivalent additional experience)
• Code of ethics

6
Certificate in Data Science

• Understand the techniques and methods of predictive and


Big Data analytics
• Learn how to use tools such as Python and Hadoop to
tackle data analysis challenges
• Develop and use models and tools to solve business
problems and mine data for fresh insights

7
Certificate in Data Science (cont’d)

What You’ll Learn


• Explore the evolution of data science and predictive
analytics
• Know statistical concepts and techniques including
regression, correlation and clustering
• Apply data management systems and technologies that
reflect concern for security and privacy
• Adopt techniques and technologies including data mining,
neural network mapping and machine learning
• Represent big data findings visually to aid decision-makers

8
Certificate in Data Science (cont’d)

Courses
• SCS 3250 – Foundations of Data Science
• SCS 3251 – Statistics for Data Science
• SCS 3252 – Big Data Management Systems & Tools
• SCS 3253 – Machine Learning

9
Certificate in Data Science (cont’d)

3251
3253
Statistics
Machine Learning
• Sampling
• Methodology
• Modelling
• Algorithms
• Inference
3250 • Evaluation/Testing
• Regression
Foundations • Ensemble Methods
• Bayesian Stats

• Python
3252
• Pandas
Big Data Systems &
• SQL
Tools

• NoSQL
• Hadoop
• Spark
• Distributed
Systems
10
The CAP Domains

Coverage in this certificate program


3250 3251 3252 3253
I. Business Problem (Question) Framing    
II. Analytics Problem Framing    
III. Data    
IV. Methodology (Approach) Selection   
V. Model Building   
VI. Deployment  
VII. Model Life Cycle Management  

 = Introductory content
  = Substantial coverage
   = Major focus

11
About this Course and the Certificate Program
• Not a course in general Python programming
• But a course that introduces the use of Python in data
analytics
• Subsequent courses in the certificate program
– Teach the various disciplines of data science and Big Data
technologies
– Overall content more technical than the “Management of Enterprise
Data Analytics” certificate program
– Not all mathematics (i.e. analytical solutions) but also relying on the
use of programming to understand data

12
How to Benefit the Most from this Course?
• Come to class
• Working with classmates is encouraged
• Use Quercus to share questions and insights (10%
participation mark)
– If you come across an interesting article on the subject matter, share
with the class
– If you have problems with the homework, post a question there
• Do the readings and homework

13
Quick Poll
Why would you like to become a Data Scientist?

A. Enjoy making sense of data


B. Good pay
C. Interesting work
D. Data Scientists are in high demand
E. All of the above

14
Module 1 – Section 2

History of Data Science

15
What is Data Science?
“Data Science” is a fairly new term, for a new profession that is
trying to make sense of Big Data.

Collecting, storing, and making sense of Big Data (another


fairly new term) is quickly becoming part of every business
and everyone’s life.

16
A Brief History of Data Science
The term "Data Science" is attributed to William S. Cleveland who, in 2001,
wrote "Data Science: An Action Plan for Expanding the Technical Areas of the
Field of Statistics.“

1960s- 1998-2000
1970s • Hard drives 2010
• Advances become cheap • What is
in • Dot-Com 2002 Data
Statistics “boom” • CODATA Science?
and • Cloud Data Article is
Computer computing and Science published
Science Hadoop Journal • Big Data

Late 1990s
2001 2003
• Google invented
• Data • Columbia
a new search
engine Science University
term gets began
combining math,
statistics, data “coined” publishing
The
engineering and
computation Journal of
Data
(which replaced
AltaVista). Science

17
Evolution of Analytics

1.0 Traditional Analytics 2.0 Big Data


• Primarily Descriptive and • Complex, large, unstructured
Reporting data sources
• Internally sources, • Starting mid 2000s (the term
relatively small, structured Big Data was coined in 2010)
data • Stored and processed rapidly,
• “Backroom” teams of with new analytical and
analysts computational technologies
• Internal Systems of like Hadoop
Support • “Data Scientists Emerge
• Online firms create data-
based products and services
18
Analytics 3.0
• What defines Analytics 3.0
– An environment and combines analytics 1.0 and 2.0 that yields
insights with speed and impact
– Analytics integral to running a business and becomes part of
strategy and operations
– Predictive and Prescriptive Models
– Artificial Intelligence techniques
• Sources
– Analytics 3.0 FAQ
– Analytics 3.0

19
From Data Analysts to Data Scientists

Traditional Analysts Data Scientists


• Tend to use tools like SAS • Tend to use tools like
and SQL Python and R (often in
• Use Relational DBMS addition of SQL and SAS)
• Use Hadoop environment
as well as in-memory
databases, and in-memory
computing

IMPORTANT: Once you learn skills and tools in one


environment you can easily transition to the other. The
underlying skills are the same.

20
Data Science Is Multidisciplinary

Mathematics
Subject Area
Expertise Statistics

Machine
Learning
Story Telling
and Data

Data Mining

Science
Visualization Business

Programming
Software
Engineering
Data
Engineering

21
Big Data
Big data is defined as a large volume of data (structured and
unstructured) that “floods” a business on a day-to-day basis.

“Data is growing faster than ever before and by the year 2020,
about 1.7 MB of new information will be created every second
for every human being on the planet” (Marr, 2015)
Source

22
The Three+ Vs of Big Data
• “Big Data” is a relatively new
term, however collecting, storing
and analysing data is centuries
Velocity
old. The concept gained
momentum in the early 2000s
when industry analyst Doug Variety
Laney articulated the now-
mainstream definition of Big
Big Data Volume
Data as the three Vs: Velocity,
Variety and Volume.
What is Big Data Variability

• SAS Institute also considers Complexity


Variability and Complexity
• Some also include Veracity

23
Questions:

Where would you find Big Data?

Can you provide an example of Big Data?

24
Module 1 – Section 3

Defining Predictive Modeling and


Data Mining

25
Predictive Modelling
Predictive modeling is a process used in analytics to create a statistical
model of future behaviour.

Predictive analytics is the area of data science concerned with forecasting


probabilities and trends.

The business process of Predictive Modelling often consists of the


following steps:

Define Prepare Create Test the Validate Evaluate Deploy


Problem the Data the Model Model the Model the Model the Model

26
What is Data Mining?
• Data Mining is defined as examining data to uncover
patterns in the data to generate new information

• Data Mining is comprised of:


– Massive data collection
– Powerful multiprocessor computers
– Data mining algorithms

27
Data Mining and Predictive Analytics

• Both branches are grounded in a huge amount of mathematical theory


dating back several decades.

• Predictive analytics and data mining both apply complex mathematics


to data in order to solve business problems. However, when we talk
about data mining, we are usually referring to an analytic toolset that
automatically searches for useful patterns in large data sets.

• Data mining is often one stage in developing a predictive model.

28
Examples of Predictive Modelling Techniques
• Decision Trees
– Classification and Regression Trees (CART), CHAID, C4.5, C5.0, etc.
– Random Forests (work by constructing many decision trees)
– Boosted Trees

• Regression
– OLS, GLM (Logistic Regression is special case of GLM, where other include Poisson,
Gamma and Multinomial regression), MARS (multivariate adaptive regression
splines), Semi-parametric regression

• Neural Network
• Support Vector Machines
• k-Nearest Neighbour algorithm (k-NN) is a non-parametric method used for classification
and regression
• Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying
Bayes' theorem with strong (naive) independence assumptions between the features
• k-means algorithm is a distance-based clustering algorithm that partitions the data into a
predetermined number of clusters

29
Module 1 – Section 4

Applications of Predictive Modeling


- Examples

30
Data Science - Every Day
• How is data science impacting our daily lives:
– Do you shop online?
– Do you receive coupons and offers by email/mail?
– How many credit cards do you have?
– Why you received an offer to go watch a movie this weekend?
– How does Facebook always “know” what ads you would like to see?
– Do you watch Netflix, and follow their “recommendations”?

31
Social Media and the Data “Explosion”

32
Banking and Finance
• Data Science Applications include:
– Customer acquisition (acquire new credit card customers, investors,
traders, etc.)
– Churn models (prevent customer attrition)
– Risk models (to assess who is qualified for a mortgage, credit line,
etc.)
– Next best product model
– Customer Satisfaction
– Drive revenue, reduce cost

33
Data Science in Healthcare

34
Product Recommendation Engine

35
Example: Customer Segmentation
Why Segment?
• Customers may differ in:
– What they want to buy
– Amount willing to pay
– Quantity they buy
– Time, place, frequency of purchase
– Personal taste (likes and dislikes)
• Media, telephone plan, newspapers, magazines, movies, social media

36
Example: Customer Segmentation (cont’d)

Premium Valuable Potential


$$$ $$ $

37
Customize Marketing Strategy for Each
Customer Segment

Premium Valuable Potential


Say “Thank you” Grow these into Valuable
Maintain and
through personalized customers through
Grow
communication offers/promotions

38
Example: Customer Churn (Attrition)

Customer churn or attrition, is defined as the number of customers who


discontinue a service or employees who leave a company during a
specified time period.

Why do customers leave?


Better price? Better service? Convenient location? Etc.

Data Scientists may build a predictive model to flag early signs of customer
churn, to help business develop strategy to prevent churn.

39
Example: Fraud Detection

40
Market Basket Analysis
Market Basket Analysis is a
modelling technique based upon
the theory that if you buy a certain
group of items, you are more (or
less) likely to buy another group of
Bought Milk and Eggs  Bought Oil
items.

Business strategy could include:


a. Offer coupon on Eggs with a
purchase of Milk
b. Place Milk and Eggs close on
the shelf
c. Place Oil near Milk and Eggs
d. Place Oil far from Milk and
Eggs (to force customer “shop
through the store”)

41
Quick Poll

Does your organization have a Big Data road map?

A. Yes
B. No
C. Don’t Know
D. What is Big Data??

42
43
Module 1 – Section 5

Becoming a Data Scientist

44
How to Become a Data Scientist
• To be come a data scientist, one would need to have
– background in statistics, math and programming
– soft skills (communication, scientific curiosity)
– business understanding—and gut instinct
– strong technical skills (databases and coding)

• Formal education
– though these days a Masters or PhD isn’t a requirement in Data
Science; one could supplement a bachelors degree with experience and
relevant certifications
– (Masters in Information and Data Science MIDS at UC Berkeley costs ~
$60,000!)
• Certifications and non-degree programs (such as continuing
education)
• Python, R and/or SAS, SQL
• Strong background in analytics
45
Skills Required

Technical Analytical
Skills Stats, Math, Skills and
R, Python, Computer Education
SAS, SQL, Science,
(Scala, Physics, etc.
Julia, etc.
are nice to Data
have) Scientist

Business Domain Non-Technical Skills


Knowledge and Soft include a strong business
Skills acumen, solid
communication skills, and
ability to tell a “story”

46
Python #1 for Data Science

47
Non Technical Skills
• Intellectual curiosity –This is a key skill, as one needs to think about
the problem critically and ask the right questions to be able to formulate
and eventually answer the business problem at hand.

• Business acumen – one needs a good understanding of the industry


they are working in, and have a grasp of problems the company is trying
to solve.

• Communication skills – a data scientist must be able to clearly and


fluently translate their findings to a non-technical team (Marketing or
Sales departments); as well as be able to communicate with the
business to understand objectives and business problem.

48
Module 1 – Section 6

Data Science: Job Market Overview

49
Demand for Data Science

• The statistics listed below represent this significant and


growing demand for data scientists.
– #16 Highest Paying Job in Demand
– 3,433 Number of Job Openings
– $105,395 Average Base Salary
– #1 Best Job in America for 2016

• Sources: 25 Best Jobs in America and 25 Highest


Paying Jobs in America for 2016

50
Data Scientists are “Sexy”

• The Harvard Business Review, a noted authority on “things


that are sexy,” has declared “Data Scientist“ to be the
sexiest career of the 21st century, publishing an article
titled:
“Data Scientist: The Sexiest Job of the 21st Century “

(Thomas H. Davenport, D.J. Patil October 2012 Issue)


Source

51
Data Scientists are #1 in the US

For 2016, Glassdoor has identified the 25 Best Jobs in America


(based on highest overall Glassdoor Job Score, determined by
combining three key factors – number of job openings, salary and
career opportunities rating). Glassdoor rankings

In # 1 spot: Data Scientist


Job Openings (1,736) in the US
Median Base Salary ($116,840) in the US

52
Data Scientists are in Demand
• Forbes Published an Article “The 10 Toughest Jobs to Fill in 2016”,
with Data Scientist in the top 10
Source

• In another article, “Where Big Data Jobs Will Be In 2015”, published


in 2014, Columbus states:
“Demand for big data expertise across a range of occupations
saw significant growth over the last twelve months”
Source

53
Job Titles for Data Scientists

54
Salary Overview - USA

“The average data scientist today earns $123,000 a year,


according to Indeed.com” (2016, USA)

“Why Data Scientists Get Paid So Much”


Data Scientist Salaries

55
Data Scientist Salary - Canada

Source
56
Salary Overview – Python

Source

57
Salary Overview - Statistician

Source

58
Salary Overview – Data Mining

Source

59
What Determines Salary?

• Experience - people who more experience, get paid more


• Managerial roles - managers and directors in this field do
get paid more
• Academic achievement.
More degrees = more $
• Company size – start-ups may not be able to pay top $,
however many start-ups love to hire data scientists

60
Who is Hiring Data Scientists?
Any company that has a great deal of “Big Data” would seek out a Data
Scientist

• Banking and Finance


• Insurance
• Healthcare
• Biotechnology
• Pharmaceutical
• Retail
• Marketing
• Social Media
• Energy Sector
• Engineering
• Information Technology
• Telecommunication
• Media
• Transportation

61
Source

62
Certifications

The following website provides an extensive overview of


certifications:
www.kdnuggets.com

63
Where there is data… there is Data Science

64
Quick Poll
Would you like to be a Data Scientist?

A. Yes
B. No
C. Don’t Know
D. Still Thinking About It!

65
Summary
• Analytics have broad application in business and
science
• Data Science brings together ideas from computer
science, statistics and engineering to solve new
problems
• Business skills (formulating questions, gathering
information, building consensus) are essential to
applying data science to solving business problems

66
Module 1 – Section 7

Homework

67
Next Class
• In preparation:
– If you are reading Think Python, continue with Ch. 8 – 14
– Install Anaconda Python according to the instructions provided

• Introduction to Python
– The core syntax of Python
– Hands-on

68
Follow us on social

Join the conversation with us online:

facebook.com/uoftscs

@uoftscs

linkedin.com/company/university-of-toronto-school-of-continuing-studies

@uoftscs

69
Any questions?

70
Thank You
Thank you for choosing the University of Toronto
School of Continuing Studies

71

You might also like