0% found this document useful (0 votes)
28 views106 pages

CDSS - Day 1

Uploaded by

junitasari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views106 pages

CDSS - Day 1

Uploaded by

junitasari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 106

CDSS

Day 1
Tarun Sukhani

© Abundent Sdn. Bhd., All Rights Reserved


Trainer Profile

Tarun Sukhani is the Founder & CTO of Abundent, a Digital Transformation and Big Data Analytics company based in
Malaysia with offices in Singapore, Indonesia, and the USA. He is a business and IT professional with more than 20 years
of experience working in multinational companies in the US, Europe, Asia, South America, and the Middle East. He has
held a number of different senior development and executive positions including that of CIO/CTO, director, and board
member and is experienced in developing and managing large IT operations. As a consultant, Tarun was involved in
improving business operations in companies such as Dell, AMD, and Experian, as well as regional conglomerates such as
Indra in Asia Pacific.

Tarun has conducted training as well as facilitated seminars and workshops in Malaysia, Indonesia, Philippines, Thailand,
Singapore and various other Asia Pacific countries, focusing on project management, consulting, leadership and strategic
management, security, teamwork and other soft skills in addition to Enterprise Computing/Programming, Software
Architecture, Big Data, Data Science, and Machine Learning. He is highly energetic and has a strong passion for
developing people.

© Abundent Sdn. Bhd., All Rights Reserved


Course Exam
2 hour exam on Day 5 from 3-5pm
70% to pass (35/50 questions)
● 50 questions (randomly selected from test bank)
● Multiple-Choice (some questions have multiple answers)
● You can go back to any previous questions
● Completely online and you get your results right away
● NOT OPEN BOOK!!!
● Don’t worry, we will go over 70% of the questions

© Abundent Sdn. Bhd., All Rights Reserved


Course Material

First things first…download course material:

tinyurl.com/cdssmaterial

© Abundent Sdn. Bhd., All Rights Reserved


Introduction to
Data science

© Abundent Sdn. Bhd., All Rights Reserved


What is Data?
Data is a symbolic representation of a physical or abstract concept

Cat
Symbols

Bili

Kucing
Physical concept
© Abundent Sdn. Bhd., All Rights Reserved
What is Data Representation?
Data representation is how you encode the symbols you use

Cat
(English)
Symbols

बिल्ली
(Hindi)


(Chinese) Physical concept
© Abundent Sdn. Bhd., All Rights Reserved
Some Data Representations Are Better...

By changing the
representation, you
can often solve
seemingly intractable
problems.

For example, how


would you draw a
straight line through
the data for the
diagram on the right?

© Abundent Sdn. Bhd., All Rights Reserved


What is Data Structure?
Data structure is how you store and retrieve the data. Usually, this means
what computer science approach you choose.

Includes linked lists, sets,


List arrays, queues, stacks

Map Includes dictionaries, key-


value stores, caches

Graph Includes trees

© Abundent Sdn. Bhd., All Rights Reserved


Data Types

© Abundent Sdn. Bhd., All Rights Reserved


Data Types - Hierarchy

© Abundent Sdn. Bhd., All Rights Reserved


Data Dictionary

© Abundent Sdn. Bhd., All Rights Reserved


Data Formats
Data Formats are how you store the data in your data structures.
{"widget": { <widget>
"debug": "on",
<debug>on</debug>
"window": {
"title": "Sample Konfabulator Widget", <window title="Sample Konfabulator Widget">
"name": "main_window", <name>main_window</name>
"width": 500, <width>500</width>
"height": 500 <height>500</height>
},
</window>
"image": {
"src": "Images/Sun.png", <image src="Images/Sun.png" name="sun1">
"name": "sun1", <hOffset>250</hOffset>
"hOffset": 250, <vOffset>250</vOffset>
"vOffset": 250, <alignment>center</alignment>
"alignment": "center" </image>
},
<text data="Click Here" size="36" style="bold">
"text": {
"data": "Click Here", <name>text1</name>
"size": 36, <hOffset>250</hOffset>
"style": "bold",
<vOffset>100</vOffset>
"name": "text1",
<alignment>center</alignment>
"hOffset": 250,
<onMouseUp>
"vOffset": 100,
sun1.opacity = (sun1.opacity / 100) * 90;
"alignment": "center",
"onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;" </onMouseUp>
}
</text>
}}
</widget>

JSON XML © Abundent Sdn. Bhd., All Rights Reserved


What is Data Science?

© Abundent Sdn. Bhd., All Rights Reserved


What is Data Science?

© Abundent Sdn. Bhd., All Rights Reserved


Data Science Core Components

© Abundent Sdn. Bhd., All Rights Reserved


Data Architecture

© Abundent Sdn. Bhd., All Rights Reserved


Data Science Tools

© Abundent Sdn. Bhd., All Rights Reserved


Data Science History

John Tukey

1935 1939 1958 1977

1989 1996 1997 2007

2009 2010
© Abundent Sdn. Bhd., All Rights Reserved
Data Science is NOT Databases
Data Science is concerned
with finding patterns in large
amounts of data, whereas
Databases are concerned with
querying large amounts of data.

© Abundent Sdn. Bhd., All Rights Reserved


Data Science is NOT Business Intelligence

Business Intelligence Data Science


Querying the Past Querying the Future

© Abundent Sdn. Bhd., All Rights Reserved


Data Science is NOT Business Intelligence

© Abundent Sdn. Bhd., All Rights Reserved


What is Hard about Data Science?

© Abundent Sdn. Bhd., All Rights Reserved


Overcoming Assumptions - Video
How many babies do women have on
average in Bangladesh?

A. 2.5
B. 3.5
C. 4.5
D. 5.5

Source: Gapminder

© Abundent Sdn. Bhd., All Rights Reserved


Making Ad Hoc Explanations - Video
Why was the financial crisis of 2008
not predicted by financial analysts?

Domain Expertise and Programming


Skills are insufficient to explain reality.

Nobel laureate George Akerlof on


excessive specialization in economics
and lack of statistical foundation in
finance.

© Abundent Sdn. Bhd., All Rights Reserved


Overgeneralizing
Does Atmospheric CO2 cause
temperature increase?

“Correlation does not equal causation”

Source: An Inconvenient Truth

© Abundent Sdn. Bhd., All Rights Reserved


Lab Activity - Finding Relevant Strong Correlations
File: Country Data.R
Dataset: Country Data.csv

Categorical Numerical Numerical Numerical Numerical Numerical Numerical


(Nominal) (Continuous) (Continuous) (Discrete) (Discrete) (Continuous) (Continuous)

© Abundent Sdn. Bhd., All Rights Reserved


Data Science
Workflow

© Abundent Sdn. Bhd., All Rights Reserved


Data Science Workflow Diagram

© Abundent Sdn. Bhd., All Rights Reserved


Data Gathering and Ingestion
Data gathering is extracting data from the world. Raw data gathered this way must
be cleaned in order to become usable.

We call raw data “dirty data”, and clean data “tidy data”.

We normally gather data as part of an experiment, in order to test a hypothesis.

The goal is to move from data to wisdom:


Information

Knowledge

Wisdom
Data

What? How to? Why? What is best?

Fit for Purpose Fit for Use


© Abundent Sdn. Bhd., All Rights Reserved
Data Extraction, Preparation and Cleansing
This stage of the data science workflow involves data warehousing techniques to
extract and prepare the raw data for processing later.

The process of “cleansing” dirty data often involves a great deal of tedious data
wrangling and correction, but tools like OpenRefine can help.

Examples include:

1. Fixed Firm Price and FFP belong to the same category and can be combined
(Entity mismatch)
2. The range of prices can be converted to a logarithmic scale if the values are
highly distributed (Feature scaling)
3. Addresses and other out of date data need to be updated (Bit rot)

© Abundent Sdn. Bhd., All Rights Reserved


Data Analysis - Descriptive, Predictive and Prescriptive

© Abundent Sdn. Bhd., All Rights Reserved


Data Analysis - Descriptive, Predictive and Prescriptive

© Abundent Sdn. Bhd., All Rights Reserved


Data Visualization
Data visualization is the graphical
representation of information and data. By
using visual elements like charts, graphs,
and maps, data visualization tools provide
an accessible way to see and understand
trends, outliers, and patterns in data.

In the world of Big Data, data visualization


tools and technologies are essential to
analyze massive amounts of information
and make data-driven decisions.

© Abundent Sdn. Bhd., All Rights Reserved


Data Visualization

© Abundent Sdn. Bhd., All Rights Reserved


Data Visualization

© Abundent Sdn. Bhd., All Rights Reserved


Data Visualization
Visualizing biology

Source: Ted

© Abundent Sdn. Bhd., All Rights Reserved


Data Visualization
GDP By Country Over TIme

Source: LinkedIn

© Abundent Sdn. Bhd., All Rights Reserved


Model Deployment

© Abundent Sdn. Bhd., All Rights Reserved


Lab Activity - Building Machine Learning Models
File: Iris.ipynb
Dataset: Iris.csv

Numerical Numerical Numerical Numerical Numerical Categorical


(Continuous) (Continuous) (Continuous) (Continuous) (Continuous) (Nominal)

© Abundent Sdn. Bhd., All Rights Reserved


Life of A Data Scientist

© Abundent Sdn. Bhd., All Rights Reserved


What is a Data Scientist?

© Abundent Sdn. Bhd., All Rights Reserved


What is a Data Scientist?

© Abundent Sdn. Bhd., All Rights Reserved


Data Scientist Roles

© Abundent Sdn. Bhd., All Rights Reserved


Data Scientist Roles

© Abundent Sdn. Bhd., All Rights Reserved


What Does a Good Data Scientist Look Like?

© Abundent Sdn. Bhd., All Rights Reserved


T-Shaped Skillset

© Abundent Sdn. Bhd., All Rights Reserved


Data Scientist Roadmap

1. Fundamentals
2. Math & Statistics
3. Programming
4. Machine Learning
5. Text Mining/NLP
6. Visualization
7. Big Data
8. Data Ingestion
9. Data Munging/Wrangling
10. Toolbox

© Abundent Sdn. Bhd., All Rights Reserved


Data Scientist Education Framework

© Abundent Sdn. Bhd., All Rights Reserved


Data Scientist Education Framework

© Abundent Sdn. Bhd., All Rights Reserved


Data Scientist Education Framework

© Abundent Sdn. Bhd., All Rights Reserved


Thinking Like A Data Scientist

© Abundent Sdn. Bhd., All Rights Reserved


Knowns and Unknowns

© Abundent Sdn. Bhd., All Rights Reserved


DIKUW

© Abundent Sdn. Bhd., All Rights Reserved


Demand and Opportunity

© Abundent Sdn. Bhd., All Rights Reserved


Labor Market

© Abundent Sdn. Bhd., All Rights Reserved


Labor Market

© Abundent Sdn. Bhd., All Rights Reserved


Famous Data Scientists

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Media Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Media Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Media Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Media Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Retail Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Retail Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Medical Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Medical Industry

© Abundent Sdn. Bhd., All Rights Reserved


Applications of Data Science - Banking Industry

© Abundent Sdn. Bhd., All Rights Reserved


Data Science Principles

© Abundent Sdn. Bhd., All Rights Reserved


Data-Driven Organization

© Abundent Sdn. Bhd., All Rights Reserved


Developing Data Products

© Abundent Sdn. Bhd., All Rights Reserved


Developing Data Products

© Abundent Sdn. Bhd., All Rights Reserved


Developing Data Products

© Abundent Sdn. Bhd., All Rights Reserved


Developing Data Products

© Abundent Sdn. Bhd., All Rights Reserved


Data Products

Descriptive

Predictive

Prescriptive

© Abundent Sdn. Bhd., All Rights Reserved


Data Gathering

© Abundent Sdn. Bhd., All Rights Reserved


Data and Data Collection : 2 types of data
Qualitative Quantitative
Words, images, observations,
Numbers, tests, counting, measuring.
conversations, photographs

© Abundent Sdn. Bhd., All Rights Reserved


Data Collection Techniques

© Abundent Sdn. Bhd., All Rights Reserved


Observations vs. Simulations
Freeman Dyson on the importance of
gathering data through observations.

Computer simulations should not


replace scientific data gathering.

© Abundent Sdn. Bhd., All Rights Reserved


Quantitative Methods

Experiment:
● A procedure designed to test a hypothesis as part of
the scientific method.
● The 2 key variables are the independent and
dependent variables. The independent variable is
controlled or changed to test its effects on the
dependent variable.
● 3 key types of experiments are controlled experiments,
field experiments, and natural experiments.

© Abundent Sdn. Bhd., All Rights Reserved


Quantitative Methods

Dependent Variable: The variable in the study under


consideration. Used for predicting the outcome for
the study (y).

Independent variables: The variables affecting the


dependent variable (x1, x2, …, xn)

y = f(x1, x2, …, xn)

Which is which here?

© Abundent Sdn. Bhd., All Rights Reserved


Key Factors for High Quality Experimental Design

Data should not be contaminated by poor measurement or


errors in procedure.

Eliminate confounding or hidden variables from study or


minimize effects on other variables (control group).

Representativeness:

Does your sample represent the population you are


studying?

Must use random sample techniques.

© Abundent Sdn. Bhd., All Rights Reserved


What Makes a Good Quantitative Research Design?

4 Key Elements

Freedom from Bias

Freedom from Confounding

Control of Extraneous Variables

Statistical Precision to Test Hypothesis -


t-test, chi-square, ANOVA, etc.

© Abundent Sdn. Bhd., All Rights Reserved


What To Avoid in Experimental Design?

Bias : When observations favor some individuals in the population over


others.

Confounding : variable that influences both the dependent variable and


independent variable causing a spurious association. Confounding is a
causal concept, and as such, cannot be described in terms of correlations
or associations.

Extraneous Variables : Undesirable variables that influence the


relationship between the variables that an experimenter is examining.
Another way to think of this is that these are variables that influence the
outcome of an experiment, though they are not the variables that are
actually of interest (Customer ID, etc)

© Abundent Sdn. Bhd., All Rights Reserved


Precision versus Accuracy

“ Precise “ means sharply defined or “ Accurate “ means truthful or correct.


measured.

© Abundent Sdn. Bhd., All Rights Reserved


Confusion Matrix
What is Confusion Matrix and why do you need it?

Well, it is a performance measurement for machine learning classification


problems where output can be two or more classes. It is a table with 4 different
combinations of predicted and actual values (for 2 classes).

It is extremely useful for measuring


Recall, Precision, Specificity,
Accuracy and most importantly
AUC-ROC Curve.

© Abundent Sdn. Bhd., All Rights Reserved


AUC-ROC
AUC (Area Under the Curve)

ROC (Receiver Operating Characteristic Curve)

AUC - ROC curve is a performance measurement for classification problem at various


thresholds settings. ROC is a probability curve and AUC represents degree or
measure of separability. It tells how much model is capable of distinguishing between
classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By
analogy, Higher the AUC, better the model is at distinguishing between patients with
disease and no disease.

© Abundent Sdn. Bhd., All Rights Reserved


AUC-ROC plot

The ROC curve is plotted with TPR against the FPR where TPR is on y-
axis and FPR is on the x-axis.

© Abundent Sdn. Bhd., All Rights Reserved


Let’s understand TN, TP, FN, FP in Confusion Matrix

True Negative (TN) Case was Negative and predicted Negative

True Positive (TP) Case was Positive and predicted Positive

False Negative (FN) Case was Positive but predicted Negative

False Positive (FP) Case was Negative but predicted Positive

© Abundent Sdn. Bhd., All Rights Reserved


Interpreting Result of Experiments

Goal of research is to draw conclusions.

What did study mean?

What, if any, is the cause and effect of the outcome?

© Abundent Sdn. Bhd., All Rights Reserved


Introduction to Sampling

Sampling is the problem of accurately acquiring the


necessary data in order to form a representative
view of the problem.

This much more difficult to do than is generally


realized.

© Abundent Sdn. Bhd., All Rights Reserved


Overall Methodology:
❖ State the objectives of the experiment
❖ Define the target population
❖ Define the data to be collected
❖ Define the variables to be determined
❖ Define the required precision and accuracy
❖ Define the measurement ‘instrument’
❖ Define the sample size & sampling method, then select the sample.

© Abundent Sdn. Bhd., All Rights Reserved


Sampling

Distribution:
When you form a sample you often show it by a
plotted distribution known as a histogram.

A Histogram
Is the distribution of frequency of occurrence of a
certain variable within a specified range.

© Abundent Sdn. Bhd., All Rights Reserved


Interpreting quantitative findings

Descriptive Statistics: Mean, Median, Mode, Frequencies

© Abundent Sdn. Bhd., All Rights Reserved


Mean

● In science the term mean is really the arithmetic mean


● Given by the equation

© Abundent Sdn. Bhd., All Rights Reserved


Median
Consider the set
1,1,2,2,3,6,7,11.11.13.14.16.19

In this case there are 13 values so the median is the middle value, or (n+1)/2

(13+1) /2 = 7

Consider the set


1,1,2,2,3,6,7,11,11,13,14,16

In the second case, the mean of the two middle values is the median or (n+1)/2

(12+1) /2 = 6.5 ~ (6+7) /2 = 6.5

© Abundent Sdn. Bhd., All Rights Reserved


Mode
The most frequent value in a data set
Consider the set
1,1,1,1,2,2,3,6,11,11,11,13,14,16,19

In this case the mode is 1 because it is the most common value.

There may be cases where there are more than one mode as in this case
Consider the set
1,1,1,1,2,2,3,6,11,11,11,11,13,14,16,19

In this case there are two modes (bimodal) : 1 and 11 because


both accur 4 times in the data set.

© Abundent Sdn. Bhd., All Rights Reserved


T-test
The test statistic that a t test produces is a t-value. Conceptually, t-values
are an extension of z-scores. In a way, the t-value represents how many
standard units the means of the two groups are apart.

The formula for computing the t-value and degrees of freedom for paired T test is:

mean1 and mean2 are the


average values of each of the
sample sets, while var1and var2
represent the variance of each of
the sample sets.

© Abundent Sdn. Bhd., All Rights Reserved


T-test

© Abundent Sdn. Bhd., All Rights Reserved


Chi-square
In a chi-square test test, we test whether or not there is a difference in the rates of outcomes on
a nominal scale (like sex, eye color, first name etc.). The test statistic of a chi-square text is χ2
and can range from 0 to Infinity. The null-hypothesis of a chi-square test is that χ2 = 0 which
means no difference.

© Abundent Sdn. Bhd., All Rights Reserved


ANOVA

ANOVA stands for “Analysis of variance.” At first


glance, this sounds like a strange name to give to a
test that you use to find differences in means, not
differences in variances. However, ANOVA actually
uses variances to determine whether or not there are
‘real’ differences in the means of groups.

© Abundent Sdn. Bhd., All Rights Reserved


ANOVA

© Abundent Sdn. Bhd., All Rights Reserved


T-test vs ANOVA

© Abundent Sdn. Bhd., All Rights Reserved


T-test vs ANOVA

© Abundent Sdn. Bhd., All Rights Reserved


Lab Activity - Predicting Housing Prices
File: Housing Prices.ipynb
Dataset: Housing Prices.csv

© Abundent Sdn. Bhd., All Rights Reserved


Q&A

© Abundent Sdn. Bhd., All Rights Reserved

You might also like