CDSS - Day 1
CDSS - Day 1
Day 1
Tarun Sukhani
Tarun Sukhani is the Founder & CTO of Abundent, a Digital Transformation and Big Data Analytics company based in
Malaysia with offices in Singapore, Indonesia, and the USA. He is a business and IT professional with more than 20 years
of experience working in multinational companies in the US, Europe, Asia, South America, and the Middle East. He has
held a number of different senior development and executive positions including that of CIO/CTO, director, and board
member and is experienced in developing and managing large IT operations. As a consultant, Tarun was involved in
improving business operations in companies such as Dell, AMD, and Experian, as well as regional conglomerates such as
Indra in Asia Pacific.
Tarun has conducted training as well as facilitated seminars and workshops in Malaysia, Indonesia, Philippines, Thailand,
Singapore and various other Asia Pacific countries, focusing on project management, consulting, leadership and strategic
management, security, teamwork and other soft skills in addition to Enterprise Computing/Programming, Software
Architecture, Big Data, Data Science, and Machine Learning. He is highly energetic and has a strong passion for
developing people.
tinyurl.com/cdssmaterial
Cat
Symbols
Bili
Kucing
Physical concept
© Abundent Sdn. Bhd., All Rights Reserved
What is Data Representation?
Data representation is how you encode the symbols you use
Cat
(English)
Symbols
बिल्ली
(Hindi)
猫
(Chinese) Physical concept
© Abundent Sdn. Bhd., All Rights Reserved
Some Data Representations Are Better...
By changing the
representation, you
can often solve
seemingly intractable
problems.
John Tukey
2009 2010
© Abundent Sdn. Bhd., All Rights Reserved
Data Science is NOT Databases
Data Science is concerned
with finding patterns in large
amounts of data, whereas
Databases are concerned with
querying large amounts of data.
A. 2.5
B. 3.5
C. 4.5
D. 5.5
Source: Gapminder
We call raw data “dirty data”, and clean data “tidy data”.
Knowledge
Wisdom
Data
The process of “cleansing” dirty data often involves a great deal of tedious data
wrangling and correction, but tools like OpenRefine can help.
Examples include:
1. Fixed Firm Price and FFP belong to the same category and can be combined
(Entity mismatch)
2. The range of prices can be converted to a logarithmic scale if the values are
highly distributed (Feature scaling)
3. Addresses and other out of date data need to be updated (Bit rot)
Source: Ted
Source: LinkedIn
1. Fundamentals
2. Math & Statistics
3. Programming
4. Machine Learning
5. Text Mining/NLP
6. Visualization
7. Big Data
8. Data Ingestion
9. Data Munging/Wrangling
10. Toolbox
Descriptive
Predictive
Prescriptive
Experiment:
● A procedure designed to test a hypothesis as part of
the scientific method.
● The 2 key variables are the independent and
dependent variables. The independent variable is
controlled or changed to test its effects on the
dependent variable.
● 3 key types of experiments are controlled experiments,
field experiments, and natural experiments.
Representativeness:
4 Key Elements
The ROC curve is plotted with TPR against the FPR where TPR is on y-
axis and FPR is on the x-axis.
Distribution:
When you form a sample you often show it by a
plotted distribution known as a histogram.
A Histogram
Is the distribution of frequency of occurrence of a
certain variable within a specified range.
In this case there are 13 values so the median is the middle value, or (n+1)/2
(13+1) /2 = 7
In the second case, the mean of the two middle values is the median or (n+1)/2
There may be cases where there are more than one mode as in this case
Consider the set
1,1,1,1,2,2,3,6,11,11,11,11,13,14,16,19
The formula for computing the t-value and degrees of freedom for paired T test is: