Introduction To Predictive Analytics
Introduction To Predictive Analytics
A good student has to learn many concepts, perform in examinations, loyal to his /
her teacher and others.”
2/92
Why this course?
3/92
Course Prerequisites
Probability
Programming Theory:
Basic Statistics: Skills:
• Introductory
• Descriptive Statistics • Basics of R & RStudio Probability Theory
4/92
Course Syllabus
5/92
What is Data?
“Data is the new oil. It’s valuable, but if unrefined it cannot really
be used. It has to be changed into gas, plastic, chemicals, etc to
create a valuable entity that drives profitable activity; so must data
be broken down, analyzed for it to have value.”
Example:
6/92
Role of data: Present
7/92
Role of data: Present
8/92
The World is Data Rich
9/92
Recent Trends and Buzzwords
10/92
T OPIC 1 : S TATISTICS
11/92
Statistics: View Points
“Statistics is the universal tool of inductive inference, research in natural and so-
cial sciences, and technological applications. Statistics must have a clearly defined
purpose, one aspect of which is scientific advance and the other, human welfare and
national development”
- Professor P C Mahalanobis.
“All knowledge is, in final analysis, History.
All sciences are, in the abstract, Mathematics.
All judgements are, in their rationale, Statistics.”
- Professor C R Rao.
• Role of Statistics:
1 Making inference from samples
2 Development of new methods for complex data sets
3 Quantification of uncertainty and variability
• Two Views of Statistics:
1 Statistics as a Mathematical Science
2 Statistics as a Data Science
12/92
Statistics: View Points
14/92
History of Statistics : 1900 - 1980
16/92
Statistics : 1980 - Present
• Data : Large bodies of data with complex data structures are generated
from computers, sensors, manufacturing industries, etc.
• Data are necessary and at the core of Statistical Learning, Data Science
& Machine Learning.
• Statistics : Not only has strong interactions with Probability but also
other parts of Data Science (Machine Learning, Artificial Intelligence,
etc.).
17/92
Statistics : Present (2021)
• Statistics : Not only has strong interactions with Probability but also
other parts of Data Science (Machine Learning, Artificial Intelligence,
etc.).
18/92
Statistics: Past Vs. Present
19/92
What is the Difference?
• Data characteristics:
- Size
- Dimensionality
- Complexity
- Messy
- Secondary sources
• Computational considerations :
- Large scale and complex systems
20/92
T OPIC 2 : D ATA S CIENCE
21/92
What is Data Science?
22/92
What is Data Science?
23/92
Types of Data Science?
"When you’re fundraising, it’s AI.
When you’re hiring, it’s ML.
When you’re implementing, it’s Linear Regression.
When you’re debugging, it’s printf()."
- Baron Schwartz, Founder and CEO of VividCortex, 2017.
24/92
Data in Data Analytics
Basic Definitions:
Entity: A particular thing is called entity or object.
Attribute: An attribute is a measurable or observable property of an entity.
Data: A measurement of an attribute is called data.
Note: Data defines an entity and Computer can manage all type of data (e.g.,
audio, video, text, etc.). In general, there are many types of data that can be used
to measure the properties of an entity.
Scale: A good understanding of data scales (also called scales of measurement) is
important. Depending on the scales of measurement, different techniques are fol-
lowed to derive hitherto unknown knowledge in the form of patterns, associations,
anomalies or similarities from a volume of data.
25/92
NOIR: Scales of Measurement
• The NOIR scale is the fundamental building block on which the extended data
types are built.
• Further, nominal (Blood groups, Attendance) and ordinal (Shirt size) are
collectively referred to as categorical or qualitative data. Whereas, interval
(weight, temperature) and ratio (Sound intensity in Decibel) data are collectively
referred to as quantitative or numeric data.
26/92
Data Cube
27/92
2-D view of rainfall data
• In this 2-D representation, the rainfall for “North-East” region are shown with
respect to different months for a period of years...
28/92
View of 3-D rainfall data
Figure: 2-D view of rainfall data Figure: 3-D view of rainfall data
29/92
Data Mining Process
30/92
Data Mining Process
31/92
1. Prior Knowledge
Gaining information on
• Objective of the problem.
• Subject area of the problem.
• Data.
32/92
2. Data Preparation
Gaining information on
• Data Exploration and Data quality.
• Handling missing values and Outliers.
• Data type conversion.
• Transformation, Feature selection and Sampling.
33/92
3. Modeling
Figure: Splitting data into training and test data sets (right).
34/92
3. Modeling
35/92
Application and Knowledge
4. Application:
• Product readiness.
• Technical integration.
• Model response time.
• Remodeling.
• Assimilation.
5. Knowledge:
• Posterior knowledge.
36/92
Data exploration
Roadmap:
• Organize the data set.
• Find the central point for each attribute (central tendency).
• Understand the spread of the attributes (dispersion).
• Visualize the distribution of each attributes (shapes).
• Pivot the data.
• Watch out for outliers.
• Understanding the relationship between attributes.
• Visualize the relationship between attributes.
• Visualization high dimensional data sets.
• For more details, read Kotu, V., & Deshpande, B. (2014). Predictive analytics and
data mining: concepts and practice with rapidminer. Morgan Kaufmann.
37/92
Overview of Data Science Tools
38/92
T OPIC 3 : M ACHINE L EARNING
39/92
What is Machine Learning?
Machine learning is the field of study that gives computers the ability to
learn without being explicitly programmed.
40/92
ML techniques are impacting our life
41/92
Introduction to Machine Learning
• Designing algorithms that ingest data and learn a model of the data.
• The learned model can be used to
• Unsupervised Learning:
- Uncover structure hidden in ‘unlabelled’ data.
- Given network of social interactions, find communities.
- Given shopping habits for people using loyalty cards: find groups of
‘similar’ shoppers.
- Given expression measurements of 1000s of genes for 1000s of
patients, find groups of functionally similar genes.
- Goal: Hypothesis generation, visualization.
• Supervised Learning:
- A database of examples along with ‘labels’ (task-specific).
- Given expression measurements of 1000s of genes for 1000s of patients
along with an indicator of absence or presence of a specific cancer,
predict if the cancer is present for a new patient.
- Given network of social interactions along with their browsing habits,
predict what news might users find interesting.
- Goal: Prediction on new examples.
43/92
Types of Machine Learning
• Semi-supervised Learning:
- A database of examples, only a small subset of which are labelled.
• Multi-task Learning:
- A database of examples, each of which has multiple labels
corresponding to different prediction tasks.
• Reinforcement Learning:
- An agent acting in an environment, given rewards for performing
appropriate actions, learns to maximize its reward.
44/92
A Typical Supervised Learning Workflow
45/92
A Typical Unsupervised Learning Workflow
46/92
A Typical Reinforcement Learning Workflow
Reinforcement Learning: Learning a ”policy" by performing
actions and getting rewards (e.g, robot controls, beating games)
47/92
Machine Learning: A Brief Timeline
48/92
Machine Learning in the real-world
Broadly applicable in many domains (e.g., internet, robotics, healthcare and
biology, computer vision, NLP, databases, computer systems, finance, etc.).
49/92
Machine Learning helps NLP
50/92
Machine Learning helps Computer Vision
51/92
ML helps Recommendation systems
• A recommendation system is a machine-learning system that is based on data
that indicate links between a set of a users (e.g., people) and a set of items (e.g.,
products).
• A link between a user and a product means that the user has indicated an interest
in the product in some fashion (perhaps by purchasing that item in the past).
• The machine-learning problem is to suggest other items to a given user that he
or she may also be interested in, based on the data across all users.
52/92
Machine Learning helps Chemistry
ML algorithms can understand properties of molecules and learn to synthesize new
molecules.
Figure: Inverse molecular design using machine learning: Generative models for
matter engineering (Science, 2018)
53/92
Machine Learning helps Image Recognition
54/92
ML helps Many Other Areas...
55/92
Taxonomy of Machine Learning
56/92
Cheatsheet
57/92
Development of Stat & ML Models
• Random forest (Breiman,
• Linear Regression 2001).
• ARIMA Model (Box
(Galton, 1875).
and Jenkins, 1970). • Deep Convolutional
• Linear Neural Nets (Krizhevsky,
• Classification and
Discriminant Sutskever, Hinton, NIPS
Regression Tree
Analysis (R.A. 2012).
(Breiman et al., 1984).
Fisher, 1936).
• Generative Adversarial
• Artificial Neural
• Logistic Regression Nets (Ian Goodfellow et
Network (Rumelhart
(Berkson, JASA, al., NIPS 2014).
et al., 1985).
1944).
• Deep Learning (LeCun,
• MARS (Friedman,
• k-Nearest Neighbor Bengio, Hinton, Nature
1991, Annals of
(Fix and Hodges, 2015).
Statistics).
1951).
• Bayesian Deep Neural
• SVM (Cortes and
• Parzen’s Density Network (Yarin Gal,
Vapnik, Machine
Estimation (E Islam, Zoubin
learning, 1995)
Parzen, AMS, 1962) Ghahramani, ICML
2017).
58/92
Developments of Neural Nets
59/92
Supervised Learning: Classification
• Example: Credit scoring.
60/92
Supervised Learning: Regression
61/92
Unsupervised Learning: Clustering
62/92
Unsupervised Learning: Dimensionality Reduction
• Dimensionality Reduction:
Learn a Low-dimensional
representation for a given
set of high-dimensional
inputs
63/92
Recipe for Learning
64/92
T OPIC 4 : A RTIFICIAL INTELLIGENCE
65/92
A first look at Artificial Intelligence
66/92
Genesis of AI
67/92
Misconception of AI
• "In from three to eight years we will have a machine with the general
intelligence of an average human being." - Marvin Minsky (1970, Life
Magazine).
68/92
AI is not human intelligence
"What often happens is that an engineer has an idea of how the brain
works (in his opinion) and then designs a machine that behaves that way.
This new machine may in fact work very well. But, I must warn you that
it does not tell us anything about how the brain actually works, nor is
it necessary to ever really know that, in order to make a computer very
capable. It is not necessary to understand the way birds flap their wings
and how the feathers are designed in order to make a flying machine [...]
It is therefore not necessary to imitate the behavior of Nature in
detail in order to engineer a device which can in many respects
surpass Nature’s abilities."
- Richard Feynman (1999).
69/92
AI technology - Autonomous cars
70/92
AI technology - virtual assistant / chatbot
71/92
Exploring the beauty of pure mathematics
72/92
Exploring the beauty of pure mathematics
73/92
T OPIC 5 : B IG D ATA
74/92
Storage and Processing capacities
• Kryder’s Law is the assumption that disk drive density, also known as areal
density, will double every thirteen months. The implication of Kryder’s Law is
that as areal density improves, storage will become cheaper.
77/92
Now data is Big data!
“Big data is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden
knowledge from it” - Standard definition.
78/92
Characteristics of Big data: 3Vs
79/92
Big data: V for Volume
Volume:
• Volume of data that
needs to be processed is
increasing rapidly.
• Need more
computation facility.
80/92
Big data: V for Variety
Variety:
• Various formats, types, and
structures.
81/92
Big data: V for Velocity
Velocity:
• Data is being generated fast
and need to be processed
fast.
82/92
Big data vs. Small data
83/92
Challenges ahead. . .
84/92
Big data landscape
85/92
Still far from terminator
86/92
Emerging Research Areas
87/92
Now we are stepping into risk-sensitive
areas
Shifting from Performance Driven to Risk Sensitive (Ex-
plainable AI)...
88/92
Take Home
• Statistical Thinking:
- Accepting randomness in formulating models and ideas.
- Realizing and Analyzing dependence of conclusions on assumptions.
- Measuring uncertainty in some ways without forgetting dependence
on assumptions.
89/92
Textbooks
90/92
Questions of the day
• Weather forecasting
• Mobile usage of all customers of a service provider
• Anomaly (e.g. fraud) detection in a bank organization
• Person categorization, that is, identifying a human
• Air traffic control in an airport
• Streaming data from all flying air crafts of Boeing
91/92
End of Session
92/92