Introduction to Data Science
Lecture - 1
Sumita Narang
Objectives
• Fundamentals of Data Science
• Real World applications
• Data Science vs BI
• Data Science vs Statistics
• Roles and responsibilities of a Data Scientist
• Software Engineering for Data Science
• Data Scientists Toolbox
• Data Science Challenges
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Data?
• Data are raw facts and figures
that on their own have no
meaning
• These can be any
alphanumeric characters i.e.
text, numbers, symbols
Note the “are” bit above? Why?
The Latin word data is the plural of datum, "(thing) given," neuter past
participle of dare "to give". Data may be used as a plural noun in this sense,
with some writers—usually scientific writers—in the 20th century
using datum in the singular and data for plural.
4
BITS Pilani, Pilani Campus
Types of Data
Traditionally, the data that we had was mostly structured and
small in size, which could be analyzed by using the simple BI
tools.
Unlike data in the traditional systems which was mostly
structured, today most of the data is unstructured or semi-
structured.
5
BITS Pilani, Pilani Campus
Information Ladder
The information ladder was created by education professor
Norman Longworth to describe the stages in human learning.
According to the ladder, a learner moves through the following
progression to construct “wisdom”from “data”.
Data →
Information →
Knowledge →
Understanding →
Insight →
Wisdom
6
BITS Pilani, Pilani Campus
Data Science : Definition
Data science refers to areas of work concerned with various fields
related to large collection of data –
Data Acquisition / Collection
Data Management, preservation and archival
Data Preparation
Data Modelling
Data Analytics & Visualization
Data
Preparation
Data
Data
Management
Modelling
/ Storage
Prerequisite for AI
Artificial
Intelligence
Data Data Data Analysis
Acquisition Science /Visualization
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolution of Data Storage &
Processing
EDP Era Cloud Era
- Electronic Data Processing - Cloud Data Processing
- Expensive storage on local - Cloud storage of data;
physical servers free/easy access to data
- Local storage of data; - Cheap storage on Cloud
Limited access to data - Extensive Cloud (GPUs)
- Low processing power processing power
NFV SDN IoT
•Network Function Virtualization •Software Defined Networking •Internet of Things
(routing, load balancing, firewalls •The physical separation of the network •Each device connected to internet,
packaged on virtual machines (VMs) – control plane from the forwarding plane, collecting data, transferring data to
leads to statistically multiplexing of HW and where a control plane controls Central Cloud based storage
resource several devices.
•Proprietary Control plane separated from •Abundance of data, easily accessible
•Reduced Cost considerably, better generic forwarding plane – eliminates data
utilization of HW need of specialized HW -> further cost
•Details reduction
•Details
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Applications/Examples
BITS Pilani, Pilani Campus
Applications/Examples
BITS Pilani, Pilani Campus
Applications/Examples
BITS Pilani, Pilani Campus
Applications/Examples
BITS Pilani, Pilani Campus
Applications/Examples
BITS Pilani, Pilani Campus
Applications/Examples
Traveler's Salesman Problem (TSP),
Vehicle routing Problem (VRP),
BITS Pilani, Pilani Campus
Applications/Examples
• Smart Grids in IoT Network predict shelf-life
of Transformers
• Smart meters in IOT Network predict shelf life
of meters
• Prediction of HW faults in network
BITS Pilani, Pilani Campus
Applications/Examples
• Loan Frauds/defaults
•
Banking fraud detection •
Credit Card frauds/defaults
Insurance Premium Prediction
BITS Pilani, Pilani Campus
Applications/Examples
• Prediction of faults in network
Preventive maintenance • Geo-location wise workforce management for
fault support
• Prediction of preventive maintenance life-cycle
for network devices
• Field Support Org cost reduction by improving
First Time Right (FTR), reducing Site visits
BITS Pilani, Pilani Campus
Applications/Examples
• Free community using Google APIs for Geo-
Google maps location, distance APIs etc. – source of data for
Google maps
BITS Pilani, Pilani Campus
Common Applications of Data
Science
1. Fraud & Risk Detection
2. Healthcare – Medical image analysis, genetics & genomics, drug
development etc.
3. Internet Search – Knowledge Graphs (Semantic Networks) based
on Resource Descriptive Framework RDF
4. Targeted Advertising – Facebook friends referral ads, Amazon past
purchase based products ads, Google search based ads
5. Image Recognition – unlocking phone based on face; App which
identifies object you point to; identifying celebrities in pictures;
Invoice processing etc.
6. Speech recognition – voice recorders, human speech
conversations Bots – ok google !!
7. Airline routing – Auto-pilot
8. Natural Language processing- document translation , chat bots
9. Augmented Reality – VR games, simulators/ training centers etc.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Motivational Examples of Data
Science and AI
• New AI Algorithm Can Now Guess What You Look Like Based Just On Your Voice
https://www.analyticsindiamag.com/new-ai-algorithm-can-now-guess-what-you-look-
like-based-just-on-your-voice/
• How Machine Learning Can Eliminate The Emotional Aspect Of Investing Money
https://www.analyticsindiamag.com/how-machine-learning-can-eliminate-the-
emotional-aspect-of-investing-money/
• How artificial intelligence changed the face of banking in India
https://yourstory.com/2019/05/how-artificial-intelligence-changed-banking-sector
• Machine Learning may help capture fusion energy on Earth
https://odishatv.in/science/machine-learning-may-help-capture-fusion-energy-on-
earth-371697
• How artificial intelligence can slowly change the healthcare landscape
https://yourstory.com/2019/05/how-artificial-intelligence-change-the-healthcare
• Googles AI system better than humans at spotting lung cancer: Study https://www.business-standard.com/article/pti-
stories/google-s-ai-system-better-than-humans-at-spotting-lung-cancer-study-119052100509_1.html
• Machine learning overtakes humans in predicting death or heart attack https://medicalxpress.com/news/2019-05-machine-
humans-death-heart.html
• Machine learning could predict death or heart attack with over 90% accuracy: Study
https://www.timesnownews.com/health/article/machine-learning-could-predict-death-or-heart-attack-with-over-90-accuracy-
study/417593
• Why machine learning could be the key to the future of gambling
https://www.techradar.com/news/why-machine-learning-could-be-the-key-to-the-
future-of-gambling
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science in Various Domains
14
BITS Pilani, Pilani Campus
Evolution of Data Science to
Artificial Intelligence
Artificial
Intelligence
Machine
Data Learning
Science
Deep
Learning
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Machine Learning vs Data
Science?
Machine learning and statistics both are part of data science. The
word learning in machine learning means that the algorithms depend on some
data, used as a training set, to fine-tune some model or algorithm parameters.
This encompasses many techniques such as regression, naive Bayes or
supervised clustering. But not all techniques fit in this category.
For instance, unsupervised clustering - a statistical and data science technique -
aims at detecting clusters and cluster structures without any a-priori knowledge or
training set to help the classification algorithm. A human being is needed to label
the clusters found. Some techniques are hybrid, such as semi-supervised
classification. Some pattern detection or density estimation techniques fit in this
category.
BITS Pilani, Pilani Campus
Machine Learning vs Data
Science?
Data science is much more than machine learning though. Data, in data science,
may or may not come from a machine or mechanical process (survey data could
be manually collected, clinical trials involve a specific type of small data) and it
might have nothing to do with learning as I have just discussed. But the main
difference is the fact that data science covers the whole spectrum of data
processing, not just the algorithmic or statistical aspects. In particular, data
science also covers
• data integration
• distributed architecture
• automating machine learning
• data visualization
• dashboards and BI
• data engineering
• deployment in production mode
• automated, data-driven decisions
BITS Pilani, Pilani Campus
Data Science Vs. Statistics
The fields differ in their modeling processes, the size of their
data, the types of problems studied, the background of
the people in the field, and the language used. However,
the fields are closely related. Ultimately, both statistics and
data science aim to extract knowledge from data.
Reference Links –
• https://www.educba.com/data-science-vs-statistics/
• https://www.displayr.com/statistics-vs-data-science-whats-the-
difference/
• https://link.springer.com/article/10.1007/s42081-018-0009-3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Vs. Statistics
Comparison Data Science Statistics
1# Meaning Data science is the business of learning from data, which is Statistics is a mathematics branch, which also aims to
traditionally the business of statistics. Data science, is often provide methods for data representation, analysis &
understood as a broader, task-driven and computationally- further evaluations.
oriented version of statistics; which is an inter-disciplinary area
of scientific techniques.
2# Concept of Modelling process in data science is done by comparing Statisticians take a different approach to building and
models the predictive accuracy of different machine learning testing their models. The starting point in statistics is
methods, choosing the model which is most accurate. usually a simple model (e.g., linear regression), and
While data science focuses on comparing many methods to the data is checked to see if it consistent with the
create the best machine learning model, statistics instead assumptions of that model. The model is improved
improves a single, simple model to best suit the data. by addressing any assumptions in the model that are
violated.
3# Big Data Data scientists often deal with huge databases - so big that Historically, the focus on statistics has been much
they cannot be stored on a single computer. more about what can be learned from very small
quantities of data.
4# Types of Data science problems often relate to making predictions and In contrast, the problems studied by statistics are
problems optimizing search of large databases. The end-goal of data more often focused on drawing conclusions about
science analysis is more often to do with a specific database or the world at large. The end-goal of statistical analysis
predictive model. is often to draw a conclusion about what causes
what, based on the quantification of uncertainty.
5# Skillset & Data scientists tend to come from engineering backgrounds. Statisticians are usually trained by math departments.
Language Language used – Machine learning, Feature engineering, Language used- Estimating, Classification,
Supervised learning, Labeling, one-hot coding Covariate/predictor/independent variable;
Response/output/dependent variable; dummy
variable/indicator coding
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Vs. Business
Intelligence (BI)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Vs. Business
Intelligence (BI)
DATA SCIENCE BUSINESS
INTELLIGENCE
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Vs. BI
References –
• https://www.dataversity.net/data-science-vs-business-
intelligence/#
• https://towardsdatascience.com/data-science-vs-
business-intelligence-same-but-completely-different-
1d5900c9cc95
• https://data-flair.training/blogs/business-intelligence-vs-
data-science/
• https://analyticsindiamag.com/business-intelligence-
different-data-science/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Roles of Data Scientist
Key role of Data scientist is to define the data architecture i.e.
rules, policies or standard that govern which data is collected,
stored, arranged, integrated, and put to use in data systems.
1. Data Collection techniques – Interviews,
Questionnaire/surveys, Observations/Transactions,
Documents/records, Case studies
2. Data Management – DBMS -> Relational DBMS, NoSQL ( In
memory, Columnar, key-value based, document based etc.)
3. Data Preparation – cleaning, organizing, combining, pruning,
missing value handling, labelling data for visualizing data and
applying machine learning
4. Data Analytics – Stream involves automating insights into
certain data set, visualizing the results, uses queries & data
aggregation procedures
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skills Required – Data
Scientist
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Software Engineering for Data
Science 1/
• For data scientists, software is the generalization of a specific aspect of a data
analysis. Software allows for the systematizing and the standardizing of a procedure,
so that different people can use it and understand what it's doing, at any given time.
• So if specific parts of a data analysis may require implementing or applying a number of procedures or tools together.
• Software is the encompassing of all these tools into a specific module or procedure that can be repeatedly applied in a variety of
settings
• Software formalizes and abstracts the functionality of a set of procedures or tools, by
developing a well defined interface to the analysis. So the software will have an
interface, or a set of inputs and a set of outputs that are well understood.
• So for example, most statistical packages will have a linear regression, a function which has a very well defined interface
• There are basically three kind of levels of software –
1. 1st level - Code that is encapsulating automation of a set of procedures. For example, a loop of some sort, never repeats an
operation multiple times.
2. 2nd level- Writing a function - which is used to encapsulate a set of instructions. And the key thing about a function is that you'll
have to define some sort of interface, which will be the inputs to the function. And then the function may have a set of outputs or it
may have some side effect for example, if it's a plotting function. Data Cleaning function, Missing Value handler etc.
3. Highest level is the actual software package - Which will often be a collection of functions or functions in other things. That will be
a little bit more formal because there'll be a very specific interface or API that a user has to understand. And often for a software
package there'll be a level of convenience features for users, like documentation, or examples, or tutorials that may come with it,
to help the user, to kind of apply the software to many different settings
References –
https://towardsdatascience.com/data-science-is-becoming-software-engineering-53e31314939a
https://www.coursera.org/lecture/data-science-course/what-is-software-engineering-for-data-science-cmnBH
https://careerkarma.com/blog/data-science-vs-software-engineering/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Software Engineering for Data
Science 2/
• Software Engineering was historically considered a different
highly technical field as compared to Data Science, and while
both have similar skill sets, there are big differences in the
ways in which these skills are usually applied.
• Differ in Methodologies – SDLC, Data Science Project Process (will study in session
2/3)
• Differ in Objectives - Data science tends to be much more about analysis of data in
practice to solve a business objective, with some aspects of programming and
development thrown in. Software engineering, on the other hand, tends to focus on
creating systems and software that is user-friendly and that serves a specific purpose.
• Differ in Approaches - Data science is a very process-oriented field. Its practitioners
ingest and analyze data sets in order to better understand a problem and arrive at a
solution. Software engineering, on the other hand, is more likely to approach tasks
with existing frameworks and methodologies.
• Differ in Tools - A data scientist’s wheelhouse contains tools for data analytics, data
visualization, working with databases, machine learning, and predictive modeling.
using Amazon S3, MongoDB, Hadoop, MySQL, Postgresql, or something similar. For
model building, there’s a good chance they’ll be working with Statsmodels or Scikit-
learn. Distributed processing of big data requires Apache Spark. A software engineer
utilizes tools for software design and analysis, software testing, programming
languages, web application tools, and much more.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Software Engineering for Data
Science 3/
Data science is becoming software engineering
• In data science over the last few years, perhaps the most significant and
hardest to ignore has been the increased focus on deployment and
productionization of models.
• The skill sets of software engineers and data scientists are converging, at least when it comes to product-
facing data science applications, like building recommender systems. Data scientists are being asked to take
care of deployment and productionization, and software engineers are being asked to expand their skill set
to include modeling.
• Historically, data science and software engineering weren’t nearly as closely
integrated as they are today.
• Feature Engineering which was manually done in machine learning approaches by data scientists; that is
being automatically done by Deep Learning models coded in SWs (Tensorflow, Pytorch libraries)
• One aspect of data science that’s often over-emphasized is model tuning. It’s very rare that your focus as a
data scientist will be on making a model 1% better; typically it’s much more important to get a “good enough”
model out the door and in front of users, which is why software engineering and deployment skills are
increasingly growing in importance over model tuning.
• 2 Types of Data Science Projects
• Building Data science process libraries, models, frameworks, creating algorithms
• Building Services & products that use Data science libraries& solve a business problem
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skills Venn diagram
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Scientist Toolbox
1. Programming Languages -R, Python, SQL, Matlab
2. Data Science Libraries/ frameworks – Tensorflow, Pytorch, Scikit-
learn, plotlib
3. Data Handling Tools – RDBMS ( MYSQL, PostGres, Microsoft SQL
Server, Oracle DB), NOSQL DB(MongoDB, Cassandra, HBASE),
Big Data Technologies (HADOOP, SPARK, HIVE)
4. Analytics Modelling Tools – SAS, SPSS, Cloud-Based AI/ML
Interfaces (AWS, GCP, Azure)
Reference
https://www.coursera.org/learn/data-scientists-tools
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ethics for Data Science
Ethics for different stages in Data Science projects –
1. Project Selection & Scoping - Is data science the right tool for the
job/problem being solved?
– Pursuing a data science project can mean pulling resources away from other initiatives that could have a
greater impact. It’s important to recognize when this is not the case and resources are better directed
elsewhere.
2. Building the Team - Does the team include and/or consider individuals and
institutions who will ultimately be affected by the tool?
– They should engage members of their target communities along with data scientists
3. Data Collection
– Does collecting data impede on anyone’s privacy?
– Were the systems and processes used to collect the data biased against any groups?
4. Data analytics
– What bias is the team introducing? (dependent variable selection)
– Should the team include features that could be discriminatory? Features such as race, gender, or income
– Is the analysis sufficiently transparent? Machine learning models can be opaque, particularly around what
variables are driving predictions
5. Implementation
– Are the people using models aware of its shortcomings?
– What are the consequences of not acting on false negatives (and acting on false positives)?
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Challenges/
Concerns
1. Identifying the problem
One of the major steps in analyzing a problem and designing a solution is to first figure out the problem properly
and define each aspect of it. Many times Data scientists opt for a mechanical approach and start working on data
sets and tools without a clear definition of the business problem or the client requirement.
2. Access to right data – Data quantity
For right analysis, it is very important to lay the hands on the right kind of data. Gaining access to a variety of data
in the most appropriate format is quite difficult as well as time-consuming. There could be issues ranging from
hidden data, insufficient volume of data or less variety in the kind of data. Data could be spread unevenly across
various lines of business so getting permission to access that data can also pose a challenge.
3. Data Cleansing – Data quality
According to a study by MIT, Big Data has begun to cost companies up to 25% of possible revenue because
cleansing bad data is eroding operating expenses.
4. Lack of domain expertise
Data Scientists also need to have sound domain knowledge and gain subject matter expertise. One of the biggest
challenges faced by data scientists is to apply domain knowledge to business solutions. Data scientists are a
bridge between the IT department and the top management.
5. Data security issues
Since data is extracted through a lot of interconnected channels, social media as well as other nodes, there is
increased vulnerability of hacker attacks. Due to the confidentiality element of data, Data scientists are facing
obstacles in data extraction, usage, building models or algorithms. The process of obtaining consent from users is
causing a major delay in turnaround time and cost overruns.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References
Data collection
https://www.slideshare.net/priyansakthi/methods-of-data-collection-16037781;
https://research-methodology.net/research-methods/quantitative-research/;
https://cyfar.org/data-collection-techniques
Data Management
https://searchsqlserver.techtarget.com/definition/database-management-system
https://www.stoodnt.com/blog/data-engineer-vs-data-scientist/
https://www.dataspace.com/big-data-strategy/so-what-is-the-difference-between-a-programmer-and-a-data-scientist/
Data Preparation
https://searchbusinessanalytics.techtarget.com/definition/data-preparation
https://www.dataquest.io/blog/python-vs-r/
https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html
Data Analytics
https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article
https://searchdatamanagement.techtarget.com/definition/data-analytics
Data analytics Ethics
https://www.coursera.org/learn/data-science-ethics
https://dssg.uchicago.edu/2015/09/18/an-ethical-checklist-for-data-science/
Data analytics Challenges/ Concerns
https://www.forbes.com/sites/laurencebradford/2018/09/06/8-real-challenges-data-scientists-face/#7105dced6d99
https://www.proschoolonline.com/blog/challenges-faced-by-data-scientists
https://www.analyticsinsight.net/what-are-the-major-challenges-faced-by-data-scientists/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Additional Slides
APPENDIX
Network Function Virtualization: Background
10/28/2020 CS ZG525 / CSI ZG525/ ES ZG526: ADVANCED COMPUTER NETWORKS 40
SDN Architecture / Layers
Image Source: Open Networking Foundation
10/28/2020 CS ZG525 / CSI ZG525/ ES ZG526: ADVANCED COMPUTER NETWORKS 41
Era of Data Science
BITS Pilani, Pilani Campus
Data Analytics
Lecture – 2
Sumita Narang
Objectives
• Defining Analytics
• Types of data analytics
– Descriptive, Diagnostic
– Predictive, Prescriptive
• Data Analytics – methodologies
– CRISP-DM Methodology
– SEMMA
– BIG DATA LIFE CYCLE
– SMAM
• Analytics Capacity Building
• Challenges in Data-driven decision making
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Analytics?
Good? Or Evil?
Doctor: What medical
treatment should I recommend?
What route should I follow today? What grades can I expect in this course?
45
Analytics is all around us….
Source: Avasant Research
46
What is data analytics
Data or information is in raw format. The increase in size of the data
has lead to a rise in need for carrying out inspection, data cleaning,
transformation as well as data modeling to gain insights from the
data in order to derive conclusions for better decision making
process. This process is known as data analysis.
• Data Mining is a popular type of data analysis technique to carry
out data modeling as well as knowledge discovery that is geared
towards predictive purposes.
• Business Intelligence operations provide various data analysis
capabilities that rely on data aggregation as well as focus on the
domain expertise of businesses. In Statistical applications,
business analytics can be divided into Exploratory Data Analysis
(EDA) and Confirmatory Data Analysis (CDA).
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Analytics – Sub Streams
• Streams Business Analytics – BI (business intelligence), OLAP(online analytical
processing) – offline analytic techniques with well defined business objectives
• Data Mining - process of sorting through large data sets to identify patterns and
establish relationships to solve problems through data analysis.
• Real-time Analytics - use of, or the capacity to use, data and related resources as
soon as the data enters the system
• PIM – Processing in memory: A chip architecture in which the processor is integrated into a memory
chip to reduce latency
• In-database analytics - a technology that allows data processing to be conducted within the database by
building analytic logic into the database itself
• In-memory analytics - an approach to querying data when it resides in random access memory (RAM),
as opposed to querying data that is stored on physical disks (e.g. Apache Spark)
• Massively parallel programming (MPP) -- the coordinated processing of a program by
multiple processors that work on different parts of the program, with each processor using its
own operating system and memory.
• Predictive Analytics - a form of advanced analytics that uses both new and
historical data to forecast activity, behavior and trends. It involves applying statistical
analysis techniques, analytical queries and automated machine learning algorithms
to data sets to create predictive models that place a numerical value -- or score -- on
the likelihood of a particular event happening
• Text analytics - statistical, linguistic and structural techniques are applied to extract
and classify information from textual sources, a species of unstructured data.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Analytics
:Business Analytics Types
• EDA (Exploratory Data analytics) - aims to find
patterns and relationships in data
• Business Analytics
• Real-time Analytics
• Data Mining
• CDA (Confirmatory Data analytics) - applies statistical
techniques to determine whether hypotheses about
a data set are true or false
• Predictive Analytics
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Analytics vs. Data
Reporting
Data Analytics Data Reporting
• Interactive process of a person tackling a problem, • Reporting refers to the process of organizing and
finding the data required to get an answer, summarizing data in an easily readable format to
analyzing that data, and interpreting the results in communicate important information.
order to provide a recommendation for action
1. A report will show the user what had happened in the past to avoid inferences and help to get a
feel for the data while analysis provides answers to any question or issue. An analysis process takes
any steps needed to get the answers to those questions.
2. Reporting just provides the data that is asked for while analysis provides the information or the
answer that is needed actually.
3. We perform the reporting in a standardized way, but we can customize the analysis. There are
fixed standard formats for reporting while we perform the analysis as per the requirement; we
customize it as needed.
4. We can perform reporting using a tool and it generally does not involve any person in the analysis.
Whereas, a person is there for doing analysis and leading the complete analysis process.
5. Reporting is inflexible while analysis is flexible. Reporting provides no or limited context about
what’s happening in the data and hence is inflexible while analysis emphasizes data points that are
significant, unique, or special, and it explains why they are important to the business.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Data
Analysis
1. Programmatic: There might be a need to write a
program for data analysis by using code to manipulate it
or do any kind of exploration because of the scale of the
data.
2. Data-driven: A lot of data scientists depend on a
hypothesis-driven approach to data analysis. For
appropriate data analysis, one can also avail the data to
foster analysis. This can be of significant advantage when
there is a large amount of data. For example – machine
learning approaches can be used in place of hypothetical
analysis.
3. Attributes usage: For proper and accurate analysis of
data, it can use a lot of attributes. In the past, analysts
dealt with hundreds of attributes or characteristics of the
data source. With Big Data, there are now thousands of
attributes and millions of observations.
4. Iterative: As whole data is broken into samples and samples are then analyzed, therefore data
analytics can be iterative in nature. Better compute power enables iteration of the models until
data analysts are satisfied. This has led to the development of new applications designed for
addressing analysis requirements and time frames.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Data Analytics
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Data Analytics
• Answers “what happened”
• Juggles raw data from multiple data sources to give valuable
Descriptive Analytics insights into the past
• Signals that something is wrong or right, without explaining
why
• Answers “why something happened”
Diagnostic Analytics • Find out dependencies and identify patterns
• Tries to find root cause why something went right or wrong
• Tells “what’s likely to happen”
Predictive Analytics • Tool for forecasting
• Predicts or forecast what is likely to happened in future
• Prescribe what action to take to eliminate a future problem
or take full advantage of a promising trend
Prescriptive
Analytics • Uses advanced tools and technologies, like machine learning,
business rules and algorithms, which makes it sophisticated
to implement and manage.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Applications – Descriptive
Analytics
1. Telecom company finding current trend of MLP customers
2. Telecom company finding number of site visits for tickets
3. Healthcare provider learns how many patients were
hospitalized last month
4. Retailer – the average weekly sales volume
5. Manufacturer – a rate of the products returned for a past
month, etc.
6. A placement agency – number of candidates being placed to
top companies, candidates actually joining, candidates who
accept but do not join etc.
7. Consultancy firm – Doing salary analysis of different work
streams, stages, industries
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Application - Diagnostic
Analytics
FSO analytics example –
Business Objective fro FSO Department: Target for 5 top clients in India,
Europe, South Africa and Costa Rica Markets
• FTR (Frist Time Right) Improvement by 2%
• Improve incoming WO Quality by 5%
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Industry Use Cases for Predictive/Prescriptive
Analytics: Summary View
Technique/ Approach / Methodology
Image Processing /
Knowledge Neural Networks/ Natural Language Optimization Social Network
Computer Vision /
Graphs Deep Learning Processing (NLP) Algorithms Analysis
Object Recognition
Transportation
Domains
Legal
Finance
Energy
Telecom
Logistics
Healthcare
Sales
Security
56
<Transportation/ Logistics><Optimization Algorithms>
Problem Statement
• ABC Organization allows employees to request cab service for scheduled time slots,
having well-defined drop/pick up locations
• Services team manually creates routes for the given requests, which are often sub-
optimal
Solution Approach
• Used Large Neighborhood Search (LNS) – AI Search Metaheuristic
• Constraints built into LNS as Rules
Solution Benefits
• Overall Cost Optimization (Fuel, Security) and Travel Time Optimization
• No tedious manual planning required; more time window for user requests
Total No. of No. of Manual No. of proposed
Time Date Savings Vehicles used in manual routes Vehicles used in optimized routes
Customers routes routes
1 Amaze- D, 2 Dzire D, 3 Etios D,1 Indica
7:00 PM 1/19/2018 56% 28 12 7 6 Amaze D 4 seater, 1 Tavera D 9 Seater
D,1 Innova D, 4 Tempo Traveller
57
<Transportation><Computer Vision/ Object Recognition>
Problem Statement
• 94% or more fatalities on road happen due to human error while driving
Solution Approach
• Computer Vision/ Object Recognition methods to detect pedestrians, cyclists,
vehicles, road construction/repair and other events (like road signs, indicators, arm-
signals)
• Huge volumes of Training data from real world traffic for Predictive Algorithm tuning
Solution Benefits
• Reduced fatalities due to human error, drunken driving and distractions (e.g. mobile
phone calls)
Image Source: Waymo 58
<Sales><Knowledge Graphs + NLP>
Problem Statement
• Group company formed via multiple acquisitions (Holding Organization)
• Information about Clients, Customer Projects, Skills, Expertise distributed
across multiple repositories, not easily accessible by the Client/Sales Teams
• Lack of knowledge about relevant sales material leading to weaker sales
pitch
Solution Approach
• NLP-powered Knowledge Management System for easy access of
information across divisions of the group company
• Information stored as Knowledge Graphs
• Efficient Search Engine based on techniques used for Internet search
Solution Benefits
• Stronger sales pitch with information relevant for prospective customer/
project leading to better conversion rates
59
<Telecom><Social Network Analysis>
Problem Statement
• With Telecom saturation in many markets, focus changing from customer acquisition to
customer retention. Telecom service providers need to control churn to maintain profitability
• Existing churn prediction systems based on individual calling patterns not efficient
Solution Approach
• Proposed approach uses Social Network Analysis for churn prediction (predict future churners
based on direct & indirect interactions with past churners)
• Diffusion-based Spreading Activation Algorithm
• Start with the churners and their social relationships from the call graph
• Initiate a diffusion process with the churners as seeds
• Churners influence their neighbors to churn. Influence determined by strength of relationship
• At end of the diffusion process examine influence on the non churners
Solution Benefits
• Early prediction on possible churners who need to be retained before they churn
Source: IBM India Research Lab
60
<Healthcare><Deep Learning>
Problem Statement
• Living in an age of stress, running against time all the while
• WHO estimates 18.1% of adults in the U.S. have experienced an anxiety disorder
• In India, about 450 million people suffer from mental or behavioral disorder.
• In India, the total number of people living with depression is over 56 million, about
4.5% of total population (2015), and another 38 million suffers from anxiety disorders
Solution Approach
• A wearable device that extract’s data of your mind in meditative, moving, resting or
sleeping states
• 64 pieces of information extracted and processed to provide useful insights about the
state of your mind and overall mental health
• Trained model developed by Data Scientists and Neurologists using medical data of
existing patients
• Real-time tips and guidance to reduce stress and anxiety offered
Solution Benefits
• Better tracking of mental health and cultivation of practices that lead to reduced
anxiety and stress
61
<Telecom><SVM + Learning Algorithms>
Data Sampling
Problem Statement 1 • Creation of multiple smaller
chunks of data-sets
. • Technology Used-SQL
• Requirement to reduce Field Support costs for Tier-1
Telecom OEM’s operations/support team
• First Time Right (FTR) found to be low (30% to 60% 2
across regions), leading to higher # of field visits .
Data Churning
Solution Approach • Identification of dependent and independent
variables in data
• Relevance of data fields & Variation of values
• Categorization of historical data on field visits • Completeness & correctness of values in data fields
• Technology used – R, SQL, Excel
(Duplicate issues, Wrong Assignments, Site Visit Not
Required, Site Access Restrictions, Fix Quality)
• Algorithms and Rule-sets to create a learning model. 3
SVM based Prediction to categorize new incoming .
faults Build Business Logic
• Data Modelling using statistical techniques
• Identify relationships between variables
Solution Benefits • Build algorithm with business logic to derive
dependent variables in terms of independent
variables
• Advisory inputs for each incoming fault reported to
• Implement the business logic to run on complete
avoid repeated field visits and increased FTR % data set
• Technology Used- SQL , C#
62
• Technology used – R, SQL
<Energy><Computer Vision/ Image Processing/ Object Recognition>
Problem Statement
• Focus on Renewables Energy Market – Solar Energy farms being setup behind the meter
• Higher cost of infrastructure needs to be offset by higher energy production to reduce per unit
cost of energy
Solution Approach
• Better design for solar plant leading to higher production efficiencies
• Drone-based site surveys (video and image feeds) to avoid manual site visits
• Computer vision based Design Software to design solar plant layout (Solar PVs)
• SVM / Linear Regression based Predictive Analytics using Weather Forecast Data
Solution Benefits
• Upto 25% higher energy production from the same infrastructure, reducing per unit cost of energy
• Early indication of Energy Production for efficiency in Energy Trading & Storage
63
<Sales><NLP+ Knowledge Graphs>
Problem Statement
• E-Commerce firm spending a lot of cost on post-
order customer support/ queries
• Requirement to maintain round-the-clock
customer support center with Multi-lingual
executives
Solution Approach
• NLP-powered Chatbots and Voicebots to provide
on-chat and on-call customer support
• Support for English and multiple other Indian
languages
• Handover to human executive only for the cases
that cannot be handled by Bots
Solution Benefits
• Reduction in customer support costs by upto
30%
Source: Litifer
64
<Legal><NLP, Computer Vision & Deep Learning>
Problem Statement
• Longer and error prone process for Contract Review and Due Diligence
• Requirement of Legal experts and a significantly large Legal Team
Solution Approach
• Using LegalTech, an AI-based solution. Two categories of solutions:
• Systems that apply predetermined rules
• Systems that learn by training themselves
Solution Benefits
• Less overhead costs of Business
• Faster contract processing time
65
<Finance><Neural Networks / Deep Learning>
Problem Statement
• Stock market portfolio selection to maximize expected return, while minimizing risk
Solution Approach
• Implement Markowitz’s Modern Portfolio Theory:
• Stock Price prediction using Deep Learning models using TensorFlow
• Portfolio weights for different stocks calculated using Logistic Regression technique
• Risk computed by calculating Standard Deviation on historical data
• Use of Long short-term memory (LSTM) networks, which are special kind of Recurrent Neural
Network (RNN) capable of learning long-term dependencies
Solution Benefits
• Optimized Risk-Reward Trade-off & Effective Portfolio Management
66
<Security><Computer Vision/ Image Processing/ Object Recognition>
Problem Statement
• Security staff required round-the-clock to monitor CCTV footages from multiple
cameras
• Incidents / Security lapses getting missed and not reported
Solution Approach
• Automated monitoring of all live CCTV feeds from multiple cameras
• Configurable list of events/ incidents that need to be watched and/or reported
• Conversion of detailed feeds into highlights for quick “end-of-day” watch
Solution Benefits
• Better incident reporting and reduced manual monitoring required
67
Analytics Terminology
https://marketing.adobe.com/resources/help/en_US/referen
ce/glossary.html
https://analyticstraining.com/analytics-terminology/
https://blog.hubspot.com/marketing/hubspot-google-
analytics-glossary
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skills Required for Data
Analyst
Technical skills for data analytics:
• Packages and Statistical methods
• BI Platform and Data Warehousing
• Base design of data
• Data Visualization and munging
• Reporting methods
• Knowledge of Hadoop and MapReduce
• Data Mining
Business Skills Data analytics:
• Effective communication skills
• Creative thinking
• Industry knowledge
• Analytic problem solving
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tools in Data Analytics – Paid
Tools
1. SAS: SAS is a software suite that can mine, alter, manage and retrieve
data from a variety of sources and perform statistical analysis on it. SAS
provides a graphical point-and-click user interface for non-technical users
and more advanced options through the SAS programming language.
SAS programs have a DATA step, which retrieves and manipulates data,
usually creating a SAS data set, and a PROC step, which analyses the
data.
2. WPS: WPS can use programs written in the language of SAS without the
need for translating them into any other language. In this regard WPS is
compatible with the SAS system. WPS is a language interpreter able to
process the language of SAS and produce similar results. It is sometimes
used as an alternative to SAS as it is relatively cheaper.
3. MS Excel: Microsoft Excel is a spreadsheet application developed by
Microsoft for Microsoft Windows and Mac OS. It features calculation,
graphing tools, pivot tables, and a macro programming language called
Visual Basic for Applications. It has been a very widely applied
spreadsheet for these platforms, especially since version 5 in 1993, and it
has replaced Lotus 1-2-3 as the industry standard for spreadsheets.
Excel forms part of Microsoft Office
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tools in Data Analytics – Paid
Tools contd..
4. Tableau: Tableau Software is an American computer software company
headquartered in Seattle, Washington. It produces a family of interactive
data visualization products focused on business intelligence. Tableau
offers five main products: Tableau Desktop, Tableau Server, Tableau
Online, Tableau Reader and Tableau Public.
5. Pentaho: Pentaho is a company that offers Pentaho Business Analytics, a
suite of open source Business Intelligence (BI) products which provide
data integration, OLAP services, reporting, dash boarding, data mining
and ETL capabilities. Pentaho was founded in 2004 by five founders and
is headquartered in Orlando, FL, USA. The Pentaho suite consists of two
offerings, an enterprise and community edition. The enterprise edition
contains extra features not found in the community edition.
6. Statistica: STATISTICA is a statistics and analytics software package
developed by StatSoft. STATISTICA provides data analysis, data
management, statistics, data mining, and data visualization procedures.
STATISTICA product categories include Enterprise (for use across a site
or organization), Web-Based (for use with a server and web browser),
Concurrent Network Desktop, and Single-User Desktop.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tools in Data Analytics – Paid
Tools contd..
7. Qlikview : Qlikview is a business intelligence software
from Qlik. It helps its users understand the business in a
better way by providing them features like consolidating
relevant data from multiple sources, exploring the
various associations in the data, enabling social
decision making through secure, real-time collaboration
etc.
8. KISSmetrics : KISSmetrics is a person-based analytics
product that helps users identify, understand, and
improve the metrics that drive their online business.
They make it simple to get the information users need
to make better product and marketing decisions.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tools in Data Analytics – Paid
Tools contd..
9. WeKa :The Weka workbench contains a collection of visualization
tools and algorithms for data analysis and predictive modelling,
together with graphical user interfaces for easy access to this
functionality. The original non-Java version of Weka was a TCL/TK
front-end to (mostly third-party) modelling algorithms implemented
in other programming languages, plus data pre-processing utilities
in C, and a Make file-based system for running machine learning
experiments. This original version was primarily designed as a tool
for analysing data from agricultural domains, but the more recent
fully Java-based version (Weka 3), for which development started
in 1997, is now used in many different application areas, in
particular for educational purposes and research.
10. BigML: BigML is a Corvallis, Ore.based startup with a SaaS-based
machine learning platform that allows everyday business users to
create actionable predictive models within minutes.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tools in Data Analytics – Free
Tools
1. R : R provides a wide variety of statistical and graphical techniques,
including linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, and others. R is easily
extensible through functions and extensions, and the R community is
noted for its active contributions in terms of packages. There are
some important differences, but much code written for S runs
unaltered. Many of R's standard functions are written in R itself,
which makes it easy for users to follow the algorithmic choices
made. For computationally intensive tasks, C, C++, and FORTRAN
code can be linked and called at run time. Advanced users can write
C, C++, Java or Python code to manipulate R objects directly.
2. Google Analytics: Google Analytics is a service offered by Google
that generates detailed statistics about a website's traffic and traffic
sources and measures conversions and sales. The product is aimed
at marketers as opposed to webmasters and technologists from
which the industry of web analytics originally grew. It's the most
widely used website statistics service
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tools in Data Analytics – Free
Tools contd..
3. Python Pandas (python data analysis library) is an open source,
BSD-licensed library providing high-performance, easy-to-use data
structures and data analysis tools for the Python programming
language. It provides tools for reading and writing data between in-
memory data structures and different formats: CSV and text files,
Microsoft Excel, SQL databases, and the fast HDF5 format and
also intelligent data alignment and integrated handling of missing
data: gain automatic label-based alignment in computations and
easily manipulate messy data into an orderly form can be done
using python
4. Spotfire TIBCO Spotfire is an analytics and business intelligence
platform for analysis of data by predictive and complex statistics.
During the 2010 World Cup, FIFA used this software to give
viewers analytics on country teams' past performances.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Analytics
Lecture –3
Sumita Narang
Objectives
• Defining Analytics
• Types of data analytics
– Descriptive, Diagnostic
– Predictive, Prescriptive
• Data Analytics – methodologies
– CRISP-DM Methodology
– SEMMA
– BIG DATA LIFE CYCLE
– SMAM
• Analytics Capacity Building
• Challenges in Data-driven decision making
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Analytics Methodologies
Data analytics - methodologies
– KDD – Knowledge Discovery in Database
• (https://www.youtube.com/watch?v=a4M3GdI5UFY
• http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html)
– CRISP-DM Methodology
– SEMMA
– BIG DATA LIFE CYCLE
(https://www.tutorialspoint.com/big_data_analytics/big_data_anal
ytics_lifecycle.htm)
– SMAM
– ASUM- DM
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CRISP-DM
https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-
projects.html
https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
SEMMA
https://documentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm1a2.htm&docsetVer
sion=14.3&locale=en
http://jesshampton.com/2011/02/16/semma-and-crisp-dm-data-mining-methodologies/
SMAM
https://www.kdnuggets.com/2015/08/new-standard-methodology-analytical-models.html
KDD
http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
https://data-flair.training/blogs/data-mining-and-knowledge-discovery/
https://pdfs.semanticscholar.org/7dfe/3bc6035da527deaa72007a27cef94047a7f9.pdf?_ga=2.146116535.22663
2156.1578656398-640734799.1575775195
Big Data Lifecycle
http://www.informit.com/articles/article.aspx?p=2473128&seqNum=11
https://www.linkedin.com/pulse/four-keys-big-data-life-cycle-kurt-cagle/
ASUS-DM
https://www.researchgate.net/publication/326307750_Towards_an_Improved_ASUM-
DM_Process_Methodology_for_Cross-Disciplinary_Multi-
organization_Big_Data_Analytics_Projects_13th_International_Conference_KMO_2018_Zilina_Slovakia_Augus
t_6-10_2018_Proceeding
KDD – Knowledge Discovery
in Database Process
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
SEMMA Process
Acronym SEMMA stands for Sample, Explore, Modify, Model, Assess.
Sample —Identify, merge, partition, and sample input data sets, among other
tasks.
Explore —explore data sets statistically and graphically. plot the data, obtain
descriptive statistics, identify important variables, and perform association
analysis, among other tasks.
Modify —prepare the data for analysis. Examples of the tasks that you can
complete for these nodes are creating additional variables, transforming
existing variables, identifying outliers, replacing missing values, performing
cluster analysis, and analyzing data with self-organizing maps (SOMs) or
Kohonen networks.
Model —fit a predictive model to a target variable. Available models include
decision trees, neural networks, least angle regressions, support vector
machines, linear regressions, and logistic regressions.
Assess —compare competing predictive models. They build charts that plot the
percentage of respondents, percentage of respondents captures, lift, and profit.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comparison Study – KDD vs
SEMMA
By doing a comparison of the KDD and SEMMA stages we
would, on a first approach, affirm that they are equivalent:
- Sample can be identified with Selection;
- Explore can be identified with Pre processing;
- Modify can be identified with Transformation;
- Model can be identified with Data Mining;
- Assess can be identified with Interpretation/Evaluation.
Examining it thoroughly, we may affirm that the five stages of
the SEMMA process can be seen as a practical
implementation of the five stages of the KDD process, since it
is directly linked to the SAS Enterprise Miner software.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CRISP-DM Process
CRISP-DM stands for CRoss-Industry Standard Process for Data
Mining. It consists on a cycle that comprises six stages
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comparison Study – KDD vs
CRISP-DM
Comparing the KDD stages with the CRISP-DM stages is not as
straightforward as in the SEMMA situation. CRISP-DM methodology
incorporates the steps that, as referred above, must precede and
follow the KDD process that is to say:
- The Business Understanding phase can be identified with the
development of an understanding of the application domain, the
relevant prior knowledge and the goals of the end-user;
- The Deployment phase can be identified with the consolidation by
incorporating this knowledge into the system.
- The Data Understanding phase can be identified as the
combination of Selection and Pre processing;
- The Data Preparation phase can be identified with
Transformation;
- The Modeling phase can be identified with Data Mining;
- The Evaluation phase can be identified with
Interpretation/Evaluation.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
KDD vs SEMMA vs CRISP-DM
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Drawbacks of CRISP-DM 1/
• One major drawback is that the model no longer seems to be
actively maintained. The official site, CRISP-DM.org, is no
longer being maintained. Furthermore, the framework itself
has not been updated on issues on working with new
technologies, such as big data.
• Big data technologies means that there can be additional effort spend in the
data understanding phase, for example, as the business grapples with the
additional complexities that are involved in the shape of big data sources.
• CRISP-DM is a great framework and its use on projects helps
focus them on delivering real business value. CRISP-DM has
been around a long time so many projects that are using
CRISP-DM are taking shortcuts. Some of these shortcuts
make sense but too often they result in projects using a
corrupted version of the approach like the one shown in
Figure .
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Drawbacks of CRISP-DM 2/
A lack of clarity: Rather than drill down into the
details and really get clarity on both the business
problem and exactly how an analytic might help,
the project team make do with the business goals
and some metrics to measure success.
Mindless rework: Most try to find new data or new
modeling techniques rather than working with
their business partners to re-evaluate the business
problem.
Blind hand-off to IT: Some analytic teams don’t
think about deployment and operationalization of
their models at all. Fail to recognize that the
models they build will have to be applied to live
data in operational data stores or embedded in
operational systems.
Failure to iterate
Analytic professionals know that models age and
that models need to be kept up to date if they are
to continue to be valuable.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Standard Methodology for
Analytical Models (SMAM)
https://www.kdnuggets.com/2015/08/new-standard-
methodology-analytical-models.html
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
SMAM / 2
Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model Requirements gathering Understanding the conditions required for the model to
function
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the
value
Operationalization Embedding the analytical model in operational systems
Model lifecycle Governance around model lifetime and refresh
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Analytics
Big Data analytics platforms are designed to serve
following needs of today’s businesses.
1. Massive amounts of data - It is big – typically in
terabytes or even petabytes
2. Varied data such as video files to sql databases to text
data - It is varied – it could be a traditional database, it
could be video data, log data, text data or even voice
data
3. Data that comes in at varying frequency – from days to
minutes - It keeps increasing as new data keeps flowing
in
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Technologies
• MapReduce To understand the beginning of Big Data technology,
we will need to go back to 2004 when 2 Googlers – Sanjay
Ghemawat and Jeffrey Dean wrote a paper that described how
Google used the ‘Divide and Conquer’ approach to deal with its
gigantic databases. This approach involves breaking a task into
smaller sub-tasks and then working on sub-tasks in parallel, and
results in huge efficiencies. Open source software enthusiast ‘Doug
Cutting’ was one of the guys deeply inspired by the Google paper.
He was able to scale his engine to process a couple of hundred
million web pages but the requirement was for something 10,000
times faster than this. This is the computing power Google
generates when it processes the trillions of webpages in existence.
• Hadoop : Doug and his partner went about creating an Open source
file system and processing framework that later came to be known
as Hadoop. This formed the basis of their search engine “Nutch”.
While the original Google file system was based on C++, Doug’s
hadoop was based on Java.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Technologies contd..
• Pig: As hadoop began to be implemented on a larger scale, Big Data
specialists soon realized that they were wasting far too much time on
writing MapReduce queries rather than actually analyzing data.
MapReduce was long and time consuming to write. Developers at
Yahoo soon came out with a work around – Pig. Pig is essentially an
easier way to write MapReduce queries. It is similar to Python and
allows for shorter and more efficient code to be written that can then be
translated to MapReduce before execution.
• Hive: While this solved the problem for a number of people, there were
many who still found this difficult to learn. SQL is a language that most
developers are familiar with and hence people at Facebook decided to
create Hive – an alternative to Pig. Hive enables code to be written in
Hive query language or HQL that, as the name suggests, is very similar
to SQL. Thus, we now have an option – if we are familiar with Python,
we can pick up Pig to write code. If we have knowledge of SQL, we can
go for Hive. In either case, we get away from the time consuming job of
writing MapReduce queries.So far we have understood 4 of the most
popular Big Data technologies – MapReduce, Hadoop, Pig and Hive.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Technologies contd..
• NoSQL: NoSQL refers to databases that do not follow the traditional
tabular structure. This means that the data is not organized in the
traditional rows and columns structure. An example of such data is the text
from social media sites which can be analyzed to reveal trends and
preferences. Another example is video data or sensor data. There are a
number of NoSQL database technologies that work well for specific data
problems. Hbase, CouchDB, MongoDB and Cassandra are some
examples of NoSQL databases. Database technologies enable efficient
storage and processing of data. however, in order to analyze this data, Big
Data specialists require other specializd technologies.
• Mahout /Impala - Mahout is a collection of algorithms that enable machine
learning to be performed on hadoop databases. If you were looking to
perform clustering, classification or collaborative filtering on your data,
Mahout will help you do that. E-commerce companies and retailers have a
frequent need to perform tasks like clustering and collaborative filtering on
their data and Mahout is a great choice for this. Impala is another
technology that enables analytics on Big Data. Impala is a query engine
that allows Big Data specialists to perform analytics on data stored on
hadoop via SQL or other BI tools. Impala as been developed and promoted
by cloudera.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Life Cycle 1/
A big data analytics cycle can be described by the following
stage −
• Business Problem Definition – CRISP DM : Business
understanding
• Research -- new step
• Human Resources Assessment – new step
• Data Acquisition – new step
• Data Munging – new step
• Data Storage – new step
• Exploratory Data Analysis – CRISP DM : Data Understanding
• Data Preparation for Modeling and Assessment – CRISP DM :
Data Preparation
• Modeling – CRISP DM : Data Modelling
• Implementation – CRISP DM : Evaluation & Deployment
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Life Cycle 2/
Research: Analyze what other companies have done in the same situation. This involves looking for
solutions that are reasonable for your company, even though it involves adapting other solutions to the
resources and requirements that your company has. In this stage, a methodology for the future stages
should be defined.
Human Resources Assessment: Once the problem is defined, it’s reasonable to continue analyzing if the
current staff is able to complete the project successfully. Traditional BI teams might not be capable to
deliver an optimal solution to all the stages, so it should be considered before starting the project if there
is a need to outsource a part of the project or hire more people.
Data Acquisition: This section is key in a big data life cycle; it defines which type of profiles would be
needed to deliver the resultant data product. Data gathering is a non-trivial step of the process; it
normally involves gathering unstructured data from different sources. To give an example, it could
involve writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in
different languages normally requiring a significant amount of time to be completed.
Data Munging: Once the data is retrieved, for example, from the web, it needs to be stored in an easy to-
use format. To continue with the reviews examples, let’s assume the data is retrieved from different sites
where each has a different display of the data. Suppose one data source gives reviews in terms of
rating in stars, therefore it is possible to read this as a mapping for the response variable y ∈ {1, 2, 3, 4,
5}. Another data source gives reviews using two arrows system, one for up voting and the other for
down voting. This would imply a response variable of the form y ∈ {positive, negative}.
In order to combine both the data sources, a decision has to be made in order to make these two
response representations equivalent. This can involve converting the first data source response
representation to the second form, considering one star as negative and five stars as positive. This
process often requires a large time allocation to be delivered with good quality.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Life Cycle 3/
Data Storage: Once the data is processed, it sometimes needs to be stored in a database.
Big data technologies offer plenty of alternatives regarding this point. The most common
alternative is using the Hadoop File System for storage that provides users a limited
version of SQL, known as HIVE Query Language. This allows most analytics task to be
done in similar ways as would be done in traditional BI data warehouses, from the user
perspective. Other storage options to be considered are MongoDB, Redis, and SPARK.
This stage of the cycle is related to the human resources knowledge in terms of their abilities
to implement different architectures. Modified versions of traditional data warehouses are
still being used in large scale applications. For example, teradata and IBM offer SQL
databases that can handle terabytes of data; open source solutions such as postgreSQL
and MySQL are still being used for large scale applications.
Even though there are differences in how the different storages work in the background, from
the client side, most solutions provide a SQL API. Hence having a good understanding of
SQL is still a key skill to have for big data analytics.
This stage a priori seems to be the most important topic, in practice, this is not true. It is not
even an essential stage. It is possible to implement a big data solution that would be
working with real-time data, so in this case, we only need to gather data to develop the
model and then implement it in real time. So there would not be a need to formally store
the data at all.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
ASUM-DM
ftp://ftp.software.ibm.com/software/data/sw-
library/services/ASUM.pdf
Analytics Solutions Unified Method - Implementations with
Agile principles
The Analytics Solutions Unified Method (ASUM) is a step-
by-step guide to conducting a complete implementation
lifecycle for IBM Analytics solutions.
ASUM is designed to create successful and repeatable IBM
Analytics deployments. The method can be utilized by
IBM clients and business partners to successfully
implement IBM Analytics solutions.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
ASUM
ASUM is a methodology to
implement AGILE in the project
management way of working.
So think of ASUM being applied
in some phases of CRISP-DM or
SNAM i.e. Data Preparation
and Modelling.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
ASUM
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Some More Variations - TDSP
Microsoft Team Data Science Process
The TDSP process model provides a dynamic framework to machine learning solutions that have been through a robust process of
planning, producing, constructing, testing, and deploying models. Here is an example of the TDSP process:
https://hub.packtpub.com/two-popular-data-analytics-methodologies-every-data-professional-should-know-tdsp-crisp-dm/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Analytics Capacity Building
Reference – Example : https://chhsdata.github.io/dataplaybook/documents/APHSA-
Roadmap-to-Capacity-Building-in-Analytics-White-Paper.pdf
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data driven decision making
• Definition – When it is data and not instincts that drives the
business decisions.
• Examples – Fraud detection in Loans, Credit Cards (Cibil
scores); Insurance, Six sigma projects to improve efficiency;
Target advertising in e-commerce; Product Roadmap
planning, Team planning
• 6 Steps to Data Driven decision making-
1. Strategy – Define clear Business goals
2. Identifying key data focus areas – Data is everywhere, flowing from multiple sources.
Based on domain knowledge define key focus data sources which seem to impact the
most, easier to access, reliable and clean
3. Data Collection & Storage – Defining data architecture to collect, store, archive i.e.
manage data. Connect multiple data sources, clean, prepare and organize
4. Data Analytics – Analyzing the data and derive key insights
5. Turning insights to Actions – business actions to be taken based on the findings from
key insights from data
6. Operationalize and Deploy – Using IT systems, automate the data collection, storage,
analysis and presenting the key highlights
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Business Model
Industry Example – FSO
Analytics Tool
Business Objective fro FSO Department: Target for 5 top clients in India,
Europe, South Africa and Costa Rica Markets
• FTR (Frist Time Right) Improvement by 2%
• Improve incoming WO Quality by 5%
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References
1. https://www.scnsoft.com/blog/4-types-of-data-analytics
2. https://pdfs.semanticscholar.org/7dfe/3bc6035da527dea
a72007a27cef94047a7f9.pdf
3. https://www.nap.edu/read/23670/chapter/6#57
4. https://www.kdnuggets.com/2017/01/four-problems-
crisp-dm-fix.html
5. https://www.jigsawacademy.com/em/Beginners_Guide_t
o_Analytics.pdf
6. https://data-flair.training/blogs/data-analytics-tutorial/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Challenges/ Concerns
1. Identifying the problem
One of the major steps in analyzing a problem and designing a solution is to first figure out the problem properly
and define each aspect of it. Many times Data scientists opt for a mechanical approach and start working on data
sets and tools without a clear definition of the business problem or the client requirement.
2. Access to right data – Data quantity
For right analysis, it is very important to lay the hands on the right kind of data. Gaining access to a variety of data
in the most appropriate format is quite difficult as well as time-consuming. There could be issues ranging from
hidden data, insufficient volume of data or less variety in the kind of data. Data could be spread unevenly across
various lines of business so getting permission to access that data can also pose a challenge.
3. Data Cleansing – Data quality
According to a study by MIT, Big Data has begun to cost companies up to 25% of possible revenue because
cleansing bad data is eroding operating expenses.
4. Lack of domain expertise
Data Scientists also need to have sound domain knowledge and gain subject matter expertise. One of the biggest
challenges faced by data scientists is to apply domain knowledge to business solutions. Data scientists are a
bridge between the IT department and the top management.
5. Data security issues
Since data is extracted through a lot of interconnected channels, social media as well as other nodes, there is
increased vulnerability of hacker attacks. Due to the confidentiality element of data, Data scientists are facing
obstacles in data extraction, usage, building models or algorithms. The process of obtaining consent from users is
causing a major delay in turnaround time and cost overruns.
https://www.eaie.org/blog/7-challenges-becoming-data-
driven.html#:~:text=Data%20can%20play%20an%20important,and%20wrong%20or%20missing%20information.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Process
Lecture – 4
Sumita Narang
Objectives
Data Science methodology
– Business understanding
– Data Requirements
– Data Acquisition
– Data Understanding
– Data preparation
– Modelling
– Model Evaluation
– Deployment and feedback
Case Study
Data Science Proposal
– Samples
– Evaluation
– Review Guide
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Industry Example – FSO
Analytics Tool
Business Objective fro FSO Department: Target for 5 top clients in India,
Europe, South Africa and Costa Rica Markets
• FTR (Frist Time Right) Improvement by 2%
• Improve incoming WO Quality by 5%
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Industry Example - FSO
Analytics Tool
Highlighting Key Insights based on Analysis
• Aids in identifying key improvement areas in order to
improve FTR & Incoming WO Quality
• Key highlights & Lowlights from data depicted as textual &
tabular insights in the tool
WFM Repeated Analysis
Closure FSO Analytics Data
Dump History DB • Ease of performing repeated analysis weekly/monthly, on
Aggregation
the same parameters, for same customer
Pre & Post Analysis
• Based on actions identified from analysis, results to be
monitored
• Aids in doing comparison of same parameters month on
month
• Aids in doing comparison of same parameters across
different customer/regions
Analysis
Analytics GUI
Procedures
Easy option to fetch reports
Representation
Execution
• To support in performing the RCA for problematic areas
identified through analysis, tool provides the feature of
creating & downloading reports
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Process
1. Business
2. Data Acquisition
Requirement 3. Data Preparation
& Storage
Understanding
4. Data Model 5. Evaluate and 6. Interpret and
Creation prove Model Present Results
7. Deployment &
Operational
Support
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Process
The Team Data Science Process (TDSP) is an agile, iterative
data science methodology to deliver predictive analytics
solutions and intelligent applications efficiently. The process here
that can be implemented with a variety of tools.
Features:
- Improve Team Collaboration and learning
- Contains a distillation of the best practices and structures.
- Helps in successful implementation of data science initiatives
Source Reference
4
BITS Pilani, Pilani Campus
Key components of the TDSP
TDSP comprises of the following key components:
• A data science lifecycle definition
• A standardized project structure
• Infrastructure and resources for data science projects
• Tools and utilities for project execution
5
BITS Pilani, Pilani Campus
Data Science Lifecycle
The Team Data Science Process (TDSP) provides a lifecycle to structure the
development of your data science projects. The lifecycle outlines the steps,
from start to finish, that projects usually follow when they are executed.
• Designed for data science projects that ship as part of intelligent applications
• Machine learning or artificial intelligence models for predictive analytics
• Exploratory data science projects or ad hoc analytics projects can also
benefit
6
BITS Pilani, Pilani Campus
Data Science Lifecycle
7
BITS Pilani, Pilani Campus
Data Science Project Lifecycle
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Detailed stages
Details
Stage 1- Scoping Phase : Business
Requirements Understanding
1. Product Need
• Understand project sponsor needs and limitations
• Understand Project sponsor vision to deploy and present the results.
2. Initial solution ideation – Data Requirements
• Collaborate with SME to understand data sources, data fields, and computational resources
• Collaborate with Data Engineer for possible solutions, data sources & data architecture
• Decide on general algorithmic approach (e.g. unsupervised clustering vs boosted-tree-based
classification vs probabilistic inference)
3. Scope & KPI
• To define a measurable and quantifiable goal
• E.g. predicting the expected CTR of an ad with approximation of at least X% in at least Y% of
the cases, for any ad that runs for at least a week, and for any client with more than two
months of historic data”
4. Scope & KPI Approval
• Product sponsor approved the KPIs defined
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stage 2 -Data Acquisition &
Management
Data Acquisition Stage of collecting or acquiring data from various sources.
2 Types of Data Collection techniques - Primary Data & Secondary
Data Collection.
Primary Data collection – Data is collected from direct sources. Observation,
Interview, Questionnaire, Audit Data, Case Study, Survey Method
Secondary Data collection – Data collected and research analyzed
by other agencies, universities
Develop a solution architecture of the data pipeline that refreshes
and scores the data regularly
Ingest the data into the target analytic environment
Set up a data pipeline to score new or regularly refreshed data.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Acquisition from
Sources & Variety of data
https://datafloq.com/read/understanding-
sources-big-data-infographic/338
21
BITS Pilani, Pilani Campus
Data Management /
Warehousing
Data Administrative process that includes acquiring, validating, storing,
Management/ protecting, and processing required data to ensure the accessibility,
Warehousing reliability, and timeliness of the data for its users.
Relational Databases – RDBMS: Structured information & adherence
to strong schema, ACID (Atomicity, Consistency, Isolation and
Durability) properties, SQL based real time querying
NoSQL Databases – 4 types: 1. Key-Value (Amazon S3, Riak) 2.
Document based store ( CouchDB, MongoDB ) 3. Column-Based Store
(Hbase, Cassandra ) 4. Graph-Based ( Neo4J)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stage 3: Data Preparation 1/
Data Preparation is the process of collecting, cleaning, and consolidating
data into one file or data table, primarily for use in analysis.
1. Handling messy, inconsistent, or un-standardized data
2. Trying to combine data from multiple sources
3. Handling of missing values, boundary values, deleting duplicate
values
4. Validation of data, reliability and correctness checks
5. Dealing with data that was scraped from an unstructured source such
as PDF documents, images etc.
6. Feature engineering, feature reduction/scaling
Data Understanding & Exploration/ Data Preparation
• Produce a clean, high-quality data set whose relationship to the target variables is
understood. Locate the data set in the appropriate analytics environment so you are
ready to model.
• Explore dimensions; find deficiencies in data
• Add more data sources if required with support from Data Engineer
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stage 3: Data Preparation 2/
The key steps to your data preparation:
Validation – The correctness
Data analysis – The data is
Creating an intuitive of the workflow is next
audited for errors ;anomalies
workflow – A workflow evaluated against a
to be corrected. For large
consisting of a sequence of representative sample of the
datasets, data preparation
data prep operations for dataset; Leading to
prove helpful in producing
addressing the data errors is adjustments to the workflow
metadata & uncovering
then formulated. as previously undetected
problems.
errors are found.
Transformation – Once
Backflow of cleaned data –
convinced of the effectiveness
Finally, steps must also be
of the workflow,
taken for the clean data to
transformation may now be
replace the original dirty data
carried out, and the actual
sources.
data prep process takes place.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Preparation /Data
Pre-processing
Data Transformation
Data Understanding/
(Cleaning/ Converting Feature Engineering
Analysis
into different formats)
- Meta data, Cols/attirbutes, SME
- Data types of attributes
- distribtuion of attributs
-Data Cleaning – handling of
- data Quality of attribute –
missing v alues, removing outliers 1. Feature Reduction
missing v alues, Duplicate values
- Standardization, Normalization 2. Feature Selection
- Categorical – uniqueness/
-Date extraction – month, year 3. Feature Creation
classes/value count
etc.
- Outliers, Noise in data
- Correlation + Association
analysis
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stage 3- Data Pre-processing contd..
Algorithms require features with some specific characteristic to work properly. Here,
the need for feature engineering arises
• Preparing the proper input dataset, compatible with the machine learning
algorithm requirements.
• Improving the performance of machine learning models.
data scientists spend 80% of their time on data
preparation:
25
BITS Pilani, Pilani Campus
Stage 3:Data Pre-processing contd..
Data scientists spend 80% of their time on data preparation:
Source: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-
enjoyable-data-science-task-survey-says/
26
BITS Pilani, Pilani Campus
Stage 3: Data Pre-processing steps
Data Cleaning: Data is cleansed through processes such as filling in missing
values, smoothing the noisy data, or resolving the inconsistencies in the data.
Techniques like- dropping missing values beyond a threshold, filling with mean/
Median; Imputation; Binning etc.
Data Integration: Data with different representations are put together and
conflicts within the data are resolved.
Data Transformation: Data is normalized, aggregated and generalized.
Data Reduction: This step aims to present a reduced representation of the data
in a data warehouse.
Data Discretization: Involves the reduction of a number of values of a
continuous attribute by dividing the range of attribute intervals.
27
BITS Pilani, Pilani Campus
Stage 3:Data Pre-processing
Techniques – Data Transformations
• Imputation
• Handling Outliers
• Binning
• Log Transform
• One-Hot Encoding
• Grouping Operations
• Feature Split
• Scaling
• Extracting Date
https://towardsdatascience.com/feature-engineering-for-machine-learning-
3a5e293a5114
28
BITS Pilani, Pilani Campus
Preparing The Data
Dealing with missing data
It is quite common in real-world problems to miss some values of our data
samples. It may be due to errors on the data collection, blank spaces on
surveys, measurements not applicable…etc
Missing values are tipically represented with the “NaN” or “Null” indicators. The
problem is that most algorithms can’t handle those missing values so we need
to take care of them before feeding data to our models. Once they are
identified, there are several ways to deal with them:
• Eliminating the samples or features with missing values. (we risk to delete
relevant information or too many samples)
• Imputing the missing values, with some pre-built estimators such as the
Imputer class from scikit learn. We’ll fit our data and then transform it to
estimate them. One common approach is to set the missing values as the
mean value of the rest of the samples.
12
BITS Pilani, Pilani Campus
Feature Scaling
This is a crucial step in the preprocessing phase as the majority of machine
learning algorithms perform much better when dealing with features that are on
the same scale. In most cases, the numerical features of the dataset do not
have a certain range and they differ from each other. In real life, it is nonsense
to expect age and income columns to have the same range. But from the
machine learning point of view, how these two columns can be compared?
Scaling solves this problem. The continuous features become identical in terms
of the range, after a scaling process. This process is not mandatory for many
algorithms, but it might be still nice to apply. However, the algorithms based
on distance calculations such as k-NN or k-Means need to have scaled
continuous features as model input. The most common techniques are:
• Normalization: it refers to rescaling the features to a range of [0,1], which is
a special case of min-max scaling. To normalize our data we’ll simply need
to apply the min-max scaling method to each feature column.
• Standardization: it consists in centering the feature columns at mean 0 with
standard deviation 1 so that the feature columns have the same parameters
as a standard normal distribution (zero mean and unit variance).
14
BITS Pilani, Pilani Campus
Preparing The Data -Selecting
Meaningful Features
One of the most common solution to avoid overfitting is to reduce data’s
dimensionality. This is frequently done by reducing the number of features of
our dataset via Principal Component Analysis (PCA).
15
BITS Pilani, Pilani Campus
Central Limit Theorem
The Central Limit Theorem (CLT) is a statistical theory
states that given a sufficiently large sample size from a
population with a finite level of variance, the mean of all
samples from the same population will be approximately
equal to the mean of the population.
• Larger the sample size, the more would be the normal
distribution of means of samples
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Preparing The Data-Splitting
Data Into Subsets
In general, we will split our data in three parts: training, testing and validating
sets. We train our model with training data, evaluate it on validation data and
finally, once it is ready to use, test it one last time on test data.
The ultimate goal is that the model can generalize well on unseen data, in other
words, predict accurate results from new data, based on its internal parameters
adjusted while it was trained and validated.
a) Learning Process
b) Over-fitting & Under-fitting
16
BITS Pilani, Pilani Campus
What is Bias & Variance in
data
https://towardsdatascience.com/what-is-ai-bias-
6606a3bcb814
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tools for Data Preparation
1. R – Data-tables and related libraries
1. http://www.milanor.net/blog/preparing-the-data-for-modelling-with-r/
2. https://www.udacity.com/course/data-analysis-with-r--ud651
3. https://www.datacamp.com/home (Course: Introduction to R )
4. https://courses.edx.org/courses/course-v1:Microsoft+DAT209x+5T2016/course/
2. Python – Pandas libraries, Numpy, scipy, EDA libraries
1. https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-
python.html
2. https://www.analyticsvidhya.com/blog/2016/01/top-certification-courses-sas-r-
python-machine-learning-big-data-spark-2015-16/#five
3. Self Service Data Preparation tools - Examples:
Clearstory data, Datameer, Microsoft Power query for
Excel, Pixata, Tamr, Big Data analyzeretc.
1. (https://www.predictiveanalyticstoday.com/data-preparation-tools-and-
platforms/)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Extra Reading
How to depict Dependent and independent variables in
data? How to depict data correlation?
http://www.sthda.com/english/wiki/visualize-correlation-matrix-
using-correlogram
https://datavizcatalogue.com/search/relationships.html
https://www.khanacademy.org/math/pre-algebra/pre-algebra-
equations-expressions/pre-algebra-dependent-
independent/v/dependent-and-independent-variables-
exercise-example-2
How to compensate for missing values ?
https://towardsdatascience.com/6-different-ways-to-compensate-
for-missing-values-data-imputation-with-examples-
6022d9ca0779
https://measuringu.com/handle-missing-data/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References
• https://www.bouvet.no/bouvet-deler/roles-in-a-data-science-project
• https://www.altexsoft.com/blog/datascience/how-to-structure-data-
science-team-key-models-and-roles/
• https://www.quora.com/What-is-the-life-cycle-of-a-data-science-project
• https://towardsdatascience.com/5-steps-of-a-data-science-project-
lifecycle-26c50372b492
• https://www.dezyre.com/article/life-cycle-of-a-data-science-project/270
• https://www.slideshare.net/priyansakthi/methods-of-data-collection-
16037781
• https://www.questionpro.com/blog/qualitative-data/
• https://surfstat.anu.edu.au/surfstat-home/1-1-1.html
• https://www.mymarketresearchmethods.com/types-of-data-nominal-
ordinal-interval-ratio/
• https://www.coursera.org/learn/decision-making
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Process
Lecture – 4, 5 (Part-2)
Sumita Narang
Objectives
Data Science methodology
– Business understanding
– Data Requirements
– Data Acquisition
– Data Understanding
– Data preparation
– Modelling
– Model Evaluation
– Deployment and feedback
Case Study
Data Science Proposal
– Samples
– Evaluation
– Review Guide
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stage 4: Data Modelling
• Split correctly the data
• Choose a measure of success. Set an evaluation
protocol and the different protocols available.
• Develop a benchmark model
• Choose an adequate model and tune it to get the best
performance possible. An overview of how a model
learns.
• What is regularization and when is appropriate to use it.
• Differentiate between over and under fitting, defining
what they are and explaining the best ways to avoid
them
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Modelling & Evaluation
1. Data Model Development & experiment framework setup
• Data Modelling based on training sets
• Framework to feed in new data and test the models
• Framework to change training data and retrain model based on new data sets
as sliding window
• 3 main tasks involved -
• Feature Engineering: Create data features from the raw data to facilitate
model training
• Model Training: Find the model that answers the question most accurately
by comparing their success metrics
• Determine if your model is suitable for production
2. Data Model Evaluation & KPI Checks
• Read papers, research material to finalize the algorithmic approaches
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Define Appropriately the
Problem
The first, and one of the most critical things to do, is to find out what are the
inputs and the expected outputs. The following questions must be answered:
• What is the main objective? What are we trying to predict?
• What are the target features?
• What is the input data? Is it available?
• What kind of problem are we facing? Binary classification? Clustering?
• What is the expected improvement?
• What is the current status of the target feature?
• How is going to be measured the target feature?
5
BITS Pilani, Pilani Campus
Choose a Measure of Success
“If you can’t measure it you can’t improve it”.
If you want to control something it should be observable, and in order to
achieve success, it is essential to define what is considered success: Maybe
precision? accuracy? Customer-retention rate?
This measure should be directly aligned with the higher level goals of the
business at hand. And it is also directly related with the kind of problem we are
facing:
• Regression problems use certain evaluation metrics such as mean squared
error (MSE).
• Classification problems use evaluation metrics as precision, accuracy and
recall.
8
BITS Pilani, Pilani Campus
Setting an Evaluation Protocol 1/
Maintaining a Hold Out Validation Set
This method consists on setting apart some portion of the data as the test set.
The process would be to train the model with the remaining fraction of the data,
tuning its parameters with the validation set and finally evaluating its
performance on the test set.
The reason to split data in three parts is to avoid information leaks. The main
inconvenient of this method is that if there is little data available, the validation
and test sets will contain so few samples that the tuning and evaluation
processes of the model will not be effective.
This is a simple kind of cross validation technique,
also known as the holdout method. Although this
method doesn’t take any overhead to compute and is
better than traditional validation, it still suffers from
issues of high variance. This is because it is not
certain which data points will end up in the
validation set and the result might be entirely
different for different sets. Reduce Bias Error Reduce
Variance Error
9
BITS Pilani, Pilani Campus
Setting an Evaluation Protocol 2/
K-Fold Validation Method
K-Fold consists in splitting the data into K partitions of equal size. For each
partition i, the model is trained with the remaining K-1 partitions and it is
evaluated on partition i.
The final score is the average of the K scored obtained. This technique is
specially helpful when the performance of the model is significantly different
from the train-test split.
As there is never enough data to train your
model, removing a part of it for validation poses a
problem of under-fitting. By reducing the training
data, we risk losing important patterns/ trends in
data set, which in turn increases error induced by
bias. So, what we require is a method that provides
ample data for training the model and also leaves
ample data for validation. K Fold cross validation
does exactly that.
K-fold cross validation significantly reduces bias
as we are using most of the data for fitting, and
also significantly reduces variance as most of the
data is also being used in validation set.
10
BITS Pilani, Pilani Campus
Cross-Validation (1/3)
• Cross-Validation is a very useful technique for assessing the performance
of machine learning models.
• Helps in knowing how the machine learning model would generalize to an
independent data set.
• Helps in estimating how accurate the predictions will be in practice.
• We are given two type of data sets: known data set (training data set) and
unknown data set (test data set).
• There are different types of Cross-Validation techniques but the overall
concept remains the same:
o To partition the overall dataset into a number of subsets
o Hold out a subset at a time and train the model on remaining subsets
o Test model on hold out subset
Sources: https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
https://magoosh.com/data-science/k-fold-cross-validation/
19 BAZG523(Introduction to Data Science)
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
Cross-Validation (2/3)
K-Fold Cross-Validation: If k=5 the dataset will be divided into 5 equal parts
and the below process will run 5 times, each time with a different holdout set.
1. Take a group as a test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test data set
4. Retain the evaluation score and discard the model
At the end of the above process summarize the skill of the model using the
average of model evaluation scores.
Sources: https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
https://magoosh.com/data-science/k-fold-cross-validation/
20 BAZG523(Introduction to Data Science)
Cross-Validation (3/3)
Leave One Out Cross-Validation : It is K-fold cross validation taken to its
logical extreme, with K equal to N, the number of data points in the dataset.
Sources: https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
https://magoosh.com/data-science/k-fold-cross-validation/
21 BAZG523(Introduction to Data Science)
Setting an Evaluation Protocol 3/
Stratified K-Fold Cross validation
In some cases, there may be a large imbalance in the response
variables. For example, in dataset concerning price of houses, there
might be large number of houses having high price. Or in case of
classification, there might be several times more negative samples
than positive samples.
For such problems, a slight variation in the K Fold cross validation
technique is made, such that each fold contains approximately
the same percentage of samples of each target class as the
complete set, or in case of prediction problems, the mean
response value is approximately equal in all the folds. This
variation is also known as Stratified K Fold.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Setting an Evaluation Protocol 4/
Iterated K-Fold Validation with Shuffling (Random Sub-sampling)
It consist on applying K-Fold validation several times and shuffling the data
every time before splitting it into K partitions. The Final score is the average of
the scores obtained at the end of each run of K-Fold validation.
This method can be very computationally expensive, as the number of trained
and evaluating models would be I x K times. Being I the number of iterations
and K the number of partitions.
Above explained validation techniques are also referred to as Non-
exhaustive cross validation methods. These do not compute all ways of
splitting the original sample, i.e. you just have to decide how many subsets need
to be made.
11
BITS Pilani, Pilani Campus
Non-Exhaustive & Exhaustive
Cross-Validation Techniques
References-
1. https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
2. https://blog.contactsunny.com/data-science/different-types-of-validations-in-
machine-learning-cross-validation
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exhaustive Cross Validation
Methods 1/
Exhaustive Methods, that computes all possible ways the data
can be split into training and test sets.
Leave-P-Out Cross Validation
This approach leaves p data points out of training data, i.e. if
there are n data points in the original sample then, n-p
samples are used to train the model and p points are used as
the validation set.
This is repeated for all combinations in which original sample
can be separated this way, and then the error is averaged for
all trials, to give overall effectiveness.
This method is exhaustive in the sense that it needs to train and
validate the model for all possible combinations, and for
moderately large p, it can become computationally infeasible.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exhaustive Cross Validation
Methods 2/
Leave-1-Out Cross Validation
A particular case of this method is when p = 1. This is
known as Leave one out cross validation. This
method is generally preferred over the previous one
because it does not suffer from the intensive
computation, as number of possible combinations is
equal to number of data points in original sample or
n.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Predictive Modelling
Predictive modeling refers to the task of building a model for the
target variable as a function of the explanatory variables. There are
two types of predictive modeling tasks:
• Classification, which is used for discrete target variables, and
regression, which is used for continuous target variables. For
example, predicting whether a Web user will make a purchase at an
online bookstore is a classification task because the target variable
is binary-valued.
• On the other hand, forecasting the future price of a stock is a
regression task because price is a continuous-valued attribute. The
goal of both tasks is to learn a model that minimizes the error
between the predicted and true values of the target variable.
Predictive modeling can be used to identify customers that will respond
to a marketing campaign, predict disturbances in the Earth's
ecosystem, or judge whether a patient has a particular disease based
on the results of medical tests.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Predictive
Modelling Techniques
In classification problems, we use two types of algorithms (dependent on
the kind of output it creates):
• Class output: Algorithms like SVM and KNN create a class output. For
instance, in a binary classification problem, the outputs will be either 0
or 1. However, today we have algorithms which can convert these class
outputs to probability. But these algorithms are not well accepted by the
statistics community.
• Probability output: Algorithms like Logistic Regression, Random
Forest, Gradient Boosting, Adaboost etc. give probability outputs.
Converting probability outputs to class output is just a matter of creating
a threshold probability.
Classification Algorithms vs Clustering Algorithms
In clustering, the idea is not to predict the target class as in classification,
it’s more ever trying to group the similar kind of things by considering
the most satisfied condition, all the items in the same group should
be similar and no two different group items should not be similar.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Developing a Benchmark
model
Benchmarking is the process of comparing your result to existing
methods. You may compare to published results using another
paper, for example. If there is no other obvious methodology against
which you can benchmark, you might compare to the best naive
solution (guessing the mean, guessing the majority class, etc.) or a
very simple model (a simple regression, K Nearest Neighbors). If the
field is well studied, you should probably benchmark against the
current published state of the art (and possibly against human
performance when relevant).
Null Model Bayes rate model
Single-variable models (e.g Pivot tables)
https://towardsdatascience.com/creating-benchmark-models-the-scikit-learn-way-
af227f6ea977
17
BITS Pilani, Deemed to be University under Section 3 of UGC
BITS Pilani, PilaniAct, 1956
Campus
Null Model – Null Hypothesis
Null model for Univariate regression model –
Y=α + β1X+ ϵ
Null hypothesis would normally be that β1 is statistically no different from zero.
– H0: β1=0 (null hypothesis)
– HA: β1≠0 (alternative hypothesis)
For a univariate linear model, such as the above, if we were to reject the alternative
hypothesis then we could drop β1X from the linear model and we'd be left with
Y=α+ϵ
this is your Null model and the same as mean of Y.
Null Model (Single Variable Models) for Bi-variate regression model -
where x1 contains the predictors you know are affecting the outcome, so are not wanting to
test, while x2 contains the predictors you are testing.
So the null hypothesis will be β2=0 and the null model would be –
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naive Bayes Classifier
In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
Bayes theorem: This theorem helps to calculate conditional probability of occurrence of
a hypothesis H, if the Evidence E is true, given the prior probability of H, and
probability of occurrence of Evidence E.
P(H|E) = P(E|H) . P(H) / P (E)
https://www.khanacademy.org/partner-content/wi-phi/wiphi-critical-thinking/wiphi-
fundamentals/v/bayes-theorem
Example –
P(H|E) To find the conditional probability that I
have Dengu, given the evidence that I show
symptoms of Head ache.
Given -
P(H) = Prior Probability that I can have a disease
like Dengu
P(E) = Probability that I can have head ache (due
to any reason – dengu or not dengu)
P(E|H) = Probability that I can have Head ache,
given the condition that I have dengu
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Classifiers
- Naive Bayes classifier
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How Naive Bayes algorithm
works?
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
It is easy and fast to predict class of test data set. It also perform well in
multi class prediction
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast
probability = 0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior
probability for each class. The class with the highest posterior
probability is the outcome of prediction.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How Naive Bayes algorithm
works?
Problem: Players will play if weather is sunny. Is this statement
is correct?
We can solve it using above discussed method of posterior
probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 =
0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has
higher probability.
Naive Bayes uses a similar method to predict the probability of
different class based on various attributes. This algorithm is
mostly used in text classification and with problems having
multiple classes.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stage 5: Model Evaluation
Metrics
Performance Metrics vary based on type of models i.e.
Classification Models, Clustering Models, Regression
Models.
https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Validating models
Identifying common model problems
Bias - Systematic Error
Variance - Undesirable (but non-systematic) distance
between predictions and actual values.
Overfit
Nonsignificance: A model that appears to show an
important relation when in fact the relation may not hold
in the general population, or equally good predictions
can be made without the relation.
29
BITS Pilani, Pilani Campus
Validating models
30
BITS Pilani, Pilani Campus
Ensuring model quality
Testing On Hold Out Data
k-fold cross-validation
The idea behind k-fold cross-validation is to repeat the construction
of the model on different subsets of the available training data and
then evaluate the model only on data not seen during construction.
This is an attempt to simulate the performance of the model on
unseen future data.
Significance Testing
“What is your p-value?”
31
BITS Pilani, Pilani Campus
Balancing Bias & Variance to
Control Errors in Machine Learning
https://towardsdatascience.com/balancing-bias-and-variance-to-control-errors-in-machine-learning-16ced95724db
Y = f(X) + e
Estimation of this relation or f(X) is known as statistical learning. On general, we won’t be able to make a perfect estimate
of f(X), and this gives rise to an error term, known as reducible error. The accuracy of the model can be improved by
making a more accurate estimate of f(X) and therefore reducing the reducible error. But, even if we make a 100%
accurate estimate of f(X), our model won’t be error free, this is known as irreducible error(e in the above
equation). The quantity e may contain unmeasured variables that are useful in predicting Y : since we don’t
measure them, f cannot use them for its prediction. The quantity e may also contain unmeasurable variation.
Bias
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely
complicated, by a much simpler model. So, if the true relation is complex and you try to use linear regression, then
it will undoubtedly result in some bias in the estimation of f(X). No matter how many observations you have, it is
impossible to produce an accurate prediction if you are using a restrictive/ simple algorithm, when the true relation is
highly complex.
Variance
Variance refers to the amount by which your estimate of f(X) would change if we estimated it using a different
training data set. Since the training data is used to fit the statistical learning method, different training data sets will
result in a different estimation. But ideally the estimate for f(X) should not vary too much between training sets.
However, if a method has high variance then small changes in the training data can result in large changes in f(X).
A general rule is that, as a statistical method tries to match data points more closely or when a more flexible
method is used, the bias reduces, but variance increases.
In order to minimize the expected test error, we need to select a statistical learning method that simultaneously
achieves low variance and low bias.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regularization 1/
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
This is a form of regression, that constrains/ regularizes or shrinks the coefficient
estimates towards zero. In other words, this technique discourages learning a
more complex or flexible model, so as to avoid the risk of overfitting.
A simple relation for linear regression looks like this. Here Y represents the learned
relation and β represents the coefficient estimates for different variables or
predictors(X).
Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
The fitting procedure involves a loss function, known as residual sum of squares or RSS.
The coefficients are chosen, such that they minimize this loss function. Now,
this will adjust the coefficients based on your training data. If there is noise in the
training data, then the estimated coefficients won’t generalize well to the future data.
This is where regularization comes in and shrinks or regularizes these learned
estimates towards zero.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regularization 2/
Ridge Regression
Above image shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning
parameter that decides how much we want to penalize the flexibility of our model. The
increase in flexibility of a model is represented by increase in its coefficients, and if we want to
minimize the above function, then these coefficients need to be small. This is how the Ridge
regression technique prevents coefficients from rising too high.
Lasso Regression
What does Regularization achieve?
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well
for a data set different than its training data. Regularization, significantly reduces the variance
of the model, without substantial increase in its bias. So the tuning parameter λ, used in the
regularization techniques described above, controls the impact on bias and variance. As the value
of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this
increase in λ is beneficial as it is only reducing the variance(hence avoiding overfitting),
without loosing any important properties in the data. But after certain value, the model starts
loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the
value of λ should be carefully selected.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stage 6 - Data Visualization &
Interpret Results
Communicate
information
Analyze data to • more effectively
support reasoning • Share and persuade
(visual explanation)
Record Information • Understand your
data better and act
• Blueprints, upon that
photographs, understanding
seismographs, … • Develop and assess
hypotheses (visual
exploration) Find
patterns and discover
errors in data
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification of Visualization 1/
5 Types of Data Visualization Categories
Temporal
• 2 conditions: that they are linear, and are one-dimensional.
• Temporal visualizations normally feature lines that either stand alone or
overlap with each other, with a start and finish time.
• Easy to read graphs
Hierarchical
• those that order groups within larger groups. Hierarchical visualizations
are best suited if you’re looking to display clusters of information,
especially if they flow from a single origin point.
• more complex and difficult to read,
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification of Visualization 2/
Network
• Datasets connect deeply with other datasets. Network data visualizations show how
they relate to one another within a network. In other words, demonstrating
relationships between datasets without wordy explanations.
Multidimensional
• there are always 2 or more variables in the mix to create a 3D data visualization.
• Because of the many concurrent layers and datasets, these types of visualizations tend
to be the most vibrant or eye-catching visuals. Another plus? These visuals can break
down a ton of data down to key takeaways.
Geospatial
• relate to real life physical locations, overlaying familiar maps with different data points.
• These types of data visualizations are commonly used to display sales or acquisitions
over time, and can be most recognizable for their use in political campaigns or to
display market penetration in multinational corporations.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification of Visualization 3/
Multidimen
Temporal Hierarchical Network Geospatial
sional
Scatter Tree Matrix
plots charts
Box plots Flow map
diagrams
Polar area Ring Node-link Density
Pie charts
diagrams charts diagrams map
Time series Sunburst Word Venn Cartogra
sequences diagrams clouds diagrams m
Alluvial Stacked
Timelines
diagrams bar graphs
Heat map
Line graphs Histograms
https://www.klipfolio.com/resources/articles/what-is-data-visualization
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sage 7 - Deployment &
Iterative Lifecycle
Standard Methodology for Analytical Models (SMAM)
Operationalization –
Implementing the model as a deployable software solution
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Deployment Phase
1. Solution Productionizing & Monitoring Setup
• Research language may or may not be possible to use in productionizing
• If Research and productionizing language is different; porting the algorithm;
finding related libraries & writing custom code; writing wrapper APIs
• If both are same; how to make models more scalable & achieve required
efficiency;
• Devise mechanism to continuously monitor the performance of model setup
2. Solution Deployment
• Hosting solution in company’s data centers or on cloud based on company’s
policies , infrastructure and cost
KPI Check
• Validating that target KPIs are met
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
<Transportation/ Logistics><Optimization Algorithms>
Problem Statement
• ABC Organization allows employees to request cab service for scheduled time slots,
having well-defined drop/pick up locations
• Services team manually creates routes for the given requests, which are often sub-
optimal
Solution Approach
• Used Large Neighborhood Search (LNS) – AI Search Metaheuristic
• Constraints built into LNS as Rules
Solution Benefits
• Overall Cost Optimization (Fuel, Security) and Travel Time Optimization
• No tedious manual planning required; more time window for user requests
Total No. of No. of Manual No. of proposed
Time Date Savings Vehicles used in manual routes Vehicles used in optimized routes
Customers routes routes
1 Amaze- D, 2 Dzire D, 3 Etios D,1 Indica
7:00 PM 1/19/2018 56% 28 12 7 6 Amaze D 4 seater, 1 Tavera D 9 Seater
D,1 Innova D, 4 Tempo Traveller
175
Project Example of Data Science
- Diagnostic Analytics
FSO analytics example –
Business Objective fro FSO Department: Target for 5 top clients in India,
Europe, South Africa and Costa Rica Markets
• FTR (Frist Time Right) Improvement by 2%
• Improve incoming WO Quality by 5%
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References
• https://www.bouvet.no/bouvet-deler/roles-in-a-data-science-project
• https://www.altexsoft.com/blog/datascience/how-to-structure-data-
science-team-key-models-and-roles/
• https://www.quora.com/What-is-the-life-cycle-of-a-data-science-project
• https://towardsdatascience.com/5-steps-of-a-data-science-project-
lifecycle-26c50372b492
• https://www.dezyre.com/article/life-cycle-of-a-data-science-project/270
• https://www.slideshare.net/priyansakthi/methods-of-data-collection-
16037781
• https://www.questionpro.com/blog/qualitative-data/
• https://surfstat.anu.edu.au/surfstat-home/1-1-1.html
• https://www.mymarketresearchmethods.com/types-of-data-nominal-
ordinal-interval-ratio/
• https://www.coursera.org/learn/decision-making
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Teams
Lecture – 6
Sumita Narang
Objectives
Data Science Teams
• Defining Data Team
• Roles in a Data Science Team
– Data Scientists
– Data Engineers
• Managing Data Team
– On boarding and evaluating the success of team
– Working with other teams
• Common difficulties
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Project – Team
structure
CLIENT PROJECT
SPONSOR
SUBJECT IT / DATA SCIENCE
MATTER OPERATIONS TEAM
Computer EXPERT TEAM MANAGER
Science/ IT
DATA DATA DATA ANALYST/
SW Machine SCIENTIST(S) ARCHITECT/ VISUALIZATION
Team ENGINEER(S) EXPERTS
Development Learning
Domain / Machine
Traditional Mathematics Researchers Learning
Business Research Engineers
& Statistics
Knowledge
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Roles in Data Science Project
• Represents business interest, maintains vision
• Person who wants data science results
Project Sponsor • Decides if project is success or failure
• Manage stakeholder expectations
• Project Management, planning/tracking & Facilitator
Subject Matter IT/ Operations Data Architect/ Data Analyst/
Data Scientist
Expert- SME Team Engineer Visualization expert
• Understands business • Builds application • Design the project steps • Build Data driven • Business level story
objective around the • Pick the data sources platform i.e. Data telling
• Business process logic/algorithm devised • Pick the tools to be used warehousing, Data • Build Dashboards and
domain expert by Data Team • Data Preparation Lakes, streaming other visualizations
• Defines roadmap • Responsible for • Performs statistical tests platforms or other • Provide insights
• Defines sources of input deploying the project in and procedures solutions to aggregate through visual means
data, input data the IT infrastructure • Applies machine data
formats, important/non- • Implements the overall learning models • Operationalize
important data f/wk for continuous • Evaluates results Algorithms and
• Helps in defining data data collection/feed, • Key collaborator among Machine Learning
dependencies, i/p vs o/p data cleaning/ multiple entities -SME & models with data
data relation preparation, data project sponsor, IT changing over time
modelling and Team and Data Team • Data Integration – from
prediction • Enables project vision different data sources
• Depicts results visually and goals to be aligned identified by Scientist
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skills Required – Data
Scientist
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Scientist
Skills Required
Project Manager IT /Operations Team
• Project management • Software development –
• Business acumen & vision C/C++, Java, Python etc.
• Team management • SQL/NoSQL Database
• Stakeholder management querying & scripting skills
• Tools – Jira, Azure, DevOps, • Hands-on programming
Agile knowledge of Data science
frameworks – Tensorflow,
Kreas etc.
Subject Matter Expert • Cloud / in-house data center
• Hands-on domain expert on deployment skills
business process • Operational skills to monitor
• Tools: Domain specific deployment platform
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Engineer vs Data Analyst
vs Data Scientist
Data Scientist Data Architect/ Engineer Data Analyst/ Visualization
expert
• Design the project steps • Build Data driven platform i.e. • Business level story telling
• Pick the data sources Data warehousing, Data Lakes,
• Pick the tools to be used streaming platforms or other • Build Dashboards and other
• Data Preparation solutions to aggregate data visualizations
• Performs statistical procedures
• Develop data set processes for
• Applies machine learning models • Provide insights through visual
• Evaluates results
Data modelling, mining &
production; with data changing means
• Key collaborator among multiple
entities -SME & project sponsor, IT over time
Team and Data Team • Data Integration – from
• Enables project vision and goals to different data sources identified
be aligned by Scientist
SKILLSET
Data Management : SQL, No-SQL Data Management : SQL, No-SQL Data analytics tools: Power BI, SRSS,
Data preparation: Python, SAS, R, Scala Data Warehousing tools Tableau
Data analytics: Power BI, SRSS, Tableau Data Purging, archiving, scripting
Data Modelling – Machine learning Tools eg. – Oracle, Cassandra, MySQL,
algorithms. Statistical learning, Neural MongoDB, Hive, neo4j, Postgress,
network (Tools - IBM Watson, SPSS, sqoop, Hadoop/HDFS etc.
Deep Learning)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Scientist Vs. IT SW
Programmer
IT SW Programmer Data Scientist
Data scientists, focus on building predictive models and
Software engineers work on front-end/back-end
developing machine learning capabilities to analyze the
development, build web and mobile apps, develop
data captured by that software. Data scientists specialize in
operating systems and design software to be used by
finding methods for solving business problems that require
organizations.
statistical analysis
A software engineer creates deterministic algorithms. Data scientists create probabilistic algorithms. Data
Every program that a software engineer writes should scientists, however, because they are dealing with
produce the exact same result every time it runs. statistics, can’t always guarantee an outcome.
Programmers typically work with SQL databases and Data scientists typically also work with SQL databases as
programming languages like Java, Javascript, and Python. well as Hadoop data stores. They are more likely to work in
Excel and frequently program with statistical software like
SAS and R. There is also a big trend toward Python but with
different libraries (Numpy, Pandas, etc.) than are used by
programmer
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science -> AI Skillset
Data Science Artificial Intelligence
• Data Management :
• Statistical Learning,
SQL, No-SQL
Neural networks
• Data preparation:
• In-depth programming
Python, SAS, R, Scala,
skills in
• Data analytics - Power
python/C/C++/Java
BI, SRSS, Tableau etc.
Data Scientist needs software programming skills, but more focused around data
preparation, modelling, presenting.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skills Venn diagram
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Managing Data Science Team
2/
- Attracting Top Talent
What’s your differentiated value proposition for candidate data
scientists? List three things that make the opportunity unique, that
you think will resonate with your target candidate pool. Test your
pitch on your group. Get feedback.
- Hiring Process
What are the three most important attributes for your candidates?
What is your assessment plan for each?
- Onboarding
What activities and outcomes need to have been achieved in the first
30, 60, and 90 days?
What are the most important pieces of “tribal knowledge” your new hire
needs to know, and how will she learn them? Examples include data
sources, project methodologies, stakeholder dynamics, notable wins
and losses, etc.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Managing Data Science Team
3/
Retention and Management
What skills do you hope this candidate develops over the first year?
What metrics will determine success of this candidate after a year? Examples include certain
business metrics, community contributions, number of insights produced, or project iteration
velocity.
How to Retain Your Talent
• Don’t oversell the role. Half of data scientists stay at their jobs for two years or less. To
reduce turnover, be truthful about the position you’re hiring for.
• Understand team members’ motivations. Take out time to discover each employee’s
goals, interests and personal incentives. Then you can pair them with rewarding projects
and recognize accomplishments in a meaningful way.
• Offer support. Data science can be a discipline of failures: Models fail, processes fail, data
sources turn out to be terrible. Offer positive reinforcement and remind team members that it
can take years to see an impact. Help team break the problems into manageable chunks so
employees aren’t intimidated by an overwhelming project.
• Create learning opportunities. Data scientists often leave their jobs because they’re
bored. If core projects aren’t cutting-edge, creating opportunities for team members to learn
new things, such as a weekly lunch to discuss the latest research or occasional hackathons
to test a new software framework or computational technique.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Habits of Successful Data
Science Managers
The following are the seven habits we have observed in many successful data science managers, in no
particular order.
1. Build bridges to other stakeholders. Avoid friction and crossed wires by opening communication
channels with other teams. Consider putting a data scientist and product manager in a room for an hour
before each new project to ensure they’re on the same page. Making data scientists attend meetings
without their laptops can force them to communicate with other stakeholders. Giving data scientists
opportunities to explain their work to engineers, product managers, and others can also improve
communication.
2. Track performance. Use a template to keep track of what you discussed, the objectives you set, and the
feedback you gave during one-on-one meetings with your reports. Relying on memory won’t work.
3. Aim to take projects to production. Preparing teams to deploy their own API services and to
productionalize code helps you move faster, and you don’t get blocked on engineering resources that
might not be available.
4. Start on-call rotation. As teams get bigger, set up a weekly rotation of data scientists on call to fix models
that break. That encourages better documentation and gives those not on-call time to focus on core
projects.
5. Ask the dumb questions. Seemingly simple questions can open the door to identifying and solving
fundamental problems.
6. Always be learning. Read prolifically to keep up with developments in this quickly evolving field.
Consume not only technical material, but also insights about management and organizational psychology.
7. Get out of the way, but not forever. If you’re a new manager, consider stepping away from coding for
three to six months. Otherwise, you risk never truly embracing the manager role, and might under-serve
the team. After that, feel free to tackle non-critical projects or those nobody else wants to do.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Common Challenges 1/
1. Hiring a balanced data science team - build a cross-functional data science
team that enables your organization to get insights from data and build
production ready models.( Data Scientist, machine learning Engineer, Data
Architect/Engineer, SW Developer, Research Scientist)
2. Retaining the team and Growing the team
3. Translating the business goals to smaller chunks of tasks, and defining
measurable KPIs for the Data Science Team to work on achieving these KPIs.
4. Transforming Data Science team output/deliverables to a business
understandable form, with key focus on Data Visualization. Hence try to bridge
the gap between the Business Teams who relatively less/non-technical and the
very technical Data Science Team
5. Engage and keep team motivated during the failures, and also keep the Senior
leadership aligned with the fact that Data Science Projects are not like any SW
Engineering project which can very Agile and give results every 7 days.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Common Challenges 2/
6. Prioritize work: team is flooded with requests for some analytics report or some
other data crunching requests. These adhoc requests consume lot of time and it
impacts the long term projects and other key deliverable. It’s important prioritize
work and assign right priority to these adhoc tasks. It is better to an adhoc requests
backlog and added priority to these tasks. The team could then manage these
urgent requests better without sacrificing the time towards important tasks.
6. Data quality: Are you getting the right data? You may have plenty of data available,
but the quality of that data isn’t a given. To create, validate, and maintain production
for high-performing machine learning models, you have to train and validate them
using trusted, reliable data. You need to check both the Accuracy and Quality of the
data. Accuracy in data labeling measures how close the labeling is to ground truth.
Quality in data labeling is about accuracy across the overall dataset. Make sure that
the work of all of your annotators look same and labeling is consistently accurate
across your datasets.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Common Challenges 3/
9. Tools : Tools play an important role because they allow you to automate. You
should use relevant tools to do heavy lifting jobs, running scripts to automate
queries and processing data to save some time which can in turn be used to
make the team more productive. Choosing the right set of tools as per your
needs.
10. Processes -Data science team projects are research oriented or start with lot
of research activities , it’s difficult to predict how long it will take for them to
finish. Also lot of activities like model building, data crunching are usually done
by a single person, so traditional collaborative workflows don’t fit. You have to
identify an approach which works best for your team. For example, we run a
mix of Kanban and Scrum boards in JIRA. For Research Activities, Data
Exploration/Analysis, Exploring ML models go for Kanban mode while
as Productization of the models you can work as a Scrum team. So basically
your Data scientists, Research Scientists and ML Engineers work mostly in
Kanban mode where as Data Engineers, Software Engineers work in Scrum
mode. Evaluate various options and see what best works for your team and
projects.
11. Knowledge Management for data science team – Data and Models
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Knowledge Management for
Data Science Teams 1/
What is knowledge management?
• The goal of knowledge management is to capture insight, which can be defined as “better
understanding.” Insight is thus relative—it’s about constantly improving upon previous
ideas.
• Creating that kind of “compounding machine” requires a way to capture knowledge, a
framework for users to follow and mechanisms to improve through feedback.
Why is knowledge management difficult?
1. Organizing knowledge in advance is difficult. Classifications are often too rigid, since you
don’t know what will matter in the future.
2. There are few incentives to participate. As one data scientist said, “I get paid for what I
build this year, not maintaining what I built last year.”
3. It’s a classic collective action problem. No one wants to be the first to spend time on
documentation. When knowledge is being captured, it can be hard to know how to act on
it.
4. Systems always lag behind reality. If knowledge management takes extra time and is
done in a different system from the core work, its quality will suffer.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Knowledge Management for
Data Science Teams 2/
5. Other obstacles are unique to data science teams:
• People use different tools. Knowledge management is tougher when some team members
work in R and others in Python, and when some store code on GitHub and others in email.
Training people to use the same systems is difficult because of high turnover.
• The components of a single project are scattered. Artifacts and insights may be spread
across a Docker store, a wiki, a PowerPoint presentation, etc.
• If you have code, that doesn’t mean you can rerun it. A meta-analysis of 600 computational
research papers found that only 20 percent of the code could be re-run; of that share, many
second attempts yielded slightly different results.
How to improve knowledge management
There are four steps that can help data science leaders improve knowledge
management in their enterprise organizations:
1. Capture as much knowledge as possible in one place.
The more things are in there, the more connections you have across them, and the value grows that
way. You don’t want people operating on the fringes. A common platform that encompasses both the
core work and knowledge management is key to ensure it gets done and minimizes the burden. If you
can’t capture everything, start with the most valuable model or knowledge, and build a system around
that.
2. Choose a knowledge management system that allows for discovery,
provenance, reuse, and modularity
• Discovery: Data scientists spend much of their time searching for information, cutting into productivity.
Teams have to decide whether to curate knowledge (the Yahoo approach) or index it (the Google approach).
Curation makes sense when the domain is relatively stable. Indexing and searching is best when the domain
is fluid, and you can’t possibly know beforehand what the taxonomy should look like.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Knowledge Management for
Data Science Teams 3/
• Provenance: Let people focus on the aspects of knowledge management that matter. Use a platform that
allows people to synthesize their work, not have to track which software version they used.
• Reuse: If it won’t run, it won’t get reused. That requires access to not only code, but also historical
versions of datasets.
• Decompose and Modularize: Ensure that people have the incentives and tools to create building blocks
that can be reused and built upon.
3. Identify the right unit of knowledge
Compounding systems rely on units of knowledge. In academia, those are books and papers; in software,
it’s code. In data science, the model is the right thing to organize around, because it’s the thing data
scientists make. The model includes the data, code, parameters and results.
Data Annotation and Quality Assurance(QA)- Data is Key to success of any data science team. Having
a well trained Data labeling team can provide significant value particularly during the iterative machine
learning model testing and validation stages. Machine learning is an iterative process. You need to
validate the model predictions and also need to prepare new datasets and enrich existing datasets to
improve your algorithm’s results. You can either hire or outsource data annotation but challenge remains
to have a consistency in labeling data and validating results.
4. Think beyond technology
Changes at the people and process levels are also important. Reframe how people see their jobs: They
should spend less time doing and more time codifying and learning. Make collaboration a priority in hiring
and compensation. Finally, while knowledge management should be seen as everyone’s job, some
organizations create new roles for curating or facilitating knowledge.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Managing the Team
Focus –
- On boarding and evaluating the success of team
- Working with other teams
References –
https://hbr.org/2018/10/managing-a-data-science-team
https://www.dominodatalab.com/resources/field-
guide/managing-data-science-teams/
https://towardsdatascience.com/building-and-managing-
data-science-teams-77ba4f43bc58
https://towardsdatascience.com/so-youre-going-to-
manage-a-data-science-team-31f075809ffd
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data driven decision making
• Definition – When it is data and not instincts that drives the
business decisions.
• Examples – Fraud detection in Loans, Credit Cards (Cibil
scores); Insurance, Six sigma projects to improve efficiency;
Target advertising in e-commerce; Product Roadmap
planning, Team planning
• 6 Steps to Data Driven decision making-
1. Strategy – Define clear Business goals
2. Identifying key data focus areas – Data is everywhere, flowing from multiple sources.
Based on domain knowledge define key focus data sources which seem to impact the
most, easier to access, reliable and clean
3. Data Collection & Storage – Defining data architecture to collect, store, archive i.e.
manage data. Connect multiple data sources, clean, prepare and organize
4. Data Analytics – Analyzing the data and derive key insights
5. Turning insights to Actions – business actions to be taken based on the findings from
key insights from data
6. Operationalize and Deploy – Using IT systems, automate the data collection, storage,
analysis and presenting the key highlights
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Business Model
Industry Example – FSO
Analytics Tool
Business Objective fro FSO Department: Target for 5 top clients in India,
Europe, South Africa and Costa Rica Markets
• FTR (Frist Time Right) Improvement by 2%
• Improve incoming WO Quality by 5%
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Project – Team
structure
PROJECT
SPONSOR – FSO
HEAD
SUBJECT MATTER IT / OPERATIONS
EXPERT – TEAM DATA SCIENCE
TELECOM EXPERT TEAM MANAGER
(4 members)
DATA DATA ANALYST/
WebApp DotNet DATA ARCHITECT/
SW Architect -1 SCIENTIST(S) VISUALIZATION
Developers -1 Developers (2) ENGINEER(S) - 2
Team (3) EXPERTS- 1
Machine Learning
Researchers (1)
Engineers (2)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data and Data Models
Lecture – 7
Sumita Narang
Objectives
• Types of Data, and Datasets
• Data Format
• High dimensional data
• Data representation - Graphs and networks, Matrices, Vectors, Data Frames, list Libraries of Graph, Matrices
and vectors
• Data quality
• Epicycles of Data Analysis
• Data Models
– Model as expectation
– Comparing models to reality
– Reactions to Data
– Refining our expectations
• Six Types of the Questions
• Characteristics of Good Question
• Formal modelling
– General Framework
– Associational Analyses
• Prediction Analyses
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Behavior
Static data is the data which has its size/value defined at the state of
initialization. the value/size cannot be change further
Dynamic data is the data which also has its size/value defined at the
state of initialization. But unlike static data its value or size can be
changes as per the need.
22
BITS Pilani, Pilani Campus
Data Types & Scales 1/
TYPES OF DATA
Qualitative Data Quantitative Data
Binary Discrete Binary Discrete Continuous
SCALES OF DATA MEASUREMENT
Qualitative Data Quantitative Data
Nominal Ordinal Interval Ratio
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Data 2/
Qualitative Data - Data that approximates and characterizes.
• Data type is non-numerical in nature.
• Statistically also known as categorical data as can be arranged
categorically/grouped based on the attributes and properties of a
thing or a phenomenon.
e.g. Females have brown, black, blonde, and red hair (qualitative).
e.g. Feedback to training is good, bad, average (qualitative)
Scales
Nominal sub-type- if there is no natural order between the categories
(e.g. eye colour, gender, direction); used for labelling the data
Ordinal sub-type - if an ordering exists (e.g. exam results, socio-
economic status, Satisfaction score, happiness index). order of the
values is what’s important and significant, but the differences
between each one is not really known.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Data 3/
Quantitative data – Data which is numeric in nature
• Discrete if the measurements are integers (e.g. number of people in a
household, number of cigarettes smoked per day) and
• Continuous if the measurements can take on any value, usually within
some range (e.g. weight).
• Binary, if measurement can take just two values 0 or 1
Scales
• Interval – Interval scales are numeric scales in which we know both the
order and the exact differences between the values. E.g. Temperature
in Degree Celsius etc.
• Note- In the case of interval scales, zero doesn’t mean the absence of value, but is actually another number
used on the scale, like 0 degrees celsius. So there is no true zero in value
• Ratio – Ratio scale tell us about the order, they tell us the exact value
between units, AND they also have an absolute zero–which allows for a
wide range of both descriptive and inferential statistics to be applied.
• Examples of ratio variables include height, weight, and duration
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Formats
24
BITS Pilani, Pilani Campus
Exercise
Data Type Subtype Scale
Age in Years Quantitative Discrete Ratio
Time in AM/PM Qualitative Binary Ordinal
Brightness as measured by light meter Quantitative Continuous Ratio
Brightness as measured by people’s
Qualitative Discrete Ordinal
judgement
Angles measured in degrees 0 to 360 Quantitative Continuous Ratio
Bronze, Silver, Gold medals as awarded Qualitative Discrete Ordinal
Height above sea level Quantitative Continuous Ratio/Interval
Number of patients in a hospital Quantitative Discrete Ratio
ISBN numbers for books Qualitative Discrete Ordinal
Ability to pass light in terms of the following
Qualitative Discrete Ordinal
values: opaque, translucent, transparent.
Military rank Qualitative Discrete Ordinal
Distance from the center of campus Quantitative Continuous Ratio/Interval
Density of a substance in grams per cubic
Quantitative Discrete Ratio/Interval
centimeter
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data-set Descriptions
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Types of Data Sets
• Record
• Relational records
timeout
season
coach
game
score
team
• Data matrix, e.g., numerical matrix, crosstabs
ball
lost
pla
wi
n
y
• Document data: text documents: term-frequency vector
• Transaction data
• Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
• World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0
• Social or information networks
Document 3 0 1 0 0 1 2 2 0 3 0
• Molecular Structures
• Ordered
• For some types of data, the attributes have relationships that
involve order in time or space e.g. Video data: sequence of
images TID Items
• Temporal data: time-series 1 Bread, Coke, Milk
• Sequential Data: transaction sequences 2 Beer, Bread
• Genetic sequence data 3 Beer, Coke, Diaper, Milk
• Spatial, image and multimedia: -Spatial auto-correlation 4 Beer, Bread, Diaper, Milk
• Spatial data: maps 5 Coke, Diaper, Milk
• Image data: 212
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Characteristics of
Structured Data 1/
• Dimensionality
The dimensionality of a data set is the number of attributes, that the objects in the data set
possess. Data with a small number of dimensions, tends to be qualitatively different
than moderate or high-dimensional data. Indeed, the difficulties associated with
analyzing high-dimensional data are sometimes referred to as the curse of
dimensionality. Because of this, an important motivation in preprocessing the data is
dimensionality reduction.
• Sparsity
For some data sets, such as those with asymmetric features, most attributes of an object
have values of 0; in many cases fewer than 1% of the entries are non-zero. In practical
terms, sparsity is an advantage because usually only the non-zero values need to be
stored and manipulated. This results in significant savings with respect to computation
time and storage. Furthermore some data mining algorithms work well only for sparse
data.
213
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Characteristics of
Structured Data 2/
• Resolution
– It is frequently possible to obtain data at different levels of resolution, and often the
properties of the data are different at different resolutions. For instance, the
surface of the Earth seems very uneven at a resolution of a few meters, but is
relatively smooth at a resolution of tens of kilometers. The patterns in the data also
depend on the level of resolution. If the resolution is too fine, a pattern may not be
visible or may be buried in noise; if the resolution is too coarse, the pattern may
disappear. For example, variations in atmospheric pressure on a scale of hours
reflect the movement of storms and other weather systems. On a scale of months,
such phenomena are not detectable.
• Distribution
– Centrality and dispersion
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
Also called samples , examples, instances, data points,
objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
215
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes
Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
– E.g., customer _ID, name, address
Types:
– Nominal
– Binary
– Numeric: quantitative
• Interval-scaled
• Ratio-scaled
216
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Types
Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
217
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities
218
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discrete vs. Continuous
Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and
represented using a finite number of digits
– Continuous attributes are typically represented as
floating-point variables
219
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Statistical Descriptions
of Data
Motivation
– To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
– Data dispersion: analyzed with multiple granularities
of precision
– Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
– Folding measures into numerical dimensions
– Boxplot or quantile analysis on the transformed cube
220
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency
n
1 n
x xi
w x i i
Mean (algebraic measure) (sample vs. population):
n i 1 x i 1
n
Note: n is sample size and N is population size.
w i
• Weighted arithmetic mean:
x i 1
• Trimmed mean: chopping extreme values N
Median:
• Middle value if odd number of values, or n / 2 ( freq ) l
median L1 ( ) width
average of the middle two values otherwise freqmedian
• Estimated by interpolation (for grouped data):
Mode
• Value that occurs most frequently in the data Median
interval
• Unimodal, bimodal, trimodal
221
• Empirical formula:
mean mode 3 (mean median)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively skewed data
symmetric
positively skewed negatively skewed
222
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Dispersion of
Data
Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add whiskers,
and plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample:s, population: σ)
– Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
( xi ) xi
2 2 2
s
2
( xi x )
2
[ xi ( xi ) 2 ]
2
N i 1 N i 1 n 1 i 1 n 1 i 1 n i 1
– Standard deviation s (or σ) is the square root of variance s2 (or σ2)
223
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis
Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles, i.e., the height
of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to Minimum and
Maximum
– Outliers: points beyond a specified outlier threshold, plotted
individually
224
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Graphic Displays of Basic
Statistical Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis repres. frequencies
Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
225
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histogram Analysis
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
Histogram: Graph display of tabulated frequencies, shown as bars
It shows what proportion of cases fall into each of several categories
Differs from a bar chart in that it is the areaof the bar that denotes the value, not the height as in bar charts, a
crucial distinction when the categories are not of uniform width
The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must
be adjacent
226
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histograms Often Tell More than Boxplots
The two histograms shown
in the left may have the
same boxplot
representation
The same values for:
min, Q1, median, Q3,
max
But they have rather
different data distributions
227
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Handling Non-Record Data
• Unstructured data, which is not available as records.
• Most data mining algorithms are designed for record
data or its variations, such as transaction data and data
matrices. Record-oriented techniques can be applied
to non-record data by extracting features from data
objects and using these features to create a record
corresponding to each object.
• Consider spatio-temporal data consisting of a time series from each point on a
spatial grid. This data is often stored in a data matrix, where each row represents
a location and each column represents a particular point in time.
• However, such a representation does not explicitly capture the time relationships
that are present among attributes and the spatial relationships that exist among
objects. This does not mean that such a representation is inappropriate, but
rather that these relationships must be taken into consideration during the
analysis.
• For example, it would not be a good idea to use a data mining technique that
assumes the attributes are statistically independent of one another.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
Author(s), Title, Edition, Publishing House
T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Representation
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Vector in Data Science
A vector is a tuple of one or more values called scalars.
Vectors can be represented in Python in different ways; known as
Sequence Data Types
• Lists
• Tuples
• Ranges
• Arrays
Vector Arithmetic
1. Vector Addition -> c = a + b
1. a + b = (a1 + b1, a2 + b2, a3 + b3)
2. Vector Subtraction
3. Vector Multiplication
4. Vector Division
5. Vector Dot Product
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sequence Types in Python -
Lists, tuples , Range 1/
https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range
• A list is simply an ordered collection of items.
– Lists, Tuples, and Strings are all data types that, in python, are called sequence data types. All of them are
containers of items and all of them have items that are ordered.
• A Tuple is also an ordered collection of items.
Similarity –
1. Both lists and tuples are sequence data types that can store a collection of
items.
2. Each item stored in a list or a tuple can be of any data type.
3. And you can also access any item by its index.
Differences –
1. The main difference between lists and a tuples is the fact that lists
are mutable whereas tuples are immutable
2. List defined via List () function; assigned values by [ l1, l2 ,l3] i.e. square
brackets. Tuples are defined via round brackets (l1,l2,l3) is tuple
3. Lists can grow dynamically as they are mutable objects.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sequence Types in Python -
Lists, tuples , Range 2/
The range type represents an immutable sequence of
numbers and is commonly used for looping a specific
number of times in for loops.
• list(range(10)) is same as [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
• list(range(1,11) is same as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
• list(range(0, 30, 5)) is same as [0, 5, 10, 15, 20, 25]
• list(range(0, 10, 3)) is same as [0, 3, 6, 9]
• list(range(0, -10, -1)) is same as [0, -1, -2, -3, -4, -5, -
6, -7, -8, -9]
• list(range(0)) is same as []
• list(range(1, 0)) is same as []
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lists and Arrays
https://medium.com/backticks-tildes/list-vs-array-python-data-type-40ac4f294551
https://www.pythoncentral.io/the-difference-between-a-list-and-an-array/
Apparently, an Array is a data type in Python also, meaning we have the array type and list type (the list
type being more popular).
The most popular type of array used in Python is the numpy array.
Similarities-
• Both can be used to store any data type (real numbers, strings, etc)
• Both can be indexed and iterated
• Both can be sliced
• Both are mutable
Differences-
• The main difference between these two data types is the operation you can perform on them. Arrays
are specially optimized for arithmetic computations so if you’re going to perform similar operations
you should consider using an array instead of a list. E.g. x = array([3, 6, 9, 12])
x/3.0
print(x) -> array([1, 2, 3, 4])
• Also lists are containers for elements having differing data types but arrays are used as containers
for elements of the same data type.
• It does take an extra step to use arrays because they have to be declared while lists don't because
they are part of Python's syntax,
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise – Lists, Arrays, tuples
https://www.w3resource.com/python-exercises/list/
https://www.w3resource.com/python-exercises/tuple/
https://www.w3resource.com/python-exercises/array/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2-Dimensional Data
Data can be stored in python in 3 ways –
1. 2D array or 2D List
2. Matrix
3. Pandas Data frame
Details –
1. 2D Arrays -
https://www.tutorialspoint.com/python/python_2darray.htm
Two dimensional array is an array within an array. It is an array
of arrays. In this type of array the position of an data element is
referred by two indices instead of one. So it represents a table
with rows and columns of data. In the below example of a two
dimensional array, observer that each array element itself is also
an array.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Matrix
https://www.tutorialspoint.com/python/python_matrix.htm
Matrix is a special case of two dimensional array where
each data element is of strictly same size. So every
matrix is also a two dimensional array but not vice versa.
For example following 2 D array is allowed, but this data
cannot go into a matrix –
2 3 4
The above data can be represented as
1 12 14
a two dimensional array as below.
1 2 [['Mon' '18' '20' '22' '17']
['Tue' '11' '18' '21' '18']
from numpy import * a = ['Wed' '15' '21' '20' '19']
array([['Mon',18,20,22,17],['Tue',11,18,21,1 ['Thu' '11' '20' '22' '21']
8], ['Wed',15,21,20,19],['Thu',11,20,22,21], ['Fri' '18' '17' '23' '22']
['Fri',18,17,23,22],['Sat',12,22,20,18], ['Sat' '12' '22' '20' '18']
['Sun',13,15,19,16]]) m = reshape(a,(7,5)) ['Sun' '13' '15' '19' '16']]
print(m)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Frames
https://www.geeksforgeeks.org/python-pandas-dataframe/
Pandas DataFrame is two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows and columns).
A Data frame is a two-dimensional data structure, i.e., data is aligned in a
tabular fashion in rows and columns. Pandas DataFrame consists of three
principal components, the data, rows, and columns.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Graphs Representation
https://www.geeksforgeeks.org/graph-and-its-representations/
https://www.scribd.com/doc/104543388/Graph-Representation
G=(V,E)
•A graph is a set of vertices and edges. A vertex may represent a
state or a condition while the edge may represent a relation
between two vertices.
Following two are the most commonly used representations of a
graph.
1. Adjacency Matrix
2. Adjacency List
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Adjacency Matrix
Adjacency Matrix is a 2D array of size V x V where V is the number of vertices in a
graph. Let the 2D array be adj[][], a slot adj[i][j] = 1 indicates that there is an edge
from vertex i to vertex j. Adjacency matrix for undirected graph is always symmetric.
Adjacency Matrix is also used to represent weighted graphs. If adj[i][j] = w, then there
is an edge from vertex i to vertex j with weight w.
Pros: Representation is easier to implement and follow. Removing an edge takes O(1)
time. Queries like whether there is an edge from vertex ‘u’ to vertex ‘v’ are efficient
and can be done O(1).
Cons: Consumes more space O(V^2). Even if the graph is sparse(contains less number
of edges), it consumes the same space. Adding a vertex is O(V^2) time.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Adjacency Matrix
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Adjacency List
An array of lists is used. Size of the array is equal to the number of vertices.
Let the array be array[]. An entry array[i] represents the list of vertices
adjacent to the ith vertex. This representation can also be used to
represent a weighted graph. The weights of edges can be represented as
lists of pairs. Following is adjacency list representation of the above graph.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Where to use what
• Adjacency matrix is good for dense graphs.
• Adjacency lists is good for sparse graphs and also for
changing the no of nodes.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Incidence Matrix Representation
The Incidence matrix of a graph G with N vertices and
E edges is N x E.
Mij = 1 if ej is incident on vi
Mij = 0 otherwise
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sparse Matrix (1/3)
• Sparse Matrix is a matrix which contains very few non-zero elements.
• When a sparse matrix is represented with a 2-dimensional array, we waste a
lot of space to represent that matrix.
• For example, consider a matrix of size 100 X 100 containing only 10 non-
zero elements. In this matrix, only 10 spaces are filled with non-zero values
and remaining spaces of the matrix are filled with zero. That means, we
allocate 100 X 100 X 2 = 20000 bytes of space to store this integer matrix.
• Sparse Matrix Representations can be done in many ways following are two
common representations:
o Array Representation
o Linked List representation
Sources: http://btechsmartclass.com/data_structures/sparse-matrix.html
https://www.geeksforgeeks.org/sparse-matrix-representation/
10 BAZG523(Introduction to Data Science)
Sparse Matrix (2/3)
Array Representation
2D array is used to represent a sparse matrix in which there are three rows
named as follows:
• Row: Index of row, where non-zero element is located
• Column: Index of column, where non-zero element is located
• Value: Value of the non zero element located at index – (Row, Column)
Sources: http://btechsmartclass.com/data_structures/sparse-matrix.html
https://www.geeksforgeeks.org/sparse-matrix-representation/
11 BAZG523(Introduction to Data Science)
Sparse Matrix (3/3)
Linked List Representation
In linked list, each node has four fields, which are defined as follows:
• Row: Index of row, where non-zero element is located
• Column: Index of column, where non-zero element is located
• Value: Value of the non zero element located at index – (row, column)
• Next node: Address of the next node.
Sources: http://btechsmartclass.com/data_structures/sparse-matrix.html
https://www.geeksforgeeks.org/sparse-matrix-representation/
12 BAZG523(Introduction to Data Science)
Libraries – Graphs and
Networks in Python
https://www.python-course.eu/graphs_python.php
https://www.python.org/doc/essays/graphs/
graph = { "a" : ["c"],
"b" : ["c", "e"],
"c" : ["a", "b", "d", "e"],
"d" : ["c"],
"e" : ["c", "b"],
"f" : [] }
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Coding Formats
Data Format defines the format in which data is coded.
• The information is coded in such a way that a program or application
can recognize, read and use the data.
• If software/hardware is no longer used, data can become
unreadable. In order to prevent this, it is vital to choose an open
format: that is a software format that is not attached to a certain
software supplier (proprietary software).
• Data formats are often indicated by their MIME type. MIME stands
for Multipart (Multipurpose) Internet Mail Extension. MIME
provides web browser information on how to deal with a file.
• A MIME type is noted as two indications separated by a slash (MIME
type/subtype). Example: text/plain is the MIME type for plain text.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Coding Formats
Application/subtypes Text/subtypes Numerical data formats
Application/pdf: Text/plain: application/x-matlab-data
In transferring a file from one program to another, text means that the text has not been given any layout. numerical data type: MATLAB
layout has a tendency to shift or disappear. To prevent this, is an advanced scientific
applications exist that ensure a universally identical computation program.
rendering of the document. One example is the PDF
(Portable Document Format) type, a universal file format
for the electronic exchange of documents while
maintaining their layout.
Application/gml+xml: Text/HTML HDF5 (application/x-hdf5) &
GML stands for Geographic Markup Language: a standard (HyperText Markup Language) is a format that NetCDF (application/x-netcdf):
way of describing geographic information. Geographic data indicates how the information should be visualized on : store large amounts of
describe the world in spatial terms, in regular plain text. As a website. Its code allows you to indicate what text numerical data (information in
such, this is a language independent of any sort of data should look like, for example bold or in italics. This the form of numbers). A file
visualization kind of layout is not present in plain text format. containing numerical data is
also called a binary file.
Application/vnd.google-earth.kml+xml: Text/XML
Represent dataset with
geographic data is encoded in such a way that it can be (eXtensible Markup Language) format does not
multidimensional arrays. HDF5
read by so-called earth browsers, such as Google Earth, include layout either, but allows you to provide
and NetCDF allow you to add
Google Maps, and Google Maps for mobile. Unlike the additional information about the contents of the file.
metadata to the dataset as a
standard GML format, earth browsers do visualize data. Metadata can be added, such as <title> for a title and
whole, but also to the variables
<creator> for the person who created the document
and dimensions within the set.
Application/x-java-archive: Both HDF5 and NetCDF have
dataset is related to the Java programming language. standard definitions for the
values plotted on specific axes.
Application/octet-stream:
This MIME type covers a general kind of binary data not
otherwise defined. It is a residual category used for
datasets of an unclear or uncertain nature
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Formats
Still Image Geospatial Graphic Image Audio Video Database
TIFF, JPEG 2000, Shapefile (SHP, raster formats: WAVE, AIFF, MP3, MOV, MPEG-4, XML, CSV,
PNG, JPEG/JFIF, DBF, SHX), TIFF, JPEG2000, MXF, FLAC. AVI, MXF. TAB
DNG (digital GeoTIFF, NetCDF. PNG, JPEG/JFIF,
negative), BMP, DNG, BMP, GIF.
GIF.
vector formats: RDBMS
Scalable vector
graphics, AutoCAD
Drawing
Interchange
Format,
Encapsulated
Postscripts, Shape
files.
cartographic: Most NOSQL
complete data,
GeoTIFF, GeoPDF,
GeoJPEG2000,
Shapefile.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Quality
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Quality 1/
Data quality is a perception or an assessment of data's fitness to serve
its purpose in a given context.
• Measure of correctness of measured values
Accuracy
• Precision of values – 32 bit, 64 bit
• Measure of data availability for all real-time scenarios
Completeness
• Handling Missing values
Reliability/Compara • Measure of consistency in data
bility • Degree to which data can be compared over time and domain
• Data collected for the same purpose which is relevant for the
Relevance
current use case under observation
• Defines if data is latest/current
Timeliness And
Punctuality • With change in environmental factors, older data may become
outdated and hence unusable
• Ease with which data can be accessed. Formats in which data is
Accessibility / Clarity available & supporting information defining those formats
• Clarity implies sufficiency of metadata, supportive information
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
– Measurement and Collection Errors
– Accuracy, Precision and Bias in data
– Noise and Outliers
– Missing values
– Duplicate data
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality – Accuracy
Measure Statistically
• Accuracy – Measure of how close the current value is to the “True/Reference value”.
• The term measurement error refers to any problem resulting from the measurement process. A common problem is that the value
recorded differs from the true value to some extent. For continuous attributes, the numerical difference of the measured and true value
is called the error.
• The term data collection error refers to errors such as omitting data objects or attribute values, or inappropriately including a data
object. For example, a study of animals of a certain species might include animals of a related species that are similar in appearance to
the species of interest.
• Both measurement errors and data collection errors can be either systematic or random.
• Precision refers to unreproducible random measurement error that leads to variation among multiple
measurements of the same quantity; how close the data is with each other. Definition 2.3 (Precision).
The closeness of repeated measurements( of the same quantity) to one another.
• Definition 2.4 (Bias). A systematic variation of measurement from the quantity being measured
Variance (S2) = average squared deviation of values
from mean
[squaring deviation from the mean] ÷ number of observations = variance
Standard deviation (S) = square root of the variance
Standard deviation of this random noise characterizes the precision of the
measurement instrument (where a larger standard deviation corresponds
to poorer precision).
Variance is a measure of heterogeneity in a given data. Higher the
variance, more heterogeneous is it and smaller the variance, more
homogeneous is it.
https://www.youtube.com/watch?v=0IiHPKAvo7g
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Bias & Variance in
data
https://towardsdatascience.com/what-is-ai-bias-
6606a3bcb814
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noise
Noise is the random component of a measurement error. It may involve the
distortion of a value or the addition of spurious objects.
The term noise is often used in connection with data that has a spatial or
temporal component.
In such cases, techniques from signal or image processing can frequently be
used to reduce noise and thus, help to discover patterns(signals) that might
be "lost in the noise." Nonetheless, the elimination of noise is frequently
difficult, and much work in data mining focuses on devising robust
algorithms that produce acceptable results even when noise is present.
(a) Time series. (b) Time series with noise.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outliers
• Outliers are either
• (1) data objects that, in some sense, have characteristics that are different from
most of the other data objects in the data set, or
• (2) values of an attribute that are unusual with respect to the typical values for
that attribute. Alternatively, we can speak of anomalous objects or values.
• Furthermore, it is important to distinguish between the
notions of noise and outliers. Outliers can be legitimate
data objects or values. Thus, unlike noise, outliers may
sometimes be of interest.
• In fraud and network intrusion detection, for example, the
goal is to find unusual objects or events from among a
large number of normal ones.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Dimensions
A Data Dimension is a set of data attributes pertaining to something of interest
to a business.
Dimensions store the textual descriptions of the business. With out the
dimensions, we cannot measure the facts. The different types of dimension
tables are explained in detail below.
For example, a business might want to know how may blue widgets were sold
at a specific store in India last month.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
High Dimensionality of Data
High Dimensional means that the number of dimensions are
staggeringly high — so high that calculations become
extremely difficult. With high dimensional data, the number of
features can exceed the number of observations.
For example, microarrays, which measure gene expression, can
contain tens of hundreds of samples. Each sample can
contain tens of thousands of genes.
So if “p” is the number of dimensions and “n” is the number of
observations, then if “p > n” it is a high dimension dataset i.e.
the number of features available in dataset is more that the
number of observations available.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Curse of Dimensionality
The curse of dimensionality usually refers to what
happens when you add more and more variables to a
multivariate model. Each added variable results in an
exponential decrease in predictive power.
The statistical curse of dimensionality refers to a related
fact: a required sample size n will grow exponentially
with data that has d dimensions. In simple terms, adding
more dimensions could mean that the sample size you
need quickly become unmanageable.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Epicycle of Data Analysis
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Epicycle of Data Analysis
https://bookdown.org/rdpeng/artofdatascience/epicycles-of-
analysis.html
There are 5 core activities of data analysis:
1. Stating and refining the question
2. Exploring the data
3. Building formal statistical models
4. Interpreting the results
5. Communicating the results
More specifically, for each of the five core activities, it is
critical that you engage in the following steps:
1. Setting Expectations,
2. Collecting information (data), comparing the data to
your expectations, and if the expectations don’t
match,
3. Revising your expectations or fixing the data so
your data and your expectations match.
Iterating through this 3-step process is what we call the
“epicycle of data analysis.” As you go through every
stage of an analysis, you will need to go through the
epicycle to continuously refine your question, your
exploratory data analysis, your formal models, your
interpretation, and your communication.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References
• https://unstats.un.org/unsd/dnss/docs-nqaf/UK-
Guidelines_Subject.pdf
• https://www.folkstalk.com/2010/01/types-of-
dimensions.html?m=1
• https://www.folkstalk.com/2010/01/data-warehouse-
dimensional-modelling.html
• https://www.metrology-
journal.org/articles/ijmqe/full_html/2017/01/ijmqe160046/i
jmqe160046.html
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Appendix
Extra Read
Type of Data Dimensions
Dimensions
Types
Conformed Junk Degenerated Role-playing
Dimensions Dimensions Dimensions Dimensions
Junk dimension is a Role-playing
Conformed collection of random dimensions are
dimensions mean the transactional codes or Degenerated dimensions which
exact same thing with text attributes; to Dimension is a are often used for
every possible fact correlate some dimension which is multiple purposes
table to which they are unrelated attributes. derived from the fact within the same
joined; e.g. The date E.g. gender & marital table and doesn't database. For
dimension connected status dimension can have its own example, a date
to the sales facts is be combined as one dimension table. E.g. dimension can be
identical to the date junk attribute which Transactional Ids in used for “date of
dimension connected stores all possible fact table. sale", as well as
to the inventory facts combinations of these
"date of delivery", or
2 attribute values
"date of hire"
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Dimensions- Storage
Schemas
1. Star Schema: A schema is the one in which
a central fact table is surrounded by de-
normalized dimensional tables. A star
schema can be simple or complex. A simple
star schema consists of one fact table where
as a complex star schema have more than
one fact table.
2. Snow flake Schema: A snow flake schema
is an enhancement of star schema by adding
additional dimensions. Snow flake schema
are useful when there are low cardinality
attributes in the dimensions
3. Galaxy Schema: Galaxy schema contains
many fact tables with some common
dimensions (conformed dimensions). This
schema is a combination of many data marts
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data and Data Models
Lecture – 8
Sumita Narang
Objectives
Types of Data and Datasets
Data Quality
Epicycles of Data Analysis
Data Models
– Model as expectation
– Comparing models to reality
– Reactions to Data
– Refining our expectations
Six Types of the Questions
Characteristics of Good Question
Formal modelling
– General Framework
– Associational Analyses
Prediction Analyses
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Modelling & Evaluation
1. Data Model Development & experiment framework setup
• Data Modelling based on training sets - At its core, a statistical model provides a
description of how the world works and how the data were generated.
• Framework to feed in new data and test the models
• Framework to change training data and retrain model based on new data sets as sliding
window
• 3 main tasks involved -
• Feature Engineering: Create data features from the raw data to facilitate model
training
• Model Training: Find the model that answers the question most accurately by
comparing their success metrics
• Determine if your model is suitable for production
2. Data Model Evaluation & KPI Checks
• Read papers, research material to finalize the algorithmic approaches
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
1. Define Appropriately the Problem
The first, and one of the most critical things to do, is to find out what are the
inputs and the expected outputs. The following questions must be answered:
• What is the main objective? What are we trying to predict?
• What are the target features?
• What is the input data? Is it available?
• What kind of problem are we facing? Binary classification? Clustering?
• What is the expected improvement?
• What is the current status of the target feature?
• How is going to be measured the target feature?
5
BITS Pilani, Pilani Campus
2. Developing a Benchmark
Model
Benchmark Model
The goal in this step of the process is to develop a benchmark model that
serves us as a baseline, upon we’ll measure the performance of a better and
more attuned algorithm.
Benchmarking requires experiments to be comparable, measurable, and
reproducible. It is important to emphasize on the reproducible part of the last
statement. Now a day’s data science libraries perform random splits of data,
this randomness must be consistent through all runs. Most random generators
support setting a seed for this purpose. In Python we will use the random.seed
method from the random package.
Null Model Bayes rate model
Normal Model
17
BITS Pilani, Pilani Campus
Ideal models to calibrate against
Null Model
A null model is the best model of a very simple form
you’re trying to outperform. The two most typical null
model choices are a model that is a single constant
(returns the same answer for all situations) or a model
that is independent (doesn’t record any important
relation or interaction between inputs and outputs).
We use null models to lower-bound desired
performance, so we usually compare to a best null
model.
17
BITS Pilani, Pilani Campus
Ideal models to calibrate against
P(A | B) = (P(B | A) * P(A)) / P(B)
Bayes rate model
• The Bayes Optimal Classifier is a probabilistic model that makes the most
probable prediction for a new example. A Bayes rate model (also sometimes
called a saturated model) is a best possible model given the data at hand.
• The Bayes rate model is the perfect model and it only makes mistakes when
there are multiple examples with the exact same set of known facts (same
xs) but different outcomes (different ys).
• It isn’t always practical to construct the Bayes rate model, but we invoke it
as an upper bound on a model evaluation score.
P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)
Where vj is a new instance to be classified, H is the set of hypotheses for classifying the
instance, hi is a given hypothesis, P(vj | hi) is the posterior probability for vi given
hypothesis hi, and P(hi | D) is the posterior probability of the hypothesis hi given the
data D. https://machinelearningmastery.com/bayes-optimal-classifier/ 18
BITS Pilani, Pilani Campus
Ideal models to calibrate
against
Normal Model
This model says that the randomness in a set of data can
be explained by the Normal distribution, or a bell-shaped
curve. The Normal distribution is fully specified by two
parameters—the mean and the standard deviation.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3. Comparing Model
Expectation to Reality
We may be very proud of developing our statistical model,
but ultimately its usefulness will depend on how closely it
mirrors the data we collect in the real world.
Price Survey Data
with Normal
Distribution
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Choose a Measure of Success
“If you can’t measure it you can’t improve it”.
If you want to control something it should be observable, and in order to
achieve success, it is essential to define what is considered success: Maybe
precision? accuracy? Customer-retention rate?
This measure should be directly aligned with the higher level goals of the
business at hand. And it is also directly related with the kind of problem we are
facing:
• Regression problems use certain evaluation metrics such as mean squared
error (MSE).
• Classification problems use evaluation metrics as precision, accuracy and
recall.
8
BITS Pilani, Pilani Campus
Model Evaluation Metrics
Performance Metrics vary based on type of models i.e.
Classification Models, Clustering Models, Regression
Models.
https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Setting an Evaluation Protocol
Maintaining a Hold Out Validation Set
This method consists on setting apart some portion of the data as the test set.
The process would be to train the model with the remaining fraction of the data,
tuning its parameters with the validation set and finally evaluating its
performance on the test set.
The reason to split data in three parts is to avoid information leaks. The main
inconvenient of this method is that if there is little data available, the validation
and test sets will contain so few samples that the tuning and evaluation
processes of the model will not be effective.
9
BITS Pilani, Pilani Campus
Setting an Evaluation Protocol
K-Fold Validation
K-Fold consists in splitting the data into K partitions of equal size. For each
partition i, the model is trained with the remaining K-1 partitions and it is
evaluated on partition i.
The final score is the average of the K scored obtained. This technique is
specially helpful when the performance of the model is significantly different
from the train-test split.
10
BITS Pilani, Pilani Campus
Setting an Evaluation Protocol
Iterated K-Fold Validation with Shuffling
It consist on applying K-Fold validation several times and shuffling the data
every time before splitting it into K partitions. The Final score is the average of
the scores obtained at the end of each run of K-Fold validation.
This method can be very computationally expensive, as the number of trained
and evaluating models would be I x K times. Being I the number of iterations
and K the number of partitions.
11
BITS Pilani, Pilani Campus
4. Reacting to Data: Refining Our
Expectations
Okay, so the model and the data don’t match very well, as
was indicated by the histogram above. So what do do?
Well, we can either
1. Get a different model; or
2. Get different data
Price Survey Data with
Gamma Distribution
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Validating models
Identifying common model problems
Bias - Systematic Error
Variance - Undesirable (but non-systematic) distance
between predictions and actual values.
Overfit
Nonsignificance: A model that appears to show an
important relation when in fact the relation may not hold
in the general population, or equally good predictions
can be made without the relation.
29
BITS Pilani, Pilani Campus
Validating models
30
BITS Pilani, Pilani Campus
Ensuring model quality
Testing On Held - Out Data
k-fold cross-validation
The idea behind k-fold cross-validation is to repeat the construction
of the model on different subsets of the available training data and
then evaluate the model only on data not seen during construction.
This is an attempt to simulate the performance of the model on
unseen future data.
Significance Testing
“What is your p-value?”
31
BITS Pilani, Pilani Campus
Balancing Bias & Variance to
Control Errors in Machine Learning
https://towardsdatascience.com/balancing-bias-and-variance-to-control-errors-in-machine-learning-16ced95724db
Y = f(X) + e
Estimation of this relation or f(X) is known as statistical learning. On general, we won’t be able to make a perfect estimate
of f(X), and this gives rise to an error term, known as reducible error. The accuracy of the model can be improved by
making a more accurate estimate of f(X) and therefore reducing the reducible error. But, even if we make a 100%
accurate estimate of f(X), our model won’t be error free, this is known as irreducible error(e in the above
equation).
The quantity e may contain unmeasured variables that are useful in predicting Y : since we don’t measure them, f
cannot use them for its prediction. The quantity e may also contain unmeasurable variation.
Bias
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a
much simpler model. So, if the true relation is complex and you try to use linear regression, then it will undoubtedly
result in some bias in the estimation of f(X). No matter how many observations you have, it is impossible to produce
an accurate prediction if you are using a restrictive/ simple algorithm, when the true relation is highly complex.
Variance
Variance refers to the amount by which your estimate of f(X) would change if we estimated it using a different training data
set. Since the training data is used to fit the statistical learning method, different training data sets will result in a
different estimation. But ideally the estimate for f(X) should not vary too much between training sets. However, if a
method has high variance then small changes in the training data can result in large changes in f(X).
A general rule is that, as a statistical method tries to match data points more closely or when a more flexible
method is used, the bias reduces, but variance increases.
In order to minimize the expected test error, we need to select a statistical learning method that simultaneously
achieves low variance and low bias.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
When Do We Stop Training &
Model fitment process?
Six Types of the Questions
1. Are you out of data?
• Iterative data analysis will eventually begin to raise questions that simply cannot be
answered with the data at hand.
• Another situation in which you may find yourself seeking out more data is when
you’ve actually completed the data analysis and come to satisfactory results, usually
some interesting finding. Then, it can be very important to try to replicate whatever
you’ve found using a different, possibly independent, dataset.
2. Do you have enough evidence to make a decision?
• It’s important to always keep in mind the purpose of the data analysis as you go
along because you may over or under-invest resources in the analysis if the
analysis is not attuned to the ultimate goal.
• The question of whether you have enough evidence depends on factors specific to
the application at hand and your personal situation with respect to costs and
benefits.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Six Types of the Questions
3. Can you place your results in any larger context?
• Are results working on larger dataset in a bigger context, matching with past
observations
• Is model working well on multiple variations in the data or on different data sets
4. Are you out of time?
• project timelines are approaching
• Is model able to give results in define application SLAs Is model running for
hours to give results,
5. Is your model overfitting?
6. Is your model able to handle noisy, messy data,
outliers?
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
7.2 General Framework
Formal Modelling
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
General Framework
1. Setting expectations. Setting expectations comes in the form
of developing a primary model that represents. your best
sense of what provides the answer to your question. This
model is chosen based on whatever information you have
currently available.
2. Collecting Information. Once the primary model is set, we
will want to create a set of secondary models that challenge
the primary model in some way. We will discuss examples of
what this means below.
3. Revising expectations. If our secondary models are
successful in challenging our primary model and put the
primary model’s conclusions in some doubt, then we may
need to adjust or modify the primary model to better reflect
what we have learned from the secondary models.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Primary Model
• It’s often useful to start with a primary model. This model will
likely be derived from any exploratory analyses that you have
already conducted and will serve as the lead candidate for
something that succinctly summarizes your results and
matches your expectations. It’s important to realize that at any
given moment in a data analysis, the primary model is not
necessarily the final model.
• Through the iterative process of formal modeling, you may
decide that a different model is better suited as the primary
model. This is okay, and is all part of the process of setting
expectations, collecting information, and refining expectations
based on the data.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Secondary models
• Once you have decided on a primary model, you will
then typically develop a series of secondary models. The
purpose of these models is to test the legitimacy and
robustness of your primary model and potentially
generate evidence against your primary model. If the
secondary models are successful in generating evidence
that refutes the conclusions of your primary model, then
you may need to revisit the primary model and whether
its conclusions are still reasonable.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Analysis
There are three classes of variables that are important to think about in an
associational analysis.
1. Outcome. The outcome is the feature of your dataset that is thought to change
along with your key predictor. Even if you are not asking a causal or
mechanistic question, so you don’t necessarily believe that the outcome
responds to changes in the key predictor, an outcome still needs to be defined
for most formal modeling approaches.
2. Key predictor. Often for associational analyses there is one key predictor of
interest (there may be a few of them). We want to know how the outcome
changes with this key predictor. However, our understanding of that relationship
may be challenged by the presence of potential confounders.
3. Potential confounders. This is a large class of predictors that are both related to
the key predictor and the outcome. It’s important to have a good understanding
what these are and whether they are available in your dataset. If a key
confounder is not available in the dataset, sometimes there will be a proxy that
is related to that key confounder that can be substituted instead.
The basic form of a model in an associational analysis will be –
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
This is a form of regression, that constrains/ regularizes or shrinks the coefficient
estimates towards zero. In other words, this technique discourages learning a
more complex or flexible model, so as to avoid the risk of overfitting.
A simple relation for linear regression looks like this. Here Y represents the learned
relation and β represents the coefficient estimates for different variables or
predictors(X).
Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
The fitting procedure involves a loss function, known as residual sum of squares or RSS.
The coefficients are chosen, such that they minimize this loss function. Now, this will
adjust the coefficients based on your training data. If there is noise in the training
data, then the estimated coefficients won’t generalize well to the future data. This is
where regularization comes in and shrinks or regularizes these learned estimates
towards zero.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regularization
Ridge Regression
Above image shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning
parameter that decides how much we want to penalize the flexibility of our model. The
increase in flexibility of a model is represented by increase in its coefficients, and if we want to
minimize the above function, then these coefficients need to be small. This is how the Ridge
regression technique prevents coefficients from rising too high.
Lasso Regression
What does Regularization achieve?
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well
for a data set different than its training data. Regularization, significantly reduces the variance
of the model, without substantial increase in its bias. So the tuning parameter λ, used in the
regularization techniques described above, controls the impact on bias and variance. As the value
of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this
increase in λ is beneficial as it is only reducing the variance(hence avoiding overfitting),
without loosing any important properties in the data. But after certain value, the model starts
loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the
value of λ should be carefully selected.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Predictive Modelling
Predictive modeling refers to the task of building a model for the
target variable as a function of the explanatory variables. There are
two types of predictive modeling tasks:
• Classification, which is used for discrete target variables, and
regression, which is used for continuous target variables. For
example, predicting whether a Web user will make a purchase at an
online bookstore is a classification task because the target variable
is binary-valued.
• On the other hand, forecasting the future price of a stock is a
regression task because price is a continuous-valued attribute. The
goal of both tasks is to learn a model that minimizes the error
between the predicted and true values of the target variable.
Predictive modeling can be used to identify customers that will respond
to a marketing campaign, predict disturbances in the Earth's
ecosystem, or judge whether a patient has a particular disease based
on the results of medical tests.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Predictive
Modelling Techniques
In classification problems, we use two types of algorithms (dependent on
the kind of output it creates):
• Class output: Algorithms like SVM and KNN create a class output. For
instance, in a binary classification problem, the outputs will be either 0
or 1. However, today we have algorithms which can convert these class
outputs to probability. But these algorithms are not well accepted by the
statistics community.
• Probability output: Algorithms like Logistic Regression, Random
Forest, Gradient Boosting, Adaboost etc. give probability outputs.
Converting probability outputs to class output is just a matter of creating
a threshold probability.
Classification Algorithms vs Clustering Algorithms
In clustering, the idea is not to predict the target class as in classification,
it’s more ever trying to group the similar kind of things by considering
the most satisfied condition, all the items in the same group should
be similar and no two different group items should not be similar.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
ML Models
summary of ML models, Source
BITS Pilani, Pilani Campus
Algorithms
BITS Pilani, Pilani Campus
Model - Tuning its
Hyperparameters
Finding a Good Model
One of the most common methods for finding a good model is cross validation.
In cross validation we will set:
• A number of folds in which we will split our data.
• A scoring method (that will vary depending on the problem’s nature —
regression, classification…).
• Some appropriate algorithms that we want to check.
We’ll pass our dataset to our cross validation score function and get the model
that yielded the best score. That will be the one that we will optimize, tunning its
hyperparameters accordingly.
18
BITS Pilani, Pilani Campus
Model - Tuning its
Hyperparameters
# Test Options and Evaluation Metrics
num_folds = 10
scoring = "neg_mean_squared_error"
# Spot Check Algorithms
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))
results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
19
BITS Pilani, Pilani Campus
Model - Tuning its
Hyperparameters
# Compare Algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()
20
BITS Pilani, Pilani Campus
Exercise towards
Applications
Implementation of Modeling for Customer Churn in Telecom Industry
Data Set: https://www.kaggle.com/becksddf/churn-in-telecoms-dataset
Implementation of Modeling for Health Industry – Heart Diseases
Data Set: https://www.kaggle.com/ronitf/heart-disease-uci
BITS Pilani, Pilani Campus
References
• https://www.bouvet.no/bouvet-deler/roles-in-a-data-science-project
• https://www.altexsoft.com/blog/datascience/how-to-structure-data-
science-team-key-models-and-roles/
• https://www.quora.com/What-is-the-life-cycle-of-a-data-science-project
• https://towardsdatascience.com/5-steps-of-a-data-science-project-
lifecycle-26c50372b492
• https://www.dezyre.com/article/life-cycle-of-a-data-science-project/270
• https://www.slideshare.net/priyansakthi/methods-of-data-collection-
16037781
• https://www.questionpro.com/blog/qualitative-data/
• https://surfstat.anu.edu.au/surfstat-home/1-1-1.html
• https://www.mymarketresearchmethods.com/types-of-data-nominal-
ordinal-interval-ratio/
• https://www.coursera.org/learn/decision-making
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956