CH1 Introduction To Data Science BS
CH1 Introduction To Data Science BS
Dr Setturu Bharath
Syllabus
Unit 1:
Fundamentals of Data Science, Real World applications, Data Science vs BI, Data Science vs Statistics, Roles and
responsibilities of a Data Scientist, Software Engineering for Data Science, Data Scientists Toolbox, Data Science
Challenges.
Defining Analytics: Types of data analytics: Descriptive, Diagnostic, Predictive, Prescriptive Data Analytics –
methodologies, CRISP-DM Methodology, SEMMA, BIG DATA LIFE CYCLE, SMAM, Effort Estimate, Analytics Capacity
Building, Challenges in Data-driven decision making Data Science Process, Data Science methodology, Business
understanding, Data Requirements, Data Acquisition, Data Understanding, Data preparation Modelling, Model
Evaluation, Deployment and feedback, Case Study, Data Science Proposal, Samples, Evaluation, Review Guide
Unit 2:
Defining Data Team: Roles in a Data Science Team: Data Scientists, Data Engineers Managing Data Team :On boarding
and evaluating the success of team, Working with other teams, Common difficulties, Data and Data Models, Types of
Data and Datasets, Data Quality, Epicycles of Data Analysis, Data Models, Model as expectation, comparing models to
reality, Reactions to Data, Refining our expectations, Six Types of the Questions, Characteristics of Good Question
Formal modelling, General Framework, Associational Analyses, Prediction Analyses Introduction to OLTP, OLAP.
Syllabus
Unit 3:
Properties of functions, Data wrangling and Feature Engineering, Data cleaning, Data Aggregation, Sampling, Handling
Numeric Data Discretization, Binarization, Normalization, Data Smoothening, Dealing with textual Data Managing
Categorical Attributes, Transforming Categorical to Numerical Values: Encoding techniques, Feature Engineering,
Feature Extraction (Dimensionality Reduction) Feature Construction, Feature Subset selection, Filter methods,
Wrapper methods, Embedded methods, Feature Learning, Case Study involving FE tasks. Lab: Feature Extraction, and
Subset selection.
Unit 4:
Data Need for visualization, Exploratory vs Explanatory Analysis, Tables, Axis based Visualization and Statistical
Plots, Lessons in Data Visualization Design, The Data Visualization Design Process, Stories and Dashboards,
Storytelling with Data: The final deliverable, The Narrative - report / presentation structure, building narrative
with Data, Effective Story Telling.
Ethics for Data Science Bias and Fairness, Types of Bias, Identifying Bias, Evaluating Bias, Being a data skeptic –
examples of misuse of Data, Doing Good Science Five C’s, Ethical guidelines for Data Scientist, Ethics of data scraping
and storage, Case Study: IBM AI Fairness 360.
Textbooks
1. Introducing Data Science by Cielen, Meysman and Ali.
2. Grus, J., 2019. Data science from scratch: first principles with python. O'Reilly Media.
3. Storytelling with Data: A data visualization guide for business professionals by Cole Nussbaumer Knaflic.
4. Introduction to Data Mining by Tan, Steinbach, and Vipin Kumar.
5. The Art of Data Science by Roger D Peng and Elizabeth Matsui.
6. Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides.
7. Python Data Science Handbook: Essential tools for working with data by Jake Vander Plas.
8. KDD, SEMMA, and CRISP-DM: A Parallel Overview, Ana Azevedo and M.F. Santos, IADS-DM, 2008.
1. https://www.tableau.com/sites/default/files/whitepapers/752750_core_why_visual_analytics_whitepaper_0.pdf
2. https://nbviewer.org/github/IBM/AIF360/blob/master/examples/tutorial_medical_expenditure.ipynb
3. https://nbviewer.org/github/IBM/AIF360/blob/master/examples/tutorial_credit_scoring.ipynb
4. https://data-flair.training/blogs/data-analytics-tutorial/
5. http://cdnlarge.tableausoftware.com/sites/default/files/whitepapers/visual_analysis_for-everyone.pdf
Data science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today.
A data scientist is someone who knows more statistics than a computer scientist and more computer
science than a statistician.
Data science
In 2012, the Obama campaign employed dozens of data scientists who data-mined and experimented their
way to identifying voters who needed extra attention, choosing optimal donor-specific fundraising appeals
and programs, and focusing get-out-the-vote efforts where they were most likely to be useful.
In 2016 the Trump campaign tested a staggering variety of online ads and analyzed the data to find what
worked and what didn’t.
Data Management
Analytics Modeling
Business Analysis
Soft Skills
The main things that set a data scientist apart from a statistician are the ability to work with big data and experience in
machine learning, computing, and algorithm building. Their tools tend to differ too, with data scientist job descriptions more
frequently mentioning the ability to use Hadoop, Pig, Spark, R, Python, and Java, among others.
Data types and data sources
Semi-Structured Unstructured
Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record.
As such, it’s often easy to store structured data in tables within databases or Excel files.
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
Structured data gives you a hard time storing it in a traditional relational database.
UnStructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific
or varying.
One example of unstructured data is your regular email. Although email contains structured elements
such as the sender, title, and body text, it’s a challenge with the thousands of different languages and
dialects out there further complicating this.
Graph-based or network data → Semi Structured/Unstructured
“Graph data” can be a confusing term because any data can be shown in a graph.
“Graph” in this case points to mathematical graph theory. In graph theory, a graph
is a mathematical structure to model pair-wise relationships between objects.
The graph structures use nodes, edges, and properties to represent and store
graphical data.
Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and the
shortest path between two people
Graph-based or network data → Semi Structured/Unstructured
Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.
Graph or network data focuses on the relationship or adjacency of objects to simplify the data storage
and retrieval.
Machine-generated data
Machine-generated data is becoming a major data resource and will continue to do so.
Data Science Competence
Retrieving data
The second step is to collect data.
You should be clear about which data you need and where you can find it.
In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data.
Data can also be delivered by third-party companies and takes many forms
ranging from Excel spreadsheets to different types of databases.
The data science process
Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data and prepare it for use in
subsequent steps.
This phase consists of three subphases: data cleansing removes false values from a data source and inconsistencies
across data sources, data integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
The data science process
Data exploration
Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the data, and whether there are
outliers. To achieve this you mainly use descriptive statistics, visual techniques, and simple modeling.
This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
The data science process
You select a technique from the fields of statistics, machine learning, operations research, and so on. Building a model
is an iterative process that involves selecting the variables for the model, executing the model, and model diagnostics.
Cross Industry Standard Process for Data Science (CRISP) - Phases, tasks, outputs
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation
Determine Business Objectives Collect Initial Data Data Set Select Modeling Technique Evaluate Results Plan Deployment
Background Initial Data Collection Report Data Set Description Modeling Technique Assessment of Data Mining Results w.r.t. Deployment Plan
Business Objectives Modeling Assumptions Business Success Criteria
Business Success Criteria Approved Models
Situation Assessment Describe Data Select Data Generate Test Design Review Process Plan Monitoring and
Inventory of Resources Data Description Report Rationale for Inclusion / Test Design Review of Process Maintenance
Requirements,Assumptions, and Constraints Exclusion Monitoring & Maintenance
Risks and Contingencies Plan
Terminology
Costs and Benefits
Determine Explore Data Clean Data Build Model Determine Next Steps Produce Final Report
Data Mining Goal Data Exploration Report Data Cleaning Report Parameter Settings List of Possible Actions Final Report
Data Mining Goals Models Decision Final Presentation
Data Mining Success Criteria Model Description
Produce Project Plan Verify Data Quality Construct Data Assess Model Review Project
Project Plan Data Quality Report Derived Attributes Model Assessment Experience
Initial Asessment of Tools and Techniques Generated Records Revised Parameter Settings Documentation
Integrate Data
Merged Data
Format Data
Reformatted Data
The data science process
These results can take many forms, ranging from presentations to research reports.
Sometimes you’ll need to automate the execution of the process because the business will want to use the insights
you gained in another project or enable an operational process to use the outcome from your model.
Mathematical Aspects
• Mathematics is fundamental for
understanding and solving complex data-
related problems.
• Key areas include statistics, linear algebra, Computational Geometry Optimization Stochastics
calculus, probability, and optimization.
• Linear and Nonlinear Programming
techniques for finding the best outcome in
a mathematical model with constraints.
Scientific Computing Machine Learning
Statistical Aspects
Descriptive Statistics: Summarizing and describing the
features of a dataset. Includes mean, median, mode,
standard deviation, and variance.
Inferential Statistics: Drawing conclusions and making Linear Models Statistical Tests Inference
predictions about a population based on a sample.
Includes hypothesis testing, confidence intervals, and
regression analysis.
analysis.
• Complexity Analysis: Understanding the time and
space complexity of algorithms to ensure they
are scalable and efficient.
• Data Mining: Extracting patterns from large data Software Engineering Artificial Intelligence Machine Learning
Not mathematicians
• But should know about optimization, stochastics, etc.
Not statisticians
• But should know about regression, statistical tests, etc.
Maths Statistics
Quantitative
Communication skills Programming
Collaborative
Data Technical
Scientists
Teamwork Infrastructures
Skeptical
Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)
Different types of Data Scientists
High
Depth of Insights
towards reaching the business goal. Data
Science
• Data Science is a process of extracting, manipulating, visualizing, and Business
maintaining data as well as generating predictions. Intelligence
Low
Past Present Future
Time
Data Science vs. Business Intelligence
Business Intelligence Data Science
Techniques Dashboards, alerts, queries Optimization, predictive modelling, forecasting
Data Types Structured, data warehouses Any kind, often unstructured and dynamic
Common questions What happened…? What if…?
How much did…? What will…?
When did…? How can we…?
Data Storage Data mostly stored in data-warehouses Data utilized is distributed in real time clusters
Concept Deals with data analysis on the business platform Consists of several data operations in various domains
Data Science vs. Business Intelligence
Data Science vs. Statistics
• Statistics is a field of study rooted in mathematics, providing programmatic tools and methods — such as variance analysis,
mean, median, and frequency analysis – to collect data, design experiments, and perform analysis on a given set of figures
to measure an attribute or determine values for a particular question. Statistical methods are used in all fields that require
decision-making.
• Strong mathematical skills are the foundation for statistics.
• The field of data science can be described as the crossroads between machine learning, traditional research, and software
development. A more wide-ranging multi-disciplinary field, data science goes beyond exploratory analysis, using scientific
methods, algorithms and mathematical formulas to extract, evaluate, and visualize structured and unstructured data.
• Data science can be broken down further into data mining, machine learning, and big data.
Data Science vs. Statistics
• Data scientists should have in-depth knowledge of descriptive statistics (measures of frequency, measures of central tendency,
measures of dispersion or variation, and measures of position), probability theory, as well as probability distribution, dimensionality
reduction, over- and under-sampling, and Bayesian analysis.
• The fields of data science and statistics have many similarities. Both focus on extracting data and using it to analyze and solve real-
world problems.
• Data scientists use statistical analysis. However, data scientists need to be familiar with statistics, among other areas.
• The science of statistics enables data science (aiding its reliability and validity), and data science expands the application of statistics
to Big Data.
• Data scientists should accept the contribution and importance of statistics and statisticians must humbly acknowledge the novel
capabilities made possible through data science and support this field of study with their theoretical and pragmatic expertise.
Data Science vs. Software Engineering
• Data science is more exploratory.
• Software engineering is more focused on systems building.
• Data science project management should be more open to changes.
• Data scientists collect, reformat, model, and interpret data. While some data scientists have a strong theoretical bent (best
are storytellers), most data scientists working in for profit enterprises are pragmatic and use their skills to derive practical
information from data.
• Software engineers write and test software code that meets the needs of end users. Depending on the environment they
work in, software engineers might also be responsible for the complete software product life cycle.
Data Science vs. Software Engineering
• Software engineering, in contrast, is a process that focuses on planning what to build, designing the best way to build, then
writing the code to build what was planned.
• Software engineers are also called software developers or simply developers. Software engineers are often divided into
front-end, back-end, or full-stack engineers. Front-end engineers focus on the part of the product that the end user sees
and interacts with. Back-end engineers write and maintain code that processes data from web applications
• The expected outcome of the project is known at the start of the project. Software engineering practices emphasize
standardization and automation.
• Data scientists can use aspects of the engineering mindset to improve the quality of their code, a subject.
BigData: More Data → More Opportunities
• Big Data are data sets so large or so complex that traditional methods of storing, accessing, and analyzing their breakdown are too
expensive. However, there is a lot of potential value hidden in this data, so organizations are eager to harness it to drive innovation
and competitive advantage.
• Big Data technologies and approaches are used to drive value out of data-rich environments in ways that traditional analytics
tools and methods cannot
• . Big Data has given rise to Data Science
• Data science is rooted in solid foundations of mathematics and statistics, computer science, and domain knowledge
BigData: More Data → More Opportunities
zettabyte (ZB),
yottabyte (YB),
TERABYTES PETABYTES EXABYTES brontobyte (BB),
LARGE
Geopbyte (GPB)
VOLUME OF INFORMATION
SMALL
Descriptive Analytics:
Descriptive analytics is the analysis of historical data using two key methods – data aggregation and data
mining - which are used to uncover trends and patterns.
Descriptive analytics is not used to draw inferences or make predictions about the future from its findings;
rather it is concerned with representing what has happened in the past.
Descriptive analytics are often displayed using visual data representations like line, bar and pie charts and,
although they give useful insights on its own, often act as a foundation for future analysis.
Because descriptive analytics uses fairly simple analysis techniques, any findings should be easy for the
wider business audience to understand.
Data analytics types:
Predictive analytics
Predictive analytics is a more advanced method of data analysis that uses probabilities to make
assessments of what could happen in the future.
It uses statistical modelling and machine learning techniques to identify the likelihood of future outcomes
based on historical data.
To make predictions, machine learning algorithms take existing data and attempt to fill in the missing
data with the best possible guesses.
These predictions can then be used to solve problems and identify opportunities for growth.
For example, organisations are using predictive analytics to prevent fraud by looking for patterns in
criminal behaviour, optimising their marketing campaigns by spotting opportunities for cross selling and
reducing risk by using past behaviours to predict which customers are most likely to default on
payments.
Data analytics types:
Prescriptive analytics
Predictive analytics shows companies the raw results of their potential actions, prescriptive analytics shows
companies which option is the best.
The field of prescriptive analytics borrows heavily from mathematics and computer science, using a variety of
statistical methods.
Although closely related to both descriptive and predictive analytics, prescriptive analytics emphasises actionable
insights instead of data monitoring. This is achieved through gathering data from a range of descriptive and
predictive sources and applying them to the decision-making process.
Algorithms then create and re-create possible decision patterns that could affect an organisation in different
ways.
Data analytics types:
Diagnostic Analytics
Diagnostic analytics can reveal the full spectrum of causes, ensuring we will see the complete
picture.
We will try to quantify which factors are most impactful and zero (not impactful) in on them.
For diagnostic analytics, we will use some of the same techniques as descriptive analytics, but
you’ll dive deeper with drill-downs and correlations.
We may also need to bring in outside datasets to more fully inform your analysis. Sigma makes
this easy, especially when connected with Snowflake’s powerful capabilities.
Data analytics types:
Diagnostic Analytics
Diagnostic analytics can reveal the full spectrum of causes, ensuring we will see the complete
picture.
We will try to quantify which factors are most impactful and zero (not impactful) in on them.
For diagnostic analytics, we will use some of the same techniques as descriptive analytics, but
you’ll dive deeper with drill-downs and correlations.
We may also need to bring in outside datasets to more fully inform your analysis. Sigma makes
this easy, especially when connected with Snowflake’s powerful capabilities.
Crisp dm:
CRoss-Industry Standard Process for Data Mining
Business/
Deployment Application
Understanding
Data
Evaluation
Understanding
Plan Monitering
Assess Describe Clean Generate Review
&
Situation Data Data Test Design Process
Maintenance
Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report
Verify
Produce Integrate Assess Review
Data
Project Plan Data Model Project
Quality
Format
Data
Crisp dm:
Business Understanding Modeling
Understanding project objectives and requirements Run the data mining tools
Data mining problem definition Evaluation
Data Understanding Determine if results meet business objectives
Initial data collection and familiarization Identify business issues that should have
Identify data quality issues been addressed earlier
Initial, obvious results Deployment
Data Preparation Put the resulting models into practice
Record and attribute selection
Set up for repeated/continuous mining of the
Data cleansing data
Phases in the Crisp dm Process
(1 & 2)
Business Understanding:
• Statement of Business Objective
• Statement of Data Mining objective
• Statement of Success Criteria
Data Understanding
• Explore the data and verify the quality
• Find outliers
51
Phases in the Crisp dm Process (3 & 4)
Data preparation: Model building
Takes usually over 90% of the time • Selectionof the modeling techniques is based upon
• Collection
• Assessment
the data mining objective
• Consolidation and Cleaning • Modeling is an iterative process - different for
table links, aggregation level, missing values, etc supervised and unsupervised learning
• Data selection
active role in ignoring non-contributory data? May model for either description or prediction
outliers?
Use of samples
visualization tools
• Transformations - create new variables
52
Phases in the Crisp dm Process (5 & 6)
Model Evaluation Deployment
• Evaluation of model: how well it performed on test data • Determine how the results need to be utilized
• Methods and criteria depend on model type: • Who needs to use them?
e.g., coincidence matrix with classification models, mean error rate • How often do they need to be used
with regression models Deploy Data Mining results by:
• Interpretation of model: important or not, easy or hard • Scoring a database
depends on algorithm • Utilizing results as business rules
• interactive scoring on-line
54
SEMMA
Sharda, R., Delen, D., Turban, E. (2018). Big data Intelligence, Analytics,
and Data Science: A Managerial Perspective. 04. Pearson Education. New
Jersey. ISBN: 9780134633282.
55
SEMMA
Sharda, R., Delen, D., Turban, E. (2018). Big data Intelligence, Analytics, and Data Science: A Managerial Perspective.
04. Pearson Education. New Jersey. ISBN: 9780134633282.
Modify – Data is modified by creating, selecting, and transforming variables to center the model selection, and any
additional information or variables can be added necessarily to make information output for significant. Whenever
new information is available, data mining methods can be updated or modified.
Model – Data is modeled by permitting software to search for a mix of data that dependably predict an ideal result in
an automatic way. For example, statistical models such as time series analysis, memory -based reasoning, etc.
Assess – Data is assessed by evaluating whether findings from the data is valuable enough (useful) and reliable. In
this phase, data can also be gauged on how well it performs. If data model is valid, it should work just fine on both
reserved sample and constructed sample.
56
Big Data & Life Cycle
In almost every aspect, Big Data can bring “big values” to our lives.
Technologically, Big Data allows diverse and heterogeneous data to be fully integrated and analyzed to help us make
decisions.
Big Data comes from myriad sources, including social media, sensors, the Internet of Things, video surveillance, and
many sources of data
Characteristics of Big Data:
Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows and millions of
columns.
Complexity of data types and structures: Big Data reflects the variety of new data sources, formats, and structures,
including digital traces being left on the web and other digital repositories for subsequent analysis.
Speed of new data creation and growth: Big Data can describe high-velocity data, with rapid data ingestion and near
real-time analysis.
57
Big Data & Life cycle
Areas of Applications:
Health and Well being
Policy making and public opinions
Smart cities and a more efficient society
New online educational models: MOOC and Student-Teacher modeling
Robotics and human-robot interaction
Much of this power hinges on Research on Analytics
58
Big Data & Life cycle
The Data Analytics Lifecycle is designed
specifically for Big Data problems and data
science projects.
The lifecycle has six phases, and project
work can occur in several phases at once.
The movement can be either forward or
backwards.
This iterative depiction of the lifecycle is
intended to closely portray a real project,
in which aspects of the project move
forward and may return to earlier stages
as new information is uncovered and team
members learn more about various stages
of the project.
59
Big Data & Life cycle
The Data Analytics Lifecycle defines analytics process best practices spanning
discovery to project completion.
The lifecycle draws from established methods in the realm of data analytics and
decision science. This synthesis was developed after gathering input from data
scientists and consulting established approaches that provided input on pieces of
the process.
KEY ROLES:
1. Business User: Someone who understands the domain area and usually
benefits from the results. This person can consult and advise the project
team on the context of the project, the value of the results, and how the
outputs will be operationalized. Usually, a business analyst, line manager, or
deep subject matter expert in the project domain fulfils this role.
2. Project Sponsor: Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines the core business
problem. Generally provides the funding and sets the priorities for the
project and clarifies the desired outputs. 60
Big Data & Life cycle
3. Project Manager: Ensures that key milestones and objectives are met on time and at the expected quality.
Business Intelligence Analyst: Provides business domain expertise based on a deep understanding of the data,
key performance indicators (KPIs), key metrics, and business intelligence from a reporting perspective. Business
Intelligence Analysts generally create dashboards and reports and have knowledge of the data feeds and
sources.
4. Business Intelligence Analyst: Provides business domain expertise based on a deep understanding of the
data, key performance indicators (KPIs), key metrics, and business intelligence from a reporting perspective.
Business Intelligence Analysts generally create dashboards and reports and have knowledge of the data feeds
and sources.
5. Database Administrator (DBA): Provisions and configures the database environment to support the analytics
needs of the working team. These responsibilities may include providing access to key databases or tables and
ensuring the appropriate security levels are in place related to the data repositories.
61
Big Data & Life cycle
6. Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data management and
data extraction, and provides support for data ingestion into the analytic sandbox. Whereas the DBA sets up
and configures the databases to be used, the data engineer executes the actual data extractions and performs
substantial data manipulation to facilitate the analytics. The data engineer works closely with the data scientist
to help shape data in the right ways for analyses.
7. Data Scientist: Provides subject matter expertise for analytical techniques, data modeling, and applying valid
analytical techniques to given business problems. Ensures overall analytics objectives are met. Designs and
executes analytical methods and approaches with the data available to the project.
62
Big Data & Life cycle
Data Analytics Lifecycle Phases:
Phase 1—Discovery: In Phase 1, the team learns the business
domain, including relevant history, resources such as whether the
organization or business unit has attempted similar projects in the
past from which they can learn. Important activities in this phase
include framing the business problem as an analytics challenge
that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.