0% found this document useful (0 votes)
25 views

CH1 Introduction To Data Science BS

Uploaded by

furquan10010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

CH1 Introduction To Data Science BS

Uploaded by

furquan10010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction to Data Science

Dr Setturu Bharath
Syllabus
Unit 1:

Fundamentals of Data Science, Real World applications, Data Science vs BI, Data Science vs Statistics, Roles and
responsibilities of a Data Scientist, Software Engineering for Data Science, Data Scientists Toolbox, Data Science
Challenges.

Defining Analytics: Types of data analytics: Descriptive, Diagnostic, Predictive, Prescriptive Data Analytics –
methodologies, CRISP-DM Methodology, SEMMA, BIG DATA LIFE CYCLE, SMAM, Effort Estimate, Analytics Capacity
Building, Challenges in Data-driven decision making Data Science Process, Data Science methodology, Business
understanding, Data Requirements, Data Acquisition, Data Understanding, Data preparation Modelling, Model
Evaluation, Deployment and feedback, Case Study, Data Science Proposal, Samples, Evaluation, Review Guide

Unit 2:

Defining Data Team: Roles in a Data Science Team: Data Scientists, Data Engineers Managing Data Team :On boarding
and evaluating the success of team, Working with other teams, Common difficulties, Data and Data Models, Types of
Data and Datasets, Data Quality, Epicycles of Data Analysis, Data Models, Model as expectation, comparing models to
reality, Reactions to Data, Refining our expectations, Six Types of the Questions, Characteristics of Good Question
Formal modelling, General Framework, Associational Analyses, Prediction Analyses Introduction to OLTP, OLAP.
Syllabus
Unit 3:

Properties of functions, Data wrangling and Feature Engineering, Data cleaning, Data Aggregation, Sampling, Handling
Numeric Data Discretization, Binarization, Normalization, Data Smoothening, Dealing with textual Data Managing
Categorical Attributes, Transforming Categorical to Numerical Values: Encoding techniques, Feature Engineering,
Feature Extraction (Dimensionality Reduction) Feature Construction, Feature Subset selection, Filter methods,
Wrapper methods, Embedded methods, Feature Learning, Case Study involving FE tasks. Lab: Feature Extraction, and
Subset selection.

Unit 4:

Data Need for visualization, Exploratory vs Explanatory Analysis, Tables, Axis based Visualization and Statistical
Plots, Lessons in Data Visualization Design, The Data Visualization Design Process, Stories and Dashboards,
Storytelling with Data: The final deliverable, The Narrative - report / presentation structure, building narrative
with Data, Effective Story Telling.

Ethics for Data Science Bias and Fairness, Types of Bias, Identifying Bias, Evaluating Bias, Being a data skeptic –
examples of misuse of Data, Doing Good Science Five C’s, Ethical guidelines for Data Scientist, Ethics of data scraping
and storage, Case Study: IBM AI Fairness 360.
Textbooks
1. Introducing Data Science by Cielen, Meysman and Ali.
2. Grus, J., 2019. Data science from scratch: first principles with python. O'Reilly Media.
3. Storytelling with Data: A data visualization guide for business professionals by Cole Nussbaumer Knaflic.
4. Introduction to Data Mining by Tan, Steinbach, and Vipin Kumar.
5. The Art of Data Science by Roger D Peng and Elizabeth Matsui.
6. Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides.
7. Python Data Science Handbook: Essential tools for working with data by Jake Vander Plas.
8. KDD, SEMMA, and CRISP-DM: A Parallel Overview, Ana Azevedo and M.F. Santos, IADS-DM, 2008.

Reference Books (Online & Digital)

1. https://www.tableau.com/sites/default/files/whitepapers/752750_core_why_visual_analytics_whitepaper_0.pdf
2. https://nbviewer.org/github/IBM/AIF360/blob/master/examples/tutorial_medical_expenditure.ipynb
3. https://nbviewer.org/github/IBM/AIF360/blob/master/examples/tutorial_credit_scoring.ipynb
4. https://data-flair.training/blogs/data-analytics-tutorial/
5. http://cdnlarge.tableausoftware.com/sites/default/files/whitepapers/visual_analysis_for-everyone.pdf
Data science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today.

It adds methods from computer science to the range of statistics.

A data scientist is someone who knows more statistics than a computer scientist and more computer
science than a statistician.
Data science
In 2012, the Obama campaign employed dozens of data scientists who data-mined and experimented their
way to identifying voters who needed extra attention, choosing optimal donor-specific fundraising appeals
and programs, and focusing get-out-the-vote efforts where they were most likely to be useful.

In 2016 the Trump campaign tested a staggering variety of online ads and analyzed the data to find what
worked and what didn’t.

Now, you start feeling to add:


Data science
Data Science Skills

Data Management

Analytics Modeling

Business Analysis

Soft Skills

The main things that set a data scientist apart from a statistician are the ability to work with big data and experience in
machine learning, computing, and algorithm building. Their tools tend to differ too, with data scientist job descriptions more
frequently mentioning the ability to use Hadoop, Pig, Spark, R, Python, and Java, among others.
Data types and data sources

• Data with defined types and structure


• Example: comma separated values
Structured
• Textual data with parseable pattern
• Example: XML files with schema
• Textual data with erratic formats that Quasi-
can be formated with effort Structured
• Example: Clickstream data
Semi- • Data that has no inherent structure, often with
Unstructured multiple formats
Structured
• Example: Web site, videos
Examples for data types Quasi-Structured
Structured

Semi-Structured Unstructured
Structured data

Structured data is data that depends on a data model and resides in a fixed field within a record.

As such, it’s often easy to store structured data in tables within databases or Excel files.

SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.

Structured data gives you a hard time storing it in a traditional relational database.
UnStructured data

Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific
or varying.

One example of unstructured data is your regular email. Although email contains structured elements
such as the sender, title, and body text, it’s a challenge with the thousands of different languages and
dialects out there further complicating this.
Graph-based or network data → Semi Structured/Unstructured

“Graph data” can be a confusing term because any data can be shown in a graph.

“Graph” in this case points to mathematical graph theory. In graph theory, a graph
is a mathematical structure to model pair-wise relationships between objects.

The graph structures use nodes, edges, and properties to represent and store
graphical data.

Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and the
shortest path between two people
Graph-based or network data → Semi Structured/Unstructured

Audio, image, and video

Audio, image, and video are data types that pose specific challenges to a data scientist.

Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.

Graph or network data focuses on the relationship or adjacency of objects to simplify the data storage
and retrieval.
Machine-generated data

Machine-generated data is information that’s automatically created by a computer, process, application, or


other machine without human intervention.

Machine-generated data is becoming a major data resource and will continue to do so.
Data Science Competence

Three competence groups identified in the NIST


document and confirmed by the analysis of collected
data: Data Analytics including statistical methods,
Machine Learning and Business Analytics

Engineering: software and infrastructure

Subject/Scientific Domain competencies and knowledge


THE DATA SCIENCE PROCESS
Setting the research goal
Data science is mostly applied in the context of an organization.

When the business asks you to perform a data science project,


you’ll first prepare a project charter.

This charter contains information such as what you’re going to


research, how the company benefits from that, what data and
resources you need, a timetable, and deliverables.
The data science process

Retrieving data
The second step is to collect data.

You should be clear about which data you need and where you can find it.
In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data.

Data can also be delivered by third-party companies and takes many forms
ranging from Excel spreadsheets to different types of databases.
The data science process

Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data and prepare it for use in
subsequent steps.

This phase consists of three subphases: data cleansing removes false values from a data source and inconsistencies
across data sources, data integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
The data science process

Data exploration
Data exploration is concerned with building a deeper understanding of your data.

You try to understand how variables interact with each other, the distribution of the data, and whether there are
outliers. To achieve this you mainly use descriptive statistics, visual techniques, and simple modeling.

This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
The data science process

Data modeling or model building


In this phase you use models, domain knowledge, and insights about the data you found in the previous steps to
answer the research question.

You select a technique from the fields of statistics, machine learning, operations research, and so on. Building a model
is an iterative process that involves selecting the variables for the model, executing the model, and model diagnostics.
Cross Industry Standard Process for Data Science (CRISP) - Phases, tasks, outputs
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation
Determine Business Objectives Collect Initial Data Data Set Select Modeling Technique Evaluate Results Plan Deployment
Background Initial Data Collection Report Data Set Description Modeling Technique Assessment of Data Mining Results w.r.t. Deployment Plan
Business Objectives Modeling Assumptions Business Success Criteria
Business Success Criteria Approved Models
Situation Assessment Describe Data Select Data Generate Test Design Review Process Plan Monitoring and
Inventory of Resources Data Description Report Rationale for Inclusion / Test Design Review of Process Maintenance
Requirements,Assumptions, and Constraints Exclusion Monitoring & Maintenance
Risks and Contingencies Plan
Terminology
Costs and Benefits

Determine Explore Data Clean Data Build Model Determine Next Steps Produce Final Report
Data Mining Goal Data Exploration Report Data Cleaning Report Parameter Settings List of Possible Actions Final Report
Data Mining Goals Models Decision Final Presentation
Data Mining Success Criteria Model Description
Produce Project Plan Verify Data Quality Construct Data Assess Model Review Project
Project Plan Data Quality Report Derived Attributes Model Assessment Experience
Initial Asessment of Tools and Techniques Generated Records Revised Parameter Settings Documentation
Integrate Data
Merged Data
Format Data
Reformatted Data
The data science process

Presentation and automation


Finally, you present the results to your business.

These results can take many forms, ranging from presentations to research reports.

Sometimes you’ll need to automate the execution of the process because the business will want to use the insights
you gained in another project or enable an operational process to use the outcome from your model.
Mathematical Aspects
• Mathematics is fundamental for
understanding and solving complex data-
related problems.
• Key areas include statistics, linear algebra, Computational Geometry Optimization Stochastics
calculus, probability, and optimization.
• Linear and Nonlinear Programming
techniques for finding the best outcome in
a mathematical model with constraints.
Scientific Computing Machine Learning
Statistical Aspects
Descriptive Statistics: Summarizing and describing the
features of a dataset. Includes mean, median, mode,
standard deviation, and variance.

Inferential Statistics: Drawing conclusions and making Linear Models Statistical Tests Inference
predictions about a population based on a sample.
Includes hypothesis testing, confidence intervals, and
regression analysis.

Probability Distributions: Understanding and working


Time Series Analysis Machine Learning
with different types of data distributions, such as
normal, binomial, and Poisson distributions.
Computer Science Aspects
• Data Structures: Efficient data handling and
storage using arrays, linked lists, trees, graphs,
hash tables, etc.
• Algorithms: Sorting, searching, and optimization
algorithms are crucial for data manipulation and Data Structures and Algorithms Databases Distributed Computing

analysis.
• Complexity Analysis: Understanding the time and
space complexity of algorithms to ensure they
are scalable and efficient.
• Data Mining: Extracting patterns from large data Software Engineering Artificial Intelligence Machine Learning

sets using methods like clustering, association


rule mining, and anomaly detection.
Applications
• Companies learn your
secrets, shopping
patterns, and
preferences Intelligent Systems Robotics Marketing

• They find stories,


extract knowledge.

Medicine Autonomous Driving Social Networks


WhO are Data Scientists? → Skill set for data scientists
Not computer scientists
• But should know about databases, data structures, algorithms, etc.

Not mathematicians
• But should know about optimization, stochastics, etc.

Not statisticians
• But should know about regression, statistical tests, etc.

Not domain experts


• But must work together with them
Skills of Data Scientists Algorithms

Maths Statistics

Quantitative
Communication skills Programming

Collaborative
Data Technical
Scientists
Teamwork Infrastructures
Skeptical

Create hypotheses, but


be skeptical about them
Different types of Data Scientists
According to Microsoft Research:
• Machine Learning Scientist
• Polymath • Statistician
„Do it all“ • Data Analyzer
• Software Programming Analyst
Analyzing data
• Data Engineer
• Data Evangelist
Actuarial Scientist
Data analysis, disseminating and acting on insights • Platform Builder •

Collect data and create infrastructures • Business Analytic Practitioner


• Data Preparer • Quality Analyst
Querying existing data, preparing data for analysis • Moonlighters (50%/20%) • Spatial Data Scientist
„Spare time“ data scientists • Mathematician
• Data Shapers • Digital Analytic Consultant
Analyzing and preparing data • Insight Actors
Use the outcome and act on insights.

Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)
Different types of Data Scientists

• Mathematics and Applied Mathematics


• Applied Statistics/Data Analysis
• Solid Programming Skills (R, Python, Julia, SQL)
• Data Mining
• Data Base Storage and Management
• Machine Learning and discovery
Data Science vs. Business Intelligence
Business Intelligence (Gartner IT Glossary)
• […] best practices that enable access to and analysis of information to improve and optimize decisions and performance.
• Business Intelligence offered solutions to the problems of the present, while Data Science provides avenues for the future

High

• BI deals with measuring performance and quantifying the progress

Depth of Insights
towards reaching the business goal. Data
Science
• Data Science is a process of extracting, manipulating, visualizing, and Business
maintaining data as well as generating predictions. Intelligence
Low
Past Present Future

Time
Data Science vs. Business Intelligence
Business Intelligence Data Science
Techniques Dashboards, alerts, queries Optimization, predictive modelling, forecasting
Data Types Structured, data warehouses Any kind, often unstructured and dynamic
Common questions What happened…? What if…?
How much did…? What will…?
When did…? How can we…?
Data Storage Data mostly stored in data-warehouses Data utilized is distributed in real time clusters
Concept Deals with data analysis on the business platform Consists of several data operations in various domains
Data Science vs. Business Intelligence
Data Science vs. Statistics
• Statistics is a field of study rooted in mathematics, providing programmatic tools and methods — such as variance analysis,
mean, median, and frequency analysis – to collect data, design experiments, and perform analysis on a given set of figures
to measure an attribute or determine values for a particular question. Statistical methods are used in all fields that require
decision-making.
• Strong mathematical skills are the foundation for statistics.
• The field of data science can be described as the crossroads between machine learning, traditional research, and software
development. A more wide-ranging multi-disciplinary field, data science goes beyond exploratory analysis, using scientific
methods, algorithms and mathematical formulas to extract, evaluate, and visualize structured and unstructured data.
• Data science can be broken down further into data mining, machine learning, and big data.
Data Science vs. Statistics
• Data scientists should have in-depth knowledge of descriptive statistics (measures of frequency, measures of central tendency,
measures of dispersion or variation, and measures of position), probability theory, as well as probability distribution, dimensionality
reduction, over- and under-sampling, and Bayesian analysis.
• The fields of data science and statistics have many similarities. Both focus on extracting data and using it to analyze and solve real-
world problems.
• Data scientists use statistical analysis. However, data scientists need to be familiar with statistics, among other areas.
• The science of statistics enables data science (aiding its reliability and validity), and data science expands the application of statistics
to Big Data.
• Data scientists should accept the contribution and importance of statistics and statisticians must humbly acknowledge the novel
capabilities made possible through data science and support this field of study with their theoretical and pragmatic expertise.
Data Science vs. Software Engineering
• Data science is more exploratory.
• Software engineering is more focused on systems building.
• Data science project management should be more open to changes.
• Data scientists collect, reformat, model, and interpret data. While some data scientists have a strong theoretical bent (best
are storytellers), most data scientists working in for profit enterprises are pragmatic and use their skills to derive practical
information from data.
• Software engineers write and test software code that meets the needs of end users. Depending on the environment they
work in, software engineers might also be responsible for the complete software product life cycle.
Data Science vs. Software Engineering
• Software engineering, in contrast, is a process that focuses on planning what to build, designing the best way to build, then
writing the code to build what was planned.
• Software engineers are also called software developers or simply developers. Software engineers are often divided into
front-end, back-end, or full-stack engineers. Front-end engineers focus on the part of the product that the end user sees
and interacts with. Back-end engineers write and maintain code that processes data from web applications
• The expected outcome of the project is known at the start of the project. Software engineering practices emphasize
standardization and automation.
• Data scientists can use aspects of the engineering mindset to improve the quality of their code, a subject.
BigData: More Data → More Opportunities
• Big Data are data sets so large or so complex that traditional methods of storing, accessing, and analyzing their breakdown are too
expensive. However, there is a lot of potential value hidden in this data, so organizations are eager to harness it to drive innovation
and competitive advantage.
• Big Data technologies and approaches are used to drive value out of data-rich environments in ways that traditional analytics
tools and methods cannot
• . Big Data has given rise to Data Science
• Data science is rooted in solid foundations of mathematics and statistics, computer science, and domain knowledge
BigData: More Data → More Opportunities
zettabyte (ZB),
yottabyte (YB),
TERABYTES PETABYTES EXABYTES brontobyte (BB),
LARGE
Geopbyte (GPB)
VOLUME OF INFORMATION

SMALL

1990’s 2000’s 2010’s →………..


Relational Databases Content Management Key-Value Storages
& Data Warehouses & Unstructured Data
Datascience challenges
Data Cleansing
Removing unwanted data from your datasets is one of the key challenges. Bad data is costly to businesses, with
some losing up to $12.1 million yearly globally. It can lead to incorrect conclusions, resulting in wrong decisions.
Handling Multiple Data Sources
Getting the right data for analysis is a daunting task, especially when you’re accessing data from various sources.
That’s why, for effective data science, consolidating data from multiple sources is a must.
Not Enough Skilled Workers
A staggering 59% of businesses use data science in different ways to improve their performance. This has resulted in a high
demand for skilled data science professionals that outweighs supply.
Datascience challenges
Data Privacy and Security
In 2020 alone, the FBI received over 2,000 cybercrime complaints daily (USA). A total of 1.13 million cases of
financial cyber fraud were reported in 2023, according to a Lok Sabha reply on February 6. Ransomware, attacks on data
systems, and data theft are some common forms of data security breaches.

Reporting to Non-Technical Stakeholders


Some organizations don’t have clearly defined business terms and KPIs (Key Performance Indicator). That can be a
challenge for your data scientists when it comes to reporting. If each department interprets business terms differently and
uses different measures to calculate KPIs, then your data scientists will have a lot to do.
Data analytics types:
Descriptive Analytics: which tells you what happened in the past (Theories/Scenarios)
Diagnostic Analytics: which helps you understand why something happened in the past
Predictive Analytics,: which predicts what’s most likely to happen in the future
Prescriptive Analytics: which recommends actions you can take to affect those likely outcomes
Data analytics types:

Descriptive Analytics:
Descriptive analytics is the analysis of historical data using two key methods – data aggregation and data
mining - which are used to uncover trends and patterns.
Descriptive analytics is not used to draw inferences or make predictions about the future from its findings;
rather it is concerned with representing what has happened in the past.
Descriptive analytics are often displayed using visual data representations like line, bar and pie charts and,
although they give useful insights on its own, often act as a foundation for future analysis.
Because descriptive analytics uses fairly simple analysis techniques, any findings should be easy for the
wider business audience to understand.
Data analytics types:
Predictive analytics
Predictive analytics is a more advanced method of data analysis that uses probabilities to make
assessments of what could happen in the future.
It uses statistical modelling and machine learning techniques to identify the likelihood of future outcomes
based on historical data.
To make predictions, machine learning algorithms take existing data and attempt to fill in the missing
data with the best possible guesses.
These predictions can then be used to solve problems and identify opportunities for growth.
For example, organisations are using predictive analytics to prevent fraud by looking for patterns in
criminal behaviour, optimising their marketing campaigns by spotting opportunities for cross selling and
reducing risk by using past behaviours to predict which customers are most likely to default on
payments.
Data analytics types:
Prescriptive analytics
Predictive analytics shows companies the raw results of their potential actions, prescriptive analytics shows
companies which option is the best.

The field of prescriptive analytics borrows heavily from mathematics and computer science, using a variety of
statistical methods.

Although closely related to both descriptive and predictive analytics, prescriptive analytics emphasises actionable
insights instead of data monitoring. This is achieved through gathering data from a range of descriptive and
predictive sources and applying them to the decision-making process.
Algorithms then create and re-create possible decision patterns that could affect an organisation in different
ways.
Data analytics types:
Diagnostic Analytics
Diagnostic analytics can reveal the full spectrum of causes, ensuring we will see the complete
picture.
We will try to quantify which factors are most impactful and zero (not impactful) in on them.
For diagnostic analytics, we will use some of the same techniques as descriptive analytics, but
you’ll dive deeper with drill-downs and correlations.
We may also need to bring in outside datasets to more fully inform your analysis. Sigma makes
this easy, especially when connected with Snowflake’s powerful capabilities.
Data analytics types:
Diagnostic Analytics
Diagnostic analytics can reveal the full spectrum of causes, ensuring we will see the complete
picture.
We will try to quantify which factors are most impactful and zero (not impactful) in on them.
For diagnostic analytics, we will use some of the same techniques as descriptive analytics, but
you’ll dive deeper with drill-downs and correlations.
We may also need to bring in outside datasets to more fully inform your analysis. Sigma makes
this easy, especially when connected with Snowflake’s powerful capabilities.
Crisp dm:
CRoss-Industry Standard Process for Data Mining

Business/
Deployment Application
Understanding

Data
Evaluation
Understanding

Modeling Data Preparation


Phases and Tasks
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine Collect Select


Select Evaluate Plan
Business Initial Modeling
Data Results Deployment
Objectives Data Technique

Plan Monitering
Assess Describe Clean Generate Review
&
Situation Data Data Test Design Process
Maintenance

Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report

Verify
Produce Integrate Assess Review
Data
Project Plan Data Model Project
Quality

Format
Data
Crisp dm:
Business Understanding Modeling
Understanding project objectives and requirements Run the data mining tools
Data mining problem definition Evaluation
Data Understanding Determine if results meet business objectives
Initial data collection and familiarization Identify business issues that should have
Identify data quality issues been addressed earlier
Initial, obvious results Deployment
Data Preparation Put the resulting models into practice
Record and attribute selection
Set up for repeated/continuous mining of the
Data cleansing data
Phases in the Crisp dm Process
(1 & 2)
Business Understanding:
• Statement of Business Objective
• Statement of Data Mining objective
• Statement of Success Criteria

Data Understanding
• Explore the data and verify the quality
• Find outliers

51
Phases in the Crisp dm Process (3 & 4)
Data preparation: Model building
Takes usually over 90% of the time • Selectionof the modeling techniques is based upon
• Collection
• Assessment
the data mining objective
• Consolidation and Cleaning • Modeling is an iterative process - different for
table links, aggregation level, missing values, etc supervised and unsupervised learning
• Data selection
active role in ignoring non-contributory data? May model for either description or prediction
outliers?
Use of samples
visualization tools
• Transformations - create new variables
52
Phases in the Crisp dm Process (5 & 6)
Model Evaluation Deployment
• Evaluation of model: how well it performed on test data • Determine how the results need to be utilized
• Methods and criteria depend on model type: • Who needs to use them?
e.g., coincidence matrix with classification models, mean error rate • How often do they need to be used
with regression models Deploy Data Mining results by:
• Interpretation of model: important or not, easy or hard • Scoring a database
depends on algorithm • Utilizing results as business rules
• interactive scoring on-line

54
SEMMA
Sharda, R., Delen, D., Turban, E. (2018). Big data Intelligence, Analytics,
and Data Science: A Managerial Perspective. 04. Pearson Education. New
Jersey. ISBN: 9780134633282.

Sample – Generating data in this phase could be optional. It involves


extracting a large dataset so that a significant piece of information can be
deducted and formed by the pattern. As a way to optimize cost and
performance, the SAS Institute applies a dependable and statistically
representative sample of complete detailed information sources instead of
mining a whole volume of data.
Explore – Data is explored by looking for unforeseen patterns and
oddities. This could increase comprehension and ideas towards the data.
Moreover, it also refines the disclosure process because if there is no
visualization or the visual itself is unclear, it can be done through
statistical techniques (clustering, factor analysis, etc.).

55
SEMMA
Sharda, R., Delen, D., Turban, E. (2018). Big data Intelligence, Analytics, and Data Science: A Managerial Perspective.
04. Pearson Education. New Jersey. ISBN: 9780134633282.

Modify – Data is modified by creating, selecting, and transforming variables to center the model selection, and any
additional information or variables can be added necessarily to make information output for significant. Whenever
new information is available, data mining methods can be updated or modified.
Model – Data is modeled by permitting software to search for a mix of data that dependably predict an ideal result in
an automatic way. For example, statistical models such as time series analysis, memory -based reasoning, etc.
Assess – Data is assessed by evaluating whether findings from the data is valuable enough (useful) and reliable. In
this phase, data can also be gauged on how well it performs. If data model is valid, it should work just fine on both
reserved sample and constructed sample.
56
Big Data & Life Cycle
In almost every aspect, Big Data can bring “big values” to our lives.
Technologically, Big Data allows diverse and heterogeneous data to be fully integrated and analyzed to help us make
decisions.
Big Data comes from myriad sources, including social media, sensors, the Internet of Things, video surveillance, and
many sources of data
Characteristics of Big Data:
Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows and millions of
columns.
Complexity of data types and structures: Big Data reflects the variety of new data sources, formats, and structures,
including digital traces being left on the web and other digital repositories for subsequent analysis.
Speed of new data creation and growth: Big Data can describe high-velocity data, with rapid data ingestion and near
real-time analysis.
57
Big Data & Life cycle
Areas of Applications:
Health and Well being
Policy making and public opinions
Smart cities and a more efficient society
New online educational models: MOOC and Student-Teacher modeling
Robotics and human-robot interaction
Much of this power hinges on Research on Analytics

58
Big Data & Life cycle
The Data Analytics Lifecycle is designed
specifically for Big Data problems and data
science projects.
The lifecycle has six phases, and project
work can occur in several phases at once.
The movement can be either forward or
backwards.
This iterative depiction of the lifecycle is
intended to closely portray a real project,
in which aspects of the project move
forward and may return to earlier stages
as new information is uncovered and team
members learn more about various stages
of the project.
59
Big Data & Life cycle
The Data Analytics Lifecycle defines analytics process best practices spanning
discovery to project completion.
The lifecycle draws from established methods in the realm of data analytics and
decision science. This synthesis was developed after gathering input from data
scientists and consulting established approaches that provided input on pieces of
the process.
KEY ROLES:
1. Business User: Someone who understands the domain area and usually
benefits from the results. This person can consult and advise the project
team on the context of the project, the value of the results, and how the
outputs will be operationalized. Usually, a business analyst, line manager, or
deep subject matter expert in the project domain fulfils this role.
2. Project Sponsor: Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines the core business
problem. Generally provides the funding and sets the priorities for the
project and clarifies the desired outputs. 60
Big Data & Life cycle
3. Project Manager: Ensures that key milestones and objectives are met on time and at the expected quality.
Business Intelligence Analyst: Provides business domain expertise based on a deep understanding of the data,
key performance indicators (KPIs), key metrics, and business intelligence from a reporting perspective. Business
Intelligence Analysts generally create dashboards and reports and have knowledge of the data feeds and
sources.
4. Business Intelligence Analyst: Provides business domain expertise based on a deep understanding of the
data, key performance indicators (KPIs), key metrics, and business intelligence from a reporting perspective.
Business Intelligence Analysts generally create dashboards and reports and have knowledge of the data feeds
and sources.
5. Database Administrator (DBA): Provisions and configures the database environment to support the analytics
needs of the working team. These responsibilities may include providing access to key databases or tables and
ensuring the appropriate security levels are in place related to the data repositories.
61
Big Data & Life cycle
6. Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data management and
data extraction, and provides support for data ingestion into the analytic sandbox. Whereas the DBA sets up
and configures the databases to be used, the data engineer executes the actual data extractions and performs
substantial data manipulation to facilitate the analytics. The data engineer works closely with the data scientist
to help shape data in the right ways for analyses.

7. Data Scientist: Provides subject matter expertise for analytical techniques, data modeling, and applying valid
analytical techniques to given business problems. Ensures overall analytics objectives are met. Designs and
executes analytical methods and approaches with the data available to the project.

62
Big Data & Life cycle
Data Analytics Lifecycle Phases:
Phase 1—Discovery: In Phase 1, the team learns the business
domain, including relevant history, resources such as whether the
organization or business unit has attempted similar projects in the
past from which they can learn. Important activities in this phase
include framing the business problem as an analytics challenge
that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.

Phase 2—Data preparation: Phase 2 requires the presence of an


analytic sandbox, in which the team can work with data and
perform analytics for the duration of the project. The team needs
to execute extract, load, and transform (ELT) or extract, transform
and load (ETL) to get data into the sandbox. In this phase, the
team also needs to familiarize itself with the data thoroughly and
take steps to condition the data
63
Big Data & Life cycle
Data Analytics Lifecycle Phases:
Phase 3—Model planning: Phase 3 is model planning, where the team
determines the methods, techniques, and workflow it intends to follow
for the subsequent model-building phase and selects key variables
and the most suitable models.
Phase 4—Model building: In Phase 4, the team develops datasets for
testing, training, and production purposes. In addition, in this phase,
the team builds and executes models based on the work done in the
model planning phase. The team also considers whether its existing
tools will suffice for running the models, or if it will need a more
robust environment for executing models and workflows (for example,
fast hardware and parallel processing, if applicable).
64
Big Data & Life cycle
Data Analytics Lifecycle Phases:
Phase 5—Communicate results: In Phase 5, the team, in collaboration
with major stakeholders, determines if the results of the project are a
success or a failure based on the criteria developed in Phase 1. The
team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
Phase 6—Operationalize: In Phase 6, the team delivers final reports,
briefings, code, and technical documents. In addition, the team may
run a pilot project to implement the models in a production
environment.
65
Data model and tools
Sample Dataset Inventory Research on Model Planning in Industry Verticals
Data model and tools
Commercial Tools:
• SAS Enterprise Miner [allows users to run predictive and descriptive models based on large volumes of data from across the enterprise. It interoperates with
other large data stores, has many partnerships, and is built for enterprise-level computing and analytics.
• SPSS Modeler (provided by IBM and now called IBM SPSS Modeler) offers methods to explore and analyze data through a GUI.
• Matlab provides a high-level language for performing a variety of data analytics, algorithms, and data exploration. Alpine Miner [11] provides a GUI front end
for users to develop analytic workflows and interact with Big Data tools and platforms on the back end.
• STATISTICA and Mathematica are also popular and well-regarded data mining and analytics tools.
Data model and tools
Free or Open Source tools:
• R and PL/R: R was described earlier in the model planning phase, and PL/R is a procedural language for PostgreSQL with R. Using this approach means that
R commands can be executed in database. This technique provides higher performance and is more scalable than running R in memory.
• Octave a free software programming language for computational modeling, has some of the functionality of Matlab. Because it is freely available, Octave is
used in major universities when teaching machine learning.
• WEKA is a free data mining software package with an analytic workbench. The functions created in WEKA can be executed within Java code. Python is a
programming language that provides toolkits for machine learning and analysis, such as scikit-learn, numpy, scipy, pandas, and related data visualization
using matplotlib. SQL in-database implementations, such as MADlib provide an alterative to in-memory desktop analytical tools. MADlib provides an open-
source machine learning library of algorithms that can be executed in-database, for PostgreSQL or Greenplum.
Standard Methodology for Analytical Models (SMAM)
Assignment:

Data Science Case Study Health Care:


Stanford Medicine, Google team
Case Study Elections : Obama to
Biden
http://www.datascienceassn.org/content/standard-methodology-analytical-models

You might also like