0% found this document useful (0 votes)
46 views

BIA 5000 Introduction To Analytics - Lesson 6

This document discusses the data science life cycle and related concepts. It describes key steps in data acquisition and preparation including gathering, understanding, reformatting, consolidating, transforming, cleansing and storing data. Common challenges in data preparation are also reviewed. The document then discusses data wrangling techniques including gathering, filtering, subsetting, profiling and transforming data. Finally, it introduces the concept of data profiling and provides example questions to profile individual data.

Uploaded by

Shivam Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

BIA 5000 Introduction To Analytics - Lesson 6

This document discusses the data science life cycle and related concepts. It describes key steps in data acquisition and preparation including gathering, understanding, reformatting, consolidating, transforming, cleansing and storing data. Common challenges in data preparation are also reviewed. The document then discusses data wrangling techniques including gathering, filtering, subsetting, profiling and transforming data. Finally, it introduces the concept of data profiling and provides example questions to profile individual data.

Uploaded by

Shivam Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

INTRODUCTION

TO ANALYTICS
2022 - 2023
LESSON 6.
DATA SCIENCE LIFE CYCLE (PART II)
Learning Objectives

1. Describe data preparation steps and challenges

2. Distinguish data cleansing techniques

3. Describe data wrangling process and methods

4. Understand the concept of data profiling

5. Explain the purpose of the activities in each phase of the data science life cycle

6. Explain the analytics maturity model

7. Understand factors that impact analytics maturity

8. Recognize reasons for analytics project failures


Agenda

1. Data acquisition and preparation


2. Data wrangling and profiling
3. Predictive modeling
4. The rest of the data science life cycle
5. Analytics maturity levels & factors
6. Analytics projects failure reasons
DATA ACQUISITION
AND PREPARATION
The time we spend on data
preparation and data cleansing…

How many ways can you misspell


“Philadelphia”?

Go to menti.com
Data Science Life Cycle
Business
problem

Monitoring & Data


Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
Data Acquisition
Gather & extract Gather, extract, mine data from enterprise’s source systems, cloud-based
data applications and external sources.

Understand the Understand the data and its business definitions. Acquisition questions:
data
• Where did the data come from?
• Is data complete? What may be missing? Why?
• What were the data collection points?
• Who touched and processed data?
• What are quality issues of the data?
Data Preparation
Data preparation:
a set of processes that
gather data from
diverse source
systems, transform it
according to business
and technical rules,
and stage it for
transforming it into
useful information

Ensuring data quality

Textbook Chapter 5 Figure 5.2


Data Preparation
Reformat data Convert data from multiple systems into common format and schema.
Requires schema and column definitions (data dictionary)

Consolidate & Consolidate data using standardized definitions;


validate data Validate data by querying;
Determine whether data conforms to pre-defined business rules
Ensuring
Transform data Transform data into business information. data
Includes using business rules, algorithms, filters and creating associations quality
Cleanse data Analyze data for quality and inconsistency and clean up data issues

Store data Store the resulting data for further processing


Data Preparation Challenges
Project delays and cost overruns are frequently tied to underestimating time & resources
required for data preparation
Volume, variety and veracity of Data is in different formats, follows different rules and is collected at
data different rates

Data quality at source Bad data is introduced by operational system defects, human errors,
manual steps introduce, environmental instability (sensors, IoT)

Introducing data quality issues Information is skewed during transformation and aggregation (defects,
errors, incorrect algorithms)

Inconsistent data models Poor or incomplete understanding of source data leads to suboptimal
data models, missed or misinterpreted relationships between data,
may results in misleading analytics

Lack of master data Data from multiple sources is not brought to a consistent common
management definition – cannot be joined or compared accurately
Data Cleansing Techniques

Validity Check and correct invalid format, invalid values (e.g. outside of range,
checks syntax errors, typos, white space)

Relevance Detect and remove irrelevant data (corrupted, inaccurate or irrelevant for
checks the goals of analytics)

Duplicate Find, resolve and remove duplicate information (e.g. the same event
removal recorded by two sources, the same event processed twice, customer
address captured in multiple systems).

Consistency Detect values that contradict each other or are incompatible (e.g. is year or
checks birth consistent with age). Validation may be based on constraints or business
rules.
Data Cleansing Techniques

Data profiling Use summary statistics about to data to assess quality (range, mean,
distribution, unique values, outliers)

Visualization Visualize data using statistical methods to detect unexpected or erroneous


values (e.g. outliers)

Missing values 1. Discard observations with missing values


2. Impute (calculate missing value using other observations/data). May use
statistical methods or copy values from similar records (hot deck method).
3. Flag records with missing values for special processing

Data Rescale data values into a range from 0 to 1 for normally distributed data
Normalization
Data Cleansing Examples

Name and address cleansing Matching and standardizations of names and addresses

Customer householding Link personal and business accounts of family members under
a household grouping
DATA WRANGLING &
PROFILING
Data Science Life Cycle
Data Science Life Cycle

Preparation Wrangling
From Data Preparation to Data Wrangling

Preparation Wrangling

Textbook Chapter 5 Figure 5.3 Textbook Chapter 5 Figure 5.3


Data Wrangling (Franchising)

Data wrangling
(franchising):
Aggregation,
summarization and
enrichment of data
for use with BI tools.
a.k.a “data munging”

* Data wrangling
processes and activities
will be influenced by the
selected BI tools
Textbook Chapter 5 Figure 5.3
Data wrangling – so many terms!
Data franchising
Data munging
Advanced data preparation
Data Wrangling - Iterative

Profiling:
- Guides data transformations
- Validates data transformations

https://www.bankingtech.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf
Data Wrangling: Gather, Filter, Subset
Gather Gather the data from sources (may have different format and
structure)

Filter Choose a smaller part of dataset relevant for the purpose.


Filtering can be done by tables, rows and columns:
• select a subset that satisfies certain criteria
• Discard unwanted fields (attributes) that are irrelevant for the
analytical purposes

Subset Create subsets relevant to the analytics problem – a result of filtering


Data Wrangling: Profile & Transform
Profile data Examine and evaluate the data content, quality, and relationships to better
understand the data.
Generate statistics and summaries from the data.
Restructure & de- Transform data from source schema to the target BI tool schema.
normalize for target For example, restructure from the relational schema at the source (e.g. data
schema warehouse) to a non-relational schema of the target (e.g. data mart).
Transforming unstructured data into structured e.g. numeric or categorical for
processing by analytics tools.
Advanced cleaning In addition to the cleaning performed during data preparation, more advanced
& validation cleaning and validation techniques may be provided by specific tools
Enrich data Perform business transformations and calculations required for business purposes;
Join and combine multiple datasets;
Create groupings, aggregations and summaries to improve performance and
reduce the need to redundant calculations.
Data Profiling Questions

What’s in your data?

How is quality of your data?

Is the data complete? Are there missing values?

Is the data unique? Are there duplications?

Are there problem records in the data set? Are there anomalies (outliers)?

What is the distribution of data? Does the distribution look right (as expected from this business
data)?

What are the ranges of values, minimums, maximums and averages? Are they as expected from
this business data?
Data Profiling - Individual

Profiling Description and variations


type

Individual Understanding the validity of individual record fields.


values Syntax checks:
profiling
Formatting: is the data field in the correct format?
Value range: Does the value fall within permissible set of values?
Semantic checks:
Focus on the meaning of data in context (interpretation of data). For example, if New
Year’s Day is a holiday, no orders should have a delivery date of Jan 1.
Data Profiling – Set-based
Profiling type Description and variations

Set-based Understanding the distribution of values for a given field across multiple records.
profiling Checks the validity of distribution – is it as expected for this type of business data?
Numeric fields:
- Build a histogram and compare to the known distribution (e.g. Poisson or Gaussian)
- Determine summary statistics (min, max, median, mean) and identify outliers
Categorical fields:
- Count occurrences of unique values or clusters of values
Geospatial data:
- Plot data on a map
Date-time data:
- Plot date-time values across daily, weekly, monthly scales
Distribution of multiple values:
- Build scatter plots
- Check for duplication
Set-Based Profiling

Column Cross-column Cross-table


profiling profiling profiling

Profiling type Description and variations

Key analysis Scan collections of values within a table to locate a potential primary key OR
Scan collections of values across table to locate a potential foreign key

Dependency Determine dependent relationships within a dataset or across tables;


analysis Identify redundant data and correlations
Individual vs. Set-Based Profiling

Mini-quiz

Go to menti.com
Data Wrangling: Transformations
Transformation type Description and variations

Structuring Actions that change the form or schema of the dataset:


Intra-record:
changing the order of fields within a record
breaking record fields into smaller components
combining fields into complex structures
Inter-record:
remove subsets of records
aggregations and pivots of the data (see granularity)

Granularity Aggregations: change the granularity of the dataset (e.g., moving from individual
customers to segments of customers, or from individual sales transactions to monthly
or quarterly net revenue calculations).
Pivoting: shifts records into fields or shift fields into records.
Data Wrangling Methods
Transformation type Description and variations

Cleansing of Actions that fix irregularities in the dataset (quality and consistency issues.)
missing values Cleansing predominately involves manipulating individual field values within
records. The most common variant fixes missing (or NULL) values. Methods:
Discard:
Records with a missing value are discarded (not used for analytics)
Impute:
Calculate missing value using other observations/data). Methods:
- Statistical methods(use average or median value)
- Copy values from similar records (hot deck method)
- Interpolation for time series data, using Last observation carried forward (LOCF)
or Next observation carried backward (NOCB)
Keep & flag:
Keep records with missing values; usually involves flagging them for special
processing
Data Wrangling: Transformations
Transformation type Description and variations

Cleansing of invalid Invalid scenarios:


or inconsistent data Data is inconsistent with other fields (e.g., a customer age compared with their data
of birth)
Data is ambiguous (e.g. abbreviation that can have multiple interpretations)
Data is incorrectly coded (e.g. categorical value does not match standards)
Methods:
Calculate correct or consistent value for the field - overwrite the original value in the
dataset.
Keep both the original (incorrect) and derived (correct) value.
Mark values as invalid.

De-duplication Removal of duplicate records, reconciliation of inconsistencies in duplicate records

Data standardization Replacing different values, codes or spelling with a standard value.
Bringing values to a similar scale
Data Wrangling: Transformations
Transformation Description and variations
type

Subsetting Split data sets into subsets to wrangle them separately:


- Subset by structure (e.g. split heterogeneous set of records)
- Subset by granularity
- Subset into smaller sized sets

Sampling Using samples of big data to iteratively refine data wrangling steps.
Sampling requires statistical approaches to determine “representative
samples”
Usually requires some extreme values that represent the range, and
random representative sample that reflects distribution trends.
Stratified samples: include representative from all groups (“strata”), even
though they might misrepresent the trends of the overall dataset
Dataset structure types

Rectangular “Jagged” Heterogeneous


dataset: dataset: dataset:
Varied record Different entities
length with varied
Database table,
structure in one
matrix e.g. JSON, XML dataset
Data Wrangling: Transformations
Transformation type Description and variations

Enriching Actions that add new values to the dataset from multiple data‐sets.
Joins: combine datasets by linking records.
Unions: blend multiple datasets together by matching up records from two different
datasets and concatenating them “horizontally” into a wider table that includes
attributes from both sides of the match.
Metadata enrichment: add metadata (information about the data) into the
dataset. Can be dataset independent (e.g. the current time or the username of the
person transforming the data) or specific to the dataset (e.g. filenames or locations
of each record within the dataset).
Computation of new data values: Calculate or derive new data from existing data
(e.g. convert time based on geo-location; calculate a sentiment score from a chat
bot transcript).
Categorization: Reduce number of categories for categorical values, or to create
ranges (bins) for continuous variables (e.g. age ranges or income ranges). A.k.a.
coarse classification, classing, grouping, binning.
Metadata: describing data
Transformation type Description and variations

Structure Format and encoding of records and fields

Granularity Level of depth or the number of entities represented by a data record.


(a.k.a. resolution) Fine granularity: each record represents a single entity (e.g. one order)
Coarse granularity: each record represents a collection of entities (e.g. total order
per region per month)

Accuracy Quality, accuracy and consistency of data

Temporality Time sensitivity of the dataset; how time impacts accuracy of the dataset.
Timestamps may be used to identify record creation or the last known date this
record was considered accurate

Scope of data The number of distinct attributes represented in a dataset (dimensionality)


The attribute-by-attribute population coverage: are “all” the attributes for each
record present in the dataset? (sparsity)
Dataset Structure Questions

Do all records in the dataset contain the same fields?

How can you access the same fields across records? By position? By name?

How are the records delimited/separated in the dataset? Do you need to parse records?

How are the record fields delimited from one another? Do you need to parse them?

How are the fields encoded? Human readable strings? Binary numbers? Hash keys? Codes?
Compressed?

What are the relationship types between records and the record fields:
- Singular (record should have one and only one value for a field, like customer date of birth)
- Set-based (record could have many values for the field, like customer shipping addresses)
Data Temporality Questions

When was the dataset collected?

Were all the records and record fields collected/measured at the same time?

Are the timestamps associated with collection of the data known and available

Have some records or record field values been modified after the time of creation? Are the
timestamps of these modifications available?

How can you determine if the data is “stale” (no longer accurate)? Can you forecast when the
data will become stale?

If there are conflicting values in the data (e.g., multiple mailing addresses for a person), can you
use timestamps to determine which value is “correct”?
Data Scope Questions

What characteristics of the things (represented by the records) are captured or not?

Are the same record fields available for all records? Are they accessible via the same specification
(position, name, etc.)?

Do the records in the dataset represent the entire population of associated things? Are there
missing records? Are the missing records randomly or systematically missing?

Are there multiple records for the same thing? If so, does this change the granularity of the dataset
(e.g., from customers to contacts) or require some amount of deduplication before analysis?

Does the dataset contain a heterogeneous set of records (representing different kinds of entities)?
If so, what is the relationship between the different kinds of records?
Publishing
Publish Store data into the target analytics platform

What is published?

Dataset A transformed version of the input datasets – “refined datasets” are


published to the analytical tool

Transformation Logic and scripts that generate the refined datasets; scripts that
logic generate data wrangling statistics and insights

Profiling metadata Profiling reports required for managing automated data services and
products

https://www.bankingtech.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf
Business
problem

Monitoring & Data


Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
PREDICTIVE
MODELING
Predictive Modeling Process
Explore data Examine data and its properties; compute descriptive statistics; discover
data anomalies; test significant variables; use visualization to identify
patterns and trends

Build & train Form hypothesis about the analytics problem


machine Select candidate machine learning models with selected predictor
learning models variables
Train the models using the training data set

Evaluate model Test and evaluate models using training sets


performance Repeat the build, train and evaluate steps to optimize the model

Deploy models Deploy the best performing model to production


THE REST OF THE
DATA LIFE CYCLE
Business
problem

Monitoring & Data


Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
Visualization & Communication
Visualize data Present findings and insights to business users

Publish & Share the insights with business stakeholders in an easy to understand and consume
communicate to format
stakeholders

Incorporate Use the predictions and insights to make business decision at specific points of a
analytics into business process;
business process Create a feedback mechanism to assess the accuracy of predictions by collection
data and outcomes of the business process
Predictive Modeling

What is “decay”?
Predictive Model Decay
Decay:
• to decrease usually gradually in size, quantity, activity, or force
• to decline from a sound or prosperous condition
• to fall into ruin

Model decay reasons

The relationship between predictor variables and behaviour is changing

New/better data becomes available

Data scarcity - when data that the model needs becomes unavailable

Organization's objectives change


Monitoring & Maintenance
Monitor model Models need to adapt to changing business conditions and data
performance Track results of the predictions and measure the effectiveness of the predictive
models
Alert business of model decay and modify models as their effectiveness starts to
decline

Maintain model • Create and maintain document of the model design


design • Maintain the model monitoring process
documentation • Update documentation as the model is enhanced or modified

Manage the • Monitor business value of the models


models • Prune models with little business value
• Tune, improve and optimize the models
Model Documentation Questions

When was the model designed, and by who?

What is the perimeter of the model (ranges, entity types, geographical region, industry sectors)?

What are the strengths and the weaknesses of the model?

What data were used to build the model? How was the sample constructed? What is the time
horizon in the sample?

Is human judgement used, and how?

Bart Baesens (2014)


Analytics in a Big Data World: The Essential Guide to Data Science and its Applications
Wiley
ANALYTICS
MATURITY
BI and Analytics Maturity Model

BI – Business Intelligence
BICC – Business Analytics
Competency Centre
ACE - Analytics Centre of Excellence
CAO – Chief Analytics Officer
CDO – Chief Data Officer

Source: Gartner (October 2016)


BI and Analytics Maturity Model
Determine the best
course of action
Predict and
BI – Business Intelligence
Monitor what has forecast future
Infer why has it BICC – Business Analytics
occurred
occurred Competency Centre
ACE - Analytics Centre of Excellence
CAO – Chief Analytics Officer
CDO – Chief Data Officer

Source: Gartner (October 2016)


Data Analytics Maturity

Brent Dykes Data Analytics Marathon: thy Your Organization must Focus on the Finish
Analytics & Data Science Success Rates
Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not
scale in the organization.
Jan 2019: Gartner

Through 2022, only 20% of analytic insights will deliver business outcomes.
Jan 2019: Gartner

77% of businesses report that "business adoption" of big data and AI initiatives is a big
challenge.
Jan 2019: NewVantage survey

87% of data science projects never make it into production.


July 2019: VentureBeat AI
https://blogs.gartner.com/andrew_white/2019/01/03/our-top-data-and-analytics-predicts-for-2019/
https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/
https://newvantage.com/wp-content/uploads/2018/12/Big-Data-Executive-Survey-2019-Findings.pdf
Analytics Maturity Factors
What are the factors that impact analytics maturity in an organization?
What impacts the success of analytics projects?

People, Culture,
Organization

Process

Data quality

Technology
Analytics Maturity Factors

People, Culture, Executive support – commitment to support the use of analytics


Organization Siloed vs. enterprise approach to managing data
Support for data governance
Skilled analytics team

Process Maturity of the analytics process; data-driven decision-making


BI development process – from identifying needs to sustained
repeatable application

Data quality Managing quality of data across organization

Technology Enterprise and data architecture


Disparate vs integrated systems
Analytics technology and expertise
ANALYTICS PROJECTS
FAILURE REASONS
Analytics Projects Failure Reasons

Failure reason Mitigation

Not focused on business Clear business case


value, unclear Understand purpose of analytics
requirements
Capture accurate and clear requirements

Relying on software to Background research


be the solution Select the right software for the job
Rely on proper analysis and design
Focus on quality of data

Treating big data as Expertise and architecture must be adequate for handling big data
traditional structured Apply appropriate methods and tools, in particular for semi--
data structured and unstructured data Source: Textbook Chapter 15
Analytics Projects Failure Reasons (cntd)

Failure reason Mitigation

Lack of expertise Ensure deep understanding of business


Acquire or grow data science expertise (including statistical,
actuarial and specialized programming skills)

Analytics not integrated Don’t stop at collecting and analyzing data


into the business Incorporate analytics into the business process
process
Measure the results (success of analytics) and adjust models as
needed

Lack of solid project Keep scope realistic (don’t promise too much)
management Ensure enough support and resources from business
Manage business analytics efforts as projects
Source: Textbook Chapter 15

You might also like