BIA 5000 Introduction To Analytics - Lesson 6
BIA 5000 Introduction To Analytics - Lesson 6
TO ANALYTICS
2022 - 2023
LESSON 6.
DATA SCIENCE LIFE CYCLE (PART II)
Learning Objectives
5. Explain the purpose of the activities in each phase of the data science life cycle
Go to menti.com
Data Science Life Cycle
Business
problem
Predictive Data
modeling wrangling
Data Acquisition
Gather & extract Gather, extract, mine data from enterprise’s source systems, cloud-based
data applications and external sources.
Understand the Understand the data and its business definitions. Acquisition questions:
data
• Where did the data come from?
• Is data complete? What may be missing? Why?
• What were the data collection points?
• Who touched and processed data?
• What are quality issues of the data?
Data Preparation
Data preparation:
a set of processes that
gather data from
diverse source
systems, transform it
according to business
and technical rules,
and stage it for
transforming it into
useful information
Data quality at source Bad data is introduced by operational system defects, human errors,
manual steps introduce, environmental instability (sensors, IoT)
Introducing data quality issues Information is skewed during transformation and aggregation (defects,
errors, incorrect algorithms)
Inconsistent data models Poor or incomplete understanding of source data leads to suboptimal
data models, missed or misinterpreted relationships between data,
may results in misleading analytics
Lack of master data Data from multiple sources is not brought to a consistent common
management definition – cannot be joined or compared accurately
Data Cleansing Techniques
Validity Check and correct invalid format, invalid values (e.g. outside of range,
checks syntax errors, typos, white space)
Relevance Detect and remove irrelevant data (corrupted, inaccurate or irrelevant for
checks the goals of analytics)
Duplicate Find, resolve and remove duplicate information (e.g. the same event
removal recorded by two sources, the same event processed twice, customer
address captured in multiple systems).
Consistency Detect values that contradict each other or are incompatible (e.g. is year or
checks birth consistent with age). Validation may be based on constraints or business
rules.
Data Cleansing Techniques
Data profiling Use summary statistics about to data to assess quality (range, mean,
distribution, unique values, outliers)
Data Rescale data values into a range from 0 to 1 for normally distributed data
Normalization
Data Cleansing Examples
Name and address cleansing Matching and standardizations of names and addresses
Customer householding Link personal and business accounts of family members under
a household grouping
DATA WRANGLING &
PROFILING
Data Science Life Cycle
Data Science Life Cycle
Preparation Wrangling
From Data Preparation to Data Wrangling
Preparation Wrangling
Data wrangling
(franchising):
Aggregation,
summarization and
enrichment of data
for use with BI tools.
a.k.a “data munging”
* Data wrangling
processes and activities
will be influenced by the
selected BI tools
Textbook Chapter 5 Figure 5.3
Data wrangling – so many terms!
Data franchising
Data munging
Advanced data preparation
Data Wrangling - Iterative
Profiling:
- Guides data transformations
- Validates data transformations
https://www.bankingtech.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf
Data Wrangling: Gather, Filter, Subset
Gather Gather the data from sources (may have different format and
structure)
Are there problem records in the data set? Are there anomalies (outliers)?
What is the distribution of data? Does the distribution look right (as expected from this business
data)?
What are the ranges of values, minimums, maximums and averages? Are they as expected from
this business data?
Data Profiling - Individual
Set-based Understanding the distribution of values for a given field across multiple records.
profiling Checks the validity of distribution – is it as expected for this type of business data?
Numeric fields:
- Build a histogram and compare to the known distribution (e.g. Poisson or Gaussian)
- Determine summary statistics (min, max, median, mean) and identify outliers
Categorical fields:
- Count occurrences of unique values or clusters of values
Geospatial data:
- Plot data on a map
Date-time data:
- Plot date-time values across daily, weekly, monthly scales
Distribution of multiple values:
- Build scatter plots
- Check for duplication
Set-Based Profiling
Key analysis Scan collections of values within a table to locate a potential primary key OR
Scan collections of values across table to locate a potential foreign key
Mini-quiz
Go to menti.com
Data Wrangling: Transformations
Transformation type Description and variations
Granularity Aggregations: change the granularity of the dataset (e.g., moving from individual
customers to segments of customers, or from individual sales transactions to monthly
or quarterly net revenue calculations).
Pivoting: shifts records into fields or shift fields into records.
Data Wrangling Methods
Transformation type Description and variations
Cleansing of Actions that fix irregularities in the dataset (quality and consistency issues.)
missing values Cleansing predominately involves manipulating individual field values within
records. The most common variant fixes missing (or NULL) values. Methods:
Discard:
Records with a missing value are discarded (not used for analytics)
Impute:
Calculate missing value using other observations/data). Methods:
- Statistical methods(use average or median value)
- Copy values from similar records (hot deck method)
- Interpolation for time series data, using Last observation carried forward (LOCF)
or Next observation carried backward (NOCB)
Keep & flag:
Keep records with missing values; usually involves flagging them for special
processing
Data Wrangling: Transformations
Transformation type Description and variations
Data standardization Replacing different values, codes or spelling with a standard value.
Bringing values to a similar scale
Data Wrangling: Transformations
Transformation Description and variations
type
Sampling Using samples of big data to iteratively refine data wrangling steps.
Sampling requires statistical approaches to determine “representative
samples”
Usually requires some extreme values that represent the range, and
random representative sample that reflects distribution trends.
Stratified samples: include representative from all groups (“strata”), even
though they might misrepresent the trends of the overall dataset
Dataset structure types
Enriching Actions that add new values to the dataset from multiple data‐sets.
Joins: combine datasets by linking records.
Unions: blend multiple datasets together by matching up records from two different
datasets and concatenating them “horizontally” into a wider table that includes
attributes from both sides of the match.
Metadata enrichment: add metadata (information about the data) into the
dataset. Can be dataset independent (e.g. the current time or the username of the
person transforming the data) or specific to the dataset (e.g. filenames or locations
of each record within the dataset).
Computation of new data values: Calculate or derive new data from existing data
(e.g. convert time based on geo-location; calculate a sentiment score from a chat
bot transcript).
Categorization: Reduce number of categories for categorical values, or to create
ranges (bins) for continuous variables (e.g. age ranges or income ranges). A.k.a.
coarse classification, classing, grouping, binning.
Metadata: describing data
Transformation type Description and variations
Temporality Time sensitivity of the dataset; how time impacts accuracy of the dataset.
Timestamps may be used to identify record creation or the last known date this
record was considered accurate
How can you access the same fields across records? By position? By name?
How are the records delimited/separated in the dataset? Do you need to parse records?
How are the record fields delimited from one another? Do you need to parse them?
How are the fields encoded? Human readable strings? Binary numbers? Hash keys? Codes?
Compressed?
What are the relationship types between records and the record fields:
- Singular (record should have one and only one value for a field, like customer date of birth)
- Set-based (record could have many values for the field, like customer shipping addresses)
Data Temporality Questions
Were all the records and record fields collected/measured at the same time?
Are the timestamps associated with collection of the data known and available
Have some records or record field values been modified after the time of creation? Are the
timestamps of these modifications available?
How can you determine if the data is “stale” (no longer accurate)? Can you forecast when the
data will become stale?
If there are conflicting values in the data (e.g., multiple mailing addresses for a person), can you
use timestamps to determine which value is “correct”?
Data Scope Questions
What characteristics of the things (represented by the records) are captured or not?
Are the same record fields available for all records? Are they accessible via the same specification
(position, name, etc.)?
Do the records in the dataset represent the entire population of associated things? Are there
missing records? Are the missing records randomly or systematically missing?
Are there multiple records for the same thing? If so, does this change the granularity of the dataset
(e.g., from customers to contacts) or require some amount of deduplication before analysis?
Does the dataset contain a heterogeneous set of records (representing different kinds of entities)?
If so, what is the relationship between the different kinds of records?
Publishing
Publish Store data into the target analytics platform
What is published?
Transformation Logic and scripts that generate the refined datasets; scripts that
logic generate data wrangling statistics and insights
Profiling metadata Profiling reports required for managing automated data services and
products
https://www.bankingtech.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf
Business
problem
Predictive Data
modeling wrangling
PREDICTIVE
MODELING
Predictive Modeling Process
Explore data Examine data and its properties; compute descriptive statistics; discover
data anomalies; test significant variables; use visualization to identify
patterns and trends
Predictive Data
modeling wrangling
Visualization & Communication
Visualize data Present findings and insights to business users
Publish & Share the insights with business stakeholders in an easy to understand and consume
communicate to format
stakeholders
Incorporate Use the predictions and insights to make business decision at specific points of a
analytics into business process;
business process Create a feedback mechanism to assess the accuracy of predictions by collection
data and outcomes of the business process
Predictive Modeling
What is “decay”?
Predictive Model Decay
Decay:
• to decrease usually gradually in size, quantity, activity, or force
• to decline from a sound or prosperous condition
• to fall into ruin
Data scarcity - when data that the model needs becomes unavailable
What is the perimeter of the model (ranges, entity types, geographical region, industry sectors)?
What data were used to build the model? How was the sample constructed? What is the time
horizon in the sample?
BI – Business Intelligence
BICC – Business Analytics
Competency Centre
ACE - Analytics Centre of Excellence
CAO – Chief Analytics Officer
CDO – Chief Data Officer
Brent Dykes Data Analytics Marathon: thy Your Organization must Focus on the Finish
Analytics & Data Science Success Rates
Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not
scale in the organization.
Jan 2019: Gartner
Through 2022, only 20% of analytic insights will deliver business outcomes.
Jan 2019: Gartner
77% of businesses report that "business adoption" of big data and AI initiatives is a big
challenge.
Jan 2019: NewVantage survey
People, Culture,
Organization
Process
Data quality
Technology
Analytics Maturity Factors
Treating big data as Expertise and architecture must be adequate for handling big data
traditional structured Apply appropriate methods and tools, in particular for semi--
data structured and unstructured data Source: Textbook Chapter 15
Analytics Projects Failure Reasons (cntd)
Lack of solid project Keep scope realistic (don’t promise too much)
management Ensure enough support and resources from business
Manage business analytics efforts as projects
Source: Textbook Chapter 15