0% found this document useful (0 votes)

46 views

BIA 5000 Introduction To Analytics - Lesson 6

This document discusses the data science life cycle and related concepts. It describes key steps in data acquisition and preparation including gathering, understanding, reformatting, consolidating, transforming, cleansing and storing data. Common challenges in data preparation are also reviewed. The document then discusses data wrangling techniques including gathering, filtering, subsetting, profiling and transforming data. Finally, it introduces the concept of data profiling and provides example questions to profile individual data.

Uploaded by

Shivam Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

BIA 5000 Introduction To Analytics - Lesson 6

Uploaded by

Shivam Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

INTRODUCTION

TO ANALYTICS
2022 - 2023
LESSON 6.
DATA SCIENCE LIFE CYCLE (PART II)
Learning Objectives

1. Describe data preparation steps and challenges

2. Distinguish data cleansing techniques

3. Describe data wrangling process and methods

4. Understand the concept of data profiling

5. Explain the purpose of the activities in each phase of the data science life cycle

6. Explain the analytics maturity model

7. Understand factors that impact analytics maturity

8. Recognize reasons for analytics project failures

Agenda

1. Data acquisition and preparation

2. Data wrangling and profiling
3. Predictive modeling
4. The rest of the data science life cycle
5. Analytics maturity levels & factors
6. Analytics projects failure reasons
DATA ACQUISITION
AND PREPARATION
The time we spend on data
preparation and data cleansing…

How many ways can you misspell

“Philadelphia”?

Go to menti.com
Data Science Life Cycle
Business
problem

Monitoring & Data

Maintenance acquisition

Visualization & Data

Communication preparation

Predictive Data
modeling wrangling
Data Acquisition
Gather & extract Gather, extract, mine data from enterprise’s source systems, cloud-based
data applications and external sources.

Understand the Understand the data and its business definitions. Acquisition questions:
data
• Where did the data come from?
• Is data complete? What may be missing? Why?
• What were the data collection points?
• Who touched and processed data?
• What are quality issues of the data?
Data Preparation
Data preparation:
a set of processes that
gather data from
diverse source
systems, transform it
according to business
and technical rules,
and stage it for
transforming it into
useful information

Ensuring data quality

Textbook Chapter 5 Figure 5.2

Data Preparation
Reformat data Convert data from multiple systems into common format and schema.
Requires schema and column definitions (data dictionary)

Consolidate & Consolidate data using standardized definitions;

validate data Validate data by querying;
Determine whether data conforms to pre-defined business rules
Ensuring
Transform data Transform data into business information. data
Includes using business rules, algorithms, filters and creating associations quality
Cleanse data Analyze data for quality and inconsistency and clean up data issues

Store data Store the resulting data for further processing

Data Preparation Challenges
Project delays and cost overruns are frequently tied to underestimating time & resources
required for data preparation
Volume, variety and veracity of Data is in different formats, follows different rules and is collected at
data different rates

Data quality at source Bad data is introduced by operational system defects, human errors,
manual steps introduce, environmental instability (sensors, IoT)

Introducing data quality issues Information is skewed during transformation and aggregation (defects,
errors, incorrect algorithms)

Inconsistent data models Poor or incomplete understanding of source data leads to suboptimal
data models, missed or misinterpreted relationships between data,
may results in misleading analytics

Lack of master data Data from multiple sources is not brought to a consistent common
management definition – cannot be joined or compared accurately
Data Cleansing Techniques

Validity Check and correct invalid format, invalid values (e.g. outside of range,
checks syntax errors, typos, white space)

Relevance Detect and remove irrelevant data (corrupted, inaccurate or irrelevant for
checks the goals of analytics)

Duplicate Find, resolve and remove duplicate information (e.g. the same event
removal recorded by two sources, the same event processed twice, customer
address captured in multiple systems).

Consistency Detect values that contradict each other or are incompatible (e.g. is year or
checks birth consistent with age). Validation may be based on constraints or business
rules.
Data Cleansing Techniques

Data profiling Use summary statistics about to data to assess quality (range, mean,
distribution, unique values, outliers)

Visualization Visualize data using statistical methods to detect unexpected or erroneous

values (e.g. outliers)

Missing values 1. Discard observations with missing values

2. Impute (calculate missing value using other observations/data). May use
statistical methods or copy values from similar records (hot deck method).
3. Flag records with missing values for special processing

Data Rescale data values into a range from 0 to 1 for normally distributed data
Normalization
Data Cleansing Examples

Name and address cleansing Matching and standardizations of names and addresses

Customer householding Link personal and business accounts of family members under
a household grouping
DATA WRANGLING &
PROFILING
Data Science Life Cycle
Data Science Life Cycle

Preparation Wrangling
From Data Preparation to Data Wrangling

Preparation Wrangling

Textbook Chapter 5 Figure 5.3 Textbook Chapter 5 Figure 5.3

Data Wrangling (Franchising)

Data wrangling
(franchising):
Aggregation,
summarization and
enrichment of data
for use with BI tools.
a.k.a “data munging”

* Data wrangling
processes and activities
will be influenced by the
selected BI tools
Textbook Chapter 5 Figure 5.3
Data wrangling – so many terms!
Data franchising
Data munging
Advanced data preparation
Data Wrangling - Iterative

Profiling:
- Guides data transformations
- Validates data transformations

https://www.bankingtech.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf
Data Wrangling: Gather, Filter, Subset
Gather Gather the data from sources (may have different format and
structure)

Filter Choose a smaller part of dataset relevant for the purpose.

Filtering can be done by tables, rows and columns:
• select a subset that satisfies certain criteria
• Discard unwanted fields (attributes) that are irrelevant for the
analytical purposes

Subset Create subsets relevant to the analytics problem – a result of filtering

Data Wrangling: Profile & Transform
Profile data Examine and evaluate the data content, quality, and relationships to better
understand the data.
Generate statistics and summaries from the data.
Restructure & de- Transform data from source schema to the target BI tool schema.
normalize for target For example, restructure from the relational schema at the source (e.g. data
schema warehouse) to a non-relational schema of the target (e.g. data mart).
Transforming unstructured data into structured e.g. numeric or categorical for
processing by analytics tools.
Advanced cleaning In addition to the cleaning performed during data preparation, more advanced
& validation cleaning and validation techniques may be provided by specific tools
Enrich data Perform business transformations and calculations required for business purposes;
Join and combine multiple datasets;
Create groupings, aggregations and summaries to improve performance and
reduce the need to redundant calculations.
Data Profiling Questions

What’s in your data?

How is quality of your data?

Is the data complete? Are there missing values?

Is the data unique? Are there duplications?

Are there problem records in the data set? Are there anomalies (outliers)?

What is the distribution of data? Does the distribution look right (as expected from this business
data)?

What are the ranges of values, minimums, maximums and averages? Are they as expected from
this business data?
Data Profiling - Individual

Profiling Description and variations

type

Individual Understanding the validity of individual record fields.

values Syntax checks:
profiling
Formatting: is the data field in the correct format?
Value range: Does the value fall within permissible set of values?
Semantic checks:
Focus on the meaning of data in context (interpretation of data). For example, if New
Year’s Day is a holiday, no orders should have a delivery date of Jan 1.
Data Profiling – Set-based
Profiling type Description and variations

Set-based Understanding the distribution of values for a given field across multiple records.
profiling Checks the validity of distribution – is it as expected for this type of business data?
Numeric fields:
- Build a histogram and compare to the known distribution (e.g. Poisson or Gaussian)
- Determine summary statistics (min, max, median, mean) and identify outliers
Categorical fields:
- Count occurrences of unique values or clusters of values
Geospatial data:
- Plot data on a map
Date-time data:
- Plot date-time values across daily, weekly, monthly scales
Distribution of multiple values:
- Build scatter plots
- Check for duplication
Set-Based Profiling

Column Cross-column Cross-table

profiling profiling profiling

Profiling type Description and variations

Key analysis Scan collections of values within a table to locate a potential primary key OR
Scan collections of values across table to locate a potential foreign key

Dependency Determine dependent relationships within a dataset or across tables;

analysis Identify redundant data and correlations
Individual vs. Set-Based Profiling

Mini-quiz

Go to menti.com
Data Wrangling: Transformations
Transformation type Description and variations

Structuring Actions that change the form or schema of the dataset:

Intra-record:
changing the order of fields within a record
breaking record fields into smaller components
combining fields into complex structures
Inter-record:
remove subsets of records
aggregations and pivots of the data (see granularity)

Granularity Aggregations: change the granularity of the dataset (e.g., moving from individual
customers to segments of customers, or from individual sales transactions to monthly
or quarterly net revenue calculations).
Pivoting: shifts records into fields or shift fields into records.
Data Wrangling Methods
Transformation type Description and variations

Cleansing of Actions that fix irregularities in the dataset (quality and consistency issues.)
missing values Cleansing predominately involves manipulating individual field values within
records. The most common variant fixes missing (or NULL) values. Methods:
Discard:
Records with a missing value are discarded (not used for analytics)
Impute:
Calculate missing value using other observations/data). Methods:
- Statistical methods(use average or median value)
- Copy values from similar records (hot deck method)
- Interpolation for time series data, using Last observation carried forward (LOCF)
or Next observation carried backward (NOCB)
Keep & flag:
Keep records with missing values; usually involves flagging them for special
processing
Data Wrangling: Transformations
Transformation type Description and variations

Cleansing of invalid Invalid scenarios:

or inconsistent data Data is inconsistent with other fields (e.g., a customer age compared with their data
of birth)
Data is ambiguous (e.g. abbreviation that can have multiple interpretations)
Data is incorrectly coded (e.g. categorical value does not match standards)
Methods:
Calculate correct or consistent value for the field - overwrite the original value in the
dataset.
Keep both the original (incorrect) and derived (correct) value.
Mark values as invalid.

De-duplication Removal of duplicate records, reconciliation of inconsistencies in duplicate records

Data standardization Replacing different values, codes or spelling with a standard value.
Bringing values to a similar scale
Data Wrangling: Transformations
Transformation Description and variations
type

Subsetting Split data sets into subsets to wrangle them separately:

- Subset by structure (e.g. split heterogeneous set of records)
- Subset by granularity
- Subset into smaller sized sets

Sampling Using samples of big data to iteratively refine data wrangling steps.
Sampling requires statistical approaches to determine “representative
samples”
Usually requires some extreme values that represent the range, and
random representative sample that reflects distribution trends.
Stratified samples: include representative from all groups (“strata”), even
though they might misrepresent the trends of the overall dataset
Dataset structure types

Rectangular “Jagged” Heterogeneous

dataset: dataset: dataset:
Varied record Different entities
length with varied
Database table,
structure in one
matrix e.g. JSON, XML dataset
Data Wrangling: Transformations
Transformation type Description and variations

Enriching Actions that add new values to the dataset from multiple data‐sets.
Joins: combine datasets by linking records.
Unions: blend multiple datasets together by matching up records from two different
datasets and concatenating them “horizontally” into a wider table that includes
attributes from both sides of the match.
Metadata enrichment: add metadata (information about the data) into the
dataset. Can be dataset independent (e.g. the current time or the username of the
person transforming the data) or specific to the dataset (e.g. filenames or locations
of each record within the dataset).
Computation of new data values: Calculate or derive new data from existing data
(e.g. convert time based on geo-location; calculate a sentiment score from a chat
bot transcript).
Categorization: Reduce number of categories for categorical values, or to create
ranges (bins) for continuous variables (e.g. age ranges or income ranges). A.k.a.
coarse classification, classing, grouping, binning.
Metadata: describing data
Transformation type Description and variations

Structure Format and encoding of records and fields

Granularity Level of depth or the number of entities represented by a data record.

(a.k.a. resolution) Fine granularity: each record represents a single entity (e.g. one order)
Coarse granularity: each record represents a collection of entities (e.g. total order
per region per month)

Accuracy Quality, accuracy and consistency of data

Temporality Time sensitivity of the dataset; how time impacts accuracy of the dataset.
Timestamps may be used to identify record creation or the last known date this
record was considered accurate

Scope of data The number of distinct attributes represented in a dataset (dimensionality)

The attribute-by-attribute population coverage: are “all” the attributes for each
record present in the dataset? (sparsity)
Dataset Structure Questions

Do all records in the dataset contain the same fields?

How can you access the same fields across records? By position? By name?

How are the records delimited/separated in the dataset? Do you need to parse records?

How are the record fields delimited from one another? Do you need to parse them?

How are the fields encoded? Human readable strings? Binary numbers? Hash keys? Codes?
Compressed?

What are the relationship types between records and the record fields:
- Singular (record should have one and only one value for a field, like customer date of birth)
- Set-based (record could have many values for the field, like customer shipping addresses)
Data Temporality Questions

When was the dataset collected?

Were all the records and record fields collected/measured at the same time?

Are the timestamps associated with collection of the data known and available

Have some records or record field values been modified after the time of creation? Are the
timestamps of these modifications available?

How can you determine if the data is “stale” (no longer accurate)? Can you forecast when the
data will become stale?

If there are conflicting values in the data (e.g., multiple mailing addresses for a person), can you
use timestamps to determine which value is “correct”?
Data Scope Questions

What characteristics of the things (represented by the records) are captured or not?

Are the same record fields available for all records? Are they accessible via the same specification
(position, name, etc.)?

Do the records in the dataset represent the entire population of associated things? Are there
missing records? Are the missing records randomly or systematically missing?

Are there multiple records for the same thing? If so, does this change the granularity of the dataset
(e.g., from customers to contacts) or require some amount of deduplication before analysis?

Does the dataset contain a heterogeneous set of records (representing different kinds of entities)?
If so, what is the relationship between the different kinds of records?
Publishing
Publish Store data into the target analytics platform

What is published?

Dataset A transformed version of the input datasets – “refined datasets” are

published to the analytical tool

Transformation Logic and scripts that generate the refined datasets; scripts that
logic generate data wrangling statistics and insights

Profiling metadata Profiling reports required for managing automated data services and
products

https://www.bankingtech.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf
Business
problem

Monitoring & Data

Maintenance acquisition

Visualization & Data

Communication preparation

Predictive Data
modeling wrangling
PREDICTIVE
MODELING
Predictive Modeling Process
Explore data Examine data and its properties; compute descriptive statistics; discover
data anomalies; test significant variables; use visualization to identify
patterns and trends

Build & train Form hypothesis about the analytics problem

machine Select candidate machine learning models with selected predictor
learning models variables
Train the models using the training data set

Evaluate model Test and evaluate models using training sets

performance Repeat the build, train and evaluate steps to optimize the model

Deploy models Deploy the best performing model to production

THE REST OF THE
DATA LIFE CYCLE
Business
problem

Monitoring & Data

Maintenance acquisition

Visualization & Data

Communication preparation

Predictive Data
modeling wrangling
Visualization & Communication
Visualize data Present findings and insights to business users

Publish & Share the insights with business stakeholders in an easy to understand and consume
communicate to format
stakeholders

Incorporate Use the predictions and insights to make business decision at specific points of a
analytics into business process;
business process Create a feedback mechanism to assess the accuracy of predictions by collection
data and outcomes of the business process
Predictive Modeling

What is “decay”?
Predictive Model Decay
Decay:
• to decrease usually gradually in size, quantity, activity, or force
• to decline from a sound or prosperous condition
• to fall into ruin

Model decay reasons

The relationship between predictor variables and behaviour is changing

New/better data becomes available

Data scarcity - when data that the model needs becomes unavailable

Organization's objectives change

Monitoring & Maintenance
Monitor model Models need to adapt to changing business conditions and data
performance Track results of the predictions and measure the effectiveness of the predictive
models
Alert business of model decay and modify models as their effectiveness starts to
decline

Maintain model • Create and maintain document of the model design

design • Maintain the model monitoring process
documentation • Update documentation as the model is enhanced or modified

Manage the • Monitor business value of the models

models • Prune models with little business value
• Tune, improve and optimize the models
Model Documentation Questions

When was the model designed, and by who?

What is the perimeter of the model (ranges, entity types, geographical region, industry sectors)?

What are the strengths and the weaknesses of the model?

What data were used to build the model? How was the sample constructed? What is the time
horizon in the sample?

Is human judgement used, and how?

Bart Baesens (2014)

Analytics in a Big Data World: The Essential Guide to Data Science and its Applications
Wiley
ANALYTICS
MATURITY
BI and Analytics Maturity Model

BI – Business Intelligence
BICC – Business Analytics
Competency Centre
ACE - Analytics Centre of Excellence
CAO – Chief Analytics Officer
CDO – Chief Data Officer

Source: Gartner (October 2016)

BI and Analytics Maturity Model
Determine the best
course of action
Predict and
BI – Business Intelligence
Monitor what has forecast future
Infer why has it BICC – Business Analytics
occurred
occurred Competency Centre
ACE - Analytics Centre of Excellence
CAO – Chief Analytics Officer
CDO – Chief Data Officer

Source: Gartner (October 2016)

Data Analytics Maturity

Brent Dykes Data Analytics Marathon: thy Your Organization must Focus on the Finish
Analytics & Data Science Success Rates
Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not
scale in the organization.
Jan 2019: Gartner

Through 2022, only 20% of analytic insights will deliver business outcomes.
Jan 2019: Gartner

77% of businesses report that "business adoption" of big data and AI initiatives is a big
challenge.
Jan 2019: NewVantage survey

87% of data science projects never make it into production.

July 2019: VentureBeat AI
https://blogs.gartner.com/andrew_white/2019/01/03/our-top-data-and-analytics-predicts-for-2019/
https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/
https://newvantage.com/wp-content/uploads/2018/12/Big-Data-Executive-Survey-2019-Findings.pdf
Analytics Maturity Factors
What are the factors that impact analytics maturity in an organization?
What impacts the success of analytics projects?

People, Culture,
Organization

Process

Data quality

Technology
Analytics Maturity Factors

People, Culture, Executive support – commitment to support the use of analytics

Organization Siloed vs. enterprise approach to managing data
Support for data governance
Skilled analytics team

Process Maturity of the analytics process; data-driven decision-making

BI development process – from identifying needs to sustained
repeatable application

Data quality Managing quality of data across organization

Technology Enterprise and data architecture

Disparate vs integrated systems
Analytics technology and expertise
ANALYTICS PROJECTS
FAILURE REASONS
Analytics Projects Failure Reasons

Failure reason Mitigation

Not focused on business Clear business case

value, unclear Understand purpose of analytics
requirements
Capture accurate and clear requirements

Relying on software to Background research

be the solution Select the right software for the job
Rely on proper analysis and design
Focus on quality of data

Treating big data as Expertise and architecture must be adequate for handling big data
traditional structured Apply appropriate methods and tools, in particular for semi--
data structured and unstructured data Source: Textbook Chapter 15
Analytics Projects Failure Reasons (cntd)

Failure reason Mitigation

Lack of expertise Ensure deep understanding of business

Acquire or grow data science expertise (including statistical,
actuarial and specialized programming skills)

Analytics not integrated Don’t stop at collecting and analyzing data

into the business Incorporate analytics into the business process
process
Measure the results (success of analytics) and adjust models as
needed

Lack of solid project Keep scope realistic (don’t promise too much)
management Ensure enough support and resources from business
Manage business analytics efforts as projects
Source: Textbook Chapter 15

Ab Initio FAQ's Part1
83% (6)
Ab Initio FAQ's Part1
61 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
1.1 Development of Employee Performance Management - An Overview
No ratings yet
1.1 Development of Employee Performance Management - An Overview
57 pages
Hardware Sales and Service
No ratings yet
Hardware Sales and Service
80 pages
Railway Management System A Project Submitted To Chhattisgarh Swami Vivekanand Technical University Bhilai Chhattisgarh (India)
No ratings yet
Railway Management System A Project Submitted To Chhattisgarh Swami Vivekanand Technical University Bhilai Chhattisgarh (India)
30 pages
Latest Java Interview Question and Answers
No ratings yet
Latest Java Interview Question and Answers
8 pages
"Library Management System": A Report ON
No ratings yet
"Library Management System": A Report ON
41 pages
Minor Project Report
No ratings yet
Minor Project Report
39 pages
Project e Billing
No ratings yet
Project e Billing
20 pages
Software Project-2
No ratings yet
Software Project-2
51 pages
Web Development
No ratings yet
Web Development
87 pages
Project Report SVVV
No ratings yet
Project Report SVVV
33 pages
Corporate HRM Information Scheduler: Submitted in Partial Fulfillment of The Requirement For The Award of Degree of
No ratings yet
Corporate HRM Information Scheduler: Submitted in Partial Fulfillment of The Requirement For The Award of Degree of
50 pages
Java Programming Unit5 Notes PDF
No ratings yet
Java Programming Unit5 Notes PDF
110 pages
Call Center Manegment
No ratings yet
Call Center Manegment
92 pages
Jinendra Major Project
No ratings yet
Jinendra Major Project
127 pages
Retail Bill Management System
No ratings yet
Retail Bill Management System
77 pages
Online Society Tracking: A Project Report On
No ratings yet
Online Society Tracking: A Project Report On
187 pages
Evaluation of Machine Learning Algorithms For The Detection of Fake Bank Currency
100% (1)
Evaluation of Machine Learning Algorithms For The Detection of Fake Bank Currency
41 pages
387 - Mindtree Whitepaper Migrating An Existing On Premise Application To Windows Azure Cloud PDF
No ratings yet
387 - Mindtree Whitepaper Migrating An Existing On Premise Application To Windows Azure Cloud PDF
8 pages
Timetablegeneratror Contents
No ratings yet
Timetablegeneratror Contents
46 pages
44 - Samira Fargoes
100% (1)
44 - Samira Fargoes
97 pages
Hospital Managemt FINAL PROJECT
No ratings yet
Hospital Managemt FINAL PROJECT
39 pages
Ipl Team Management
No ratings yet
Ipl Team Management
18 pages
E Mart - 093653
No ratings yet
E Mart - 093653
49 pages
Big Data Agenda
No ratings yet
Big Data Agenda
4 pages
Data Science Engineering Full Time Program Brochure
No ratings yet
Data Science Engineering Full Time Program Brochure
21 pages
School Automation / R M S With SMS Solutions: Academic Ecord Anagement Ystem
100% (1)
School Automation / R M S With SMS Solutions: Academic Ecord Anagement Ystem
87 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
Logical DB
No ratings yet
Logical DB
8 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Data Visuval & Analysis
No ratings yet
Data Visuval & Analysis
66 pages
ETL Proof of Concept Written Response: About Talend
No ratings yet
ETL Proof of Concept Written Response: About Talend
16 pages
Book Shop Project
No ratings yet
Book Shop Project
72 pages
Career Track Brochure - Data Science
No ratings yet
Career Track Brochure - Data Science
39 pages
Business Analytics - Training Curriculum - SKOLAR
No ratings yet
Business Analytics - Training Curriculum - SKOLAR
16 pages
Cab Fare Prediction Report by Abhinav Jha
No ratings yet
Cab Fare Prediction Report by Abhinav Jha
41 pages
Airline Reservation System
No ratings yet
Airline Reservation System
82 pages
Automatic Timetable Generation
No ratings yet
Automatic Timetable Generation
10 pages
Virtual Mirror - A Hassle Free Approach To The Use of Trial Room
No ratings yet
Virtual Mirror - A Hassle Free Approach To The Use of Trial Room
38 pages
Online Tax Management System Project Report
No ratings yet
Online Tax Management System Project Report
174 pages
Big Data: by It Faculty Alttc Ghaziabad
No ratings yet
Big Data: by It Faculty Alttc Ghaziabad
26 pages
Object Detection - Deep Learning: Jamia Hamdard
No ratings yet
Object Detection - Deep Learning: Jamia Hamdard
26 pages
A.I Project
No ratings yet
A.I Project
19 pages
Google Data Analytics Strategy
No ratings yet
Google Data Analytics Strategy
4 pages
DWH Int Questions
100% (1)
DWH Int Questions
9 pages
CH01 Overview of Business Analytics
No ratings yet
CH01 Overview of Business Analytics
32 pages
Final Report On Online Catering
100% (1)
Final Report On Online Catering
110 pages
Mahatma Gandhi University: Restructured Curriculum and Syllabi
No ratings yet
Mahatma Gandhi University: Restructured Curriculum and Syllabi
142 pages
Gaming System
No ratings yet
Gaming System
37 pages
Case Study On Building Data Warehouse/Data Mart
No ratings yet
Case Study On Building Data Warehouse/Data Mart
5 pages
Project Report: Ludo Game'
0% (1)
Project Report: Ludo Game'
36 pages
23 Big Data and Data Wrangling
No ratings yet
23 Big Data and Data Wrangling
56 pages
Introduction To
No ratings yet
Introduction To
302 pages
Rto Management
100% (5)
Rto Management
50 pages
Interview Questions
No ratings yet
Interview Questions
9 pages
Mini Project: Diploma in Computer Engineering
No ratings yet
Mini Project: Diploma in Computer Engineering
30 pages
Chapter 8 - Social Media Information Systems
No ratings yet
Chapter 8 - Social Media Information Systems
38 pages
Shivam Soin
No ratings yet
Shivam Soin
42 pages
Math211101020
No ratings yet
Math211101020
12 pages
Big Data Categories-Life Cycle
No ratings yet
Big Data Categories-Life Cycle
15 pages
BIA 5000 Introduction To Analytics - Lesson 1
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 1
48 pages
BIA 5000 Introduction To Analytics - Lesson 3
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 3
35 pages
BIA 5000 Introduction To Analytics - Lesson 4
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 4
49 pages
BIA 5000 Introduction To Analytics - Lesson 2
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 2
52 pages
Introduction-to-Database-and-SQL-Week 06-Fall-2022
No ratings yet
Introduction-to-Database-and-SQL-Week 06-Fall-2022
12 pages
Introduction-to-Database-and-SQL-Week 03-Fall-2022
No ratings yet
Introduction-to-Database-and-SQL-Week 03-Fall-2022
14 pages
Mergesort Quicksort: Chittaranjan Mandal (IIT Kharagpur)
No ratings yet
Mergesort Quicksort: Chittaranjan Mandal (IIT Kharagpur)
135 pages
CH 3 SQL FUNDAMENTALS
No ratings yet
CH 3 SQL FUNDAMENTALS
43 pages
Department of Education: Practical Research 1
No ratings yet
Department of Education: Practical Research 1
5 pages
Azure Data Factory v2 (PDFDrive)
No ratings yet
Azure Data Factory v2 (PDFDrive)
78 pages
LLM Use Cases
No ratings yet
LLM Use Cases
8 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
33 pages
Amazon Web Services
No ratings yet
Amazon Web Services
19 pages
Arrays: Introduction - Variables in A Program Have Values Associated With Them. During
No ratings yet
Arrays: Introduction - Variables in A Program Have Values Associated With Them. During
66 pages
Database First vs. Code-First Approach in Software Design Workflows
No ratings yet
Database First vs. Code-First Approach in Software Design Workflows
3 pages
Dark Side Case Analysis
No ratings yet
Dark Side Case Analysis
1 page
CBSE_X_ IT_TERM-1-INTERNAL ASSESSMENT EXAM- QP
No ratings yet
CBSE_X_ IT_TERM-1-INTERNAL ASSESSMENT EXAM- QP
4 pages
statistics for economics (2)
No ratings yet
statistics for economics (2)
5 pages
Proc Means The Friendliest PROC
No ratings yet
Proc Means The Friendliest PROC
25 pages
Rychel 1977
No ratings yet
Rychel 1977
6 pages
Akash Shandilya Snowflake Developer CTS
No ratings yet
Akash Shandilya Snowflake Developer CTS
3 pages
What Is Import and Export
100% (1)
What Is Import and Export
6 pages
Science Curriculum Grade 2
No ratings yet
Science Curriculum Grade 2
39 pages
Instructions For The Scope of Work Template The Scope of
No ratings yet
Instructions For The Scope of Work Template The Scope of
8 pages
Aspell Readme
No ratings yet
Aspell Readme
1 page
1.8.2 Database
No ratings yet
1.8.2 Database
15 pages
SAP HANA Vs SAP S - 4 HANA - Is There Any Difference Between The Two - DataFlair
No ratings yet
SAP HANA Vs SAP S - 4 HANA - Is There Any Difference Between The Two - DataFlair
2 pages
GDPR and Ontap: Quick Guide
No ratings yet
GDPR and Ontap: Quick Guide
21 pages
SAP Data Archiving
No ratings yet
SAP Data Archiving
23 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Big Data Module 1
No ratings yet
Big Data Module 1
14 pages
How To Increase The Size of A Vdisk and Filesystem in A LDom Guest Domain Doc ID 1549604.1
No ratings yet
How To Increase The Size of A Vdisk and Filesystem in A LDom Guest Domain Doc ID 1549604.1
15 pages
Memory Units of A Computer
No ratings yet
Memory Units of A Computer
16 pages
A Study of Fintech in Investment Management - Guided Mutual Fund Portfolios
No ratings yet
A Study of Fintech in Investment Management - Guided Mutual Fund Portfolios
14 pages
Best Practices Data Modeling in QlikView
No ratings yet
Best Practices Data Modeling in QlikView
60 pages

BIA 5000 Introduction To Analytics - Lesson 6

Uploaded by

BIA 5000 Introduction To Analytics - Lesson 6

Uploaded by

INTRODUCTION

1. Describe data preparation steps and challenges

2. Distinguish data cleansing techniques

3. Describe data wrangling process and methods

4. Understand the concept of data profiling

6. Explain the analytics maturity model

7. Understand factors that impact analytics maturity

8. Recognize reasons for analytics project failures

1. Data acquisition and preparation

How many ways can you misspell

Monitoring & Data

Visualization & Data

Ensuring data quality

Textbook Chapter 5 Figure 5.2

Consolidate & Consolidate data using standardized definitions;

Store data Store the resulting data for further processing

Visualization Visualize data using statistical methods to detect unexpected or erroneous

Missing values 1. Discard observations with missing values

Textbook Chapter 5 Figure 5.3 Textbook Chapter 5 Figure 5.3

Filter Choose a smaller part of dataset relevant for the purpose.

Subset Create subsets relevant to the analytics problem – a result of filtering

What’s in your data?

How is quality of your data?

Is the data complete? Are there missing values?

Is the data unique? Are there duplications?

Profiling Description and variations

Individual Understanding the validity of individual record fields.

Column Cross-column Cross-table

Profiling type Description and variations

Dependency Determine dependent relationships within a dataset or across tables;

Structuring Actions that change the form or schema of the dataset:

Cleansing of invalid Invalid scenarios:

De-duplication Removal of duplicate records, reconciliation of inconsistencies in duplicate records

Subsetting Split data sets into subsets to wrangle them separately:

Rectangular “Jagged” Heterogeneous

Structure Format and encoding of records and fields

Granularity Level of depth or the number of entities represented by a data record.

Accuracy Quality, accuracy and consistency of data

Scope of data The number of distinct attributes represented in a dataset (dimensionality)

Do all records in the dataset contain the same fields?

When was the dataset collected?

Dataset A transformed version of the input datasets – “refined datasets” are

Monitoring & Data

Visualization & Data

Build & train Form hypothesis about the analytics problem

Evaluate model Test and evaluate models using training sets

Deploy models Deploy the best performing model to production

Monitoring & Data

Visualization & Data

Model decay reasons

The relationship between predictor variables and behaviour is changing

New/better data becomes available

Organization's objectives change

Maintain model • Create and maintain document of the model design

Manage the • Monitor business value of the models

When was the model designed, and by who?

What are the strengths and the weaknesses of the model?

Is human judgement used, and how?

Bart Baesens (2014)

Source: Gartner (October 2016)

Source: Gartner (October 2016)

87% of data science projects never make it into production.

People, Culture, Executive support – commitment to support the use of analytics

Process Maturity of the analytics process; data-driven decision-making

Data quality Managing quality of data across organization

Technology Enterprise and data architecture

Failure reason Mitigation

Not focused on business Clear business case

Relying on software to Background research

Failure reason Mitigation

Lack of expertise Ensure deep understanding of business

Analytics not integrated Don’t stop at collecting and analyzing data

You might also like