0% found this document useful (0 votes)
19 views

Business Data Analytics Part 3

Uploaded by

Thao Pjn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Business Data Analytics Part 3

Uploaded by

Thao Pjn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Part 3.

Source data
Tasks
1/ Plan Data Collection

2/ Determine the Data Sets

3/ Collect Data

4/ Validate Data
Plan data collection
Planning
considerations
❖ what data is needed
❖ the availability of the data
❖ the need for historical data
❖ determining when and how the
data will be collected
❖ how the data will be validated
once collected
What is the difference between structured and
unstructured data?
Structured data is data that is organized, Unstructured data is the exact opposite of
well-thought-out and formatted, such as data structured data as it exists outside of any
residing in a database management system organized repository like a database.
(DBMS).

Structured data is easily accessed by initiating a Unstructured data takes on many forms and
query in a query language such as SQL (standard sources such as text from word processing
query language). documents, emails, social media sites, image,
audio, or video files.
Case study: sourcing data
Determine the data sets
A five Vs
assessment ❖ Volume
helps to ❖ Velocity

determine ❖ Variety
❖ Veracity
which datasets ❖ Value

to consider
Technique: Data modelling
Data models describe business entities and
relationships between them
Customers Orders Products
Data models describe business entities and
relationships between them
Customers Orders Products

Name & Surname Customer name Product name


City Customer address Product description
Postal code Product names Product price
Phone number Product prices
Quantities
Delivery price
Total price
Normalization is a technique for organizing
data in a database. It is important that a
database is normalized to minimize
redundancy (duplicate data) and to ensure
only related data is stored in each table. It
also prevents any issues resulting from
database modifications such as insertions,
deletions, and updates.
First Normal Form

● Data is stored in tables with rows uniquely


identified by a primary key
● Data within each table is stored in individual
columns in its most reduced form
● There are no repeating groups
First Normal Form

Customers Orders Products

Name & Surname Customer name Product name


City Product names Product description
Postal code Product prices Product price
Phone number Quantities
Shipping address
Shipping price
Total price
First Normal Form

Customers Orders Products

Customer ID Order ID Product ID


Name & Surname Customer name Product name
City Product names Product description
Postal code Product prices Product price
Phone number Quantities
Shipping address
Shipping price
Total price
First Normal Form

Customers Orders Products

Customer ID Order ID Product ID


Name Customer name Product name
Surname Product names Product description
City Product prices Product price
Postal code Quantities
Phone number Shipping address
Shipping price
Total price
Relationships
1 *
Person Property

ID Owner_ID
Name Address
First Normal Form

Customers Orders Products

Customer ID Order ID Product ID


Name Customer name Product name
Surname Shipping address Product description
City Shipping price Product price
Postal code Total price
Phone number 1
1
Order-to-products

* Order ID
Product ID *
Quantity
Second Normal Form

● Everything from 1NF


● Only data that relates to a table’s primary key is
stored in each table
Second Normal Form

Customers Orders Products

Customer ID Order ID Product ID


Name Customer name Product name
Surname Shipping address Product description
City Shipping price Product price
Postal code Total price
Phone number 1
1
Order-to-products

* Order ID
Product ID *
Quantity
Second Normal Form

1 *
Customers Orders Products

Customer ID Order ID Product ID


Name Customer ID Product name
Surname Shipping address Product description
City Shipping price Product price
Postal code Total price
Phone number 1
1
Order-to-products

* Order ID
Product ID *
Quantity
Third Normal Form

● Everything from 2NF


● There are no in-table dependencies between the
columns in each table
Third Normal Form

1 *
Customers Orders Products

Customer ID Order ID Product ID


Name Customer ID Product name
Surname Shipping address Product description
City Shipping price Product price
Postal code Total price
Phone number 1
1
Order-to-products

* Order ID
Product ID *
Quantity
Third Normal Form

1 *
Customers Orders Products

Customer ID Order ID Product ID


Name Customer ID Product name
Surname Shipping address Product description
Postal code Shipping price Product price
Phone number Total price
1
* 1
Order-to-products
1
* Order ID
Cities *
Product ID
Postal code Quantity
City
Technique: Data mapping
Data mapping is used to consolidate data
from one or more sources to a destination to
create a meaningful set of data with a
standardized format.
— Guide to Business Data Analytics, IIBA
Data mapping is used to support:

Data migration Data integration


source target

Users Customers

First name Name


Last name Address
Address Mobile number
Phone number Home number
Analysing source and target
The repository providing the original data is The repository receiving the data is referred to as
referred to as the source. When analyzing the the target. When analyzing the target, consider
source, consider the: the:

● format ● format
● attributes of interest or potential interest ● new attributes that need to be created
● data type and data size of the attributes ● source attributes that need to be
transformed
● creation of new custom fields
Mapping
considerations ❖

Which attributes will be migrated
Which new attributes need to be
created in the target repository
Target Source Rules ❖ Data size
Customers.Name Users.First_Name Concatenate with ❖ New custom defined attributes
Users.Last_Name space
that need manipulation or
Customers.Address Users.Address N/A
calculation
Customers.Mobile Users.Phone If starts with 04
Usage considerations
Strengths: Limitations:

● Provides a meticulous approach. ● Requires careful attention to detail


● Provides data traceability ● Data mapping can be time-consuming.
● Enables creation of a standardized, ● Needs updating as soon as changes are
business-focused data repository made in either source or target
● Helps identify data quality issues
Technique: Data dictionary
The data dictionary is used to collate and
standardize references to data elements
across initiatives or at an organizational level.

— Guide to Business Data Analytics, IIBA


Owner Term Data type Example Source of truth

Marketing team Total marketing Currency per $40000 per Martech system
budget time month

Digital team Conversion rate Percentage 4.5% Google


analytics
Data
dictionary ❖

Collect terms
Define terms

creation ❖ Identify conflicts

process ❖

Get alignment
Get sign off
❖ Publish
❖ Maintain
Collect data
Collecting data involves the activities performed to
help with data setup, preparation, and collection.

Passive Data Collection: unobtrusive data Active Data Collection: actively seeking
collection from users in their day-to-day information from stakeholders for a specific goal.
transactions with the organization.

This type of data is available without an analytics This type of data is not readily available with the
objective in mind, and a large portion of such data organization (and requires e.g. surveys or
may already exist with the organization. self-reports).
Before data professionals begin collecting
large amounts of data, it may be necessary to
test the data collection approach by using a
small number of observations.
Technique: Extract, Transform, and
Load (ETL)
Core principles of ETL

1/ Identify high-quality data, from


a variety of sources

2/ Transition this data to a target


repository creating a “single
source of truth

3/ Provide easy access


Extract

1) Identify data sources and


types
2) Create universal
classification
3) Verify data integrity
Transform

Translate extracted data to a


usable and accurate format.

Ensure the data follows sounds


business logic.
Load

Transition the transformed data


to a target repository.

1) Review target data format


2) Perform data load
3) Generate audit trails
Usage considerations
Strengths: Limitations:

● Provides well-established process ● Heterogeneous and streaming data are


● Many of the ETL tools provide a graphical difficult to process through traditional ETL
view of the operation and connections to tools and technologies
enterprise data sources ● High-volume data movement through ETL
● Easy maintenance of data by maintaining requires a lot of planning and effort to
automatic traces to data sources and audit maintain consistency
trails ● ETL is not suitable for near real-time
● Serves as a robust pre-analysis prior to interventions through algorithmic
modelling activities decision-making
Technique: Data flows
Data flow diagrams show where data comes
from, which activities process the
data, and if the output results are stored or
utilized by another activity or external
entity.
— A Guide to the Business Analysis Body of Knowledge(r), IIBA
Yourdon Gane and Sarson
Externals
a person, organization, automated system, or any device
capable of producing or receiving data

Data store
a collection of data where data may be read repeatedly
and where it can be stored for future use.

Process
a manual or automated activity performed for a business
reason.

Data flow
a collection of data where data may be read repeatedly
and where it can be stored for future use.
Order
Customer Bill Ordering

Inventory Order
details details

Inventory Order storage

Order
Inventory details
details

Aggregated
Reporting Manager
data
Validate data
Validating data
involves assessing
that the planned data
sources can and
should be used and,
when accessed, the
data obtained are
providing the types of
results expected.
Characteristics of high-quality data:

1/ Accuracy

2/ Completeness

3/ Consistency

4/ Uniqueness

5/ Timeliness
Two types of validation
Data validation may be performed by a data Business validation is performed by key
analyst, data scientist, or business analysis stakeholders with the authority to approve data
practitioner with sufficient skills to use the sources for use in analytics initiatives and the
necessary tools to access data and the underlying knowledge to assess data accuracy.
competencies to analyze the results

You might also like