0% found this document useful (0 votes)

5 views

Unit 2

Uploaded by

jovialdarwin8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit 2

Uploaded by

jovialdarwin8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Quality Tool- Introduction

• In the Digital Era, the availability of big data facilitates new generation
Industries to design novel business models and automate their
business operations.
• It also helps them invent new technological solutions, which generate
new business opportunities. Big data is generated from sensors,
machines, social media, Web sites, and e-commerce portals.
• Data is the new oil and an asset to any organization, and there are
attempts to monetize the data. There are bound to be variations and
inconsistencies in the data collected from many heterogeneous
sources. A mechanism should be to correct anomalies at the source
or post-collection and ensure high data quality.
What is Data Quality Tool?
• Data Quality:
• The success of any organization depends on the quality of the data
collected, stored, and used for deriving insights. Quality data forms
the core part of any business, in the bottom layer of the information
hierarchy. The information layer, Analytics layer (knowledge), and
Insights (wisdom) layers are on top of the data layer in the respective
order mentioned.
• Data quality can be defined as a characteristic that makes the data fit for its intended
use, and it can also be defined as the characteristic that makes the data represent the
true picture it is supposed to project.

• The above two definitions are in total contrast as the first insists on completing the
day-to-day transaction, and the other aims to achieve the end-to-end purpose for
which the attributes are designed.

• For example, Employee Master in payroll contains so many attributes few of them are
mandatory for calculating the monthly payment. If all such fields are present correctly,
that will be sufficient to run payroll, and this will meet the first definition of data
quality.
• For doing manpower planning, skill planning, dynamic work allocation, and effective
utilization of manpower, most of the attributes should have the right quality of data,
and this will meet the second definition of data quality.
Importance of Data Quality
• Accurate data produces accurate analysis & dependable results,
avoids wastage, and enhances the productivity and profitability of the
Organization.
• Reliable data provides an edge to the business in fighting competitive
markets.
• It facilitates the system to be compliant with all local and international
regulations.
• Companywide digital transformation and cost-saving programs can be
implemented with adequate data backup.
Steps to Improve Data Quality:
• Having the right mix of People, Processes, and Technology with
adequate support from top management is the first step to improving
data quality.
• Install a system to measure and improve a set of Quality Dimensions
like Uniqueness, Precision, Conformity, Consistency, Completeness,
Timeliness, and Relevance.
• Data accuracy, Data validity, and Data Integrity are the other aspects
of good data quality management.
• There should always be a single source for Data, and we should avoid
getting it from multiple resources.
Data Quality Tools
• Any DQ tool typically does data cleansing, data integration, and
Managing master data and Metadata by adopting the guidelines in
the various disciplines of DQM as given below.
• Data Governance
• Data Matching
• Data Profiling
• Data Quality Monitoring and Reporting
• Master Data Management (MDM)
• Customer and Product Data Management
• Data Asset Management
DQ Tools & Features
DQ Tool Key Features Value to the Users

Informatica Quality Data Standardization, deduplication, validation, consolidation, and MDM supports structured and unstructured data. AI features
Data MDM Solutions robust MDM solution. enabled.

SAS Data Data Integration and Cleansing. Uses Data governance and Unstructured data. AI features Graphical interfaces and a
Management metadata management disciplines of DQ management. powerful wizard for effective data management.

Experian Aperture Data discovery and profiling, Data monitoring, and Data cleansing. Easy to use DQ management tool. Workflow designer enables
Data Studio Works with any data. easy data quality monitoring.

IBM InfoSphere Data cleansing and Data management. Data profiling helps deep
Machine learning-enabled high data accuracy.
Quality Stage analysis of data.

Data integrity and cleansing. Removes duplicates and human Used extensively in Salesforce. It has a drag-and-drop
Cloudingo errors. interface.

Talend Data Quality Data Standardization, Deduplication, Validation. Uses ML features to maintain clean data.

Data Cleansing. Uses data matching and deduplication techniques Very high data accuracy. Manages multiple databases and big
Data Ladder for cleansing. data.

Data integration, transformation, and Master data Management. Handles data from multiple sources and provides reliable data
SAP Data Services Uses text analysis, auditing, and data profiling techniques. for analytics.

OpenRefine Data Cleansing, including big data. Open-source tool. Supports multiple languages.
Advantages
Data Quality tool enhances the accuracy of the data:
a. While it is generated at the source
b. As it is getting extracted before storage, c. transformation post its
storage.
Its main benefits are:
• Builds confidence in the business to venture into transformation
exercise.
• Scales up revenue, profits, new business, and productivity for the
business.
• Reduces wastages, saves cost, shrinks time to market, and makes
business agile.
• Makes business digital-ready and builds a vibrant brand.
Data Cleaning

• Data cleaning is the process of editing, correcting, and structuring data

within a data set so that it’s generally uniform and prepared for analysis.
• This includes removing corrupt or irrelevant data and formatting it into a
language that computers can understand for optimal analysis.
• There is an often repeated saying in data analysis: “Garbage in, garbage
out,” which means that, if you start with bad data (garbage), you’ll only
get “garbage” results.
• Data cleaning is often a tedious process, but it’s absolutely essential to
get top results and powerful insights from your data.
• This is powerfully elucidated with the 1-10-100 principle: It costs $1 to
prevent bad data, $10 to correct bad data, and $100 to fix a downstream
problem created by bad data.
Data Cleaning Steps & Techniques

• Remove irrelevant data:

• Take a good look at your data and get an idea of what is relevant and
what you may not need. Filter out data or observations that aren’t
relevant to your downstream needs.
• If you’re doing an analysis of SUV owners, for example, but your data
set contains data on Sedan owners, this information is irrelevant to
your needs and would only skew your results.
• You should also consider removing things like hashtags, URLs, emojis,
HTML tags, etc., unless they are necessarily a part of your analysis.
Deduplicate your data

• If you’re collecting data from multiple sources or multiple departments,

use scraped data for analysis, or have received multiple survey or client
responses, you will often end up with data duplicates.

• Duplicate records slow down analysis and require more storage. Even
more importantly, however, if you train a machine learning model on a
dataset with duplicate results, the model will likely give more weight to
the duplicates, depending on how many times they’ve been duplicated.
So they need to be removed for well-balanced results.
Fix structural errors

• Structural errors include things like misspellings, incongruent naming

conventions, improper capitalization, incorrect word use, etc.
• These can affect analysis because, while they may be obvious to
humans, most machine learning applications wouldn’t recognize the
mistakes and your analyses would be skewed.

• For example, if you’re running an analysis on different data sets – one

with a ‘women’ column and another with a ‘female’ column, you
would have to standardize the title. Similarly things like dates,
addresses, phone numbers, etc. need to be standardized, so that
computers can understand them.
Deal with missing data

• Scan your data or run it through a cleaning program to locate missing

cells, blank spaces in text, unanswered survey responses, etc. This
could be due to incomplete data or human error.

• You’ll need to determine whether everything connected to this

missing data – an entire column or row, a whole survey, etc. – should
be completely discarded, individual cells entered manually, or left as
is.
Filter out data outliers

• Outliers are data points that fall far outside of the norm and may skew
your analysis too far in a certain direction.
• For example, if you’re averaging a class’s test scores and one student
refuses to answer any of the questions, his/her 0% would have a big
impact on the overall average.
• In this case, you should consider deleting this data point, altogether. This
may give results that are “actually” much closer to the average.
• However, just because a number is much smaller or larger than the other
numbers you’re analyzing, doesn’t mean that the ultimate analysis will be
inaccurate. Just because an outlier exists, doesn’t mean that it shouldn’t
be considered. You’ll have to consider what kind of analysis you’re running
and what effect removing or keeping an outlier will have on your results.
Validate your data

• Data validation is the final data cleaning technique used to

authenticate your data and confirm that it’s high quality, consistent,
and properly formatted for downstream processes.
• Do you have enough data for your needs?
• Is it uniformly formatted in a design or language that your analysis
tools can work with?
• Does your clean data immediately prove or disprove your theory
before analysis?
• Validate that your data is regularly structured and sufficiently clean
for your needs. Cross check corresponding data points and make sure
nothing is missing or inaccurate.
Data Pollution
• Digital information assembles every possible compilation of facts, but
perhaps the most treasured content is personal data.
• Digital platforms are learning who and where people are at any given
time, what they did in the past and how they plan their future, what and
who they like, and how their decisions could be influenced.
• The widespread aggregation of such personal data creates new
personalized, social environments with enormous private and social
benefits.
• The concept of data pollution invites us to expand the focus and
examine the ways that the collection of personal data affects institutions
and groups of people—beyond those whose data are taken, and apart
from the harm to their privacy. Facebook’s data practices lucidly
illustrated the impact of data sharing on an ecosystem as a whole.
• Data sharing also pollutes in other, more concrete, ways. When
people allow websites to collect information about their emails, social
networks, and even DNA, they provide information about other
individuals who are not party to these transactions. In personalized
environments, the experience of each individual depends in part on
the data shared about others.

• Data Velocity: Data Velocity refers to the speed with which data is
generated. High velocity data is generated with such a pace that it
requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages
or Facebook posts.
Cyclicity of data
Data Quality
• As an IT professional, you have heard of data accuracy quite often.
Accuracy is associated with a data element. Consider an entity such as
customer.
• The customer entity has attributes such as customer name, customer
address, customer state, customer lifestyle and so on.
• Each occurrence of the customer entity refers to a single customer.
Data accuracy, as it relates to the attributes of the customer entity,
means that the values of the attributes of a single occurrence
accurately describes the particular customer. The value of the
customer name for a single occurrence of the customer entity is
actually the name of that customer. Data quality implies data
accuracy, but it is much more than that.
• Data quality in a data warehouse is not just the quality of individual
data items but the equality of the full, integrated system as a whole. It
is more than the data edits on individual fields. For example, while
entering data about the customers in an order entry application, you
may also collect the demographics of each customer. The customer
demographics are not germane to the order entry application and,
therefore, they are not given toomuch attention.

Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
1 - OSDB Migration For SAP NetWeaver 7.52 (C - TADM70 - 21) - v1.1 - BH
No ratings yet
1 - OSDB Migration For SAP NetWeaver 7.52 (C - TADM70 - 21) - v1.1 - BH
12 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Data Cleaning in Excel
100% (1)
Data Cleaning in Excel
68 pages
Data Quality Product Directory 2009
100% (1)
Data Quality Product Directory 2009
23 pages
Ba - Data Quality
No ratings yet
Ba - Data Quality
2 pages
Why Data Cleaning Is Critical
No ratings yet
Why Data Cleaning Is Critical
5 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
Lect 6
No ratings yet
Lect 6
36 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Data Quality Lec 3
No ratings yet
Data Quality Lec 3
3 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
IDQ Functionality Imp
No ratings yet
IDQ Functionality Imp
7 pages
Chapter 4
No ratings yet
Chapter 4
20 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
dataqualitymanagement
No ratings yet
dataqualitymanagement
20 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
IJCET_15_05_017
No ratings yet
IJCET_15_05_017
13 pages
Data Quality Management Methods and Tools
100% (1)
Data Quality Management Methods and Tools
39 pages
Mis Group 6 Assignment 1
No ratings yet
Mis Group 6 Assignment 1
10 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Cleaning_ Importance and Techniques
No ratings yet
Data Cleaning_ Importance and Techniques
1 page
Data Quality
No ratings yet
Data Quality
48 pages
Data quality
No ratings yet
Data quality
5 pages
Data Quality - 079 Moumon
No ratings yet
Data Quality - 079 Moumon
8 pages
DHV MODEL 1.2 Data Cleaning
No ratings yet
DHV MODEL 1.2 Data Cleaning
49 pages
? Data Cleaning 101❗_
No ratings yet
? Data Cleaning 101❗_
17 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
4. Process Data from Dirty to Clean
No ratings yet
4. Process Data from Dirty to Clean
34 pages
Lesson 5 Data Utility
No ratings yet
Lesson 5 Data Utility
3 pages
Data Preparation and Analysis
No ratings yet
Data Preparation and Analysis
22 pages
8-HMIS DATA QUALITY
No ratings yet
8-HMIS DATA QUALITY
2 pages
GenAI HR
No ratings yet
GenAI HR
91 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Quality and Data Cleaning: An Overview
No ratings yet
Data Quality and Data Cleaning: An Overview
27 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
From Everand
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
Sumitra Kumari
No ratings yet
Data Quality Concepts PDF
100% (3)
Data Quality Concepts PDF
83 pages
Unit 1 - Data Scientist Tool Box
No ratings yet
Unit 1 - Data Scientist Tool Box
26 pages
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
Data Quality and Its Parameters
No ratings yet
Data Quality and Its Parameters
10 pages
Lesson 8 - HMIS Data Quality
No ratings yet
Lesson 8 - HMIS Data Quality
4 pages
mylessons 4
No ratings yet
mylessons 4
6 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Quality
No ratings yet
Data Quality
76 pages
Data Analitics 4
No ratings yet
Data Analitics 4
10 pages
AIA DQG IDQ Approach& Features v1.1
No ratings yet
AIA DQG IDQ Approach& Features v1.1
29 pages
DeepSeek for Data Analysis: The Future of Data Analysis for Business Professionals
From Everand
DeepSeek for Data Analysis: The Future of Data Analysis for Business Professionals
Mohammod Shaharuzzaman
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Touchpad Information Technology Class 10
From Everand
Touchpad Information Technology Class 10
Sanjay Jain
5/5 (1)
Shaurya Sharma PS LAB WORK
No ratings yet
Shaurya Sharma PS LAB WORK
70 pages
Unit 3 Generating Functions and Recurrence Relations
No ratings yet
Unit 3 Generating Functions and Recurrence Relations
3 pages
Level_of_Testing_slides_eMvO54ziWH
No ratings yet
Level_of_Testing_slides_eMvO54ziWH
14 pages
MetaData_and_its_Classification_f6qQfIZTfw
No ratings yet
MetaData_and_its_Classification_f6qQfIZTfw
14 pages
Acceptance_Testing_and_ETL_Process_j8Mus6Ctvj
No ratings yet
Acceptance_Testing_and_ETL_Process_j8Mus6Ctvj
19 pages
Unit 4
No ratings yet
Unit 4
27 pages
Nature and Purpose of Research
No ratings yet
Nature and Purpose of Research
3 pages
MultiLab User Guide
No ratings yet
MultiLab User Guide
154 pages
PROCON Handout
No ratings yet
PROCON Handout
2 pages
Power BI Vs Tableau
No ratings yet
Power BI Vs Tableau
6 pages
Chapter 5 Marketing Research
No ratings yet
Chapter 5 Marketing Research
24 pages
Comparative Analysis of Mutual Funds - Kotak Securities
No ratings yet
Comparative Analysis of Mutual Funds - Kotak Securities
6 pages
Materi Java Penting
No ratings yet
Materi Java Penting
34 pages
1.6 - Data Integration, 1.10 - Transformation
No ratings yet
1.6 - Data Integration, 1.10 - Transformation
3 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
fusion-strategy-how-real-time-data-and-ai-will-power-the-industrial-future-9781647826253-9781647826260
No ratings yet
fusion-strategy-how-real-time-data-and-ai-will-power-the-industrial-future-9781647826253-9781647826260
204 pages
Class Xii CS Practical File
100% (1)
Class Xii CS Practical File
5 pages
Illam First Draft Final Pranisha
No ratings yet
Illam First Draft Final Pranisha
24 pages
(Format B.) Title of The Research
No ratings yet
(Format B.) Title of The Research
7 pages
Aruma 294
No ratings yet
Aruma 294
6 pages
Company Catalogue
No ratings yet
Company Catalogue
6 pages
Difference Between Information and Data
No ratings yet
Difference Between Information and Data
4 pages
Four Steps To Analyse Data From A Case Study Method
No ratings yet
Four Steps To Analyse Data From A Case Study Method
12 pages
Database Akademik: 1. Gambar Relasi Antar Ke 5 Tabel
No ratings yet
Database Akademik: 1. Gambar Relasi Antar Ke 5 Tabel
5 pages
Week 2 Data Loss Prevention
No ratings yet
Week 2 Data Loss Prevention
4 pages
PL 300 Updated Part 1
No ratings yet
PL 300 Updated Part 1
90 pages
Food Delivery App Project
No ratings yet
Food Delivery App Project
3 pages
Effectivenes of Backschool Programme On Chronic Low Back Pain
No ratings yet
Effectivenes of Backschool Programme On Chronic Low Back Pain
59 pages
Unit One Report Writing: 1. Basics of Report 1.1 Meaning of A Report
No ratings yet
Unit One Report Writing: 1. Basics of Report 1.1 Meaning of A Report
11 pages
DSPD Unit - III Notes PDF
No ratings yet
DSPD Unit - III Notes PDF
12 pages
Story of Their Lives: Lived Experiences of Parents of Children With Special Needs Amidst The Pandemic
No ratings yet
Story of Their Lives: Lived Experiences of Parents of Children With Special Needs Amidst The Pandemic
15 pages
DBMS Notes
No ratings yet
DBMS Notes
31 pages
Dumpstate Log
No ratings yet
Dumpstate Log
31 pages
SQL Processing With SAS
No ratings yet
SQL Processing With SAS
232 pages
Practical Research I: Department of Education
71% (7)
Practical Research I: Department of Education
16 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

Data Quality Tool- Introduction

• Data cleaning is the process of editing, correcting, and structuring data

• Remove irrelevant data:

• If you’re collecting data from multiple sources or multiple departments,

• Structural errors include things like misspellings, incongruent naming

• For example, if you’re running an analysis on different data sets – one

• Scan your data or run it through a cleaning program to locate missing

• You’ll need to determine whether everything connected to this

• Data validation is the final data cleaning technique used to

You might also like