0% found this document useful (0 votes)
5 views

Unit 2

Uploaded by

jovialdarwin8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 2

Uploaded by

jovialdarwin8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Quality Tool- Introduction

• In the Digital Era, the availability of big data facilitates new generation
Industries to design novel business models and automate their
business operations.
• It also helps them invent new technological solutions, which generate
new business opportunities. Big data is generated from sensors,
machines, social media, Web sites, and e-commerce portals.
• Data is the new oil and an asset to any organization, and there are
attempts to monetize the data. There are bound to be variations and
inconsistencies in the data collected from many heterogeneous
sources. A mechanism should be to correct anomalies at the source
or post-collection and ensure high data quality.
What is Data Quality Tool?
• Data Quality:
• The success of any organization depends on the quality of the data
collected, stored, and used for deriving insights. Quality data forms
the core part of any business, in the bottom layer of the information
hierarchy. The information layer, Analytics layer (knowledge), and
Insights (wisdom) layers are on top of the data layer in the respective
order mentioned.
• Data quality can be defined as a characteristic that makes the data fit for its intended
use, and it can also be defined as the characteristic that makes the data represent the
true picture it is supposed to project.

• The above two definitions are in total contrast as the first insists on completing the
day-to-day transaction, and the other aims to achieve the end-to-end purpose for
which the attributes are designed.

• For example, Employee Master in payroll contains so many attributes few of them are
mandatory for calculating the monthly payment. If all such fields are present correctly,
that will be sufficient to run payroll, and this will meet the first definition of data
quality.
• For doing manpower planning, skill planning, dynamic work allocation, and effective
utilization of manpower, most of the attributes should have the right quality of data,
and this will meet the second definition of data quality.
Importance of Data Quality
• Accurate data produces accurate analysis & dependable results,
avoids wastage, and enhances the productivity and profitability of the
Organization.
• Reliable data provides an edge to the business in fighting competitive
markets.
• It facilitates the system to be compliant with all local and international
regulations.
• Companywide digital transformation and cost-saving programs can be
implemented with adequate data backup.
Steps to Improve Data Quality:
• Having the right mix of People, Processes, and Technology with
adequate support from top management is the first step to improving
data quality.
• Install a system to measure and improve a set of Quality Dimensions
like Uniqueness, Precision, Conformity, Consistency, Completeness,
Timeliness, and Relevance.
• Data accuracy, Data validity, and Data Integrity are the other aspects
of good data quality management.
• There should always be a single source for Data, and we should avoid
getting it from multiple resources.
Data Quality Tools
• Any DQ tool typically does data cleansing, data integration, and
Managing master data and Metadata by adopting the guidelines in
the various disciplines of DQM as given below.
• Data Governance
• Data Matching
• Data Profiling
• Data Quality Monitoring and Reporting
• Master Data Management (MDM)
• Customer and Product Data Management
• Data Asset Management
DQ Tools & Features
DQ Tool Key Features Value to the Users

Informatica Quality Data Standardization, deduplication, validation, consolidation, and MDM supports structured and unstructured data. AI features
Data MDM Solutions robust MDM solution. enabled.

SAS Data Data Integration and Cleansing. Uses Data governance and Unstructured data. AI features Graphical interfaces and a
Management metadata management disciplines of DQ management. powerful wizard for effective data management.

Experian Aperture Data discovery and profiling, Data monitoring, and Data cleansing. Easy to use DQ management tool. Workflow designer enables
Data Studio Works with any data. easy data quality monitoring.

IBM InfoSphere Data cleansing and Data management. Data profiling helps deep
Machine learning-enabled high data accuracy.
Quality Stage analysis of data.

Data integrity and cleansing. Removes duplicates and human Used extensively in Salesforce. It has a drag-and-drop
Cloudingo errors. interface.

Talend Data Quality Data Standardization, Deduplication, Validation. Uses ML features to maintain clean data.

Data Cleansing. Uses data matching and deduplication techniques Very high data accuracy. Manages multiple databases and big
Data Ladder for cleansing. data.

Data integration, transformation, and Master data Management. Handles data from multiple sources and provides reliable data
SAP Data Services Uses text analysis, auditing, and data profiling techniques. for analytics.

OpenRefine Data Cleansing, including big data. Open-source tool. Supports multiple languages.
Advantages
Data Quality tool enhances the accuracy of the data:
a. While it is generated at the source
b. As it is getting extracted before storage, c. transformation post its
storage.
Its main benefits are:
• Builds confidence in the business to venture into transformation
exercise.
• Scales up revenue, profits, new business, and productivity for the
business.
• Reduces wastages, saves cost, shrinks time to market, and makes
business agile.
• Makes business digital-ready and builds a vibrant brand.
Data Cleaning

• Data cleaning is the process of editing, correcting, and structuring data


within a data set so that it’s generally uniform and prepared for analysis.
• This includes removing corrupt or irrelevant data and formatting it into a
language that computers can understand for optimal analysis.
• There is an often repeated saying in data analysis: “Garbage in, garbage
out,” which means that, if you start with bad data (garbage), you’ll only
get “garbage” results.
• Data cleaning is often a tedious process, but it’s absolutely essential to
get top results and powerful insights from your data.
• This is powerfully elucidated with the 1-10-100 principle: It costs $1 to
prevent bad data, $10 to correct bad data, and $100 to fix a downstream
problem created by bad data.
Data Cleaning Steps & Techniques

• Remove irrelevant data:


• Take a good look at your data and get an idea of what is relevant and
what you may not need. Filter out data or observations that aren’t
relevant to your downstream needs.
• If you’re doing an analysis of SUV owners, for example, but your data
set contains data on Sedan owners, this information is irrelevant to
your needs and would only skew your results.
• You should also consider removing things like hashtags, URLs, emojis,
HTML tags, etc., unless they are necessarily a part of your analysis.
Deduplicate your data

• If you’re collecting data from multiple sources or multiple departments,


use scraped data for analysis, or have received multiple survey or client
responses, you will often end up with data duplicates.

• Duplicate records slow down analysis and require more storage. Even
more importantly, however, if you train a machine learning model on a
dataset with duplicate results, the model will likely give more weight to
the duplicates, depending on how many times they’ve been duplicated.
So they need to be removed for well-balanced results.
Fix structural errors

• Structural errors include things like misspellings, incongruent naming


conventions, improper capitalization, incorrect word use, etc.
• These can affect analysis because, while they may be obvious to
humans, most machine learning applications wouldn’t recognize the
mistakes and your analyses would be skewed.

• For example, if you’re running an analysis on different data sets – one


with a ‘women’ column and another with a ‘female’ column, you
would have to standardize the title. Similarly things like dates,
addresses, phone numbers, etc. need to be standardized, so that
computers can understand them.
Deal with missing data

• Scan your data or run it through a cleaning program to locate missing


cells, blank spaces in text, unanswered survey responses, etc. This
could be due to incomplete data or human error.

• You’ll need to determine whether everything connected to this


missing data – an entire column or row, a whole survey, etc. – should
be completely discarded, individual cells entered manually, or left as
is.
Filter out data outliers

• Outliers are data points that fall far outside of the norm and may skew
your analysis too far in a certain direction.
• For example, if you’re averaging a class’s test scores and one student
refuses to answer any of the questions, his/her 0% would have a big
impact on the overall average.
• In this case, you should consider deleting this data point, altogether. This
may give results that are “actually” much closer to the average.
• However, just because a number is much smaller or larger than the other
numbers you’re analyzing, doesn’t mean that the ultimate analysis will be
inaccurate. Just because an outlier exists, doesn’t mean that it shouldn’t
be considered. You’ll have to consider what kind of analysis you’re running
and what effect removing or keeping an outlier will have on your results.
Validate your data

• Data validation is the final data cleaning technique used to


authenticate your data and confirm that it’s high quality, consistent,
and properly formatted for downstream processes.
• Do you have enough data for your needs?
• Is it uniformly formatted in a design or language that your analysis
tools can work with?
• Does your clean data immediately prove or disprove your theory
before analysis?
• Validate that your data is regularly structured and sufficiently clean
for your needs. Cross check corresponding data points and make sure
nothing is missing or inaccurate.
Data Pollution
• Digital information assembles every possible compilation of facts, but
perhaps the most treasured content is personal data.
• Digital platforms are learning who and where people are at any given
time, what they did in the past and how they plan their future, what and
who they like, and how their decisions could be influenced.
• The widespread aggregation of such personal data creates new
personalized, social environments with enormous private and social
benefits.
• The concept of data pollution invites us to expand the focus and
examine the ways that the collection of personal data affects institutions
and groups of people—beyond those whose data are taken, and apart
from the harm to their privacy. Facebook’s data practices lucidly
illustrated the impact of data sharing on an ecosystem as a whole.
• Data sharing also pollutes in other, more concrete, ways. When
people allow websites to collect information about their emails, social
networks, and even DNA, they provide information about other
individuals who are not party to these transactions. In personalized
environments, the experience of each individual depends in part on
the data shared about others.

• Data Velocity: Data Velocity refers to the speed with which data is
generated. High velocity data is generated with such a pace that it
requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages
or Facebook posts.
Cyclicity of data
Data Quality
• As an IT professional, you have heard of data accuracy quite often.
Accuracy is associated with a data element. Consider an entity such as
customer.
• The customer entity has attributes such as customer name, customer
address, customer state, customer lifestyle and so on.
• Each occurrence of the customer entity refers to a single customer.
Data accuracy, as it relates to the attributes of the customer entity,
means that the values of the attributes of a single occurrence
accurately describes the particular customer. The value of the
customer name for a single occurrence of the customer entity is
actually the name of that customer. Data quality implies data
accuracy, but it is much more than that.
• Data quality in a data warehouse is not just the quality of individual
data items but the equality of the full, integrated system as a whole. It
is more than the data edits on individual fields. For example, while
entering data about the customers in an order entry application, you
may also collect the demographics of each customer. The customer
demographics are not germane to the order entry application and,
therefore, they are not given toomuch attention.

You might also like