0% found this document useful (0 votes)
67 views

Data Profiling: References

1) Data profiling involves examining existing data sources and collecting statistics and metadata about the data. 2) It can uncover useful information such as the number of distinct values in each column, data types, frequent patterns, and potential keys. 3) The results of data profiling are used to clean, explore, manage and integrate data in databases and big data systems.

Uploaded by

Zaigham Abbas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Data Profiling: References

1) Data profiling involves examining existing data sources and collecting statistics and metadata about the data. 2) It can uncover useful information such as the number of distinct values in each column, data types, frequent patterns, and potential keys. 3) The results of data profiling are used to clean, explore, manage and integrate data in databases and big data systems.

Uploaded by

Zaigham Abbas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

11/5/15

Data Profiling
Helena Galhardas
DEI/IST

References

Slides Data Profiling course, Felix Naumann,


Trento, July 2015
Z. Abedjan, L. Golab, F. Naumann, Profiling
Relational Data A Survey, VLDBJ 2015
T. Papenbrock and others, Data Profiling with
Metanome, demo paper, VLDB 2015

1
11/5/15

Definition Data Profiling


Data profiling is the process of examining the
data available in an existing data source [...] and
collecting statistics and information about that
data.
Wikipedia 09/2013
Data profiling refers to the activity of creating
small but informative summaries of a database.
Ted Johnson, Encyclopedia of Database Systems

Data profiling is the set of activities and


processes to determine the metadata about a
given dataset.
3

Profiling in Spreadsheets

Felix Naumann | Data Proling


4
| Trento 2015

2
11/5/15

els
lab
n
m
C olu

Felix Naumann | Data Proling


5
| Trento 2015

ows
er of r
b
Num
Felix Naumann | Data Proling
6
| Trento 2015

3
11/5/15

Many interesting questions remain


What are the possible primary keys and foreign keys?
Phone
firstname, lastname, street
Are there any functional dependencies?
zip -> city
race -> voting behavior
Which columns correlate?
Date-of-Birth and first name
State and last name
What are frequent patterns in a column?
ddddd
dd aaaa St

Felix Naumann | Data Proling


7
| Trento 2015

Results of data profiling


Encompasses several methods to examine
datasets and produce metadata
Simple results to compute:
Number of null and distinct values in a column
Data type of a column
Most frequent patterns of data values in a column
More difficult results to compute involve several
columns:
Inclusion dependencies
Functional dependencies, etc
8

4
11/5/15

Challenges
Managing the input
Decide which profiling tasks to execute on which
parts of the data
Performing the computation
Computational complexity depends on the
number of rows, and the number of columns;
sorting is a typical operation
Managing the output
Meaningfully interpret the profiling results; usually
performed by database and domain experts

Existing technology
SQL queries and spreadsheet browsing
Dedicated tools or components
E.g., IBM Information Analyzer, Microsoft SQL Server
Integration Services, Informatica Data Explorer
Innovative ways to handle the challenges
E.g., using indexes, parallel processing
Methods to deliver approximate results
E.g., by profiling samples
Narrowing the discovery process to certain
columns or tables
E.g., verifying inclusion dependencies on user-
suggested pairs of columns

10

5
11/5/15

Typical data profiling procedure


1. User specifies data to be profiled and
chooses type of metadata to be generated
2. Tool computes the metadata in batch mode
(using SQL queries or specialized
algorithms)
Can last minutes or hours
3. Tool displays results in a vast collection of
tabs, tables, charts, and other visualizations
Discovered results can be translated into rules
or constraints to be enforced in a subsequent
data cleaning step

11

Use Cases for Data Profiling


Data cleaning
Data profiling results can be used to measure/monitor the quality of a dataset
Data exploration
To have an insight of new datasets: simple ad-hoc SQL queries return simple statistics (e.g., nb
distinct values)
Automated data profiling is required
Database management
Basic statistics gathered by a DBMS: number of values, number of non-null values, etc
Optimizer uses these statistics to estimate selectivity of operators and perform query
optimization
Database reverse engineering
To identify relations and attributes, domain semantics, foreign keys and cardinalities
Result: ER model or logical schema to assist experts in maintaining, integrating and querying
the DB
Data integration
For finding semantically correct correspondences between elements of two schemata (schema
matching)
Cross-DB inclusion dependencies suggest which tables may be combined with a join operation
Big Data analytics
Profiling as preparation and for initial insights
Important to determine which data to mine, how to import it into various tools and how to
interpret the results

Data profiling as preparation for any other data management task 12

6
11/5/15

Types of storage of input data


Relational database
So data profiling methods make use of SQL
queries and indexes
CSV file
Data profiling methods need to create its own
data structures in memory or disk
Mixed approach
Data originally in the database are read once and
processed further outside the database

The type of storage for input data has an


impact on the performance of the data
profiling algorithms and tools
13

Data profiling vs. data mining


Data profiling gathers technical metadata to support
data management
Data mining and data analytics discovers non-obvious
results to support business management with new
insights

Data profiling results: information about columns and


column sets
Data mining results: information about rows or row
sets
clustering, summarization, association rules,
Recommendation or classification are not related to data
profiling

14

7
11/5/15

Outline
Data profiling tasks
Data profiling tools
Visualization

15

Outline
Data profiling tasks
Data profiling tools
Visualization

16

8
11/5/15

Classification of Traditional
Data Profiling Tasks
CardinaliEes

PaGerns and
Single column data types

Value

Data proling
distribuEons

Key discovery

Uniqueness CondiEonal

ParEal

Foreign key
discovery
MulEple columns Inclusion
dependencies CondiEonal

ParEal

CondiEonal
FuncEonal
dependencies
ParEal
17

Data profiling tasks and their primary uses

18

9
11/5/15

Single column profiling


Analysis of individual columns in a given
table
Most basic form of data profiling
Assumption: All values are of same type
Assumption: All values have some common
properties to be discovered
Discover data types
Often part of the basic statistics gathered by DBMS
Complexity: Number of values/rows

19

Cardinalities
Number of values (nb of rows)
Length of values in terms of characters
Number of distinct values
Number of NULLs
MIN and MAX value

Useful for
Query optimization
Categorization of attribute
Relevance of attribute

20

10
11/5/15

Data completness
Finding disguised missing values (e.g., when
using web forms including fields whose values
must be chosen from pull-down lists)
9999-999 for the zip code
Alabama for the USA state
Methods: determine the distribution of values
and find out that disguised missing values are
occurring much more often

21

Data types and value patterns


Discovering the basic type of a column:
String vs. number
String vs. number vs. date
Increasing Difficulty

SQL data types (CHAR, INT, DECIMAL,)


Extracting frequent patterns observed in the
data of a column:
Regular expressions (\d{3})-(\d{3})-(\d{4})-(\d+)
Finding the meaning of a column (semantic
domain)
Adress, phone, email, first name

22

11
11/5/15

Value distributions
Probability distribution for numeric values
Detect whether data follows some well-known distribution
Determine that distribution function for data values
If no specific/useful function detectable: histograms

Normal distributions Laplace distributions 23

Histograms
Determine (and display) value frequencies for value intervals or for
individual values
Estimation of probability distribution for continuous variables

Grade distribu,on
15

10

0
01 01 02 02 02 03 03 03 04 04 05

Useful for
Query optimization
Outlier detection
Visualize distribution
24

12
11/5/15

Multi-column data profiling


Covers multiple columns simultaneasously
Identifies inter-value dependencies and column
similarities
Identifies correlations between values through
frequent patterns or association rules
Complexity: Number of columns and number of
values

25

Correlations and association rules


Correlation analysis reveals related numeric
columns (e.g., salary and age in relation
Employees)
Nave method: compute pairwise correlations
among all pairs of columns
Association rules: denote relationships or patterns
between attribute values among columns
Ex: Employees(emp-nb, dept, position, allowance}!
{dept=finance, position=manager} -> {allowance=
$1000}!
Algorithms: Apriori, FP-growth

26

13
11/5/15

Clustering
To segment the records into homogeneous
groups using a clustering algorithm
Records that do not fit any cluster flagged
as outliers
May indicate data quality problems
Algorithms: K-means, for example

27

Dependencies
Metadata that describe relationships
among columns
Discovery of primary keys with the help of unique
column combinations
Discovery of foreign keys with the help of inclusion
dependencies
Functional dependencies
Complexity: Number of columns and number of
values
Several algorithms for detecting dependencies
28

14
11/5/15

Uniqueness and keys


Set of columns R.X that contain only unique
value combinations
(Primary) key candidate
No null values
Uniqueness and non-null in one instance do not
imply key: Only human can specify keys
Algorithms: Gordian, DUCC, SWAN
Useful for
Schema design, data integration, indexing,
optimization
Inverse: non-uniques are duplicates

29

Inclusion dependencies (IND) and


foreign keys (FKs)
R.A S.B
All values in R.A are also present in S.B
R.A1,,R.Ai S.B1,,S.Bi:
All value combinations in R.A1,,R.Ai are also present
in S.B1,,S.Bi
Prerequisite for foreign key:
Used across relations
Use across databases
But again: Discovery on a given instance, only user can specify
for schema
Algorithms for IND detection: Spider, BINDER
INDs useful for
suggesting how to join two relations 30

15
11/5/15

Functional dependencies
XA
whenever two records have the same X values, they
also have the same A values, where X is a set of
attributes
E.g., street, numberzip-code

Algorithms for detecting FDs: TANE, FUN, FD-Mine, etc

Useful for
Schema design
Normalization
Keys
Data cleansing

31

Partial dependencies
Real datasets contain exceptions to the rule so dependencies
can be relaxed
Aka approximate dependencies: hold for a subset of records
INDs and FDs that do not perfectly hold
For all but 10 of the tuples
Only for 80% of the tuples
Only for 1% of the tuples

Also for patterns, types, uniques, and other constraints

Useful for
Data cleansing

32

16
11/5/15

Conditional dependencies
Given a partial IND or FD: For which part do the hold?
Example: conditional unique column combination
street is unique for all records with city = Lisbon
Expressed as a condition over the attributes of the
relation
Problems:
Infinite possibilities of conditions
Interestingness:
Many distinct values: less interesting
Few distinct values: surprising condition high coverage

Useful for
Integration: cross-source cINDs

33

Outline
Data profiling tasks
Data profiling tools

34

17
11/5/15

Research data profiling tools


Bellman: Column statistics, column similarity, candidate
key discovery
Potters Wheel: Column statistics (including value
patterns)
Data Auditor: CFD and CIND discovery
RuleMiner: Denial constraint discovery
MADlib: Simple column statistics
Profiler: visual data profiler tool
Metanome: in a few slides

35

Commercial data profiling tools


IBM InfoSphere Information Analyzer
http://www.ibm.com/software/data/infosphere/information-analyzer/
Oracle Enterprise Data Quality
http://www.oracle.com/us/products/middleware/data-integration/enterprise-data-quality/overview/index.html
Talend Data Quality
http://www.talend.com/products/data-quality
Ataccama DQ Analyzer
http://www.ataccama.com/en/products/dq-analyzer.html
SAP BusinessObjects Data Insight and SAP BusinessObjects Information Steward
http://www.sap.com/germany/solutions/sapbusinessobjects/large/eim/datainsight/index.epx
http://www.sap.com/germany/solutions/sapbusinessobjects/large/eim/information-steward/index.epx
Informatica Data Explorer
http://www.informatica.com/us/products/data-quality/data-explorer/
Microsoft SQL Server Integration Services Data Profiling Task and Viewer
http://msdn.microsoft.com/en-us/library/bb895310.aspx
Trillium Software Data Profiling
http://www.trilliumsoftware.com/home/products/data-profiling.aspx
CloverETL Data Profiler
http://www.cloveretl.com/products/profiler
OpenRefine
http://www.openrefine.org OSen packaged with data quality / data cleansing
and many more soSware 36

18
11/5/15

Very long feature lists


Num rows Single column primary key discovery
Min value length MulE-column primary key discovery
Median value length Single column IND discovery
Max value length
Inclusion percentage
Avg value length
Single-column FK discovery
Precision of numeric values
Scale of numeric values MulE-column IND discovery
QuarEles MulE-column FK discovery
Basic data types Value overlap (cross domain analysis)
Num disEnct values ("cardinality") Single-column FD discovery
Percentage null values MulE-column FD discovery
Data class and data type Text proling
Uniqueness and constancy
Single-column frequency histogram
MulE-column frequency histogram
PaGern discovery (Aa9)
Soundex frequencies
Benford Law Frequency

37

Screenshots from Talend Data Quality

Felix Naumann | Data Proling


38
| Trento 2015

19
11/5/15

Screenshots from Talend

Felix Naumann | Data Proling


39
| Trento 2015

Screenshots from Talend

Felix Naumann | Data Proling


40
| Trento 2015

20
11/5/15

Screenshots for
IBM Information Analyzer

41

Screenshots for
IBM Information Analyzer

42

21
11/5/15

Typical Shortcomings of Tools


(and research methods)
Usability
Complex to configure
Results complex to view and interpret
Scalability
Main-memory based
SQL based DBMS
Efficiency
Coffee, Lunch, Overnight
Functionality
Restricted to simplest tasks
Restricted to individual columns or small column sets
Realistic key candidates vs. further use-cases
SAP R3 schema has many tables with up to 16 columns as key
Interpretation of profiling results Thats the big one

43

Metanome

Extensible profiling platform that incorporates


several state-of-the-art metadata discovery
algorithms
Goals:
To provide novel profiling algorithms from research
To perform comparative evaluations
To support developers in building/testing new algorithms
Typical users:
Database administrators and IT professionals
Developers and researchers
See in: https://hpi.de/naumann/projects/data-profiling-and-analytics/
metanome-data-profiling.html

44

22
11/5/15

Design Goals
Simplicity
Should be easy to setup and use
Extensibility
New algorithms and datasets should be easily
addable to the system
Standardization
All common tasks, tooling, input parsing, result
handling should be provided
Flexibility
Make as few restrictions as possible to the
algorithms

45

Metanome architecture

Algorithm execution Algorithm configuration


Result Result
management presentation

SWAN
Configuration jar
DB2 txt
Measurements DB2 csv
MySQL xml SPIDER DUCC
Results jar jar

46

23
11/5/15

Most important tasks


Input parsing
Build an abstraction around input sources; specific formats are
irrelevant to profiling algos
Handles relational databases/files/tables, JSON/RDF/XML files
Output processing
Standardize the output formats depending on the type of metadata the
algorithm discovers
Most important metadata supported: unique column combinations,
INDs, FDs, order dependencies, basic statistics
Parameterization handling
Defines the parameterization of algorithms through the configuration
variables exposed by the profiling algorithms (set by the user)
Temporary data management
Provides dedicated temp-files for storing temporary data written by
profiling algorithms

47

Profiling algorithms
A profiling algorithm needs to implement a given
set of light-weight interfaces
Work autonomously: they are treated as foreign
code modules that manage themselves
providing maximum flexibility for their design
Algorithms supported:
UCCs: DUCC
INDs: MIND, SPIDER, BINDER
FDs: TANE, FUN, FD_MINE, etc
ODs: ORDER
48

24
11/5/15

Snapshot visualization of results

49

Snapshot different visualization


techniques

50

25
11/5/15

Outline
Data profiling tasks
Data profiling tools
Visualization

51

Motivation
Human in the loop for data profiling and data
cleansing.

Advanced visualization techniques


Beyond bar-charts and pie-charts

Interactive visualization
Support users in visualizing data, profiling results
Support any action taken upon the results
Cleansing, sorting,
Re-profile and visualize immediately
52

26
11/5/15

Profiler: Integrated Statistical Analysis


and Visualization for Data Quality
hGp://vis.stanford.edu/les/2012-Proler-AVI.pdf

Assessment

Felix Naumann | Data Proling


53
| Trento 2015

Gapminder.org

hGp://www.gapminder.org/GapminderMedia/wp-uploads/Gapminder-World-2012.pdf
54

27
11/5/15

Next Lecture
Introduction to Data Warehouse

28

You might also like