Data Profiling: References
Data Profiling: References
Data Profiling
Helena Galhardas
DEI/IST
References
1
11/5/15
Profiling in Spreadsheets
2
11/5/15
els
lab
n
m
C olu
ows
er of r
b
Num
Felix
Naumann
|
Data
Proling
6
|
Trento
2015
3
11/5/15
4
11/5/15
Challenges
Managing the input
Decide which profiling tasks to execute on which
parts of the data
Performing the computation
Computational complexity depends on the
number of rows, and the number of columns;
sorting is a typical operation
Managing the output
Meaningfully interpret the profiling results; usually
performed by database and domain experts
Existing technology
SQL queries and spreadsheet browsing
Dedicated tools or components
E.g., IBM Information Analyzer, Microsoft SQL Server
Integration Services, Informatica Data Explorer
Innovative ways to handle the challenges
E.g., using indexes, parallel processing
Methods to deliver approximate results
E.g., by profiling samples
Narrowing the discovery process to certain
columns or tables
E.g., verifying inclusion dependencies on user-
suggested pairs of columns
10
5
11/5/15
11
6
11/5/15
14
7
11/5/15
Outline
Data profiling tasks
Data profiling tools
Visualization
15
Outline
Data profiling tasks
Data profiling tools
Visualization
16
8
11/5/15
Classification of Traditional
Data Profiling Tasks
CardinaliEes
PaGerns
and
Single
column
data
types
Value
Data
proling
distribuEons
Key discovery
Uniqueness CondiEonal
ParEal
Foreign
key
discovery
MulEple
columns
Inclusion
dependencies
CondiEonal
ParEal
CondiEonal
FuncEonal
dependencies
ParEal
17
18
9
11/5/15
19
Cardinalities
Number of values (nb of rows)
Length of values in terms of characters
Number of distinct values
Number of NULLs
MIN and MAX value
Useful for
Query optimization
Categorization of attribute
Relevance of attribute
20
10
11/5/15
Data completness
Finding disguised missing values (e.g., when
using web forms including fields whose values
must be chosen from pull-down lists)
9999-999 for the zip code
Alabama for the USA state
Methods: determine the distribution of values
and find out that disguised missing values are
occurring much more often
21
22
11
11/5/15
Value distributions
Probability distribution for numeric values
Detect whether data follows some well-known distribution
Determine that distribution function for data values
If no specific/useful function detectable: histograms
Histograms
Determine (and display) value frequencies for value intervals or for
individual values
Estimation of probability distribution for continuous variables
Grade
distribu,on
15
10
0
01
01
02
02
02
03
03
03
04
04
05
Useful for
Query optimization
Outlier detection
Visualize distribution
24
12
11/5/15
25
26
13
11/5/15
Clustering
To segment the records into homogeneous
groups using a clustering algorithm
Records that do not fit any cluster flagged
as outliers
May indicate data quality problems
Algorithms: K-means, for example
27
Dependencies
Metadata that describe relationships
among columns
Discovery of primary keys with the help of unique
column combinations
Discovery of foreign keys with the help of inclusion
dependencies
Functional dependencies
Complexity: Number of columns and number of
values
Several algorithms for detecting dependencies
28
14
11/5/15
29
15
11/5/15
Functional dependencies
XA
whenever two records have the same X values, they
also have the same A values, where X is a set of
attributes
E.g., street, numberzip-code
Useful for
Schema design
Normalization
Keys
Data cleansing
31
Partial dependencies
Real datasets contain exceptions to the rule so dependencies
can be relaxed
Aka approximate dependencies: hold for a subset of records
INDs and FDs that do not perfectly hold
For all but 10 of the tuples
Only for 80% of the tuples
Only for 1% of the tuples
Useful for
Data cleansing
32
16
11/5/15
Conditional dependencies
Given a partial IND or FD: For which part do the hold?
Example: conditional unique column combination
street is unique for all records with city = Lisbon
Expressed as a condition over the attributes of the
relation
Problems:
Infinite possibilities of conditions
Interestingness:
Many distinct values: less interesting
Few distinct values: surprising condition high coverage
Useful for
Integration: cross-source cINDs
33
Outline
Data profiling tasks
Data profiling tools
34
17
11/5/15
35
18
11/5/15
37
19
11/5/15
20
11/5/15
Screenshots for
IBM Information Analyzer
41
Screenshots for
IBM Information Analyzer
42
21
11/5/15
43
Metanome
44
22
11/5/15
Design Goals
Simplicity
Should be easy to setup and use
Extensibility
New algorithms and datasets should be easily
addable to the system
Standardization
All common tasks, tooling, input parsing, result
handling should be provided
Flexibility
Make as few restrictions as possible to the
algorithms
45
Metanome architecture
SWAN
Configuration jar
DB2 txt
Measurements DB2 csv
MySQL xml SPIDER DUCC
Results jar jar
46
23
11/5/15
47
Profiling algorithms
A profiling algorithm needs to implement a given
set of light-weight interfaces
Work autonomously: they are treated as foreign
code modules that manage themselves
providing maximum flexibility for their design
Algorithms supported:
UCCs: DUCC
INDs: MIND, SPIDER, BINDER
FDs: TANE, FUN, FD_MINE, etc
ODs: ORDER
48
24
11/5/15
49
50
25
11/5/15
Outline
Data profiling tasks
Data profiling tools
Visualization
51
Motivation
Human in the loop for data profiling and data
cleansing.
Interactive visualization
Support users in visualizing data, profiling results
Support any action taken upon the results
Cleansing, sorting,
Re-profile and visualize immediately
52
26
11/5/15
Assessment
Gapminder.org
hGp://www.gapminder.org/GapminderMedia/wp-uploads/Gapminder-World-2012.pdf
54
27
11/5/15
Next Lecture
Introduction to Data Warehouse
28