SimpleETL: ETL Processing by Simple Specifications

Massive quantities of data are today collected from many sources. However, it is often labor-intensive to handle and integrate these data sources into a data warehouse. Further, the complexity is increased when specific requirements exist. One such new requirement, is the right to be forgotten where an organization upon request must delete all data about an individual. Another requirement is when facts are updated retrospectively. In this paper, we present the general framework SimpleETL which is currently used for Extract-Transform-Load (ETL) processing in a company with such requirements. SimpleETL automatically handles all database interactions such as creating fact tables, dimensions, and foreign keys. The framework also has features for handling version management of facts and implements four different methods for handling deleted facts. The framework enables, e.g., data scientists, to program complete and complex ETL solutions very efficiently with only few lines of code, which is demonstrated with a real-world example.

Uploaded by

Ivan Georgiev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

199 views6 pages

SimpleETL: ETL Processing by Simple Specifications

Uploaded by

Ivan Georgiev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

SimpleETL: ETL Processing by Simple Specifications∗

Ove Andersen Christian Thomsen Kristian Torp

Aalborg University & FlexDanmark Aalborg University Aalborg University
Denmark Denmark Denmark
[email protected] [email protected] [email protected]
[email protected]
ABSTRACT
Massive quantities of data are today collected from many sources.
However, it is often labor-intensive to handle and integrate these
data sources into a data warehouse. Further, the complexity is
increased when specific requirements exist. One such new re-
quirement, is the right to be forgotten where an organization upon
request must delete all data about an individual. Another require-
ment is when facts are updated retrospectively. In this paper, we
present the general framework SimpleETL which is currently Figure 1: Example Case Star Schema
used for Extract-Transform-Load (ETL) processing in a company
with such requirements. SimpleETL automatically handles all
database interactions such as creating fact tables, dimensions, taxi company are stored. Each travel is a fact stored in a fact table,
and foreign keys. The framework also has features for handling connected with a vehicle, a customer, and a date dimension. It is
version management of facts and implements four different meth- common practice that facts are deleted, e.g., if it is discovered that
ods for handling deleted facts. The framework enables, e.g., data an ordered trip two days ago was not executed anyway then the
scientists, to program complete and complex ETL solutions very fact will be removed, or a facts gets updates, due to late arriving
efficiently with only few lines of code, which is demonstrated accounting information. Further, for audit reasons, it is required
with a real-world example. that changes must be tracked, e.g., if a price is updated.
The presented SimpleETL framework enables data scientists
to program an ETL solution in a very efficient and convenient
1 INTRODUCTION way with only few lines of code mainly with specifications of
Data is being collected at unprecedented speed partly due to metadata. The framework manages everything behind the scene
cheaper sensor technology and inexpensive communication. from structuring data warehouse schema, fact tables, dimensions,
Companies have realized that detailed data is valuable because references, indexes, and data version tracking. This also includes
it can provide up-to-date and accurate information on how the handling of changes to facts in line with Kimball’s slowly changing
business is doing. These changes have in recent year coined dimensions [12]. Processing data using SimpleETL is automati-
terms such as “Big Data”, “The five V’s”, and “Data Scientist”. It cally highly parallelized such that every dimension is handled in
is, however, not enough to collect data; it should also be possible its own process and fact table processing is spread across multiple
for the data scientist1 to integrate it with existing data and to processes.
analyze it. The rest of the paper is structured as follows: First related
A data warehouse is often used for storing large quantity of work is discussed in Section 2. Then a simple use-case is intro-
data possibly integrated from many sources. A wide range of duced in Section 3 followed by an example implementation in
Extract-Transform-Load (ETL) tools support cleaning, structur- Section 4 showing how a user efficiently programs an ETL flow.
ing, and integration of data. The available ETL tools offer many In Section 5, the support for fact version management and dele-
advanced features, which make them very powerful but also both tion of facts is described. Then in Section 6 it is described how
overwhelming and sometimes rigid in their use. It can thus be a data scientist configures and initializes an ETL run including
challenging for a data scientist to quickly add a new data source. how the framework operates along with a real-world use case
Further, many of these products mainly focus on data processing example. Section 7 concludes the paper and points to directions
and less on aspects such as database schema handling. Other for future work.
important topics are privacy and anonymity concerns of citizens,
which has caused the EU (and others) to introduce regulations 2 RELATED WORK
where citizens have a right to be forgotten [9]. Violating these
A survey of ETL processes and technologies is given by [16].
regulations can lead to large penalties and it is thus important to
A plethora of ETL tools exist from commercial vendors such
enable easy removal of an individual citizen’s data from a data
as IBM, Informatica, Microsoft, Oracle, and SAP [2–5, 7]. Open
warehouse.
source ETL tools also exist such as Pentaho Data Integration
A simplified real-world example use case is presented by a
and Talend [6, 8]. Gartner presents the widely used tools in its
star-schema in Figure 1, where passenger travels carried out by a
Magic Quadrant [10]. With most ETL tools, the user designs the
∗ Produces the permission block, and copyright information ETL flow in a graphical user-interface by means of connecting
1 By“data scientist” we in this paper refer to someone focused at analyzing data boxes (representing transformations or operations) with arrows
and less in the technical aspects of DBMSs, e.g., ETL tools and Data Warehousing.
(representing data flows).
© 2018 Copyright held by the owner/author(s). Published in the Workshop Another approach is taken for the tool pygrametl [14] for
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
which it is argued that programmatic ETL, i.e., creating ETL pro-
under the terms of the Creative Commons license CC-by-nc-nd 4.0. grams by writing code, can be beneficial. With pygrametl, the
user programs Python objects for dimension and fact tables to
handle insert/update operations on the target data warehouse.
SimpleETL, however, hides complexity from the user and con-
veniently handles all schema management. Based on the speci-
fication of metadata, SimpleETL creates 1) the required SQL to
generate or alter the target data warehouse schema; 2) the neces-
sary database actions and pygrametl objects to modify the tables;
and 3) processes for parallel execution. SimpleETL provides tem-
plate code for its supported functionality, e.g., history tracking
of changing facts. It is therefore simple and fast for a data scien-
tist to define an ETL flow or add new sources and dimensions, Figure 2: UML Class Diagram for SimpleETL
because she does not have to make the code for this, but only
specify the metadata.
Tomingas et al. [15] propose an approach where Apache Ve- Listing 1: Defining Year Data Type
locity templates and user-specified mappings are used and trans- 1 from simpleetl import Datatype
2 def _validate_year(val):
formed into SQL statements. In contrast, SimpleETL is based on 3 if str(val).isnumeric() and 1900 <= int(intval) <= 2099:
4 return int(intval)
Python, which makes it easy for data scientists to exploit their 5 return -1
existing knowledge and to use third party libraries. 6 yeartype = Datatype('smallint', _validate_year)
BIAccelerator [13] is another template-based approach for cre-
ating ETL flows with Microsoft’s SSIS [4], enabling properties that shows all components in the framework. Next, each compo-
to be defined as parameters at runtime. Business Intelligence nent is described in more details by from the use case in Section 3.
Markup Language (Biml) [1] is a domain-specific XML-based
language to define SSIS packages (as well as SQL scripts, OLAP
4.1 Class Diagram
cube definitions and more). The focus of BIAccelerator and Bim-
l/BimlScript is to let the user define templates generating SSIS Figure 2 shows the UML class diagram for the SimpleETL frame-
packages for repetitive tasks while SimpleETL makes it easy work that consists of three classes. The class Datatype is used to
to create and load a data warehouse based on the templating define the type of a column in both dimension tables and (mea-
provided by the framework. sure) in a fact table. The parse method transforms a value (e.g.,
from a string to an integer) and ensures that the data type is
correct (e.g., it is a signed 32 bit integer) and any constraints on
3 USE-CASE the values in the column (e.g., it must be positive). The sqltype
In this section, we describe a simplified use-case scenario that method returns the SQL data type recognizable for a DBMS. The
serves as a running example throughout the paper and is used to SimpleETL framework comes with most standard data types, e.g.,
explain the distinctive features of the SimpleETL framework. The 2, 4, and 8-byte integer, numeric, date, and string (varchar) types.
simplified use-case is heavily inspired from a real-world example. The class Dimension models a dimension. It is an aggregation
In Figure 1, a star schema is presented, that connects infor- of a number of Datatype objects. The Dimension class contains
mation on passenger travels with a dimension for passengers, two methods, one for adding lookup attributes, add_lookupatt,
a dimension for the vehicle carrying out the travel, and a date and one for adding regular attributes, add_att. The combined set
dimension. The data is loaded from a CSV file with all the infor- of lookup attributes uniquely defines a record, which refers to
mation available at each line. Both the references and measures a Dimension key. Regular attributes simply describe the record.
consist of a combination of integer values, numeric values for The SimpleETL framework comes with a standard date and time
monetary amounts, string values, date, and time values. dimension.
Every night this set of data is exported from a source system The class FactTable models a fact table. It is an aggregation
(an accounting system) and a complete data dump is available, of a number of Dimension objects and Datatype objects. Four
including all historic earlier dumped data. The nightly dump methods are available on the class, first a method for connect-
has some distinctive characteristics, which make handling the ing a Dimension with the FactTable, add_dim_mapping. Second,
data non-trivial. The characteristics are that the data contain a method for adding a measure mapping, add_column_mapping
duplicates of existing facts, contain updated measures of existing , a method for defining how deleted rows should be handled,
facts, and lack deleted facts, which must be detected. These three handle_deleted_rows, and finally a method for defining addi-
characteristics put up some special demands for the ETL solution. tional indexes over a set of columns, add_index. Note that the
Two types of requirements exist for the functionality of the SimpleETL framework automatically adds indexes on all dimen-
final data warehouse, after data have been processed. First, a set sion mappings and on the lookup attribute set.
of business-oriented demands exists, such as tracking updates
of facts, e.g., when and what corrections were made. Second, 4.2 Data Type
updated legislation on people’s rights, e.g., the General Data
A data type define how a specific value is stored in the database
Protection Regulation [9], creates new requirements for data to
and how a value from the data source is parsed and processed
be deleted completely if a customer requests to be forgotten.
during ETL. An example of how a user can specify a data type
for storing year is shown in Listing 1. The data type is defined at
4 FRAMEWORK COMPONENTS line 6 and named yeartype. The first parameter specifies the SQL
This section provides an overview of the components in Sim- data type, a 2-byte integer. The second parameter is a Python
pleETL which a user customizes to create a data warehouse and function, _validate_year, which both handle the diversity of
corresponding ETL process. First, a class diagram is presented data, e.g., NULL values and conversion of string representations,
Listing 2: Defining Vehicle Dimension Listing 3: Defining Travels Fact Table
1 from simpleetl import Dimension, datatypes as dt 1 from simpleetl import FactTable, datatypes as dt
2 def handle_model(row, namemapping): 2 travels = FactTable(schema="facts", table="travels",
3 row["make"] = row["make"][0:20] lookupatts=["travelid"], store_history=True, key="id")
4 row["model"] = row["model"][0:20] 3 travels.add_dim_mapping(dimension=vehicledim, dstcol="
5 vehicledim = Dimension(schema="dims", table="vehicle", vehiclekey")
key="vehiclekey", rowexpander=handle_model) 4 travels.add_dim_mapping(dimension=datedim, dstcol="datekey")
6 vehicledim.add_lookupatt(name="vehicleid", 5 travels.add_dim_mapping(dimension=customerdim, dstcol="
dtype=dt.varchar(20), default_value='missing')
7 vehicledim.add_att(name="make", dtype=dt.varchar(20)) customerkey")
8 vehicledim.add_att(name="model", dtype=dt.varchar(20)) 6 travels.add_column_mapping(srccol="id", datatype=dt.integer,
9 vehicledim.add_att(name="vehicleyear", dtype=yeartype) dstcol="travelid")
7 travels.add_column_mapping(srccol="price", datatype=dt.
numeric(6,2), dstcol="price")
8 travels.add_index(["price"])
9 travels.handle_deleted_rows(method="mark")
and also enables constraints like 1900 <= year <= 2099 (line 3).
If the input fails to be parsed, -1 is returned (line 5).
A number of standard data types are pre-defined, e.g., SMALLINT In line 2, the FactTable object is instantiated, given a schemma
(2-byte integer), NUMERIC(precision, scale), and VARCHAR(n), and table name as the first two parameters. The third parameter
where the length of the two latter can be defined using arguments. defines the lookup attributes, the fourth parameter specifies that
Floating point data types are not supported by the SimpleETL full history should be retained and the fifth parameter defines the
framework since it depends on equality comparison for version primary key of the table, id. The lookupatts attribute defines no
management and determining updates/deletes and comparing two identical travelid can exist and is used when determining
floats can yield unpredictable results. It is encouraged to use new/updated/deleted facts.
NUMERIC(precision, scale) when decimal values are used. The vehicle dimension defined in Listing 2 is attached as a
dimension using a single line of code in line 3. In lines 4 and 5,
4.3 Dimension two additional dimensions are added, one handling date of the
The Dimension class describes how a single dimension table in the travel and another handling customer information, introduced in
database is modeled. An example implementation of the vehicle Figure 1. In line 6 and 7, two measures are added, first the lookup
dimension from Figure 1 is shown in Listing 2. The dimension attribute, id, and second the price of a travel, implemented as a
is defined in line 5, where the first and second parameters are numeric data type.
the schema and table name, respectively. The third parameter is The framework automatically creates primary keys, foreign
the name of the primary key. The fourth parameter, namemapping, keys, and indexes including a unique index on the lookup at-
known from pygrametl [14], allows for a user-defined function, tributes and the primary key. It is possible for the user to add
here handle_model, which is called on every row, in this case additional indexes (line 8). In line 9 it is defined that when a row
(line 2-4) truncating make and model to 20 characters, preventing is determined to have been deleted from the data source the row
overflowing the database varchar column, limited to 20 chars should be marked in the table as having been removed (method
(line 7-8). D4 from Section 5.2), thus keeping the fact in the data warehouse.
When the dimension has been defined, two types of attributes Overall, SimpleETL is designed to optimize productivity, en-
can be added. The first type is mandatory and is called the lookup sure consistency, reduce programming errors, and help the data
attribute set. In the example, a vehicle id, vehicleid, is defined scientist in loading and activating data for analysis. This is re-
as a single lookup attribute in line 3. Lookup attributes are not al- alized by reuse of data types and dimensions shown using code
lowed to be NULL as these must be comparable for lookups, hence examples and by keeping the number of methods and parameters
a default value for a vehicle id is the string “missing”. Adding the to a minimum.
primary key of the Dimension as a single lookup attribute makes
the primary key a smart key instead of a surrogate key [12]. Smart 5 MODIFICATIONS OF FACTS
keys can optimize performance of dimension handling while a In some system applications it is a business requirement that
smart key can be computed, e.g., the date 2017-07-21 can be a facts can be updated and full history be maintained for enabling
smart key 20170721. The second set of attributes is optional and is tracking of changes to facts. Simultaneously it is common practice
called member attributes. Member attributes provide additional to remove data if it is no longer valid, e.g., if a passenger travel
information for a dimension entry. Three member attributes are was not carried out it is later deleted from the accounting system.
added in Listing 2 (line 7-9), adding make and model attributes as Another motivation for deleting data is legal demands such as the
varchars of size 20 and vehicle year utilizing the yeartype data concept called the right to be forgotten [9]. This section shows how
type, defined in Listing 1. these requirements are handled automatically by the SimpleETL
framework.
4.4 Fact Table
The FactTable class defines a fact table and all aspects of this, 5.1 Slowly Changing Fact Tables
including database schema descriptions, data processing, and To handle updates of facts we introduce the slowly changing fact
data version management. A set of lookup attributes can be table. When a user enables version tracking of facts (store_history
defined to uniquely identifying a row. If the lookup attributes =True in Listing 3 line 2), a second fact table is created.
are set they enforce that duplicate facts with the same set of The main fact table, illustrated in Table 1, acts a similar to a
lookup attributes cannot exist. If no lookup attributes are defined, type-1 slowly changing dimension such that facts get updated
version management cannot be enabled and duplicate facts can (overwritten) when changes are detected in the source data. For
exist. Lookup attributes are not allowed to have NULL values. these examples the type-1 fact table consists of a id, a travelid,
The implementation of the fact table Travels from Figure 1 is shortened tid, and a price. This table is referred to as the type-1
shown in Listing 3. fact table in the rest of the paper.
Table 1: T1 Facts Table 2: Version Managed Fact Table Table 7: Deleted Version Managed Facts using D2

id tid price id tid price _vfrom _vto _ver _fid id tid price _vfrom _vto _ver _fid
1 100 40 1 100 40 t1 -1 1 1 1 100 40 t1 -1 1 1
2 109 25 2 109 25 t1 -1 1 2
Table 3: Upd. T1 Table 4: Updated Ver. Managed Facts Table 8: Deleted Version Managed Facts using D3 and D4

id tid price id tid price _vfrom _vto _ver _fid id tid price _vfrom _vto _ver _fid [D4 _del]
1 100 40 1 100 40 t1 -1 1 1 1 100 40 t1 -1 1 1 -1
2 109 35 2 109 25 t1 t2 1 2 2 109 25 t1 t2 1 2 t3
3 109 35 t2 -1 2 2 3 109 35 t2 t3 2 2 t3

Table 5: Del. T1 using D2/D3 Table 6: Deleted T1 using

D4 The second method, D2, completely deletes facts from the
id tid price data warehouse if they are removed at the source system. Table 5
1 100 40 id tid price _del shows the type-1 fact table and Table 7 shows the version man-
1 100 40 -1 aged fact table after the fact with tid=109 has been deleted. This
2 109 35 t1 method is useful if facts must be enforced to be removed, e.g.,
due to legal reasons and when data is removed at data source it
will automatically be removed from the fact tables too.
The second table, illustrated in Table 2 acts in a similar way The third method, D3, removes the fact in the type-1 fact table,
as a type-2 version managed slowly changing dimension where like method D2 shown in Table 5 while in the version managed
version management of data is tracked using four additional fact table the deleted fact is marked with an time stamp _vto=
columns. A pair of columns _validfrom and _validto, shortened t2, shown in Table 8. This method is useful, if the type-1 fact
_vfrom and _vto, stores the validity period of a fact using 32-bit table must mirror the source system, while deleted data must be
Unix timestamps, t1 through t3. A version number, _ver, keeps tracked.
track of fact changes and a column, _fact_id, shortened _fid, The fourth method, D4, adds an extra attribute to both fact
is references the primary key of the type-1 fact table bridging tables, _deleted, shortened _del, with default value -1. When a
the type-1 and the version managed fact tables together, e.g., for fact is removed the _del measure will be set to the relevant time
tracing historic changes from facts in the type-1 fact table. This stamp for the fact in both the type-1 and version managed fact
table is referred to as the version managed fact table in the rest tables, Table 6 and Table 8 respectively. This method is useful if
of the paper. easy filtering of deleted facts is required for, e.g., bookkeeping
We now illustrate what happens when a data set is loaded on the type-1 fact table.
by the SimpleETL framework. Table 1 and Table 2 shows the Having four different methods for handling deleted facts makes
type-1 and the version managed fact tables with two rows of data the SimpleETL framework very versatile and matches most busi-
loaded. The _vfrom is set to t1 and the _vto defaults to -1 when a ness and legal needs with respect to the balance between pre-
fact is still live. When an update happens at the data source, it is serving data versus privacy regulations.
propagated to SimpleETL at the next ETL batch run. For example
if the price for the tid=109 is updated from 25 to 35 the measure 6 DATA AND PROCESS FLOW
of the type-1 fact table is overwritten, shown in Table 3, while in This section first introduces how the ETL process is configured
the version managed fact table, Table 4, the _vto is set for id=2 and initiated, then the process flow implementation is visual-
and a new version of the fact is inserted with id=3. ized in Figure 3, separating the process flow into three stages,
The advantage of this two-table approach is that dispite many Initialization (1.1-1.4 in Figure 3), Processing (2.1-2.5), and Data
updates the type-1 fact table does not grow in size. The downside Migration (3.1-3.6). White boxes in Figure 3 indicates steps pro-
is increased storage cost from representing facts in both tables. cessed sequentially while gray boxes indicates parallel execution.
Facts are first loaded from a data source to a data staging
5.2 Deleting Facts area and dimensional integrity is maintained with all related
The motivation for deleting facts can be to reflect production, e.g., dimensions. Next, the data is moved from the data staging to
if a passenger travel was not carried out it is deleted in hindsight. the fact tables in three steps, first migrating updated data, then
Second, legal demands, such as the right to be forgotten [9], can porting new data, and finally handling deleted data, according to
require data to be deleted on individuals. the user specifications in Section 5. Finally a a real-world use-case
The SimpleETL framework enables the user to choose between is presented along with a implementation and runtime statistics.
four methods for handling deleting data. These are described
using Table 3 and Table 4 as the outset. The fact with tid=109 is 6.1 Configuration
deleted. The SimpleETL framework supports that data is loaded from mul-
The first method, D1, ignores when facts are deleted at the tiple data sources. Each data source is defined using a data feeder,
source system, i.e., if the fact with tid=109 is deleted it will still which is a user-defined Python function that yields key/value
persist in the data warehouse, like Table 3 and Table 4. This Python dictionaries of data for every fact, e.g., one dictionary
method enables keeping facts regardless of what happens at the for each row in a CSV file. These dictionaries are used by the
data source and is useful if facts cannot be altered or data is ETL process in Section 6.1. The data-feeder functions are not an
loaded incrementally. integrated part of the SimpleETL framework, which allows the
Listing 4: Processing SimpleETL are reading and writing to the dimensions (2.5) and when all data
1 prev_id = None
has been processed, the fact and dimension workers commit data
2 def dupfilter(row): to the data warehouse dimensions and data staging table.
3 global prev_id
4 if prev_id == row["id"]:
Dimension and fact handling are separated from the main
5 return False # Ignore duplicate "id" values process into parallel background workers of performance rea-
6 prev_id = row["id"]
7 return True sons. The background workers (2.4) and (2.5) in Figure 3, are im-
8 def parsevehicle(row, dbcon): plemented using Python’s multiprocessing.Process and com-
9 # Split mk_mdl into two variables
10 row["make"], row["model"] = row["mk_mdl"].split("|") munication is handled though Inter-Process Communication
11 csvfile = csv.DictReader("/path/to/file") (IPC) Queries. Several caching layers, using Python’s functools.
12 processETL(facttable=fact, datafeeder=csvfile,
filterfunc=dupfilter, transformfunc=parsevehicle, lru_cache, reduce the IPC and dimension database communica-
[database connection details])
tion.
Parallel Fact Workers The parallel fact workers, (2.4) from
Figure 3, process rows distributed in batches from (2.3). If the pa-
rameter transformfunc is provided, Section 6.1, this is executed
first. Such a function can contain advanced user defined transfor-
mations. Second, all dimension referencing is handled using the
a dimension workers (2.5). Then each measure is processed and
finally the data is inserted into a data staging table. n parallel fact
workers will be spawned where n equals the number of available
CPU cores for the framework.
Decoupled Dimension Workers Each dimension is handled
Figure 3: Main Execution Flow of SimpleETL in its own separate process (2.5), i.e., having three attached di-
mensions will run in three separate processes. Utilizing the same
dimension more than once will only spawn one instance, e.g.,
user to load data from various sources, e.g., CSV, ODBC, or REST
utilizing a date dimension three times will only use one parallel
APIs, only requiring that they can present a fact as a Python
worker process. If the dimension key is a smart key, see Sec-
dictionary.
tion 4.3, this smart key can immediately be returned from the
When the data warehouse structure, using the components
dimension worker while surrogate keys must be co-ordinated
from Section 4, and a data source are defined then the ETL pro-
with the dimension table, potentially with database lookups. m
cess can be configured and initiated. All functionality related to
parallel dimension workers will be spawned, where m is the num-
database schema management and data management is handled
ber of distinct dimensions attached a FactTable, see Section 4.4.
automatically. When the ETL process has completed, the data is
available in the data warehouse for querying. The ETL process
is started as shown in Listing 4. In line 11, a file is prepared for 6.4 Data Migration
loading, using Python’s CSV-to-Dictionary function. The ETL The data migration is split into three steps for handling updated
process is started in line 12, where the FactTable and CSV file are facts, new facts, and deleted facts. The main driver, for determin-
given as input. Listing 4 also shows how two optional functions ing updates, new data, and deleted data are the lookup attributes,
are used to customize the ETL process. The argument filterfunc see Section 4.4, which uniquely define a fact and whose values are
=dupfilter defines a function for filtering rows before data is mandatory (not NULL). Lookup attributes can be both fact mea-
distributed to parallel workers, and the argument processfunc sures or dimension referencing keys. If the lookup attribute set is
=parsevehicle defines a function distributed to all background not defined then no updating, deletion, and version management
worker processes. can be performed and all data will be appended.
We have now shown all the code that the user needs to imple- Migrating Updated Facts Updated facts are defined as facts
ment in various Python function to use the SimpleETL framework. where the set of lookup attributes already exists in the existing
In the next section, it is described what is done internally in the fact tables and where at least one of the measures have changed.
framework to build the data-warehouse schema and efficiently This is handled by (3.1) and (3.2) in Figure 3 and the type-1 and
load the data. version managed tables are processed in parallel, as handling
updates does not change relationships between these two tables.
6.2 Initialization Migrating New Facts New facts are facts whose set of lookup
Before starting the ETL data processing SimpleETL initializes attributes do not exist in the type-1 and version managed fact
database connections and validates the FactTable object, processETL tables. This is handled in (3.3) and (3.4) in Figure 3 where data
(1.1) in Figure 3. Schema, constraints, and indexes are created and is first migrated to the type-1 fact table and next to the version
verified for all attached dimensions (1.2) and the fact tables (1.3). managed fact table. This sequential step is necessary as the ver-
A temporary data staging table is initialized, for later handling sion managed fact table needs the id of the type-1 fact table for
updated and deleted facts (1.4). referencing this. This step also ensures that no duplicate sets
of lookup attributes is loaded, if the lookup attribute set of the
6.3 Processing FactObject is defined.
The main ETL process extracts data from the data source, given Migrating Deleted Facts If migration of deleted facts is en-
the datafeeder argument, Figure 3 (2.1). A filterfunc, intro- abled, it is determined which facts exist in the type-1 and version
duced in Section 6.1, can be applied for filtering data (2.2). Then managed fact tables, while they do not exist in the staging table.
data is distributed to the background workers (2.3) in batches of The method for how facts are handled, when removed at the data
1000 rows (user configurable size). Background fact workers (2.4) source, is dependent on the methods described in Section 5.2.
This migration of deleted facts is handled in (3.5) and (3.6) in REFERENCES
Figure 3. [1] BimlScript. http://www.bimlscript.com/. Accessed 2017-10-24.
[2] IBM InfoSphere DataStage. https://www.ibm.com/ms-en/marketplace/
datastage. Accessed 2017-10-13.
6.5 Real-World Use [3] Informatica. https://www.informatica.com/. ([n. d.]). Accessed 2017-10-13.
[4] Microsoft SQL Server Integration Services. https://docs.microsoft.com/en-us/
SimpleETL is designed to be a convenient and easy tool for data sql/integration-services/sql-server-integration-services. Accessed 2017-10-
scientists to quickly load their data and start working with it. To 13.
[5] Oracle Data Integrator. http://www.oracle.com/technetwork/middleware/
show that SimpleETL also performs well a real-world use-case data-integrator/overview/index.html. Accessed 2017-10-13.
is implemented. One fact table is configured with version track- [6] Pentaho Data Integration - Kettle. http://kettle.pentaho.org. Accessed 2017-
ing enabled and deleted facts being propagated by the method 10-13.
[7] SAP Data Services. https://www.sap.com/products/data-services.html. Ac-
D3 from Section 5.2. The fact is constructed as 153 columns, in- cessed 2017-10-13.
cluding 1 primary key, 41 foreign keys to 18 dimensions, and [8] Talend. https://www.talend.com/products/big-data/. Accessed 2017-10-24.
111 measures. An index is automatically generated covering the [9] 2016. EU Regulation 2016/679: General Data Protection Regulation. Official
Journal of the European Union L119 (2016), 1–88. http://eur-lex.europa.eu/
lookup attributes and the primary key and 41 indexes are auto- legal-content/EN/TXT/?uri=OJ:L:2016:119:TOC
matically generated on all the foreign keys. The data contains [10] Mark A. Beyer, Eric Thoo, Mei Yang Selvage, and Ethisham Zaidi. 2017. Gartner
Magic Quadrant for Data Integration Tools. (2017).
information on passenger travels from a fleet system. 1.2 million [11] Scott Curie. [n. d.]. What is Biml. http://www.bimlscript.com/walkthrough/
rows are available in a 1.67 GB CSV data file and each row has Details/3105. Accessed 2017-10-24.
147 columns. The final size of the type-1 and version managed [12] Ralph Kimball and Margy Ross. 2013. The data warehouse toolkit: The definitive
guide to dimensional modeling. John Wiley & Sons.
fact tables are 732 and 882 MB of data and 1193 and 1422 MB of [13] Reinhard Stumptner, Bernhard Freudenthaler, and Markus Krenn. 2012. BIAc-
indexes, respectively. celerator – A Template-Based Approach for Rapid ETL Development. Springer
The initial data load takes 34 minutes, including creating Berlin Heidelberg, 435–444.
[14] Christian Thomsen and Torben Bach Pedersen. 2009. pygrametl: a powerful
schema while an incremental batch providing 17 678 updated, programming framework for extract-transform-load programmers.. In DOLAP,
16 381 new, and 3 deleted facts is performed in 8 minutes on a Il-Yeol Song and Esteban ZimÃąnyi (Eds.). ACM, 49–56.
[15] Kalle Tomingas, Margus Kliimask, and Tanel Tammet. 2014. Mappings, Rules
single Ubuntu Linux server running PostgreSQL 9.6 with 16 GB and Patterns in Template Based ETL Construction. In The 11th International
of RAM, 6 core Intel Xeon E5-2695V3 CPU clocked at 2.3 GHz. Baltic DB & IS2014 Conference.
The SimpleETL framework and the PostgreSQL DBMS both run [16] Panos Vassiliadis. 2009. A Survey of Extract-Transform-Load Technology. 5,
1–27.
on the same host.
The performance of SimpleETL scales with the number of
CPUs and a large period of the execution time is related with un-
derlying DBMS transactions. A different DBMS or configurations
will yield other performance results.

7 CONCLUSION
This paper presents the SimpleETL framework that enables sim-
ple and efficient programming of ETL for data warehouse so-
lutions without the user needs database management or ETL
experience. This makes the framework particular well suited for
data scientists because they can quickly integrate and explore
new data sources.
The framework enables advanced fact handling such as han-
dling slowly changing facts using version management and en-
ables the users to decide how deleted facts should be handled.
Four different methods for handling deleted facts are presented.
The framework is simple and contains only three classes for
data types, dimensions, and fact tables, respectively. Each class
has two to four methods. The ETL process is directed by meta-
data specifications and the framework handles everything else,
including version management and tracking of deleted facts. The
entire internal process flow extensively utilizes parallelization
and IPC for processing facts and every dimension is spawned in
separate processes.
The main contribution of SimpleETL is to provide a conve-
nient and simple ETL framework for data scientists. Despite this,
performance benchmarks, using real-world data scenario where
facts are inserted, updated, and deleted, shows that the frame-
work is lightweight and executing ETL batches and maintaining
versioned data and deletions is performed efficiently.
There are a number of relevant directions for future work,
including automatic table partitioning to handle very large data
sets. Snowflake dimension support is another commonly used
technique from data warehousing, which would be relevant to
support in the SimpleETL framework.

Spring Framework 3.1
No ratings yet
Spring Framework 3.1
284 pages
Lab 2c
No ratings yet
Lab 2c
9 pages
Introduction À l'ETL Et Application Avec Oracle: Data Warehouse
No ratings yet
Introduction À l'ETL Et Application Avec Oracle: Data Warehouse
64 pages
Build ETL Using Python
No ratings yet
Build ETL Using Python
7 pages
Android Developing RESTful Android Apps
No ratings yet
Android Developing RESTful Android Apps
56 pages
Preparing The Set-Up: Tasks
No ratings yet
Preparing The Set-Up: Tasks
28 pages
CHP 2 - Dependency Injection Using Spring Rev H PDF
No ratings yet
CHP 2 - Dependency Injection Using Spring Rev H PDF
29 pages
Ionic 4, Angular 7 and Cordova Tutorial: Build CRUD Mobile Apps
No ratings yet
Ionic 4, Angular 7 and Cordova Tutorial: Build CRUD Mobile Apps
18 pages
Component
No ratings yet
Component
8 pages
Spring Certification Notes
100% (1)
Spring Certification Notes
40 pages
AWS Certified Developer Associate-Exam Guide en 1.4
No ratings yet
AWS Certified Developer Associate-Exam Guide en 1.4
3 pages
The Data WareHouse ETL Toolkit - Chapter 05
100% (1)
The Data WareHouse ETL Toolkit - Chapter 05
40 pages
PHP OOP Concepts and Examples
No ratings yet
PHP OOP Concepts and Examples
6 pages
Spring Boot Interview Questions and Answers
No ratings yet
Spring Boot Interview Questions and Answers
6 pages
Module 3 - Breaking The Monolith - Containers
No ratings yet
Module 3 - Breaking The Monolith - Containers
43 pages
Mysql Interview Questions PDF
No ratings yet
Mysql Interview Questions PDF
5 pages
Lab7 - Python Assisted Exploitation
No ratings yet
Lab7 - Python Assisted Exploitation
11 pages
Formation Angular Lab 2 More Components: Lab 2.1: Data Flowing Downwards
No ratings yet
Formation Angular Lab 2 More Components: Lab 2.1: Data Flowing Downwards
5 pages
Linux Server Configuration Guide
No ratings yet
Linux Server Configuration Guide
8 pages
Spring Data JPA + JSF + Maven + MySQL Using Eclipse IDE - Simple Example To Start With
No ratings yet
Spring Data JPA + JSF + Maven + MySQL Using Eclipse IDE - Simple Example To Start With
22 pages
Day 45 - Deploy WordPress Website On AWS
No ratings yet
Day 45 - Deploy WordPress Website On AWS
14 pages
SQL Tutorial
No ratings yet
SQL Tutorial
72 pages
Laravel & PhpStorm Setup Guide
No ratings yet
Laravel & PhpStorm Setup Guide
18 pages
E-Commerce Application - Angular Front-End and Spring Boot Back-End
No ratings yet
E-Commerce Application - Angular Front-End and Spring Boot Back-End
2 pages
Hibernate Association Mapping Annotations
No ratings yet
Hibernate Association Mapping Annotations
4 pages
Mysql Interview Que & Ans
No ratings yet
Mysql Interview Que & Ans
13 pages
Encrypted PostgreSQL
No ratings yet
Encrypted PostgreSQL
37 pages
Angular Dependency Injection Labs
No ratings yet
Angular Dependency Injection Labs
8 pages
Quick Guide Spring
No ratings yet
Quick Guide Spring
14 pages
JavaEra: Comprehensive Java Resources
0% (1)
JavaEra: Comprehensive Java Resources
209 pages
Tutorial-HDP-Administration - I HDFS & YARN PDF
No ratings yet
Tutorial-HDP-Administration - I HDFS & YARN PDF
140 pages
Aspiring Full Stack Developer Profile
No ratings yet
Aspiring Full Stack Developer Profile
2 pages
INFORMATICA
No ratings yet
INFORMATICA
100 pages
DB2 DB Creation Steps
No ratings yet
DB2 DB Creation Steps
8 pages
MongoDB CRUD Guide for Developers
No ratings yet
MongoDB CRUD Guide for Developers
43 pages
Edb - Bart - Qs - 7 Quick Start Guide For RHEL-CentOS 7
No ratings yet
Edb - Bart - Qs - 7 Quick Start Guide For RHEL-CentOS 7
7 pages
HBase Succinctly PDF
100% (1)
HBase Succinctly PDF
85 pages
Tiffin Ordering System
No ratings yet
Tiffin Ordering System
3 pages
Spring Batch Docs
No ratings yet
Spring Batch Docs
136 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
Python Programming Practicals
No ratings yet
Python Programming Practicals
67 pages
Dse Admin 60
No ratings yet
Dse Admin 60
1,015 pages
MapGuide Programming Manual
No ratings yet
MapGuide Programming Manual
164 pages
YARN Essentials - Sample Chapter
No ratings yet
YARN Essentials - Sample Chapter
12 pages
Facebook's Scalable Architecture
No ratings yet
Facebook's Scalable Architecture
5 pages
The Data Warehouse ETL Toolkit - Chapter 06
No ratings yet
The Data Warehouse ETL Toolkit - Chapter 06
77 pages
C# Interface Events
No ratings yet
C# Interface Events
3 pages
User Authentication With Laravel - Laravel Book
No ratings yet
User Authentication With Laravel - Laravel Book
9 pages
A Material
No ratings yet
A Material
191 pages
Spring Cloud Vault
No ratings yet
Spring Cloud Vault
29 pages
Spring Boot Reference
No ratings yet
Spring Boot Reference
410 pages
Survey On ETL Processes
No ratings yet
Survey On ETL Processes
11 pages
Unit 2 DW
No ratings yet
Unit 2 DW
75 pages
An Overview On Data Quality Issues at Data Staging ETL
No ratings yet
An Overview On Data Quality Issues at Data Staging ETL
4 pages
DW Lecture UNIT 2
No ratings yet
DW Lecture UNIT 2
40 pages
(IJCT-V2I5P3) Author :mr. Nilesh Mali, MR - SachinBojewar
No ratings yet
(IJCT-V2I5P3) Author :mr. Nilesh Mali, MR - SachinBojewar
8 pages
Reading Material Mod 4 Data Integration - Data Warehouse
No ratings yet
Reading Material Mod 4 Data Integration - Data Warehouse
33 pages
ETL Testing
No ratings yet
ETL Testing
12 pages
Data Warehousing for Managers
No ratings yet
Data Warehousing for Managers
16 pages
Q1 2013 Deckers Outdoor Corp. Earnings Conference Call
No ratings yet
Q1 2013 Deckers Outdoor Corp. Earnings Conference Call
23 pages
Q2 2013 Deckers Outdoor Corp. Earnings Conference Call
No ratings yet
Q2 2013 Deckers Outdoor Corp. Earnings Conference Call
20 pages
Q4 2013 Deckers Outdoor Corporation Earnings Conference
No ratings yet
Q4 2013 Deckers Outdoor Corporation Earnings Conference
30 pages
CORRECTED TRANSCRIPT - Deckers Outdoor Corp DECK US Q1 2025 Earnings Call
No ratings yet
CORRECTED TRANSCRIPT - Deckers Outdoor Corp DECK US Q1 2025 Earnings Call
23 pages
Lycian Way Hiking Guide
0% (2)
Lycian Way Hiking Guide
17 pages
Deckers Outdoor Corporation Reports Third Quarter 2013 Financial Results - Deckers Brands
No ratings yet
Deckers Outdoor Corporation Reports Third Quarter 2013 Financial Results - Deckers Brands
1 page
Api-Demo: Platform-As-A-Service (Paas) Based Solution
No ratings yet
Api-Demo: Platform-As-A-Service (Paas) Based Solution
6 pages
Configure Python WebApp
No ratings yet
Configure Python WebApp
8 pages
ETL Testing Interview Questions
No ratings yet
ETL Testing Interview Questions
13 pages
MIles and More Letter
No ratings yet
MIles and More Letter
1 page
Visualizing Software Architecture
0% (1)
Visualizing Software Architecture
1 page
Holux GPSport 260 Pro GPS
No ratings yet
Holux GPSport 260 Pro GPS
106 pages
SP800 27 RevA
No ratings yet
SP800 27 RevA
35 pages
Glory Service Manual
No ratings yet
Glory Service Manual
15 pages
Remove Unused CSS With PurgeCSS
No ratings yet
Remove Unused CSS With PurgeCSS
5 pages
Binary Basics for Engineering Students
No ratings yet
Binary Basics for Engineering Students
16 pages
Oracle Lite SQL Operators Guide
No ratings yet
Oracle Lite SQL Operators Guide
5 pages
Quantum Computing & Communication Insights
No ratings yet
Quantum Computing & Communication Insights
5 pages
C++ Multithreading Tutorial
No ratings yet
C++ Multithreading Tutorial
7 pages
Advanced Memory Management Quiz
100% (1)
Advanced Memory Management Quiz
3 pages
Monolithic Vs Modular Storage
No ratings yet
Monolithic Vs Modular Storage
36 pages
AWS Lambda, Senior Software Development Engineer: Qualifications
No ratings yet
AWS Lambda, Senior Software Development Engineer: Qualifications
1 page
CSE 1287 Final Exam Programming Guide
No ratings yet
CSE 1287 Final Exam Programming Guide
48 pages
Kba-161023230425 3 NV Item Esn, Esn Me, Meid and Meid Me
No ratings yet
Kba-161023230425 3 NV Item Esn, Esn Me, Meid and Meid Me
3 pages
Juniper NSRP
No ratings yet
Juniper NSRP
39 pages
PreDefined VB - Net Functions
No ratings yet
PreDefined VB - Net Functions
10 pages
Q Rep DB2 Oracle
No ratings yet
Q Rep DB2 Oracle
34 pages
FortiToken 200
No ratings yet
FortiToken 200
2 pages
Vazhu
No ratings yet
Vazhu
3 pages
Micron Serial NOR Flash Memory: 3V, Multiple I/O, 4KB Sector Erase N25Q128A Features
No ratings yet
Micron Serial NOR Flash Memory: 3V, Multiple I/O, 4KB Sector Erase N25Q128A Features
84 pages
Autocad Material
No ratings yet
Autocad Material
2 pages
PrateekGautam InternshalaResume-2
No ratings yet
PrateekGautam InternshalaResume-2
3 pages
Tesda Reviewer Chs Ncii
No ratings yet
Tesda Reviewer Chs Ncii
72 pages
TwidoSuite V2.33docx
No ratings yet
TwidoSuite V2.33docx
5 pages
Security Onion: Network Security Monitoring in Minutes Doug Burks
No ratings yet
Security Onion: Network Security Monitoring in Minutes Doug Burks
21 pages
Data Mining - Lab 7 Using Weka For Clustering
No ratings yet
Data Mining - Lab 7 Using Weka For Clustering
9 pages
FPGA Development Methodology Guide
No ratings yet
FPGA Development Methodology Guide
13 pages
Cis 158 Final Exam: Indicate Whether The Statement Is True or False
No ratings yet
Cis 158 Final Exam: Indicate Whether The Statement Is True or False
4 pages
DX Diag
No ratings yet
DX Diag
38 pages
CCENT Practice Certification Exam # 1 25NOV2011
No ratings yet
CCENT Practice Certification Exam # 1 25NOV2011
29 pages
UD15888B Hik-Connect IOS Mobile Client User Manual 3.10.0 PDF1-Manual-A4 en-US
No ratings yet
UD15888B Hik-Connect IOS Mobile Client User Manual 3.10.0 PDF1-Manual-A4 en-US
109 pages
C++ Programs For Class 12
78% (9)
C++ Programs For Class 12
59 pages
Beginner's Guide to SAP BRF Plus
No ratings yet
Beginner's Guide to SAP BRF Plus
25 pages
QT CuoiKy MMT1 2009 Draft Edit
No ratings yet
QT CuoiKy MMT1 2009 Draft Edit
11 pages

SimpleETL: ETL Processing by Simple Specifications

Uploaded by

SimpleETL: ETL Processing by Simple Specifications

Uploaded by

SimpleETL: ETL Processing by Simple Specifications∗

Ove Andersen Christian Thomsen Kristian Torp

Table 5: Del. T1 using D2/D3 Table 6: Deleted T1 using

You might also like