0% found this document useful (0 votes)
147 views

SimpleETL: ETL Processing by Simple Specifications

Massive quantities of data are today collected from many sources. However, it is often labor-intensive to handle and integrate these data sources into a data warehouse. Further, the complexity is increased when specific requirements exist. One such new requirement, is the right to be forgotten where an organization upon request must delete all data about an individual. Another requirement is when facts are updated retrospectively. In this paper, we present the general framework SimpleETL which is currently used for Extract-Transform-Load (ETL) processing in a company with such requirements. SimpleETL automatically handles all database interactions such as creating fact tables, dimensions, and foreign keys. The framework also has features for handling version management of facts and implements four different methods for handling deleted facts. The framework enables, e.g., data scientists, to program complete and complex ETL solutions very efficiently with only few lines of code, which is demonstrated with a real-world example.

Uploaded by

Ivan Georgiev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

SimpleETL: ETL Processing by Simple Specifications

Massive quantities of data are today collected from many sources. However, it is often labor-intensive to handle and integrate these data sources into a data warehouse. Further, the complexity is increased when specific requirements exist. One such new requirement, is the right to be forgotten where an organization upon request must delete all data about an individual. Another requirement is when facts are updated retrospectively. In this paper, we present the general framework SimpleETL which is currently used for Extract-Transform-Load (ETL) processing in a company with such requirements. SimpleETL automatically handles all database interactions such as creating fact tables, dimensions, and foreign keys. The framework also has features for handling version management of facts and implements four different methods for handling deleted facts. The framework enables, e.g., data scientists, to program complete and complex ETL solutions very efficiently with only few lines of code, which is demonstrated with a real-world example.

Uploaded by

Ivan Georgiev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

SimpleETL: ETL Processing by Simple Specifications∗

Ove Andersen Christian Thomsen Kristian Torp


Aalborg University & FlexDanmark Aalborg University Aalborg University
Denmark Denmark Denmark
[email protected] [email protected] [email protected]
[email protected]
ABSTRACT
Massive quantities of data are today collected from many sources.
However, it is often labor-intensive to handle and integrate these
data sources into a data warehouse. Further, the complexity is
increased when specific requirements exist. One such new re-
quirement, is the right to be forgotten where an organization upon
request must delete all data about an individual. Another require-
ment is when facts are updated retrospectively. In this paper, we
present the general framework SimpleETL which is currently Figure 1: Example Case Star Schema
used for Extract-Transform-Load (ETL) processing in a company
with such requirements. SimpleETL automatically handles all
database interactions such as creating fact tables, dimensions, taxi company are stored. Each travel is a fact stored in a fact table,
and foreign keys. The framework also has features for handling connected with a vehicle, a customer, and a date dimension. It is
version management of facts and implements four different meth- common practice that facts are deleted, e.g., if it is discovered that
ods for handling deleted facts. The framework enables, e.g., data an ordered trip two days ago was not executed anyway then the
scientists, to program complete and complex ETL solutions very fact will be removed, or a facts gets updates, due to late arriving
efficiently with only few lines of code, which is demonstrated accounting information. Further, for audit reasons, it is required
with a real-world example. that changes must be tracked, e.g., if a price is updated.
The presented SimpleETL framework enables data scientists
to program an ETL solution in a very efficient and convenient
1 INTRODUCTION way with only few lines of code mainly with specifications of
Data is being collected at unprecedented speed partly due to metadata. The framework manages everything behind the scene
cheaper sensor technology and inexpensive communication. from structuring data warehouse schema, fact tables, dimensions,
Companies have realized that detailed data is valuable because references, indexes, and data version tracking. This also includes
it can provide up-to-date and accurate information on how the handling of changes to facts in line with Kimball’s slowly changing
business is doing. These changes have in recent year coined dimensions [12]. Processing data using SimpleETL is automati-
terms such as “Big Data”, “The five V’s”, and “Data Scientist”. It cally highly parallelized such that every dimension is handled in
is, however, not enough to collect data; it should also be possible its own process and fact table processing is spread across multiple
for the data scientist1 to integrate it with existing data and to processes.
analyze it. The rest of the paper is structured as follows: First related
A data warehouse is often used for storing large quantity of work is discussed in Section 2. Then a simple use-case is intro-
data possibly integrated from many sources. A wide range of duced in Section 3 followed by an example implementation in
Extract-Transform-Load (ETL) tools support cleaning, structur- Section 4 showing how a user efficiently programs an ETL flow.
ing, and integration of data. The available ETL tools offer many In Section 5, the support for fact version management and dele-
advanced features, which make them very powerful but also both tion of facts is described. Then in Section 6 it is described how
overwhelming and sometimes rigid in their use. It can thus be a data scientist configures and initializes an ETL run including
challenging for a data scientist to quickly add a new data source. how the framework operates along with a real-world use case
Further, many of these products mainly focus on data processing example. Section 7 concludes the paper and points to directions
and less on aspects such as database schema handling. Other for future work.
important topics are privacy and anonymity concerns of citizens,
which has caused the EU (and others) to introduce regulations 2 RELATED WORK
where citizens have a right to be forgotten [9]. Violating these
A survey of ETL processes and technologies is given by [16].
regulations can lead to large penalties and it is thus important to
A plethora of ETL tools exist from commercial vendors such
enable easy removal of an individual citizen’s data from a data
as IBM, Informatica, Microsoft, Oracle, and SAP [2–5, 7]. Open
warehouse.
source ETL tools also exist such as Pentaho Data Integration
A simplified real-world example use case is presented by a
and Talend [6, 8]. Gartner presents the widely used tools in its
star-schema in Figure 1, where passenger travels carried out by a
Magic Quadrant [10]. With most ETL tools, the user designs the
∗ Produces the permission block, and copyright information ETL flow in a graphical user-interface by means of connecting
1 By“data scientist” we in this paper refer to someone focused at analyzing data boxes (representing transformations or operations) with arrows
and less in the technical aspects of DBMSs, e.g., ETL tools and Data Warehousing.
(representing data flows).
© 2018 Copyright held by the owner/author(s). Published in the Workshop Another approach is taken for the tool pygrametl [14] for
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
which it is argued that programmatic ETL, i.e., creating ETL pro-
under the terms of the Creative Commons license CC-by-nc-nd 4.0. grams by writing code, can be beneficial. With pygrametl, the
user programs Python objects for dimension and fact tables to
handle insert/update operations on the target data warehouse.
SimpleETL, however, hides complexity from the user and con-
veniently handles all schema management. Based on the speci-
fication of metadata, SimpleETL creates 1) the required SQL to
generate or alter the target data warehouse schema; 2) the neces-
sary database actions and pygrametl objects to modify the tables;
and 3) processes for parallel execution. SimpleETL provides tem-
plate code for its supported functionality, e.g., history tracking
of changing facts. It is therefore simple and fast for a data scien-
tist to define an ETL flow or add new sources and dimensions, Figure 2: UML Class Diagram for SimpleETL
because she does not have to make the code for this, but only
specify the metadata.
Tomingas et al. [15] propose an approach where Apache Ve- Listing 1: Defining Year Data Type
locity templates and user-specified mappings are used and trans- 1 from simpleetl import Datatype
2 def _validate_year(val):
formed into SQL statements. In contrast, SimpleETL is based on 3 if str(val).isnumeric() and 1900 <= int(intval) <= 2099:
4 return int(intval)
Python, which makes it easy for data scientists to exploit their 5 return -1
existing knowledge and to use third party libraries. 6 yeartype = Datatype('smallint', _validate_year)
BIAccelerator [13] is another template-based approach for cre-
ating ETL flows with Microsoft’s SSIS [4], enabling properties that shows all components in the framework. Next, each compo-
to be defined as parameters at runtime. Business Intelligence nent is described in more details by from the use case in Section 3.
Markup Language (Biml) [1] is a domain-specific XML-based
language to define SSIS packages (as well as SQL scripts, OLAP
4.1 Class Diagram
cube definitions and more). The focus of BIAccelerator and Bim-
l/BimlScript is to let the user define templates generating SSIS Figure 2 shows the UML class diagram for the SimpleETL frame-
packages for repetitive tasks while SimpleETL makes it easy work that consists of three classes. The class Datatype is used to
to create and load a data warehouse based on the templating define the type of a column in both dimension tables and (mea-
provided by the framework. sure) in a fact table. The parse method transforms a value (e.g.,
from a string to an integer) and ensures that the data type is
correct (e.g., it is a signed 32 bit integer) and any constraints on
3 USE-CASE the values in the column (e.g., it must be positive). The sqltype
In this section, we describe a simplified use-case scenario that method returns the SQL data type recognizable for a DBMS. The
serves as a running example throughout the paper and is used to SimpleETL framework comes with most standard data types, e.g.,
explain the distinctive features of the SimpleETL framework. The 2, 4, and 8-byte integer, numeric, date, and string (varchar) types.
simplified use-case is heavily inspired from a real-world example. The class Dimension models a dimension. It is an aggregation
In Figure 1, a star schema is presented, that connects infor- of a number of Datatype objects. The Dimension class contains
mation on passenger travels with a dimension for passengers, two methods, one for adding lookup attributes, add_lookupatt,
a dimension for the vehicle carrying out the travel, and a date and one for adding regular attributes, add_att. The combined set
dimension. The data is loaded from a CSV file with all the infor- of lookup attributes uniquely defines a record, which refers to
mation available at each line. Both the references and measures a Dimension key. Regular attributes simply describe the record.
consist of a combination of integer values, numeric values for The SimpleETL framework comes with a standard date and time
monetary amounts, string values, date, and time values. dimension.
Every night this set of data is exported from a source system The class FactTable models a fact table. It is an aggregation
(an accounting system) and a complete data dump is available, of a number of Dimension objects and Datatype objects. Four
including all historic earlier dumped data. The nightly dump methods are available on the class, first a method for connect-
has some distinctive characteristics, which make handling the ing a Dimension with the FactTable, add_dim_mapping. Second,
data non-trivial. The characteristics are that the data contain a method for adding a measure mapping, add_column_mapping
duplicates of existing facts, contain updated measures of existing , a method for defining how deleted rows should be handled,
facts, and lack deleted facts, which must be detected. These three handle_deleted_rows, and finally a method for defining addi-
characteristics put up some special demands for the ETL solution. tional indexes over a set of columns, add_index. Note that the
Two types of requirements exist for the functionality of the SimpleETL framework automatically adds indexes on all dimen-
final data warehouse, after data have been processed. First, a set sion mappings and on the lookup attribute set.
of business-oriented demands exists, such as tracking updates
of facts, e.g., when and what corrections were made. Second, 4.2 Data Type
updated legislation on people’s rights, e.g., the General Data
A data type define how a specific value is stored in the database
Protection Regulation [9], creates new requirements for data to
and how a value from the data source is parsed and processed
be deleted completely if a customer requests to be forgotten.
during ETL. An example of how a user can specify a data type
for storing year is shown in Listing 1. The data type is defined at
4 FRAMEWORK COMPONENTS line 6 and named yeartype. The first parameter specifies the SQL
This section provides an overview of the components in Sim- data type, a 2-byte integer. The second parameter is a Python
pleETL which a user customizes to create a data warehouse and function, _validate_year, which both handle the diversity of
corresponding ETL process. First, a class diagram is presented data, e.g., NULL values and conversion of string representations,
Listing 2: Defining Vehicle Dimension Listing 3: Defining Travels Fact Table
1 from simpleetl import Dimension, datatypes as dt 1 from simpleetl import FactTable, datatypes as dt
2 def handle_model(row, namemapping): 2 travels = FactTable(schema="facts", table="travels",
3 row["make"] = row["make"][0:20] lookupatts=["travelid"], store_history=True, key="id")
4 row["model"] = row["model"][0:20] 3 travels.add_dim_mapping(dimension=vehicledim, dstcol="
5 vehicledim = Dimension(schema="dims", table="vehicle", vehiclekey")
key="vehiclekey", rowexpander=handle_model) 4 travels.add_dim_mapping(dimension=datedim, dstcol="datekey")
6 vehicledim.add_lookupatt(name="vehicleid", 5 travels.add_dim_mapping(dimension=customerdim, dstcol="
dtype=dt.varchar(20), default_value='missing')
7 vehicledim.add_att(name="make", dtype=dt.varchar(20)) customerkey")
8 vehicledim.add_att(name="model", dtype=dt.varchar(20)) 6 travels.add_column_mapping(srccol="id", datatype=dt.integer,
9 vehicledim.add_att(name="vehicleyear", dtype=yeartype) dstcol="travelid")
7 travels.add_column_mapping(srccol="price", datatype=dt.
numeric(6,2), dstcol="price")
8 travels.add_index(["price"])
9 travels.handle_deleted_rows(method="mark")
and also enables constraints like 1900 <= year <= 2099 (line 3).
If the input fails to be parsed, -1 is returned (line 5).
A number of standard data types are pre-defined, e.g., SMALLINT In line 2, the FactTable object is instantiated, given a schemma
(2-byte integer), NUMERIC(precision, scale), and VARCHAR(n), and table name as the first two parameters. The third parameter
where the length of the two latter can be defined using arguments. defines the lookup attributes, the fourth parameter specifies that
Floating point data types are not supported by the SimpleETL full history should be retained and the fifth parameter defines the
framework since it depends on equality comparison for version primary key of the table, id. The lookupatts attribute defines no
management and determining updates/deletes and comparing two identical travelid can exist and is used when determining
floats can yield unpredictable results. It is encouraged to use new/updated/deleted facts.
NUMERIC(precision, scale) when decimal values are used. The vehicle dimension defined in Listing 2 is attached as a
dimension using a single line of code in line 3. In lines 4 and 5,
4.3 Dimension two additional dimensions are added, one handling date of the
The Dimension class describes how a single dimension table in the travel and another handling customer information, introduced in
database is modeled. An example implementation of the vehicle Figure 1. In line 6 and 7, two measures are added, first the lookup
dimension from Figure 1 is shown in Listing 2. The dimension attribute, id, and second the price of a travel, implemented as a
is defined in line 5, where the first and second parameters are numeric data type.
the schema and table name, respectively. The third parameter is The framework automatically creates primary keys, foreign
the name of the primary key. The fourth parameter, namemapping, keys, and indexes including a unique index on the lookup at-
known from pygrametl [14], allows for a user-defined function, tributes and the primary key. It is possible for the user to add
here handle_model, which is called on every row, in this case additional indexes (line 8). In line 9 it is defined that when a row
(line 2-4) truncating make and model to 20 characters, preventing is determined to have been deleted from the data source the row
overflowing the database varchar column, limited to 20 chars should be marked in the table as having been removed (method
(line 7-8). D4 from Section 5.2), thus keeping the fact in the data warehouse.
When the dimension has been defined, two types of attributes Overall, SimpleETL is designed to optimize productivity, en-
can be added. The first type is mandatory and is called the lookup sure consistency, reduce programming errors, and help the data
attribute set. In the example, a vehicle id, vehicleid, is defined scientist in loading and activating data for analysis. This is re-
as a single lookup attribute in line 3. Lookup attributes are not al- alized by reuse of data types and dimensions shown using code
lowed to be NULL as these must be comparable for lookups, hence examples and by keeping the number of methods and parameters
a default value for a vehicle id is the string “missing”. Adding the to a minimum.
primary key of the Dimension as a single lookup attribute makes
the primary key a smart key instead of a surrogate key [12]. Smart 5 MODIFICATIONS OF FACTS
keys can optimize performance of dimension handling while a In some system applications it is a business requirement that
smart key can be computed, e.g., the date 2017-07-21 can be a facts can be updated and full history be maintained for enabling
smart key 20170721. The second set of attributes is optional and is tracking of changes to facts. Simultaneously it is common practice
called member attributes. Member attributes provide additional to remove data if it is no longer valid, e.g., if a passenger travel
information for a dimension entry. Three member attributes are was not carried out it is later deleted from the accounting system.
added in Listing 2 (line 7-9), adding make and model attributes as Another motivation for deleting data is legal demands such as the
varchars of size 20 and vehicle year utilizing the yeartype data concept called the right to be forgotten [9]. This section shows how
type, defined in Listing 1. these requirements are handled automatically by the SimpleETL
framework.
4.4 Fact Table
The FactTable class defines a fact table and all aspects of this, 5.1 Slowly Changing Fact Tables
including database schema descriptions, data processing, and To handle updates of facts we introduce the slowly changing fact
data version management. A set of lookup attributes can be table. When a user enables version tracking of facts (store_history
defined to uniquely identifying a row. If the lookup attributes =True in Listing 3 line 2), a second fact table is created.
are set they enforce that duplicate facts with the same set of The main fact table, illustrated in Table 1, acts a similar to a
lookup attributes cannot exist. If no lookup attributes are defined, type-1 slowly changing dimension such that facts get updated
version management cannot be enabled and duplicate facts can (overwritten) when changes are detected in the source data. For
exist. Lookup attributes are not allowed to have NULL values. these examples the type-1 fact table consists of a id, a travelid,
The implementation of the fact table Travels from Figure 1 is shortened tid, and a price. This table is referred to as the type-1
shown in Listing 3. fact table in the rest of the paper.
Table 1: T1 Facts Table 2: Version Managed Fact Table Table 7: Deleted Version Managed Facts using D2

id tid price id tid price _vfrom _vto _ver _fid id tid price _vfrom _vto _ver _fid
1 100 40 1 100 40 t1 -1 1 1 1 100 40 t1 -1 1 1
2 109 25 2 109 25 t1 -1 1 2
Table 3: Upd. T1 Table 4: Updated Ver. Managed Facts Table 8: Deleted Version Managed Facts using D3 and D4

id tid price id tid price _vfrom _vto _ver _fid id tid price _vfrom _vto _ver _fid [D4 _del]
1 100 40 1 100 40 t1 -1 1 1 1 100 40 t1 -1 1 1 -1
2 109 35 2 109 25 t1 t2 1 2 2 109 25 t1 t2 1 2 t3
3 109 35 t2 -1 2 2 3 109 35 t2 t3 2 2 t3

Table 5: Del. T1 using D2/D3 Table 6: Deleted T1 using


D4 The second method, D2, completely deletes facts from the
id tid price data warehouse if they are removed at the source system. Table 5
1 100 40 id tid price _del shows the type-1 fact table and Table 7 shows the version man-
1 100 40 -1 aged fact table after the fact with tid=109 has been deleted. This
2 109 35 t1 method is useful if facts must be enforced to be removed, e.g.,
due to legal reasons and when data is removed at data source it
will automatically be removed from the fact tables too.
The second table, illustrated in Table 2 acts in a similar way The third method, D3, removes the fact in the type-1 fact table,
as a type-2 version managed slowly changing dimension where like method D2 shown in Table 5 while in the version managed
version management of data is tracked using four additional fact table the deleted fact is marked with an time stamp _vto=
columns. A pair of columns _validfrom and _validto, shortened t2, shown in Table 8. This method is useful, if the type-1 fact
_vfrom and _vto, stores the validity period of a fact using 32-bit table must mirror the source system, while deleted data must be
Unix timestamps, t1 through t3. A version number, _ver, keeps tracked.
track of fact changes and a column, _fact_id, shortened _fid, The fourth method, D4, adds an extra attribute to both fact
is references the primary key of the type-1 fact table bridging tables, _deleted, shortened _del, with default value -1. When a
the type-1 and the version managed fact tables together, e.g., for fact is removed the _del measure will be set to the relevant time
tracing historic changes from facts in the type-1 fact table. This stamp for the fact in both the type-1 and version managed fact
table is referred to as the version managed fact table in the rest tables, Table 6 and Table 8 respectively. This method is useful if
of the paper. easy filtering of deleted facts is required for, e.g., bookkeeping
We now illustrate what happens when a data set is loaded on the type-1 fact table.
by the SimpleETL framework. Table 1 and Table 2 shows the Having four different methods for handling deleted facts makes
type-1 and the version managed fact tables with two rows of data the SimpleETL framework very versatile and matches most busi-
loaded. The _vfrom is set to t1 and the _vto defaults to -1 when a ness and legal needs with respect to the balance between pre-
fact is still live. When an update happens at the data source, it is serving data versus privacy regulations.
propagated to SimpleETL at the next ETL batch run. For example
if the price for the tid=109 is updated from 25 to 35 the measure 6 DATA AND PROCESS FLOW
of the type-1 fact table is overwritten, shown in Table 3, while in This section first introduces how the ETL process is configured
the version managed fact table, Table 4, the _vto is set for id=2 and initiated, then the process flow implementation is visual-
and a new version of the fact is inserted with id=3. ized in Figure 3, separating the process flow into three stages,
The advantage of this two-table approach is that dispite many Initialization (1.1-1.4 in Figure 3), Processing (2.1-2.5), and Data
updates the type-1 fact table does not grow in size. The downside Migration (3.1-3.6). White boxes in Figure 3 indicates steps pro-
is increased storage cost from representing facts in both tables. cessed sequentially while gray boxes indicates parallel execution.
Facts are first loaded from a data source to a data staging
5.2 Deleting Facts area and dimensional integrity is maintained with all related
The motivation for deleting facts can be to reflect production, e.g., dimensions. Next, the data is moved from the data staging to
if a passenger travel was not carried out it is deleted in hindsight. the fact tables in three steps, first migrating updated data, then
Second, legal demands, such as the right to be forgotten [9], can porting new data, and finally handling deleted data, according to
require data to be deleted on individuals. the user specifications in Section 5. Finally a a real-world use-case
The SimpleETL framework enables the user to choose between is presented along with a implementation and runtime statistics.
four methods for handling deleting data. These are described
using Table 3 and Table 4 as the outset. The fact with tid=109 is 6.1 Configuration
deleted. The SimpleETL framework supports that data is loaded from mul-
The first method, D1, ignores when facts are deleted at the tiple data sources. Each data source is defined using a data feeder,
source system, i.e., if the fact with tid=109 is deleted it will still which is a user-defined Python function that yields key/value
persist in the data warehouse, like Table 3 and Table 4. This Python dictionaries of data for every fact, e.g., one dictionary
method enables keeping facts regardless of what happens at the for each row in a CSV file. These dictionaries are used by the
data source and is useful if facts cannot be altered or data is ETL process in Section 6.1. The data-feeder functions are not an
loaded incrementally. integrated part of the SimpleETL framework, which allows the
Listing 4: Processing SimpleETL are reading and writing to the dimensions (2.5) and when all data
1 prev_id = None
has been processed, the fact and dimension workers commit data
2 def dupfilter(row): to the data warehouse dimensions and data staging table.
3 global prev_id
4 if prev_id == row["id"]:
Dimension and fact handling are separated from the main
5 return False # Ignore duplicate "id" values process into parallel background workers of performance rea-
6 prev_id = row["id"]
7 return True sons. The background workers (2.4) and (2.5) in Figure 3, are im-
8 def parsevehicle(row, dbcon): plemented using Python’s multiprocessing.Process and com-
9 # Split mk_mdl into two variables
10 row["make"], row["model"] = row["mk_mdl"].split("|") munication is handled though Inter-Process Communication
11 csvfile = csv.DictReader("/path/to/file") (IPC) Queries. Several caching layers, using Python’s functools.
12 processETL(facttable=fact, datafeeder=csvfile,
filterfunc=dupfilter, transformfunc=parsevehicle, lru_cache, reduce the IPC and dimension database communica-
[database connection details])
tion.
Parallel Fact Workers The parallel fact workers, (2.4) from
Figure 3, process rows distributed in batches from (2.3). If the pa-
rameter transformfunc is provided, Section 6.1, this is executed
first. Such a function can contain advanced user defined transfor-
mations. Second, all dimension referencing is handled using the
a dimension workers (2.5). Then each measure is processed and
finally the data is inserted into a data staging table. n parallel fact
workers will be spawned where n equals the number of available
CPU cores for the framework.
Decoupled Dimension Workers Each dimension is handled
Figure 3: Main Execution Flow of SimpleETL in its own separate process (2.5), i.e., having three attached di-
mensions will run in three separate processes. Utilizing the same
dimension more than once will only spawn one instance, e.g.,
user to load data from various sources, e.g., CSV, ODBC, or REST
utilizing a date dimension three times will only use one parallel
APIs, only requiring that they can present a fact as a Python
worker process. If the dimension key is a smart key, see Sec-
dictionary.
tion 4.3, this smart key can immediately be returned from the
When the data warehouse structure, using the components
dimension worker while surrogate keys must be co-ordinated
from Section 4, and a data source are defined then the ETL pro-
with the dimension table, potentially with database lookups. m
cess can be configured and initiated. All functionality related to
parallel dimension workers will be spawned, where m is the num-
database schema management and data management is handled
ber of distinct dimensions attached a FactTable, see Section 4.4.
automatically. When the ETL process has completed, the data is
available in the data warehouse for querying. The ETL process
is started as shown in Listing 4. In line 11, a file is prepared for 6.4 Data Migration
loading, using Python’s CSV-to-Dictionary function. The ETL The data migration is split into three steps for handling updated
process is started in line 12, where the FactTable and CSV file are facts, new facts, and deleted facts. The main driver, for determin-
given as input. Listing 4 also shows how two optional functions ing updates, new data, and deleted data are the lookup attributes,
are used to customize the ETL process. The argument filterfunc see Section 4.4, which uniquely define a fact and whose values are
=dupfilter defines a function for filtering rows before data is mandatory (not NULL). Lookup attributes can be both fact mea-
distributed to parallel workers, and the argument processfunc sures or dimension referencing keys. If the lookup attribute set is
=parsevehicle defines a function distributed to all background not defined then no updating, deletion, and version management
worker processes. can be performed and all data will be appended.
We have now shown all the code that the user needs to imple- Migrating Updated Facts Updated facts are defined as facts
ment in various Python function to use the SimpleETL framework. where the set of lookup attributes already exists in the existing
In the next section, it is described what is done internally in the fact tables and where at least one of the measures have changed.
framework to build the data-warehouse schema and efficiently This is handled by (3.1) and (3.2) in Figure 3 and the type-1 and
load the data. version managed tables are processed in parallel, as handling
updates does not change relationships between these two tables.
6.2 Initialization Migrating New Facts New facts are facts whose set of lookup
Before starting the ETL data processing SimpleETL initializes attributes do not exist in the type-1 and version managed fact
database connections and validates the FactTable object, processETL tables. This is handled in (3.3) and (3.4) in Figure 3 where data
(1.1) in Figure 3. Schema, constraints, and indexes are created and is first migrated to the type-1 fact table and next to the version
verified for all attached dimensions (1.2) and the fact tables (1.3). managed fact table. This sequential step is necessary as the ver-
A temporary data staging table is initialized, for later handling sion managed fact table needs the id of the type-1 fact table for
updated and deleted facts (1.4). referencing this. This step also ensures that no duplicate sets
of lookup attributes is loaded, if the lookup attribute set of the
6.3 Processing FactObject is defined.
The main ETL process extracts data from the data source, given Migrating Deleted Facts If migration of deleted facts is en-
the datafeeder argument, Figure 3 (2.1). A filterfunc, intro- abled, it is determined which facts exist in the type-1 and version
duced in Section 6.1, can be applied for filtering data (2.2). Then managed fact tables, while they do not exist in the staging table.
data is distributed to the background workers (2.3) in batches of The method for how facts are handled, when removed at the data
1000 rows (user configurable size). Background fact workers (2.4) source, is dependent on the methods described in Section 5.2.
This migration of deleted facts is handled in (3.5) and (3.6) in REFERENCES
Figure 3. [1] BimlScript. http://www.bimlscript.com/. Accessed 2017-10-24.
[2] IBM InfoSphere DataStage. https://www.ibm.com/ms-en/marketplace/
datastage. Accessed 2017-10-13.
6.5 Real-World Use [3] Informatica. https://www.informatica.com/. ([n. d.]). Accessed 2017-10-13.
[4] Microsoft SQL Server Integration Services. https://docs.microsoft.com/en-us/
SimpleETL is designed to be a convenient and easy tool for data sql/integration-services/sql-server-integration-services. Accessed 2017-10-
scientists to quickly load their data and start working with it. To 13.
[5] Oracle Data Integrator. http://www.oracle.com/technetwork/middleware/
show that SimpleETL also performs well a real-world use-case data-integrator/overview/index.html. Accessed 2017-10-13.
is implemented. One fact table is configured with version track- [6] Pentaho Data Integration - Kettle. http://kettle.pentaho.org. Accessed 2017-
ing enabled and deleted facts being propagated by the method 10-13.
[7] SAP Data Services. https://www.sap.com/products/data-services.html. Ac-
D3 from Section 5.2. The fact is constructed as 153 columns, in- cessed 2017-10-13.
cluding 1 primary key, 41 foreign keys to 18 dimensions, and [8] Talend. https://www.talend.com/products/big-data/. Accessed 2017-10-24.
111 measures. An index is automatically generated covering the [9] 2016. EU Regulation 2016/679: General Data Protection Regulation. Official
Journal of the European Union L119 (2016), 1–88. http://eur-lex.europa.eu/
lookup attributes and the primary key and 41 indexes are auto- legal-content/EN/TXT/?uri=OJ:L:2016:119:TOC
matically generated on all the foreign keys. The data contains [10] Mark A. Beyer, Eric Thoo, Mei Yang Selvage, and Ethisham Zaidi. 2017. Gartner
Magic Quadrant for Data Integration Tools. (2017).
information on passenger travels from a fleet system. 1.2 million [11] Scott Curie. [n. d.]. What is Biml. http://www.bimlscript.com/walkthrough/
rows are available in a 1.67 GB CSV data file and each row has Details/3105. Accessed 2017-10-24.
147 columns. The final size of the type-1 and version managed [12] Ralph Kimball and Margy Ross. 2013. The data warehouse toolkit: The definitive
guide to dimensional modeling. John Wiley & Sons.
fact tables are 732 and 882 MB of data and 1193 and 1422 MB of [13] Reinhard Stumptner, Bernhard Freudenthaler, and Markus Krenn. 2012. BIAc-
indexes, respectively. celerator – A Template-Based Approach for Rapid ETL Development. Springer
The initial data load takes 34 minutes, including creating Berlin Heidelberg, 435–444.
[14] Christian Thomsen and Torben Bach Pedersen. 2009. pygrametl: a powerful
schema while an incremental batch providing 17 678 updated, programming framework for extract-transform-load programmers.. In DOLAP,
16 381 new, and 3 deleted facts is performed in 8 minutes on a Il-Yeol Song and Esteban ZimÃąnyi (Eds.). ACM, 49–56.
[15] Kalle Tomingas, Margus Kliimask, and Tanel Tammet. 2014. Mappings, Rules
single Ubuntu Linux server running PostgreSQL 9.6 with 16 GB and Patterns in Template Based ETL Construction. In The 11th International
of RAM, 6 core Intel Xeon E5-2695V3 CPU clocked at 2.3 GHz. Baltic DB & IS2014 Conference.
The SimpleETL framework and the PostgreSQL DBMS both run [16] Panos Vassiliadis. 2009. A Survey of Extract-Transform-Load Technology. 5,
1–27.
on the same host.
The performance of SimpleETL scales with the number of
CPUs and a large period of the execution time is related with un-
derlying DBMS transactions. A different DBMS or configurations
will yield other performance results.

7 CONCLUSION
This paper presents the SimpleETL framework that enables sim-
ple and efficient programming of ETL for data warehouse so-
lutions without the user needs database management or ETL
experience. This makes the framework particular well suited for
data scientists because they can quickly integrate and explore
new data sources.
The framework enables advanced fact handling such as han-
dling slowly changing facts using version management and en-
ables the users to decide how deleted facts should be handled.
Four different methods for handling deleted facts are presented.
The framework is simple and contains only three classes for
data types, dimensions, and fact tables, respectively. Each class
has two to four methods. The ETL process is directed by meta-
data specifications and the framework handles everything else,
including version management and tracking of deleted facts. The
entire internal process flow extensively utilizes parallelization
and IPC for processing facts and every dimension is spawned in
separate processes.
The main contribution of SimpleETL is to provide a conve-
nient and simple ETL framework for data scientists. Despite this,
performance benchmarks, using real-world data scenario where
facts are inserted, updated, and deleted, shows that the frame-
work is lightweight and executing ETL batches and maintaining
versioned data and deletions is performed efficiently.
There are a number of relevant directions for future work,
including automatic table partitioning to handle very large data
sets. Snowflake dimension support is another commonly used
technique from data warehousing, which would be relevant to
support in the SimpleETL framework.

You might also like