SimpleETL: ETL Processing by Simple Specifications
SimpleETL: ETL Processing by Simple Specifications
id tid price id tid price _vfrom _vto _ver _fid id tid price _vfrom _vto _ver _fid
1 100 40 1 100 40 t1 -1 1 1 1 100 40 t1 -1 1 1
2 109 25 2 109 25 t1 -1 1 2
Table 3: Upd. T1 Table 4: Updated Ver. Managed Facts Table 8: Deleted Version Managed Facts using D3 and D4
id tid price id tid price _vfrom _vto _ver _fid id tid price _vfrom _vto _ver _fid [D4 _del]
1 100 40 1 100 40 t1 -1 1 1 1 100 40 t1 -1 1 1 -1
2 109 35 2 109 25 t1 t2 1 2 2 109 25 t1 t2 1 2 t3
3 109 35 t2 -1 2 2 3 109 35 t2 t3 2 2 t3
7 CONCLUSION
This paper presents the SimpleETL framework that enables sim-
ple and efficient programming of ETL for data warehouse so-
lutions without the user needs database management or ETL
experience. This makes the framework particular well suited for
data scientists because they can quickly integrate and explore
new data sources.
The framework enables advanced fact handling such as han-
dling slowly changing facts using version management and en-
ables the users to decide how deleted facts should be handled.
Four different methods for handling deleted facts are presented.
The framework is simple and contains only three classes for
data types, dimensions, and fact tables, respectively. Each class
has two to four methods. The ETL process is directed by meta-
data specifications and the framework handles everything else,
including version management and tracking of deleted facts. The
entire internal process flow extensively utilizes parallelization
and IPC for processing facts and every dimension is spawned in
separate processes.
The main contribution of SimpleETL is to provide a conve-
nient and simple ETL framework for data scientists. Despite this,
performance benchmarks, using real-world data scenario where
facts are inserted, updated, and deleted, shows that the frame-
work is lightweight and executing ETL batches and maintaining
versioned data and deletions is performed efficiently.
There are a number of relevant directions for future work,
including automatic table partitioning to handle very large data
sets. Snowflake dimension support is another commonly used
technique from data warehousing, which would be relevant to
support in the SimpleETL framework.