Duckdb: An Embeddable Analytical Database: Mark Raasveldt Hannes Mühleisen
Duckdb: An Embeddable Analytical Database: Mark Raasveldt Hannes Mühleisen
While for very small data sets all systems will show com- and elsewhere. We are also particularly indebted to the TUM
parable behavior, only DuckDB will be able to continue func- database group for their papers on query optimization, win-
tioning for larger ones. SQLite will begin to suffer from its dow functions, storage and concurrency control that we used
row-based execution model and MonetDBLite begins to suf- to implement DuckDB.
fer from excessive intermediate result materialization due to
its bulk processing model. While HyPer is extremely fast in REFERENCES
processing queries, it will not be able to transfer result sets [1] Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100:
as quickly as DuckDB using its socket client protocol [12]. Hyper-Pipelining Query Execution. In CIDR 2005, Second Biennial
For the “drilldown” scenario, we invite the audience to Conference on Innovative Data Systems Research, Asilomar, CA, USA,
January 4-7, 2005. 225–237. http://cidrdb.org/cidr2005/papers/P19.pdf
propose their own query to be configured into the benchmark [2] Lukas Fittl. 2019. C library for accessing the PostgreSQL parser outside
computers. This will allow direct appraisal of DuckDB’s of the server environment. https://github.com//fittl/libpg_query.
performance, without the demonstration authors being able [3] Richard Hipp. 2019. Database File Format. https://www.sqlite.org/
to cherry-pick queries where DuckDB excels. Again, the fileformat.html.
audience member that has proposed the query will then be [4] Richard Hipp. 2019. Most Widely Deployed and Used Database Engine.
https://www.sqlite.org/mostdeployed.html.
able to turn the dial to increase the amount of data read by [5] Harald Lang, Tobias Mühlbauer, Florian Funke, et al. 2016. Data
the query and observe the impact on the four systems in Blocks: Hybrid OLTP and OLAP on Compressed Storage using both
real-time. Vectorization and Compilation. In Proceedings of the 2016 Interna-
tional Conference on Management of Data, SIGMOD Conference 2016,
San Francisco, CA, USA, June 26 - July 01, 2016. 311–326. https:
4 CURRENT STATE AND NEXT STEPS //doi.org/10.1145/2882903.2882925
As of this writing, DuckDB runs all TPC-H queries and all [6] Wes McKinney. 2010. Data Structures for Statistical Computing in
but two TPC-DS queries. We expect complete TPC-DS cov- Python. In Proceedings of the 9th Python in Science Conference, Stéfan
van der Walt and Jarrod Millman (Eds.). 51 – 56.
erage by the time the demonstration is presented. DuckDB
[7] Guido Moerkotte and Thomas Neumann. 2008. Dynamic programming
also already completes most of SQLite’s SQL logic test suite strikes back. In Proceedings of the ACM SIGMOD International Confer-
that contains millions of queries. Immediate next steps for ence on Management of Data, SIGMOD 2008, Vancouver, BC, Canada,
DuckDB are completion of DataBlocks storage scheme and June 10-12, 2008. 539–552. https://doi.org/10.1145/1376616.1376672
subquery folding, currently in branches. A buffer manager [8] Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans
for Modern Hardware. PVLDB 4, 9 (2011), 539–550. https://doi.org/10.
is also not yet implemented, but will be. DuckDB already
14778/2002938.2002940
supports inter-query parallelism but intra-query parallelism [9] Thomas Neumann and Alfons Kemper. 2015. Unnesting Arbitrary
will be added as well. We plan to implement a work steal- Queries. In Datenbanksysteme für Business, Technologie und Web (BTW),
ing scheduler to balance resources between short and long 16. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssys-
running queries. A special consideration is also to allow bal- teme" (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings. 383–402.
https://dl.gi.de/20.500.12116/2418
ancing resource usage with the host application, a special
[10] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast
issue for embedded operations. As with MonetDBLite, we Serializable Multi-Version Concurrency Control for Main-Memory
will implement the database APIs of R and Python etc. Database Systems. In Proceedings of the 2015 ACM SIGMOD Interna-
A more advanced future direction is self-checking. We tional Conference on Management of Data, Melbourne, Victoria, Aus-
have learned to distrust the hardware the database is running tralia, May 31 - June 4, 2015. 677–689. https://doi.org/10.1145/2723372.
2749436
on. This is particularly relevant in the edge computing use
[11] Thomas Neumann and Bernhard Radke. 2018. Adaptive Optimization
case, where hardware failures are to be commonplace. One of Very Large Join Queries. In Proceedings of the 2018 International
approach is to keep checksums on all persistent and inter- Conference on Management of Data (SIGMOD ’18). ACM, New York,
mediate data and piggy-back checksum verification on scan NY, USA, 677–692. https://doi.org/10.1145/3183713.3183733
operators. This might be possible without a significant per- [12] Mark Raasveldt and Hannes Mühleisen. 2017. Don’t Hold My Data
Hostage - A Case For Client Protocol Redesign. PVLDB 10, 10 (2017),
formance impact. A vectorized engine is particularly suited
1022–1033. https://doi.org/10.14778/3115404.3115408
for this since a chunk of data typically fits in the CPU cache [13] Mark Raasveldt and Hannes Mühleisen. 2018. MonetDBLite:
and additional passes are not requiring RAM access. Another An Embedded Analytical Database. CoRR abs/1805.08520 (2018).
approach to increasing trust in the hardware is inspired by arXiv:1805.08520 http://arxiv.org/abs/1805.08520
video game developers which periodically run sanity check [14] Hadley Wickham, Romain François, Lionel Henry, and Kirill Müller.
2018. dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.
computation to ensure correct operation of CPU and RAM.
org/package=dplyr R package version 0.7.8.
Acknowledgements
We would like to thank all past, current and future contribu-
tors to DuckDB at the CWI Database Architectures Group