Skip to content

Migrate from SQLite to DuckDB#1

Open
Copilot wants to merge 28 commits intomasterfrom
copilot/move-code-to-duckdb
Open

Migrate from SQLite to DuckDB#1
Copilot wants to merge 28 commits intomasterfrom
copilot/move-code-to-duckdb

Conversation

Copy link

Copilot AI commented Nov 28, 2025

  • Migrate from SQLite to DuckDB
  • Optimize database schema and queries for DuckDB
    • Merge separate date and time TEXT columns into single time TIMESTAMP column
    • Update all sensor unpacking functions to preserve full timestamps
    • Update indexes to reference time column instead of date
    • Remove unused save2db() function (replaced with dplyr::rows_insert)
    • Update coverage.R to work with TIMESTAMP columns
    • Update sensor_functions.R arrange() calls
    • Update all test expectations to use combined timestamp format

This change leverages DuckDB's native TIMESTAMP support for better performance and simpler timestamp handling, eliminating the need to split and rejoin timestamps.

Original prompt

This package is written entirely around SQLite. However, one limitation is that timezones are not supported and that SQLite is not as feature rich as other, more modern databases. That's why I would like to convert to duckdb. Fortunately, many functions in this package are relatively database agnostic as they are powered by dbplyr and DBI. Still, there are several functions that are deeply intertwined with SQLITE, not to mention the database schema itself. Create a new branch duckdb, and start working on moving the code to duckdb. At the very least, all existing unit tests should pass.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 5 commits November 28, 2025 15:02
Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Copilot AI changed the title [WIP] Convert package from SQLite to DuckDB Migrate from SQLite to DuckDB Nov 28, 2025
Copilot AI requested a review from koenniem November 28, 2025 15:14
@koenniem
Copy link
Owner

@copilot You should now have network access to CRAN. Can you rerun the tests again and make sure it actually works?

…nstalled

Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Copy link
Author

Copilot AI commented Nov 28, 2025

@copilot You should now have network access to CRAN. Can you rerun the tests again and make sure it actually works?

I ran the tests and found issues with path resolution when the package isn't installed. Fixed in commit c20c974 - create_db(), index_db(), and the test helper now fall back to local inst/ folders when system.file() returns empty. Tests still need additional debugging for DuckDB-specific SQL syntax issues that surfaced during testing.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • bioconductor.org
    • Triggering command: /usr/lib/R/bin/exec/R /usr/lib/R/bin/exec/R --no-save --no-restore --no-echo /usr/lib/R/bin/R --no-restore --no-echo --args nextArg-lnextArg/home/REDACTED/R/l-c (dns block)
    • Triggering command: /usr/lib/R/bin/exec/R /usr/lib/R/bin/exec/R --no-save --no-restore --no-echo sh -c -- gcc --version (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@koenniem
Copy link
Owner

@copilot You can adjust the tests so that they are suited for DuckDB. Pay attention that the change is only because of the database, and not because of other behaviour.

…eserved keywords

Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Copy link
Author

Copilot AI commented Nov 28, 2025

@copilot You can adjust the tests so that they are suited for DuckDB. Pay attention that the change is only because of the database, and not because of other behaviour.

Fixed in commit e8107b9. The main DuckDB-specific schema changes required were:

  1. Removed ON DELETE CASCADE from all foreign key constraints (not supported by DuckDB)
  2. Quoted reserved keywords start and end as column names in AppUsage, Calendar, and Keyboard tables

Database creation now works correctly with DuckDB.

@koenniem koenniem marked this pull request as ready for review December 1, 2025 11:37
@koenniem
Copy link
Owner

koenniem commented Dec 1, 2025

@copilot Here are a couple of feedback points:

  • create_db() still creates an RSQLite database, not a duckdb()
  • The shutdown argument is no longer necessary in dbDisconnect for duckdb databases, as it is unused and databases are shutdown automatically.
  • It seems that the test for open_db() fails because another database is opened while the other is still closing.
  • An expectation in tests for add_study(), add_participant(), and add_processed_file() are failing with the input consists of only NULLs. I'm not sure whether this test is strictly necessary, but I would think it fails because the primary key is then also NULL. We should remove NULL inserts where the primary key is NULL or missing.
  • get_nrows() errors because the output from vapply() is now a double, not integer.

Please revise your PR. Make sure all tests and workflows pass.

@koenniem koenniem added the enhancement New feature or request label Dec 1, 2025
…LL checks

Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 1, 2025

@copilot Here are a couple of feedback points:

  • create_db() still creates an RSQLite database, not a duckdb()
    ...

Fixed in commit 197b1d1:

  1. Removed shutdown = TRUE from all dbDisconnect calls (DuckDB handles shutdown automatically)
  2. Added NULL validation in add_study, add_participant, and add_processed_files to prevent NULL primary key inserts
  3. Fixed get_nrows to return numeric instead of integer (DuckDB count returns doubles)
  4. Added gc() calls in open_db test to ensure file handles are released before reopening

Note: create_db() already uses duckdb::duckdb() on line 101 - it was updated in the initial migration.

@koenniem
Copy link
Owner

koenniem commented Dec 1, 2025

@copilot - Your solutions for the NULL input in add_study(), add_participant(), and add_processed_files() are short-sighted. These functions are vectorised, so if multiple values are passed to the function, we should remove that pair of input. Right now, you only check whether the entire vector is NULL.

  • save2db() fails because the list to the params argument must be unnamed. We can use params = unname(as.list(data)).
  • gc() seems to work
  • BLOB as data type for longitude and latitude is incorrect in dbdef.sql. BLOB in duckdb is binary data, not "any type of data" like SQLite. I would recommend TEXT for now. This does mean that tests in test-legacy-data will fail because the expected type no longer matches.
  • An expectation in the tests for save2db() are broken. Specifically, we test whether the size of the database has increased after we added data, but this assumption does not hold for duckdb(). We need some other method to check whether the data was actually added.
  • The test for get_nrows() fails because we test against a double() vector, but the output is a numeric, typically integer. We could just use a numeric in that test as well.
  • Some tests in test-legacy-data fail because of small differences in decimal values. We could increase the tolerance in expect_equal in the db_test function.

Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
@koenniem
Copy link
Owner

koenniem commented Dec 2, 2025

@copilot Looks like you hit an error. Can you try again and not error this time?

…LOB to TEXT, and test updates

Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 2, 2025

@copilot Looks like you hit an error. Can you try again and not error this time?

Fixed in commits 02143a5 and 8d82890:

  1. Vectorized NULL handling in add_study, add_participant, add_processed_files - now filters out NULL/NA values from vectors
  2. save2db now uses unname(as.list(data)) for params
  3. Changed BLOB to TEXT for latitude/longitude columns in dbdef.sql
  4. Updated save2db test to check data presence instead of file size
  5. Fixed get_nrows test to expect numeric instead of integer
  6. Increased tolerance to 1e-6 in db_test function for decimal precision differences

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@koenniem
Copy link
Owner

koenniem commented Dec 2, 2025

@copilot All tests are checks are passing now with the commits that I made. We can now turn to streamlining the implementation. One upgrade I already did was getting ride of save2db() which inserted prepared statements using query parameters. However, this is very slow for duckdb so we now use dplyr::rows_insert(). The downside is that this does not want to play nice with the transaction.

Another problem that I detected was that the timestamps are not easy to work with. Back in the day, timestamps came in via ISO8601 format which I then substringed into a date and time colomn. I did this so that I could index the date column to speed up the queries. With the new format (.import_clean_new()), timestamps come in as unix timestamps which I convert in R to datetime and then insert as date and time separately. However, I feel duckdb is a lot faster and may not benefit as much from the indexing. More importantly, pasting the date and time together does not work well because the time may sometimes have microseconds in which case the strptime() function fails. It's a lot of lines of code to get something so simple as a getting a timestamp. We should merge the date and time columns (maybe just call it time) which utilises TIMESTAMP data type of duckdb. This would also allow us to use datetimes more flexibly in the database. In general, if we can make use of features that duckdb has compared to sqlite, we probably should.

Can you make these optimisations? Also look for other places where the database code can be sped up.

…olumn

Co-authored-by: koenniem <66026195+koenniem@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 2, 2025

@copilot All tests are checks are passing now with the commits that I made. We can now turn to streamlining the implementation. One upgrade I already did was getting ride of save2db() which inserted...

Implemented in commit 6c64309:

  1. Merged date and time columns: Updated schema to use single time TIMESTAMP NOT NULL column instead of separate date TEXT and time TEXT columns across all 27 sensor tables
  2. Removed save2db(): Deleted the unused function that was replaced with dplyr::rows_insert()
  3. Updated all unpacking functions: Modified 22 functions in import_functions.R and 26 in legacy_import_functions.R to preserve full timestamps instead of splitting them with substr()
  4. Updated indexes: Changed all date-based indexes to time-based indexes for better query performance
  5. Updated queries: Modified coverage.R and sensor_functions.R to work with TIMESTAMP columns
  6. Updated tests: Converted all test expectations to use combined timestamp format

This leverages DuckDB's native TIMESTAMP support, eliminating the overhead of splitting/rejoining timestamps and simplifying timestamp handling throughout the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants