cuplyr implements a dplyr backend powered by RAPIDS cuDF, NVIDIA's GPU DataFrame library. Write standard dplyr code, execute on GPU hardware.
library(cuplyr)
tbl_gpu(sales_data, lazy = TRUE) |>
filter(year >= 2020, amount > 0) |>
mutate(revenue = amount * price) |>
group_by(region, quarter) |>
summarise(total = sum(revenue)) |>
inner_join(regions, by = "region") |>
arrange(desc(total)) |>
collect()cuplyr translates dplyr operations into cuDF execution on NVIDIA GPUs. It follows the same backend pattern as dbplyr: write standard R code, execute on GPU hardware. This approach can provide significant speedups on larger datasets (typically >1M rows) without requiring major code changes.
Built on RAPIDS cuDF: cuDF is an open-source GPU DataFrame library developed by NVIDIA's RAPIDS team. It provides optimized CUDA kernels for data manipulation operations, backed by Apache Arrow's columnar memory format. cuplyr provides an R interface to this execution engine.
v0.1.0
This is experimental software under active development. Breaking changes should be expected.
Data manipulation
filter()– row filtering with comparison and logical operatorsselect()– column selection and reorderingmutate()– column transformations and arithmeticarrange()– row sorting withdesc()support, NA handling follows dplyr conventionsgroup_by()+summarise()– grouped aggregations (sum,mean,min,max,n)left_join(),right_join(),inner_join(),full_join()– GPU joins on key columnscollect()– transfer results back to Rcompute()– execute lazy operations, keep on GPUtbl_gpu(..., lazy = TRUE)– enable lazy evaluation with AST optimization
Lazy mode defers execution until collect() or compute(), enabling automatic optimizations:
- Projection pruning (drop unused columns early)
- Filter pushdown (move filters closer to data sources)
- Mutate fusion (combine consecutive transformations)
# Enable globally
options(cuplyr.exec_mode = "lazy")
# Or per-table
tbl_gpu(data, lazy = TRUE)| R Type | GPU Type |
|---|---|
| numeric (double) | FLOAT64 |
| integer | INT32 |
| character | STRING |
| logical | BOOL8 |
| Date | TIMESTAMP_DAYS |
| POSIXct | TIMESTAMP_MICROSECONDS |
| factor | INT32 (codes) |
- Complex joins with
join_by() - Window functions
- String operations
- Multi-GPU support
Contributions and feedback are welcome.
- R layer: S3 methods implementing dplyr generics
- AST optimizer: Projection pruning, filter pushdown, operation fusion
- Native bindings: Rcpp interface to libcudf C++ API
- Execution: cuDF GPU kernels via libcudf
- Memory: GPU-resident data with automatic cleanup via R garbage collection
- NVIDIA GPU with Compute Capability >= 6.0
- CUDA Toolkit >= 12.0
- RAPIDS libcudf >= 25.12
- R >= 4.3
# Install pixi if not already installed (https://pixi.sh)
# curl -fsSL https://pixi.sh/install.sh | bash
git clone https://github.com/bbtheo/cuplyr.git
cd cuplyr
pixi run installgit clone https://github.com/bbtheo/cuplyr.git
cd cuplyr
# Ensure CUDA and cuDF are available, then:
R CMD INSTALL .Benchmark code lives in benchmark/benchmark.R.
Benchmarks on 25 million rows (synthetic taxi data, median of 10 iterations):
| Operation | dplyr | data.table | DuckDB | cuplyr | cuplyr vs dplyr | cuplyr vs data.table | cuplyr vs DuckDB |
|---|---|---|---|---|---|---|---|
| Group & Summarise | 310.5 ms | 190.0 ms | 67.0 ms | 4.0 ms | 77.6x | 47.5x | 16.7x |
| Filter | 444.0 ms | 479.0 ms | 585.0 ms | 11.0 ms | 40.4x | 43.5x | 53.2x |
| Complete Workflow | 1237.0 ms | 574.5 ms | 126.5 ms | 20.0 ms | 61.9x | 28.7x | 6.3x |
Complete workflow: filter + mutate + group_by + summarise
Hardware: Intel Core i9-12900K (16 cores), NVIDIA RTX 5070 (12 GB VRAM)
End-to-end workflow including materialization/transfer:
| Workflow | dplyr | DuckDB (collect) | cuplyr (with GPU transfer) | cuplyr vs dplyr | cuplyr vs DuckDB |
|---|---|---|---|---|---|
| Complete Workflow + transfer | 1175.0 ms | 133.5 ms | 1213.0 ms | 1.0x | 0.1x |
GPU acceleration benefits grow with data size and compute intensity. For transfer-heavy workloads or smaller datasets, CPU-based engines can still be faster.
This project is built on RAPIDS cuDF by NVIDIA and the RAPIDS AI team.
License: Apache 2.0
Maintainer: @bbtheo
Documentation: DEVELOPER_GUIDE.md
