Skip to content

bbtheo/cuplyr

Repository files navigation

cuplyr

dplyr backend for GPU acceleration via RAPIDS cuDF

cuplyr implements a dplyr backend powered by RAPIDS cuDF, NVIDIA's GPU DataFrame library. Write standard dplyr code, execute on GPU hardware.

library(cuplyr)

tbl_gpu(sales_data, lazy = TRUE) |>
  filter(year >= 2020, amount > 0) |>
  mutate(revenue = amount * price) |>
  group_by(region, quarter) |>
  summarise(total = sum(revenue)) |>
  inner_join(regions, by = "region") |>
  arrange(desc(total)) |>
  collect()

About

cuplyr translates dplyr operations into cuDF execution on NVIDIA GPUs. It follows the same backend pattern as dbplyr: write standard R code, execute on GPU hardware. This approach can provide significant speedups on larger datasets (typically >1M rows) without requiring major code changes.

Built on RAPIDS cuDF: cuDF is an open-source GPU DataFrame library developed by NVIDIA's RAPIDS team. It provides optimized CUDA kernels for data manipulation operations, backed by Apache Arrow's columnar memory format. cuplyr provides an R interface to this execution engine.

Status

v0.1.0

This is experimental software under active development. Breaking changes should be expected.

Supported operations

Data manipulation

  • filter() – row filtering with comparison and logical operators
  • select() – column selection and reordering
  • mutate() – column transformations and arithmetic
  • arrange() – row sorting with desc() support, NA handling follows dplyr conventions
  • group_by() + summarise() – grouped aggregations (sum, mean, min, max, n)
  • left_join(), right_join(), inner_join(), full_join() – GPU joins on key columns
  • collect() – transfer results back to R
  • compute() – execute lazy operations, keep on GPU
  • tbl_gpu(..., lazy = TRUE) – enable lazy evaluation with AST optimization

Lazy evaluation

Lazy mode defers execution until collect() or compute(), enabling automatic optimizations:

  • Projection pruning (drop unused columns early)
  • Filter pushdown (move filters closer to data sources)
  • Mutate fusion (combine consecutive transformations)
# Enable globally
options(cuplyr.exec_mode = "lazy")

# Or per-table
tbl_gpu(data, lazy = TRUE)

Supported column types

R Type GPU Type
numeric (double) FLOAT64
integer INT32
character STRING
logical BOOL8
Date TIMESTAMP_DAYS
POSIXct TIMESTAMP_MICROSECONDS
factor INT32 (codes)

Not yet implemented

  • Complex joins with join_by()
  • Window functions
  • String operations
  • Multi-GPU support

Contributions and feedback are welcome.

Architecture

  • R layer: S3 methods implementing dplyr generics
  • AST optimizer: Projection pruning, filter pushdown, operation fusion
  • Native bindings: Rcpp interface to libcudf C++ API
  • Execution: cuDF GPU kernels via libcudf
  • Memory: GPU-resident data with automatic cleanup via R garbage collection

Installation

Requirements

  • NVIDIA GPU with Compute Capability >= 6.0
  • CUDA Toolkit >= 12.0
  • RAPIDS libcudf >= 25.12
  • R >= 4.3

Using pixi (recommended)

# Install pixi if not already installed (https://pixi.sh)
# curl -fsSL https://pixi.sh/install.sh | bash

git clone https://github.com/bbtheo/cuplyr.git
cd cuplyr
pixi run install

From source

git clone https://github.com/bbtheo/cuplyr.git
cd cuplyr

# Ensure CUDA and cuDF are available, then:
R CMD INSTALL .

Performance

Benchmark code lives in benchmark/benchmark.R.

Benchmarks on 25 million rows (synthetic taxi data, median of 10 iterations):

Operation dplyr data.table DuckDB cuplyr cuplyr vs dplyr cuplyr vs data.table cuplyr vs DuckDB
Group & Summarise 310.5 ms 190.0 ms 67.0 ms 4.0 ms 77.6x 47.5x 16.7x
Filter 444.0 ms 479.0 ms 585.0 ms 11.0 ms 40.4x 43.5x 53.2x
Complete Workflow 1237.0 ms 574.5 ms 126.5 ms 20.0 ms 61.9x 28.7x 6.3x

Complete workflow: filter + mutate + group_by + summarise

Hardware: Intel Core i9-12900K (16 cores), NVIDIA RTX 5070 (12 GB VRAM)

End-to-end workflow including materialization/transfer:

Workflow dplyr DuckDB (collect) cuplyr (with GPU transfer) cuplyr vs dplyr cuplyr vs DuckDB
Complete Workflow + transfer 1175.0 ms 133.5 ms 1213.0 ms 1.0x 0.1x

GPU acceleration benefits grow with data size and compute intensity. For transfer-heavy workloads or smaller datasets, CPU-based engines can still be faster.

Acknowledgments

This project is built on RAPIDS cuDF by NVIDIA and the RAPIDS AI team.


License: Apache 2.0

Maintainer: @bbtheo

Documentation: DEVELOPER_GUIDE.md