cuplyr

dplyr backend for GPU acceleration via RAPIDS cuDF

cuplyr implements a dplyr backend powered by RAPIDS cuDF, NVIDIA's GPU DataFrame library. Write standard dplyr code, execute on GPU hardware.

library(cuplyr)

tbl_gpu(sales_data, lazy = TRUE) |>
  filter(year >= 2020, amount > 0) |>
  mutate(revenue = amount * price) |>
  group_by(region, quarter) |>
  summarise(total = sum(revenue)) |>
  inner_join(regions, by = "region") |>
  arrange(desc(total)) |>
  collect()

About

cuplyr translates dplyr operations into cuDF execution on NVIDIA GPUs. It follows the same backend pattern as dbplyr: write standard R code, execute on GPU hardware. This approach can provide significant speedups on larger datasets (typically >1M rows) without requiring major code changes.

Built on RAPIDS cuDF: cuDF is an open-source GPU DataFrame library developed by NVIDIA's RAPIDS team. It provides optimized CUDA kernels for data manipulation operations, backed by Apache Arrow's columnar memory format. cuplyr provides an R interface to this execution engine.

Status

v0.1.0

This is experimental software under active development. Breaking changes should be expected.

Supported operations

Data manipulation

filter() – row filtering with comparison and logical operators
select() – column selection and reordering
mutate() – column transformations and arithmetic
arrange() – row sorting with desc() support, NA handling follows dplyr conventions
group_by() + summarise() – grouped aggregations (sum, mean, min, max, n)
left_join(), right_join(), inner_join(), full_join() – GPU joins on key columns
collect() – transfer results back to R
compute() – execute lazy operations, keep on GPU
tbl_gpu(..., lazy = TRUE) – enable lazy evaluation with AST optimization

Lazy evaluation

Lazy mode defers execution until collect() or compute(), enabling automatic optimizations:

Projection pruning (drop unused columns early)
Filter pushdown (move filters closer to data sources)
Mutate fusion (combine consecutive transformations)

# Enable globally
options(cuplyr.exec_mode = "lazy")

# Or per-table
tbl_gpu(data, lazy = TRUE)

Supported column types

R Type	GPU Type
numeric (double)	FLOAT64
integer	INT32
character	STRING
logical	BOOL8
Date	TIMESTAMP_DAYS
POSIXct	TIMESTAMP_MICROSECONDS
factor	INT32 (codes)

Not yet implemented

Complex joins with join_by()
Window functions
String operations
Multi-GPU support

Contributions and feedback are welcome.

Architecture

R layer: S3 methods implementing dplyr generics
AST optimizer: Projection pruning, filter pushdown, operation fusion
Native bindings: Rcpp interface to libcudf C++ API
Execution: cuDF GPU kernels via libcudf
Memory: GPU-resident data with automatic cleanup via R garbage collection

Installation

Requirements

NVIDIA GPU with Compute Capability >= 6.0
CUDA Toolkit >= 12.0
RAPIDS libcudf >= 25.12
R >= 4.3

Using pixi (recommended)

# Install pixi if not already installed (https://pixi.sh)
# curl -fsSL https://pixi.sh/install.sh | bash

git clone https://github.com/bbtheo/cuplyr.git
cd cuplyr
pixi run install

From source

git clone https://github.com/bbtheo/cuplyr.git
cd cuplyr

# Ensure CUDA and cuDF are available, then:
R CMD INSTALL .

Performance

Benchmark code lives in benchmark/benchmark.R.

Benchmarks on 25 million rows (synthetic taxi data, median of 10 iterations):

Operation	dplyr	data.table	DuckDB	cuplyr	cuplyr vs dplyr	cuplyr vs data.table	cuplyr vs DuckDB
Group & Summarise	310.5 ms	190.0 ms	67.0 ms	4.0 ms	77.6x	47.5x	16.7x
Filter	444.0 ms	479.0 ms	585.0 ms	11.0 ms	40.4x	43.5x	53.2x
Complete Workflow	1237.0 ms	574.5 ms	126.5 ms	20.0 ms	61.9x	28.7x	6.3x

Complete workflow: filter + mutate + group_by + summarise

Hardware: Intel Core i9-12900K (16 cores), NVIDIA RTX 5070 (12 GB VRAM)

End-to-end workflow including materialization/transfer:

Workflow	dplyr	DuckDB (collect)	cuplyr (with GPU transfer)	cuplyr vs dplyr	cuplyr vs DuckDB
Complete Workflow + transfer	1175.0 ms	133.5 ms	1213.0 ms	1.0x	0.1x

GPU acceleration benefits grow with data size and compute intensity. For transfer-heavy workloads or smaller datasets, CPU-based engines can still be faster.

Acknowledgments

This project is built on RAPIDS cuDF by NVIDIA and the RAPIDS AI team.

License: Apache 2.0

Maintainer: @bbtheo

Documentation: DEVELOPER_GUIDE.md

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.claude		.claude
R		R
benchmark		benchmark
docs		docs
inst/docker		inst/docker
man		man
scripts		scripts
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.envrc		.envrc
.gitignore		.gitignore
AGENTS.md		AGENTS.md
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
configure		configure
cuplyr_0.1.0.tar.gz		cuplyr_0.1.0.tar.gz
pixi.lock		pixi.lock
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cuplyr

dplyr backend for GPU acceleration via RAPIDS cuDF

About

Status

Supported operations

Lazy evaluation

Supported column types

Not yet implemented

Architecture

Installation

Requirements

Using pixi (recommended)

From source

Performance

Acknowledgments

About

Uh oh!

Releases 2

Languages

License

bbtheo/cuplyr

Folders and files

Latest commit

History

Repository files navigation

cuplyr

dplyr backend for GPU acceleration via RAPIDS cuDF

About

Status

Supported operations

Lazy evaluation

Supported column types

Not yet implemented

Architecture

Installation

Requirements

Using pixi (recommended)

From source

Performance

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Languages