keyed

Explicit Key Assumptions for Flat-File Data

The keyed package brings database-style primary key protections to R data frames. Declare which columns must be unique, and keyed enforces that constraint through filters, joins, and mutations. When assumptions break, it errors immediately instead of failing silently downstream.

Quick Start

library(keyed)

# Declare a primary key (errors if not unique)
orders <- data.frame(
  order_id = 1:4,
  item     = c("apple", "bread", "apple", "cheese"),
  qty      = c(2, 1, 5, 3)
)
orders <- key(orders, order_id)

# Key persists through dplyr
orders |> dplyr::filter(qty > 1) |> has_key()
#> [1] TRUE

# Watch for automatic drift detection
orders <- orders |> watch()
modified <- orders |> dplyr::mutate(qty = qty * 10)
check_drift(modified)
#> Drift detected
#> Modified: 4 row(s)
#>   qty: 4 change(s)

Statement of Need

In databases, you declare customer_id as a primary key and the engine enforces uniqueness. With CSV and Excel files, you get no such guarantees. Duplicates slip in silently, joins produce unexpected row counts, and data assumptions are implicit.

Existing validation packages (pointblank, validate) offer comprehensive rule engines but require upfront schema definitions. For analysts working interactively with flat files, this overhead is often too high. The result: assumptions go unchecked, and errors surface far from their source.

keyed addresses this gap with four lightweight mechanisms:

Feature	What it does
Keys	Declare unique columns, enforced through transformations
Locks	Assert conditions (no NAs, row counts, coverage) at pipeline checkpoints
UUIDs	Track row identity through filters, joins, and reshaping
Watch & Diff	Auto-snapshot before each transformation, cell-level drift reports

All four work directly on data frames with no external dependencies, so you get key safety without leaving R.

Features

Keys

Declare which columns must be unique. Keys persist through base R and dplyr operations, and block any transformation that would break uniqueness.

# Single or composite keys
customers <- key(customers, customer_id)
sales     <- key(sales, region, year)

# Keys survive filtering
active <- customers[customers$status == "active", ]
has_key(active)
#> [1] TRUE

# Uniqueness-breaking operations are blocked
customers |> dplyr::mutate(customer_id = 1)
#> Error: Key is no longer unique after transformation.
#> i Use `unkey()` first if you intend to break uniqueness.

Join Diagnostics

Preview join cardinality before executing:

diagnose_join(customers, orders, by = "customer_id")
#> Cardinality: one-to-many
#> customers: 1000 rows (unique)
#> orders:    5432 rows (4432 duplicates)
#> Left join will produce ~5432 rows

Locks

Assert conditions at pipeline checkpoints. Locks error immediately, never continuing silently.

customers |>
  lock_unique(customer_id) |>
  lock_no_na(email) |>
  lock_nrow(min = 100)

Available locks:

Function	Checks
`lock_unique(df, col)`	No duplicate values
`lock_no_na(df, col)`	No missing values
`lock_complete(df)`	No NAs in any column
`lock_coverage(df, threshold, col)`	% non-NA above threshold
`lock_nrow(df, min, max)`	Row count in range

UUIDs

Generate stable row identifiers when your data has no natural key. UUIDs survive all transformations and enable row-level tracking.

customers <- add_id(customers)

# Track which rows were added or removed
filtered <- customers |> dplyr::filter(name != "Bob")
compare_ids(customers, filtered)
#> Lost: 1 row (7b1e4a9c2f8d3601)
#> Kept: 2 rows

Watch & Diff

watch() turns drift detection from a manual ceremony into an automatic safety net. Watched data frames auto-snapshot before each dplyr verb, so check_drift() always gives you a cell-level report of what the last transformation changed.

df <- key(data.frame(id = 1:5, x = c(1, 2, 3, 4, 5)), id) |> watch()

# Every dplyr verb auto-snapshots before executing
filtered <- df |> dplyr::filter(id <= 3)
check_drift(filtered)
#> Drift detected
#> Removed: 2 row(s)
#> Unchanged: 3 row(s)

# Each step in a chain tracks drift from the previous step
result <- filtered |> dplyr::mutate(x = x * 100)
check_drift(result)
#> Drift detected
#> Modified: 3 row(s)
#>   x: 3 change(s)

You can also compare any two keyed data frames directly with diff():

old <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id)
new <- data.frame(id = 2:4, x = c("B", "c", "d"))

diff(old, new)
#> Key: id
#> Removed: 1 row(s)
#> Added: 1 row(s)
#> Modified: 1 row(s)
#>   x: 1 change(s)
#> Unchanged: 1 row(s)

Use unwatch() to stop automatic stamping, or clear_all_snapshots() to free memory.

Installation

# Install from CRAN
install.packages("keyed")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/keyed")

When to Use Something Else

Need	Better Tool
Enforced schema	SQLite, DuckDB
Full data validation	pointblank, validate
Production pipelines	targets

Documentation

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.

License

MIT (see the LICENSE.md file)

Citation

@software{keyed,
  author = {Colling, Gilles},
  title = {keyed: Explicit Key Assumptions for Flat-File Data},
  year = {2025},
  url = {https://CRAN.R-project.org/package=keyed},
  doi = {10.32614/CRAN.package.keyed}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
R		R
docs		docs
inst		inst
man		man
pkgdown		pkgdown
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
cran-comments.md		cran-comments.md
deprecated.md		deprecated.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

keyed

Quick Start

Statement of Need

Features

Keys

Join Diagnostics

Locks

UUIDs

Watch & Diff

Installation

When to Use Something Else

Documentation

Support

License

Citation

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Licenses found

gcol33/keyed

Folders and files

Latest commit

History

Repository files navigation

keyed

Quick Start

Statement of Need

Features

Keys

Join Diagnostics

Locks

UUIDs

Watch & Diff

Installation

When to Use Something Else

Documentation

Support

License

Citation

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages