`memrecycle()`: match factor levels in UTF-8 by aitap · Pull Request #6890 · Rdatatable/data.table

aitap · 2025-03-31T16:52:11Z

Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE in a UTF-8 locale vs. CE_UTF8), causing duplicate levels.

Fixes: #6886

Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels. Fixes: #6886

codecov · 2025-03-31T17:00:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.59%. Comparing base (476de7e) to head (0af4cd2).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6890   +/-   ##
=======================================
  Coverage   98.59%   98.59%           
=======================================
  Files          79       79           
  Lines       14665    14667    +2     
=======================================
+ Hits        14459    14461    +2     
  Misses        206      206

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-03-31T17:08:03Z

Generated via commit 0af4cd2

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	4 minutes and 28 seconds
Installing different package versions	8 minutes and 7 seconds
Running and plotting the test cases	2 minutes and 16 seconds

- Correct encoding of source string in non-UTF-8 locale - Use nlevels(.) instead of length(levels(.)) Co-Authored-By: Michael Chirico <chiricom@google.com>

aitap · 2025-04-01T06:38:26Z

Yes to both, thank you for catching these! Especially the 'ø', that would fail the test on a non-UTF-8 system in multiple ways.

MichaelChirico · 2025-04-01T18:22:08Z

+
+# memrecycle() did not consider string encodings for factor levels #6886
+DT = data.table(factor(rep(enc2utf8("\uf8"), 3)))
+DT[1, V1 := iconv(levels(V1), from = "UTF-8", to = "latin1")]


this test is somewhat confusing, since the end result is basically "do nothing", maybe it'll help to be more practical?

e.g. new_val = factor('a', levels = c('a', iconv(...))), then check that all we get in the update is an a level?

Re-reading the issue, I see your example is actually quite true to the original bug report. So maybe add this second example as another test instead? And maybe assign iconv(...) outside of [...] too, that might help clarify it a little bit & make it "read" closer to the original report too?

It's also a good idea to test both scenarios: assigning from a string and assigning from a factor.

MichaelChirico

LGTM, just a minor comment to see if we can't further improve the test a little. Feel free to submit when you see fit.

Thanks!

Test assignment from a string as well as from a different factor. Co-Authored-By: Michael Chirico <chiricom@google.com>

* memrecycle(): match factor levels in UTF-8 Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels. Fixes: #6886 * Fix tests - Correct encoding of source string in non-UTF-8 locale - Use nlevels(.) instead of length(levels(.)) * NEWS entry * Expand test Test assignment from a string as well as from a different factor. Co-authored-by: Michael Chirico <chiricom@google.com>

ben-schwen · 2025-12-16T12:11:45Z

@aitap apparently this is the regression for #7404 since the by columns in db-benchmark are factors.

set.seed(42)

n_groups = 1e5
rows_per_group = 100
N = n_groups * rows_per_group

dt = data.table(
  id3 = as.factor(rep(seq_len(n_groups), each = rows_per_group)),
  v1  = rnorm(N),
  v2  = rnorm(N)
)

dt[sample.int(N, N * 0.05), v1 := NA_real_]
dt[sample.int(N, N * 0.05), v2 := NA_real_]

# 20 seconds after #6890
# 0.1 seconds before #6890
system.time(dt[, .(range_v1_v2 = max(v1, na.rm = TRUE) - min(v2, na.rm = TRUE)), by = id3])

memrecycle(): match factor levels in UTF-8

4cc77c6

Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels. Fixes: #6886

aitap requested review from HughParsonage and MichaelChirico as code owners March 31, 2025 16:52

MichaelChirico reviewed Apr 1, 2025

View reviewed changes

Comment thread inst/tests/tests.Rraw Outdated

MichaelChirico reviewed Apr 1, 2025

View reviewed changes

Comment thread inst/tests/tests.Rraw Outdated

aitap and others added 2 commits April 1, 2025 09:11

Fix tests

ed9ec3e

- Correct encoding of source string in non-UTF-8 locale - Use nlevels(.) instead of length(levels(.)) Co-Authored-By: Michael Chirico <chiricom@google.com>

NEWS entry

b8c0f5a

Merge branch 'master' into memrecycle-level-encoding

2ba5cb5

MichaelChirico reviewed Apr 1, 2025

View reviewed changes

MichaelChirico approved these changes Apr 1, 2025

View reviewed changes

Expand test

0af4cd2

Test assignment from a string as well as from a different factor. Co-Authored-By: Michael Chirico <chiricom@google.com>

aitap merged commit 6f49bf1 into master Apr 1, 2025
11 checks passed

aitap deleted the memrecycle-level-encoding branch April 1, 2025 20:53

aitap mentioned this pull request Apr 24, 2025

Release data.table 1.17.2 #6860

Closed

This was referenced Mar 3, 2026

Assignment operator := does not accept anymore text values with non-ASCII with attribute Encoding = unknown #7648

Closed

memrecycle: actually match assigned string to factor levels in UTF-8 #7657

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`memrecycle()`: match factor levels in UTF-8#6890

`memrecycle()`: match factor levels in UTF-8#6890
aitap merged 5 commits intomasterfrom
memrecycle-level-encoding

aitap commented Mar 31, 2025

Uh oh!

codecov Bot commented Mar 31, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

aitap commented Apr 1, 2025

Uh oh!

MichaelChirico Apr 1, 2025 •

edited

Loading

Uh oh!

aitap Apr 1, 2025

Uh oh!

MichaelChirico left a comment

Uh oh!

Uh oh!

ben-schwen commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aitap commented Mar 31, 2025

Uh oh!

codecov Bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aitap commented Apr 1, 2025

Uh oh!

MichaelChirico Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aitap Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelChirico left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ben-schwen commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Mar 31, 2025 •

edited

Loading

github-actions Bot commented Mar 31, 2025 •

edited

Loading

MichaelChirico Apr 1, 2025 •

edited

Loading