Skip to content

memrecycle(): match factor levels in UTF-8#6890

Merged
aitap merged 5 commits intomasterfrom
memrecycle-level-encoding
Apr 1, 2025
Merged

memrecycle(): match factor levels in UTF-8#6890
aitap merged 5 commits intomasterfrom
memrecycle-level-encoding

Conversation

@aitap
Copy link
Copy Markdown
Member

@aitap aitap commented Mar 31, 2025

Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE in a UTF-8 locale vs. CE_UTF8), causing duplicate levels.

Fixes: #6886

Previously, by-reference sub-assignment to a factor column could fail to
match strings with identical content if they had different encoding bits
(even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels.

Fixes: #6886
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.59%. Comparing base (476de7e) to head (0af4cd2).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6890   +/-   ##
=======================================
  Coverage   98.59%   98.59%           
=======================================
  Files          79       79           
  Lines       14665    14667    +2     
=======================================
+ Hits        14459    14461    +2     
  Misses        206      206           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 31, 2025

Comparison Plot

Generated via commit 0af4cd2

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 28 seconds
Installing different package versions 8 minutes and 7 seconds
Running and plotting the test cases 2 minutes and 16 seconds

Comment thread inst/tests/tests.Rraw Outdated
Comment thread inst/tests/tests.Rraw Outdated
aitap and others added 2 commits April 1, 2025 09:11
- Correct encoding of source string in non-UTF-8 locale
- Use nlevels(.) instead of length(levels(.))

Co-Authored-By: Michael Chirico <chiricom@google.com>
@aitap
Copy link
Copy Markdown
Member Author

aitap commented Apr 1, 2025

Yes to both, thank you for catching these! Especially the 'ø', that would fail the test on a non-UTF-8 system in multiple ways.

Comment thread inst/tests/tests.Rraw Outdated

# memrecycle() did not consider string encodings for factor levels #6886
DT = data.table(factor(rep(enc2utf8("\uf8"), 3)))
DT[1, V1 := iconv(levels(V1), from = "UTF-8", to = "latin1")]
Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is somewhat confusing, since the end result is basically "do nothing", maybe it'll help to be more practical?

e.g. new_val = factor('a', levels = c('a', iconv(...))), then check that all we get in the update is an a level?

Re-reading the issue, I see your example is actually quite true to the original bug report. So maybe add this second example as another test instead? And maybe assign iconv(...) outside of [...] too, that might help clarify it a little bit & make it "read" closer to the original report too?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a good idea to test both scenarios: assigning from a string and assigning from a factor.

Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor comment to see if we can't further improve the test a little. Feel free to submit when you see fit.

Thanks!

Test assignment from a string as well as from a different factor.

Co-Authored-By: Michael Chirico <chiricom@google.com>
@aitap aitap merged commit 6f49bf1 into master Apr 1, 2025
11 checks passed
@aitap aitap deleted the memrecycle-level-encoding branch April 1, 2025 20:53
aitap added a commit that referenced this pull request Apr 24, 2025
* memrecycle(): match factor levels in UTF-8

Previously, by-reference sub-assignment to a factor column could fail to
match strings with identical content if they had different encoding bits
(even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels.

Fixes: #6886

* Fix tests

- Correct encoding of source string in non-UTF-8 locale
- Use nlevels(.) instead of length(levels(.))

* NEWS entry

* Expand test

Test assignment from a string as well as from a different factor.

Co-authored-by: Michael Chirico <chiricom@google.com>
@aitap aitap mentioned this pull request Apr 24, 2025
TysonStanley pushed a commit that referenced this pull request Apr 24, 2025
* memrecycle(): match factor levels in UTF-8

Previously, by-reference sub-assignment to a factor column could fail to
match strings with identical content if they had different encoding bits
(even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels.

Fixes: #6886

* Fix tests

- Correct encoding of source string in non-UTF-8 locale
- Use nlevels(.) instead of length(levels(.))

* NEWS entry

* Expand test

Test assignment from a string as well as from a different factor.

Co-authored-by: Michael Chirico <chiricom@google.com>
@ben-schwen
Copy link
Copy Markdown
Member

@aitap apparently this is the regression for #7404 since the by columns in db-benchmark are factors.

set.seed(42)

n_groups = 1e5
rows_per_group = 100
N = n_groups * rows_per_group

dt = data.table(
  id3 = as.factor(rep(seq_len(n_groups), each = rows_per_group)),
  v1  = rnorm(N),
  v2  = rnorm(N)
)

dt[sample.int(N, N * 0.05), v1 := NA_real_]
dt[sample.int(N, N * 0.05), v2 := NA_real_]

# 20 seconds after #6890
# 0.1 seconds before #6890
system.time(dt[, .(range_v1_v2 = max(v1, na.rm = TRUE) - min(v2, na.rm = TRUE)), by = id3])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Strange behaviours when updating by reference a factor

3 participants