memrecycle(): match factor levels in UTF-8#6890
Conversation
Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels. Fixes: #6886
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6890 +/- ##
=======================================
Coverage 98.59% 98.59%
=======================================
Files 79 79
Lines 14665 14667 +2
=======================================
+ Hits 14459 14461 +2
Misses 206 206 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Generated via commit 0af4cd2 Download link for the artifact containing the test results: ↓ atime-results.zip
|
|
Yes to both, thank you for catching these! Especially the |
|
|
||
| # memrecycle() did not consider string encodings for factor levels #6886 | ||
| DT = data.table(factor(rep(enc2utf8("\uf8"), 3))) | ||
| DT[1, V1 := iconv(levels(V1), from = "UTF-8", to = "latin1")] |
There was a problem hiding this comment.
this test is somewhat confusing, since the end result is basically "do nothing", maybe it'll help to be more practical?
e.g. new_val = factor('a', levels = c('a', iconv(...))), then check that all we get in the update is an a level?
Re-reading the issue, I see your example is actually quite true to the original bug report. So maybe add this second example as another test instead? And maybe assign iconv(...) outside of [...] too, that might help clarify it a little bit & make it "read" closer to the original report too?
There was a problem hiding this comment.
It's also a good idea to test both scenarios: assigning from a string and assigning from a factor.
MichaelChirico
left a comment
There was a problem hiding this comment.
LGTM, just a minor comment to see if we can't further improve the test a little. Feel free to submit when you see fit.
Thanks!
Test assignment from a string as well as from a different factor. Co-Authored-By: Michael Chirico <chiricom@google.com>
* memrecycle(): match factor levels in UTF-8 Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels. Fixes: #6886 * Fix tests - Correct encoding of source string in non-UTF-8 locale - Use nlevels(.) instead of length(levels(.)) * NEWS entry * Expand test Test assignment from a string as well as from a different factor. Co-authored-by: Michael Chirico <chiricom@google.com>
* memrecycle(): match factor levels in UTF-8 Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels. Fixes: #6886 * Fix tests - Correct encoding of source string in non-UTF-8 locale - Use nlevels(.) instead of length(levels(.)) * NEWS entry * Expand test Test assignment from a string as well as from a different factor. Co-authored-by: Michael Chirico <chiricom@google.com>
|
@aitap apparently this is the regression for #7404 since the by columns in db-benchmark are factors. set.seed(42)
n_groups = 1e5
rows_per_group = 100
N = n_groups * rows_per_group
dt = data.table(
id3 = as.factor(rep(seq_len(n_groups), each = rows_per_group)),
v1 = rnorm(N),
v2 = rnorm(N)
)
dt[sample.int(N, N * 0.05), v1 := NA_real_]
dt[sample.int(N, N * 0.05), v2 := NA_real_]
# 20 seconds after #6890
# 0.1 seconds before #6890
system.time(dt[, .(range_v1_v2 = max(v1, na.rm = TRUE) - min(v2, na.rm = TRUE)), by = id3]) |

Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even
CE_NATIVEin a UTF-8 locale vs.CE_UTF8), causing duplicate levels.Fixes: #6886