# Minimal reproducible example
This happens on current master or 1.17.8:
DT <- data.table(a="a", grp=1L)
i <- c(1, 1, 1, NA, NA)
DT[i, tail(a, 1), by = grp, verbose = TRUE] # this is correct
i clause present and columns used in by detected, only these subset: [grp]
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
forder.c received a vector type 'integer' length 2
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'tail(a, 1)'
GForce optimized j to 'gtail(a, 1)' (see ?GForce)
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.001
gforce eval took 0.000
0.001s elapsed (0.001s cpu)
grp V1
<int> <char>
1: 1 a
2: NA <NA>
DT[i][, last(a, 1), by = grp, verbose = TRUE] # this is also correct
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'last(a, 1)'
GForce is on, but not activated for this query; left j unchanged (see ?GForce)
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
memcpy contiguous groups took 0.000s for 2 groups
eval(j) took 0.000s for 2 calls
0.000s elapsed (0.000s cpu)
grp V1
<int> <char>
1: 1 a
2: NA <NA>
DT[i, last(base::print(a), 1), by = grp, verbose = TRUE] # result for the NA group is incorrect
i clause present and columns used in by detected, only these subset: [grp]
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.001s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
forder.c received a vector type 'integer' length 2
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'last(base::print(a), 1)'
GForce is on, but not activated for this query; left j unchanged (see ?GForce)
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... [1] "a" "a" "a"
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
[1] NA "a" # <-- should consist of all NAs for the NA group
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
collecting discontiguous groups took 0.000s for 1 groups
eval(j) took 0.000s for 2 calls
0.000s elapsed (0.000s cpu)
grp V1
<int> <char>
1: 1 a
2: NA a # <-- should be NA_character_
This seems to originate in this special case where only one row is overwritten:
|
if (istarts[i] == NA_INTEGER || (LENGTH(order) && iorder[ istarts[i]-1 ]==NA_INTEGER)) { |
|
for (int j=0; j<length(SDall); ++j) { |
|
writeNA(VECTOR_ELT(SDall, j), 0, 1, false); |
|
// writeNA uses SET_ for STR and VEC, and we always use SET_ to assign to SDall always too. Otherwise, |
|
// this writeNA could decrement the reference for the old value which wasn't incremented in the first place. |
|
// Further, the eval(jval) could feasibly assign to SD although that is currently caught and disallowed. If that |
|
// became possible, that assign from user's j expression would decrement the reference which wasn't incremented |
|
// in the first place. And finally, upon release of SD, values will be decremented, where they weren't incremented |
|
// in the first place. All in all, too risky to write behind the barrier in this section. |
|
// Or in the words, this entire section, and this entire dogroups.c file, is now write-barrier compliant from v1.12.10 |
|
// and we hope that reference counting on by default from R 4.0 will avoid costly gc()s. |
|
} |
|
grpn = 1; // it may not be 1 e.g. test 722. TODO: revisit. |
|
SETLENGTH(I, grpn); |
|
INTEGER(I)[0] = 0; |
Should the code overwrite more of each by-group column, or resize them to length 1 as well?
This surfaced during the work on the resizable API, where the unpopulated rows of the sub-groups were overwritten to empty strings instead of the contents of the previous group.
#Minimal reproducible exampleThis happens on current
masteror 1.17.8:This seems to originate in this special case where only one row is overwritten:
data.table/src/dogroups.c
Lines 218 to 232 in 3c044ce
Should the code overwrite more of each by-group column, or resize them to length 1 as well?
This surfaced during the work on the resizable API, where the unpopulated rows of the sub-groups were overwritten to empty strings instead of the contents of the previous group.