Skip to content

With i and without GForce, by-group operations don't fill the whole sub-column for NA groups #7442

@aitap

Description

@aitap

# Minimal reproducible example

This happens on current master or 1.17.8:

DT <- data.table(a="a", grp=1L)
i <- c(1, 1, 1, NA, NA)
DT[i, tail(a, 1), by = grp, verbose = TRUE] # this is correct
i clause present and columns used in by detected, only these subset: [grp]
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
forder.c received a vector type 'integer' length 2
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'tail(a, 1)'
GForce optimized j to 'gtail(a, 1)' (see ?GForce)
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.001
gforce eval took 0.000
0.001s elapsed (0.001s cpu)
     grp     V1
   <int> <char>
1:     1      a
2:    NA   <NA>
DT[i][, last(a, 1), by = grp, verbose = TRUE] # this is also correct
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'last(a, 1)'
GForce is on, but not activated for this query; left j unchanged (see ?GForce)
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()

  memcpy contiguous groups took 0.000s for 2 groups
  eval(j) took 0.000s for 2 calls
0.000s elapsed (0.000s cpu)
     grp     V1
   <int> <char>
1:     1      a
2:    NA   <NA>
DT[i, last(base::print(a), 1), by = grp, verbose = TRUE] # result for the NA group is incorrect
i clause present and columns used in by detected, only these subset: [grp]
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.001s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
forder.c received a vector type 'integer' length 2
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'last(base::print(a), 1)'
GForce is on, but not activated for this query; left j unchanged (see ?GForce)
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... [1] "a" "a" "a"
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
[1] NA  "a" # <-- should consist of all NAs for the NA group
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()

  collecting discontiguous groups took 0.000s for 1 groups
  eval(j) took 0.000s for 2 calls
0.000s elapsed (0.000s cpu)
     grp     V1
   <int> <char>
1:     1      a
2:    NA      a # <-- should be NA_character_

This seems to originate in this special case where only one row is overwritten:

data.table/src/dogroups.c

Lines 218 to 232 in 3c044ce

if (istarts[i] == NA_INTEGER || (LENGTH(order) && iorder[ istarts[i]-1 ]==NA_INTEGER)) {
for (int j=0; j<length(SDall); ++j) {
writeNA(VECTOR_ELT(SDall, j), 0, 1, false);
// writeNA uses SET_ for STR and VEC, and we always use SET_ to assign to SDall always too. Otherwise,
// this writeNA could decrement the reference for the old value which wasn't incremented in the first place.
// Further, the eval(jval) could feasibly assign to SD although that is currently caught and disallowed. If that
// became possible, that assign from user's j expression would decrement the reference which wasn't incremented
// in the first place. And finally, upon release of SD, values will be decremented, where they weren't incremented
// in the first place. All in all, too risky to write behind the barrier in this section.
// Or in the words, this entire section, and this entire dogroups.c file, is now write-barrier compliant from v1.12.10
// and we hope that reference counting on by default from R 4.0 will avoid costly gc()s.
}
grpn = 1; // it may not be 1 e.g. test 722. TODO: revisit.
SETLENGTH(I, grpn);
INTEGER(I)[0] = 0;

Should the code overwrite more of each by-group column, or resize them to length 1 as well?

This surfaced during the work on the resizable API, where the unpopulated rows of the sub-groups were overwritten to empty strings instead of the contents of the previous group.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions