With `i` and without GForce, by-group operations don't fill the whole sub-column for NA groups

`#` `Minimal reproducible example`

This happens on current `master` or 1.17.8:

```r
DT <- data.table(a="a", grp=1L)
i <- c(1, 1, 1, NA, NA)
DT[i, tail(a, 1), by = grp, verbose = TRUE] # this is correct
```
```
i clause present and columns used in by detected, only these subset: [grp]
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
forder.c received a vector type 'integer' length 2
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'tail(a, 1)'
GForce optimized j to 'gtail(a, 1)' (see ?GForce)
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.001
gforce eval took 0.000
0.001s elapsed (0.001s cpu)
     grp     V1
   <int> <char>
1:     1      a
2:    NA   <NA>
```

```r
DT[i][, last(a, 1), by = grp, verbose = TRUE] # this is also correct
```
```
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'last(a, 1)'
GForce is on, but not activated for this query; left j unchanged (see ?GForce)
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()

  memcpy contiguous groups took 0.000s for 2 groups
  eval(j) took 0.000s for 2 calls
0.000s elapsed (0.000s cpu)
     grp     V1
   <int> <char>
1:     1      a
2:    NA   <NA>
```
```r
DT[i, last(base::print(a), 1), by = grp, verbose = TRUE] # result for the NA group is incorrect
```
```
i clause present and columns used in by detected, only these subset: [grp]
Detected that j uses these columns: [a]
Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
forder.c received 5 rows and 1 columns
forderReuseSorting: opt=0, took 0.000s
0.001s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
forder.c received a vector type 'integer' length 2
forderReuseSorting: opt=0, took 0.000s
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'last(base::print(a), 1)'
GForce is on, but not activated for this query; left j unchanged (see ?GForce)
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... [1] "a" "a" "a"
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()
[1] NA  "a" # <-- should consist of all NAs for the NA group
last: using utils::tail: !is.xts(x) & nargs>1 & !'package:xts'%in%search()

  collecting discontiguous groups took 0.000s for 1 groups
  eval(j) took 0.000s for 2 calls
0.000s elapsed (0.000s cpu)
     grp     V1
   <int> <char>
1:     1      a
2:    NA      a # <-- should be NA_character_
```

This seems to originate in this special case where only one row is overwritten:
https://github.com/Rdatatable/data.table/blob/3c044cebfdce9cb6181e6afbfa4b411278f7d3e8/src/dogroups.c#L218-L232

Should the code overwrite more of each by-group column, or resize them to length 1 as well?

This surfaced during the work on the resizable API, where the unpopulated rows of the sub-groups were overwritten to empty strings instead of the contents of the previous group.

	if (istarts[i] == NA_INTEGER \|\| (LENGTH(order) && iorder[ istarts[i]-1 ]==NA_INTEGER)) {
	for (int j=0; j<length(SDall); ++j) {
	writeNA(VECTOR_ELT(SDall, j), 0, 1, false);
	// writeNA uses SET_ for STR and VEC, and we always use SET_ to assign to SDall always too. Otherwise,
	// this writeNA could decrement the reference for the old value which wasn't incremented in the first place.
	// Further, the eval(jval) could feasibly assign to SD although that is currently caught and disallowed. If that
	// became possible, that assign from user's j expression would decrement the reference which wasn't incremented
	// in the first place. And finally, upon release of SD, values will be decremented, where they weren't incremented
	// in the first place. All in all, too risky to write behind the barrier in this section.
	// Or in the words, this entire section, and this entire dogroups.c file, is now write-barrier compliant from v1.12.10
	// and we hope that reference counting on by default from R 4.0 will avoid costly gc()s.
	}
	grpn = 1; // it may not be 1 e.g. test 722. TODO: revisit.
	SETLENGTH(I, grpn);
	INTEGER(I)[0] = 0;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With `i` and without GForce, by-group operations don't fill the whole sub-column for NA groups #7442

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

With i and without GForce, by-group operations don't fill the whole sub-column for NA groups #7442

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

With `i` and without GForce, by-group operations don't fill the whole sub-column for NA groups #7442