Enhanced Data
Enhanced Data
frame
Description
data.table inherits from data.frame. It offers fast and memory efficient: file reader and
writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and
flexible syntax, for faster development.
It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. Since
a data.table is adata.frame, it is compatible with R functions and packages that
accept only data.frames.
Usage
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL,
stringsAsFactors=FALSE)
Arguments
... Just as ... in data.frame. Usual recycling rules are applied to vectors
of different lengths to create a list of equal length vectors.
keep.rownames If ... is a matrix or data.frame, TRUE will retain the rownames of
that object in a column named rn.
x A data.table.
on argument (see below). It allows for both equi- and the newly
implementednon-equi joins.
If not, x must be keyed. Key can be set using setkey. If i is also keyed,
then first key column of i is matched against first key column of x, second
against second, etc..
If i is not keyed, then first column of i is matched against first key column
of x, second column of i against second key column of x, etc...
The expression '.()' is a shorthand alias to list(); they both mean the
same. (An exception is made for the use of .() within a call to bquote,
where .() is left unchanged.) As long as j returns a list, each element
of the list becomes a column in the resulting data.table. This is the
default enhanced mode.
Advanced: When i is
a list (or data.frame or data.table), DT[i, j,
by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins
to. That is, you can join (in i) and aggregate (in j) simultaneously. We
call this grouping by each i. See this StackOverflow answer for a more
detailed explanation until we roll out vignettes.
Advanced: In the X[Y, j] form of grouping, the j expression sees
variables in X first, then Y. We call this join inherited scope. If the variable
is not in X or Y then the calling frame is searched, its calling frame, and so
on in the usual way up to and including the global environment.
keyby Same as by, but with an additional setkey() run on the by columns of
the result, for convenience. It is common practice to use 'keyby=' routinely
when you wish the result to be sorted.
with By default with=TRUE and j is evaluated within the frame of x; column
names can be used as variables.
roll When i is a data.table and its row matches to all but the last x join
column, and its value in the last i join column falls in a gap (including
after the last observation in x for that group), then:
+Inf (or TRUE) rolls the prevailing value in x forward. It is also known
as last observation carried forward (LOCF).
When roll is a finite number, that limit is also applied when rolling the
ends.
which TRUE returns the row numbers of x that i matches to. If NA, returns the
row numbers of i that have no match in x. By default FALSE and the rows
in x that match are returned.
.SDcols Specifies the columns of x to be included in the special symbol .SD which
stands forSubset of data.table. May be character column names
or numeric positions. This is useful for speed when applying a function
through a subset of (possible very many) columns; e.g., DT[,
lapply(.SD, sum), by="x,y", .SDcols=301:350].
verbose TRUE turns on status and information messages to the console. Turn this
on by default using options(datatable.verbose=TRUE). The
quantity and types of verbosity may be expanded in future.
allow.cartesian FALSE prevents joins that would result in more
than nrow(x)+nrow(i) rows. This is usually caused by duplicate
values in i's join columns, each of which join to the same group in 'x' over
and over again: a misspecified join. Usually this was not intended and the
join needs to be changed. The word 'cartesian' is used loosely in this
context. The traditional cartesian join is (deliberately) difficult to achieve
in data.table: where every row in i joins to every row
in x (a nrow(x)*nrow(i) row result). 'cartesian' is just meant in a
'large multiplicative' sense.
drop Never used by data.table. Do not use. It needs to be here
because data.tableinherits from data.frame. See datatable-faq.
Details
data.table builds on base R functionality to reduce 2 types of time: programming time (easier
to write, read, debug and maintain), and compute time (fast and memory efficient). The general
form of data.table syntax is:
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
The way to read this out loud is: "Take DT, subset rows by i, then compute j grouped by by.
Here are some basic usage examples expanding on this definition. See the vignette (and examples)
for working examples.
X[, a] # return col 'a' from X as vector. If not
found, search in parent frame.
X[, .(a)] # same as above, but return as a data.table.
X[, sum(a)] # return sum(a) as a vector (with same
scoping rules as above)
X[, .(sum(a)), by=c] # get sum(a) grouped by 'c'.
X[, sum(a), by=c] # same as above, .() can be omitted in by on
single expression for convenience
X[, sum(a), by=c:f] # get sum(a) grouped by all columns in
between 'c' and 'f' (both inclusive)
X[, sum(a), keyby=b] # get sum(a) grouped by 'b', and sort that
result by the grouping column 'b'
X[, sum(a), by=b][order(b)] # same order as above, but by chaining
compound expressions
X[c>1, sum(a), by=c] # get rows where c>1 is TRUE, and on those
rows, get sum(a) grouped by 'c'
X[Y, .(a, b), on="c"] # get rows where Y$c == X$c, and select
columns 'X$a' and 'X$b' for those rows
X[Y, .(a, i.a), on="c"] # get rows where Y$c == X$c, and then select
'X$a' and 'Y$a' (=i.a)
X[Y, sum(a*i.a), on="c" by=.EACHI] # for *each* 'Y$c', get sum(a*i.a) on
matching rows in 'X$c'
See the see also section for the several other methods that are available for operating on
data.tables efficiently.
Note
If keep.rownames or check.names are supplied they must be written in full because R does
not allow partial argument names after '...'. For example, data.table(DF,
keep=TRUE) will create a column called "keep"containing TRUE and this is correct
behaviour; data.table(DF, keep.rownames=TRUE) was intended.
POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime.
They are implicitly converted to POSIXct type with warning. You may also be interested
in IDateTime instead; it has methods to convert to and from POSIXlt.
Examples
## Not run:
example(data.table) # to run these examples at the prompt
## End(Not run)
is.data.frame(DT) # TRUE
tables()
# joins as subsets
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X
# setting keys
kDT = copy(DT) # (deep) copy DT to kDT to work with
it.
setkey(kDT,x) # set a 1-column key. No quotes, for
convenience.
setkeyv(kDT,"x") # same (v in setkeyv stands for vector)
v="x"
setkeyv(kDT,v) # same
# key(kDT)<-"x" # copies whole table, please use set*
functions instead
haskey(kDT) # TRUE
key(kDT) # "x"
# all together
kDT[!"a", sum(v), by=.EACHI] # get sum(v) for each i != "a"
# multi-column key
setkey(kDT,x,y) # 2-column key
setkeyv(kDT,c("x","y")) # same
DT[, list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x, y%%2)] # by 2 expressions
# using rleid, get max(y) and min of all cols in .SDcols for each consecutive
run of 'v'
DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]
## Not run:
if (interactive()) {
vignette("datatable-intro")
vignette("datatable-reference-semantics")
vignette("datatable-keys-fast-subset")
vignette("datatable-secondary-indices-and-auto-indexing")
vignette("datatable-reshape")
vignette("datatable-faq")
# get the latest devel version (compiled binary for Windows available -- no
tools needed)
# https://github.com/Rdatatable/data.table/wiki/Installation
}
## End(Not run)