Collapse Cheat Sheet
Collapse Cheat Sheet
Introduction Fast Statistical Functions Grouping and Ordering Fast Data Manipulation
collapse is a C/C++ based package supporting advanced Fast functions to perform column–wise grouped and Optimized functions for grouping, ordering, unique Minimal overhead implementations
(grouped, weighted, time series, panel data and recursive) weighted computations on matrix-like objects values, splitting & recombining, and dealing with factors
fselect[<-]() - select/replace columns
statistical operations in R, with very efficient low-level
vectorizations across both groups and columns. fmean, fmedian, fmode, fsum, fprod, fsd, fvar GRP() - create a grouping object (class ’GRP’): pass to g arg. fsubset() - subset data (rows and columns)
fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct g <- GRP(iris, ~ Species) # or GRP(iris£Species) or GRP(iris["Species"])
It also offers a flexible, class-agnostic, approach to data fndistinct(iris[1:4], g) # Computation without grouping overhead ss() - fast alternative to [, particularly for data frames
transformation in R: handling matrix and data frame based Syntax ## Sepal.Length Sepal.Width Petal.Length Petal.Width [row|col]order[v]() - reorder (sort) rows and columns
objects in a uniform, attribute preserving, way, and ensuring ## setosa 15 16 9 6
## versicolor 21 14 19 9 fmutate(), fsummarise() - dplyr -like, incl. across() feature
seamless compatibility with dplyr / (grouped) tibble, data.table, FUN(x, g = NULL, [w = NULL], TRA = NULL, ## virginica 21 13 20 12
xts, sf and plm classes for panel data (’pseries’, ’pdata.frame’). [na.rm = TRUE], use.g.names = TRUE, [f|set]transform[v][<-]() - transform cols (by reference)
fgroup by() - attach ’GRP’ object to data: a class-agnostic
collapse provides full control to the user for statistical [drop = TRUE], [nthreads = 1L])
grouped frame supporting fast computations fcompute[v]() - compute new cols dropping existing ones
programming - with several ways to reach the same outcome mtcars |> fgroup_by(cyl, vs, am) |> ss(1:2)
and rich optimization possibilities. Its default is na.rm = TRUE, x vector, matrix, or (grouped) data frame / list [f|set]rename() - rename (any object with ’names’ attribute)
## mpg cyl disp hp drat wt qsec vs am gear carb
and implemented at very low cost at the algorithm level. g [optional] (list of) vectors / factors or GRP() object ## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 [set]relabel() - assign/change variable labels (’label’ attr.)
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Calling help("collapse-documentation") brings up a w [optional] vector of (frequency) weights ## get vars[<-]() - select/replace columns (standard eval.)
## Grouped by: cyl, vs, am [7 | 5 (3.8) 1-12]
detailed documentation, which is also available online. See TRA [optional] operation to transform data with computed [num|cat|char|fact|logi|date] vars[<-]() - select/
# Group Stats: [N. groups | mean (sd) min-max of group sizes]
also the fastverse package/project for a recommended set of statistics (see FUN argument to TRA() and Examples) # Fast Functions also have a grouped_df method: here wt-weighted medians replace columns by data type or retrieve names/indices
complimentary packages and easy package management. mtcars |> fgroup_by(cyl, vs, am) |> fmedian(wt) |> head(3)
drop drop matrix / data frame dimensions. default TRUE add vars[<-]() - add or column-bind columns
## cyl vs am sum.wt mpg disp hp drat qsec gear carb
## 1 4 0 1 2.140 26.0 120.3 91 4.43 16.70 5 2
Examples Examples
Row/Column Arithmetic (by Reference) fmean(AirPassengers) # Vector
## 2
## 3
4 1 0 8.805 22.8 140.8 95
4 1 1 14.198 30.4 79.0 66
3.70 20.01
4.08 18.61
4
4
2
1
mtcars |> fsubset(mpg > fnth(mpg, 0.95), disp:wt, cylinders = cyl)
Column-wise sweeping out of vectors/matrices/DFs/lists ## [1] 280.2986 GRPN(), fgroup vars(), fungroup() - get group count, ## disp hp drat wt cylinders
fmean(AirPassengers, w = cycle(AirPassengers)) # Weighted mean grouping columns/variables, and ungroup data ## Fiat 128 78.7 66 4.08 2.200 4
%cr%, %c+%, %c-%, %c*%, %c/% e.g. Z = X %c/% rowSums(X) ## [1] 284.3397
## Toyota Corolla 71.1 65 4.22 1.835 4
## Mazda RX4 6 2.620 0.9691687 0.386125 ## Indexed by: iso3c [1] | year [2 (61)] ## [1] "var1" "var2" "var3"
## Mazda RX4 Wag 6 2.875 0.9691687 0.386125
varying() - check variation within groups (panel-ids) .c(values, vectors) %=% eigen(cov(mtcars)) # Multiple Assignment
# Index stats: [N. ids] | [N. periods (tot.N. periods: (max-min)/GCD)]
# Much shorter than fsubset(mpg > fmean(mpg, cyl, TRA = "replace")) LIFEEXi = wldi$LIFEEX # Indexed series pwcor(), pwcov(), pwnobs() - pairwise correlations, # Variable labels: vlabels[<-], [set]relabel() etc. namlab() shows summary
str(LIFEEXi, strict.width = "cut") namlab(wlddev[c(2, 9)], N = TRUE, Ndist = TRUE, class = TRUE)
mtcars |> fsubset(mpg > B(mpg, cyl)) |> head(2) covariance and obs. (with P-value and pretty printing)
## mpg cyl disp hp drat wt qsec vs am gear carb ## 'indexed_series' num [1:13176] 32.4 33 33.5 34 34.5 ... ## Variable Class N Ndist Label
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 ## - attr(*, "index_df")=Classes 'index_df', 'pindex' and 'data.frame'.. ## 1 iso3c factor 13176 216 Country Code
## ..$ iso3c: Factor w/ 216 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 .. ## 2 PCGDP numeric 9470 9470 GDP per capita (constant 2010 US$)
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1
# Regression with cyl fixed effects - a la Mundlak (1978)
4 4
## ..$ year : Ord.factor w/ 61 levels "1960"<"1961"<..: 1 2 3 4 5 6 7.. List Processing
lm(mpg ~ carb + B(carb, cyl), data = mtcars) |> coef() LIFEEXi[1:7] # Subsetting indexed series Functions to process (nested) lists (of data objects)
##
##
(Intercept)
34.829652
carb B(carb, cyl)
-0.465511 -4.775032
## [1] 32.446 32.962 33.471 33.971 34.463 34.948 35.430
## ldepth() - level of nesting of list API Extensions
# Fast grouped (vs) bivariate regression slopes: mpg ~ carb
## Indexed by: iso3c [1] | year [7 (61)]
is unlistable() - is list composed of atomic objects Shorthands for frequently used functions
mtcars |> fgroup_by(vs) |> fmutate(dm_carb = W(carb)) |> c(is_irregular(LIFEEXi), is_irregular(LIFEEXi[-5])) # Is irregular?
fsummarise(beta = fsum(mpg, dm_carb) %/=% fsum(dm_carb^2)) has elem() - search if list contains certain elements fselect -> slt, fsubset -> sbt, fmutate -> mtt,
## [1] FALSE TRUE
## vs beta
[f/set]transform[v] -> [set]tfm[v], fsummarise ->
## 1 0 -0.5557241 Note: ’indexed series’ and frames are supported via existing get elem() - pull out elements from list / subset list smr, across -> acr, fgroup by -> gby, finteraction
## 2 1 -2.0706468 ’pseries’/’pdata.frame’ methods for time series/panel functions. atomic elem[<-](), list elem[<-]() - get list with atomic / -> itn, findex by -> iby, findex -> ix, frename ->
# Residuals from regressing on 'Petal' vars and 'Species' FE sub-list elements, examining only first level of list rnm, get vars -> gv, num vars -> nv, add vars -> av
fhdwithin(iris[1:2], iris[3:5]) |> head(2) Fast functions to perform time-based computations on
## Sepal.Length Sepal.Width reg elem(), irreg elem() - get full list tree leading to atomic
(irregular) time series and (unbalanced) panel data Namespace masking
## 1 0.14989286 0.1102684 (’regular’) or non-atomic (’irregular’) elements
## 2 -0.05010714 -0.3897316 Can set option(collpse mask = c(...)) with a vector of
# Detrending with country-level cubic polynomials Lags/Leads, Differences, Growth Rates and Cumulative Sums rsplit() - efficient (recursive) splitting
functions starting with f-, to export versions without f-, masking
HDW(wlddev, PCGDP + LIFEEX + POP ~ iso3c * poly(year, 3)) |> head(2) flag(x, n = 1, g = NULL, t = NULL, fill = NA, ...) t list() - efficient list transpose (transpose lists of lists) base R or dplyr. A few keywords exist to mask multiple
## HDW.PCGDP HDW.LIFEEX HDW.POP fdiff(x, n = 1, diff = 1, g = NULL, t = NULL,
## 43 -258.4069 0.2360285 -317459.1 rapply2d() - recursive apply to lists of data objects functions, see help("collapse-options"). This allows clean
fill = NA, log = FALSE, rho = 1, ...)
## 44 -119.5600 0.1136432 -33900.2 & fast code, but poses additional namespace challenges:
fgrowth(x, n = 1, diff = 1, g = NULL, t = NULL, fill unlist2d() - recursive row-binding to data.frame
# Note: HD centering/prediction and polynomials requires package 'fixest' # Masking all f- functions and specials n = GRPN and table = qtab
= NA, logdiff = FALSE, scale = 100, power = 1, ...) options(collapse_mask = "all")
fcumsum(x, g = NULL, o = NULL, na.rm = TRUE, Example: Nested Linear Models library(collapse)
fill = FALSE, check.o = TRUE, ...) (dl <- mtcars |> rsplit(mpg + hp + carb ~ vs + am)) |> str(max.level = 2) # The folowing is 100% collapse code, apart from the base pipe
Linear Models ## List of 2
wlddev |>
Statistical Operators: L(), F(), D(), Dlog(), G() ## $ 0:List of 2
## ..$ 0:'data.frame': 12 obs. of 3 variables: subset(year >= 1990) |>
Fast (barebones) linear model fitting with 6 different solvers group_by(year) |>
## ..$ 1:'data.frame': 6 obs. of 3 variables:
flm(y, X, w = NULL, add.icpt = FALSE, method = "lm") Example: Computing Growth Rates ## $ 1:List of 2 summarise(n = n(), across(PCGDP:GINI, mean, w = POP))
## ..$ 0:'data.frame': 7 obs. of 3 variables:
Fast R2 -based F-test of exclusion restrictions for lm’s (with FE) # Ad-hoc use: note that G() supports formulas which fgrowth() doesn't
## ..$ 1:'data.frame': 7 obs. of 3 variables: with(mtcars, table(cyl, vs, am))
fgrowth(AirPassengers) |> head()
fFtest(y, exc, X = NULL, w = NULL, full.df = TRUE) nest_lm <- dl |> rapply2d(lm, formula = mpg ~ .)
sum(mtcars)
## [1] NA 5.357143 11.864407 -2.272727 -6.201550 11.570248 diff(EuStockMarkets)
(nest_coef <- nest_lm |> rapply2d(summary, classes = "lm") |> droplevels(wlddev)
Both functions also have formula interfaces: G(wlddev, c(1, 10), by = PCGDP ~ iso3c, t = ~ year) |> ss(11:12) get_elem("coefficients")) |> str(give.attr = FALSE, strict = "cut") mean(nv(iris), g = iris$Species)
flm(cbind(mpg, disp) ~ hp + carb, weights = wt, mtcars) ## iso3c year G1.PCGDP L10G1.PCGDP ## List of 2 scale(nv(GGDC10S), g = GGDC10S$Variable)
## 1 AFG 1970 NA NA ## $ 0:List of 2 unique(GGDC10S, cols = c("Variable", "Country"))
## mpg disp
## 2 AFG 1971 NA NA ## ..$ 0: num [1:3, 1:4] 15.8791 0.0683 -4.5715 3.655 0.0345 ... range(wlddev$date)
## (Intercept) 28.48401839 42.155002
## hp -0.06834996 2.101036 wlddev |> fgroup_by(iso3c) |> fselect(iso3c, year, PCGDP, LIFEEX) |> ## ..$ 1: num [1:3, 1:4] 26.9556 -0.0319 -0.308 2.293 0.0149 ...
## carb 0.33207257 -38.183910 fmutate(PCGDP_growth = fgrowth(PCGDP, t = year)) |> head(2) ## $ 1:List of 2 wlddev |>
## iso3c year PCGDP LIFEEX PCGDP_growth ## ..$ 0: num [1:3, 1:4] 30.896903 -0.099403 -0.000332 3.346033 0.035.. index_by(iso3c, year) |>
# Test the exclusion of cyl-dummies and hp.
## 1 AFG 1960 NA 32.446 NA ## ..$ 1: num [1:3, 1:4] 37.0012 -0.1155 0.4762 7.3316 0.0894 ... mutate(PCGDP_lag = lag(PCGDP),
fFtest(mpg ~ qF(cyl) + hp | carb + qF(am), weights = wt, mtcars)
## 2 AFG 1961 NA 32.962 NA nest_coef |> unlist2d(c("vs", "am"), row.names = "variable") |> head(2) PCGDP_diff = PCGDP - PCGDP_lag,
## R-Sq. DF1 DF2 F-Stat. P-Value PCGDP_growth = growth(PCGDP)) |> unindex()
## Full Model 0.812 5 26 22.479 0.000 settransform(wlddev, PCGDP_growth = G(PCGDP, g = iso3c, t = year)) ## vs am variable Estimate Std. Error t value Pr(>|t|)
## Restricted Model 0.674 2 29 30.041 0.000 # Note: can omit t -> requires consecutive observations and groups ## 1 0 0 (Intercept) 15.87914500 3.65495315 4.344555 0.001865018 The best way to set this option is inside an .Rprofile file
## Exclusion Rest. 0.138 3 26 6.351 0.002 # Usage with indexed series / frames: ## 2 0 0 hp 0.06832467 0.03449076 1.980956 0.078938069
placed in the user or project directory. Use it carefully.
Page 2 of 2 CC-BY-SA Sebastian Krantz • Learn more at sebkrantz.github.io/collapse • Source code at github.com/SebKrantz/collapse • Updates announced at twitter.com/collapse R - #rcollapse • Cheatsheet created for collapse version 1.8.8 • Updated: 2022-08