An R Package For Actuarial Science
An R Package For Actuarial Science
Vincent Goulet
The actuar project is a package of Actuarial Science functions for the R sta-
tistical system. The project was launched in 2005 and the package is avail-
able on CRAN (Comprehensive R Archive Network) since February 2006.
The current version of the package contains functions for use in the fields
of risk theory, loss distributions and credibility theory. This paper presents
in detail but in non technical terms the most recent version of the package.
1
The package is released under the GNU General Public License (GPL),
version 2 or newer, thereby making it free software that anyone can use,
modify and redistribute, to the extent that the derivative work is also re-
leased under the GPL.
3 Documentation
It is a requirement of the R packaging system that every function and data
set in a package has a help page. The actuar package follows this require-
ment strictly. The help page of function foo is accessible by typing
> ?foo
or
> help("foo")
2
4.1 Probability laws
R already includes functions to compute the density function, cumulative
distribution function, quantile function of, and to generate variates from
a fair number of probability laws. For some root foo, the functions are
named dfoo, pfoo, qfoo and rfoo, respectively.
The actuar package provides d, p, q and r functions for all the probabil-
ity laws useful for loss severity modeling found in Appendix A of Klugman
et al. (2004) and not already present in base R, excluding the inverse Gaus-
sian and log-t but including the loggamma distribution (Hogg and Klugman,
1984). Among others, most welcome additions are functions for the Pareto
distribution.
Tables 1–3 list the supported distributions classified by families. Each
table details the names of the distributions as given in Klugman et al.
(2004), the root name of the R functions and the names of the arguments
corresponding to each parameter in the parametrization of Klugman et al.
(2004). One will note that by default all functions (except those for the
Pareto distribution) use a rate parameter equal to the inverse of the scale
parameter. This differs from Klugman et al. (2004) but is better in line with
the functions for the gamma, exponential and Weibull distributions in base
R.
All functions are written in C for speed. In almost every respect, they
behave just like the base R functions.
In addition to the d, p, q and r functions, the package provides m and
lev functions to compute the theoretical raw and limited moments, re-
spectively. All the probability laws of Tables 1– 3 are supported, plus the
following ones already in R: exponential, gamma, lognormal and Weibull.
The m and lev functions come especially useful with estimation methods
based on the matching of raw or limited moments. See Subsection 4.4 for
their empirical counterparts.
3
Table 1: Supported distributions from the Transformed Beta family, root
name of the R functions and argument name corresponding to each param-
eter in the parametrization of Klugman et al. (2004).
Distribution name Root (alias) Arguments
Transformed beta trbeta (pearson6) shape1 (α)
shape2 (γ)
shape3 (τ)
rate (λ = 1/θ)
scale (θ)
Burr burr shape1 (α)
shape2 (γ)
rate (λ = 1/θ)
scale (θ)
Loglogistic llogis shape (γ)
rate (λ = 1/θ)
scale (θ)
Paralogistic paralogis shape (α)
rate (λ = 1/θ)
scale (θ)
Generalized Pareto genpareto shape1 (α),
shape2 (τ),
rate (λ = 1/θ)
scale (θ)
Pareto pareto (pareto2) shape (α)
scale (θ)
Inverse Burr invburr shape1 (τ)
shape2 (γ)
rate (λ = 1/θ)
scale (θ)
Inverse Pareto invpareto shape (τ)
scale (θ)
Inverse paralogistic invparalogis shape (τ)
rate (λ = 1/θ)
scale (θ)
4
Table 2: Supported distributions from the Transformed Gamma family,
root name of the R functions and argument name corresponding to each
parameter in the parametrization of Klugman et al. (2004).
Distribution name Root (alias) Arguments
Transformed gamma trgamma shape1 (α),
shape2 (τ)
rate (λ = 1/θ),
scale (θ),
Inverse transformed invtrgamma shape1 (α),
gamma shape2 (τ)
rate (λ = 1/θ),
scale (θ),
Inverse gamma invgamma shape (α),
rate (λ = 1/θ),
scale (θ)
Inverse Weibull invweibull shape (τ),
(lgompertz) rate (λ = 1/θ),
scale (θ)
Inverse exponential invexp rate (λ = 1/θ)
scale (θ)
5
with the cj s, omitting c0 or not, etc. Moreover, with appropriate extrac-
tion, replacement and summary functions, manipulation of grouped data
becomes similar to that of individual data.
First, function grouped.data creates a grouped data object similar to —
an inheriting from — a data frame. The input of the function is a vector of
group boundaries c0 , c1 , . . . , cr and one or more vectors of group frequen-
cies n1 , . . . , nr . Note that there should be one group boundary more than
group frequencies. Furthermore, the function assumes that the intervals
are contiguous. For example, the following data
> class(x)
Second, the package supports the most common extraction and replace-
ment methods for "grouped.data" objects using the usual [ and [<- op-
erators. In particular, the following extraction operations are supported.
i) Extraction of the vector of group boundaries (the first column):
> x[, 1]
ii) Extraction of the vector or matrix of group frequencies (the second and
third columns):
6
> x[, -1]
Line.1 Line.2
1 30 26
2 31 33
3 57 31
4 42 19
5 65 16
6 84 11
> x[1:3, ]
7
> x[1, 1] <- c(0, 20)
> x
Line.1 Line.2
188.0 108.2
Higher empirical moments can be computed with emm; see Subsection 4.4.
The R function hist splits individual data into groups and draws an
histogram of the frequency distribution. The package introduces a method
for already grouped data. Only the first frequencies column is considered
(see Figure 1 for the resulting graph):
> hist(x[, -3])
R has a function ecdf to compute the empirical cumulative distribution
function (cdf) of an individual data set,
n
1 X
Fn (x) = I{xj ≤ x}, (2)
n i=1
8
Histogram of x[, −3]
0.003
0.002
Density
0.001
0.000
x[, −3]
The package includes a function ogive that otherwise behaves exactly like
ecdf. In particular, methods for functions knots and plot allow, respec-
tively, to obtain the knots c0 , c1 , . . . , cr of the ogive and a graph (see Figure
2):
> Fnt <- ogive(x)
9
ogive(x)
1.0
●
0.8
●
0.6
F(x)
●
0.4
●
0.2
●
0.0
> knots(Fnt)
> Fnt(knots(Fnt))
> plot(Fnt)
> data(dental)
> dental
10
> data(gdental)
> gdental
cj nj
1 (0, 25] 30
2 ( 25, 50] 31
3 ( 50, 100] 57
4 (100, 150] 42
5 (150, 250] 65
6 (250, 500] 84
7 (500, 1000] 45
8 (1000, 1500] 10
9 (1500, 2500] 11
10 (2500, 4000] 3
Second, in the same spirit as ecdf and ogive, function elev returns a
function to compute the empirical limited expected value — or first lim-
ited moment — of a sample for any limit. Again, there are methods for
individual and grouped data (see Figure 3 for the graphs):
[1] 16.0 37.6 42.4 85.1 105.5 164.5 187.7 197.9 241.1
[10] 335.5
11
elev(x = dental) elev(x = gdental)
● ● ●
300
●
300
●
250
●
●
Empirical LEV
Empirical LEV
200
200
●
●
● ●
150
100
100
●
●
●
50
●
50
●
● ●
● ●
0
0 500 1000 1500 0 1000 2000 3000 4000
x x
Pr
where n = j=1 nj . By default, wj = n−1
j .
12
3. The layer average severity method (LAS) applies to grouped data only
and minimizes the squared difference between the theoretical and
empirical limited expected value within each group:
r
X
d(θ) = ˜ n (cj−1 , cj ; θ))2 ,
wj (LAS(cj−1 , cj ; θ) − LAS (7)
j=1
rate
0.003551
distance
0.002842
rate
0.00364
distance
13.54
rate
0.002966
distance
694.5
13
> mde(gdental, ppareto, start = list(shape = 3, scale = 600),
+ measure = "CvM")
Working in the log of the parameters often solves the problem since the
optimization routine can then flawlessly work with negative parameter val-
ues:
logshape logscale
1.581 7.128
distance
0.0007905
> exp(p$estimate)
logshape logscale
4.861 1246.485
14
of u). Then the definition of Y is
undefined,
X<d
Y = X − d, d≤X≤u (8)
u − d,
X≥u
[1] 0
[1] 0.1343
[1] 0.02936
[1] 0
See the "coverage" vignette for the detailed pdf and cdf formulas un-
der various combinations of coverage modifications.
5 Risk theory
The current version of actuar addresses only one risk theory problem with
two user visible functions: the calculation of the aggregate claim amount
distribution of an insurance portfolio using the classical collective model
of risk theory (Klugman et al., 2004; Gerber, 1979; Denuit and Charpentier,
2004; Kaas et al., 2001). That said, the package offers five different cal-
culation or approximation methods of the distribution and four different
techniques to discretize a continuous loss variable. Moreover, we feel the
15
implementation described below makes R shine as a computing and mod-
eling platform for such problems.
Let the random variable S represent the aggregate claim amount (or total
amount of claims) of a portfolio of independent risks, random variable N
represent the number of claims (or frequency) in the portfolio and random
variable Cj the amount of claim j (or severity). Then, we have the random
sum
S = C1 + · · · + CN , (10)
where we assume that C1 , C2 , . . . are mutually independent and identically
distributed random variables each independent from N. The task at hand
consists in calculating numerically the cdf of S, given by
FS (x) = Pr[S ≤ x]
∞
X
= Pr[S ≤ x|N = n]pn
n=0
X∞
= FC∗n (x)pn , (11)
n=0
fx = F (x + h) − F (x) (13)
16
2. Lower discretization, or backward difference of F (x):
(
F (a), x=a
fx = (14)
F (x) − F (x − h), x = a + h, . . . , b.
The true cdf passes exactly midway through the steps of the dis-
cretized cdf.
E[X ∧ a] − E[X ∧ a + h]
+ 1 − F (a), x=a
h
2E[X ∧ x] − E[X ∧ x − h] − E[X ∧ x + h]
fx = , a<x<b (16)
h
E[X ∧ b] − E[X ∧ b − h]
− 1 + F (b), x = b.
h
The discretized and the true distributions have the same total proba-
bility and expected value on (a, b).
17
● ●
●
● ●
●
● ●
0.8 ●
● ●
●
●
0.6
plnorm(x)
● ●
0.4
●
● ● upper
0.2
● lower
● rounding
● unbiased
0.0
0 1 2 3 4 5
18
3. Normal approximation of the cdf, that is
S − µS
FS (x) ≈ Φ , (17)
σS
where µS = E[S] and σS2 = Var[S]. For most realistic models, this
approximation is rather crude in the tails of the distribution.
19
Hence, object Fs contains an empirical cdf with support
> knots(Fs)
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
[12] 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5
[23] 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0
[34] 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5
[45] 22.0 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0
[56] 27.5 28.0 28.5 29.0 29.5 30.0 30.5 31.0 31.5 32.0 32.5
[67] 33.0 33.5 34.0 34.5 35.0 35.5 36.0 36.5 37.0 37.5 38.0
[78] 38.5 39.0 39.5 40.0 40.5 41.0 41.5 42.0 42.5 43.0 43.5
[89] 44.0 44.5 45.0 45.5 46.0 46.5 47.0 47.5 48.0 48.5 49.0
[100] 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5
[111] 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0
[122] 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5
[133] 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 71.0
A nice graph of this function is obtained with plot (see Figure 5):
Finally, one can easily compute the mean and obtain the quantiles of the
approximate distribution as follows:
> mean(Fs)
[1] 20
> quantile(Fs)
99.9%
49.5
To conclude on the subject, Figure 6 shows the cdf of S using five com-
binations of discretization and calculation method supported by actuar.
Other combinations are possible.
6 Credibility theory
The credibility theory facilities of actuar consist of one data set and three
main functions:
20
Aggregate Claim Amount Distribution
Recursive method approximation
1.0
0.8
0.6
FS(x)
0.4
0.2
0.0
0 10 20 30 40 50 60
21
Aggregate Claim Amount Distribution
1.0
0.8
0.6
FS(x)
0.4
recursive + unbiased
recursive + upper
recursive + lower
0.2
simulation
normal approximation
0.0
0 10 20 30 40 50 60
> data(hachemeister)
> hachemeister
22
[4,] 407 396 348 341 315 328
[5,] 2902 3172 3046 3068 2693 2910
weight.7 weight.8 weight.9 weight.10 weight.11 weight.12
[1,] 9456 8003 7365 7832 7849 9077
[2,] 1964 1515 1527 1748 1654 1861
[3,] 1277 1218 896 1003 1108 1121
[4,] 352 331 287 384 321 342
[5,] 3275 2697 2663 3017 3242 3425
The random variables Φi , Λij , Ψi and Θij are generally seen as risk parame-
ters in the actuarial literature. The wijt s are known weights.
Function simpf is presented in the credibility theory section because it
was originally written in this context, but it has much wider applications.
For instance, as mentioned in Subsection 5.2, it is used by aggregateDist
for the approximation of the cdf of S by simulation.
Goulet and Pouliot (2007) describe in detail the model specification
method used in simpf. For the sake of completeness, we briefly outline
this method here.
A hierarchical model is completely specified by the number of nodes at
each level (I, J1 , . . . , JI and n11 , . . . , nIJ , above) and by the probability laws
at each level. The number of nodes is passed to simpf by means of a named
list where each element is a vector of the number of nodes at a given level.
Vectors are recycled when the number of nodes is the same throughout a
level. Probability models are expressed in a semi-symbolic fashion using an
object of mode "expression". Each element of the object must be named
— with names matching those of the number of nodes list — and should
be a complete call to an existing random number generation function, with
the number of variates omitted. Now, hierarchical models are achieved by
replacing one or more parameters of a distribution at a given level by any
23
combination of the names of the levels above. If no mixing is to take place
at a level, the model for this level can be NULL.
Function simpf also supports usage of weights, or volumes, in models.
These usually modify the frequency parameters to take into account the
“size” of an insurance contract. The weights will be used in simulation
wherever the name weights appears in a model.
Hence, function simpf has four main arguments: 1) nodes for the num-
ber of nodes list; 2) model.freq for the frequency model; 3) model.sev for
the severity model; 4) weights for the vector of weights in lexicographic
order, that is all weights of contract 1, then all weights of contract 2, and
so on.
For example, assuming that I = 2, J1 = 4, J2 = 3, n11 = · · · = n14 = 4
and n21 = n22 = n23 = 5 in model (20) above, and that weights are simply
simulated from a uniform distribution on (0.5, 2.5), then simulation of a
data set with simpf is achieved with:
> pf
Frequency model
class ~ rexp(2)
contract ~ rgamma(class, 1)
year ~ rpois(weights * contract)
Severity model
class ~ rnorm(2, sqrt(0.1))
contract ~ rnorm(class, 1)
year ~ rlnorm(contract, 1)
24
Number of claims per node:
> aggregate(pf)
> frequency(pf)
25
[2,] 1 2 0 0 0 0 NA
[3,] 1 3 0 3 0 2 NA
[4,] 1 4 0 1 1 1 NA
[5,] 2 1 0 1 1 1 2
[6,] 2 2 0 0 0 0 0
[7,] 2 3 3 4 2 2 0
class freq
[1,] 1 17
[2,] 2 16
$first
class contract claim.1 claim.2 claim.3 claim.4 claim.5
[1,] 1 1 7.974 23.401 3.153 4.368 11.383
[2,] 1 2 NA NA NA NA NA
[3,] 1 3 3.817 41.979 26.910 4.903 19.078
[4,] 1 4 98.130 50.622 55.705 NA NA
[5,] 2 1 11.793 2.253 2.397 9.472 1.004
[6,] 2 2 NA NA NA NA NA
[7,] 2 3 14.322 11.522 18.966 33.108 15.532
claim.6 claim.7 claim.8 claim.9 claim.10 claim.11
[1,] NA NA NA NA NA NA
[2,] NA NA NA NA NA NA
[3,] NA NA NA NA NA NA
[4,] NA NA NA NA NA NA
[5,] NA NA NA NA NA NA
[6,] NA NA NA NA NA NA
[7,] 14.99 25.11 40.15 17.44 4.426 10.16
$last
NULL
26
> severity(pf, splitcol = 1)
$first
class contract claim.1 claim.2 claim.3 claim.4 claim.5
[1,] 1 1 3.153 4.368 11.383 NA NA
[2,] 1 2 NA NA NA NA NA
[3,] 1 3 3.817 41.979 26.910 4.903 19.078
[4,] 1 4 98.130 50.622 55.705 NA NA
[5,] 2 1 11.793 2.253 2.397 9.472 1.004
[6,] 2 2 NA NA NA NA NA
[7,] 2 3 33.108 15.532 14.990 25.107 40.150
claim.6 claim.7 claim.8
[1,] NA NA NA
[2,] NA NA NA
[3,] NA NA NA
[4,] NA NA NA
[5,] NA NA NA
[6,] NA NA NA
[7,] 17.44 4.426 10.16
$last
class contract claim.1 claim.2 claim.3
[1,] 1 1 7.974 23.40 NA
[2,] 1 2 NA NA NA
[3,] 1 3 NA NA NA
[4,] 1 4 NA NA NA
[5,] 2 1 NA NA NA
[6,] 2 2 NA NA NA
[7,] 2 3 14.322 11.52 18.97
> weights(pf)
Function simpf was used to simulate the data in Forgues et al. (2006).
27
6.3 Fitting of hierarchical credibility models
The linear model fitting function of base R is named lm. Since credibility
models are very close in many respects to linear models, and since the
credibility model fitting function of actuar borrows much of its interface
from lm, we named the credibility function cm.
We hope that, in the long term, cm can act as a unified interface for most
credibility models. Currently, it supports the models of Bühlmann (1969)
and Bühlmann and Straub (1970), as well as the hierarchical model of Jewell
(1975). The last model includes the first two as special cases. (According
to our current counting scheme, a Bühlmann-Straub model is a two-level
hierarchical model.)
There are some variations in the formulas of the hierarchical model
in the literature. We estimate the structure parameters as indicated in
Goovaerts and Hoogstad (1987) but compute the credibility premiums as
given in Bühlmann and Jewell (1987) or Bühlmann and Gisler (2005); Goulet
(1998) has all the appropriate formulas for our implementation. For in-
stance, for a three-level hierarchical model like (19)-(20), the best linear
prediction of the ratio Xij,nij +1 = Sij,nij +1 /wij,nij +1 is
and
I
X zi
m̂ = Xzzw = Xizw .
z
i=1 Σ
28
Function cm takes in argument a formula describing the hierarchical
interactions in the data set, a data set containing the variables referenced in
the formula and the names of the columns where the ratios and the weights
are to be found in the data set. The latter should be a matrix or a data frame
with one column of indexes (numeric or character) for each hierarchical
interaction, at least two nodes in each level and more than one period of
experience for at least one contract. Missing values are represented by NAs.
There can be contracts with no experience (complete lines of NAs).
The function returns a fitted model object containing the estimators of
the structure parameters. To compute the credibility premiums, one calls
function predict with the said object in argument. One can also obtain a
nicely formated view of the most important results with a call to summary.
These two functions can report for the whole portfolio or for a subset of
levels only by means of an argument "levels".
In order to give an easily reproducible example, we group states 1 and
3 of the Hachemeister data set into one class and states 2, 4 and 5 into
another. This also shows that data does not have to be sorted by level. The
fitted model is:
> summary(fit)
Detailed premiums
29
Level: class
class Ind. premium Weight Cred. factor Cred. premium
1 1967 1.407 0.9196 1949
2 1528 1.596 0.9284 1543
Level: state
class state Ind. premium Weight Cred. factor
1 1 2061 100155 0.8874
2 2 1511 19895 0.6103
1 3 1806 13735 0.5195
2 4 1353 4152 0.2463
2 5 1600 36110 0.7398
Cred. premium
2048
1524
1875
1497
1585
$class
[1] 1949 1543
$state
[1] 2048 1524 1875 1497 1585
Finally, one can obtain the results above for the class level only as follows:
> summary(fit, levels = "class")
Detailed premiums
Level: class
class Ind. premium Weight Cred. factor Cred. premium
1 1967 1.407 0.9196 1949
2 1528 1.596 0.9284 1543
30
> predict(fit, levels = "class")
$class
[1] 1949 1543
The results above differ from those of Goovaerts and Hoogstad (1987)
for the same example because the formulas for the credibility premiums
are different.
Otherwise, usage of summary and predict for models fitted with bstraub
is identical to models fitted with cm.
7 Conclusion
The paper presented the facilities of the R package actuar version 0.9-3 in
the fields of loss distribution modeling, risk theory and credibility theory.
We feel this version of the package covers most of the basics needs in these
areas. In the future we plan to improve the functions currently available —
especially speed wise — but also to start adding more advanced features.
For example, future versions of the package should include support for
dependence models in risk theory and regression credibility models.
Obviously, the package left many other fields of Actuarial Science un-
touched so far. For this situation to change, we hope that experts in their
field will join their efforts to ours and contribute code to the actuar project.
This project intends to continue to grow and improve by and for the com-
munity of developers and users.
Finally, if you use R or actuar for actuarial analysis, please cite the soft-
ware in publications. Use
> citation()
or
> citation("actuar")
31
Acknowledgments
The package would not be at this stage of development without the stim-
ulating contribution of the following students: Sébastien Auclair, Mathieu
Pigeon, Louis-Philippe Pouliot and Tommy Ouellet.
This research benefited from financial support from the Natural Sci-
ences and Engineering Research Council of Canada and from the Chaire
d’actuariat (Actuarial Science Chair) of Université Laval.
References
Bühlmann, H., 1969. Experience rating and credibility. ASTIN Bulletin 5,
157–165.
Bühlmann, H., Gisler, A., 2005. A course in credibility theory and its appli-
cations. Springer.
Daykin, C., Pentikäinen, T., Pesonen, M., 1994. Practical Risk Theory for
Actuaries. Chapman & Hall, London.
Forgues, A., Goulet, V., Lu, J., 2006. Credibility for severity revisited. North
American Actuarial Journal 10 (1), 49–62.
Goulet, V., 2007. actuar: An R Package for Actuarial Science, version 0.9-3.
École d’actuariat, Université Laval.
URL http://www.actuar-project.org
32
Hachemeister, C. A., 1975. Credibility for regression models with appli-
cation to trend. In: Credibility, theory and applications. Proceedings of
the Berkeley actuarial research conference on credibility. Academic Press,
New York.
Hogg, R. V., Klugman, S. A., 1984. Loss Distributions. Wiley, New York.
Jewell, W. S., 1975. The use of collateral data in credibility theory: a hierar-
chical model. Giornale dell’Istituto Italiano degli Attuari 38, 1–16.
Kaas, R., Goovaerts, M., Dhaene, J., Denuit, M., 2001. Modern actuarial risk
theory. Kluwer Academic Publishers, Dordrecht.
Klugman, S. A., Panjer, H. H., Willmot, G., 1998. Loss Models: From Data to
Decisions. Wiley, New York.
Klugman, S. A., Panjer, H. H., Willmot, G., 2004. Loss Models: From Data to
Decisions, 2nd Edition. Wiley, New York.
Vincent Goulet
École d’actuariat
Pavillon Alexandre-Vachon, bureau 1620
Université Laval
Québec (QC) G1K 7P4
Canada
E-mail: [email protected]
URL: http://vgoulet.act.ulaval.ca/
33