Latent Profile Analysis in R: A Tutorial and Comparison To Mplus
Latent Profile Analysis in R: A Tutorial and Comparison To Mplus
to Mplus
Klaas J. Wardenaar
University Medical Center Groningen (UMCG)
[email protected]
April 9, 2021
Version 1.1
Summary
Latent profile analysis (LPA) can be used to identify data-driven classes of individuals
based on scoring patterns across continuous input variables. LPA can be conducted using
commercially available software packages like Mplus, Latent Gold, and SAS, but it is also
possible to use freely available R-packages. This tutorial aims to (1) help applied researchers
to conduct an LPA in R and (2) to show how results obtained in R compare to those obtained
in Mplus.
1. Background
Latent profile Analysis (LPA) is a type of latent variable model that can be use to identify
latent classes or mixtures in a dataset, based on a set of continuous input variables (Gibson,
1959; Oberski, 2016). LPA is closely related to the widely used technique of Latent Class
Analysis, which is used to estimate latent classes based on discrete input variables (Nylund-
Gibson & Choi, 2018). In medical and psychological science, LPA can be useful when
considerable between-subject heterogeneity exists in scores on a range of variables and when
this variation cannot be explained by known, manifest variables (e.g., Wolfe 1970; Sterba
2013). Here, LPA can help to identify or approximate possibly meaningful sub-grouping of
subjects that may help to better understand sample heterogeneity (Sterba 2013).
Generally, LPA works under the assumption that sample (residual) variance can be reduced
by assuming a categorical latent variable that effectively subdivides the sample into >=2
subgroups that are more homogeneous in terms of their patterns of variable means and
(co)variances. In case an LPA model is found to fit a dataset well, it is often the case that
subjects in each class resemble each other relatively well in terms of their scores on the input
variables. Depending on the model configuration, the identified classes can show different
class-specific patterns of means and class-specific or class-varying variances.
1
Many different LPA model configurations are possible, each with different sets of parameters
that are either freely estimated in each class specifically or constrained to be equal across
classes in the resulting model (e.g., Celeux & Govaert, 1995). Model configurations with
many class-specific parameters can be very flexible and, as such, may fit the data well. A
downside is that these models are more complex (much more parameters to be estimated).
Using criteria, such as the Bayesian Information criterion (BIC) helps to find a model that
strikes a good balance between model-fit and model-complexity. When doing an LPA, most
applied researchers will usually be most interested in the differences and/or overlap between
the classes’ specific patterns of parameter estimates. These can be used to characterize the
classes and, possibly, provide clues about underlying mechanisms (Sterba 2013).
This tutorial
LPA is less widely used than other latent variable models and, possibly due to that, has
long been only available in specialized software packages such as Mplus. Luckily, ongoing
developments in many different scientific fields (e.g., ecology, econometrics) have yielded a
number of packages that also allow users to conduct LPA in the open-source R-platform.
However, the use of R does require experience and documentation of packages can be rather
limited or technical, making it less easily accessible for applied researchers. Therefore, this
tutorial is aimed to help applied researchers to get going with LPA in R, illustrating the use
of several packages and, for reference, providing a comparison of the results obtained in R
with results obtained with Mplus.
2. Data
All examples in this tutorial will be using a simulated dataset (see Appendix for code). The
simulated data consist of 300 cases, each with responses on 10 continuous variables. The
data are simulated to consist of 3 classes, each with different mean scores across the 10
variables and each with a different variable (co)variance matrix.The following figure shows
the simulated data, with the different colors indicating the different classes.
2
10
value
var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
variable
Note that the data were simulated to have a latent structure that is quite obvious. Even if the
classes’ lines in the plot had the same color, some clustering would still be observable. In real
life research settings, data with such obvious patterns are never - or seldom - encountered.
3. Packages
3.1 R-packages
LPA models can be fit in R (version 4.0.3; R-core team, 2020), running in Rstudio (version
1.3.1093 used here). There are many R-packages that offer some form of latent class/mixture
analytic functionality. However, the majority of packages is focused on analyses with discrete
indicator variables (e.g., ‘poLCA’, Linzer & Lewis, 2011; ‘e1071::lca’, Meyer et al., 2019)
or require a lot of coding to define the required models (e.g., openMX, Neale et al., 2016).
For the current tutorial, two packages were selected that work with continuous indicator
variables and that require limited user coding. ‘mclust’ package (Scrucca et al., 2016) is the
first package that will be illustrated. This package is a specialist tool and allows for a wide
variety of model configurations to be estimated; as such it offers much more functionality
than most researchers will likely need. Therefore, the current tutorial focuses on a limited
range of relatively simple LPA model configurations that are commonly encountered in the
literature. Conveniently, Rosenberger et al., (2018) recently developed the package ‘tidyLPA’
which can be used as a relatively easy to use front-end for estimating common LPA models
with ‘mclust’, basically streamlining some of the in- and output functionality of ‘mclust’.
3
Although ‘tidyLPA’ is easy to use, this comes at the expense of restricting modeling options
to a few, oft-used options. For completeness, both the ‘mclust’ and ‘tidyLPA’ approach will
be illustrated.
3.2 Mplus
Because it is one of the most widely used commercial software packages for latent variable
modeling, Mplus (version 5, Muthén & Muthén, 2010) is used to fit some of the same LPA
models as are estimated using the two R-packages. The results obtained in Mplus are
compared to the results obtained with R and the extent of overlap and/or differences between
the software packages is evaluated.
4. Model configurations
LPA models can be configured in many different ways (see Scrucca et al., 2016; Banfield &
Raftery, 1993; Celeux & Govaert, 1995; Pastor et al., 2007). Here, four variants of increasing
complexity will be covered. In all models, cluster-specific means are estimated for each of
the k classes: each class has its own associated pattern of mean scores on the indicator
variables (e.g., var1-var10). The different model versions vary in terms of how the class-specific
(co)variance matrices of the indicator variables are constrained or allowed to vary within and
between classes. See the following table for an overview of the three models:
Variances Covariances
Model variant vary within class? vary beteen class? vary within class? vary beteen class?
EEI yes no no; fixed to 0 no; fixed to 0
EEE yes no yes no
VVI yes yes no; fixed to 0 no; fixed to 0
VVV yes yes yes yes
In the first, most parsimonious LPA model variant, the indicator variables are set to have
zero covariances within and across classes. Indicator-variable variances are allowed to vary
within classes but are constrained to be equal between classes. Due to the latter, only one
set of variances needs to be estimated, resulting in a parsimonious model. In ‘mclust’, which
uses Gaussian Mixture Modeling vernacular, this variant is also referred to as the EEI (equal
volume, equal shape [and undefined orientation]) model (Scrucca et al., 2016).
The second model resembles the first model, but here the complete variable (co)variance
matrix is estimated: i.e. both the indicator variances and covariances are estimated. As in
the first model, the resulting (co)variance matrix is constrained to be equal across classes. In
‘mclust’ jargon, this type of model is called an EEE (equal volume, equal shape, and equal
orientation) model.
The third model variant allows for more variation across classes. All within- and between
class variable covariances are set to zero, but variances are now allowed to vary within and
between classes. As a result, the number of variance parameters to be estimated increases
with each class that is added to the model. In ‘mclust’, this type of model is called a VVI
(varying volume, varying shape [and undefined orientation]) model.
4
The fourth, most complex model allows for most variation in (co)variances across classes:
both the variances and covariances are allowed to vary within and between classes, resulting in
estimation of class specificcovariance matrices. This type of model is called a ‘VVV’ (varying
volume, varying shape, varying orientation) model .
Of course it is possible to design many other LPA models that, for instance, use partially
constrained (co)variance matrices across classes. The ‘mclust’ package alone covers an
additional 10 different model variants. Mplus, which takes a different approach to LPA than
‘mclust’, also allows for additional and highly customized model variations. These are not
covered here, but will be relatively easy to implement for users once they are familiar with
the four basic variants mentioned above.
5. Tutorial
5.1 ‘mclust’
To run LPA using the ‘mclust’ package, the Mclust() function can be used. This function
requires data as its minimum input. In addition, the number of clusters to fit (G) and model
variants to fit and select from (modelNames) can be entered (defaults are: G=1-9 and all
fourteen model variants). In many cases, we may want to consider a more limited range of
model variants to with different numbers of classes. For multivariate data, a total of fourteen
options are available (see help("mclustModelNames") for an overview). Each model variant
is estimated for each value of the specified range of G using Expectation Maximization (EM),
using the Bayesian Information Criterion (BIC) to compare different model variants and
select the optimal one.
Conducting a latent variable model or mixture analysis with Mplus or comparable software
usually entails fitting models with different numbers of classes (that are otherwise configured
similarly) and to compare the fit between these models to select the best model. The
default approach taken in the ‘mclust’ package is slightly different in that the main modeling
command Mclust() fits multiple models for all possible combinations of the specified number
of classes and model configurations . The eventual output of ‘mclust’ is the BIC-selected best
model variant with the optimal combination of number of classes and model configuration.
An advantage of the ‘mclust’ approach is that it is efficient and flexible: given a number of
classes, the best model configuration is selected straight away and we do not have to run all
model combinations one by one. This makes the approach especially suitable for exploratory
analyses. However, like stated above, this approach is somewhat less usual in some fields,
where estimation of LPA is approached in a more confirmatory fashion. In addition, it does
take some control away from the researcher, who may want to primarily focus on selecting
the number of classes, given a particular , theory-based model configuration that is kept
constant throughout analyses, which is often the chosen strategy in software like Mplus and
made easily available through the ‘tidyLPA’ package.
Here, both approaches are shown: (1) the ‘mclust’ aproach and (2) the ‘tidyLPA’ approach.
5
5.1.1 The ‘mclust’ approach
In this approach, we run consecutive models with increasing numbers of classes (G=1 to
G=9), for each model we let the package fit the four above described model variants (“EEI”,
“EEE”, “VVI” and “VVV”) and select the best-fitting one. We start by loading the package
(library(mclust)) and creating an object mnames that contains the names of the four models
we want to be fitted. Next, we use the Mclust() function to fit the models:
library(mclust)
We can first look at the optimal number of classes and the optimal model variant that were
selected, using the following code:
# Opimal number of classes
mod_g1_9$G
## [1] 3
# Optimal model variant
mod_g1_9$modelName
## [1] "EEI"
This shows us that a 3-class EEI model was selected as the best model based on the BIC. We
can get a better perspective of this model’s performance if we compare it to the other fitted
models. We can do this by taking a closer look at the other models’ BIC values:
mod_g1_9$BIC
6
## Top 3 models based on the BIC criterion:
## EEI,3 VVI,3 EEI,4
## -12142.11 -12163.08 -12178.44
mod_g1_9$loglik
## [1] -5951.275
mod_g1_9$df
## [1] 42
Here, we can see the BICs for all fitted models (in this case 1-9 classes and 4 model variants:
36 models in total). We can see that the closest contenders were a 3-class model with a VVI
configuration (with variances allowed to vary both within and between classes) and a 4-class
EEI model.
Note that the BIC values are negative, and that values that are more negative are considered
to fit poorer than values closer to 0. This may strike some users as odd, given that in
many modeling applications, we are used to consider lower BIC values to indicate better
fit. However, the ‘mclust’ approach of maximizing the BIC is correct for the used modeling
approach (see e.g., Fraley & Raftery, 2003; Banfield & Raftery, 1993). For theoretical reasons,
the BIC is calculated in’mclust’ as:
where df is the number of parameters (degrees of freedom) and n is the sample size. We can
see, that here, higher loglikelihood values lead to a higher BIC. For the best fitting model in
our example, the loglikelihood is -5951.275, the number of parameters is 42 and the sample
size is 300. If we plug these into the formula, we get the following BIC:
As expected, the obtained value indeed corresponds to the BIC we got in the model output
(see above). Other software packages, such as Mplus and ‘tidyLPA’ (see below) use a slightly
different formula to calculate the BIC, where higher loglikelihood values lead to lower BIC
values:
We can see that the absolute BIC value is the same, irrespective of the used formula, with
only the sign differing between the two approaches.
7
Now, let us continue by taking a closer look at the class sizes and the mean values and
variances of the ten variables for the three identified classes. Because the best model has and
EEI configuration, printing out a single variance matrix from one class will suffice, because
the variances are the same across classes (this is different in cases where the optimal model
allows for between-class differences in (co)variances:
# tabulate class-membership numbers
table(summary(mod_g1_9)$classification)
##
## 1 2 3
## 100 110 90
# display the means per class
mod_g1_9$parameters$mean
8
## var5 0.000000 0.000000
## var6 0.000000 0.000000
## var7 0.000000 0.000000
## var8 0.000000 0.000000
## var9 2.509146 0.000000
## var10 0.000000 2.204457
From this output, we can see that the three classes have class-sizes of 100, 110 and 90,
respectively. These sizes along with the patterns of class-specific mean variable scores
correspond with those that served as the input for the simulation. As specified, the variances
vary slightly across variables.
The class allocation in LPA is probabilistic in nature: each subject in the data is assigned a
probability for each of the estimated classes, based on their pattern of scores on the input
variables. These probabilities can be inspected in the z-matrix (here: mod_g1_9$z). Subjects
can be allocated to one of the classes based on their highest class-probability. These posterior
allocations can be found in the classification matrix (here: mod_g1_9$classification).
These class-allocations were tabulated above to evaluate class sizes.
Note that class-allocation in this way only yields useful classifications if the patterns of
class-probabilities allow for allocation of each subject to a single class with sufficient certainty.
In case of too much uncertainty, using the classification is not advised. We can inspect the
uncertainty of allocation for all subjects (here: mod_g1_9$uncertainty). This gives us a list
of uncertainties for all subjects. Next, we can evaluate the extent of uncertainty by looking
at the maximum uncertainty or, for instance, the averaged uncertainty across subjects:
max(mod_g1_9$uncertainty)
## [1] 0.003852604
mean(mod_g1_9$uncertainty)
## [1] 2.607299e-05
Here, the uncertainty is atypically low because of the used idealized data. In many cases,
uncertainty is higher and it may be of interest to investigate it further, for instance, by
looking at (e.g., plotting or tabulating) the uncertainty per class:
cprob <- cbind(mod_g1_9$z, mod_g1_9$classification)
cprob <- as.data.frame(cprob)
colnames(cprob) <- c("prob (class 1)", "prob (class 2)", "prob (class 3)", "class")
aggregate(cprob[, 1:3], list(cprob$class), mean)
9
very high and probabilities for the other classes were very low. Again, these values reflect the
idealized data; in many cases, more differentiated class-probability patterns may be observed.
5.1.2 ‘tidyLPA’
When running ‘mclust’ models from the ‘tidyLPA’ package, the estimations are approached
a little differently. By default, the number of models that can be estimated is restricted to
six relatively common variants (see help(tidyLPA::estimate_profiles) for details). Of
these models, four can be estimated with ‘mclust’. These models differ in terms of how the
variances and covariances are allowed to vary or constrained to be equal across classes and
whether covariances are or are not fixed to zero within classes. These models have each been
allocated a number (1-6) in ‘tidyLPA’. See the table below how these numbers correspond to
the above mentioned model configurations.
Model variant model number
EEI 1
EEE 3
VVI 2
VVV 6
Models four and five are different and can only be fit when R is interfacing with Mplus using
the ‘MplusAutomation’ package. This is outside the scope of this tutorial.
To fit an LPA model in ‘tidyLPA’, we use the estimate_profiles() function. Here, we need
to enter the data.frame to be used (df) and the number of classes to estimate (n_profiles).
To determine what kind of model configuration to estimate, the authors have provided two
different ways in the estimate_profiles() command. In the first approach, we simply tell what
model number (see Table 2) we want to estimate using the models= argument For instance, if
we want to estimate LPA models with 1 to 9 classes, with an EEI configuration, we can use:
suppressMessages(library(tidyLPA))
suppressMessages(mod_1c_v1 <- estimate_profiles(df = dat1[1:10], n_profiles = 1:9,
models = 1))
The package issues a standard message by default notifying us that we use the models
argument and that the variances and covariances arguments are ignored. Here, we suppress
the message. The output shows us the model-fit information for the 1 to 9-class EEI models:
mod_1c_v1
10
## 1 6 11986.25 12264.04 0.84 0.78 1.00 0.08 0.30 0.16
## 1 7 11950.86 12269.39 0.81 0.77 0.91 0.08 0.17 0.01
## 1 8 11939.04 12298.30 0.82 0.79 0.92 0.08 0.17 0.02
## 1 9 11947.10 12347.11 0.82 0.76 0.92 0.06 0.16 0.57
Here, we can see that the BIC takes on different values compared to ‘mclust’, with lower
rather than higher values indicating best fit (see explanation above). Instead of drawing the
fit indices (e.g., BIC, AIC) directly from the ‘mclust’ package, ‘tidyLPA’ only draws the ‘raw’
loglikelihood values and posterior probabilities from the ‘mclust’ output and (re)calculates
the BIC, AIC, entropy etc. to mirror as well as possible those provided in Mplus. In addition,
p-values of the the bootstrapped likelihood ratio test (BLRT) are given by default. These
p-values indicate for each k-class model whether adding the kth class adds significantly to
model fit. For instance, the BLRT_p of 0.04 for the 4-class model indicates that adding
a fourth class led to an improvement of model fit compared to the 3-class model that was
just significant when using an alpha of 0.05. When using ’mclust’ as a stand-alone package,
the BLRT is not calculated by default, but can be obtained with the mclustBootstrapLRT
function.
Next, we can run similar commands, with different values for the models argument. It is
also possible to estimate more than one model variant in a single run by providing a vector
of model numbers (e.g., models=c(1,6)).
The second approach to determine the model variant to be estimated is by using the variances
and covariances arguments in the estimate_profiles command. The variances argument
can have the values "equal" (i.e. variable variances constrained to be equal across classes) or
"varying" (i.e. variances allowed to vary across classes). The covariances argument can
have the values: "zero" (i.e. all variable covariances fixed to zero), "equal" (i.e. covariances
constrained to be equal across classes) or "varying" (i.e. covariances allowed to vary across
classes). Now, if we again want to estimate LPA models with 1 to 9 classes, with an EEI
configuration, we can use:
mod_1c_v2 <- estimate_profiles(df = dat1[1:10], n_profiles = 1:9, variances = "equal",
covariances = "zero")
mod_1c_v2$model_1_class_2$fit
11
or we can use compare_solutions to do this in an automated fashion. Interestingly, the
best model can be selected based on the integrated information about several fit indices
(analytic hierarchy process; see help(tidyLPA::AHP)) for more details). We obtain the model
comparison with the following code.
comp <- suppressWarnings(compare_solutions(mod_1c_v1))
comp$fits
## # A tibble: 9 x 18
## Model Classes LogLik AIC AWE BIC CAIC CLC KIC SABIC ICL
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 -7274. 14588. 14834. 14662. 14682. 14550. 14611. 14598. -14662.
## 2 1 2 -6549. 13161. 13543. 13276. 13307. 13101. 13195. 13177. -13276.
## 3 1 3 -5951. 11987. 12506. 12142. 12184. 11905. 12032. 12009. -12142.
## 4 1 4 -5938. 11982. 12638. 12178. 12231. 11878. 12038. 12010. -12218.
## 5 1 5 -5929. 11986. 12778. 12223. 12287. 11860. 12053. 12020. -12273.
## 6 1 6 -5918. 11986. 12915. 12264. 12339. 11838. 12064. 12026. -12353.
## 7 1 7 -5889. 11951. 13016. 12269. 12355. 11780. 12040. 11997. -12376.
## 8 1 8 -5873. 11939. 13141. 12298. 12395. 11747. 12039. 11991. -12409.
## 9 1 9 -5866. 11947. 13285. 12347. 12455. 11733. 12058. 12005. -12458.
## # ... with 7 more variables: Entropy <dbl>, prob_min <dbl>, prob_max <dbl>,
## # n_min <dbl>, n_max <dbl>, BLRT_val <dbl>, BLRT_p <dbl>
comp$best
12
names=id v1-v10;
usevariables= v1-v10;
classes = c(1);
ANALYSIS:
type = mixture;
starts= 250 50;
MODEL:
%overall%
%C#1%
[v1-v10];
!! 2-class model
DATA: file='dat1.dat';
VARIABLE:
names=id v1-v10;
usevariables= v1-v10;
classes = c(2);
ANALYSIS:
type = mixture;
starts= 250 50;
MODEL:
%overall%
%C#1%
[v1-v10];
%C#2%
[v1-v10];
!! 3-class model
DATA: file='dat1.dat';
VARIABLE:
names=id v1-v10;
usevariables= v1-v10;
classes = c(3);
ANALYSIS:
type = mixture;
starts= 250 50;
MODEL:
%overall%
%C#1%
[v1-v10];
%C#2%
[v1-v10];
%C#3%
[v1-v10];
13
!! 3-class model
DATA: file='dat1.dat';
VARIABLE:
names=id v1-v10;
usevariables= v1-v10;
classes = c(4);
ANALYSIS:
type = mixture;
starts= 250 50;
MODEL:
%overall%
%C#1%
[v1-v10];
%C#2%
[v1-v10];
%C#3%
[v1-v10];
%C#4%
[v1-v10];
Each of the models was fitted with multiple initial stage random starts and multiple final
stage optimizations to reduce the risk of finding solutions on a local maximum. For the 1-6
class models, 250 iniitial and 50 final starts were used. For the 7-class model 1000 initial
starts and 200 final optimizations were used and for the 8- and 9-class models, a replicable
global solution could only be obtained with 5000 initial stage starts and 1000 final stage
optimizations. The fit indices for each of the fitted models is displayed in the table below.
For now, we only look at the estimated fit indices.
EEI models
Classes AIC BIC
1 14587.535 14587.535
2 13160.836 13275.653
3 11986.551 12142.11
4 11957.71 12154.01
5 11951.766 12188.808
6 11944.632 12222.415
7 11935.501 12254.027
8 11929.428 12288.695
9 11925.203 12325.212
If we compare the BIC values for the EEI models to those obtained with ‘mclust’ directly
and via ‘tidyLPA’, we can see that the results are quite similar for the less complex models,
with the BIC-values for the 2- and 3-class models being exactly the same in terms of their
absolute values. When taking a closer look at the 3-class model that was previously found to
be the best model, we can also see that that the Mplus-based classifications (n=100, n=110,
and n=90) and entropy estimates (entropy=1.0) are similar to those obtained with ‘mclust’
14
and/or ‘tidyLPA’.
If we look at the larger picture and evaluate the overlap between the BIC values obtained
for all models with class numbers ranging from 1 to 9, we can see that the estimated BIC
values show more differences between the R-packages and Mplus for the models with larger
numbers of classes. This can be explained by the different methods used by Mplus and
‘mclust’/’tidyLPA’ to generate the start values for estimation (see below). These differences
can cause the packages to yield different results for more complex models and/or models that
are increasingly misspecified (as the models with more than 3 classes were in our example).
6. Final comments
In this tutorial we have seen that it is relatively easy to run an LPA in R. As with all
data-driven analyses, care should be taken not to over-interpret the results. In the end, LPA
identifies classes in such a way as to optimally explain variance on a range of variables and
not to optimize interpretability or usefulness.
Another remark with regard to LPA results is that class-membership is probabilistic in nature:
each subject in an analyzed sample has a probability of being in each of the model’s classes.
Subjects can be allocated to a class based on their highest class-probability. Importantly,
this can only be done with enough certainty if separation between classes is sufficient, which
means that we see that each subject has a clear highest probability (e.g., p=0.9) for one of
the classes. The entropy statistic is often used to quantify this separation, with values <0.8
being taken to indicate insufficient separation to allocate subjects to one class. In such cases,
the class probabilities themselves can still be used in further investigations.
Depending on one’s preferences, one can choose either the ‘mclust’ or the ‘tidyLPA’ package.
The latter may be especially attractive for persons that already have experience with Mplus
or comparable software. It is important to note, though, that differences exist between the
estimation approaches taken by both R-packages and by Mplus.
Mixture models such as LPA need to be estimated in an iterative fashion, using an EM
algorithm that is very sensitive to the used start values, where poor start values can lead the
estimation process to a poor solution. In addition, there is often a chance that the iterative
EM process leads to a solution that is at a local, rather than a global maximum in the
likelihood ‘landscape’.
Different methods have been developed to initialize the estimation (get starting values) in
such a way as to optimize the chance of arriving at an accurate model solution (Biernacki et
al., 2003; Shireman et al., 2016). Mplus and ‘mclust’ take two different approaches. Mplus
uses a ‘brute force’ approach and reruns the model with multiple sets of random start values,
each generated from uniform distributions of values with ranges that are based on the data
(Muthén & Muthén, 1998-2015). The best solution is the model with the highest loglikelihood
that was arrived upon from at least two different starting points. In contrast, ‘mclust’ uses
the data itself to generate a set of plausible start values using hierarchical clustering. This
means that the start values are informed by the (hierarchical structure of) the data, which
can work well. A downside is that only a single set of start values is used, so the risk of
15
arriving at a local optimum is not addressed. These approaches show clear differences and
have been shown previously to not always arrive at the same solutions, with some authors
being especially critical of the ‘mclust’ approach (Shireman et al., 2017). In our example,
we did not see much difference between the two approaches, but it should be noted that
many real-world data sets do not have such a clear latent structure, making it much more
challenging to arrive at an accurate solution, given the data.
Based on the provided examples and these technical differences between ‘mclust’ and Mplus,
it is probably better to see ‘mclust’ and ‘tidyLPA’ as useful alternatives for Mplus, rather
than as tools to exactly mimic what can be done with the latter software.
References
Banfield, J., Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering.
Biometrics. 49: 803–821.
Biernacki,C.,Celeux,G., Govaert,G.(2003).Choosing starting values for the EM algorithm
for getting the highest likelihood in multivariate Gaussian mixture models. Computational
Studies and Data Analysis, 41, 561–575.
Celeux, G., Govaert, G. (1995) Gaussian parsimonious clustering models. Pattern Recognition.
28:781–793
Fraley, C., Raftery, A.E. (1998) How many clusters? Which clustering method? Answers via
model-based cluster analysis. The Computer Journal 41: 578-588.
Gibson, W. A. (1959). Three multivariate models: Factor analysis, latent structure analysis,
and latent profile analysis. Psychometrika, 24, 229–252.
Linzer, D.A., Lewis J. (2011). “poLCA: an R Package for Polytomous Variable Latent Class
Analysis.” Journal of Statistical Software. 42(10): 1-29
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F. (2019). e1071: Misc
Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU
Wien. R package version 1.7-3. https://CRAN.R-project.org/package=e1071
Muthén, L., Muthén, B., 1998–2015. Mplus User’s Guide, 7th ed., Muthén & Muthén, Los
Angeles.
Neale, M.C., Hunter, M.D., Pritikin, J.N., Zahery, M., Brick, T.R., Kirkpatrick, R.M.,
Estabrook, R., Bates, T.C., Maes, H.H., Boker, S.M. (2016). “OpenMx 2.0: Extended
structural equation and statistical modeling.” Psychometrika 81(2): 535-549.
Nylund-Gibson, K., Choi, A. Y. (2018). Ten frequently asked questions about latent class
analysis. Translational Issues in Psychological Sciences, 4, 440–461.
Oberski, D. (2016). Mixture models: Latent profile and latent class analysis. In J. Robertson
& M. Kaptein (Eds.), Modern statistical methods for HCI (pp. 275–287). Cham, Switzerland:
Springer Inter-national Publishing.
16
Pastor, D.A., Barron, K.E., Miller, B.J., Davis, S.L. (2007). A latent profile analysis of
college students’ achievement goal orientation. Contemporary Educational Psychology 32(1):
8-47.
Rosenberg, J.M., Beymer, P.N., Anderson, D.J., Van Lissa, C.J., & Schmidt, J.A. (2018).
tidyLPA: An R Package to Easily Carry Out Latent Profile Analysis (LPA) Using Open-Source
or Commercial Software. Journal of Open Source Software 3(30): 978.
Scrucca, L., Fop M., Murphy, T.B., Raftery A.E. (2016) mclust 5: clustering, classification
and density estimation using Gaussian finite mixture models The R Journal 8/1: 289-317.
Shireman, E. M., Steinley, D., & Brusco, M. J. (2016). Local Optima in Mixture Modeling.
Multivariate Behavioral Research 51(4): 466–481.
Shireman, E., Steinley, D., Brusco, M.J. (2017). Examining the effect of initialization
strategies on the performance of Gaussian mixture modeling. Behavioral Research Methods
49(1):282-293.
Sterba, S.K. (2013). Understanding Linkages Among Mixture Models. Multivariate Behav
Res. 2013 48(6): 775-815.
Wolfe, J. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral
Research, 5: 329–350.
Appendix
R code to simulate the tutorial data. Make sure to install the ‘MASS’ and ‘reshape2’ packages
first.
library(MASS)
library(reshape2)
17
# and covariance matrix (sigma)
set.seed(0111)
cl1 <-mvrnorm(100, mu=c(2,3,2,6,7,4,5,8,2,1), Sigma=Sigma1)
#########################################################
####### define covariance matrix of v1-v10 in class 2 ###
##########################################################
######## define covariance matrix of v1-v10 in class 3 ###
##########################################################
### convert matrices to data.frame objects & add a class nr
cl1 <- as.data.frame(cl1)
18
cl1$class <-rep(1,100)
cl2 <- as.data.frame(cl2)
cl2$class <-rep(2,110)
cl3 <- as.data.frame(cl3)
cl3$class <- rep(3,90)
##########################################################
### add a little stochastic noise for increased realism ;)
dat1$var1 <- dat1$var1 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var2 <- dat1$var2 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var3 <- dat1$var3 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var4 <- dat1$var4 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var5 <- dat1$var5 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var6 <- dat1$var6 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var7 <- dat1$var7 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var8 <- dat1$var8 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var9 <- dat1$var9 + rnorm(n=300, mean=0, sd=sqrt(1))
dat1$var10 <- dat1$var10 + rnorm(n=300, mean=0, sd=sqrt(1))
19