Sage Journals: Discover world-class research

Abstract

Mixture of factor analysers (MFA) are popular tools for dimension reduction and model-based clustering. The conventional method of parameter estimation for MFA is via maximum likelihood using the Expectation-Maximization (EM) algorithm. This typically assumes the number of components (g) and the number of factors (q) to be known a priori, which are often not known in practice. This article reviews different approaches for automatically inferring g and/or q from the data. A hybrid method, incremental automated MFA (IAMFA), is also proposed. We conduct a systematic comparison of these methods using simulated data and analyse their relative performance across various scenarios, looking at their ability to infer g and q correctly, the general model fit, clustering accuracy, and computation time.

Keywords

Clustering dimension reduction factor analysis finite mixture models

1 Introduction

Mixtures of factor analysers (MFA) are frequently used for modelling high dimensional data with heterogeneous structure. The MFA model combines a mixture model with the factor analysis (FA) model, thus able to perform both clustering and dimension reduction at the same time. Initially introduced by Ghahramani and Hinton (1997), the method had since found many applications in a wide range of areas, from microarray gene expressions (McLachlan et al., 2003), image processing (Yang et al., 1999), to financial risk modelling (Ko and Beak, 2018).

Parameter estimation of the MFA model is typically carried out using the maximum likelihood estimation method via the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). Several EM-type algorithm have been proposed for the MFA model, including the Expectation Conditional Maximization (ECM) algorithms by Ghahramani and Hinton (1997), McLachlan and Peel (2000a) and Zhao and Yu (2008). These algorithms typically assume the number of components (g) and the number of factors (q) to be known in advance. However, in practice, g and q are often unknown and need to inferred from the data. To this end, there have been a number of attempts to determine g and/or q in an automated manner. Many authors perform an exhaustive search by fitting (one or more) MFA models for different values of g and q and then selecting a suitable model based on some information criteria. A main drawback of this approach is the high computation time. Kaya and Salah (2015) proposed an adaptive MFA algorithm which infers both p and q by progressively splitting a component/factor and then incrementally merging components. Wang and Lin (2020) proposed an automated MFA algorithm where q is estimated during the execution of an EM algorithm, but assumes g to be known in advance. On the other hand, several others have considered Bayesian inference, including the variational Bayesian MFA (Ghahramani and Beal, 1999), the infinite mixtures of infinite FA (Murphy et al., 2020), the overfitted mixtures of infinite FA (Murphy et al., 2020) and more recently, the dynamic mixture of MFA (Grushanina and Frühwirth-Schnatter, 2023).

This article presents a systematic comparison of automated approaches for determining g and/or q (and the remaining parameters) of the MFA model. Note that we do not aim to provide a comprehensive survey of all existing methods, but rather focus on those where there are publicly available software implementations. In addition, we propose a new method for inferring g and q—the incremental automated MFA (IAMFA) algorithm—where g is determined in an incremental manner and q is inferred based on an approximated information criterion. Performance of the methods was evaluated on 100 replications of each of 12 different scenarios. We assessed their performance based on the accuracy of the inferred value of g and q, the time taken to run the algorithm, the quality of the fitted models, and the correctness of cluster assignments.

The remainder of this article is organized as follows. In Section 2, we briefly review the MFA model and a commonly-used ECM algorithm to fit the model. In Section 3, we examine selected methods for inferring g and/or q. We also propose the IAMFA algorithm. An implementation of these methods are available from the R (R Core Team, 2025) packages IMIFA (Murphy et al., 2021) and autoMFA (Davey, 2021). In Section 4, we describe the design of the simulation study used to compare the methods examined in Section 3. The results of this simulation study are presented in Section 5. General conclusions are given in Section 6.

2 The mixture of factor analysers (MFA)

The mixture of factor analysers (MFA) is a mixture model where each component of the mixture follows the well-known factor (FA) analysis model. The latter postulates that the p-dimensional data vector Y _j is formed as the sum of an underlying mean μ, a linear combination of q < p independent underlying factors and an additive error vector e _j . It is typically assumed that the error vector and the factors each follows a normal distribution. Formally, the MFA model can be specified hierarchically:

\begin{array}{l} Y_{j} | (Z_{i j} = 1) = μ_{i} + B_{i} U_{j} + e_{j}, \\ U_{i} | (Z_{i j} = 1) \sim N_{q_{i}} (0, I_{q_{i}}), \\ e_{j} | (Z_{i j} = 1) \sim N_{p} (0, D_{i}), \\ Z_{j} \sim Multinomial (1, π), \end{array}

for i = 1, 2, …, g and j = 1, 2, …, n, and where U _j and e _j are independent (given Z_ij = 1). Here, I _qi denotes the q_i-dimensional identity matrix, where q_i < p. The vector Z _j = (Z₁_j, Z₂ _j , …, Z_gj) has elements Z_ij that are binary indicator variables, taking the value of one if Y _j belongs to the ith component of the mixture model and zero otherwise. The vector π = (π₁, π₂, …, π_g) is the vector of mixing proportions. The p × 1 vector μi is the underlying mean of the ith component of the MFA model. The p × q matrices B _i are the factor loading matrices and D _i are the p × p error-variance matrices. The q × 1 vectors U _j are known as the factors.

It follows that the MFA model can be formulated as a Gaussian mixture model (GMM) with a ‘restricted’ component covariance matrix,

Y_{j} \sim (Z_{i j} = 1) \sim N_{p} (μ_{i}, B_{i} B_{i}^{⊤} + D_{i}),

with density given by

f (y_{j}; Θ) = \sum_{i = 1}^{g} π_{i} ϕ_{p} (y_{j}; μ_{i}, B_{i} B_{i}^{⊤} + D_{i}),

(2.1)

where $Θ = (π_{1}, π_{2}, \dots, π_{g - 1}, θ_{1}, θ_{2}, \dots, θ_{g})$ is the vector containing the unknown parameters of the mixture model and $θ_{i} = (μ_{i}, B_{i}, D_{i})$ is the vector containing the unknown parameters of the ith component of the mixture model. Here, $ϕ_{p} (\cdot; μ, Σ)$ denotes the density of the p-dimensional normal distribution with mean μ and variance Σ.

It is well-known that the MFA model suffers from identifiability issues. To ensure the number of parameters in the MFA model is less than that of a GMM model with a full component covariance matrix, the number of factors for each component (q_i) of the MFA model is required to obey the Ledermann bound (Ledermann, 1937), given by

q_{i} \leq p + \frac{1 - \sqrt{1 + 8 p}}{2},

(2.2)

for i = 1, 2, …, g. Apart from this, the MFA model inherits from the FA model the property of invariance to orthogonal transformations. This means post-multiplying a factor loading matrix B _i with an orthogonal matrix V will give the same model, that is, replacing B _i in (2.1) with B _i V leads to the same equation. One way to remedy this is to impose $\frac{1}{2} q_{i} (q_{i} - 1)$ constraints on the factor loading matrix B _i by applying, for example, the varimax rotation (Kaiser, 1958). Accordingly, we need a total of $\sum_{i = 1}^{g} \frac{1}{2} q_{i} (q_{i} - 1)$ constraints to achieve identifiability.

In addition, it is often assumed that the number of factors for each component of the MFA model is the same, that is, q_i = q for all i. This assumption is made in most of the methods considered in Section 3. Another popular assumption is to have a common error-variance matrix across all components, that is, D _i = D for all i. This assumption was made, for example, in Ghahramani and Beal (1999).

For a fixed g and q, the parameters of the MFA model Ω can be estimated using an EM algorithm and generalizations thereof, such as the ECM algorithm (Meng and Rubin, 1993), the Alternating Expectation Conditional Maximization (AECM) algorithm (Meng and van Dyk, 1997), or the Parameter Expanded Expectation Maximization (PX-EM) algorithm (Liu et al., 1998). For example, Ghahramani and Hinton (1997) described an ECM algorithm for the MFA model, treating the factors U _j and the component indicators Z_ij as latent variables. The mathematical details can be found in Ghahramani and Hinton (1997). An alternative version was proposed by Zhao and Yu (2008), where only the Z_ij are treated as latent variables. Technical details can be found in the mentioned reference. Briefly, the E- and M- steps of this ECM algorithm are given as follows. At the kth iteration of the ECM algorithm, the E-step evaluates the responsibilities (i.e., the posterior probability that an observation belongs to a given mixture component (McLachlan and Peel, 2000b; Bishop, 2006; Murphy, 2012))

τ_{i j}^{(k)} = \frac{π_{i}^{(k)} π_{i} ϕ_{p} (y_{j}; μ_{i}^{(k)}, B_{i}^{(k)} B_{i}^{{(k)}^{⊤}} + D_{i}^{(k)})}{f (y_{j}; Θ^{(k)})} .

(2.3)

The M-step then updates the parameters in Θ using

π_{i}^{(k + 1)} = \frac{1}{n} \sum_{j = 1}^{n} τ_{i j}^{(k)},

(2.4)

μ_{i}^{(k + 1)} = \frac{\sum_{j = 1}^{n} τ_{i j}^{(k)} y_{j}}{\sum_{j = 1}^{n} τ_{i j}^{(k)}},

(2.5)

B_{i}^{(k + 1)} = {[D_{i}^{(k)}]}^{\frac{1}{2}} U_{m_{i}} {(Λ_{m_{i}} - I_{m_{i}})}^{\frac{1}{2}} V_{i},

(2.6)

\begin{array}{l} D_{i}^{(k + 1)} = diag (d_{i 1}^{(k + 1)}, d_{i 2}^{(k + 1)}, \dots, d_{i p}^{(k + 1)}), \\ d_{i h}^{(k + 1)} = \max \{η, (ω_{i h}^{(k + 1)} + 1) d_{i h}^{(k)}\}, h = 1, \dots, p, \end{array}

(2.7)

where $m_{i} = \sum_{h = 1}^{p} I (λ_{i h} > 1), V_{i}$ is an $m_{i} \times q$ matrix satisfying $V_{i} V_{i}^{⊤} = I_{m_{i}}$ and $Λ_{m_{i}}$ is the diagonal matrix containing the m_i largest eigenvalues of ${\tilde{S}}_{i}$ . The definition of $λ_{i h}, η, w_{i h}$ and ${\tilde{S}}_{i}$ is given in Appendix A. This version of the ECM algorithm generally achieves a faster convergence rate than the original version by Ghahramani and Hinton (1997), but at the cost of not estimating the underlying factors U _j . We will adopt this version of the ECM algorithm as the parameter estimation method for the MFA model when performing exhaustive search over g and q in the simulation study in Section 5.

3 Methods for inferring g and/or q

We now consider algorithms in the literature that can infer g, q, or both g and q. In addition, we describe a new algorithm denoted by IAMFA in Section 3.7. The majority of these algorithms are based on either the ECM algorithm by Ghahramani and Hinton (1997) or the alternative version by Zhao and Yu (2008). Many of these algorithms attempt to choose g using a hierarchical splitting approach, that is, by starting with a single component model and then gradually splitting into more sub-components based on certain criteria. There are a number of different approaches proposed for choosing q. Some methods choose q based on an approximation of the Bayesian information criterion (BIC) (Schwarz, 1978), defined as −2 log(likelihood) + m log n, where m is the number of parameters in the model. Another attempts to choose q by comparing the sample covariance structure with the modelled covariance structure of each component of the MFA model and then adds an extra factor to the component with the largest difference between these two structures. There are also a few attempts based on the Bayesian formulation of the MFA model, typically employing a non-parametric prior for g and/or q to achieve automatic inference. They usually allow for each q_i to be different.

Table 1

A summary of methods for inferring the number of component g and/or the number of factors q of a mixtures of factor analysers (MFA) model. The methods are naive grid search (NGS), adaptive mixtures of factor analysers (AMoFA), automated mixtures of factor analysers (AMFA), variational Bayesian mixtures of factor analysers (VBMFA), overfitted mixtures of infinite factor analysers (OMIFA), infinite mixtures of infinite factor analysers (IMIFA) and incremental automated mixtures of factor analysers (IAMFA).

Method	Infers g	Infers q	R package	Reference
NGS	exhaustive	exhaustive	autoMFA	-
AMoFA	split & merge	incremental	autoMFA	Kaya and Salah (2015)
AMFA	–	approx. BIC	autoMFA	Wang and Lin (2020)
VBMFA	split	–	autoMFA	Ghahramani and Beal (1999)
OMIFA	Bayesian	Bayesian	IMIFA	Murphy et al. (2020)
IMIFA	Bayesian	Bayesian	IMIFA	Murphy et al. (2020)
IAMFA	split	approx. BIC	autoMFA	Section 3.7

In the subsections to follow, we provide an overview of several existing methods for inferring the number of components and/or factors in MFA models, along with our proposed IAMFA algorithm. The methods considered are naive grid search (NGS), adaptive mixtures of factor analysers (AMoFA), automated mixtures of factor analysers (AMFA), variational Bayesian mixtures of factor analysers (VBMFA), overfitted mixtures of infinite factor analysers (OMIFA) and infinite mixtures of infinite factor analysers (IMIFA). A summary of these methods is presented in Table 1, with further details provided later in the section.

3.1 Naive grid search (NGS)

The most intuitive and simplest method for automatically choosing g and q is perhaps performing an exhaustive search over different combinations of g and q. This method, which we call naive grid search (NGS), serves as a baseline for systematically evaluating all possible combinations of g and q within predefined ranges.

Given a range of plausible values for g and q (i.e., $g_{m i n} \leq g \leq g_{m a x}$ and $q_{m i n} \leq q \leq q_{m a x}$ ), the NGS approach fits a MFA model to each combination of (g, q) within the given range and then select the best model amongst all of the candidate models according to some model selection criterion (typically the BIC). For the simulation study in this article, we employ the ECM algorithm described in Section 2 for parameter estimation of the MFA model. As the EM algorithm is sensitive to starting values, it is recommended (for each combination of (g, q)) to run the algorithm multiple times with different initializations and then choose the one with the highest likelihood value. We adopt the method suggested by McLachlan et al. (2003) to find initial parameter estimates, which is based on k-means clustering. As the NGS method searches over all possible values for g and q (within the specified ranges), we should be almost guaranteed to find the best combination of parameters (within that range). The other methods discussed in this section can be viewed as approximations of the NGS method. Hence, we will consider NGS as the ‘gold standard’ when comparing with other methods in the simulation study in Section 4. However, the total number of candidate models grows rapidly as g_max and q_max increases (assuming that we hold g_min and q_min fixed, often both at one). Let n_s be the number of initial values used for each (g, q) combination. There are $red (q_{m a x} - q_{m i n} + 1) (g_{m a x} - g_{m i n} + 1)$ candidate models, requiring a total of

n_{s} (q_{\max} - q_{\min} + 1) (g_{\max} - g_{\min} + 1)

evaluations of the ECM algorithm. Even with small values of n_s, g_max and q_max (such as n_s = 10, g_max = 10 and q_max = 5), NGS would need a large number of ECM evaluations (500 in this example). For small datasets where the execution of the ECM algorithm is very fast, this approach may be reasonable. However, the computational cost can become prohibitively high for large datasets, rendering it infeasible in practice.

3.2 Adaptive mixtures of factor analysers (AMoFA)

Kaya and Salah (2015) introduces a dynamic model selection approach based on the minimum message length (MML) criterion (Wallace and Boulton, 1968). Rather than searching over all possible candidate models as in NGS, AMoFA iteratively splits existing components or merges them to optimize the trade-off between model complexity and fit. The MML criterion functions similarly to BIC, but includes additional complexity penalty to discourage complex models. By incrementally increasing model complexity only when justified by the data, the AMoFA approach avoids unnecessary computations while still ensuring a reasonable fit.

The algorithm comprises two phrases: an incremental phrase and a decremental phrase. It begins by fitting a single component single factor MFA model to the data (i.e., (g, q) = (1, 1)) using a slightly modified version of the ECM algorithm by Ghahramani and Hinton (1997). During the incremental phrase, AMoFA progressively splits a mixture component into two or adds a new factor to an existing component until the decrease in MML is less than a specified threshold. It then enters the decremental phrase, where components are progressively merged until only one component remains. The final model is chosen to be the model with the lowest MML value found throughout the entire process. A summary of the AMoFA algorithm is presented in Algorithm 1 in Appendix B. For technical details of the splitting and merging process, the reader is referred to Kaya and Salah (2015). It should be noted that the AMoFA allows each component to have its own number of factors q_i.

3.3 Automated mixtures of factor analysers (AMFA)

More recently, Wang and Lin (2020) proposed the automated mixtures of factor analysers (AMFA) algorithm that can estimate q without an exhaustive search or splitting/merging. AMFA modifies the EM algorithm by treating q as a parameter to be estimated rather than fixed a priori. It uses an approximation of the BIC to determine q, effectively embedding a regularization mechanism into the model-fitting process. The AMFA approach is a generalization of the automated factor analysis (AFA) algorithm by Zhao and Shi (2014) and is based on the ECM algorithm by Zhao and Yu (2008). With this approach, the value of q is estimated by minimizing the following approximation of the BIC,

q^{(k + 1)} = \underset{q \leq q_{\max}}{argmin} \{\sum_{i = 1}^{g} n π_{i}^{(k + 1)} \sum_{h = 1}^{q} (\log λ_{i h} - λ_{i h} + 1) + m \log n\},

(3.1)

where $m = g (2 p + p q + 1 - \frac{1}{2} q (q - 1)) - 1$ is the number of parameters in an MFA model with g components and q factors and λ_ij is defined in Appendix A. This is performed during the M-step of the ECM algorithm, after updating π_i (2.4) and μi (2.5) and before updating B _i (2.6). A summary of the AMFA algorithm is presented in Algorithm 2, in Appendix C. Note that the AMFA algorithm infers q only and g is assumed to be known.

3.4 Variational Bayesian mixtures of factor analysers (VBMFA)

A number of attempts have been made to infer g and q using Bayesian techniques. One of the earlier attempts is Ghahramani and Beal (1999), who proposed the variational Bayesian mixture of factor analysers (VBMFA) based on a Bayesian formulation of the MFA model. The method makes use of the variational Bayesian expectation-maximization (VBEM) algorithm. We will not provide a full description of the VBMFA algorithm (see Beal (2003) for details), but instead focus on the inference process of g and q.

The inference of the number of components g is performed through an incremental process. It begins by assuming a single component and then progressively introduces new components through a mechanism known as component birth. At each step, a new component is added and the model is re-evaluated to check whether this addition improves the model fit. If no improvement is achieved, or if the specified maximum number of attempts have been reached, the process terminates. This allows the model to determine the most appropriate number of components while balancing model complexity and fit to the data.

The inference of q is handled using automated relevance detection (ARD) priors. ARD allows the model to automatically decide which factors (i.e., which columns of the factor loading matrices B _i ) are important and should be retained. Specifically, the columns of each factor loading matrix B _i are assigned a standard normal prior, but with variance inversely scaled by a precision parameter ν_ih. A gamma prior is placed on ν_ih, which controls the variance. During the fitting process, the precision parameter ν_ih can grow very large (approaching infinity) under certain conditions, effectively forcing the corresponding factor loadings to become very small. When this happens, the associated factor becomes irrelevant and the corresponding column in B _i is discarded. This mechanism allows for automatic dimension reduction by pruning unnecessary components, leading to a more parsimonious model.

Despite the elegance of VBMFA in theory, practical implementation can present challenges. For instance, in our testing (using both our R implementation in autoMFA and the original MATLAB implementation by Ghahramani and Beal (1999)), we were unable to reproduce the behaviour shown in Beal (2003) that produced columns of factor loading matrices with extremely small loadings. This suggests that further refinements to the implementation may be necessary to align more closely with the theoretical outcomes.

3.5 Overfitted mixtures of infinite factor analysers (OMIFA)

OMIFA (Murphy et al., 2020) takes a non-parametric Bayesian approach by initializing with an over-specified number of components and factors and then gradually eliminating unnecessary components or factors based on data, guided by posterior probabilities. As the name suggests, it is a generalization of the infinite factor analysers (Bhattacharya and Dunson, 2011) to the mixture setting. It adopts an over-fitting finite mixture strategy. The model is initialized with a deliberately large number of components $g_{m a x} = \max \{2 \log n, 25, n - 1\}$ and a Dirichlet prior on the mixing proportions π Dirichlet(α). During Gibbs sampling, superfluous components tend to receive no observations, so the posterior mode of the number of ‘occupied’ components gives the estimate of g.

For the factor-analytic part, OMIFA employs the multiplicative gamma process (MGP) shrinkage prior on each factor loading matrices B _i . The MGP induces stronger shrinkage on higher-index columns, yielding a theoretically infinite factor model while retaining only as many active factors q_i as the data support. A modified adaptive Gibbs sampler (AGS) jointly updates factor loadings, scores, and component labels, automatically pruning empty components and irrelevant factors. It should be noted that the OMIFA model allows for each component to have its own number of factors.

3.6 Infinite mixtures of infinite factor analysers (IMIFA)

In the same article, Murphy et al. (2020) generalized OMIFA to IMIFA by embedding the factor analysis model into an infinite mixture framework. The primary distinction of IMIFA from OMIFA is the use of the Pitman-Yor process (PYP) for modelling the mixing proportions, replacing the fixed-g overfitting strategy used in OMIFA. The PYP generalizes the Dirichlet process, allowing heavier-tailed cluster-size distributions and hence richer partition structures. As in OMIFA, IMIFA uses the MGP prior to infer the number of factors. The MGP enables the model to adaptively determine the number of active factors, pruning irrelevant ones by assigning smaller or zero loadings to unimportant factors. This combination of a PYP prior for g and MGP shrinkage for q_i gives IMIFA great flexibility in adapting both the number of clusters and the dimension of each factor subspace.

Together, OMIFA and IMIFA share the same infinite-factor mechanism but differ in how they handle the component count: OMIFA prunes an over-specified finite mixture, whereas IMIFA learns g from a non-parametric prior.

3.7 Incremental automated mixtures of factor analysers (IAMFA)

We now introduce a hybrid algorithm that combines the strengths of AMFA and AMoFA, called the incremental automated mixtures of factor analysers (IAMFA) model. It infers g and q automatically using a progressive model-building approach. Specifically, q is inferred using the approximated BIC approach in AMFA and g is determined using an incremental approach similar to AMoFA and VBMFA. These modifications allow IAMFA to efficiently infer g and q while maintaining computational feasibility.

The algorithm begins with the simplest model—a single-component single-factor MFA model—fitted using the ECM algorithm described in Section 2. It then adopts a component-splitting strategy similar to AMoFA, but uses BIC in place of MML for model selection. This choice is motivated by the common use of BIC in penalized likelihood methods. Furthermore, empirical testing indicated that BIC achieves comparable model selection performance to MML while being computationally more efficient. Initially, the single component model will be split into a two-component model. Allocation of data points to the new component is determined in the same way as the assignment of points to the ‘child’ components in VBMFA (see Appendix H and also Beal (2003) for technical details). Parameter estimation of the two-component model is carried out via the AMFA algorithm given in Algorithm 2. Note that the number of factors q is updated during the execution of Algorithm 2. If the split results in a higher BIC value, the attempt is discarded and a new split is attempted. If no improvements to the BIC is made after a specified number of attempts, then the fitting process is terminated. Otherwise, the first split that decreases the BIC is accepted.

Then, the splitting process is repeated, but now involves a multivariate kurtosis metric. The kurtosis is chosen as it effectively captures deviations from normality that suggest a mixture component may contain multiple sub-clusters. While skewness measures asymmetry and number of modes detects multimodality, kurtosis is a more stable indicator of potential misspecified component structure. This aligns with prior findings (Beal, 2003) and empirical experiments in our study. Specifically, the splitting will be attempted in order of decreasing multivariate kurtosis metric $|γ_{i}|$ (from Mardia (1970); adapted for mixture models), where γ_i is defined as

γ_{i}^{(k)} = [b_{i}^{(k)} - p (p + 2)] \sqrt{\frac{\sum_{j = 1}^{n} τ_{i j}^{(k)}}{8 p (p + 2)}},

(3.2)

and where

b_{i}^{(k)} = \frac{\sum_{j = 1}^{n} τ_{i j}^{(k)} {[(y_{j} - μ_{i}^{(k)}) {(B_{i}^{(k)} B_{i}^{{(k)}^{⊤}} + D_{i}^{(k)})}^{- 1} (y_{j} - μ_{i}^{(k)})]}^{2}}{\sum_{j = 1}^{n} τ_{i j}^{(k)}} .

It should be noted that all proposed splits—whether the initial or subsequent—are accepted only if they result in a lower BIC value. The multivariate kurtosis metric is used solely to rank the components in order of priority for splitting: at each iteration, the component with the highest kurtosis is selected first. If no split yields an improved BIC after a specified number of attempts, the algorithm terminates. A summary of the IAMFA algorithm is presented in Algorithm 3, in Appendix D. We implemented this algorithm in the R package autoMFA (Davey, 2021).

4 Simulation study design

A systematic simulation study is carried out to compare the performance of the methods discussed in Section 3. We use the implementations of these methods that are available from the R packages autoMFA and IMIFA; see also Table 1. For the AMFA algorithm, we perform a naive search to determine g. For the VBMFA algorithm, we let q to be the maximum value of q allowed. We also follow the authors’ suggestion to centre and scale the dataset before applying the VBMFA algorithm. The seven methods listed in Table 1 are tested on a range of datasets generated from the MFA model to represent various practical scenarios, defined by different combinations of the following settings:

the dimension of the data (p): low (p = 3) or high (p = 10);

the number of factors (q): low $(q = \frac{p}{3})$ or high $(q = \frac{2 p}{3})$ ;

the number of components (g): low (g = 3) or high (g = 10);

the degree of separation of the components: low or high;

the total number of data points in the dataset (n): low (n = 60g) or high (n = 240g);

whether the dataset is balanced: yes (π_i = n/g) or no.

Each dataset in the simulation study is generated from

Y_{i j} \overset{i . i . d}{\sim} N_{p} (μ_{i}, B_{i} B_{i}^{⊤} + D_{i}) for i = 1, \dots, g; j = 1, \dots, n_{i},

where $\sum_{i = 1}^{g} n_{i} = n$ .

The component means μi are set as follows. We first define the temporary means $μ_{i}^{*}$ . Depending on the setting for p and g, we have four different cases. When p = g = 3, we take $μ_{1}^{*} = {(1, 0, 0)}^{⊤} μ_{2}^{*} = {(0, 1, 0)}^{⊤}$ and $μ_{3}^{*} = {(0, 0, 1)}^{⊤}$ . When p = 3 and g = 10, we take $μ_{1}^{*} = {(1, 0, \dots, 0)}^{⊤}, μ_{2}^{*} = (0, 1, \dots, 0)$ and $μ_{3}^{*} = {(0, 0, 1, \dots, 0)}^{⊤}$ When p = 3 and g = 10, we take

\begin{array}{l} μ_{1}^{*} = (1, 0, 0)^{⊤}, μ_{2}^{*} = (1, 0, 1)^{⊤}, μ_{3}^{*} = (0, 0, 0)^{⊤}, μ_{4}^{*} = (0, 0, 1)^{⊤}, μ_{5}^{*} = (0, - 1, 0)^{⊤} \\ μ_{6}^{*} = (0, - 1, 1)^{⊤}, μ_{7}^{*} = (- 1, 0, 0)^{⊤}, μ_{8}^{*} = (- 1, 0, 1)^{⊤}, μ_{9}^{*} = (0, 1, 0)^{⊤}, μ_{10}^{*} = (0, 1, 1)^{⊤} . \end{array}

Finally, when p = g = 10, we take $μ_{1}^{*} = {(1, 0, \dots, 0)}^{⊤}, μ_{2}^{*} = {(0, 1, \dots, 0)}^{⊤}, \dots, μ_{10}^{*} = {(0, \dots, 0, 1)}^{⊤}$ . Then, for well-separated data, we set $μ_{i} = 3 μ_{i}^{*}$ whereas for not well-separated data we set $μ_{i} = 1.5 μ_{i}^{*}$ .

The covariance matrix $Σ_{i} = B_{i} B_{i}^{⊤} D_{i}$ are set to be relatively small, by taking $D_{i} = 0.01 I_{p}$ and ${[B_{i}]}_{h s} = \sqrt{0.2} R_{h s} for h = 1, 2, \dots, p$ and s = 1, 2, …, q, where R_rs are standard normal random variables. Under this setting, the covariance matrix Σ _i will be small compared to μi. This implies the generated data will be tightly clustered around each μi relative to the distance between the $μ_{i}'s$ . Hence the degree of separation between the components is entirely dependent on the separation between the $μ_{i}'s$ .

For imbalanced dataset, the number of points for each component n_i is set as follows. First, each component is initially assigned 30 points. This is to avoid having very small components. The remaining n 30g points are allocated so that the largest component is 10 times larger than the smallest component, with intermediate components obtaining a linearly interpolated fraction of this amount. In effect, we have $π = (0. \bar{60}, 0. \bar{3}, 0. \bar{06})$ when g = 3 and for g = 10 we have

π = (0. \bar{18}, 0.1 \bar{63}, 0.1 \bar{45}, 0.1 \bar{27}, 0.1 \bar{09}, 0. \bar{09}, 0.0 \bar{72}, 0.0 \bar{54}, 0.0 \bar{36}, 0.0 \bar{18}) .

With six variables for the setting and two choices for each variable, there will be 2⁶ = 64 different combinations of the variables. Making use of a 2⁶⁻² fractional factorial design by Box (2005), the number of experiments needed can be reduced to 16, at the cost of introducing confounding effects from third and sixth order interactions. Furthermore, four of these combinations (where p = 3 and q = 2) do not satisfy the Ledermann bound (2.2) and hence are discarded. The settings for the remaining 12 experiments are summarised in Table 2. For each setting, we make 100 simulated replications. Hence there is a total of 1,200 simulations.

Table 2

Variable settings for each of the 12 simulated experiments; each experiment is replicated 100 times. n is the total number of points in the dataset. p is the dimension of an observation. g is the number of clusters. π are the cluster proportions. q is the number of factors in each cluster.

Experiment	p	n	Separation	g	Balanced	q
1	3	60g	Low	3	No	$\frac{1}{3} p = 1$
2	10	60g	Low	3	Yes	$\frac{1}{3} p = 3$
3	10	240g	Low	3	No	$\frac{2}{3} p = 6$
4	10	60g	High	3	No	$\frac{2}{3} p = 6$
5	3	240g	High	3	No	$\frac{1}{3} p = 1$
6	10	240g	High	3	Yes	$\frac{1}{3} p = 3$
7	10	60g	Low	10	Yes	$\frac{2}{3} p = 6$
8	3	240g	Low	10	Yes	$\frac{1}{3} p = 1$
9	10	240g	Low	10	No	$\frac{1}{3} p = 3$
10	3	60g	High	10	Yes	$\frac{1}{3} p = 1$
11	10	60g	High	10	No	$\frac{1}{3} p = 3$
12	10	240g	High	10	Yes	$\frac{2}{3} p = 6$

5 Experimental results

The performance of the seven methods is evaluated in five aspects: (a) the computation time, (b) the quality of the model fit according to the BIC, (c) the clustering accuracy according to the adjusted Rand index (ARI) (Hubert and Arabie, 1985), (d) the inferred g and (e) the inferred q. The results are presented in Tables 3 to 5, respectively. In addition, histograms for the inferred values of g and q are given in Figures 1 and 2.

We will give a few general remarks before discussing each of the five aspects. Of the methods considered (and apart from NGS), the AMFA method performed the best, on average, for all of the metrics except the computation time and for inferring q to within an error of ± 2. This is followed by the OMIFA and IMIFA methods which generally performed comparably well. The IAMFA generally performed well, but not as well as the aforementioned methods. Lastly, the AMoFA and VBMFA are almost always the worst performing methods in these experiments.

Additionally, we conducted experiments with a higher dimension of p = 20, as suggested by a reviewer. The same variable settings were used in the p = 10 experiments (i.e., experiments 2, 3, 4, 6, 7, 9, 11, and 12). The observed trends remained consistent with those for p = 3 and p = 10, with the only notable difference being the expected increase in computation time. As these results do not provide new insights, they are not discussed here. However, a summary of computation times is included in Table 11 in Appendix F.

5.1 Computation time

The simulation study was performed in R on a mid-range quad-core computer. All experiments were conducted on the same machine and all methods were run under the same computational conditions. We implemented the methods NGS, AMoFA, AMFA, VBMFA, and IMIFA in the R package autoMFA. For the OMIFA and IAMFA methods, the authors’ implementation in the R package IMIFA was used. We note that some of the methods had implementations in other programming languages. However, for consistency, the R implementation was used. It should be noted that the code in autoMFA has not been professionally optimized. It has, however, been structured as consistently as possible between the different methods to facilitate a fair comparison.

Table 3

Mean CPU time taken (in seconds) to fit each model for each experiment group.

Exp.	AMFA	IAMFA	AMoFA	VBMFA	NGS	IMIFA	OMIFA
1	81	4	7	1	79	404	364
2	39	2	5	1	386	878	565
3	202	15	75	5	1 418	1 357	758
4	40	2	7	1	312	1 020	574
5	207	7	2	4	203	865	453
6	172	28	9	4	1 684	969	743
7	210	14	51	4	1 660	1 011	892
8	528	225	16	96	522	842	768
9	386	487	369	193	4 137	1 810	1 477
10	120	30	12	5	117	547	377
11	76	30	78	10	998	1 184	885
12	552	197	516	232	2 748	839	1 596
Avg.	218	87	96	46	1 189	977	788

The mean CPU time (in seconds) for each method and each experiment is presented in Table 3. On average across all 1 200 trials, VBMFA is the fastest method and (as expected) NGS is the slowest method. Ignoring the NGS method, the difference in mean computation time between the fastest and the (second) slowest is pronounced: VBMFA is almost 34 times faster than IMIFA. The average computation time of the IAMFA and the AMoFA methods are roughly twice that of the VBMFA method, placing them in the second and third place, respectively.

It can be observed from Table 3 that the dimension of the data had a clear impact on the computation time for the NGS method. This is likely because the method is only impacted by the higher computational burden of performing the necessary matrix computations in higher dimensions (as are all of the other models), but it also has to evaluate six times more models, since when p = 10, the Ledermann bound affords q_max = 6 compared to q_max = 1 when p = 3.

Unsurprisingly, the AMFA and NGS methods are generally slower than the VBMFA, AMoFA and IAMFA methods as the last three methods do not involve any exhaustive search. We might expect the AMFA method to be generally faster than the NGS method since the former does not need to search over a range of values for q. However, this was not always the case. For example, in Experiments 1, 5, 8 and 10, NGS was faster than AMFA. We note that in these experiments q_max = 1 (as p = 3) and so NGS only search over g like AMFA. The slight time differences may be due to the implementations of these methods.

Apart from NGS, the IMIFA and OMIFA methods were (on average) the slowest among the methods considered. However, we note that their computation times were considerably less variable than the other methods. This can be observed from the ratio of the longest mean computation time to the shortest mean computation time. For NGS, this ratio is 52.17, whereas for IMIFA and OMIFA it is 4.55 and 4.38 respectively.

Table 4

Mean BIC for each model and each experiment group. *For IMIFA and OMIFA, the mean BICM is provided.

Exp.	AMFA	IAMFA	AMoFA	VBMFA	NGS	IMIFA*	OMIFA*
1	1 051	1 054	1 294	—	1 051	−2 118	−2 385
2	3 508	3 608	3 956	—	3 495	−4 203	−4 772
3	17 812	17 849	20 142	—	17 793	−16 966	−18 977
4	5 149	5 386	5 557	—	4 986	−5 294	−58 544
5	3 830	3 832	3 896	—	3 830	−95 533	−97 660
6	12 408	12 703	12 587	—	12 408	−18 320	−20 831
7	19 096	19 089	20 776	—	18 798	−31 415	−25 044
8	17 626	17 635	17 822	—	17 626	−265 394	−87 380
9	46 660	47 633	51 111	—	46 656	−69 185	−75 599
10	5 322	5 311	5 864	—	5 322	−86 118	−74 742
11	13 675	15 604	15 440	—	13 600	−19 758	−22 377
12	66 760	69 780	72 590	—	66 644	−90 264	−101 371
Avg.	17 741	18 290	19 253	—	17 684	−58 714	−44 749

5.2 Quality of model fit

The mean BIC score for each method and each experiment is presented in Table 4. It should be noted that we cannot directly compare the IMIFA and OMIFA methods with the rest of the methods, as the former provided the BIC Monte Carlo (BICM) (Raftery et al., 2007) instead of BIC. As expected, the NGS achieved the lowest mean BIC across all 1 200 simulations. The difference in mean BIC between NGS and AMFA (the second best performing method) is over 50, suggesting the NGS method found better models than the other methods, on average. However, even though NGS generally attained lower mean BIC scores than the competing methods, the difference was very minor in some cases. For example, in Experiments 1, 5, 6, 8, 10, the AMFA method was able to achieve the same mean BIC score (up to four decimal places) as the NGS method.

We also observe from Table 4 that the difference in mean BIC between NGS and AMFA for Experiments 2, 3, 4, 7, 11 and 12 is greater than 10, indicating the NGS outperformed AMFA in these situations. We note that these six experiments correspond to the setting with p = 10. This suggests that AMFA’s ability to find the ‘gold standard’ models (fitted by NGS) is declining as p increases.

The AMoFA method performed poorly across all the experiments, with much higher mean BICs than AMFA and NGS. The IAMFA method also generally performed poorly in the comparison to NGS and AMFA, although it fitted better models in Experiment 10.

For the IMIFA and OMIFA methods, the former produced a lower average BICM score, although OMIFA actually obtained lower BICM in 9 out of the 12 experiments. The main reason that the IMIFA method scored better on average was because in group eight, it obtained an average BICM value whose magnitude was roughly three times that of the OMIFA method. Further investigation showed that this was not due to an outlier BICM value which was biasing the average for the IMIFA. Given this, generally, the BICM criterion is suggesting that in most of the experiment groups the OMIFA method produced better models than the IMIFA method, on average.

Table 5

Mean ARI score for each model and each experiment in the simulation study. A higher ARI score is preferred.

Exp.	AMFA	IAMFA	AMoFA	VBMFA	NGS	IMIFA	OMIFA
1	0.84	0.83	0.57	0.74	0.84	0.76	0.72
2	0.96	0.94	0.61	0.53	0.96	0.96	0.97
3	0.96	0.96	0.37	0.96	0.96	0.96	0.96
4	0.95	0.86	0.54	0.74	1.00	0.99	0.99
5	1.00	1.00	0.98	0.98	1.00	1.00	1.00
6	1.00	1.00	0.95	1.00	1.00	1.00	1.00
7	0.05	0.07	0.14	0.02	0.64	0.56	0.56
8	0.58	0.59	0.48	0.49	0.58	0.39	0.36
9	0.94	0.93	0.61	0.92	0.94	0.94	0.94
10	0.95	0.96	0.70	0.35	0.95	0.44	0.41
11	0.99	0.84	0.69	0.73	1.00	1.00	1.00
12	0.99	0.93	0.70	0.99	0.99	0.99	0.99
Avg.	0.85	0.82	0.61	0.70	0.90	0.83	0.82

5.3 Clustering accuracy

The mean ARI score for each method and each experiment is presented in Table 5. Note that higher ARI scores are preferred, with the maximum score being one. The NGS method obtained the highest overall mean ARI across all 1,200 simulations. AMFA is the next best performing method, followed by IMIFA, OMIFA, and IAMFA which obtained similar average ARI score. The VBMFA and AMoFA methods performed noticeably worse in terms of average ARI score.

Looking closer at each experiment, we see that NGS always had the highest average ARI score except for Experiment 8 (where IMAFA attained the highest average ARI). The AMoFA method performed worse in the majority of the experiments. Of particular note is that VBMFA behaves erratically. It achieved comparable average ARI score to the NGS method in Experiments 3, 5, 6, 9 and 12. These experiments all have high n, suggesting a possible link between VBMFA’s ability to recreate the underlying sub-population structure and the size of the dataset being used. However, in Experiment 8 (where n is also high), VBMFA performed very poorly and its performance was even worse than some of the experiments where n was low.

Concerning IMIFA and OMIFA, they obtained average ARI scores that are very close to the NGS method in most of the experiments apart from Experiments 1, 7, 8 and 10. Of these, all but Experiment 10 were not well-separated, implying a possible relationship between the separation of clusters and the these two methods’ ability to recreate the sub-population structure of the datasets.

Experiments 7 and 8 were the cases where all methods produced relatively low average ARI scores. There were g = 10 clusters in both cases and the clusters were not well-separated. Several of the clusters could be overlapping, making it challenging for the algorithms to correctly identify all ten components. We observe that all of the methods (except AMoFA) tend to underestimate g in these two experimental settings. In Experiment 7, NGS outperformed the other methods by a large margin. In Experiment 8, IAMFA obtained the highest average ARI score, followed by NGS and AMFA.

5.4 Inference on g

To assess how well the different methods inferred g, we calculated two metrics: the proportion of times they correctly obtained the true value of g and the proportion of times they obtained a value that is within 2 of the true value of g. We also fitted a Rasch model (Rasch, 1960) to each of these two cases. The proportion of times where each method inferred g correctly is summarized in Table 6 and the proportion of times each method inferred g correctly within 2 is summarized in Table 10. The ability parameter for each method as given by the Rasch model is listed in Table 7.

Unsurprisingly, NGS achieved the highest overall proportion of times where it inferred g correctly, as well as inferring g to within 2. The second-best performing method in Table 6 is AMFA. This is not surprising, given than AMFA performs a grid search over g. Overall, the worse performing method is AMoFA. The Rasch model ability estimates in Table 7 paint a similar picture. We see that NGS has the highest ability estimates in both metrics (for g and for g ± 2). The AMFA method has the second highest ability estimates.

Looking at each experiment, we see that NGS was able to determine g correctly the majority of times. In fact, it was able to determine the correct g 100% of the time in 7 out of the 12 experiments (Experiments 2 to 6 and 11 and 12).

All of the models tended to perform better, on average, when g = 3 compared to when g = 10. This effect is especially evident in Table 6, where the first six rows correspond to the settings with g = 3. We observed near perfect scores in these rows for all methods except VBMFA. This suggests, in general, the methods may be able to infer g more accurately when the true number of components is small.

All the methods performed poorly in Experiments 7 and 8. As revealed in Figure 1, the histogram of inferred g by each method shows that almost all methods (apart from AMoFA) tended to underestimate the number of components in these experimental settings. In particular, IMIFA and OMIFA seriously underestimated g in Experiment 8. The AMoFA method, in contrast, overestimated the number of components in Experiment 7.

On examining histograms similar to Figure 1 for the other experiments (not shown), we found that AMFA generally performed similarly to NGS, with the exception of Experiment 7. The AMoFA method generally had the poorest performance amongst the methods considered. It routinely and dramatically overestimated the number of components, sometimes even upwards of 40 components. The VBMFA method tended to underestimate g. The IMAFA method performed relatively well, except in Experiment 7.

5.5 Inference on q

We use the same two metrics in Section 5.4 to assess how well each methods infer the number of factors q. The results are presented in Tables 7, 8, and 9.

Overall, the NGS method inferred q correctly the highest proportion of the time, a result not surprising. AMFA won the second place according to Table 8, with an overall average of 73%. At the other end, IMIFA and OMIFA were the least effective methods, with an overall average of around 4%. Inspection of Table 8 reveals that they incorrectly inferred q for all 100 replications in several experiments. The exception was Experiment 4, where they correctly inferred q about 30% of the time, whereas all the methods were less than 20%. Interestingly, the second metric (where we allow for an error of ±2) paints a very different picture; see Table 9. The IMIFA and OMIFA methods both achieved perfect scores in all experiments. The NGS followed closely behind, with an overall score of 99.67%. The AMoFA and IAMFA methods score poorly on this metric. The results implies that while IMIFA and OMIFA were not able to infer q exactly very often, but they were able to infer values very close to the true q very consistently.

Figure 1

Inferred number of components g for Experiments 7 and 8. The position of the red dot indicates the true number of components.

Table 6

Proportion of times where each model inferred the number of components g correctly.

Exp.	AMFA	IAMFA	AMoFA	VBMFA	NGS	IMIFA	OMIFA
1	0.88	0.83	0.08	0.65	0.88	0.63	0.52
2	0.99	0.93	0.04	0.08	1.00	1.00	1.00
3	1.00	0.99	0.01	0.99	1.00	1.00	1.00
4	0.88	0.67	0.01	0.27	1.00	1.00	1.00
5	1.00	0.98	0.66	0.97	1.00	0.98	0.98
6	1.00	1.00	0.69	1.00	1.00	1.00	1.00
7	0.00	0.00	0.04	0.00	0.45	0.17	0.10
8	0.21	0.31	0.18	0.19	0.21	0.00	0.00
9	0.98	0.99	0.00	0.42	0.99	1.00	1.00
10	0.76	0.76	0.10	0.02	0.76	0.00	0.00
11	0.91	0.09	0.03	0.00	1.00	1.00	1.00
12	1.00	0.54	0.04	1.00	1.00	1.00	1.00
Avg.	0.80	0.67	0.16	0.47	0.86	0.73	0.72

Table 7

Each method’s ability estimates for inferring the given criteria correctly, obtained by fitting a Rasch model for each of the criteria.

Test	AMFA	IAMFA	AMoFA	VBMFA	NGS	IMIFA	OMIFA
G	3.32	1.29	−5.91	−1.11	4.71	2.10	1.87
g ± 2	1.15	0.23	−3.50	−0.73	2.71	0.07	0.02
Q	1.90	−0.64	0.06	—	3.49	−4.24	−4.13
q ± 2	0.07	−1.25	−0.81	—	2.72	3.40	3.40

It can be observed from Table 8 that all methods struggled with Experiments 4 and 7. The number of factors were q = 6 in both cases. Figure 2 indicates that AMFA, IAMFA, AMoFA, and NGS tended to underestimate the true number of factors. This was also true for IMIFA and OMIFA in Experiment 7, although to a lesser extent. In this case, the modal number of inferred factors was five for both methods. These two experiments were also the only experiments where AMFA was unable to correctly infer q to within an error of 2 in all trials (see Table 9). Both AMFA and IAMFA almost always underestimated q in these settings, which may suggest that the technique for inferring q (3.1) is less accurate as the number of factors increases. The AMoFA method shows similar pattern, performing reasonably well when q = 1 and q = 3, but very poorly when q = 6. This may suggest its performance scales poorly as q increases.

The Rasch ability estimates in Table 7 support our observations above. We see that NGS has the highest ability estimate for inferring q exactly, whereas IMIFA and OMIFA were ranked equal best when inferring q to within an error of 2. Excluding the NGS, overall AMFA performed best among the methods considered.

In addition, we investigated whether the accuracy of the inferred g had an impact on how well q is inferred. To pursue this, we examined the proportion of times q is inferred correctly given that g has been correctly inferred. For AMFA, IAMFA, NGS, and IMIFA, only minimal increase was observed in the proportion of correctly inferred q. For OMIFA, a slight decrease was observed. However, for AMoFA, the proportion almost doubled. This is likely attributed to AMoFA’s poor ability to infer g correctly. It often drastically overestimates g, resulting in a very different clustering than the true one. On the other hand, if g was provided or estimated correctly, then the inference for q appears to be quite accurate.

Figure 2

Inferred number of factors q for Experiments 4 and 7. The position of the red dot indicates the true number of factors.

Table 8

Proportion of times where each model inferred the number of factors q correctly.

Exp.	AMFA	IAMFA	AMoFA	VBMFA	NGS	IMIFA	OMIFA
1	1.00	1.00	0.97	–	1.00	0.00	0.00
2	0.91	0.31	0.65	–	1.00	0.00	0.00
3	0.55	0.68	0.03	–	0.86	0.01	0.01
4	0.00	0.01	0.02	–	0.19	0.31	0.32
5	1.00	1.00	0.97	–	1.00	0.00	0.01
6	1.00	0.00	0.94	–	1.00	0.00	0.00
7	0.00	0.00	0.00	–	0.03	0.12	0.16
8	1.00	1.00	0.99	–	1.00	0.00	0.00
9	1.00	0.00	0.18	–	1.00	0.00	0.00
10	1.00	1.00	1.00	–	1.00	0.00	0.00
11	0.89	0.00	0.43	–	1.00	0.03	0.03
12	0.41	0.11	0.04	–	0.99	0.02	0.02
Avg.	0.73	0.43	0.52	–	0.84	0.04	0.05

6 Concluding remarks

This study provides a systematic comparison of methods for automatically inferring the number of components (g) and factors (q) in Mixtures of Factor Analysers (MFA) models, including Naive Grid Search (NGS), AMFA, AMoFA, VBMFA, OMIFA, IMIFA, and the proposed IAMFA. We evaluate their performance across a diverse range of settings and in terms of clustering accuracy, model selection, and computational efficiency. The insights from this comparative study provides helpful guidance for automated model selection and are relevant for applied fields such as biomedical research, finance, image processing and social sciences. For example, MFA are frequently used for dimensionality reduction in genomic studies, disease subtyping, and medical imaging.

Our findings indicate that, apart from NGS, AMFA consistently performed best across most evaluation metrics, particularly in terms of model selection accuracy. OMIFA and IMIFA also demonstrated strong performance. IAMFA, while not the top-performing method, provided a balance between computational efficiency and model selection accuracy, making it a viable alternative for applications where exhaustive search methods like NGS are computationally prohibitive. In contrast, AMoFA and VBMFA exhibited the weakest overall performance in these experiments.

A key challenge identified is the computational burden of methods that iteratively refine model complexity. While IAMFA improves on exhaustive search, it still requires iterative component splitting and parameter estimation, which can be costly for high-dimensional data. Scalability remains an issue for even the best-performing methods as dimensionality and sample size increases, highlighting the need for more efficient inference techniques.

Given the strong performance of Bayesian methods such as OMIFA and IMIFA, future work could explore Bayesian adaptations of IAMFA and AMFA, using nonparametric priors and variational inference techniques to improve flexibility and robustness in inferring g and q. Enhancing computational efficiency, particularly for high-dimensional data, is also of interest for future research.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

Supplementary materials

Appendix

References

Beal

(2003) Variational algorithms for approximate Bayesian inference . PhD thesis, University College London, Gower St, London.

Bhattacharya

and Dunson

(2011) Sparse bayesian infinite factor models. Biometrika , 98, 291–306. doi: 10.1093/biomet/asr013

Bishop

(2006) Pattern Recognition and Machine Learning . New York: Springer.

Box

GEP

(2005) Statistics for Experimenters: Design, Innovation, and Discovery . Wiley Series in Probability and Statistics. 2nd ed. Hoboken, NJ: Wiley-Interscience.

Davey

(2021) autoMFA: Algorithms for Automatically Fitting MFA Models . URL https://CRAN.R-project.org/package=auto MFA. R package version 1.0.0.

Dempster

, Laird

and Rubin

(1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) , 39, 1–38. doi: 10.1111/ j.2517-6161.1977.tb01600.x.

Ghahramani

and Beal

(1999) Variational inference for Bayesian mixtures of factor analysers. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS’99) . Cambridge, MA: MIT Press, 449–55.

Ghahramani

and Hinton

(1997) The em algorithm for mixtures of factor analyzers . Technical Report CRG-TR-96-1. University of Toronto.

Grushanina

and Frühwirth-Schnatter

(2023) Dynamic mixture of finite mixtures of factor analysers with automatic inference on the number of clusters and factors. arXiv preprint . doi: 10.48550/arXiv.2307.07045

10.

Hubert

and Arabie

(1985) Comparing partitions. Journal of Classification , 2, 193–218. doi: 10.1007/BF01908075

11.

Kaiser

(1958) The varimax criterion for analytic rotation in factor analysis. Psycho- metrika , 23, 187–200. doi: 10.1007/BF02289233

12.

Kaya

and Salah

(2015) Adaptive mixtures of factor analyzers . arXiv preprint:1507.02801. doi: 10.48550/arXiv.1507.0280

13.

and Beak

(2018) Var estimation using skewed mixture models and various mixtures of factor analyzers. Journal of the Korean Data and Information Science Society , 29, 769–78.

14.

Ledermann

(1937) On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika , 2, 85–93. doi: 10.1007/BF02288062

15.

Liu

, Rubin

and Wu

(1998) Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika , 85, 755–70. doi: 10.1093/biomet/85.4.755

16.

Mardia

(1970) Measures of multivariate skewness and kurtosis with applications. Biometrika , 57, 519–30. doi: 10.1093/biomet/57.3.519

17.

McLachlan

and Peel

(2000a) Mixtures of factor analyzers. In Proceedings of the Seventeenth International Conference on Machine Learning , edited by Langley

pages 599–606. San Francisco: Morgan Kaufmann. doi: 10.1002/0471721182.ch8

18.

McLachlan

and Peel

(2000b) Finite Mixture Models . Hobokin, NJ: Wiley.

19.

McLachlan

, Peel

and Bean

(2003) Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis , 41, 379–88. doi: 10.1016/S0167-9473(02)00183-4

20.

X-L

Meng

and Rubin

(1993) Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika , 80, 267–78. doi: 10.1093/biomet/80.2.267

21.

X-L

Meng

and van Dyk

(1997) The EM algorithm-an old folk – Song sung to a fast new tune. Journal of the Royal Statistical Society. Series B (Methodological) , 59, 511–67. doi: 10.1111/1467-9868.00082

22.

Murphy

, Viroli

and Gormley

(2020) Infinite mixtures of infinite factor analysers. Bayesian Analysis , 15, 937–63. doi: 10.1214/19-BA1179

23.

Murphy

, Viroli

and Gormley

(2021) IMIFA: Infinite Mixtures of Infinite Factor Analysers and Related Models . URL https://CRAN.R-project.org/package=IMIFA. R package version 2.1.8.

24.

Murphy

(2012) Machine Learning: A Proba- bilistic Perspective . MIT Press Academic.

25.

R Core Team (2025) R: A language and environment for statistical computing . R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

26.

Raftery

, Newton

, Satagopan

and Krivitsky

(2007) Estimating the integrated likelihood via posterior simulation using the harmonic mean identity. Bayesian Statistics , 8, 1–45.

27.

Rasch

(1960) Probabilistic models for some intelligence and attainment tests. Studies in Mathematical Psychology . København: Danmarks Paedagogiske Institut.

28.

Schwarz

(1978) Estimating the dimension of a model. The Annals of Statistics , 6, 461–64. doi: 10.1214/aos/1176344136

29.

Wallace

and Boulton

(1968) An information measure for classification. Computer Journal , 11, 185–94. doi: 10.1093/comjnl/11.2.185

30.

W-L

Wang

and T-I

Lin

(2020) Automated learning of mixtures of factor analysis models with missing information. TEST , 29, 1098–1124. doi: 10.1007/s11749-020-00702-6

31.

Yang

, Ahuja

and Kriegman

(1999) Face detection using a mixture of factor analyzers. In Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348) . Volume 3, pages 612–16. doi: 10.1109/ICIP.1999.817188

32.

Zhao

and Shi

(2014) Automated learning of factor analysis with complete and incomplete data. Computational Statistics & Data Analysis , 72, 205–218. doi: 10.1016/j.csda.2013.11.008

33.

Zhao

and Yu

(2008) Fast ML estimation for the mixture of factor analyzers via an ECM algorithm. IEEE Transactions on Neural Networks , 19, 1956–61. doi: 10.1109/TNN.2008.2003467

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Approaches for fully automated parameter estimation of mixtures of factor analysers

Abstract

Keywords

1 Introduction

2 The mixture of factor analysers (MFA)

Table 1

3.2 Adaptive mixtures of factor analysers (AMoFA)

3.3 Automated mixtures of factor analysers (AMFA)

3.5 Overfitted mixtures of infinite factor analysers (OMIFA)

3.6 Infinite mixtures of infinite factor analysers (IMIFA)

3.7 Incremental automated mixtures of factor analysers (IAMFA)

Table 2

Variable settings for each of the 12 simulated experiments; each experiment is replicated 100 times. n is the total number of points in the dataset. p is the dimension of an observation. g is the number of clusters. π are the cluster proportions. q is the number of factors in each cluster.

5.1 Computation time

Table 3

Mean CPU time taken (in seconds) to fit each model for each experiment group.

Mean BIC for each model and each experiment group. *For IMIFA and OMIFA, the mean BICM is provided.

Table 5

Mean ARI score for each model and each experiment in the simulation study. A higher ARI score is preferred.

5.4 Inference on g

5.5 Inference on q

Figure 1

Inferred number of components g for Experiments 7 and 8. The position of the red dot indicates the true number of components.

Proportion of times where each model inferred the number of components g correctly.

Each method’s ability estimates for inferring the given criteria correctly, obtained by fitting a Rasch model for each of the criteria.

Inferred number of factors q for Experiments 4 and 7. The position of the red dot indicates the true number of factors.

Proportion of times where each model inferred the number of factors q correctly.

Footnotes

Declaration of Conflicting Interests

Funding

Supplementary materials

Appendix

References

Supplementary Material