Abstract
This study aims to examine the performance of adaptive quadrature (AQ) estimation method for ordinal confirmatory factor analysis (CFA). Specifically, we compared four link functions (complimentary log-log [CLL], logit, log-log, and probit) of the AQ estimation method across varying factor structures, sample sizes, distributional shape of latent trait, and number of quadrature points. The study is conducted via a simulation study and using empirical data. The results demonstrate that the probit link function exhibits superiority across the vast majority of conditions, consistently yielding the highest proper convergence rates and, among successfully converged solutions, the lowest parameter recovery errors, and the best relative fit, whereas the logit generally showed the weakest performance. Additionally, a critical divergence was discovered regarding asymmetric link functions: while the probit link generally provided the best model fit across the positively skewed simulation conditions, the log-log link yielded the best relative model fit for the positively skewed empirical data. Furthermore, the study reveals the complex role of quadrature points in multidimensional spaces. Although using eight quadrature points may be necessary in more complex simulated models, it frequently causes severe estimation failures when applied to sparse real-world data.
Keywords
Introduction
The underlying logic of confirmatory factor analysis (CFA) applies to data at any level of measurement scales, but the estimation method employed varies depending on the scale type. This study aims to examine the performance of the adaptive quadrature (AQ) estimation method, proposed by Bryant and Jöreskog (2016), for ordinal categorical data under different conditions. The study begins by outlining the historical development of estimation methods used in CFA, followed by a review of previous research on estimation approaches for ordinal data. It then introduces the AQ method, a recently proposed estimation technique, and provides the rationale for investigating its performance in this study.
In the early stages of CFA’s popularity, maximum likelihood (ML) estimation was among the most widely used methods for estimating and testing measurement models involving unobservable variables (Kline, 2011). The ML estimator produces asymptotically unbiased, efficient, and consistent parameter estimates (Finch et al., 1997; Muthén & Kaplan, 1985, 1992; West et al., 1995) and is the default estimator in many software. However, ML estimation assumes data are at least on an interval scale and multivariate normally distributed. These assumptions can be easily violated with item-level data that is ordered categorical, a common response format in social and behavioral sciences. The nature of ordered categorical data has been a source of conceptual debates in CFA (DiStefano, 2002; Dolan, 1994; Fuller & Hemmerle, 1966; Li, 2016).
In applied research, particularly in social and behavioral sciences, categorical data on nominal or ordinal scales are common. The Likert-type scale is widely used to measure unobservable constructs through observable behaviors, such as in attitude and personality inventories (Crocker & Algina, 2008). While a five-point Likert scale was originally proposed by Likert (1932), different response categories are frequently employed in practice. For instance, the PISA 2015 student questionnaire used a four-category Likert-type scale, with response options ranging from “strongly disagree” to “strongly agree” (OECD, 2016). Numeric values are typically assigned to responses for statistical analysis, although such data should not be treated as interval-level data since it is inherently ordered categorical (Muthén & Kaplan, 1985).
CFA With Ordinal Data/non-normally Distributed Ordinal Data
Likert-type scales are often treated as continuous in analyses. Even though the latent trait is continuous, errors arising from dividing the construct’s continuous scale into a specific number of ordered categories are included in the psychometric analysis (DiStefano, 2002; Finney & DiStefano, 2006). In addition, some psychological traits such as depression or anxiety and some health-related traits generally are not normally distributed in the population. In general, these score distributions tend to be skewed, and a large percentage of individuals endorse items (Reise & Revicki, 2014).
Researchers have developed various methods for estimating CFA model parameters, standard errors, and model-fit indices when the data are categorical or ordinal with non-normal distribution. One existing approach is using the asymptotically distribution-free (ADF) or weighted least squares (WLS) estimator (Browne, 1984; Jöreskog, 1994), which has been proposed as superior to ML estimation when its assumptions are violated (DiStefano, 2002). The fit function of the WLS estimator includes a weight matrix, but estimating this matrix requires a very large sample for accuracy and stability. The WLS estimator also tends to underestimate standard errors, overestimate chi-square values, and sometimes fail to converge, especially with smaller samples, more severe non-normality, or more complex models (Finney & DiStefano, 2006).
When using the WLS method, there is no closed-form analytic solution for estimating the asymptotic covariance matrix of categorical data. While approaches exist to estimate the weight matrix asymptotically, issues may still arise in finite samples (Oranje, 2003). One approach to overcoming the limitation of the WLS estimator is using only the diagonal elements of the asymptotic covariance matrix of polychoric correlations instead of using all the elements when obtaining the weight matrix. Diagonally weighted least squares (DWLS) and robust WLS are adaptations of the WLS estimator that improve convergence rates and provide more accurate results (Finney & DiStefano, 2006; Lei, 2009; Mîndrilă, 2010).
Another alternative to overcome the computational difficulties of WLS is the Unweighted Least Squares (ULS) estimator. Unlike WLS or DWLS, the ULS method does not employ a weight matrix (or implies an identity matrix) during parameter estimation, thereby avoiding the instability associated with inverting large matrices in small samples. Similar to DWLS, ULS estimates are consistent, but standard errors and chi-square statistics must be adjusted for non-normality using robust corrections. In these modified approaches (Robust DWLS and Robust ULS), the standard errors and chi-square values are calculated based on the full weight matrix but are subsequently adjusted through the application of a scaling factor known as the Satorra-Bentler correction factor (Satorra & Bentler, 1994). Forero et al. (2009) evaluated DWLS and ULS methods through simulation, finding that while ULS gives more precise parameter estimates and standard errors, DWLS has higher convergence rates. Consequently, both DWLS and ULS are widely used and available in many software packages.
Adaptive quadrature estimation was proposed for ordered categorical and non-normal data as a full-information (Bryant & Jöreskog, 2016; Cagnone & Monari, 2013). The adaptive quadrature estimation involves employing a link function such as a logit link function to transform relationships among categorical data into continuous probability distributions. A detailed explanation of adaptive quadrature estimation is presented in the following section.
Adaptive Quadrature Estimation
The adaptive quadrature (AQ) estimation method is a recently proposed approach that serves as a full-information maximum likelihood alternative to limited-information methods (such as WLS and DWLS) for obtaining parameter estimates and standard errors in CFA models with ordinal data. Traditional limited-information methods rely on bivariate contingency tables and polychoric correlations, which can produce biased results when the data suffer from sparseness (i.e., many empty cells). As a full-information ML approach, AQ overcomes this limitation by maximizing the likelihood of all observed response patterns directly (Bryant & Jöreskog, 2016). Since the integrations in many likelihood functions have no closed-form solutions, latent variables are often numerically integrated to conduct ML estimation. While the Gauss-Hermite (GH) quadrature is commonly used to approximate these intractable integrals, it faces significant limitations as the number of latent variables increases. Specifically, the method becomes computationally infeasible when dealing with more complex models and categorical ordinal variables, as the number of observed variables, response categories, and latent variables all contribute to the computational complexity (Cagnone & Monari, 2013). To understand the superiority of AQ over traditional GH quadrature, one must consider how integration points (nodes) are selected. Standard GH quadrature approximates the integral by using a predetermined set of nodes and weights based on the standard normal distribution, which are typically centered around zero but are not equally spaced. This approach assumes the posterior density is well-represented by a standard normal distribution regardless of the data’s actual structure. However, in many latent variable models—especially those with ordinal data or high factor correlations—the posterior density for a given response pattern may be narrow or shifted far from zero. In such cases, the fixed nodes of GH may fall into regions with little probability mass, effectively missing the peak of the likelihood function. This inefficiency leads to substantial bias unless a prohibitive number of points is used (Lesaffre & Spiessens, 2001).
In contrast, AQ treats the posterior density as a moving target. It relies on an iterative process where the mean and variance of the posterior density are estimated for each response pattern. The integration nodes are then rescaled and shifted to center on the peak of the posterior density and scaled according to the posterior curvature (Rabe-Hesketh et al., 2005; Schilling & Bock, 2005). This adaptation allows AQ to achieve high precision with significantly fewer quadrature points compared to GH. Consequently, AQ is computationally more efficient and relatively more robust against the “curse of dimensionality,” particularly in complex factor structures (Schilling & Bock, 2005). Furthermore, the performance of AQ is highly dependent on the inter-factor correlation; while lower correlations make the posterior distribution diffuse and harder to integrate (often leading to non-convergence), higher correlations concentrate the distribution and improve estimation accuracy, although extremely high correlations in multidimensional models may introduce empirical underidentification (Cagnone & Monari, 2013; Lei, 2009).
Empirical studies confirm that AQ outperforms traditional GH quadrature in terms of both convergence and efficiency (Lesaffre & Spiessens, 2001; Rabe-Hesketh et al.,, 2005). Specifically, Cagnone and Monari (2013) found that AQ provided faster convergence and more accurate parameter estimates than GH, especially in conditions characterized by small sample sizes and high correlations between latent variables. Furthermore, while the choice of starting values strongly influenced the GH approach, it did not impact the stability of AQ estimation. AQ has thus been successfully applied to approximate integrals in the expectation-maximization algorithm for various latent variable models (Grilli & Rampichini, 2007; Vasdekis et al., 2012), utilizing a Bayesian perspective to approximate posterior densities (Naylor & Smith, 1982).
LISREL 9 (Jöreskog & Sörbom, 2013) and more recent LISREL versions offer four link functions (complimentary log-log [CLL], logit, log-log, and probit) for the AQ estimation approach, which converts relationships between ordered categorical variables into continuous probability distributions. These link functions differ in their underlying distributional assumptions for ordinal responses (Bryant & Jöreskog, 2016). The probit link function is normally distributed with a mean of 0 and a standard deviation of 1, with the 99% interval falling between z = −3 and z = +3. The logit link function exhibits a platykurtic distribution, with a mean of 0 and a standard deviation of 1.814, and its 99% interval is ranging from z = −5 and z = +5. The log-log link function displays positive skewness, featuring a mean of −.577 and a standard deviation of 1.283, with a 99% range spanning from z = −2 to z = +6. Conversely, the CLL link function demonstrates negative skewness, possessing a mean of .577 and a standard deviation of 1.283, and its 99% interval is situated between z = −6 and z = +2. Theoretically, symmetric link functions like probit and logit are expected to perform better with normally distributed data, whereas asymmetric functions like log-log and CLL are hypothesized to handle positively and negatively skewed data, respectively (Bryant & Jöreskog, 2016).
Purpose of the Study
While traditional limited-information methods (e.g., DWLS) rely on polychoric correlations, the Adaptive Quadrature (AQ) estimation is a full-information maximum likelihood (FIML) approach that directly evaluates the likelihood of individual response patterns using numerical integration and specific link functions (Schilling & Bock, 2005). Bryant and Jöreskog (2016) evaluated AQ in CFA with real ordinal data, demonstrating that AQ can provide accurate parameter estimates, standard errors, and fit indices, potentially outperforming other methods. However, their evaluation left critical unanswered questions regarding how different link functions—especially asymmetric ones like log-log and complementary log-log (CLL)—perform under varying degrees of severe non-normality. Furthermore, the specific interaction between model complexity (dimensionality) and the required number of quadrature points remains ambiguous; while too few points may cause estimation failure, too many may inflate computational time or trigger algorithmic failure without statistical gain (Cagnone & Monari, 2013).
Given the prevalent use of CFA with non-normal categorical data, it is imperative to understand the empirical boundaries of these estimation methods. Therefore, this study aims to comprehensively examine AQ’s performance through both an extensive Monte Carlo simulation and an empirical application using PISA 2018 data. We systematically manipulate sample size, distributional shape (skewness), number of response categories, CFA model structures, and inter-factor correlation levels. By evaluating four distinct link functions across different quadrature points, this study seeks to provide robust, evidence-based guidelines for applied researchers encountering complex, sparse, and non-normal ordinal data.
Method
A simulation study was first conducted, followed by an analysis of real-world data. This section details the simulation conditions, data generation processes, and psychometric properties of the scale used for the empirical data.
Data Generation
The simulation study systematically manipulated several design factors to evaluate estimation performance: sample size, distributional shape of the latent trait, measurement model complexity (number of factors), inter-factor correlations, and the number of response categories.
CFA requires a sufficiently large-sample size for stable parameter estimation (Finney & DiStefano, 2006). Jöreskog et al. (2016) proposed a minimum sample size requirement of (p + 1)(p + 2)/2, where p is the number of items. Since the number of items in this study was held constant at 18, the minimum sample size recommended by Jöreskog et al. (2016) was calculated as 190. These sample sizes reflect commonly used standards for CFA studies (Finney & DiStefano, 2006; Kline, 2011), where N = 200 is considered small, N = 500 moderate, and N = 1,000 large.
The distributional shape of the latent trait is a critical factor, as different estimation methods may perform inconsistently depending on the underlying variable distributions (Finch et al., 2018). To investigate this, continuous multivariate normal data were first generated and subsequently categorized into three- and five-category ordinal variables under three distributional conditions: normal, negatively skewed, and positively skewed. This approach mirrors the common theoretical assumption in psychological and educational research that underlying traits are continuously distributed prior to categorization (Flora & Curran, 2004; Kilic & Dogan, 2021). To achieve the desired asymmetric shapes, specific categorization thresholds were applied using the sim. structure() function in the R psych package (Revelle, 2023). Across the 100 replications, empirical skewness values ranged from −4.02 to −2.59 for negatively skewed conditions, −.10 to .10 for normal conditions, and 2.65 to 3.80 for positively skewed conditions, confirming successful distributional manipulation.
The measurement model comprised 18 items distributed asymmetrically across either two-factor or three-factor structures. To represent typical moderate-to-strong CFA scenarios, true factor loadings were specified between .60 and .75. Furthermore, inter-factor correlations were manipulated at two levels: low (r = .4) and high (r = .7), allowing for the assessment of estimation accuracy under varying degrees of factor distinctiveness, and particularly testing the algorithm’s vulnerability to empirical underidentification in multidimensional structures.
A critical element of AQ estimation is the selection of quadrature points, which balances the precision of numerical integration against computational feasibility. While the LISREL 12 user manual (Jöreskog & Sörbom, 2022) suggests 5 to 10 points for models with two to three factors, empirical evidence indicates that increasing the points beyond five yields diminishing returns in accuracy while exponentially increasing estimation time (Bryant & Jöreskog, 2016). Consequently, this study compared the performance of five quadrature points (a computationally efficient lower bound) against eight quadrature points (a robust median value) to systematically assess this trade-off.
In total, the study examined 72 distinct data generation conditions (3 sample sizes × 3 distributions × 2 category levels × 2 model structures × 2 correlation levels). One hundred datasets were generated per condition, resulting in a total of 7,200 unique datasets. To comprehensively evaluate the AQ method, each dataset was then estimated using four distinct link functions (CLL, logit, log-log, and probit) and crossed with the two quadrature levels (Q = 5 and Q = 8). This fully crossed design yielded a grand total of 57,600 estimated models. For all estimation procedures, the maximum number of iterations was set to 200; models failing to reach a solution within this limit were recorded as non-convergent.
To illustrate the skewness visually, the first item with three response categories of the first replication of three different distributions, generated with a sample size of 1,000 for a two-factor model with high correlation between factors is shown in Figure 1. Frequency of observed indicators
Real Data
To illustrate the practical implications of the simulation findings, an empirical application was conducted using real educational data from the Programme for International Student Assessment (PISA) 2018, which measures global educational outcomes. Specifically, data from Turkiye sample were filtered to extract three subscales: student’s interest in learning about other cultures (ST214), cognitive adaptability (ST216), and respect for people from other cultures (ST217). These subscales were selected due to their ordinal nature and their relevance to CFA modeling in large-scale assessments (OECD, 2019). To mirror the simulation conditions, random samples of 200, 500, and 1,000 students were drawn from this dataset. A two-factor measurement model was specified using the ST214 and ST216 subscales, while a three-factor model incorporated all three subscales (ST214, ST216, and ST217). The analyses were initially conducted using the original five-category response format and were subsequently repeated after recoding the responses into three categories to align with the categorical conditions of the simulation study. To create the three-category condition, the extreme response options were combined: “strongly disagree” and “disagree” were merged into a single category, and “strongly agree” and “agree” were merged into another, leaving the neutral midpoint unchanged. This resulted in a three-category ordinal scale with a neutral midpoint.
Finally, to fully replicate the simulation design, these real data models were estimated across the four link functions and crossed with the two quadrature levels (5 and 8 points). Given the positive skewness observed in these subscales, this real-world dataset provides an ideal empirical test to validate the simulation findings regarding the performance of asymmetric link functions (such as log-log) versus symmetric ones (such as probit) under non-normal conditions.
Descriptive statistics of the total scores from the subscales
Note. Fac = Number of factors; Cat = Number of response categories.
The examination of the descriptive statistics in Table 1 reveals that the subscale ST214 exhibits positive skewness, with values ranging from 1.8 to 2.6 in the data sets with three response categories, and from 1.0 to 1.4 in the data sets with five response categories, across both factor models. Similarly, the subscale ST217, used in the three-factor model, also demonstrates positive skewness, with values varying between 1.8 and 2.4 in the three-category data sets, and between 1.1 and 1.4 in the five-category data sets. In contrast, the subscale ST216 shows less deviation from normality, with skewness values ranging from .7 to .9 in the three-category data sets, and from .1 to .4 in the five-category data sets, for both factor models. These findings suggest that the degree of positive skewness tends to increase as the number of response categories decreases, particularly for the subscales ST214 and ST217.
Data Analysis
All CFA models with adaptive quadrature (AQ) estimation, across both the simulation and real data phases, were conducted using LISREL 12 (Jöreskog & Sörbom, 2022). To systematically evaluate the performance of AQ estimation across the simulated conditions and the empirical data, three primary criteria were examined: convergence rates, parameter recovery (for simulated data), and relative model fit. First, convergence rates were calculated for each condition. A model was considered to have successfully converged if the maximum likelihood algorithm reached a stable solution with a positive definite information matrix within the 200-iteration limit. Conditions with zero or extremely low convergence rates indicate severe computational boundaries for the AQ method. Second, parameter recovery was assessed using the root mean square error (RMSE) for the completely standardized factor loadings. The RMSE quantifies the discrepancy between the estimated factor loadings and their true population values specified during data generation. Lower RMSE values indicate higher precision in parameter estimation. RMSE values were averaged across the successfully converged replications within each condition. It is important to note that non-converged solutions and improper solutions were strictly excluded from the computation of the mean RMSE. Finally, the relative fit of the models estimated with different link functions and quadrature points was rigorously evaluated using three fit indices: the −2 Log-likelihood (−2lnL), the Akaike Information Criterion (AIC), and Schwarz’s Criterion (SC). For each valid replication in the simulation and each condition in the real data analysis, the values of −2lnL, AIC, and SC produced by the CLL, logit, log-log, and probit link functions were compared. The link function yielding the lowest values across these indices was identified as providing the optimal model fit. The proportions of selecting the best-fitting model were then aggregated to determine which link function performs optimally under specific distributional shapes, sample sizes, and quadrature configurations.
Findings
Simulation Study
Convergence Rates for Two-Factor Models
Note. Fac = Number of factors; Cor = Inter-factor correlation (low = .40, high = .70); Cat = Number of response categories; Quadrature = Number of quadrature points.
As shown in Tables 2 and 3, the distribution of the latent trait plays an important role in convergence rates. Under a normal distribution, analyses of the two-factor structure converged in all or nearly all conditions across the four link functions. In contrast, for the three-factor structure, convergence rates declined noticeably under conditions characterized by high inter-factor correlations, three response categories, and small sample sizes. Among the link functions, the logit link function appears to be the most affected by these conditions. Increasing the number of quadrature points to eight substantially improved convergence rates for the other link functions, whereas the improvement remained limited for the logit link (a finding consistent with Bryant & Jöreskog, 2016). These results further highlight the role of inter-factor correlation. Contrary to the expectation that highly correlated factors might facilitate estimation by concentrating the multidimensional likelihood, the results indicate that high correlations reduce proper convergence rates in complex AQ models. This pattern suggests that when latent factors are highly correlated, the estimation algorithm may struggle to empirically distinguish the dimensions, leading to an ill-conditioned Hessian matrix and subsequent estimation difficulties (Lei, 2009).
The model complexity and the number of quadrature points interact to form a complex barrier to convergence. Moving from a two-factor to a three-factor structure substantially increases the computational demand of multiple integral calculations, leading to greater instability. For example, under a negative distribution (three-categories, N = 200, five quadrature points, probit), the two-factor model successfully converged in 96% of replications, whereas the three-factor model dropped to 28%. Interestingly, the empirical effect of increasing quadrature points to eight depends strictly on model complexity. In the simpler two-factor models, increasing the grid to eight points proved highly detrimental, substantially reducing convergence rates (e.g., dropping from .95 to .17 for the logit link under low correlation, three-category, negative distribution). Conversely, in the complex three-factor models, increasing the grid to eight quadrature points often appeared to improve proper convergence rates across the link functions (e.g., increasing from .26 to .50 for the CLL link). This reveals the dual nature of numerical integration: an excessively fine grid may reduce estimation efficiency in lower dimensions (Bryant & Jöreskog, 2016), yet it may prove beneficial under higher-dimensional conditions.
Convergence rates are strongly affected by skewed distributions. Negatively skewed distributions generally yielded lower convergence rates than positively skewed distributions, particularly in complex models. In negatively skewed conditions, the log-log and probit links achieved relatively high convergence in two-factor models with five quadrature points, but rates declined substantially in three-factor models, especially with small samples, with probit link being more strongly affected than log-log link. Increasing the sample size and using eight quadrature points improved convergence for probit link, whereas for log-log link higher rates were primarily observed when the sample size reached at least 500.
Under positively skewed distributions, the CLL and probit links performed better than other link functions. In three-factor models, the CLL link achieved the highest convergence rates, provided the sample size was at least 500. Under these conditions, the probit link also performed reliably in larger samples when estimated with eight quadrature points.
An important interaction was observed between the distribution shape and the number of response categories. Under normal distributions, increasing categories from three to five improves convergence, likely because the denser information better approximates a continuous variable (e.g., three-factor, low correlation, normal, N = 200, five quadrature points, logit jumped from .71 with three categories to .98 with five categories). However, when the data were highly skewed (negative or positive), five-category data frequently performed worse than three-category data. This occurs because extreme skewness across a wider five-point scale inevitably creates “empty cells” (zero frequencies) for the extreme response categories, stalling the likelihood estimation.
Convergence Rates for Three-Factor Models
Note. Fac = Number of factors; Cor = Inter-factor correlation (low = .40, high = .70); Cat = Number of response categories; Quadrature = Number of quadrature points.
Mean RMSE for Two-Factor Models
Note. Fac = Number of factors; Cor = Inter-factor correlation (low = 0.40, high = 0.70); Cat = Number of response categories; Quadrature = Number of quadrature points.
As seen in Tables 4 and 5, the accuracy of parameter recovery was highly dependent on the informational density of the datasets, driven jointly by sample size and the number of response categories. As expected, increasing the sample size systematically reduced the RMSE across all conditions by narrowing the estimation variance. Furthermore, utilizing five response categories consistently yielded lower error rates than three categories. Expanding the scale to five categories minimizes information loss and more closely approximates a continuous distribution, allowing the algorithm to estimate parameters with greater precision (e.g., dropping RMSE from .07 to .05 under N = 200 normal conditions).
Regarding link functions and distribution shapes, the probit function demonstrated the most consistent performance under normal distributions. Normal distributions provided the safest estimation environment, yielding highly accurate estimates (RMSE between .02 and .10). Under these optimal conditions, the probit link minimized errors down to .02, whereas the logit link exhibited persistent bias, stalling around .17. However, when data were severely skewed, estimation errors surged dramatically, frequently reaching substantially higher error levels (.40–.70) in complex models. Under these asymmetric conditions, the pattern of parameter recovery shifted: particularly in complex three-factor models, reflecting the paradox identified earlier, the log-log link function yielded the lowest errors under negatively skewed distributions, while the CLL link function proved to be the most resilient under positively skewed distributions.
The interaction between model complexity and inter-factor correlation created the most challenging estimation environments. In two-factor models (Table 4), high correlation often stabilized or slightly reduced the RMSE (due to a concentrated likelihood). Conversely, in three-factor models (Table 5), high correlation combined with skewed data triggered a phenomenon akin to empirical underidentification. Because the estimation process becomes too unstable to clearly differentiate the three highly correlated latent dimensions, it commits substantial errors when assigning specific factor loadings, causing RMSE values to wander erratically in the 0.50–0.70 range (particularly for the logit link, as well as the asymmetric links when applied to their non-aligned skew directions).
Finally, a crucial methodological warning is required when interpreting the effect of quadrature points on parameter accuracy. In the simpler two-factor models (see Table 4), exceeding five quadrature points does not reduce error; instead, it actively induces algorithmic instability and increases the RMSE (e.g., two-factor, low correlation, three-category, negative, and logit, N = 200 increased from .21 at five quadrature points to .34 at eight quadrature points). Conversely, in highly unstable three-factor conditions (see Table 5), the aggregated RMSE for eight quadrature points sometimes appears deceptively lower than that of five quadrature points (e.g., three-factor, low correlation, five-category, negative, and probit, N = 200 shows .64 at Q = 5 vs. .24 at Q = 8). Rather than indicating genuinely superior precision across the board, this pattern is a classic manifestation of survivorship bias. Because the overall proper convergence rate in such extreme conditions is severely constrained, the deceptively low RMSE for eight quadrature points simply reflects a select subset of well-behaved samples or noise-free random datasets that managed to survive the intensive multidimensional integration process.
Mean RMSE for Three-Factor Models
Note. Fac = Number of factors; Cor = Inter-factor correlation (low = .40, high = .70); Cat = Number of response categories; Quadrature = Number of quadrature points.
Tables 6 and 7 presents the proportions of selecting the best-fitting link functions based on the information criterion across the 72 evaluated conditions. The most indisputable finding in Tables 6 and 7 is the absolute superiority of the probit link function. Corroborating its high proper convergence rates and low RMSE values, the probit function almost completely monopolized the best-fit selections under normal and positively skewed distributions. For instance, under the two-factor, high correlation, positive distribution (five categories, N = 1,000, five quadrature points) condition, the probit link achieved a proportion of 1.00, while alternative functions secured virtually zero selections (marked as ---). This overwhelming preference suggests that when ordinal scales provide sufficient granularity (five categories), the assumption of an underlying continuous normal distribution works exceptionally well, allowing the probit link to provide a near-perfect mathematical representation of the data. The informational density of the data—dictated by sample size and category format—played a decisive role in the algorithm’s ability to definitively identify the best model. When information was severely restricted (N = 200, three categories), the estimation environment became highly noisy. In these sparse data scenarios, the algorithm struggled to establish a single dominant model, resulting in selections being thinly distributed across multiple link functions (e.g., three-factor, high correlation, three categories, normal, N = 200, five quadrature points yielded scattered proportions like probit .16, log-log .06, and logit .03). However, when the sample size expanded (N = 1,000) and the category format approximated a continuous scale (five categories), estimation noise vanished, and the algorithm confidently locked onto the probit function.
Tables 6 and 7 also reveals a fascinating divergence between theoretical expectations and practical performance regarding skewed distributions. Theoretically, asymmetric link functions should outperform symmetric ones when data exhibit corresponding skewness: log-log for positive skew and CLL for negative skew (Bryant & Jöreskog, 2016). However, empirical fit results show that the log-log function largely fails to meet this expectation; even under severely positively skewed distributions, the symmetric probit function systematically outperformed the log-log link (e.g., two-factor, high correlation, five-category, positive, N = 1,000, five quadrature points resulted in probit 1.00, log-log ---). In contrast, the CLL function successfully validated theoretical expectations under negatively skewed conditions, but strictly when supported by a rich informational base. For example, in the three-factor, low correlation, negative distribution (five categories, N = 500, five quadrature points) cell, the CLL link secured the best fit with a proportion of .29, substantially surpassing the probit link (.16). This establishes that the asymmetric CLL function can successfully capture the structural properties of negatively skewed data, but only if the dataset provides enough statistical power (N = 500, five categories) to overcome algorithmic noise.
Finally, evaluating the performance of the quadrature points reveals the presence of “ghost columns” under the eight-quadrature specification. In the majority of complex conditions (especially three-factor), the total proportions for eight quadrature rarely sum to 1.00, frequently displaying sequences of dashes (---) across all link functions. Crucially, this absence is not merely due to convergence failures. As established in Tables 2 and 3, eight quadrature points frequently succeeded in achieving proper convergence in complex three-factor models. Instead, these “ghost columns” demonstrate that even when the eight-quadrature specification successfully converges, it fails to provide a superior model fit. Increasing the number of quadrature points is an algorithmic tuning parameter rather than an estimated model parameter; therefore, it does not mathematically increase the complexity penalty terms in the AIC or SC formulas. The eight-quadrature models frequently lose the relative fit comparisons simply because they converge to a higher (worse) −2lnL value. Consequently, while Q = 8 may occasionally act as a computational lifeline for convergence in complex spaces, it fails to yield a sufficient improvement in the log-likelihood, making Q = 5 the more practical and stable choice.
Real Data Results
Proportions of selecting the best-fitting model
Note. Fac = Number of factors; Cor = Inter-factor correlation (low = .40, high = .70); Cat = Number of response categories; Quadrature = Number of quadrature points.
Proportions of selecting the best-fitting model
Note. Fac = Number of factors; Cor = Inter-factor correlation (low = .40, high = .70); Cat = Number of response categories; Quadrature = Number of quadrature points, --- indicates conditions with no properly converged solution.
Table 8 details the model fit indices for the two-factor CFA models applied to the PISA 2018 empirical data. An examination of the convergence patterns and relative fit criteria provides compelling real-world illustration of the estimation dynamics observed in the primary simulation findings. Consistent with the simulation results regarding algorithmic stability, non-convergence was exclusively localized to the smallest sample size condition (N = 200). Specifically, under sparse data conditions (three-category), CLL and log-log functions failed to compute standard errors when using five quadrature points. Similarly, increasing to eight quadrature points triggered a collapse for the CLL function in the five-category condition.
The evaluation of relative model fit provides the most crucial empirical evidence supporting the theoretical interaction between distribution shape and link functions. As previously established in the descriptive statistics, the utilized PISA subscales exhibited substantial positive skewness. Consequently, the probit link—which overwhelmingly dominated the normally distributed simulated data—failed to provide the best fit in any of the real data scenarios. Instead, the log-log and logit functions consistently yielded the lowest AIC and SC values. The log-log function, which is theoretically tailored for positively skewed distributions, secured the best model fit across the larger three-category conditions (N = 500 and N = 1,000) and the small sample five-category condition (e.g., yielding an AIC of 4,432.5 compared to the probit’s 4,484.0 at Cat5, N = 200, Q = 5). In the most extreme sparse condition (Cat3, N = 200) and the larger five-category conditions, the logit link demonstrated superior fit. This empirical transition proves that when researchers encounter substantially skewed ordinal data in large-scale assessments, abandoning the default normality assumption in favor of alternative link functions yields a statistically more accurate representation of the latent structure.
Finally, comparing the performance of the quadrature points in the real data setting definitively resolves the dimensionality debate. Across the evaluated conditions, increasing the quadrature points to eight offered zero empirical reward. In the three-category conditions, utilizing eight quadrature points systematically inflated the AIC values; for example, the optimal log-log fit at N = 1,000 degraded from 13,646.7 using five quadrature points to 13,660.0 using eight quadrature. In the five-category conditions, the differences between five and eight quadrature points were practically indistinguishable. This robustly supports the simulation-based recommendation that exceeding five quadrature points is computationally hazardous and methodologically unnecessary.
The most striking visual and statistical feature of Table 9 is the high prevalence of non-convergence, represented by dashes (---). Transitioning from the two-factor to the three-factor model substantially increased the computational demands and instability of the estimation process. Under sparse data conditions (three-category) and small sample sizes (N = 200), the algorithm routinely failed to achieve proper solutions, particularly for the logit and CLL link functions. This real-world algorithmic failure perfectly mirrors the simulation findings, confirming that estimating complex, multidimensional ordinal models with AQ is highly unstable without a substantial sample size to anchor the likelihood function.
Regarding relative model fit, the empirical data once again punished the symmetric normality assumption. Because the underlying PISA subscales exhibited positive skewness, the logit and log-log link functions consistently outperformed the probit function. In the robust, information-rich conditions (five-category, N = 500 and N = 1,000), the logit link provided the absolute best model fit. For example, in the Cat5, N = 1,000 condition, the logit function (AIC = 32,166.9) demonstrated clear superiority over the probit function (AIC = 32,607.4). This consistency across both two-factor and three-factor empirical applications solidifies the necessity of matching the link function to the inherent asymmetry of the ordinal indicators.
−2lnL, AIC, and SC for two-factor model in real data sets
Note. Cat = Number of response categories; Quadrature = Number of quadrature points, --- indicates non-converged solution.
−2lnL, AIC, and SC for three-factor model in real data sets
Note. Cat = Number of response categories; Quadrature = Number of quadrature points, --- indicates non-converged solution.
Summary and Practical Implications
This study investigated the performance of four different link functions within the Adaptive Quadrature (AQ) estimation method under a variety of simulation and empirical conditions. The analyses included combinations of three sample sizes, three distribution types, two levels of response categories, two-factor structures (two-factor and three-factor models), and two levels of factor correlation, and the number of quadrature points. By rigorously filtering out improper solutions to ensure mathematically valid comparisons, the performance of the AQ method was evaluated across these diverse conditions based on proper convergence rates, parameter recovery accuracy, and relative model fit.
Findings from the simulation study robustly establish that the probit link function acts as the optimal default for maximizing proper convergence and accurately recovering standardized factor loadings across the majority of standard conditions. When improper solutions were strictly filtered, the probit function consistently achieved the highest proper convergence rates, the lowest parameter recovery error, and the best relative model fit. This dominance was particularly pronounced in normally distributed datasets and positively skewed conditions with adequate sample sizes. In stark contrast, the logit link function exhibited severe vulnerabilities, consistently demonstrating the lowest convergence rates and introducing substantial bias into factor loading estimates, especially in more complex models or under skewed distributions.
A critical theoretical evaluation emerges regarding the interaction between distribution shape and asymmetric link functions. Theoretically, the log-log function aligns with positive skewness, while the CLL function is suited for negatively skewed data (Bryant & Jöreskog, 2016). However, our results reveal a fascinating divergence between global model fit and parameter recovery. In the empirical application and relative fit evaluations, the theoretical alignment held true: the log-log link provided the best fit for positively skewed data. Conversely, regarding algorithmic stability and parameter accuracy in the simulation, the pattern reversed: the log-log function yielded lower errors under negative skew, while CLL was more stable under positive skew. The simulation results further suggest that skewed ordinal response distributions pose substantial challenges for likelihood-based estimation procedures. Because ordinal outcomes were generated by shifting thresholds on an underlying continuous latent variable, extreme skewness effectively pushes category thresholds toward the tails of the latent distribution. In such situations, the likelihood surface becomes more irregular, increasing the difficulty of numerical integration and parameter optimization. This mechanism may explain why asymmetric link functions sometimes improved model fit while simultaneously producing less accurate parameter recovery in certain conditions. These findings highlight an important distinction between global model fit and parameter accuracy. While a particular link function may provide a better approximation to the empirical response distribution, it does not necessarily guarantee improved recovery of the underlying latent parameters. Researchers should therefore evaluate both aspects when assessing model adequacy.
A notable empirical trade-off was also observed regarding the number of response categories under conditions of severe non-normality. While expanding the measurement scale from three to five categories systematically improved parameter recovery (yielding lower RMSE values) by providing greater informational density, it simultaneously degraded proper convergence rates in skewed distributions. This paradox occurs because imposing extreme skewness across a wider five-point scale inevitably generates “empty cells” (zero frequencies at the extreme tails), which severely destabilizes the full-information likelihood estimation. However, for the specific subset of datasets that successfully bypassed this sparseness issue and achieved convergence, the continuous-like nature of the five-point scale allowed the algorithm to estimate factor loadings with significantly greater precision. Consequently, applied researchers face a critical “convergence versus precision” dilemma: while three-category scales offer algorithmic stability under non-normality, five-category scales are required to minimize parameter estimation error, provided the sample size is large enough to prevent empty cells.
A major contribution of this study concerns the impact of latent dimensionality on the stability of AQ estimation. Increasing model complexity from two-factor to three-factor structures substantially reduced convergence rates across many conditions. This pattern is consistent with the well-known exponential increase in computational demand, in which the burden of multidimensional numerical integration increases rapidly as the number of latent variables grows. Higher correlations among latent factors further exacerbated this problem. When factors were strongly correlated, the estimation algorithm had greater difficulty distinguishing between latent dimensions, which occasionally resulted in non–positive definite information matrices and improper solutions. These findings emphasize that dimensionality and factor correlation jointly influence the practical feasibility of AQ-based estimation.
The results also clarify the role of the numerical integration grid in multidimensional ordinal CFA. Increasing the number of quadrature points from five to eight produced different effects depending on model complexity. For relatively simple two-factor models, increasing the grid density provided little improvement in parameter recovery or model fit. In several conditions, the larger grid slightly reduced convergence rates, suggesting that a five-point grid is generally sufficient for models with moderate dimensionality. In contrast, under the more demanding three-factor simulation conditions, the five quadrature points occasionally proved insufficient to approximate the likelihood surface accurately. In these cases, increasing the grid to eight quadrature points mathematically forced higher proper convergence rates for several link functions. However, as established earlier, this apparent stability is often an artifact of survivorship bias. While denser integration grids may theoretically assist convergence in highly parameterized simulated spaces, they do not guarantee general improvements in estimation accuracy and carry substantial risks in real-world applications.
Nevertheless, the empirical analysis using PISA data revealed an important practical limitation. When applied to real data characterized by sampling variability and sparse response patterns, the eight-point grid sometimes introduced numerical instability rather than improving estimation. This finding indicates that while larger quadrature grids may enhance numerical accuracy in idealized simulation environments, their practical utility depends on data quality, sample size, and response distribution characteristics. Taken together, the results suggest a conditional strategy for selecting quadrature points. For typical ordinal CFA applications with moderate dimensionality, five quadrature points appear to provide a reasonable balance between computational stability and estimation accuracy. Larger points may be considered for higher-dimensional models, but researchers should evaluate convergence behavior carefully before adopting them as default settings.
Limitations and Future Directions
While this study provides robust empirical and simulated evidence regarding the performance of adaptive quadrature (AQ) estimation, several limitations should be acknowledged to contextualize the findings and guide future research. First, the simulation design was restricted to two-factor and three-factor measurement models. Although the computational burden of multidimensional integration was already highly evident in the three-factor models, educational and psychological assessments frequently employ even more complex structures. Given our pivotal finding that fewer quadrature points provide superior stability and fit compared to eight quadrature points, future research should investigate whether even fewer quadrature points (e.g., 3 or 4) can maintain parameter accuracy in highly complex, higher-dimensional models without triggering algorithmic failure.
Second, this study exclusively focused on the AQ method and did not directly compare its performance against standard Gauss-Hermite (GH) quadrature or limited-information methods like Diagonally Weighted Least Squares (DWLS). This decision was driven by the software environment; modern structural equation modeling software (such as LISREL 12) heavily prioritizes or exclusively offers AQ for full-information maximum likelihood estimation with ordinal data, as standard GH is theoretically known to be inefficient for multidimensional models. Nonetheless, future studies could conduct cross-software simulations to benchmark the performance of various link functions within AQ against DWLS, particularly under severe asymmetry. Furthermore, the algorithmic failure observed under the denser eight quadrature points must be interpreted within the context of LISREL’s specific optimization routines.
Third, the distributional shapes in the simulation were manipulated by categorizing continuous normal variables to achieve specific states of normality, positive skewness, and negative skewness. While this successfully mirrored real-world non-normality, the exact degree or magnitude of the skewness was not treated as a continuously varying factor. Future simulation studies could systematically manipulate the threshold parameters to generate varying gradients of skewness and kurtosis, testing the exact tipping points where the probit function loses its advantage to the complimentary log-log or log-log functions.
Fourth, the evaluation of parameter recovery in this study was deliberately restricted to completely standardized factor loadings, as these are typically the primary parameters of interest for applied researchers establishing construct validity. Furthermore, because link functions utilize fundamentally different underlying variance scales, directly comparing completely standardized factor loadings via RMSE may partially confound true estimation error with inherent scale-definition differences. Consequently, the performance of the link functions regarding the recovery of inter-factor correlations, and the precision of standard errors were not explicitly evaluated. Because the choice of link function directly alters the category-probability functions, future research should expand this evaluation to encompass these specific parameters and inferential structures to provide a more holistic assessment of link function performance.
Footnotes
Ethical Considerations
No approval of research ethics committees was required to accomplish the goals of this study as it utilized publicly available data and simulation data, which did not involve human subjects or sensitive information.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
