Abstract
Accurately predicting entrepreneurial behaviour is critical for supporting innovation and designing effective entrepreneurship policies. This study introduces a novel modelling framework that integrates latent psychological constructs derived from Exploratory Structural Equation Modelling (ESEM) with multiple machine learning algorithms (MLAs) to predict entrepreneurial intention (EI) and outcome behaviour (EOB) using the 2019 Australian Global Entrepreneurship Monitor (GEM) dataset. The predictor set includes socio-demographic variables, regional classifications, and latent attitudinal and perceptual factors derived from GEM’s behavioural survey items. Seven MLAs were evaluated and compared to a standard logistic regression model. Among them, Multivariate Adaptive Regression Splines (MARS) achieved the highest classification accuracy, displaying sensitivity accuracies for both EI and EOB of around 83% to 84%, while outperforming traditional logistic models, other tree-based models, and newer neural network approaches in terms of sensitivity, interpretability, and parsimony. MARS identified self-efficacy (or internal perceived behavioural control), attitude, external perceived behavioural control, and age as the most important predictors of EI, while work status, age, self-efficacy, and intention were key predictors of EOB. These findings demonstrate that combining latent variable modelling with MLAs can produce robust, theory-informed predictive structures. While the Theory of Planned Behaviour (TPB) provided a useful guide for variable selection and latent construct identification, the results suggest that bounded rationality offers a promising lens for understanding entrepreneurial behaviour – emphasising heuristic decision-making and the limits of rational-choice models in entrepreneurial contexts. Beyond technical accuracy, the results illustrate how predictive modelling can both refine theoretical understanding and offer practical tools for designing more targeted entrepreneurship support.
Plain Language Summary
Understanding and predicting who intends to start a business, and who actually follows through, is important for encouraging innovation and creating effective entrepreneurship policies. In this study, we examined survey data from the 2019 Australian Global Entrepreneurship Monitor, which includes information on individuals’ backgrounds, regional locality, attitudes, and perceptions regarding entrepreneurship. Instead of relying solely on traditional statistical methods, we employed modern computer-based techniques that are more effective at identifying patterns in complex data. We found that just a few factors play a big role in shaping entrepreneurial outcomes. People who feel confident in their abilities, hold positive attitudes toward entrepreneurship, and know other entrepreneurs are much more likely to intend to start a business. When it comes to actually starting one, employment status, age, confidence, and intention are the strongest predictors of success. Our best-performing method achieved over 80% accuracy in predicting entrepreneurial intentions and outcomes. The results show that entrepreneurial behaviour often follows simple and repeatable patterns, rather than fully rational step-by-step reasoning. This means that people may make decisions based on quick judgments or rules of thumb, rather than engaging in complex deliberation. For researchers, this suggests new ways of linking behavioural theories with data-driven approaches. For educators and policymakers, the findings can inform the design of more targeted entrepreneurship support programs by identifying the traits and circumstances most likely to influence whether someone becomes an entrepreneur.
Introduction
Predicting entrepreneurial behaviour remains a major challenge in entrepreneurship research, with significant implications for innovation, employment, and economic development. Entrepreneurial intention (EI) – a person’s stated commitment to engaging in entrepreneurial activity – has long been recognised as a key antecedent of entrepreneurial outcome behaviour (EOB) (Kautonen et al., 2015; Lortie & Castogiovanni, 2015). However, accurately predicting who translates intention into action remains difficult, making it essential to explore alternative modelling strategies capable of capturing these behavioural transitions (Schade & Schuhmacher, 2023).
Traditional models, such as logistic and linear regressions, as well as structural equation modelling, have been central in examining hypothesised causal and associative relationships in EI and EOB. While these approaches advanced theory development, their utility for accurate behavioural prediction is more limited (Breiman, 2001). These methods also face known limitations: they assume linearity, are sensitive to multicollinearity, and often underperform when relationships are non-linear, involve complex interactions, or when outcome categories are rare (Berk, 2017; Breiman, 2001; Hastie et al., 2017). In much social science research, including entrepreneurship studies, predictive evaluations often rely on overall classification accuracy while overlooking severe class imbalance. Indeed, while these limitations are well recognised in traditional approaches, even recent applications of machine learning algorithms (MLAs) to GEM data have struggled with extreme imbalance, reporting sensitivity accuracies as low as 1% to 10% (Schade & Schuhmacher, 2023).
While these early applications highlight limitations, they also reflect a broader trend: management and entrepreneurship research has increasingly adopted machine learning (ML) methods for predictive analysis. These offer clear advantages for modelling complex, non-linear relationships, but challenges remain. Many applications continue to exhibit low sensitivity when predicting rare but important outcomes, prioritising overall predictive performance (overall accuracy) over parsimony, interpretability, and generalisability. As a result, opportunities remain to discover predictive structures for EI and EOB that are both accurate and theoretically meaningful – parsimonious models with considerable academic and practical value for researchers, managers, and policymakers.
This study addresses these gaps by integrating latent psychological constructs derived via Exploratory Structural Equation Modelling (ESEM) with socio-demographic and regional predictors to model both EI and EOB. Using a nationally representative 2019 Australian Global Entrepreneurship Monitor (GEM) survey (n = 10,143), we apply and compare seven MLAs alongside logistic regression. Each model is systematically evaluated across multiple performance metrics, including sensitivity, specificity, precision, F1 score, and ROC-AUC. By combining theory-grounded constructs with data-driven methods, the study advances a predictive framework that enhances sensitivity while remaining closely aligned with entrepreneurship theory.
Beyond methodological refinements, this study also addresses broader questions about how knowledge is constructed and the methods of scientific inquiry in management and entrepreneurship research. Predictive modelling is not only a technical exercise but also part of an ongoing debate about how best to generate, validate, and apply knowledge in the social sciences (Brady et al., 2010; Breiman, 2001; Freedman, 2010). Framing our approach within this wider conversation allows us to situate methodological advances alongside deeper reflections on the assumptions and pathways that shape entrepreneurship research. For example, Breiman (2001) argued that focusing on predictive accuracy can uncover structures in data that traditional statistical models may miss. Aligned with this view, Freedman (2009, 2010) cautioned that causal claims from observational data often rest on fragile assumptions, underscoring the importance of grounding inquiry in empirical observations and regularities. Simon (1996) offered a complementary behavioural perspective, showing that human decision-making is bounded: individuals rarely follow fully rational, deliberative processes, but instead rely on satisficing strategies and cognitive shortcuts. Such bounded or limited decision-making may be revealed through parsimonious predictive structures derived from observational data.
Collectively, these perspectives suggest that entrepreneurship research can benefit from approaches that integrate theory with prediction, using the latter as a filter for identifying constructs that are empirically robust and practically meaningful. In this study, we put this perspective into practice by testing TPB-derived factors such as attitudes, norms, and perceived behavioural control within ML frameworks. This approach shows how prediction can refocus, rather than replace, theory, suggesting that entrepreneurial behaviour may align more closely with heuristic perspectives – bounded rationality being one influential example – when models yield parsimonious and generalisable structures that reflect satisficing and decision-making under constraints, rather than fully rational deliberation.
This study makes three primary contributions: (1) demonstrating that machine learning approaches substantially outperform traditional statistical models in predicting EI and EOB; (2) showing that parsimonious models based on latent constructs can yield interpretable and practical insights; and (3) offering a replicable framework for integrating theory-grounded constructs with machine learning in entrepreneurship prediction. Together, these contributions not only strengthen methodological practice but also refine theoretical understanding by revealing which constructs remain empirically robust when tested in predictive models. When predictive models reveal parsimonious and generalisable structures, their outcomes can be meaningfully interpreted through established theoretical lenses – TPB as a guide for variable selection, and bounded rationality as a promising lens for explaining the behavioural patterns observed. This ensures that data prediction approaches serve as a bridge between data and theory, advancing both empirical modelling and conceptual insight in entrepreneurship research.
Theoretical Background and Research Context
Entrepreneurial Intention and Outcome Behaviour
This study focuses on two interrelated constructs central to entrepreneurship research: EI and EOB. EI refers to a person’s intention or commitment to start a business, while EOB captures the actual act of engaging in entrepreneurial activity, such as launching a venture. EI is widely considered a strong predictor of EOB, providing a foundation for models that seek to explain entrepreneurial behaviour (Ajzen & Kruglanski, 2019; Armitage & Conner, 2001; Fishbein & Ajzen, 2010; Lortie & Castogiovanni, 2015; Schlaegel & Koenig, 2014; Zaremohzzabieh et al., 2019).
Several conceptual models have been proposed to explore the EI and EOB relationships and their variable antecedents, including the Theory of Planned Behaviour (TPB) (Zaremohzzabieh et al., 2019) and the Entrepreneurial Event Model (EEM) (Krueger et al., 2000). Other models have been proposed, including a modified TPB model with mediating and moderating items used as antecedents to the attitude, subjective norms and personal behavioural model constructs (Lortie & Castogiovanni, 2015), social cognitive theory (Graham & Bonner, 2022), action theory (Frese & Gielnik, 2023), and a synthesised model using TPB and EEM constructs (Zaremohzzabieh et al., 2019).
The TPB stands out as the most empirically robust and widely applied model in the entrepreneurship literature (Kautonen et al., 2015) and is the initial theoretical model framework used in this study because of its dominance within social science research, with its application in more than 2,000 empirical studies (Ajzen & Kruglanski, 2019; Schlaegel & Koenig, 2014; Zaremohzzabieh et al., 2019), and its robustness in predicting EI and EOB in both cross-sectional and longitudinal datasets (Kautonen et al., 2015). Additionally, the entrepreneurial attributes of attitudes, perceived behavioural control, and EI and EOB, as measured by the observed items within the GEM datasets, are more closely aligned with the TPB constructs than with any other social theory behavioural model. The TPB was used as the initial theoretical framework to guide the selection of GEM variables for the construction of latent constructs and subsequent variable inclusions into the machine learning predictive models.
Theory of Planned Behaviour
The TPB proposes that intention is the immediate antecedent of behaviour, influenced by three cognitive components: (1) attitudes toward the behaviour, (2) perceived social norms, and (3) perceived behavioural control (PBC), a concept closely related to self-efficacy (Fishbein & Ajzen, 2010). The TPB assumes that individuals act on intentions when they have the motivation and the ability to do so (Fishbein & Ajzen, 2010). When people have positive attitudes, feel social support, believe they have the skills and resources to act, and have positive intentions, they are more likely to engage in a specific behaviour. For a detailed account of TPB components and their operational definitions, see Fishbein and Ajzen (2010).
In this study, TPB variables do more than explain behaviour: they are also used as inputs into ESEM and machine learning models. In this way, theory contributes not only to latent factor development but also to predictive model construction and evaluation. This reframing acknowledges the limits of causal inference in observational data (Brady et al., 2010; Freedman, 2009), while recognising that predictive structures can reveal which TPB elements remain empirically robust.
Global Entrepreneurship Monitor Data (GEM)
This study used the 2019 Australian Global Entrepreneurship Monitor (GEM) dataset, which includes 10,143 respondents from all states and territories. This large dataset is unique for Australian GEM research, as previous annual GEM surveys included only around 2000 respondents. Additional regional delineators capture state or territory of residence, city versus non-city location, and a nineteen-category regional variable for Queensland respondents (Renando & Moyle, 2021).
Model Regionality Effects with GEM Data
Entrepreneurial activity is influenced not only by individual-level traits and intentions but also by regional characteristics. However, most studies examining EI and EOB focus on cross-national comparisons, with little attention to intra-national regional heterogeneity (Bosma & Schutjens, 2010; Kamal & Daoud, 2020; Morales & Velilla, 2021; Virasa et al., 2022; Vodă et al., 2020). Many of these studies have demonstrated that substantial and persistent differences exist in entrepreneurial activity and attitudes between nations. Two studies, however, focussed on assessing the regional differences within a country for EI or EOB, including Bosma et al. (2009), who studied the effects of firm entry and exits on the competitiveness of regions based on a study across 40 regions in Holland between 1988 and 2002, and a study by Alvarez et al., (2011) that examined both formal and informal factors that influence entrepreneurship in 19 Spanish regions using 2006 to 2009 using GEM data.
This study contributes to the regional literature by examining regional differences in entrepreneurial predictors across Australian states, cities/urban areas, and sub-regions within Queensland, using the 2019 Australian GEM dataset. This micro-level regional analysis highlights whether contextual factors influence entrepreneurship, offering valuable insights for place-based policy interventions.
Structural Equation Modelling
We employed the Exploratory Structural Equation Modelling (ESEM) procedure to identify the latent constructs underlying GEM’s attitudinal and perceptual variables. The ESEM procedure developed by Asparouhov and Muthén (2009) integrates the best features of both exploratory and confirmatory factor analyses (EFA and CFA), providing confirmatory tests of a prior factor structure, while also serving as an effective algorithm for exploratory factor analysis (Asparouhov & Muthén, 2009; Marsh et al., 2009, 2014). Unlike traditional CFA, which imposes rigid assumptions about factor loadings, ESEM allows item cross-loadings and better accommodates the multidimensionality of attitudinal data, a common feature in social science surveys. This flexibility improves both the fit and interpretability of latent constructs (Marsh et al., 2009, 2014; Morin & Maïano, 2011). ESEM also provides access to standard goodness-of-fit indices and parameter estimates (e.g., alternative fit indices, standard errors, factor loadings, and variances) to guide model selection and refinement. The resultant latent factors derived from the ESEM procedure were pivotal predictors in the machine learning models for EI and EOB prediction.
Although ESEM is increasingly applied in psychology and education research (Marsh et al., 2009; Tsigilis et al., 2018), its use in entrepreneurship and management studies remains novel. Here, it offered two distinctive advantages: first, enabling a robust examination of the factor structure underlying GEM’s attitudinal and perceptual items; and second, providing a systematic means of reducing data dimensionality before subsequent machine learning modelling. In this way, ESEM not only served as a methodological bridge between traditional factor analysis and predictive modelling but also introduced an innovative contribution to entrepreneurship research.
Machine Learning Algorithms
Model Parsimony
Parsimony – the principle of achieving a desired outcome with the fewest predictors possible – plays an important role in algorithmic model development (Clarke et al., 2009; Hastie et al., 2017; Kuhn & Johnson, 2016). In predictive modelling, parsimony reduces complexity without sacrificing predictive power. This eases implementation burdens for practitioners, enhances understanding for policymakers and researchers, and improves the generalisability of predictive models to unseen datasets. While not widely discussed in management literature, parsimony aligns with a long-standing statistical principle – Occam’s Razor – and is crucial when seeking actionable and generalisable algorithmic models. In this study, we assess both predictive accuracy and parsimony to strike an optimal balance between performance, interpretability and generalisability (Berk, 2017; Breiman, 2001; James et al., 2023).
Machine Learning: General Principles and Use in This Study
The application of Machine learning (also known as statistical learning, data mining or algorithmic modelling) plays a central role in predictive analytics across a range of disciplines, including medicine, science, environmental studies, finance, and social science (Berk, 2017; Breiman, 2001; Hastie et al., 2017). MLAs learn from data to build predictive models that identify the relationship between a set of input variables (X) and an outcome variable (Y). For binary classification problems such as predicting EI and EOB, MLAs maximise predictive accuracy and reveal underlying predictive structures – what variables underscore and characterise the problem at hand. Sections 1, 2 and 3 of the Supplemental Materials outline MLAs in detail, their distinctions from stochastic data models and their modelling advantages.
Seven MLAs were used in this study: Classification and Regression Trees (CART), Multivariate Adaptive Regression Splines (MARS), Random Forests (RF), TreeNet (Stochastic Gradient Boosting), Support Vector Machines (SVM), and two types of neural networks – Probabilistic Neural Network (PNN) and Radial Basis Function Neural Network (RBN). A binary logistic regression model was also run, with its results compared to the machine learning outcomes. In all the modelling procedures, the GEM dataset was partitioned into training and test subsets; models were optimised, and their output performances were compared using multiple accuracy metrics – sensitivity, specificity, precision, F1 score, Youden’s J statistic, and ROC metrics. Each algorithm determined the importance of each predictor, and their sum and median scores across the seven algorithms were determined. These results determined the most accurate algorithm and predictors for the 2019 Australian GEM dataset.
While this study draws on MLAs to advance prediction in EI and EOB research, it is important first to situate our study within the broader methodological landscape. Prior studies have applied both traditional statistical models and more recent MLA techniques, each with distinct contributions and limitations. The following section reviews this literature to clarify the gaps our study addresses.
Prior Approaches to Modelling Entrepreneurial Intentions and Behaviour – Traditional Approaches
Much of the early research on EI and EOB relied on traditional statistical models such as ANOVA, linear regression, logistic regression, and multilevel regression models (Ali & Jabeen, 2020; Aloulou, 2016; Alvarez et al., 2011; Anderson, 2023; Brändle et al., 2018; Davidsson, 1995; Lazarczyk-Bilal & Glinka, 2020; Nguyen et al., 2019; Souitaris et al., 2007; Virasa et al., 2022; Vodă et al., 2020) and seemingly unrelated regressions (Douglas, 2013). Others used latent variable methods, including partial least squares (Liñán & Chen, 2009; Martínez-González et al., 2019, 2022; Šebjan et al., 2016; Shinnar et al., 2012) and structural equation modelling with or without exploratory/confirmatory factor analysis (Botsaris & Vamvaka, 2014; Kautonen et al., 2015; Trivedi, 2017; Vamvaka et al., 2020). These models were instrumental in establishing TPB as the dominant framework for EI and EOB research, yielding valuable measures of attitudinal and perceptual drivers. However, the emphasis of these studies was typically on explanatory fit (e.g., goodness-of-fit, regression or path coefficients, and R2 statistics) rather than on predictive performance, although these measures have been touted as indices of predictive accuracy. Few studies reported the extent to which respondents were correctly classified as having a positive or negative response to EI or EOB, limiting practical insights for policy and intervention.
Machine Learning Algorithms
More recently, there has been an increasing number of publications in the entrepreneurship literature that apply MLAs to predict EI or EOB (Aloulou, 2016; Brixiova, 2011; Chung, 2023; Cinar et al., 2019; Graham & Bonner, 2022; Koumbarakis & Volery, 2022; Liu et al., 2023; Martínez-González et al., 2022; Montebruno et al., 2020; Zulkefly et al., 2021). Cinar et al. (2019) used CHAID regression trees with GEM data to compare entrepreneurial attitudes and outcomes across Mediterranean and North African countries. Chung (2023) benchmarked tree boosting and neural networks against logistic regression in predicting early-stage entrepreneurial activity – frequently using the GEM outcome variable called TEA – showing higher predictive accuracy for MLAs. Schade and Schuhmacher (2023) employed tree-based, naïve Bayes, a deep learning neural network, and the nearest neighbour methods to determine opportunity and necessity-motivated entrepreneurship activity across more than 1 million GEM respondents from 99 countries. Graham and Bonner (2022) combined GEM perception and attitudinal items with socio-demographic variables to predict TEA using merged GEM datasets from 2015 to 2018, resulting in accurate predictive models. Montebruno et al. (2020) compared ten MLAs with logistic regression on British historical data, reporting a 25% increase in predictive accuracy for MLAs. More recently, Liu et al. (2023) evaluated multiple MLAs on the Chinese General Social Survey, and Koumbarakis and Volery (2022) and Zulkefly et al. (2021) explored MLA approaches in predicting firm birth, emergence and abandonment, and to predict entrepreneurs’ social impacts.
Across these diverse applications, a consistent theme emerges – MLAs demonstrate strong potential to improve classification accuracy while offering modelling flexibility and versatility. However, important limitations remain, particularly regarding replication, generalisability, and the handling of class imbalance and missing data. Addressing these challenges requires integrating MLA approaches with theory-informed constructs and robust modelling strategies – an approach this study advances using the 2019 Australian GEM dataset. To frame this contribution, the key methodological limitations evident in prior studies are outlined below.
Key Methodological Limitations
Three recurring issues stand out across current machine learning studies in entrepreneurship. First, sensitivity– the proportion of positive cases correctly identified – has rarely been used as the primary metric, even though it is the most relevant for binary classification tasks such as EI and EOB (Hastie et al., 2017). Accuracy measures based on overall fit can mask poor performance in detecting the entrepreneurial group of interest. Second, many MLA studies have relied on the Synthetic Minority Oversampling Technique (SMOTE) or related resampling methods to address class imbalance. While widely adopted, these techniques risk distorting the data distribution and increasing class overlap (Chawla et al., 2002; Fernandez et al., 2018; He & Garcia, 2009). By contrast, algorithmic priors embedded in CART, MARS, Random Forests, and Gradient Boosting offer a mathematically grounded solution that reduces bias towards majority classes (Berk, 2017; Hastie et al., 2017; Sherrod, 2003–2009). Third, strategies for handling missing data are inconsistent: many studies use case deletion or median imputation, whereas CART’s surrogate splits provide a more robust and less biased approach by substituting alternative predictors when data are missing (Hastie et al., 2017).
This study directly addresses these gaps. We prioritise sensitivity alongside specificity, F1, Youden’s J, and ROC scores. We apply algorithmic priors instead of resampling to manage imbalance and employ CART surrogate splits with median substitution only when surrogates are unavailable. Beyond these methodological contributions, our design also advances theoretical integration. GEM’s attitudinal and perceptual variables aligned with TPB were analysed using ESEM, and the resulting latent constructs, together with socio-demographic and regional variables, provided TPB-guided predictors for machine learning models. In this way, our study applies both statistical and MLA methods. These design choices reflect a broader orientation: using prediction as an analytical filter to identify stable, repeatable patterns before turning to theory for explanation. This perspective prepares the ground for the integrated framework presented in the next section.
Overview of the Study’s Processes
As illustrated in Figure 1, this study integrates TPB variables and ESEM-derived latent constructs with observational data to predict EI and EOB, while also accounting for regional and demographic variations. Latent constructs derived from ESEM are combined with these contextual variables and used as inputs into multiple machine learning algorithms. This integrated framework provides a rigorous yet practical approach, demonstrating how TPB-guided variables, latent psychological constructs, and modern predictive methods can be combined to identify the key drivers of EI and EOB, and generate actionable insights for researchers and decision-makers.

Conceptual and analytical framework for predicting entrepreneurial outcome behaviour.
Alternative Pathways of Scientific Inquiry – Data Set and Causal Process Observations
A common form of scientific inquiry in the social sciences involves analysing datasets with statistical models framed by formal hypotheses or research questions, a process referred to as Data Set Observations (DSO) (Brady et al., 2010; D.A Freedman, 2010). Such models often rely on far-reaching model assumptions that generate wide-ranging causal claims (Freedman, 2009). Although DSO is endemic in management research, it is not the only legitimate pathway of scientific enquiry. Another form, often overlooked in this field but highly productive in others, is Causal Process Observation (CPO).
Many transformative scientific discoveries have originated from CPOs. In epidemiology, smoking–cancer associations were recognised through observed patterns long before causal mechanisms were understood. In astronomy, centuries of planetary observation preceded Newton’s theory of gravity. In medicine, breakthroughs such as penicillin, germ theory, the discovery of the cause of cholera and many other human diseases, the identification of the bacteria Helicobacter pylori as a cause of stomach ulcers, and the roles played by hormones in body regulation emerged from careful observations and pattern recognition. Even in economics and political science, empirical regularities have often led theory, not the reverse. These examples illustrate that prediction and observation are not inferior to hypothesis testing, but represent an alternative – a historically rich and fertile mode of scientific inquiry and progress.
CPO-driven discovery often begins with anomalies, patterns, or serendipitous observations (Freedman, 2010). It requires immersion in subject matter, openness to unexpected findings, and readiness to challenge long-held assumptions. Progress comes from discarding ineffective ideas, developing better ones, and evaluating them without bias. This study, like many others in the CPO tradition, begins with observational data and the goal of identifying predictive structures – which factors drive EI and EOB, and whether regional and socio-demographic variables shape these patterns. In this study, machine learning models serve as an analytical filter to detect stable, repeatable patterns within the 2019 Australian GEM dataset.
This paper adopts the CPO pathway: recognising novel, stable patterns, integrating theoretical interpretation, and remaining sceptical of premature causal claims from stochastic models. It does not lead with hypotheses or research questions, and for some readers familiar with DSO, it may appear to be a ‘methods-only’ paper. On the contrary, it is an approach that begins with an observational dataset and prediction, using theory-informed constructs as inputs, identifies clear and parsimonious structures, and then applies the theoretical lens that best explains the evidence (Schade & Schuhmacher, 2023; Shrestha et al., 2020). In this way, prediction serves as a bridge between data and theory – first uncovering stable patterns, then guiding the selection or refinement of the theoretical lens that best explains them.
Importantly, this predictive orientation is not without precedent in entrepreneurship research. Schade and Schuhmacher (2023), writing in the Journal of Business Venturing Insights, applied supervised machine and deep learning techniques to GEM data from 1.2 million individuals across 99 countries. Their study, published in a well-regarded entrepreneurship journal, demonstrates that predicting entrepreneurial activity is a legitimate academic pursuit with both research and practical implications, showing that valuable insights can be generated outside the confines of formal hypotheses or research questions. Crucially, they also emphasise that patterns uncovered through machine learning should be reviewed either by developing new theoretical approaches through algorithm-supported induction or by interpreting results through existing theories. Our study adopts the latter approach, positioning predictive outcomes within established theoretical frameworks, such as the TPB, and subsequently interpreting their broader behavioural implications through theories of rational and bounded decision-making that can best account for parsimonious and accurate model outcomes. In this way, predictive analytics is framed not as atheoretical, but as a tool for testing the robustness and boundaries of theoretical constructs.
Data and Methods
GEM 2019 Dataset and Variables
This study uses data from the 2019 GEM Survey, which collected responses from 10,143 adults across all states and territories. The GEM survey records individuals’ entrepreneurial activities, intentions, attitudes, and socio-demographic characteristics.
Our two outcome measures are:
Entrepreneurial Intentions (EI)– whether respondents expect to start a business in the next three years; and Entrepreneurial Outcome Behaviour (EOB)– whether respondents are currently starting or managing a business that is less than 3.5 years old, also referred to as Total Early-Stage Entrepreneurial Activity (TEA19). In both cases, the outcome of interest is a ‘yes’ response, coded as 1 for the variables called futsup and TEA19 within the dataset.
Predictors included:
Socio-demographic variables– gender (gender), age (age), education (uneduc), and current work status (gemoccu) – a bracketed term indicates the variable name within the 2019 GEM dataset.
Regional classifications– state (state), city/non-city division (aufullcity), and Queensland sub-regions (queensland).
Attitudinal and perceptual items– a range of items from knowing an entrepreneur (knowent) to your personal perceptions of whether businesses primarily aim to solve social problems (nbsocent), as described in Table 1 (14 items in total). Table 1 summarises all modelled variables – outcome measures (EI and EOB), the perception and attitudinal items later combined into latent constructs, and the socio-demographic and regional predictors. Before analysis, all survey responses coded as ‘refused’ or ‘don’t know’ were recoded and treated as missing values.
Description of GEM Variables used in this Study.
Source. own research.
Note. Categories for aufullcity include 1 = Greater Sydney, 2 = Rest of NSW, 3 = Greater Melbourne, 4 = Rest of Vic, 5 = Greater Brisbane, 6 = Rest of Qld, 7 = Greater Adelaide, 8 = Greater SA, 9 = Greater Perth, 10 = Rest of WA, 11 = Greater Hobart, 12 = Rest of Tas, 13 = Greater Darwin, 14 = Rest of NT, 15 = Australian Capital Territory (ACT). Categories for queensland include: 1 = East Brisbane, 2 = North Brisbane, 3 = South Brisbane, 4 = West Brisbane, 5 = Brisbane Inner City, 6 = Cairns, 7 = Darling Downs, 8 = Fitzroy, 9 = Gold Coast, 10-Ipswich, 11 = Logan/Beaudesert, 12 = Mackay, 13 = Morton Bay North, 14 = Morton Bay South, 15 = Queensland Outback, 16 = Sunshine Coast, 17 = Toowoomba, 18 = Townsville, 19 = Wide Bay.
Since the predictors are derived from the 2019 Australian GEM dataset, the findings should be interpreted within this context. The predictive structures identified here may vary across countries and years; therefore, future validation is important to establish generalisability.
Exploratory Structural Equation Modelling
To reduce the data dimensionality across the GEM attitudinal and perception variables, a factor model using these 14 predictors (variables 8 to 21 in Table 1) was developed using ESEM, implemented with the weighted least squares estimation function (WLSMV) and a Geomin oblique rotation. Standard SEM fit statistics were used to assess the overall data-model fit, with the following goodness of fit statistics used in this study: RMSEA values <0.06 with the lower value of the 90% confidence interval for RMSEA ≤ 0.05, the upper value < 0.08 and the p-value of this test being not significant if p > .05, indicating a model close-fit; SRMR < 0.08, CFI ≥ 0.95 (≥0.9 as a minimum) and TLI ≥ 0.9 (Kline, 2016; Wang & Wang, 2020; West et al., 2015). The traditional model χ2 is not used as a goodness-of-fit statistic in this study due to its high sensitivity to large sample sizes and the sensitivity of this test to violations in the assumption of multivariate normality (the χ2 value increases with highly skewed and kurtosis distributions) (Wang & Wang, 2020; West et al., 2015).
Model Accuracy and Robustness
To address class imbalance, equal priors were applied so that ‘yes’ and ‘no’ responses were weighted equally. This adjustment reduces bias toward the larger ‘no’ category and ensures sensitivity (the ability to detect positive cases) is accurately estimated (Berk, 2017; Breiman et al., 1984; Hastie et al., 2017; Sherrod, 2003–2009; Steinberg & Golovnya, 2006). Model accuracy was assessed primarily using sensitivity, which reflects the proportion of positive cases correctly identified. Specificity, precision, F1 scores, Youden's J index and ROC values were also reported for comparison. For CART, the Gini criterion was used for splitting, and surrogate predictors were used to handle missing data. For neural networks and SVA, where surrogates were unavailable in DTREG, median substitution was used. Random forests employed the standard out-of-bag (OOB) procedure for accuracy assessment (Berk, 2017; Hastie et al., 2017).
Validation relied on a 60:40 train–test split rather than cross-validation, as the large sample size provided reliable out-of-sample estimates. As Breiman et al. (1984, p.75) note, test-sample validation is suitable when class sizes are around 900 or larger, which was the case in the GEM 2019 dataset. Each algorithm generated a ranking of variable importance (scored 0–100), and these were summarised across models by reporting both median and total scores. Further technical details of model settings are provided in Appendix A.
Software Applications
DTREG (version 8.8) generated all neural network, CART, RF, TreeNet and Support Vector Machine models. Minitab/SPM (version 8) was used to generate MARS models. Mplus (version 8.9) was used for ESEM models, and STATA (version 17.0) with Long and Freese's SPOST13 extensions were used to generate binary logit and post-logit estimation statistics (Long & Freese, 2014).
Results and Discussion
Missing Data and ESEM Outcomes
All model variables experienced some missing data except TEA19, the three regional classifiers (state, aufullcity and queensland), and gender. EI (futsup) had 6.9% missing data, while the remaining three socio-demographic variables had an average of 1.8% missing data. The 14 perception and attitudinal GEM variables had an average of 5.2% missingness. The two dependent variables, EI (futsup) and EOB (TEA19), exhibited binary class imbalances, as expected, with the focused ‘yes’ category representing 12.2% (n = 1,240) and 9.5% (n = 961) of all respondents, respectively.
An optimal four-factor ESEM with standardised factor loadings and item labels was generated from the ESEM analysis. The one-factor and two-factor ESEMs did not meet the minimum goodness of fit statistics, and the three-factor ESEM solution, although providing reasonable goodness of fit statistics, did not provide a coherent or robust factor solution compared to the four-factor ESEM. A five-factor ESEM or higher did not converge to a final solution.
The four-factor ESEM displayed the highest goodness of fit statistics, exceeding the minimum goodness of fit requirements with significant factor loadings ≥ 0.3 interpreted as salient. For this four-factor model, the goodness of fit statistics are: RMSEA = 0.037, 90% CI 0.035 to 0.04 with probability of RMSEA ≤ 0.05 = 1.0; CFI/TLI = 0.98 and 0.95, respectively; and SRMR = 0.015. Small factor cross-loadings were evident for all items, with most cross-loadings being < 0.1. The variable equalinc failed to meet minimum factor loading requirements and was rejected from the ESEM factor solution and subsequent analyses. The indicators knowent, suskill, creativ, and vision load on Factor 1; the items fearfailL, oppism, and proact load on Factor 2, the items opport and easystart load on Factor 3, and the items nbgood, nbstatus, nbmedia, and nbsocent load on Factor 4. The factor correlations were low to moderate, with the largest being -0.374 between Factors 2 and 3.
Factors 1 and 3 are labelled Internal Perceived Behavioural Control (PBC_I) and External Perceived Behavioural Control (PBC_E), respectively. People form beliefs about personal and environmental factors that can assist or impede their attempts to carry out a behaviour. Aggregated, these beliefs create a sense of high or low perceived behavioural control (PBC) about the behaviour in question (Fishbein & Ajzen, 2010). These behavioural control beliefs can be internal to the subject –internal control beliefs include skills, knowledge, competence, background, determination, and willpower – and external control beliefs are external to the subject and include labour, finances, time, and access to resources (Ajzen, 2002). Factors 1 and 3 capture these two aspects of PBC, respectively. Factor 1 (PBC_I) is also commonly referred to as self-efficacy (Fishbein & Ajzen, 2010); in this study, the terms self-efficacy and internal perceived behavioural control are used interchangeably.
Factor 2 was labelled Attitude (Att). This factor includes the observed indicators fearfailL, oppism and proact. A broad definition of attitudes that is commonly cited is ‘An attitude is a mental and neural state of readiness, organised through experience, exerting a directive or dynamic influence upon the individual’s response to all objects and situations with which it is related’ (Ajzen & Kruglanski, 2019; Allport, 1935). Items comprising Factor 2 are negatively worded survey questions within the GEM survey and were not recoded prior to ESEM modelling. Perceived norms are the perceived social pressure to engage or not engage in a particular behaviour, social pressure coming from important individuals, significant groups or organisations of people in the respondent’s lives and their approval or disapproval of the behaviour being performed, as well as whether these referents perform the behaviour or do not perform the behaviour in focus (Fishbein & Ajzen, 2010). Factor 4 (nbgood, nbstatus, nbmedia and nbsocent) characterises this attribute of perceived norms (PN). Sum scores were generated for each factor using their item scores for all cases in the dataset before machine learning modelling was initiated; F2 items were recoded before deriving this factor's summed scores and machine learning modelling.
Machine Learning Algorithms Outcomes
Predicting Entrepreneurial Intentions
Table 2 shows the results of seven MLAs using the 2019 Australian GEM dataset to predict EI. These results apply to the test data partition. The binary logit model includes all records with non-missing data values. The baseline binary logistic regression model yielded sensitivity, specificity, and precision metrics of 10.73%, 98.03%, and 50.23%, respectively. The logit model's overall accuracy was 84.38%, but the sensitivity was very low – only one in 10 people with a high propensity for EI was detected by the logit model. Detailed crosstabulation and model estimators from the logit model are displayed in Supplemental Tables 1S and 2S of the Supplemental Materials. From the latter table, the estimators F1 to F4 and age are significant (P>|z| < 0.01), and gender also makes a significant contribution to the model (P>|z| < 0.012).
Model Accuracies for the Test Data Partition for EI.
Source. own research.
Note. Blank cells within the table indicate that the predictor was not used within the algorithmic model.
MARS and CART were the most accurate algorithms in predicting positive EI respondents, achieving approximately 83% and 82% sensitivity rates, respectively, with sensitivity accuracies more than seven times that of the binary logit model. Surprisingly, the PNN algorithm, a modern neural network model, had the lowest sensitivity value of 24.8% but the highest specificity accuracy of approximately 90%. These accuracies were similar to those of the binary logit model results. The TreeNet (stochastic gradient boosting), SVM, and RBN algorithms displayed similar and high sensitivity and specificity accuracies, exceeding 70%.
These latter three algorithms displayed the highest Youden's J indexes, but precision accuracies were broadly similar across all seven models, ranging from 29.3% (TreeNet) to approximately 24% for MARS and CART, with a difference of about 5% between the smallest and largest precision scores. As anticipated, the ROC value was lowest for the PNN algorithm due to its low sensitivity and high specificity, with other model ROC values ranging from almost 80% (TreeNet) to approximately 70% (PNN).
The five most important predictors in terms of total and median scores were the latent construct F1 (internal perceived behavioural control or self-efficacy), work status (gemoccu), the respondent's age, F2 (attitude), and education (uneduc). The variables state, queensland and gender were the least important variables. The remaining variables displayed modest model importance scores, with the regional variable aufullcity (city/non-city divide) having high model importance scores for the RF and TreeNet algorithms, with importance scores of 82.7 and 55.3, respectively. Across the seven algorithmic models, considerable variability in the variable of importance rankings is evident. MARS and the SVM algorithms display a restricted set of four predictors. The three models, RF, PNN, and RBN, included all predictors; CART also included all predictors except for gender. The predictor F1 had high importance scores across all the MLAs.
MARS provided the most parsimonious list of predictors for EI assessment, utilising just four predictors in this model: age, F1, F2, and F3. The MARS model is displayed in Exhibit 1, along with the final linear function relating the basis functions to Y. If the predictor F1 is missing in this MARS model, BF1 is coded as ‘0’. If F1 is present, BF1 = ‘1’. A Missing value for F1 also sets BF3 and BF4 to zero. These functions only contribute to this MARS model when F1 is a positive value.

MARS Basis Functions and Linear Relations to Y (EI).
The MARS basis functions revealed that the predictors F1, F2, and F3 positively contribute to EI when their estimators exceed their inflection point values of 8, 3, and 7, respectively. The predictor age shows a decreasing contribution to EI, with the BF7 and BF8 functions determining the localised regression slopes for this predictor. This MARS model accurately assessed whether a respondent has a positive EI, achieving a sensitivity accuracy above 80%, the highest of any MLA used in this study. However, the parameter estimates from MARS or those derived from any machine learning algorithm cannot be interpreted as causal explanations, and causal inferences are not generated in these models (Ahrens et al., 2020).
Predicting Entrepreneurial Outcome Behaviour
Table 3 summarises the results of the seven machine learning algorithms (MLAs) that predict EOB with the 2019 Australian-focused GEM dataset. These machine learning results apply to the test data partition. The binary logit model includes all records with non-missing data values. This baseline binary logit regression model yielded sensitivity, specificity, and precision metrics of 22.99%, 97.76%, and 57.14%, respectively. This model accurately assessed negative EOB respondents (Y = 0) but displayed a misclassification rate of 77% for respondents with a positive propensity for EOB. The logit model’s crosstabulation outcome and estimators for the EOB dependent variable are detailed in Supplemental Tables 3S and 4S in the Supplemental Materials. The latter table highlights that the estimators F1, F2, F3, futsup, age, four uneduc categories, and four gemoccu categories are significant (P>|z| < 0.01).
Model Accuracies for the Test Data Partition for EOB.
Source. own research.
Note. Blank cells within the table indicate that the predictor was not used within the algorithmic model.
MARS, Random Forests, and CART were the most accurate algorithms in predicting positive EOB respondents, with sensitivity accuracies ranging from approximately 82% to 84%. The PNN algorithm again displayed the lowest sensitivity value of 27.29% and the highest specificity accuracy of approximately 92.24%. The remaining three machine learning models, TreeNet, SVM, and RBN, displayed high sensitivity and specificity accuracies, with values over 70%.
The Youden’s J indexes were highest for the CART, MARS, TreeNet, SVM, and RF algorithms, with precision accuracies similar across all MLAs, ranging between 24.59 for RBN to 30.81 for TreeNet. High-sensitivity algorithms tend to generate more false positives, perpetuating lower precision values. As expected, the ROC value was lowest for the PNN algorithm due to its low sensitivity and high specificity, with all other model ROC values exceeding 80%.
The five most important predictors, in terms of total and median scores, were work status (gemoccu), EI (futsup), education status (uneduc), age (age), and F1 (internal perceived behavioural control or self-efficacy). The variables state, queensland, aufullcity, and gender were the least important. Across the seven algorithmic models, considerable variability in the variable of importance rankings is evident. MARS and the SVM algorithms display a restricted set of four and five predictors, respectively. TreeNet and PNN used all predictors in their models, with TreeNet displaying moderate to high variable importance scores for all twelve predictors. The predictor gemoccu or work status displayed high variable importance scores in all seven predictive EOB algorithms.
MARS provided the most parsimonious predictive structure for EOB assessment, using only four predictors: gemoccu, futsup (labelled as FUTSUP_R in the MARS software output), F1 (internal perceived behavioural control or self-efficacy), and age. The basis functions characterising this algorithm, with their relationship to EOB (Y), are described in Exhibit 2. Self-employed respondents (gemoccu = 7), respondents with F1 scores above 10, and those with positive EI responses are highly likely to have positive responses to EOB. Increasing age has a decreasing effect on positive EOB responses post-28 years of age. The MARS results for EI and EOB align well regarding important predictors and their categories of importance.

MARS Basis Functions and Linear Relations to Y (EOB).
Figure 2 presents a visual summary of the predictive associations between EI and EOB identified in the MARS models. The nodes in the diagram represent important predictors and the target variables. Internodal linkages indicate the strength of the relationship between each predictor and its associated target variable; thicker lines indicate more important or impactful predictors.

Visual Summary of Key Predictors of EI and EOB in the MARS Models.
The most important predictor in the MARS model for each target variable has a link strength of 4-point (line thickness), the next highest ranked variable has a link strength of 3, the next a link strength of 2 and the fourth predictor has a link strength of 1-point. There are no directional linkages (no arrows) between the predictors and target variables, indicating that all these modelled relationships are associative, not causal, and the MARS parameter estimates have no statistical inferential interpretation.
These findings are robust within the Australian 2019 dataset; however, they should be interpreted cautiously when considering other countries or different time periods. Predictive structures may vary across contexts due to cultural, institutional, and economic differences. As such, the strength and ordering of predictors should not be assumed fixed across all contexts, but instead viewed as evidence of stable associations within this national dataset.
Conclusions
This study applied a range of MLAs to predict EI and EOB and to determine their underlying predictive structures. MARS generated a highly accurate and parsimonious model for predicting both target variables, outperforming a traditional logit regression model, particularly in terms of sensitivity accuracy – a critical requirement for practical screening and interventions. The four predictors used by the MARS model to predict EI included the latent constructs from the ESEM procedure –F1 (Internal Perceived Behavioural Control or self-efficacy), F2 (Attitude), F3 (External Perceived Behavioural Control), and age. The four predictors used by this model to predict EOB included the latent construct F1, work status, age and EI itself, exhibiting sensitivities above 80% for both targets.
With its additive, linear basis functions, the MARS model offers an interpretable alternative that is readily generalisable across datasets with similar distributional characteristics. The linear basis functions of MARS are also easy to program into new predictive applications. Notably, regional variables (state, aufullcity, queensland) and gender contributed little to predictive performance, raising questions about their explanatory value in entrepreneurship behaviour models. This suggests the need for further regional and economic contextual analysis, utilising advanced analytics approaches to better model non-linearities and non-additive predictor effects, or rigorous qualitative frameworks.
In summary:
The binary logit model's accuracy in determining positive EI and EOB respondents in the 2019 GEM dataset is poor; all the MLAs used in this study offer practical advantages in predictive accuracy (compared to this logit model) for both these target variables using the 2019 Australian GEM data.
Although the logistic model exhibited high specificity and overall accuracy, it struggled to correctly classify positive cases of EI and EOB, with sensitivities of only 10.7% and 22.9%, respectively, which limits its practical utility in targeted policy or support interventions. Tables 1S and 2S in the Supplementary Materials provide a comprehensive overview of the EI logit model's outcomes.
Integrating ESEM and MLAs yielded methodological conciseness and practical benefits. The findings highlight the value of selecting MLAs based not only on predictive power but also on interpretability and data-handling robustness.
With just four predictors, the MARS model offers considerable advantages for identifying individuals with a high propensity for entrepreneurial behaviour, with sensitivity accuracy for both target variables (EI and EOB) between 83% and 84%.
MARS generated a concise list of predictors for positive EI responses, including age and the latent constructs F1, F2 and F3. Increasing age has a negative impact on EI after the age of 20. Increasing F1, F2, and F3 values, when these values exceeded their inflection point values of 8, 3, and 7, positively affected EI.
For EOB assessment, MARS provides a concise list of predictors, including work status (gemoccu), EI (futsup), F1, and age. Self-employed respondents with F1 scores of more than 10 who exhibit positive EI responses are highly likely to have positive responses to EOB. Increasing age has a negative impact on EOB after the age of 28. Supplemental Tables 3S and 4S in the Supplementary Materials provide a detailed overview of the EOB logit model, allowing for comparison with the EOB MARS model's outcomes.
The application of MLAs in this study highlighted that prior probabilities, as used in this study, are highly effective in ensuring maximum classification sensitivity accuracy, regardless of the magnitude of imbalances between the two target variable categories. Surrogacy is also a practical, convenient, and effective way of handling missing data for categorical or continuous predictors.
Consistent with four decades of work comparing algorithmic and stochastic approaches (e.g., Breiman et al., 1984; Breiman, 2001; Berk, 2017; Clarke et al., 2009; Hastie et al., 2017; Kuhn & Johnson, 2016), these results reaffirm that MLAs offer considerable advantages over stochastic statistical models when the objective is prediction rather than causal inference.
MLAs within a suitable analytic framework can enhance our understanding and prediction of social behaviour, offering practical and usable data insights for investors, policymakers, government agencies, and researchers aiming to drive social progress and economic development.
Overall, this study demonstrates that integrating behavioural theory with machine learning is technically powerful, providing a precise and actionable understanding of entrepreneurial outcome behaviour for researchers and decision-makers alike. Together, these results illustrate how data-driven methods rooted in behavioural theory can generate insights for identifying, supporting, and scaling entrepreneurial activity. These results demonstrate the technical advantages of machine learning approaches for prediction; however, their accuracy across regions and time requires further validation. Future studies using additional GEM waves and international datasets will be essential to confirm the stability of these predictive structures.
By integrating behavioural theory with predictive models, this study shows how constructs such as self-efficacy, age, and work status emerge as consistent predictors, while others fade in importance. This underscores an epistemological point: theory-driven hypothesis testing remains one path to knowledge, while prediction and the interpretation of theory-reflective outcomes provide another (Berk, 2017; Brady et al., 2010; Freedman, 2009). As Shrestha et al. (2020) argue, predictive results can be used to explain dervived outcomes, either to extend existing theories or to generate new ones through algorithm-supported induction. Our study adopts the first approach, situating predictive findings within established frameworks such as TPB, while interpreting their broader implications through bounded rationality.
Two implications follow. First, the predictive dominance of a small set of variables suggests that entrepreneurial behaviour often reflects simple, repeatable decision rules rather than the full deliberative pathway implied by TPB. Second, this pattern is consistent with Simon’s theory of bounded rationality, which posits that individuals rely on a limited number of high-signal cues, reflecting satisficing and heuristic processing rather than fully rational choice. Future research should test the stability of these rules across time, cultures, and datasets, and examine whether predictive, theory-testing, or mixed-methods approaches – including qualitative insights – can clarify when bounded rationality provides the strongest explanatory leverage.
Recognising the role of bounded rationality also opens applied avenues: by understanding how simple cues shape entrepreneurial choices, researchers and policymakers can design interventions, training, and education programs that match the shortcuts individuals actually use, supporting more effective entrepreneurial entry and persistence. Taken together, this invites contributions from empirical modellers, machine learning scholars, and qualitative researchers alike, ensuring that bounded rationality is examined from multiple vantage points.
In conclusion, predictive models not only improve accuracy but also help reveal the cognitive shortcuts that underpin entrepreneurial decision-making. In doing so, they affirm CPO as a valuable pathway of scientific inquiry in entrepreneurship research. More broadly, this study demonstrates how integrating behavioural constructs with machine learning can enhance both practice and theory, providing researchers, educators, and policymakers with a replicable framework for understanding and supporting entrepreneurial behaviour.
Practical Implications
The findings of this study provide actionable insights for policymakers, educators, and entrepreneurship support organisations. By demonstrating that high predictive accuracy can be achieved with a small, theory-informed set of predictors, the results highlight opportunities for more efficient, data-driven support strategies:
Policy and government programs can incorporate focused predictor sets, such as those identified by the MARS model (self-efficacy, intentions, age, and work status), into grant schemes, training programs, and pre-screening assessments to better target individuals with a high likelihood of entrepreneurial behaviour.
Incubators and accelerators can apply these predictors to their intake processes, tailoring support pathways accordingly. For example, individuals high in perceived behavioural control, intentions and opportunity perception can be directed into action-oriented modules, while others may benefit from confidence- or skills-building interventions.
Education and training providers can rebalance curricula to place greater emphasis on self-efficacy and opportunity awareness, aligning program design with the factors and variables most predictive of entrepreneurial intention and action.
In summary, integrating behavioural theory with machine learning provides a practical framework for identifying and supporting future entrepreneurs, ensuring that program design and policy initiatives are better matched to the decision-making heuristics individuals actually use.
Limitations and Future Research
In this study, the priority was maximising model sensitivity by adjusting model priors to account for target class imbalances; however, this approach can also increase the rate of false positives. For instance, the MARS model produced a true positive to false positive ratio of approximately 1:3 for EI and EOB. In practical applications, this ratio may be considered suboptimal, depending on the cost associated with misclassification; however, it nevertheless represents a marked improvement over prior MLA studies in entrepreneurship, many of which reported sensitivities as low as 1–10%.
This analysis was based on a cross-sectional dataset (2019 Australian GEM dataset), which limits the ability to assess changes over time or explore longitudinal consistency in predictive patterns. Future studies should validate these models using GEM datasets from different years and regions to enhance generalisability. Using observed items instead of latent constructs may reveal whether item-level predictors outperform composite factors in terms of variable importance and predictive accuracy.
Another limitation is the potential for unobserved heterogeneity. Hidden respondent subgroups may influence model outcomes and confound predictor relationships. Future work should consider latent class modelling to identify subgroups and assess their effects on model structure and predictive outcomes. Latent class identifiers could then be incorporated into MLA frameworks to improve classification accuracy and control for sample-level variability. Despite these limitations, this study demonstrates the potential of combining ESEM and MLAs to produce predictive tools that are both analytically robust and practical.
Future research could align predictive behavioural models, such as those developed in this study, with bibliometric mapping of the entrepreneurship literature. This would enable researchers to assess whether the conceptual focus of the field, including themes emerging in entrepreneurial intentions bibliometric research, aligns with empirically validated behavioural predictors at the individual level. Such integration may reveal gaps between scholarly discourse and actual behavioural mechanisms, offering opportunities for theory refinement and more targeted policy interventions.
Supplemental Material
sj-docx-1-sgo-10.1177_21582440251404789 – Supplemental material for Predicting Entrepreneurial Intentions and Behaviour: A Machine Learning Approach Using Latent Constructs from Australian GEM Data
Supplemental material, sj-docx-1-sgo-10.1177_21582440251404789 for Predicting Entrepreneurial Intentions and Behaviour: A Machine Learning Approach Using Latent Constructs from Australian GEM Data by Ray Duplock, Gian Luca Casali and Char-lee McLennan in SAGE Open
Footnotes
Appendix A
Across the seven machine learning algorithms:
Ethical Considerations
No human or animal subjects were involved in this research. This project is exempt from ethical considerations from QUT’s Human Ethics Committee (project id 7877).
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The first author is a recipient of an Australian Government Research Training Program Stipend that provided financial support for this research.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statements
Requests for this dataset can be made to the third author of this paper.
Declaration of Generative AI Application
ChatGPT assisted with manuscript editing. Grammarly assisted with text edits and grammar.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
