Sage Journals: Discover world-class research

Abstract

Feedforward neural networks (FNNs) can be viewed as non-linear regression models, where covariates enter the model through a combination of weighted summations and non-linear functions. Although these models have some similarities to the approaches used within statistical modelling, the majority of neural network research has been conducted outside of the field of statistics. This has resulted in a lack of statistically based methodology, and, in particular, there has been little emphasis on model parsimony. Determining the input layer structure is analogous to variable selection, while the structure for the hidden layer relates to model complexity. In practice, neural network model selection is often carried out by comparing models using out-of-sample performance. However, in contrast, the construction of an associated likelihood function opens the door to information-criteria-based variable and architecture selection. A novel model selection method, which performs both input- and hidden-node selection, is proposed using the Bayesian information criterion (BIC) for FNNs. The choice of BIC over out-of-sample performance as the model selection objective function leads to an increased probability of recovering the true model, while parsimoniously achieving favourable out-of-sample performance. Simulation studies are used to evaluate and justify the proposed method, and applications on real data are investigated.

Keywords

information criteria model selection neural networks variable selection

1 Introduction

Neural networks are a popular class of machine-learning models, which pervade modern society through their use in many artificial-intelligence-based systems (LeCun, Bengio, and Hinton, 2015). Their success can be attributed to their predictive performance in an array of complex problems (Abiodun, Jantan, Omolara, Dada, Mohamed, and Arshad, 2018). Recently, neural networks have been used to perform tasks such as natural language processing (Goldberg, 2016), anomaly detection (Pang, Shen, Cao, and Hengel, 2021), and image recognition (Voulodimos, Doulamis, Doulamis, and Protopapadakis, 2018). Feedforward neural networks (FNNs), which are a particular type of neural network, can be viewed as non-linear regression models, and have some similarities to statistical modelling approaches (e.g., covariates enter the model through a weighted summation, and the estimation of the weights for an FNN is equivalent to the calculation of a vector-valued statistic) (Ripley, 1994; White, 1989). Despite early interest from the statistical community (White, 1989; Ripley, 1993; Cheng and Titterington, 1994), the majority of neural network research has been conducted outside of the field of statistics (Breiman, 2001; Hooker and Mentch, 2021). Given this, there is a general lack of statistically based methods, such as model and variable selection, which focus on developing parsimonious models.

Typically, the primary focus when implementing a neural network centres on model predictivity (rather than parsimony); the models are viewed as 'black-boxes' whose complexity is not of great concern (Efron, 2020). It is perhaps not surprising, therefore, that there is a tendency for neural networks to be highly over-parameterized, miscalibrated, and unstable (Sun, Song, and Liang, 2022). Nevertheless, FNNs can capture more complex covariate effects than is typical within popular (linear/additive) statistical models. Consequently, there has been renewed interest in merging statistical models and neural networks, for example, in the context of flexible distributional regression (Rügamer, Kolb, and Klein, 2024) and mixed modelling (Tran, Nguyen, Nott, and Kohn, 2020). However, statistically based model selection procedures are required to increase the utility of the FNN within the statistician's toolbox.

Traditional statistical modelling is concerned with developing parsimonious models, as it is crucial for the efficient estimation of covariate effects and significance testing (Efron, 2020). Indeed, model selection (which includes variable selection) is one of the fundamental problems of statistical modelling (Fisher and Russell, 1922). It involves choosing the 'best' model, from a range of candidate models, by trading pure data fit against model complexity (Anderson and Burnham, 2004). As such, there has been a substantial amount of research on model and variable selection (Miller, 2002). As noted by Heinze, Wallisch, and Dunkler (2018), typical approaches include significance testing combined with forward selection or backward elimination (or a combination thereof); information criteria such as Akaike information criterion (AIC) or Bayesian information criterion (BIC; Schwarz, 1978; Akaike, 1998; Anderson and Burnham, 2004); and penalized likelihood such as LASSO (Tibshirani, 1996; Fan and Lv, 2010).

In machine learning, due to the focus on model predictivity, relatively less emphasis is placed on finding a model that strikes a balance between complexity and fit. Looking at FNNs in particular, the number of hidden nodes is usually treated as a tunable hyperparameter (Bishop et al., 1995; Pontes, Amorim, Balestrassi, Paiva, and Ferreira, 2016). Input-node selection is not as common, as the usual consensus when fitting FNNs appears to be similar to the early opinion of Breiman (2001): 'the more predictor variables, the more information'. However, there are some approaches in this direction, and a survey of variable selection techniques in machine learning can be found in Chandrashekar and Sahin (2014). Nevertheless, typically, the optimal model is usually determined based on its predictive performance, such as out-of-sample mean squared error (OOS), which can be calculated on a validation dataset. Unlike an information criteria, out-of-sample performance does not directly take account of model complexity.

When framing an FNN statistically, there are several motivating reasons for a model selection procedure that aims to obtain a parsimonious model. For example, the estimation of parameters in a larger-than-required model results in a loss in model efficiency, which, in turn, leads to less precise estimates. Input-node selection, which is often ignored in the context of neural networks, can provide the practitioner with insights on the importance of covariates. Instead, other feature importance measures are typically used such as the feature attribution methods described in Koenen and Wright (2024). Furthermore, eliminating irrelevant covariates can result in cheaper models by reducing potential costs associated with data collection (e.g., financial, time, energy). In this article, we take a statistical-modelling view of neural network selection by assuming an underlying (normal) error distribution. Doing so enables us to construct a likelihood function, and, hence, carry out information-criteria-based model selection, such as the BIC (Schwarz, 1978), naturally encapsulating the parsimony in the context of a neural network. More specifically, we propose an algorithm that alternates between selecting the hidden layer complexity and the inputs with the objective of minimizing the BIC. We have found, in practice, that this leads to more parsimonious neural network models than the more usual approach of minimizing out-of-sample error, while also not compromising the out-of-sample performance itself.

The remainder of this article is structured as follows. In Section 2, we introduce the FNN model while linking it to a normal log-likelihood function. Section 3 motivates and details the proposed model selection procedure. Simulation studies to investigate the performance of the proposed method, and to compare it to other approaches, are given in Section 4. In Section 5, we apply our method to real-data examples. Finally, we conclude in Section 6 with a discussion.

2 Feedforward neural network

Let $y = (y_{1}, y_{2}, \dots, y_{n}) \in ℝ^{n}$ be the response variable of interest for a regression-based problem, where $n$ represents the number of observations. For the $i$ th observation, $i = 1, \dots, n$ , let $x_{i} =$ ${(x_{1 i}, x_{2 i}, \dots, x_{p i})}^{T}$ be a vector of $p$ covariates-the inputs to the neural network model. We assume a model of the form $y_{i} = NN (x_{i}) + ε_{i}$ , where $ε_{i}$ is a random error that we assume has a $N (0, σ^{2})$ distribution, and $NN (\cdot)$ is a neural network,

\begin{matrix} NN (x_{i}) = γ_{0} + \sum_{k = 1}^{q} γ_{k} ϕ (\sum_{j = 0}^{p} ω_{j k} x_{j i}) . \end{matrix}

(2.1)

As we aim to frame FNNs as an alternative to other statistical non-linear regression models (i.e., used on small-to-medium sized tabular data sets relative to the much larger data sets seen more broadly in machine learning), and due to the universal approximation theorem (Cybenko, 1989; Hornik, Stinchcombe, and White, 1989), we are restricting our attention to FNNs with a singlehidden layer. The parameters in Equation 2.1 are as follows: $ω_{0 k}$ , the intercept term associated with the $k$ th hidden node; $ω_{j k}$ , the weight that connects the $j$ th input node to the $k$ th hidden node; $γ_{0}$ , the intercept term associated with the output node; and $γ_{k}$ , the weight that connects the $k$ th hidden node to the output node. The function $ϕ (\cdot)$ is the activation function for the hidden layer, which is often a logistic function. The number of parameters in the neural network is given by $K = (p + 2) q + 1$ . A diagram of a neural network architecture with $p$ input nodes and $q$ hidden nodes is shown in Figure 1. In the diagram, $x_{0} = 1, h_{0} = 1$ , and $h_{k} = ϕ (\sum_{j = 0}^{p} ω_{j k} x_{j i})$ .

Given our assumption that $ε_{i} \sim N (0, σ^{2})$ , we then make use of the log-likelihood function

\begin{matrix} l (θ) = - \frac{n}{2} l o g (2 π σ^{2}) - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(y_{i} - NN (x_{i}))}^{2}, \end{matrix}

(2.2)

Figure 1

Neural network architecture with p input nodes and q hidden nodes.

where $θ = {(ω_{01}, \dots, ω_{p 1}, \dots, ω_{0 q}, \dots, ω_{p q}, γ_{0}, \dots, γ_{q}, σ^{2})}^{T}$ . We maximize this log-likelihood to obtain $\hat{θ}$ but note that the estimates of the neural network parameters do not depend on the value of $σ^{2}$ , i.e., the residual sum of squares, $\sum_{i = 1}^{n} {(y_{i} - NN (x_{i}))}^{2}$ , can be estimated to obtain the neural network parameters. This is useful since standard neural network software (that minimizes the residual sum of squares) such as nnet (Ripley and Venables, 2022) can be used used to optimize the neural network followed by the estimation of $σ^{2}$ in a separate step.

The calculation of a log-likelihood function allows for the use of information criteria when selecting a given model, and in particular, the BIC (Schwarz, 1978), BIC $= - 2 l (\hat{θ}) + log (n) (K + 1)$ , where we have K + 1 parameters, i.e., the K neural network parameters plus the variance parameter, $σ^{2}$ . An attractive property of the BIC is that it is 'dimension-consistent', i.e., the probability of selecting the 'true' model approaches one as sample size increases (Anderson and Burnham, 2004). It is important to note that other approaches for the calculation of the degrees of freedom exist (Murata, Yoshizawa, and Amari, 1994; Ye, 1998), but we find these do not penalize more complex models (with redundancies) heavily enough in the model selection context compared to using K (see Supplementary Material Section A).

3 Model selection

To begin model selection, a set of candidate models must be considered. For the input layer, we can have up to $p_{max}$ inputs, where $p_{max}$ is the maximum number of covariates being considered, and this is often the total number of covariates available in the data under study. The input layer can contain any combination of these $p_{max}$ inputs. For the hidden layer, we must specify a $q_{max}$ value, which is the maximum number of hidden nodes to be considered; this controls the maximum level of complexity of the candidate models. We can then have between one and $q_{max}$ nodes in the hidden layer. From a neural network selection perspective, we aim to select a subset of $p \leq p_{max}$ covariates to enter the input layer and to build a hidden layer of $q \leq q_{max}$ nodes to adapt to the required complexity. To carry out these selections, we suggest a statistically motivated procedure based on minimizing the BIC, since it directly penalizes complexity and is known to be selection consistent, i.e., BIC minimization converges to the true model asymptotically. In contrast, and more usually in machine learning applications, one could consider predictive performance, for example, the OOS. We will also consider this approach but find that it leads to significantly more complex models than the use of BIC while only marginally improving predictive performance. Whether one is aiming to minimize BIC or OOS, multiple initializations of the neural network (from $n_{init}$ random vectors of parameters) are required to improve the chance of finding a global maximizer of the log-likelihood surface.

3.1 Proposed approach

We propose a stepwise procedure that starts with a hidden-node selection phase followed by an input-node selection phase. (We find that this ordering leads to improved model selection.) This is, in turn, followed by a fine-tuning phase that alternates between the hidden and input layers for further improvements. The proposed model selection procedure is detailed in Algorithm 4 (which relies on Algorithms 1 –3), and a schematic diagram is provided in Figure 2. It is also described at a high level in the following paragraphs.

Figure 2

Model selection schematic. Nodes coloured grey are being considered in current phase. Nodes coloured gold represent optimal nodes in that phase to be brought forward to the next phase.

Algorithm 1

Fit Candidate Model

Algorithm 2

Hidden-Node Selection

Algorithm 3

Input-Node Selection

Algorithm 4

Model Selection

The procedure (Algorithm 4) is initialized with the full set of input nodes, $X_{full}$ , the maximum number of hidden nodes being considered, $q_{max}$ , and the number of initializations, $n_{init}$ , and, as mentioned, starts with a hidden-node selection phase (Algorithm 2 with $Q = \{1, 2, \dots, q_{max} - 1\}$ ). For each candidate model in this phase (i.e., models with $q \in \{1, \dots, q_{max}\}$ ), the network optimizer is supplied with $n_{init}$ random vectors of initial parameters, the log-likelihood function is maximized at each of these vectors, and the overall maximizer is found (see Algorithm 1). The reason for supplying the neural network with different vectors of initial parameters is due to the complex optimization surface for neural networks that may contain several local maxima. Thus, the use of a set of initial vectors (rather than just one) aims to increase the chance of finding the global maxima; of course, this cannot be guaranteed as is often the case in more complex statistical models. Once all of the $q_{max}$ candidate models have been fitted, the hidden-node selection phase is concluded by selecting the one whose hidden structure (i.e., number of nodes, q) minimizes the BIC.

Once the hidden-node selection phase has concluded, the focus switches to the input layer (Algorithm 3); at this point, there are $p_{max}$ inputs (i.e., the set of input nodes currently included in the model is the set of all input nodes, $X = X_{full})$ . For the input-node selection phase, each input node is dropped in turn, with the aim of finding an input whose removal yields a lower BIC; as with the previous phase, random sets of initial parameters are used for each candidate model in the underlying likelihood optimization. If the removal of a given input node does yield a lower BIC value, then that input node is dropped from the model (and if two or more inputs result in a lower BIC, the one yielding the lower BIC is removed). This is repeated until no covariate, when removed from the model, results in a lower BIC, and, then, the set of included input nodes, $X$ , is returned. (Thus, in this phase, Algorithm 3 is applied with only the 'drop inputs' step and $n_{steps} = p_{max}$ .)

Both the hidden layer and covariate selection phases are backward elimination procedures. Rather than stopping the algorithm after these two phases, we have found it fruitful to search for an improved model in a neighbourhood of the current 'best' model by carrying out some further fine tuning. This is done by considering the addition or removal of one hidden node (Algorithm 2 with $Q = \{q - 1, q + 1\}$ ), then the further addition or removal of one input node (Algorithm 3 with with both the 'drop inputs' and 'add inputs' steps and $n_{steps} = 1$ ), and these two steps are repeated alternately until no further adjustment decreases the BIC (see Step 4 in Algorithm 4). This fine-tuning stage is analogous to stepwise model selection with backward and forward steps. Note that one could apply this alternating stepwise procedure from the offset, but we have found it to be significantly more computationally efficient to focus first on the hidden and input layers (separately and in that order) before moving to the stepwise phase.

The particular order of the model selection steps described above has been chosen in order to have a higher probability in recovering the 'true' model, and to have a lower computational cost (see Section 4.1 for a detailed simulation). Note that choosing the set of input nodes requires a more extensive search than choosing the number of hidden nodes. There are more candidate structures for the input layer as you can have any combination of the nodes. Therefore, it is recommended to perform hidden-node selection first, to eliminate any redundant hidden nodes and decrease the number of parameters in the model, before performing input-node selection.

4 Simulation studies

In order to justify and evaluate the proposed model selection approach, three simulation studies are used:

Simulation 1 (Section 4.1): In our first simulation study, we investigate the effect of the ordering of the model selection steps to justify the procedure. This includes the effect of performing input-node and hidden-node selection phases first, the improvement of including a stepwise fine-tuning step, and the performance of a procedure that only carries out iterative stepwise steps (i.e., fine tuning from the offset).

Simulation 2 (Section 4.2): The second simulation study compares the performance of using the BIC as the model selection objective function versus using AIC or OOS.

Simulation 3 (Section 4.3): The third simulation study investigates the performance of the proposed model selection procedure in the case where the true data-generating process is not a neural network, but, rather is that of a linear-type regression model (albeit with non-linear and interaction terms). Here, we compare the performance of our procedure against classical linear-regression stepwise selection.

In the first two simulation studies, the response is generated from an FNN with known 'true' architecture. The weights are generated so that there are three important inputs, $x_{1}, x_{2}, x_{3}$ , with non-zero weights, and ten unimportant inputs, $x_{4}, \dots, x_{13}$ , with zero weights. All input variables are independent and generated from a standard normal distribution and the error variance is 0.7 (but the results are similar when the inputs are correlated as shown in Supplementary Material Section B).

The 'true' hidden layer consists of q = 3 hidden nodes, while we set our procedure to consider a maximum of $q_{max} = 10$ hidden nodes. The weights of the neural network are held constant over all repetitions and are given by $(ω_{01} = 1.40, ω_{11} = 4.35, ω_{21} = 3.22, ω_{31} = - 2.43, ω_{02} = - 2.89, ω_{12} =$ $4.28, ω_{22} = - 3.27, ω_{32} = - 2.30, ω_{03} = - 1.90, ω_{13} = 4.49, ω_{23} = 3.24, ω_{33} = 2.46, γ_{0} = 2.98, γ_{1} =$ 2.37, ${γ_{2} = 2.37, γ_{3} = 2.47)}^{T}$ . The metrics calculated to evaluate the performance of the model selection approach are the true negative rate (TNR) for the input nodes (i.e., the proportion of input nodes with true zero weights that are correctly dropped from the model), the false discovery rate (FDR) for the input nodes (i.e., the proportion of input nodes with true zero weights that are incorrectly included in the model), the average number of hidden nodes selected $(\overset{࿽}{q})$ , the probability of choosing the correct set of inputs (PI), the probability of choosing the correct number of hidden nodes $(PH)$ , and the probability of choosing the overall true model (PT). (All probabilities refer to the proportion of correct results from the 1000 simulation replicates.) In all simulation studies, we vary the sample size $n \in \{250, 500, 1000\}$ and carry out 1000 replicates. Our proposed model selection approach is implemented in our publicly available R package selectnn (McInerney and Burke, 2024). The neural network function used is nnet, which is available from the R package of the same name (Ripley and Venables, 2022). (Note that we do not use a weight decay penalty when fitting the models, i.e., we set decay = 0 within the nnet function.)

4.1 Simulation 1: model selection approach

This simulation study aims to justify the approach of the proposed model selection procedure, i.e., a hidden-node phase, followed by an input-node phase, followed by a fine-tuning phase; here, we label this approach as H-I-F. Some other possibilities would be: to start with the input-node phase (I-HF), to stop the procedure without fine tuning (either H-I or I-H), or to only carry out fine-tuning from the beginning (F). Descriptions of the considered model selection approaches are as follows (the proposed approach is highlighted in bold; round brackets indicate the reordering of the steps in Algorithm 4 required to achieve the approach):

H-I: Hidden-node selection phase, followed by input-node selection phase (Step 1 → Step 2).

I-H: Input-node selection phase, followed by hidden-node selection phase (Step 2 → Step 1).

H-I-F: Hidden-node selection phase, followed by input-node selection phase, and then a fine-tuning phase (Step 1 → Step 2 → Step 3).

I-H-F: Input-node selection phase, followed by hidden-node selection phase, and then a finetuning phase (Step 2 → Step 1 → Step 3).

F: Fine-tuning phase only (Step 3).

The objective function used for model selection is BIC, and each approach has $n_{init} = 5$ initial vectors for the optimization procedure. (The choice of objective function and the effect of $n_{init}$ are investigated in Section 4.2 and Supplementary Material Section C, respectively.) The results of the simulation study are shown in Table 1. Boxplots for TNR for the inputs and q for all approaches are displayed in Figure 3 and Figure 4, respectively. The true-positive rate is not shown as it is one for all methods.

Table 1

Simulation 1: model selection metrics.

n	Method	Time (s)	Input layer			Hidden layer		PT
n	Method	Time (s)	TNR	FDR	PI	$\overline{q}$ (3)	PH	PT
250	H-I	13	0.78	0.23	0.59	2.29	0.18	0.10
	I-H	50	0.25	0.70	0.01	2.85	0.44	0.01
	H-I-F	14	0.87	0.15	0.72	2.66	0.54	0.43
	I-H-F	53	0.46	0.61	0.03	2.87	0.50	0.03
	F	116	0.77	0.29	0.47	8.58	0.13	0.12
500	H-I	32	0.90	0.10	0.83	3.47	0.53	0.50
	I-H	100	0.64	0.36	0.42	3.14	0.87	0.40
	H-I-F	36	0.96	0.05	0.90	3.05	0.95	0.85
	I-H-F	103	0.72	0.32	0.46	3.08	0.92	0.43
	F	82	0.97	0.04	0.89	3.17	0.90	0.82
1000	H-I	53	1.00	0.00	0.99	3.02	0.98	0.97
	I-H	186	0.87	0.14	0.78	3.00	1.00	0.77
	H-I-F	53	1.00	0.00	0.99	3.00	1.00	0.99
	I-H-F	189	0.88	0.14	0.77	3.01	0.99	0.76
	F	169	0.99	0.02	0.97	3.04	0.99	0.96

Time (s), median time to completion in seconds (carried out on an Intel® Core^TM i5-10210U Processor). Best values for a given sample size are highlighted in bold.

Looking at the the model selection metrics, it is clear that the proposed H-I-F approach performs well, both in terms of selecting the correct set of input nodes and selecting the correct number of hidden nodes. Furthermore, the TNR is high, the FDR is low, and, as expected, we see that performance improves across all metrics with increasing sample size. From the results in Table 1, and from Figures 3 and 4, it is clear that the H-I-F approach performs best at recovering the true model structure.

Figure 3

Simulation 1: boxplots for TNR (the true negative rate for the input variables) for each method by sample size.

Figure 4

Simulation 1: boxplots for $q$ (the number of hidden nodes selected) for each method by sample size. Median value highlighted in red. Dashed line indicates the true value of q.

Comparing the methods without the fine-tuning stage in the boxplots, and looking at layerwise selection, the probability of selecting the correct structure is increased when that layer is selected in the second phase, e.g., input-node selection is best when it comes second (see H-I versus I-H in Figure 3). This suggests a relationship between the structure of the input and hidden layers (the probability of correctly selecting the structure of one layer increases when the other layer is more correctly specified). This is investigated further in Supplementary Material Section D. Therefore, H-I is likely better than I-H due to input-node selection being a more difficult task than hidden-node selection (determining the optimal set of input nodes versus the optimal number of hidden nodes), and, hence, it is favourable to perform it after hidden-node selection (given the number of hidden nodes is not substantially larger than the number of input nodes). This relationship between the structure of both layers can be handled by incorporating a fine-tuning phase after both the H and I phases are completed. Recall that the aim of fine tuning is to search for an improved solution in a neighbourhood of the current solution, where both H and I steps are carried out alternately (and include both backward and forward selections). Indeed, we see that the addition of the fine-tuning phase improves on H-I in the smaller sample sizes (in large part due to improved hidden-layer selection), but its addition does not greatly improve on I-H. Moreover, a boxplot for the computational time for each approach is provided in Supplementary Material Section E, and the addition of fine tuning only marginally adds to the computational expense. Overall, H-I-F is significantly better than I-H-F both in terms of computational expense and model selection. One may also consider only carrying out fine-tuning steps from the offset, which we denote by F. However, this does not perform as well as H-I-F at the smallest sample size and is more computationally demanding. From the above, the H-I-F approach is what we suggest as it leads to good model selection performance while also being the most computationally efficient approach.

4.2 Simulation 2: model selection objective function

This simulation study aims to determine the performance of using different objective functions when carrying out model selection. In particular, it aims to determine whether the use of an information criterion can improve the ability for the model selection procedure to recover the true model; this is compared to the far more common approach in neural networks of using out-of-sample performance. Three objective functions are investigated: BIC, AIC, and OOS. The AIC approach is the same as the proposed approach in Section 3.1, swapping BIC for AIC $= - 2 l (\hat{θ}) + 2 (K + 1)$ . The OOS approach follows the same procedure, but with the objective function replaced by OOS, which is calculated on an additional validation dataset that is 20% the size of the training dataset, i.e., OOS $OOS = \frac{1}{\tilde{n}} \sum_{i = 1}^{\tilde{n}} {({\tilde{y}}_{i} - NN ({\tilde{x}}_{i}))}^{2},$ , where $\tilde{n}$ is the number of observations in the validation dataset with response variable ${\tilde{y}}_{i}$ and covariate vector ${\tilde{x}}_{i}$ . As before, $n_{init} = 5$ random initializations are used. The results of the simulation study are shown in Table 2 and boxplots of TNR for the inputs and q for the different objective functions are given in Supplementary Material Section G.

Table 2

Simulation 2: model selection metrics.

n	Method	Input layer			Hidden layer		K(16)	OOS Test	PT
n	Method	TNR	FDR	PI	$\overline{q}$ (3)	PH	K(16)	OOS Test	PT
250	AIC	0.25	0.71	0.00	11.70	0.00	144	2.29	0.00
	BIC	0.87	0.15	0.72	2.66	0.54	16	0.86	0.43
	OOS	0.45	0.60	0.04	2.79	0.28	27	1.30	0.01
500	AIC	0.24	0.71	0.00	11.40	0.00	144	1.03	0.00
	BIC	0.96	0.05	0.90	3.05	0.95	16	0.53	0.85
	OOS	0.46	0.60	0.03	3.91	0.36	37	0.57	0.00
1000	AIC	0.27	0.70	0.00	11.40	0.00	141	0.76	0.00
	BIC	1.00	0.00	0.99	3.00	1.00	16	0.56	0.99
	OOS	0.53	0.57	0.02	3.72	0.46	34	0.57	0.00

Best values for a given sample size are highlighted in bold.

The results show that BIC far outperforms OOS and AIC in correctly identifying the correct FNN architecture. Using OOS as the model selection objective function almost never leads to correct neural network architecture being identified. This is due to the inability of the OOS to correctly identify and remove the unimportant covariates (TNR is always relatively low). Using AIC leads to even worse performance, and this is likely due to the weaker penalty on model complexity compared to BIC. It is also of interest to compare the approaches in terms of the size of the model selected and its out-of-sample performance. The median number of neural network parameters, K (note that the true value is K = 16), and the median OOS Test evaluated on a test set are reported. The OOS Test is computed on an entirely new dataset (20% the size of the training set) that the OOS-optimizing procedure was not exposed to. Interestingly, BIC-minimization leads to the lowest OOS values on the test data. This is particularly noteworthy since this is achieved using approximately half as many parameters as the OOS-minimization procedure. Boxplots highlighting the values of OOS Test and K are shown in Figures 5 and 6, respectively. Figure 5 also displays the OOS Test values for the true model (inputs $x_{1}, x_{2}, x_{3}$ and q =3) and the full model (inputs $x_{1}, x_{2}, \dots, x_{13}$ and q = 10); this allow us to evaluate the performance of selection compared to the full model, and how close we can get to the true model. The models selected using the BIC procedure have similar performance to the true model, particularly as the sample size increases. In contrast, the models selected using AIC have worse out-of-sample performance and significantly more parameters, and the performance is similar to fitting the full model.

Figure 5

Simulation 2: boxplots for OOS Test for the models selected by each objective function; for comparison, the results for the true model (with inputs $x_{1}, x_{2}, x_{3}$ and $= 3$ ) and the full model (with inputs $x_{1}, x_{2}, \dots, x_{13}$ and $q = 10$ ).

Figure 6

Simulation 2: boxplots for $K$ (number of parameters) for the models selected by each objective function.

We have also compared our proposed BIC-based selection procedure to two commonly used strategies for dealing with overfitting, namely, weight decay and early stopping. The results are deferred to Supplementary Material Section H, where we have found that our proposed approach yields improved OOS Test values compared to these other two strategies.

4.3 Simulation 3: data-generating process is not a neural network

For this simulation study, we investigate the performance of the proposed H-I-F model selection procedure on a dataset simulated from a data-generating process that is not a neural network:

\begin{matrix} y = x_{1} - 0.75 x_{2}^{2} + 0.9 x_{3} x_{4} x_{5} + ε, \end{matrix}

(4.1)

where $x_{1}, x_{2}, \dots, x_{10} \sim N (0, 1)$ , i.e., there are five relevant and five irrelevant covariates, and $σ^{2} = Var (ε) = 0.3$ . For comparison, we have also performed stepwise model selection for a linear model using BIC. We applied this using the stepAIC function from the MASS R package with k = log (n) (Venables and Ripley, 2002). To compare with the H-I-F procedure, we also performed stepwise selection on a linear model with a search space containing (i) all terms up to three-way interactions (step-lm-3), (ii) all terms up to two-way interactions (step-lm-2), and (iii) only main effects (step-lm-1)2. Note that the first model is correctly specified, and the latter two are misspecified. For these linear models, we began the search with all possible terms in the model, and allowed the stepwise search to consider both the elimination of an included variable and the addition of an excluded variable at each step (i.e., direction = "both"). For the purpose of this study, when computing performance metrics (displayed in Table 3), we only considered whether or not relevant variables $(x_{1}, \dots, x_{5})$ and irrelevant variables $(x_{6}, \dots, x_{10})$ are selected. While the exact functional form of each selected variable is not considered, the OOS metrics facilitate model comparisons in the sense that lower OOS values imply a better approximation to the generating model (i.e., the functional form of input variables). In Table 3, as with earlier tables, the TNR, FDR, and PT selection metrics are shown, but, here, the TPR (true positive rate) metric is also shown. Moreover, we also show median number of parameters (K), the median OOS Test evaluated on a test set, and the median computational time (Time) for each approach.

From Table 3, we see that the proposed H-I-F procedure has a high TNR, a low FDR, and the true positive rate increases with the sample size; consequently, the probability of selecting the true set of covariates (PT) increases with the sample size. At the highest sample size, the out-of sample performance is very close to that of the correctly specified third order linear model (step lm-3). Although this true step-lm-3 model provides the lowest out-of-sample performance, its true negative and FDRs are relatively poor compared to the neural network, and, hence, the probability of selecting the true set of covariates does not approach one for the sample sizes we have considered. The selected step-lm-3 model does have fewer parameters on average than the neural network model (at n = 500 and n = 1000), but the step-lm-3 search is far more computationally intensive; this is due to the large number of possible interaction terms up to order three. It is important to note that the stepwise approaches for the linear models require the search space of models to be explicitly specified through the interaction and polynomial terms, and the performance of the misspecified (step-lm-2 and step-lm-1) approaches is very poor. In contrast, the proposed H-I-F selection approach does not require these terms to be explicitly specified, but still achieves very good out-of-sample performance since complex functional relationships and interactions are captured in a more automatic manner within the neural network structure.

Table 3

Simulation 3: Comparison of proposed model selection approach for neural networks with stepwise model selection for linear models for the data-generating process given by Equation 4.1.

Method	n	TPR	TNR	FDR	PT	K	OOS Test	Time (s)
H-I-F	250	0.53	0.90	0.13	0.02	11	2.12	12
H-I-F	500	0.78	0.95	0.04	0.50	43	0.52	22
H-I-F	1000	1.00	0.98	0.02	0.93	57	0.30	46
step-Im-3	250	1.00	0.13	0.45	0.04	61	0.73	53
step-Im-3	500	1.00	0.46	0.32	0.12	20	0.36	105
step-Im-3	1000	1.00	0.63	0.24	0.23	16	0.28	215
step-Im-2	250	0.85	0.35	0.43	0.00	15	1.95	2
step-Im-2	500	0.83	0.39	0.42	0.00	13	1.31	3
step-Im-2	1000	0.93	0.80	0.16	0.00	10	1.02	5
step-Im-1	250	0.20	1.00	0.00	0.00	2	3.55	0
step-Im-1	500	0.20	1.00	0.00	0.00	2	2.46	0
step-Im-1	1000	0.40	1.00	0.00	0.00	3	1.98	0

Time (s), median time to completion in seconds (carried out on an Intel ® Core^TM i5-10210U Processor).

5 Application to data

Airbnb is an online marketplace that provides both short-term and long-term rentals. Data relating to the rental listings can be obtained from Inside Airbnb (http://insideairbnb.com). Here, we focus on rental listings in the Dún Laoghaire-Rathdown area of Dublin on the seventh of September 2023, and aim to implement our proposed model selection approach, and determine factors that may be associated with the listing's price. The data consists of information relating to 625 rental listings, and the following explanatory variables: the number of people the rental accommodates (accommodates), the rental's review rating (rating), the number of reviews per month (num_reviews), an indicator of whether the rental is an entire home or a private room (room_type; 0 for an entire home, 1 for a private room), an indicator of whether or not the host is a 'superhost' (superhost; i.e., top-performing Airbnb hosts, where performance is based on reviews, responsiveness and their cancellation rate), the total number of Airbnb listings that the host has (num_listings), an indicator of whether or not the listing is instantly bookable (instant), and the latitude (latitude) and longitude (longitude) of the rental. The response variable is the natural logarithm of the price per night of each rental (Inprice). The data is available in our R package selectnn (McInerney and Burke, 2024).

The dataset has been randomly split into a training set and test set with a 80%-20%- split, respectively, and all continuous variables have been standardized (based on the training data) to have zero mean and unit variance. The model selection procedure was implemented with $n_{init} = 10$ and $q_{max} = 10$ . For comparison purposes, the model found by our proposed model selection procedure is compared to fitting an FNN with all inputs and the maximum number of hidden nodes considered. For both models (selected and full), we report the number of input nodes (p), the number of hidden nodes (q), the total number of parameters (K), the BIC, and the OOS computed using the test set. For the covariates that are selected, we also report: (i) relative covariate importance via the change in BIC ( $△ BIC$ ) upon removal of that covariate, and (ii) a simple covariate effect ( $\hat{τ}$ ) as measured by the change in the average predicted response going from lower to higher covariate values (below/above median for numeric covariates and 0/1 for binary covariates). See Supplementary Material Section I for more detail on these measures.

Our proposed procedure selects two hidden nodes and includes three covariates: accommodates, num_reviews and room_type. As shown in Table 4, the selected model has 100 fewer parameters than the full model, while also having a much lower BIC value and a lower OOS. The BIC differences and covariate effects (and their associated bootstrapped confidence intervals) for the variables that remain in the model are reported in Table 5. Using $△ BIC$ as a measure of variable importance, we find that accommodates is the most important variable with ${ΔBIC}_{accommodates} = 200.16$ . Based on its effect ( ${\hat{τ}}_{accommodates} = 1.38$ ), the more people the listing accommodates, the higher the price per night. The binary variable room_type has a negative effect, which suggests that the listing price of a private room is lower than an entire house, on average. The other covariate, num_reviews, is more important than room_type as judged by its $△ BIC$ value, but the confidence interval for its covariate effect includes zero. This suggests that num_reviews has a non-linear effect that cannot be seen in an overall average change in the predicted response.

Table 4

Dublin Airbnb: selected versus full model comparison.

	p	q	K	BIC	OOS
Selected	3	2	11	884.2	0.25
Full	9	10	111	1136.3	0.48

Table 5

Dublin Airbnb: covariate effects and BIC differences.

	$\hat{τ} (95 % C I)$	$Δ B I C$
accommodates	$1.38 (1.30,1.47)$	200.16
num_reviews	$0.07 (- 0.07,0.21)$	89.56
room_type	$- 1.44 (- 1.51, - 1.37)$	64.02

The selected model dropped six covariates (from a set of nine possible covariates). While the underlying selection procedure cannot guarantee that this model minimizes the BIC, and an exhaustive search through all sub-models is computationally expensive, we have nevertheless carried out such a search for the purpose of comparison. To this end, we fitted the model with all covariates included, all nine models that arise by dropping one of each of the covariates, all 36 models that arise when pairs of covariates are dropped, all 84 models that arise when triples of covariates are dropped, and so on, for each hidden layer size, $q = 1, \dots, 10$ . Each model was allowed $n_{init} = 10$ random initializations, mirroring that of our selection procedure.

Figure 7 shows the BIC for each model where each point corresponds to a different input-layer hidden-layer combination; for comparison, the model selected by our procedure is indicated using a box. First, note that there is a subset of models with relatively large BIC values. Each of these models are missing the variables accommodates and room_type, which further highlights their importance in the model. It is clear that the proposed selection procedure has indeed found a model with a BIC value that is among the lowest of the alternative models we have considered. That being said, the exhaustive search did return two models with lower BIC values ( $ΔBIC = - 4.4$ and $ΔBIC = - 2.7$ ). These models are more complex than the model selected by our procedure (with q = 2 and = 3) as they have q = 2 and q = 2 hidden nodes with p = 4 and p = 5 input nodes, respectively. We note that the out-of-sample predictive performance is very similar across all three of these models.

Figure 7

Dublin Airbnb: BIC of models for different input-layer and hidden-layer combinations. Points are coloured according to the input layer size. The model selected by our procedure is enclosed in a box and the horizontal dashed line indicates the BIC for this model.

6 Discussion

FNNs have become very popular in recent years and have the potential to capture more complex covariate effects than traditional statistical models. However, model selection procedures are of the utmost importance in the context of FNNs since their flexibility may increase the chance of over-fitting; indeed, the principle of parsimony is very common throughout statistical modelling more generally. Therefore, we have proposed a statistically motivated neural network selection procedure by assuming an underlying (normal) error distribution, which then permits BIC minimization. More specifically, our procedure involves a hidden-node selection phase, followed by an inputnode (covariate) selection phase, followed by a final fine-tuning phase. We have made this procedure available in our selectnn package in R (McInerney and Burke, 2024).

Through extensive simulation studies, we have found that that (i) the order of selection (input versus hidden layer) is important, with respect to the probability of recovering the true model and the computational efficiency, (ii) the addition of a fine-tuning stage provides a non-negligible improvement while not significantly increasing the computational burden, (iii) using the BIC is necessary to asymptotically converge to the true model, and (iv) although the models selected using BIC have fewer parameters than those selected using out-of-sample performance, they have comparable, and sometimes improved, predictivity. We suggest that statistically orientated model selection approaches are necessary in the application of neural networks - just as they are in the application of more traditional statistical models - and we have demonstrated the favourable performance of our proposal.

In its current form, a limitation of the proposed procedure is that, due to its stepwise nature, it would be more computationally intensive when dealing with larger models and datasets. We expect that randomization and/or divide-and-conquer throughout the selection phases would be required in more complex problems involving many covariates and/or hidden layers, and adaptations may also be required for stochastic optimization procedures used on much larger datasets. Nevertheless, neural networks are still valuable in more traditional (smaller) statistical problems for which procedures such as ours will lead to more insightful outputs. Furthermore, the implementation of statistical approaches more broadly (such as uncertainty quantification and hypothesis testing) in neural network modelling will be crucial for the enhancement of these insights. This will be the direction of our future work.

Supplementary materials

Supplemental

Supplemental Material for A statistical modelling approach to feedforward neural network model selection by Andrew McInerney and Kevin Burke, in Statistical Modelling

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number 18/CRT/6049. The second author was supported by the Confirm Smart Manufacturing Centre (https://confirm.ie/) funded by Science Foundation Ireland (Grant Number: 16/RC/3918). For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

References

Abiodun

, Jantan

, Omolara

, Dada

, Mohamed

and Arshad

(2018) Stateof-the-art in artificial neural network applications: A survey. Heliyon , 4, e00938. doi: 10.1016/j.heliyon.2018.e00938.

Akaike

(1998) Information theory and an extension of the maximum likelihood principle. In

Parsen

, Tanabe

, and Kitagawa

eds. Selected papers of Hirotugu Akaike , pages 199-213. New York, NY: Springer. doi: 10.1007/978-1-4612-1694-0_15.

Anderson

and Burnham

(2004) Model selection and multi-model inference . 2nd edn.

New York, NY:

Springer-Verlag.

Bishop Christopher

(1995) Neural networks for pattern recognition . Oxford: Oxford University Press.

Breiman

(2001) Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science , 16, 199–231. doi: 10.1214/ss/1009213726.

Chandrashekar

and Sahin

(2014) A survey on feature selection methods. Computers & Electrical Engineering, 40, 16–28. doi: 10.1016/j.compeleceng.2013.11.024.

Cheng

and Titterington

D M

(1994) Neural networks: A review from a statistical perspective. Statistical Science , 9, 2–30.

Cybenko

(1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems , 2, 303–314. doi: 10.1007BF02551274.

Efron

(2020) Prediction, estimation, and attribution. International Statistical Review , 88, S28–S59. doi: 10.1111/insr.12409.

10.

Fan

and Lv

(2010) A selective overview of variable selection in high dimensional feature space. Statistica Sinica , 20, 101–148.

11.

Fisher

and Russell

E J

(1922) On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character , 222, 309–368. doi: 10.1098/rsta.1922.0009.

12.

Goldberg

(2016) A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research , 57, 345–420. doi: 10.1613/jair.4992.

13.

Heinze

, Wallisch

and Dunkler

(2018) Variable selection - a review and recommendations for the practicing statistician. Biometrical Journal , 60, 431–449. doi: 10.1002/bimj.201700067.

14.

Hooker

and Mentch

(2021) Bridging Breimans brook: From algorithmic modeling to statistical learning. Observational Studies , 7, 107125. doi: 10.1353/obs.2021.0027.

15.

Hornik

, Stinchcombe

and White

(1989) Multilayer feedforward networks are universal approximators. Neural Networks , 2, 359366. doi: 10.1016/0893-6080(89)90020-8.

16.

Koenen

and Wright

(2024) Interpreting deep neural networks with the package innsight, ar Xiv preprint arXiv2306.10822: .

17.

LeCun

, Bengio

and Hinton

(2015) Deep learning. Nature , 521, 436–444. doi: 10.1038/nature14539.

18.

McInerney

and Burke

(2024) selectnn: A Statistically-Based Approach to Neural Network Model Selection . Available at: https://github.com/andrew-mcinerney/selectnn. R package version 0.0.0.9000.

19.

Miller

(2002) Subset selection in regression . New York: Chapman and Hall/CRC. doi: 10.1201/9781420035933.

20.

Murata

, Yoshizawa

and Amari

(1994) Network information criterion - determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks , 5, 865–872. doi: 10.1109/72.329683.

21.

Pang

, Shen

, Cao

and Hengel

AVD

(2021) Deep learning for anomaly detection: A review. ACM Computing Surveys (CSUR), 54, 1–38. doi: 10.1145/3439950.

22.

Pontes

, Amorim

, Balestrassi

, Paiva

and Ferreira

(2016) Design of experiments and focused grid search for neural network parameter optimization. Neurocomputing , 186, 22–34. doi: 10.1016/j.neucom.2015.12.061.

23.

Ripley

B.D.

(1994). Neural networks and related methods for classification. Journal of the Royal Statistical Society: Series B (Methodological) , 56, 409–437. doi: 10.1111/j.2517-6161.1994.tb01990.x.

24.

Ripley

and Venables

(2022) nnet: Feedforward neural networks and multinomial loglinear models . Available at: https://CRAN.Rproject.org/package=nnet. R package version, 7.3-17.

25.

Ripley

(1993) Statistical aspects of neural networks. In Nielsen

BOE

, Jensen

and Kendall

. eds. Networks and chaos: Statistical and probabilistic aspects , pages 40-123. London: Chapman & Hall.

26.

Rügamer

, Kolb

and Klein

(2024) Semistructured distributional regression. The American Statistician , 78, 88–99. doi: 10.1080/00031305.2022.2164054.

27.

Schwarz

(1978) Estimating the dimension of a model. The Annals of Statistics , 6, 461464.

28.

Sun

, Song

and Liang

(2022) Learning sparse deep neural networks with a spikeand-slab prior. Statistics & Probability Letters , 180, 109246. doi: 10.1016/j.spl.2021.109246.

29.

Tibshirani

(1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) , 58, 267–288. doi: 10.1111/j.25176161.1996.tb02080.x.

30.

Tran

M-N

, Nguyen

, Nott

and Kohn

(2020) Bayesian deep net glm and glmm. Journal of Computational and Graphical Statistics , 29, 97–113. doi: 10.1080/10618600.2019.1637747.

31.

Venables

and Ripley

(2002) Modern applied statistics with S . 4th edn.

New York:

Springer.

32.

Voulodimos

, Doulamis

and Protopapadakis

(2018) Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience , 2018. doi: 10.1155/2018/7068349.

33.

White

(1989) Learning in artificial neural networks: a statistical perspective. Neural Computation , 1, 425–464. doi: 10.1162/neco.1989.1.4.425.

34.

(1998) On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association , 93, 120–131.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.39 MB

A statistical modelling approach to feedforward neural network model selection

Abstract

Keywords

1 Introduction

2 Feedforward neural network

Neural network architecture with p input nodes and q hidden nodes.

3.1 Proposed approach

Figure 2

Model selection schematic. Nodes coloured grey are being considered in current phase. Nodes coloured gold represent optimal nodes in that phase to be brought forward to the next phase.

Fit Candidate Model

Hidden-Node Selection

Input-Node Selection

Model Selection

4.1 Simulation 1: model selection approach

Table 1

Simulation 1: model selection metrics.

Simulation 1: boxplots for TNR (the true negative rate for the input variables) for each method by sample size.

Simulation 1: boxplots for q (the number of hidden nodes selected) for each method by sample size. Median value highlighted in red. Dashed line indicates the true value of q.

Table 2

Simulation 2: model selection metrics.

Simulation 2: boxplots for OOS Test for the models selected by each objective function; for comparison, the results for the true model (with inputs x 1 , x 2 , x 3 and = 3 ) and the full model (with inputs x 1 , x 2 , … , x 13 and q = 10 ).

Simulation 2: boxplots for K (number of parameters) for the models selected by each objective function.

Simulation 3: Comparison of proposed model selection approach for neural networks with stepwise model selection for linear models for the data-generating process given by Equation 4.1.

Table 4

Dublin Airbnb: selected versus full model comparison.

Dublin Airbnb: covariate effects and BIC differences.

Dublin Airbnb: BIC of models for different input-layer and hidden-layer combinations. Points are coloured according to the input layer size. The model selected by our procedure is enclosed in a box and the horizontal dashed line indicates the BIC for this model.

Supplementary materials

Supplemental

Footnotes

Declaration of Conflicting Interests

Funding

References

Supplementary Material

Simulation 1: boxplots for $q$ (the number of hidden nodes selected) for each method by sample size. Median value highlighted in red. Dashed line indicates the true value of q.

Simulation 2: boxplots for OOS Test for the models selected by each objective function; for comparison, the results for the true model (with inputs $x_{1}, x_{2}, x_{3}$ and $= 3$ ) and the full model (with inputs $x_{1}, x_{2}, \dots, x_{13}$ and $q = 10$ ).

Simulation 2: boxplots for $K$ (number of parameters) for the models selected by each objective function.