Abstract
Railroad ballast is typically comprised of only large granular particles. However, the degradation of fresh ballast and the arrival of foreign fines result in ballast fouling. Compared with fresh ballast, fouled ballast exhibits reduced resilience and compromised drainage capabilities. To optimize track performance, maintenance activities for the ballast are frequently scheduled based on the fouling severity. An accurate assessment of ballast fouling conditions can enhance maintenance efficiency and reduce costs. Over the years, while many ballast fouling evaluation methods have been developed, their widespread adoption has been hindered by system costs and implementation challenges. This study aims to address this by developing an affordable and easily implemented approach to estimating ballast fouling conditions using the Gaussian Mixture Model (GMM). Initially, images of fouled ballast are characterized by fitting the distributions of each RGB (Red, Green, Blue) channel. Subsequently, two mathematical methods, expectation-maximization and point estimation, are employed to solve the GMM parameters. These derived GMM parameters are then used to backcalculate the sample parameters, facilitating the estimation of ballast fouling conditions. The results of this study reveal a close alignment between the ballast fouling conditions backcalculated with the GMM and those quantified through laboratory sieving analysis. This study thus presents a promising path forward, using images captured from cost-effective cameras to estimate ballast fouling conditions with minimal computational expense.
The ballasted track is the dominant track structure for freight railroads. It relies on coarse-graded granular material (
Despite ballast being engineered to resist wear and tear, train loads can lead to the breakdown of angular ballast particles (
Traditional methods for detecting ballast fouling conditions include sieve analysis for determining FI and Non-Destructive Testing (NDT) methods such as surface wave analysis, SmartRock trajectory, and Ground Penetration Radar (
The GMM, the weighted summation of several normal distributions, offers more flexibility in fitting unknown distributions (
This study uses GMM to characterize the color intensity of fouled ballast images. The linear relationship between sample variance and FI (
Fouled Ballast Image
Earlier research collected and analyzed images of fouled ballast samples specifically designed to represent varying FI values, as depicted in Figure 1 (

Gradation curves for fouled ballast.
Images of the fouled ballast are captured through transparent acrylic walls under controlled illumination conditions. Each FI value is associated with six images, each encompassing three color channels: Red, Green, and Blue. Each color channel possesses a relative frequency distribution, denoted as
Input Relative Frequency Distributions in This Study
Figure 2 depicts the RGB color density distributions for a fouled ballast sample with an FI value of 18. The relative frequency plot is clearly asymmetrical for each channel, suggesting that the normal distribution model would not be an appropriate fit for these distributions. Knowing that a small value of one color channel corresponds to a dark color, the difference between the voids and the ballast should be reflected on the relative frequency plot. Similarly, the fouled ballast should be classified into large ballast particles and fines. Objectively, one distribution curve can be separated into three ranges through a trial-and-error process. Each distribution can be manually segmented into three ranges: the left tail, the left elbow, and the remaining portion. The three corresponding ranges for one distribution are shown below in the respective channel color, as presented in Figure 2. Regardless of the color channel, each range essentially highlights the same region across all three channels. Range 1 corresponds to the voids within the fouled ballast, while Range 3 symbolizes the surfaces of large ballast particles and clustered fines. Range 2 represents the edges of large particles or the clustered fines.

An example of fouled ballast (FI 18) image decomposition.
Given that fouled ballast consists of ballast particles, fines, and voids, each individual color channel illustrated in Figure 2 manifests this reality in the form of color channel intensity. Therefore, it is logical and reasonable to assume that a fouled ballast color intensity distribution comprises three distinct components. The relative frequency distributions in Figure 2 are not symmetric, which cannot be fitted well by a normal distribution, which only has two fitting parameters. Instead of assuming that the color intensity of one channel follows a normal distribution, each component of the fouled ballast image could follow a normal distribution.
Gaussian Mixture Model
A GMM can be described as a multivariate distribution synthesized by a finite number of Gaussian or normal distributions. Each Gaussian distribution is treated as an independent normal distribution, complete with its own mean and covariance. These normal distributions are combined proportionally, each representing their respective fractions of the GMM population (
The following equation describes the GMM in this study, which is a probability density function:
where the vector
Because the summation of mixing portions is 1, there are only two independent mixing portion parameters, making the number of elements in the parameter vector
where the likelihood function for
The maximum likelihood estimator,
and
and
A 3-D vector of zero-one indicator variables,
Associated with the labeling vector
The complete-data log-likelihood for
Expectation-Maximization
The expectation-maximization (EM) algorithm involves two steps, E- (for expectation) and M- (for maximization), to iteratively solve for
E-Step
The initial guess of
According to the property of
Noticing that is the linear relationship between
On the (
as the current conditional expectation
M-Step
To obtain the updated estimator
After substituting the expression of the normal distribution, Equations (5) and (6) can be solved explicitly at the (
The EM iterations stop as the difference of the incomplete likelihood function
Initial Guess of Gaussian Mixture Model Parameters
Point Estimation
Another way to determine the parameter vector
and
As discussed earlier, the parameter vector
Obviously, the unknown parameters outnumber the equations. Noticing the ranges and their corresponding patterns in Figure 2, the first assumption is that:
The portion vector,
Because all six pictures under one FI are the realizations of this FI, their GMM parameters are all estimations of the GMM parameters of this FI. The second assumption is that:
The parameter vector,
Before making more assumptions about the FI value, the following equations hold for one FI, subscript
Substituting Equations (20) into Equations (19) yields the following equations for one FI, subscript
Equations (21) have 36 equations and 20 unknowns, making this system of equations possible to be solved or fitted.
Each equation in Equations (21) should have a residual
This non-linear fitting problem is solved by
Results and Discussion
EM Fitting
For each color channel of the color triplet (R, G, B), the relative frequency is plotted against its corresponding color triplet component in Figure 3, and all the subplots share the common horizontal and vertical labels. Figure 3a demonstrates the GMM fitting of the relative frequency distribution of the Red channel. The green blocks constitute the relative frequency histogram of the Red channel, reflecting the statistical analysis of the fouled ballast image. The prominent curve represents the GMM distribution, while the three thinner curves correspond to the three GMM components. The three dashed vertical lines indicate the mean value for each component. While mixing proportions are not explicitly presented, the area under each thin curve corresponds to its mixing portion. Figure 3b and c , displays the fitting results for the Green and Blue channels, respectively.

Expectation-maximization (EM) fitting results of all color intensity distributions in Table 1: color channels of the color triplet red (R), green (G), and blue (B): (
As the FI increases, whatever the channel, the first component, symbolizing voids, occupies a smaller proportion. The highest peak, comprised of the other two components, escalates as more areas of clustered fines appear in the fouled ballast. The fouled ballast histogram transitions from a two-peak curve to a one-peak curve, signifying a decrease in population variance with the rising peak height.
With a few exceptions, such as Figure 3b (FI 18-VI) and Figure 3c (FI 0-V), the sequence of the three components remains unchanged. There are more anomalous fittings in Figure 3c compared with Figure 3a or b , mainly because the mean value of the Blue intensity is lower than the other two, making it more challenging to successfully classify the three components.
Examining the position of these vertical lines under a single FI, slight variations in precise positions across different subplots can be observed, as each fouled ballast image represents a realization of its FI. These minor discrepancies indicate that the second assumption holds for both the portions and the mean values.
Averaged EM Parameters and Point-Estimated Parameters
Although the estimated parameters in Figure 3 do not precisely follow two assumptions, taking the average over these EM fitting parameters can obtain representative values for one FI, subscript
Therefore, these averaged EM-estimated parameters can be compared with those from point estimation.
Figure 4 shows the mean values of GMM components in relation to the FI, with the two methods producing distinct trends. The average
The mean vector,

Component mean value estimation: (
Figure 5 illustrates the variance values of the GMM components in relation to the FI. The average variances determined through the EM method exhibit a gradual decrease in relation to the FI, whereas those computed through point estimation demonstrate more variability. Given that the component variance
2. The variance vector,

Component variance value estimation: (
Figure 6 illustrates the relationship between the mixing portions and the FI. Instinctively, the mixing portion should represent the area of a component. Therefore, a fouled ballast image with a higher FI value should exhibit fewer visible voids, leading to a smaller mixing portion for the first component,

Component mixing portion value estimation.
Obviously, differences in parameter estimation between the two methods are present. Given that higher-order moments are expected to yield less accurate estimations, particularly for non-normal distributions, it is worth noting that the point estimation method implemented here involves the second moment, the sample variance. The objective function in Equation (22) does not directly quantify the GMM fitting. As this equation represents the sum of squared residuals for each equation in Equations (21), it applies equal weight to each equation. However, considering that the absolute value of the mean is generally larger than the absolute value of the variance, the precision of variance estimation is effectively diminished.
Backcalculated Sample Parameters
The sample mean value can be backcalculated with Equation (17). Figure 7 shows the experimental, EM-estimated, and point-estimated sample mean values. Both backcalculated mean values fit the laboratory sieve analysis results well.

Comparison of sample mean value.
The sample variance value can be backcalculated using Equation (18). Figure 8 illustrates the experimental, EM-estimated, and point-estimated variances. Compared with the performance of sample mean backcalculation, it is more challenging to accurately fit the sample variances. The linear relationship between these two quantities is intriguing because the FI can be linearly predicted by the experimental sample variance (

Comparison of sample variance value.
After conducting the linear regression fitting, the values of slope
Linear Fitting Results Between the FI and the Sample Variance
Figure 9 presents the fitted data points alongside the linear regression lines. The discrepancies observed among different methods within a single color channel highlight that the experimental variance cannot be seamlessly merged with the backcalculated variance to establish a FI prediction model based on RGB variance. Compared with the experimental regression models, the other two sets of models show the discrepancy. The differences arise from the varied variance values for each FI value. When the experimental variance is directly calculated from the image channel intensity, the GMM backcalculated variances are from Equation (18). The estimation of all the elements of the vector

Linear regression between the sample variance and the FI.
Conclusions
As fouled ballast encompasses several elements, big ballast particles, clustered fines, and voids, this study attempts to decompose the RGB distributions of fouled ballast images into three components for each channel. The GMM was employed to classify these components, yielding fitting parameters such as component means, variances, and mixing proportions. Two estimation methodologies are used for these fitting parameters: expectation-maximization, which minimizes information loss by maximizing log-likelihood, and point estimation, which optimizes parameters by minimizing the sum of the squared residuals of the sample mean and variance expression. Based on these fitting outcomes, several notable conclusions can be drawn:
Both expectation-maximization and point estimation methods can estimate the GMM parameters given the assumptions, although discrepancies in values and the variations with FI differ.
Both methods demonstrate commendable performance when comparing the backcalculated sample means from various methods with the experimental sample means. However, with respect to sample variances, neither method accurately reproduces the backcalculated sample variances consistent with the experimental data.
A linear relationship between the sample variance and the FI is verified and explained, but the linear model may only apply to certain ranges.
Further research could implement data filtering and smoothing techniques to enhance data accuracy, and two conjectures on component properties are yet to be verified: 1) component means are independent of the FI; and 2) component variances are independent of the FI. Given the expression of sample variance, these GMM parameters can be represented in the form of the FI, thereby exploiting the GMM to characterize the fouled ballast more effectively.
Footnotes
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: Yufeng Gong, Yu Qian; data collection: Yufeng Gong; analysis and interpretation of results: Yufeng Gong, Yu Qian; draft manuscript preparation: Yufeng Gong, Yu Qian. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is partially funded by the Federal Railroad Administration (FRA), Loram Maintenance of Way, Inc., M×V Rail, and BNSF Railway.
