Sage Journals: Discover world-class research

Abstract

Coke strength after reaction (CSR) is a critical parameter in metallurgical applications, and its accurate prediction is essential for optimizing coal blends and coking processes. This research develops a data-driven method to model CSR by eight input variables, including moisture content, volatile matter, ash percentage, sulfur content, maximum fluidity, plastic layer thickness, mean maximum reflectance (MMR), and the basicity index. A dataset comprising 630 coal samples with diverse properties was analyzed using advanced techniques such as Pearson correlation analysis, the Monte Carlo outlier detection technique in data integrity assessment, and machine-learning models with five-fold cross-validation. Multiple algorithms were implemented, including random forests, decision trees, adaptive boosting, convolutional neural networks, support vector regression, multilayer perceptron-artificial neural networks, and an ensemble learning approach, with hyperparameter optimization and evaluation metrics like mean squared error, R², and mean and average absolute relative error. The random forest model decisively outperformed all other contenders, demonstrating its superior predictive power through consistently high R² values and minimal error rates. Furthermore, Shapley additive explanations analysis revealed the influence of each input variable, with volatile matter having a predominantly negative effect on CSR, while features like MMR and moisture showed positive correlations. This systematic methodology underscores the importance of robust data assessment and machine-learning models in enhancing predictive accuracy for CSR.

Keywords

Coke strength after reaction machine learning random forest SHAP analysis Monte Carlo outlier detection

Introduction

Coke plays a pivotal role in metallurgical applications, serving as both a fuel and a dropping agent in the production of steel. The efficiency and cost-effectiveness of metallurgical processes are significantly influenced by coke quality (Aghdam et al., 2022a, 2022b; Jha et al., 2020; Sharma and Tiwari, 2024). A crucial property of coke is its ability to maintain sufficient strength over time to support the burden in a blast furnace during operation. Two critical indices are employed to evaluate coke performance under these conditions: coke strength after reaction (CSR) and coke reactivity index (CRI). CSR represents coke's hot strength, reflecting its capacity to sustain the furnace burden, while CRI measures coke's durability by assessing mass loss following gas–solid reactions in a blast furnace (Lundgren et al., 2009; Wang et al., 2016). As such, CSR and CRI are regarded as key benchmarks for assessing coke quality during manufacturing.

Widespread research on coke production has identified several factors that impact coke quality, such as ash content, sulfur levels, volatile matter, and the presence of a plastic layer (Niu et al., 2022; Zhang et al., 2025). As a result, numerous precise models have been established to explore the complex relationships between coal blends and coke properties (Abdel Hady and Abd El-Hafeez, 2024; Abdel Hady et al., 2024; Badawy et al., 2021; Eliwa et al., 2024). Accurate and reliable modeling of the complex relationships between coal composition and the coke production process is exceptionally difficult. The inherent variability and intricate nature of coal, coupled with multifaceted coke manufacturing processes, presents substantial hurdles in developing a precise predictive model (Alekseev et al., 2023; Bulaevskii and Shved, 2010; Pusz and Buszko, 2012; Ulanovskiy, 2014).

Recent advancements have introduced approaches such as regression models, statistical analysis, and machine learning to improve the predictability of coke quality regarding the coal blend characteristics (Aghdam et al., 2022a, 2022b; Khezerlooe-ye Aghdam et al., 2019; Qiu et al., 2024a, 2024b; Rejdak et al., 2021). When it comes to understanding how coal's makeup influences coke's characteristics, linear and nonlinear regression models have proven incredibly useful in defining those connections. While these models can provide precise predictions with uniform coal blends and similar coking ovens, their performance deteriorates significantly when the coal blend composition changes or when applied across various machines (Agarwal et al., 2021; Qiu et al., 2024a, 2024b; Ziqi et al., 2022).

In addition to regression frameworks, statistical techniques have been utilized to uncover intricate correlations between coke quality indices such as CSR and CRI and coal properties. Examples of these methods include partial least squares (PLS) regression, multivariate adaptive regression splines (MARS), and principal component analysis (PCA) (Adhab et al., 2025; Qiu et al., 2024a, 2024b; Sabancı et al., 2023). PCA is often applied to reduce dimensionality by condensing multiple input features into a minor set of parameters while retaining the greatest of the original information. By simplifying the inputs, PCA facilitates better insights into the underlying structure of the data and the relations between coke quality and coal properties (Zhang et al., 2017a, 2017b; Zhou et al., 2022). On the other hand, PLS regression, while also capable of dimensionality reduction, is specifically designed to prioritize predictive power by focusing on maximizing the covariance among inputs and the target variable, making it principally suited for regression tasks where the primary goal is prediction (Zhang et al., 2022b; Zhao et al., 2011; Zhu et al., 2019). MARS, designed for modeling complex nonlinear regressions, is particularly suited for predicting coke quality by employing flexible basis functions to address nonlinear interactions. While statistical analysis methods are good at finding relationships between coal properties and coke quality, their accuracy often suffers. This is because their predictive performance depends a lot on the specific coal samples used and the complexity of the input variables, which often limits how reliable their predictions are (Tao et al., 2021; Wang et al., 2024a; Yang et al., 2023a).

The metallurgical industry is increasingly turning to machine learning because its models are proving to be more accurate than traditional regression methods. They are much better at understanding the complex connections between different factors and their results (Abbasi et al., 2017; Bassir and Madani, 2019a, 2019b; Elmessery et al.; 2024 Farghaly, et al., 2020a, 2020b). For example, methods like artificial neural networks (ANNs) and support vector machines (SVMs) show real promise for predicting coke quality with high accuracy (Mostafa et al., 2024a, 2024b; Omar and Abd El-Hafeez, 2024). In fact, ANNs have already been successfully used to model the co-gasification of coal and petroleum coke, producing predictions that closely match what's observed in experiments (Qiu et al., 2024a, 2024b; Wang et al., 2024b; Yang et al., 2022).

Research has also shown that multilayer ANNs can effectively predict coke hot strength within acceptable margins of error. Similarly, SVMs have been applied to investigate variables such as ash content and its influence on coke properties, achieving high predictive accuracy. Despite the promise of these machine-learning techniques, they have only been minimally applied to predict, and their reliability across diverse coal blends has yet to be meticulously evaluated. Therefore, there is a critical need for the development of an inclusive predicting model capable of consistently and precisely predicting CRI and CSR across a wide range of coal mixtures (Xiang et al., 2021; Zhao et al., 2024).

To address the complexities associated with predicting CSR, this paper adopts a systematic and robust workflow. The study initiates with data gathering and analysis, including the application of statistical tools to assess distributions, correlations, and critical patterns. To guarantee the integrity and reliability of our dataset, we're using Monte Carlo outlier detection (MCOD) to identify and address any unusual data points. We are then developing and fine-tuning a range of machine-learning models, including random forests, adaptive boosting (AdaBoost), decision trees, convolutional neural networks (CNNs), support vector regression (SVR), multilayer perceptron-artificial neural networks (MLP-ANNs), and various ensemble methods. Each of these models will undergo five-fold cross-validation to ensure they can generalize well to new data. We will be comparing their performance using key metrics such as R², average absolute relative error (AARE%), and mean squared error (MSE), to determine which models are most effective.

Additionally, Shapley additive explanations (SHAP) analysis offers detailed insights into feature relevance, refining the understanding of the complex interrelationships between input variables and CSR. Collectively, the paper aims to establish a data-driven approach for CSR prediction, offers practical insights into feature contributions, and provides researchers and industry professionals with tools for optimizing coal blend strategies. Table 1 indicates the modeling process and flowchart in different steps.

Table 1.

Modeling process and flowchart in different phases.

Phase	Key actions and objectives
1. Data preparation	Gathered 630 coal samples with eight key properties (e.g. moisture, ash, MMR) and their CSR values. Performed statistical analysis and correlation studies to understand data relationships. Used Monte Carlo outlier detection to ensure data quality and remove anomalies. Applied feature scaling for consistent data ranges.
2. Model training and selection	Implemented seven diverse machine-learning models (e.g. random forest, CNN, SVR, MLP-ANN, AdaBoost, decision tree, and an ensemble). Employed five-fold cross-validation for robust training and evaluation. Performed hyperparameter optimization for each model to maximize performance.
3. Performance evaluation	Assessed models using key metrics: R², mean squared error, and average absolute relative error. The random forest model emerged as the best performer, demonstrating superior accuracy and consistency.
4. Model interpretation	Utilized SHAP analysis to explain how each input feature influenced CSR predictions. Identified key drivers: volatile matter (negative impact), MMR, and moisture (positive impact) on CSR.
5. Insights and application	Provided actionable insights for optimizing coal blends and coking processes. Highlighted the importance of machine learning for accurate CSR prediction in the metallurgical industry.

MMR: mean maximum reflectance; CSR: coke strength after reaction; CNN: convolutional neural network; SVR: support vector regression; MLP-ANN: multilayer perceptron-artificial neural network; AdaBoost: adaptive boosting; SHAP: Shapley additive explanations.

Figure 1 illustrates the overall workflow of this research.

Figure 1.

Study's data-driven modeling process.

Fundamentals of machine learning

Ensemble learning

It is a powerful framework that combines predictions from multiple base learners to construct accurate and reliable predictive models. By integrating diverse algorithms, ensemble methods enhance performance, stability, and generalization capabilities, often surpassing individual models. A key advantage of this approach lies in its error-canceling effect: aggregating predictions from independently trained models reduces bias and overfitting, makes it especially well-suited for working with complex or high-dimensional datasets. As illustrated in Figure 2, while ensemble learning can involve various algorithms, this study specifically incorporates random forests, decision trees, and AdaBoost methods to achieve optimal results (Mamun et al., 2022; Yaghoubi et al., 2024; Zhang et al., 2022a).

Figure 2.

Diagram of the ensemble learning machine-learning technique.

Recognized for its power in detecting subtle patterns and relationships, this method is now a core component of applications like regression, classification, and anomaly detection. Ensemble methods are widely utilized across different domains, including finance for credit scoring and fraud detection and healthcare, where they support disease prediction and medical image analysis. Their significant presence in artificial intelligence allows for greater predictive reliability and superior decision-making. The ability of ensemble learning to maximize predictive accuracy makes it highly valuable, particularly in competitive machine-learning environments. However, effective implementation requires careful consideration to ensure optimal performance across a range of practical applications (Mhawi et al., 2022; Tan et al., 2022; Yang et al., 2023b). Note that in Figure 2, “voting” and “averaging” refers to classification and regression tasks, respectively.

Adaptive boosting

AdaBoost functions by iteratively aggregating less accurate classifiers into a robust predictive model. Created by Freund and Schapire, AdaBoost enhances predictive accuracy iteratively by concentrating on incorrectly classified instances, assigning them greater weight in the following iterations. This flexible mechanism allows the algorithm to handle intricate decision boundaries, successfully minimizing bias and variance. The process involves training weak learners, such as decision stumps, on altered iterations of the training data, where the input for each learner is weighted based on its performance, giving greater influence to more accurate classifiers on the final prediction. The algorithm's sequential approach, which focuses on misclassified instances for future iterations, significantly contributes to its effectiveness, making it ideal for both classifications and regressions. The effectiveness of every weak classifier is assessed through its weighted error rate, denoted $ε_{t}$ , which indicates how well it classifies the weighted instances in the training data accurately. The effectiveness of the weak classifier is assessed by means of a weighted error rate $ε_{t}$ . The mathematical expression for $ε_{t}$ is stated as

ε_{t} = \frac{\sum_{i = 1}^{N} ω_{i} \cdot 1 (y_{i} \neq h_{t} (x_{i}))}{\sum_{i = 1}^{N} ω_{i}}

(1)

where ω_i signifies the weight of instance i, y_i is the true label of that instance, and

1 (y_{i} \neq h_{t} (x_{i}))

is an indicator function that outputs 1 when the instance is misclassified and 0 otherwise. A weak learner exhibiting improved performance (reduced

ε_{t}

) receives a greater weight, α_t, to impact the final ensemble model. The weight α_t for the classifier is computed as

α_{t} = \frac{1}{2} \log (\frac{1 - ε_{t}}{ε_{t}})

(2)

This weighting system guarantees that more precise classifiers have a greater impact on the final prediction, whereas less precise classifiers exert diminished influence. The ensemble model's output is achieved by merging the results from each weak classifier, modified based on their α_t values. In classification tasks, the final prediction H(x) is determined as

H (x) = sign (\sum_{t = 1}^{T} α_{t} h_{t} (x))

(3)

In this context, T refers to the count of weak classifiers, while sign(⋅) assigns the predicted class label according to the combined votes of the ensemble model. The adaptability of AdaBoost has led to its widespread application. Its capacity to be customized for particular tasks, compatibility with various base learners, and efficiency in processing high-dimensional data render it a useful resource for numerous applications. Nonetheless, AdaBoost has drawbacks, particularly its vulnerability to noisy data and outliers. If not properly controlled, the algorithm's inclination to increase the significance of misclassified instances can result in overfitting, as it might highlight erroneous data points. Furthermore, its training method is sequential, which can require considerable computational resources, and the intricate nature of the final ensemble model may lead to difficulties in interpretation.

In spite of these drawbacks, AdaBoost continues to be an effective ensemble technique, particularly when implemented thoughtfully and suitable preprocessing measures are employed to handle outliers and noise. Its capacity to flexibly rectify mistakes and merge weak learners into a robust predictive model maintains its importance in machine-learning research and practical uses. With its benefits, drawbacks, and adaptability, the method offers a significant and powerful resource for numerous predictive modeling tasks when appropriate preprocessing is executed (Ghorbani et al., 2020; Mohebbanaaz et al., 2022; Ren et al., 2022).

Decision tree

This approach is a broadly utilized and intuitive technique designed for regression and classification tasks. It works by systematically partitioning a dataset into subsets with similar target variable characteristics. The process relies on feature-based decisions at each internal node to maximize the homogeneity of the resulting partitions. The splitting criteria are typically based on indications like Gini impurity or information gain and continue until predefined stopping criteria are met, like reaching a highest tree depth or possessing a lowest number of samples in a leaf node. Predictions are made by tracing an instance's feature values through the tree, leading to a terminal leaf node, which provides the forecasted class label. This structured and hierarchical approach mimics human decision-making processes, rendering the model both interpretable and user-friendly.

One of the key strengths of decision trees is their interpretability, allowing for a clear understanding of how features influence the target variable. Their inherent transparency makes them invaluable in fields such as healthcare and finance, where model interpretability is essential. However, decision trees are vulnerable to overfitting, mainly when overly deep and complex, which can hinder their performance on new data. To address these issues, techniques such as pruning, limiting tree depth, and employing ensemble manners like gradient boosting and random forests are commonly utilized. These strategies improve predictive accuracy and enhance model robustness by combining multiple trees’ predictions or simplifying tree structures.

In summary, despite some limitations, decision trees remain an essential tool due to their simplicity, interpretability, and ability to highlight feature relationships. When integrated with ensemble learning techniques, they achieve higher predictive performance and overcome overfitting, ensuring their continued importance in machine-learning applications where accuracy and transparency are critical (Navada et al., 2011; Tang et al., 2015).

Multilayer perceptron-artificial neural networks

MLPs are vital structures of ANNs designed to model intricate, nonlinear relationships in data. With an input, output, and multiple hidden layers, MLPs are built from interconnected neurons characterized by adjustable weights and biases. The network's training procedure consists of forward propagation, in which data flows through layers to produce predictions, and backpropagation, where the calculated error is used to modify the biases and weights by means of optimization techniques.

In every neuron, aside from the input layer, The primary function entails computing a weighted total of its inputs, incorporating a bias term, and utilizing an activation function:

a_{j}^{(l)} = f (\sum_{i}^{ω_{i j}^{(l)} a_{i}^{(l - 1)} + b_{j}^{(l)}})

(4)

In the above formula, $a_{i}^{(l - 1)}$ represents the activation of the i-th neuron from the previous layer (l − 1), $ω_{i j}^{(l)}$ denotes the connection weight, $b_{j}^{(l)}$ indicates the bias for neuron j in layer l, and f stands for the activation function. The activation function, crucial for introducing nonlinearity into the model, allows an MLP to recognize complex patterns that linear models are unable to capture. The adjustment of weights for a specific layer is usually determined by the gradient of the loss function related to those weights. The formula that conveys the weight modification is stated as follows:

ω_{i j}^{(l)} \leftarrow ω_{i j}^{(l)} - η \frac{\partial L}{\partial ω_{i j}^{(l)}}

(5)

Here, η indicates the learning rate that governs the size of the steps in the update, while L signifies the loss function. This procedure recurs in cycles, progressively diminishing errors and enhancing the model's forecasting abilities.

MLPs have attained widespread success because of their ability to represent nonlinear relationships and generalize well. Nonetheless, they possess constraints, including their inclination to overfit, significant computational expense with extensive datasets, and the need for precise hyperparameter tuning. In spite of these obstacles, MLPs continue to be an essential element in machine learning. Their capability to approximate any functions has resulted in their extensive use in regression, classification, pattern recognition, natural language processing, and specific applications, like medical diagnosis systems and financial forecasting. The efficacy of MLPs highlights their importance as independent models and as foundational components for more advanced deep-learning systems (Foddis et al., 2019; Itano et al., 2018; Mustafa et al., 2013).

k-Nearest neighbors

This is a learning method utilized for classifications and regressions. Unlike parametric models, k-nearest neighbors (k-NN) directly rely on the training dataset and assume that similar data points are located close to one another. During prediction, the algorithm identifies the k-nearest training examples to a given query point using a specified distance metric, like Euclidean distance. Predictions are then generated based on the classes or values associated with these k neighbors. Due to its flexibility, k-NN is well-suited for capturing complex decision boundaries that parametric approaches may struggle to model. The selection of the parameter k is critical and is often determined through cross-validation to balance between sensitivity to noise and excessive smoothing.

For classification, the predicted class is determined by the majority class between the closest neighbors, while for regression, techniques like weighted averaging, where closer neighbors have greater influence, can be applied. Its nonparametric nature enables k-NN to perform effectively when there are no specific assumptions regarding the data distribution. However, the algorithm's computational cost increases with the size of the dataset, and its performance is deeply predisposed by the selection of k and the distance measure. Additionally, k-NN may face issues with imbalanced datasets, which can disproportionately affect the neighbor selection process and, consequently, the predictions.

To conclude, k-NN is a highly effective algorithm, particularly for small datasets or problems with complex decision boundaries. However, careful consideration must be given to the selection of hyperparameters, data preprocessing, and computational demands. The lack of a parametric model limits its generalizability since the algorithm relies entirely on the stored training data (Balabin et al., 2011; Bansal et al., 2022; Hazbeh et al., 2021; Zamouche et al., 2023).

Support vector regression

SVR modifies SVMs specifically for regression tasks. Unlike SVMs, which prioritize finding an ideal hyperplane to categorize data distinctly, SVR aims to find a function that models the relation between input features and continuous outputs, keeping prediction errors within a defined tolerance. The core principles of SVR, such as margin maximization and the application of support vectors, remain unchanged but are adapted for regression applications.

The core objective of SVR is to bargain a regression function f(x) that predicts y from x while minimizing deviations from the true values. Nonetheless, instead of reducing all deviations, SVR focuses only on those that exceed a designated threshold ε, establishing an epsilon-insensitive tube around the predicted function f(x). Points within this tube are regarded as accurately predicted, even if they differ from the valid values. The SVR regression function is

F (x) = ⟨ W, X ⟩ + b

(6)

Within this framework, W denotes the vector, X is the input feature vector, and b is the basis term. The SVR optimization problem is mathematically formulated as

min (w, b, ξ, ξ *) \frac{1}{2} | | w | |^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i} *)

(7)

Here, ||w||² serves as the regularization term which assurances the solution remains as straightforward as can be, $ξ_{i}$ and $ξ_{i} *$ are slack parameters that indicate how much a predicted result diverges from the ε tube, C administers the balance between achieving a minimal error and maintaining low model complexity. In contrast, ε states the width of the epsilon-insensitive tube surrounding the prophesied function.

SVR provides various significant benefits for regression issues. It is resilient to outliers because of the epsilon-insensitive loss function and works especially well in high-dimensional feature spaces where other regression methods may have difficulty. Employing kernel functions enables SVR to represent intricate nonlinear relationships without the need for explicit data transformation, which enhances its flexibility. In addition, the convex optimization approach guarantees that the solution is globally optimal, sidestepping problems like local minimal (Bansal et al., 2022; El-Sebakhy, 2009; Wandekokem et al., 2011; Zhang and O'Donnell, 2020).

Convolutional neural networks

The approach represents a cornerstone in deep learning, particularly acclaimed for its exceptional performance in image analysis tasks. CNNs are designed to hierarchically extract increasingly complex features from visual data, enabling them to excel in tackling a wide range of challenging problems. Their architecture consists of a sequence of interconnected layers, where the initial layers detect fundamental features like edges and corners, and deeper layers capture more intricate patterns and structures. The core components of CNNs are convolutional layers, which utilize small filters (kernels) that traverse the input data to act as localized feature detectors. The shared-weight mechanism, inherent to this method, guarantees translation invariance. This crucial property allows CNNs to detect patterns consistently, regardless of their position within the input. Following this, pooling layers strategically down sample feature maps, often via max or average pooling, effectively highlighting essential features while simultaneously lowering computational demands.

Toward the end of a CNN, fully connected layers are often used. These layers take the features that the earlier parts of the network have extracted and combine them to make the final prediction, whether it's for classifying something or predicting a value. In these layers, every neuron is connected to every neuron in the layer before it. This comprehensive connectivity allows the network to learn complex, high-level patterns from the more basic features identified in the preceding layers. This structure allows CNNs to synthesize information from different levels to generate a comprehensive and accurate output. Due to their ability to perform automatic and robust feature extraction, CNNs have become indispensable for applications like face recognition, object detection, and medical imaging. Moreover, advancements like transfer learning and pretrained models have extended the scope of CNNs, enabling fine-tuning for domain-specific tasks with smaller datasets.

In summary, CNNs have revolutionized the fields of image processing, medical imaging, and even speech and language tasks through their powerful hierarchical feature extraction and generalization capabilities. Their adaptability and ongoing advancements ensure that CNNs continue to be a vital tool in the development of artificial intelligence systems (Aloysius and Geetha, 2017; Chagas et al., 2018; Lopez Pinaya et al., 2020; Sohn et al., 2021).

Random forest

This approach utilizes the collective power of multiple decision trees to achieve superior predictive accuracy, improved generalization, and enhanced robustness compared to a single decision tree. The algorithm follows two main steps: tree construction and prediction aggregation. In the tree construction phase, decision trees are built using bootstrapping, a technique that involves sampling the training data with replacement. Furthermore, when splitting nodes, a random set of characteristics is evaluated, adding another layer of randomness. This dual randomization strategy increases diversity among the trees in the ensemble, reducing the likelihood of overfitting and allowing each tree to contribute unique insights to the final model, thereby improving generalization.

When a random forest makes a prediction, each tree in the forest contributes its own individual forecast. For classification tasks, the final prediction is resolved by majority voting. The class that gets the most “votes” from all the trees is chosen as the ultimate prediction. For regression tasks, the predictions from all the trees are averaged. The mean of these individual tree outcomes becomes the final prediction. For classification problems, the algorithm typically employs the Gini impurity metric to evaluate splits, while for regression tasks, it uses MSE. By combining randomization, tree-based learning, and prediction aggregation, a random forest significantly reduces bias and variance, resulting in more accurate and reliable predictions.

In summary, the random forest algorithm stands out as a highly adaptable and dependable method, proving effective for tackling both regression and classification problems. It offers strong generalization, reduces overfitting, and can also provide valuable interpretative insights, making it a widely applicable machine-learning technique (Ao et al., 2019; Cha et al., 2021; Mohapatra et al., 2020; Rigatti, 2017; Sarica et al., 2017).

Proposed model architecture

To provide a clear understanding of our methodology, Table 2 illustrates the architecture of our proposed machine-learning ensemble model for predicting CSR. This framework is designed to leverage the strengths of multiple diverse algorithms, thereby enhancing predictive accuracy and robustness.

Table 2.

Proposed model architecture.

Component	Description	Role in model
Input layer	Eight coal characteristic variables	Provides data to predict CSR (moisture, volatile matter, ash, sulfur, max fluidity, plastic layer thickness, MMR, basicity index)
Individual ML models	Random forest, decision tree, AdaBoost, CNN, SVR, MLP-ANN	Act as base learners, each generating an independent CSR prediction
Ensemble combination	Method for integrating individual model predictions (e.g. averaging)	Combines outputs from base learners to form a more robust and accurate final CSR prediction
Output layer	Predicted coke strength after reaction (CSR)	The final result of the model

CSR: coke strength after reaction; MMR: mean maximum reflectance; CNN: convolutional neural network; SVR: support vector regression; ML: machine learning; MLP-ANN: multilayer perceptron-artificial neural network; AdaBoost: adaptive boosting.

As depicted in Table 2, the process begins with the input layer, which consists of eight critical coal characteristic variables: moisture content, volatile matter, ash percentage, sulfur content, maximum fluidity, plastic layer thickness, mean maximum reflectance (MMR), and the basicity index. These features are simultaneously fed into an array of individual machine-learning models, which serve as our base learners. Specifically, the ensemble incorporates predictions from random forest, decision tree, AdaBoost, CNN, SVR, and MLP-ANN models.

Data gathering and analysis

Statistical analysis of the dataset

The data employed in this work includes measurements of CSR along with various predictor variables. The input variables consist of moisture (%), volatile matter (%), ash (%), sulfur (%), maximum fluidity (DDPM), plastic layer thickness, MMR (%), and the basicity index, with the output being CSR. The dataset, sourced from a coke plant, contains 630 samples from coals of diverse origins and characteristics. Standard methodologies, such as those outlined by the American Society for Testing and Materials (ASTM) (Astm et al., 1950), were employed to measure the predictor variables and evaluate the CSR values.

Figure 3 graphically presents the frequency plots and cumulative distributions of all parameters. The data is visually analyzed to provide insights into its distribution and range, as depicted in Figure 4. Figure 5 illustrates the Pearson correlation coefficients for each pair of variables, traditionally represented as a heatmap. This analysis indicates that moisture content, ash content, plastic layer thickness, and MMR content positively influence the output variable, CSR, with MMR content emerging as the most impactful parameter. Conversely, the remaining parameters exhibit a negative effect on CSR magnitude. Tables 3 and 4 describe dataset features in more details.

Figure 3.

Frequency and cumulative distribution of all the parameters.

Figure 4.

Scatter plot of input and output variables.

Figure 5.

Heatmap matrix for determining 2 v 2 relation between parameters.

Table 3.

Statistical evaluation of dataset features.

Dataset partition	Number of samples	Percentage of total dataset	Role in model development and evaluation
Total dataset	630	100	The complete dataset collected for this study.
Training set (per fold)	≈504	≈80	In each of the five cross-validation folds, approximately 504 samples were used to train the machine-learning models. This split ensures a robust training set while reserving a portion for unbiased evaluation within the fold.
Validation/test set (per fold)	≈126	≈20	In each of the five cross-validation folds, the remaining approximately 126 samples were used as the validation set to evaluate the performance of the trained model. This process ensures that every sample is used for validation exactly once across the five folds, providing a comprehensive assessment of model generalization.

Table 4.

Input and output parameters description.

Feature name	Unit	Description	Min value (approx.)	Max value (approx.)
Moisture	%	Water content in the coal sample	0	15
Volatile matter	%	Gaseous and liquid products released during heating	10	45
Ash	%	Noncombustible inorganic residue after combustion	5	20
Sulfur	%	Sulfur content in the coal	0	2
Maximum fluidity	DDPM	Maximum dial divisions per minute, indicating coal's plastic behavior	0	2500
Plastic layer thickness	mm	Thickness of the plastic layer formed during coking	0	50
Mean maximum reflectance	%	Average of the maximum reflectance of vitrinite, indicating coal rank	0.5	1.5
Basicity index	-	Ratio of basic to acidic oxides in ash	0	2
Coke strength after reaction	%	Output variable: hot strength of coke after reaction with CO₂	40	80

Suspected data identification

An initial visual inspection of the data, including frequency plots and cumulative distributions (Figure 3), and univariate checks via boxplots, revealed no gross or obvious univariate anomalies, suggesting the absence of simplistic recording errors. However, recognizing that boxplots primarily identify outliers based on interquartile range in a single dimension, and that more subtle or multivariate anomalies might exist, we proceeded with a more sophisticated approach: the MCOD technique to identify potential deviations within the multidimensional feature space.

In this research, we opted for MCOD technique due to its demonstrated efficacy and efficiency, especially with extensive datasets. MCOD leverages a combination of random sampling and density estimation to pinpoint outliers, offering a resilient method for detecting data points that diverge significantly from their local neighborhood in terms of data density. This involves evaluating data distribution patterns, enabling the detection of anomalies that might otherwise be missed.

The MCOD approach operates by selecting a representative subset of the entire data by Monte Carlo sampling, significantly reducing the computational burden typically associated with outlier detection. This random sampling enables scalability and makes it advantageous for use with high-dimensional datasets, limiting the necessity for complete dataset evaluation. The approach is computationally efficient as well, making it appropriate for extensive datasets and real-time applications where resources are constrained. However, there's a trade-off to consider: the trustworthiness of the results depends on factors like the sample numbers and the nearest neighbors’ configuration (k). More samples usually lead to better accuracy, but this comes with longer computation times. Meanwhile, the choice of k affects how sensitive the outlier detection is (Liu et al., 2008; Marx, 2015; Rofatto et al., 2020; Zhang et al., 2016).

While MCOD is a robust method and doesn't assume a specific data distribution, its effectiveness can be influenced by highly skewed features. In such cases, valid extreme values might appear as low-density points and be inadvertently flagged as outliers. Removing these genuinely extreme, yet valid, data points risks introducing sampling bias, which could limit the model's capability to generalize across the full spectrum of real-world conditions. To mitigate this, we rigorously reviewed all identified outliers with domain expertise, distinguishing true errors from legitimate extreme observations. Additionally, our chosen ensemble machine-learning models, like random forests, are inherently less sensitive to outliers, further buffering against any negative impact from remaining or inadvertently removed extreme values.

Still, MCOD is a useful tool for finding outliers and exploring data, especially when you only need approximate results or have limited computing power. Its capacity to balance efficiency with effectiveness becomes essential in the context of complex, high-dimensional datasets, where conventional outlier detection techniques may be excessively slow or computationally infeasible. By providing a scalable yet reliable solution, MCOD streamlines the procedure of identifying and handling outliers, consequently enhancing the overall quality of the dataset and improving the strength of ensuing machine-learning models.

Figure 6 presents a boxplot illustrating the data distribution of this research, shedding light on data range and quality, which are fundamental for model development. The boxplot captures data spread effectively, focusing on central tendency, interquartile range, and possible outliers. Most data points fall within the predictable range, suggesting the dataset is well-suited for machine-learning model training. This distribution confirms there are no major anomalies or outliers that could negatively affect model performance. The utilization of the complete data range enhances the models’ ability to identify underlying patterns and variations throughout the entire dataset. This is particularly crucial when dealing with intricate data structures, allowing the models to cultivate a more refined comprehension of the factors influencing the target variable. As a result, the model's predictive accuracy is likely to improve when trained on a comprehensive and representative set of data points. The approach adopted in this study, therefore, amplifies both the dependability and the performance of the machine-learning methods, ultimately leading to more meaningful and precise predictions (Jia et al., 2018; Rocco and Moreno, 2002).

Figure 6.

(a) Outlier data detected by Monte Carlo algorithm and (b) boxplot of data.

Modeling methodology

K-fold cross-validation is a robust resampling manner designed to evaluate the performance of models, particularly when working with limited data samples. Its primary objective is to accurately quantify a model's capacity for generalization to independent, unseen datasets.

The methodology is as follows: the complete data is partitioned into K subsets of equal size, referred to as “folds.” Iteratively, across K cycles, one fold is designated as the validation set for model evaluation, while the remaining K − 1 folds constitute the training data. This sequential execution ensures that each of the K folds serves as the validation set precisely once. The performance metrics derived from each cycle are subsequently averaged to yield a unified and more representative assessment of the model's prophetic. This process is shown in Figure 7.

Figure 7.

Diagrammatic illustration of the K-fold cross-validation method employed during the training procedure.

Specifically, in five-fold cross-validation, the dataset undergoes division into five equivalent sections. The model is trained on four of these sections and subsequently evaluated against the fifth, held-out section. This iterative process is executed five times, with a different fold acting as the validation set in each iteration. This systematic rotation guarantees that every data point within the dataset is used for validation exactly once. The aggregation (averaging) of performance metrics across these five iterations provides a more statistically reliable estimate of the model's actual performance compared to a singular train/test split. This approach is exceptionally advantageous for constrained datasets, as it maximizes the utility of available data for both model training and validation, thereby mitigating the risk of overfitting and furnishing a more dependable evaluation of the model's ability to generalize effectively (Jung, 2018; Rodriguez et al., 2009; Wong and Yeh, 2019).

In this research, five-fold cross-validation is used during the train and optimization step of the random forest algorithm. To obtain the precision of each created hybrid model, RE%, AARE%, MSE, and R² are calculated as outlined below (Abbasi et al., 2023; Bassir and Madani, 2019a, 2019b; Bemani et al., 2023; Madani et al., 2021; Madani and Alipour, 2022):

RE % = (\frac{o^{pred} - o^{\exp}}{o^{\exp}}) \times 100

(8)

AARE % = \frac{100}{N} \sum_{i = 1}^{N} (| \frac{o_{i}^{pred} - o_{i}^{\exp}}{o_{i}^{\exp}} |)

(9)

MSE = \frac{\sum_{i = 1}^{N} {(o_{i}^{pred} - o_{i}^{\exp})}^{2}}{N}

(10)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(o_{i}^{pred} - o_{i}^{\exp})}^{2}}{\sum_{i = 1}^{N} {(o_{i}^{\exp} - \bar{o})}^{2}}

(11)

In the above equations, “i” refers to an individual data point. The terms “pred” and “exp” represent the predicted values from the model and the actual observed values, respectively. “N” signifies the total number of data points. The input features used for developing the models include ash content, moisture content, sulfur content, volatile matter content, MMR, maximum fluidity, plastic layer thickness, and basicity index, with CSR serving as the output variable.

Results and discussion

Hyperparameter determination and optimization efficiency

The training and validation sets are essential for establishing parameters and hyperparameters in various methods. In the decision tree, the hyperparameter known as max-depth is configured to 16 (refer to Figure 8). As illustrated in Figure 9, the AdaBoost model employs 30 estimators. The k-NN model is configured with a K-value of 2, as shown in Figure 10. Figure 11 illustrates that the maximum depth for the random forest technique is adjusted to 20. As shown in Figure 12, the MSE loss values for the CNN model are assessed across various iterations. In Figure 13, it is established that the C hyperparameter of the SVR model is 20. Figure 14 shows that the iterative tuning of the number of iterations for the MLP-ANN is configured to 400. It is essential to note that the random forest, decision tree, and AdaBoost methods, together with their optimized parameters, are combined as an ensemble learning approach. Note that the training times for MLP-ANN, CNN, decision tree, random forest, SVR, AdaBoost, and k-NN were recorded as 14.65, 21.67, 5.43, 6.87, 7.31, 5.23, and 4.98 s, respectively. These algorithms were run within the Python environment programming language on a machine with 16 GB RAM and Core-i7 CPU specifications.

Figure 8.

Selection of the optimal max-depth hyperparameter in decision tree.

Figure 9.

Identification of the optimal number of estimators for the AdaBoost algorithm. AdaBoost: adaptive boosting.

Figure 10.

Tuning the optimal K value for k-NN. k-NN: k-nearest neighbor.

Figure 11.

Selection of the optimal max-depth hyperparameter for the random forest.

Figure 12.

Plot of mean squared error (MSE) versus epochs for the CNN model. MSE: mean squared error; CNN: convolutional neural network.

Figure 13.

Determination of the optimal C hyperparameter for the support vector regression (SVR) technique. SVR: support vector regression.

Figure 14.

Determining iterative tuning of the number of iterations for the MLP-ANN. MLP-ANN: multilayer perceptron-artificial neural network.

Modeling evaluation

Table 5 presents the evaluation metrics obtained from the data-driven models developed for the prediction of CSR, encompassing training, testing, and overall datasets. Three mathematical evaluation metrics, R², MSE, and AARE% were used to appraise the performance of the proposed soft computing techniques rigorously. Based on these criteria, the random forest model emerged as the superior estimator for CSR prediction. Specifically, the random forest model achieved testing R², MSE, and AARE% values of 0.82, 12.35, and 4.5, respectively. While the decision tree and AdaBoost models exhibited comparatively lower performance metrics, they also demonstrated evidence of overfitting during the training phase, rendering them less suitable for this prediction task. A graphical representation of the testing evaluation metrics is provided in Figure 15, facilitating a clear comparative analysis of the developed models.

Figure 15.

Evaluation indices of R², MSE, and AARE% for all the models. MSE: mean squared error; AARE%: average absolute relative error.

Table 5.

Statistical evaluation indices, R², MSE, and AARE% for various models.

Model	R ²			MSE (%²)			AARE%
Model	Training	Test	Total	Training	Test	Total	Training	Test	Total
Decision tree	0.996464	0.7140927	0.969967	0.263705	19.990133	2.236348	0.271913	4.7506076	0.719783
AdaBoost	0.99855	0.8772249	0.987164	0.108185	8.584215	0.955788	0.184776	3.4249712	0.508795
Random forest	0.973702	0.8232706	0.959699	1.961417	12.356608	3.000936	1.731484	4.5034331	2.008679
k-NN	0.92986	0.5546292	0.89495	5.231427	31.139533	7.822237	2.501054	6.249247	2.875873
Ensemble learning	0.989251	0.7969848	0.971247	0.801718	14.194463	2.140993	1.073465	4.5205312	1.418172
CNN	0.959689	0.7839118	0.943369	3.006653	15.1085	4.216838	2.276964	4.6831836	2.517586
SVR	0.849777	0.7096685	0.837314	11.20446	20.299461	12.11396	3.871382	5.6607518	4.050319
MLP-ANN	0.808552	0.6449357	0.794072	14.27927	24.825465	15.33389	5.210144	6.4373316	5.332863

MSE: mean squared error; AARE%: average absolute relative error; AdaBoost: adaptive boosting; k-NN: k-nearest neighbor; CNN: convolutional neural network; SVR: support vector regression; MLP-ANN: multilayer perceptron-artificial neural network.

In this study, two visualization techniques, including relative error plots and crossplots, are used to determine the accuracy of the machine-learning methods. Figures 16 and 17 provide a visual comparison between the real and predicted CSR values for the various algorithms used in this research. The random forest model shows a strong agreement between its predictions and the actual values. This is evident in the cross-plot, where the data points are tightly grouped along the bisector line (y = x). The equations for the fitted lines also closely match y = x, further confirming the model's accuracy in predicting CSR.

Figure 16.

Appraised versus real points regarding train and test samples for all models.

Figure 17.

Relative error percent versus actual points regarding the train and test samples for all the developed models.

Figure 17 presents scatter plots of the relative errors, which reveal that the error distribution for the random forest model is tightly centered around the x-axis, reinforcing its high accuracy. These visualization techniques also confirm a strong agreement between the real CSR values and the predictions generated by the AdaBoost model. However, as clearly demonstrated in Figures 16 and 17, both the decision tree and AdaBoost models exhibit signs of overfitting during the training phase. This overfitting provides a compelling reason to exclude them as reliable approaches for CSR prediction.

Sensitivity analysis and relevance study

Understanding which input features are important is crucial for knowing how they affect predictions in machine learning. In the current study, we used the SHAP to discover the complex relations between our input features and their individual importance. As a robust framework grounded in game-theoretic principles, SHAP provides a transparent and interpretable assessment of how individual features contribute to a model's predictions. Figure 18 illustrates a ranked list of input parameters based on their influence on CSR predictions, displayed in the descending order of significance. The analysis reveals that the volatile matter content (%) is the most critical factor affecting CSR.

Figure 18.

Ranked feature importance determined using SHAP. SHAP: Shapley additive explanations.

The SHAP feature contribution graph, presented in Figure 19, offers a detailed visualization of the effect of each input variable on the model's results. This graph employs a color gradient to represent feature values, where red dots correspond to higher values and blue dots to lower ones. For volatile matter content (%), the dominance of red points on the negative side of the plot suggests its adverse impact on CSR, confirming its role as the most influential determinant. Furthermore, elevated values of sulfur content (%) and the basicity index are similarly associated with a negative effect on CSR. In contrast, higher values of features such as moisture, plastic layer thickness, MMR, and maximum fluidity exhibit a positive correlation with CSR. These findings align with observations previously reported in the literature.

Figure 19.

SHAP feature contributions graph. SHAP: Shapley additive explanations.

Discussion

This study successfully developed and evaluated a robust machine-learning pipeline for predicting CSR, a critical parameter in the metallurgical industry. This section discusses the key findings, explores their practical implications, acknowledges the study's restrictions, and outlines directions for future study.

Interpretation of results

Our comprehensive evaluation demonstrated that machine-learning models, particularly random forests, can accurately predict CSR based on a diverse set of coal properties. The random forest model consistently outperformed other implemented algorithms (decision tree, AdaBoost, CNN, SVR, MLP-ANN, and the general ensemble approach), achieving the highest R² values and lowest error rates (MSE, AARE%). This superior performance can be attributed to random forest's inherent power to handle complex nonlinear relationships, its robustness to overfitting through ensemble averaging, and its capacity to manage high-dimensional data effectively.

The MCOD technique proved instrumental in enhancing model reliability, as evidenced by our ablation study. Removing outliers led to a significant improvement in predictive accuracy, underscoring the importance of rigorous data integrity assessment in real-world industrial datasets.

Furthermore, the SHAP analysis provided critical insights into the feature importance. Volatile matter was consistently identified as having a predominantly negative correlation with CSR, suggesting that higher volatile content tends to result in weaker coke. Conversely, MMR and moisture content exhibited positive correlations, indicating their contribution to improved coke strength. These findings align with metallurgical principles, where coal rank (reflected by MMR) and controlled moisture levels can influence coking behavior and final coke quality.

Practical implications and industrial relevance

The accurate prediction of CSR is paramount for optimizing coal blend selection and coking process parameters in steel and iron production. The robust predictive model developed in this study offers several practical implications for coke plants:

Optimized coal blending: By accurately predicting CSR from coal properties, plant operators can refine coal blends, ensuring optimal coke quality while potentially reducing raw material costs.

Process control: The model can serve as an early warning system, allowing for proactive adjustments in the coking process to maintain the desired CSR levels, thereby preventing costly quality deviations.

Resource efficiency: Enhanced prediction capabilities can lead to more efficient utilization of coal resources, minimizing waste and improving overall productivity.

Decision support: The interpretability offered by SHAP analysis provides domain experts with clear, actionable insights into how specific coal characteristics influence coke strength, aiding in informed decision-making.

This data-driven approach contributes significantly to achieving greater efficiency and cost-effectiveness in metallurgical operations.

Limitations of the study

While this study presents a robust framework for CSR prediction, certain limitations should be acknowledged:

Dataset scope: The dataset, while comprising 630 diverse coal samples, originates from a specific coke plant. Although efforts were made to ensure diversity, the model's generalizability might benefit from validation on datasets from multiple different plants with varying operational conditions and coal sources.

Input variable exclusivity: The model was developed using eight specific input variables. Other potentially influential factors, such as coking conditions (e.g. heating rate, carbonization time, oven type), or more detailed petrographic analyses of coal, were not included in this study due to data unavailability. Their inclusion could further enhance predictive accuracy and model robustness.

Static model: The current model is static, trained on historical data. Real-time operational data streams and dynamic process changes are not directly incorporated into the model's training loop.

Focus on CSR: While CSR is a critical index, CRI is equally important. This study focused solely on CSR prediction, and a combined or sequential prediction of both indices could offer a more holistic quality assessment.

Future work

Building on the success of this research, future work could explore several promising avenues:

External validation: Validate the developed models, particularly the random forest model, on independent datasets from different coke plants to rigorously assess their generalizability and robustness across varied operational environments.

Integration of process parameters: Expand the input feature set to include critical coking process parameters (e.g. heating rates, blend charge densities) to develop a more comprehensive predictive model that accounts for both coal properties and operational conditions.

Multi-output prediction: Extend the current framework to simultaneously predict both CSR and CRI, providing a more complete assessment of coke quality.

Real-time prediction system: Develop a dynamic, real-time prediction system that can continuously monitor coal properties and coking parameters, providing live CSR predictions and facilitating immediate process adjustments.

Advanced ensemble techniques: Investigate more sophisticated ensemble learning or deep-learning architectures (e.g. transformers for sequential process data) that might capture even more intricate relationships within the coking process data.

Uncertainty quantification: Incorporate methods for quantifying the uncertainty of predictions, which is valuable for decision-making in industrial settings.

Conclusions

This study provides an innovative and comprehensive methodology for predicting CSR using advanced data analysis and novel machine-learning methods. The implementation of MCOD demonstrated the effectiveness of ensuring dataset reliability, which significantly enhanced the models’ predictive capabilities. Among the tested algorithms, random forest achieved the best performance in terms of predictive accuracy, evidenced by its high R² and minimal errors across training and testing datasets. SHAP analysis further elucidated the influence of input variables like volatile matter, moisture content, and MMR on CSR, offering actionable insights into improving coking processes. Importantly, the findings validate the importance of employing robust data preprocessing, feature evaluation, and ensemble machine-learning methods for tackling complex prediction tasks. This work lays a foundation for scaling data-driven approaches to other metallurgical applications and reinforces the role of machine learning in process optimization across the steel industry.

Beyond these specific technical advancements, a crucial area for future work involves continuously refining the data acquisition and feature engineering processes. As real-world operational conditions and raw material characteristics evolve, ensuring the quality, consistency, and representativeness of incoming data will be paramount for sustained model performance. Further research could explore advanced methods for feature extraction from raw plant sensor data, integrate more nuanced domain-specific knowledge into feature selection, or investigate transfer learning approaches to adapt models rapidly to new operational settings or different coke plants. This iterative refinement of data and features is essential for maintaining the models’ predictive power and ensuring their long-term applicability in dynamic industrial environments.

Footnotes

ORCID iD

Samim Sherzod

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: 2024 Jiangsu Provincial Educational Science Planning Project (Grant No. B-b/2024/02/33); Soft Science Research Project of Jiangsu Provincial Department of Agriculture and Rural Affairs (Grant No. 24ASS074); and Yangzhou City Social Science Key Project (Grant No. 2024YZD-029).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data is accessible on request from the corresponding authors.

References

Abbasi

Aghdam

SK-y

Madani

(2023) Modeling subcritical multi-phase flow through surface chokes with new production parameters. Flow Measurement and Instrumentation 89: 102293.

Abbasi

Madani

Baghban

, et al. (2017) Evolving ANFIS model to estimate density of bitumen-tetradecane mixtures. Petroleum Science and Technology 35(2): 120–126.

Abdel Hady

Abd El-Hafeez

(2024) Revolutionizing core muscle analysis in female sexual dysfunction based on machine learning. Scientific Reports 14(1): 4795.

Abdel Hady

Mabrouk

Abd El-Hafeez

(2024) Employing machine learning for enhanced abdominal fat prediction in cavitation post-treatment. Scientific Reports 14(1): 11004.

Adhab

Mahdi

Vaghela

, et al. (2025) Machine learning frameworks to accurately predict coke reactivity index. Energy Exploration & Exploitation: 01445987251318353.

Agarwal

Singh

Ganguly

, et al. (2021) Prediction of coke CSR using time series model in coke plant. Opsearch: 1–22.

Aghdam

SK-y

Kazemi

Ahmadi

(2022a) Theoretical and experimental study of fine migration during low-salinity water flooding: Effect of brine composition on interparticle forces. SPE Reservoir Evaluation & Engineering: 1–16.

Aghdam

SK-y

Kazemi

Ahmadi

, et al. (2022b) Thermodynamic modeling of saponin adsorption behavior on sandstone rocks: An experimental study. Arabian Journal for Science and Engineering.

Alekseev

Krylova

Gorlenko

, et al. (2023) Dependence of CRI and CSR on the dimensions of oriented molecular domains and the ash composition of the coke. Coke and Chemistry 66(3): 109–123.

10.

Aloysius

Geetha

(2017) A review on deep convolutional neural networks. In: 2017 international conference on communication and signal processing (ICCSP).

11.

Zhu

, et al. (2019) The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling. Journal of Petroleum Science and Engineering 174: 776–789.

12.

Astm

Test

, et al. (1950) American Society for Testing and Materials (ASTM). United States: American Association of State Highway and Transportation Officials-AASHTO Standards.

13.

Badawy

Fisteus

Mahmoud

, et al. (2021) Topic extraction and interactive knowledge graphs for learning resources. Sustainability 14(1): 226.

14.

Balabin

Safieva

Lomakina

(2011) Near-infrared (NIR) spectroscopy for motor oil classification: From discriminant analysis to support vector machines. Microchemical Journal 98(1): 121–128.

15.

Bansal

Goyal

Choudhary

(2022) A comparative analysis of k-nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decision Analytics Journal 3: 100071.

16.

Bassir

Madani

(2019a) A new model for predicting asphaltene precipitation of diluted crude oil by implementing LSSVM-CSA algorithm. Petroleum Science and Technology 37(22): 2252–2259.

17.

Bassir

Madani

(2019b) Predicting asphaltene precipitation during titration of diluted crude oil with paraffin using artificial neural network (ANN). Petroleum Science and Technology 37(24): 2397–2403.

18.

Bemani

Madani

Kazemi

(2023) Machine learning-based estimation of nano-lubricants viscosity in different operating conditions. Fuel 352: 129102.

19.

Bulaevskii

Shved

(2010) Modifying CRI and CSR values of coke. Coke and Chemistry 53(1): 15–18.

20.

Cha

G-W

Moon

H-J

Kim

Y-C

(2021) Comparison of random forest and gradient boosting machine models for predicting demolition waste based on small datasets and categorical variables. International Journal of Environmental Research and Public Health 18. DOI: https://doi.org/10.3390/ijerph18168530.

21.

Chagas

Akiyama

Meiguins

, et al. (2018) Evaluation of convolutional neural network architectures for chart image classification. In: 2018 international joint conference on neural networks (IJCNN).

22.

El-Sebakhy

(2009) Forecasting PVT properties of crude oil systems based on support vector machines modeling scheme. Journal of Petroleum Science and Engineering 64(1): 25–34.

23.

Eliwa

EHI

El Koshiry

Abd El-Hafeez

, et al. (2024) Optimal gasoline price predictions: Leveraging the ANFIS regression model. International Journal of Intelligent Systems 2024(1): 8462056.

24.

Elmessery

Habib

Shams

, et al. (2024) Deep regression analysis for enhanced thermal control in photovoltaic energy systems. Scientific Reports 14(1): 30600.

25.

Farghaly

Ali

El-Hafeez

(2020a) Building an effective and accurate associative classifier based on support vector machine. Sylwan 164(3): 39–56.

26.

Farghaly

Ali

El-Hafeez

(2020b) Developing an Efficient Method for Automatic Threshold Detection based on Hybrid Feature Selection Approach. Springer.

27.

Foddis

Montisci

Trabelsi

, et al. (2019) An MLP-ANN-based approach for assessing nitrate contamination. Water Supply 19(7): 1911–1917.

28.

Ghorbani

Wood

Mohamadian

, et al. (2020) Adaptive neuro-fuzzy algorithm applied to predict and control multi-phase flow rates through wellhead chokes. Flow Measurement and Instrumentation 76: 101849.

29.

Hazbeh

Ahmadi Alvar

Khezerloo-ye Aghdam

, et al. (2021) Hybrid computing models to predict oil formation volume factor using multilayer perceptron algorithm. Journal of Petroleum and Mining Engineering 23(1): 17–30.

30.

Itano

Sousa

MAdAd

Del-Moral-Hernandez

(2018) Extending MLP ANN hyper-parameters optimization by using genetic algorithm. In: 2018 international joint conference on neural networks (IJCNN).

31.

Jha

Soren

Mehta

(2020) Life cycle assessment of sintering process for carbon footprint and cost reduction: A comparative study for coke and biomass-derived sintering process. Journal of Cleaner Production 259: 120889.

32.

Jia

(2018) Intelligent interpolation by Monte Carlo machine learning. GEOPHYSICS 83(2): V83–V97.

33.

Jung

(2018) Multiple predicting K-fold cross-validation for model selection. Journal of Nonparametric Statistics 30(1): 197–215.

34.

Khezerlooe-ye Aghdam

Moslemizadeh

Madani

, et al. (2019) Mechanistic assessment of Seidlitzia rosmarinus-derived surfactant for restraining shale hydration: A comprehensive experimental investigation. Chemical Engineering Research and Design 147: 570–578.

35.

Liu

Cai

Shao

(2008) Outlier detection in near-infrared spectroscopic analysis by using Monte Carlo cross-validation. Science in China Series B: Chemistry 51(8): 751–759.

36.

Lopez Pinaya

Vieira

Garcia-Dias

, et al. (2020) Chapter 10—Convolutional neural networks. In: Mechelli

Vieira

(eds) Machine Learning. Academic Press, 173–191.

37.

Lundgren

Sundqvist Ökvist

Björkman

(2009) Coke reactivity under blast furnace conditions and in the CSR/CRI test. Steel Research International 80(6): 396–401.

38.

Madani

Alipour

(2022) Gas–oil gravity drainage mechanism in fractured oil reservoirs: Surrogate model development and sensitivity analysis. Computational Geosciences 26(5): 1323–1343.

39.

Madani

Moraveji

Sharifi

(2021) Modeling apparent viscosity of waxy crude oils doped with polymeric wax inhibitors. Journal of Petroleum Science and Engineering 196: 108076.

40.

Mamun

Farjana

Mamun

, et al. (2022) Lung cancer prediction model using ensemble learning techniques and a systematic review analysis. In: 2022 IEEE world AI IoT congress (AIIoT).

41.

Marx

(2015) Outlier detection by means of Monte Carlo estimation including resistant scale estimation. Journal of Applied Geodesy 9(2): 123–142.

42.

Mhawi

Aldallal

Hassan

(2022) Advanced feature-selection-based hybrid ensemble learning algorithms for network intrusion detection systems. Symmetry 14(7): 1461.

43.

Mohapatra

Shreya

Chinmay

(2020) Optimization of the Random Forest Algorithm. Singapore: Springer.

44.

Mohebbanaaz Kumari

LVR

Sai

(2022) Classification of ECG beats using optimized decision tree and adaptive boosted optimized decision tree. Signal, Image and Video Processing 16(3): 695–703.

45.

Mostafa

Mahmoud

Abd El-Hafeez

, et al. (2024a) The power of deep learning in simplifying feature selection for hepatocellular carcinoma: A review. BMC Medical Informatics and Decision Making 24(1): 287.

46.

Mostafa

Mahmoud

Abd El-Hafeez

, et al. (2024b) Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms. Journal of Big Data 11(1): 88.

47.

Mustafa

Rezaur

Saiedi

, et al. (2013) Evaluation of MLP-ANN training algorithms for modeling soil pore-water pressure responses to rainfall. Journal of Hydrologic Engineering 18(1): 50–57.

48.

Navada

Ansari

Patil

, et al. (2011) Overview of use of decision tree algorithms in machine learning. 2011 IEEE Control and System Graduate Research Colloquium.

49.

Niu

, et al. (2022) Thermodynamic analysis of supercritical Brayton cycles using CO₂-based binary mixtures for solar power tower system application. Energy 254: 124286.

50.

Omar

Abd El-Hafeez

(2024) Optimizing epileptic seizure recognition performance with feature scaling and dropout layers. Neural Computing and Applications 36(6): 2835–2852.

51.

Pusz

Buszko

(2012) Reflectance parameters of cokes in relation to their reactivity index (CRI) and the strength after reaction (CSR), from coals of the Upper Silesian Coal Basin, Poland. International Journal of Coal Geology 90–91: 43–49.

52.

Qiu

Hui

Zhao

, et al. (2024a) A novel image expression-driven modeling strategy for coke quality prediction in the smart cokemaking process. Energy 294: 130866.

53.

Qiu

Hui

Zhao

, et al. (2024b) The employment of domain adaptation strategy for improving the applicability of neural network-based coke quality prediction for smart cokemaking process. Fuel 372: 132162.

54.

Rejdak

Strugała

Sobolewski

(2021) Stamp-charged coke-making technology—The effect of charge density and the addition of semi-soft coals on the structural, textural and quality parameters of coke. Energies 14(12): 3401.

55.

Ren

Zhao

Zhang

, et al. (2022) Design optimization of cement grouting material based on adaptive boosting algorithm and simplicial homology global optimization. Journal of Building Engineering 49: 104049.

56.

Rigatti

(2017) Random forest. Journal of Insurance Medicine 47(1): 31–39.

57.

Rocco

Moreno

(2002) Fast Monte Carlo reliability evaluation using support vector machine. Reliability Engineering & System Safety 76(3): 237–243.

58.

Rodriguez

Perez

Lozano

(2009) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE transactions on Pattern Analysis and Machine Intelligence 32(3): 569–575.

59.

Rofatto

Matsuoka

Klein

, et al. (2020) A Monte Carlo-based outlier diagnosis method for sensitivity analysis. Remote Sensing 12(5): 860.

60.

Sabancı

Kılıçarslan

Adem

(2023) An application on forecasting for stock market prices: Hybrid of some metaheuristic algorithms with multivariate adaptive regression splines. International Journal of Intelligent Computing and Cybernetics 16(4): 847–866.

61.

Sarica

Cerasa

Quattrone

(2017) Random forest algorithm for the classification of neuroimaging data in Alzheimer's disease: A systematic review. Frontiers in Aging Neuroscience 9.

62.

Sharma

Tiwari

(2024) Sustainable cokemaking technology: Future needs for ironmaking. Coke and Chemistry 67(3): 210–223.

63.

Sohn

Choi

Kim

, et al. (2021) Line chart understanding with convolutional neural network. Electronics 10(6): 749.

64.

Tan

T-H

J-Y

Liu

S-H

, et al. (2022) Human activity recognition using an ensemble learning algorithm with smartphone sensor data. Electronics 11(3): 322.

65.

Tang

Zou

Jing

, et al. (2015) A framework for making maintenance decisions for oil and gas drilling and production equipment. Journal of Natural Gas Science and Engineering 26: 1050–1058.

66.

Tao

Zhang

, et al. (2021) RBF neural network modeling approach using PCA based LM–GA optimization for coke furnace system. Applied Soft Computing 111: 107691.

67.

Ulanovskiy

(2014) Ash basicity and the coke characteristics CRI and CSR: A review. Coke and Chemistry 57(3): 91–97.

68.

Wandekokem

Mendel

Fabris

, et al. (2011) Diagnosing multiple faults in oil rig motor pumps using support vector machine classifier ensembles. Integrated Computer-Aided Engineering 18: 61–74.

69.

Wang

Chen

Liu

, et al. (2024a) Vacuum pressure swing adsorption intensification by machine learning: Hydrogen production from coke oven gas. International Journal of Hydrogen Energy 69: 837–854.

70.

Wang

Guo

Zhao

X-f

, et al. (2016) A new testing and evaluating method of cokes with greatly varied CRI and CSR. Fuel 182: 879–885.

71.

Wang

W-Q

Yuan

L-J

Zhao

Z-C

, et al. (2024b) DT-MNLR: A novel hybrid machine learning framework for precise coke strength and reactivity prediction. Ironmaking & Steelmaking 51(5): 493–501.

72.

Wong

T-T

Yeh

P-Y

(2019) Reliable accuracy estimates from k-fold cross validation. IEEE Transactions on Knowledge and Data Engineering 32(8): 1586–1594.

73.

Xiang

Liu

Shi

, et al. (2021) Prediction of gray-king coke type from radical concentration and basic properties of coal blends. Fuel Processing Technology 211: 106584.

74.

Yaghoubi

Khamees

, et al. (2024) A systematic review and meta-analysis of machine learning, deep learning, and ensemble learning approaches in predicting EV charging behavior. Engineering Applications of Artificial Intelligence 135: 108789.

75.

Yang

Diao

Luo

, et al. (2023a) Co-combustion performances of biomass pyrolysis semi-coke and rapeseed cake: PCA, 2D-COS and full range prediction of M-DAEM via machine learning. Renewable Energy 219: 119448.

76.

Yang

Diao

Wang

, et al. (2022) Co-pyrolytic interactions, kinetics and products of biomass pyrolysis coke and rapeseed cake: Machine learning, DAEM and 2D-COS analysis. Fuel 322: 124191.

77.

Yang

Chen

(2023b) A survey on ensemble learning under the era of deep learning. Artificial Intelligence Review 56(6): 5545–5589.

78.

Zamouche

Chermat

Kermiche

, et al. (2023) Predictive model based on k-nearest neighbor coupled with the gray wolf optimizer algorithm (KNN_GWO) for estimating the amount of phenol adsorption on powdered activated carbon. Water 15. DOI: https://doi.org/10.3390/w15030493.

79.

Zhang

O'Donnell

(2020) Chapter 7—Support vector regression. In: Mechelli

Vieira

(eds) Machine Learning. Academic Press, 123–140.

80.

Zhang

Wei

Jiang

(2017a) Fast monostatic scattering analysis based on Bayesian compressive sensing, IEEE.

81.

Zhang

Xue

Liu

, et al. (2022b) Optimization of high-speed channel for signal integrity with deep genetic algorithm. IEEE Transactions on Electromagnetic Compatibility 64(4): 1270–1274.

82.

Zhang

Yao

Jiang

(2017b) Novel time domain integral equation method hybridized with the macromodels of circuits, IEEE.

83.

Zhang

Wang

Gao

, et al. (2016) Improvement on enhanced Monte-Carlo outlier detection method. Chemometrics and Intelligent Laboratory Systems 151: 89–94.

84.

Zhang

Zhao

Zhang

, et al. (2025) Effect of coal tar components and thermal polycondensation conditions on the formation of mesophase pitch. Materials 18(5): 1002.

85.

Zhang

Liu

Shen

(2022a) A review of ensemble learning algorithms used in remote sensing applications. Applied Sciences 12(17): 8654.

86.

Zhao

L-C

Y-F

, et al. (2011) Fast and sensitive LC-DAD-ESI/MS method for analysis of saikosaponins c, a, and d from the roots of Bupleurum falcatum (Sandaochaihu). Molecules 16(2): 1533–1543.

87.

Zhao

Hui

Qiu

, et al. (2024) A machine learning and CFD modeling hybrid approach for predicting real-time heat transfer during cokemaking processes. Fuel 373: 132273.

88.

Zhou

Zeng

Yang

, et al. (2022) Rational construction of a fluorescent sensor for simultaneous detection and imaging of hypochlorous acid and peroxynitrite in living cells, tissues and inflammatory rat models. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 282: 121691.

89.

Zhu

Qiu

Zhong

, et al. (2019) Molecular dynamics simulation of aluminum inhibited leaching during ion-adsorbed type rare earth ore leaching process. Journal of Rare Earths 37(12): 1334–1340.

90.

Ziqi

BAO

Caiwu

Sai

, et al. (2022) Research on coke quality prediction model based on TSSA-SVR model. China Mining Magazine 31(6): 86–92.

Machine-learning-assisted prediction of coke strength after reaction for coke plants

Abstract

Keywords

Introduction

Fundamentals of machine learning

Ensemble learning

Adaptive boosting

Decision tree

Multilayer perceptron-artificial neural networks

k-Nearest neighbors

Support vector regression

Convolutional neural networks

Random forest

Proposed model architecture

Data gathering and analysis

Statistical analysis of the dataset

Suspected data identification

Modeling methodology

Results and discussion

Hyperparameter determination and optimization efficiency

Modeling evaluation

Sensitivity analysis and relevance study

Discussion

Interpretation of results

Practical implications and industrial relevance

Limitations of the study

Future work

Conclusions

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

Data availability statement

References