Sage Journals: Discover world-class research

Abstract

Precisely forecasting coke reactivity index (CRI) plays a critical role in the metallurgical industry, as it enables optimization of coke quality, leading to cost-effective production and efficient resource utilization. In this research, several machine learning predictive models based on extra trees, decision tree, support vector machine, random forest, multilayer perceptron artificial neural network, K-nearest neighbors, convolutional neural network, ensemble learning, and adaptive boosting using a dataset gathered from a coke plant are developed to predict CRI. To minimize overfitting in each algorithm, K-fold cross-validation methodology is employed during the training phase. The efficacy of each algorithm is visually represented through graphical methods and quantitatively evaluated using performance metrics. The findings indicate that maximum fluidity and mean maximum reflectance (MMR) exhibit a direct correlation with CRI while being indirectly relevant to moisture content, ash content, sulfur content, basicity index, plastic layer thickness, and MMR. Among the various predictive models evaluated, the random forest model emerged as the most accurate tool, according to the performance metrics of R-squared, mean square error, and average absolute relative error (%), with numerical values of 0.958, 3.718, and 2.545%, respectively, for the total datapoints. The developed tool can be easily used to accurately estimate CRI without needing experimental or field data reliably.

Keywords

Coke reactivity index machine learning predictive modeling outlier detection sensitivity analysis

Introduction

Coke plays an essential role in the metallurgical industry, functioning as both a reducing agent and a fuel in steel and iron manufacturing. The quality of coke significantly influences the efficiency and expense of metallurgical operations (Feng et al., 2024; Kardas and Pustějovská, 2019; Wang et al., 2024a; Yu et al., 2024). Coke is recognized for its capability to withstand the burden in a blast furnace for a defined duration. To assess the performance of coke under such conditions, coke-making facilities rely on two critical metrics: the coke strength after reaction (CSR) and the coke reactivity index (CRI). CSR determines the hot strength of coke, indicating its capability to support the furnace burden, whereas CRI evaluates its resilience by determining the mass loss following its reaction within the blast furnace (Nishi et al., 1982; Xie et al., 2024). As a result, CSR and CRI continue to serve as key quality standards for evaluating coke production.

Extensive research on coke production has demonstrated that its quality is shaped by several factors, including ash content, volatile matter, sulfur concentration, and the existence of a plastic layer. Consequently, various mathematical models have been developed to analyze the complicated correlation between properties of coke and coal blends (North et al., 2018a). However, developing a reliable model that precisely captures this relationship remains a significant challenge owing to the complex coal nature and the intricacies of the coke-making process (Agarwal et al., 2021). Recently, several methods have been employed to forecast coke quality via coal blends, including regression models, statistical analyses, and machine learning algorithms. Both linear and nonlinear regression models have been extensively utilized to establish relationships between coal blend properties and coke quality. While these models deliver accurate predictions when applied to consistent ovens and similar coal types, their reliability diminishes significantly when the composition of the coal blend changes or when used across different facilities (Bao et al., 2024; Lv et al., 2024; Stankevich and Zolotukhin, 2015).

In addition to regression models, statistical analysis methods have proven to be advanced tools for uncovering complex relationships between coal properties and coke performance metrics, such as CSR and CRI, in blast furnaces. PCA (principal component analysis), in particular, is a dimensionality reduction method that transforms numerous input variables into a smaller set of new features, retaining most of the essential information from the original data set (Lech et al., 2019; Zhang et al., 2025). This transformation allows PCA to simplify complex datasets, streamline calculations, and provide insights into the correlation between coke quality and coal properties. While conceptually similar, partial least squares (PLS) regression takes a different focus. Unlike PCA, which prioritizes dimensionality reduction, PLS regression emphasizes maximizing the predictive power of input variables (Ding et al., 2019; Feng et al., 2023). Despite this distinction, both approaches demonstrate comparable effectiveness. MARS, on the other hand, is specifically designed for handling complex non-linear regression problems, making it an ideal choice for forecasting coke quality (Chelgani et al., 2011). MARS (multivariate adaptive regression splines) employs a set of simple linear functions with constraints to address complex non-linear regression problems. Although statistical analysis techniques are effective in illustrating the correlation between coal characteristics and coke quality indices, they often require numerous variables and can be influenced by the choice of coal samples, resulting in constrained predictive accuracy.

Recently, the machine learning and artificial network algorithms (Yang et al., 2023) have emerged as powerful tools in various disciplines (Ghorbani et al., 2022, 2023a; Hajihosseinlou et al., 2024; Madani et al., 2017; Zhang et al., 2024). For example, Soltanian et al. (2024) successfully put forth data-driven models to predict the thermodynamic properties of hydrogen using state-of-the-art machine learning models. Abdi et al. (2021) predicted the adsorption capacity of carbon dioxide by porous metal organic frameworks using tree-based machine learning algorithms. Razavi et al. (2020) introduced accurate models to estimate carbon dioxide absorption by various solutions of amino-acids. The metallurgical industry has embraced the rapid progress of machine learning, which excels in uncovering complex relationships between predictor and response variables, offering superior predictive accuracy compared to traditional regression models (Chen and Bai, 2013). Recent research has focused on support vector machines (SVMs) and artificial neural networks (ANN), both of which have demonstrated effectiveness in accurately predicting coke quality (North et al., 2018b). Kang et al. utilized ANNs to examine the gasification of petroleum and coal coke (Kang et al., 2022). Yu demonstrated that a four-layer artificial neural network could successfully predict the hot strength of coke for a set of 10 samples, achieving predictions within an acceptable margin of error (Sidorov and Aristova, 2020). Zhang and Chen applied SVMs to study the impacts of ash content on furnace performance and the cold strength of coke, providing valuable insights into these critical relationships (Zhang et al., 2022). Their study demonstrated that SVM is capable of accurately predicting key variables in coke production. However, applying machine learning methodologies to forecast CSR and CRI remains limited, and their performance across diverse coal samples has yet to be thoroughly investigated (North et al., 2018a). Therefore, developing a comprehensive forecasting model that can reliably and accurately predict CSR and CRI across diverse coal blends is essential.

In this study, a comprehensive database from a coke production facility is collected and examined, with a focus on identifying outliers and suspected data points. Sensitivity examination is conducted using a relevancy index to investigate the impact of various input parameters such as moisture content, ash content, sulfur content, mean maximum reflectance (MMR), volatile matter content, plastic layer thickness, maximum fluidity, and basicity index on the CRI. Subsequently, multiple machine learning models are developed, leveraging algorithms such as decision tree, extra trees, K-nearest neighbors (KNN), random forest, multilayer perceptron (MLP) artificial neural network (ANN), SVM, convolutional neural network, ensemble learning, and adaptive boosting. These models are designed to accurately estimate the CRI based on its influencing factors. The robustness of the data-driven models is assessed via several metrics and visualized through graphical plots. The complete methodology in this paper is visually summarized in Figure 1.

Figure 1.

Methodological workflow to predict coke reactivity index using various machine learning models.

Machine learning backgrounds

Decision tree

A decision tree is an interpretable and versatile machine learning method known for both regression and/or classification jobs. It operates by segmenting the dataset into sub-portions based on feature values, represented as a tree-like structure of branches and nodes. At each internal node, a decision is made using a specific feature, with branches representing the possible outcomes of that decision. This procedure is repeated until a criterion like reaching a maximum depth is met. The terminal nodes, known as leaves, contain the final predictions or outcomes. This structure enables the algorithm to copy human decision-making methods, making it extremely instinctual and easy to visualize (Shi et al., 2023).

Decision Trees are typically easy in their structure, but they are also inclined to overfitting when trees are too complex. Despite their limitations, Decision Trees are extensively utilized across numerous areas including healthcare, finance, and marketing owing to their capability to work with both categorical and numerical data, interpret results clearly, and perform well on moderately sized datasets (Myles et al., 2004).

Adaptive boosting

Adaptive Boosting (AdaBoost) is a powerful ensemble learning technique that aggregates several weak classifiers, typically decision trees, to generate a powerful classifier. It is functioned by consecutively training several weak learners with each concentrating on correcting the mistakes of its predecessor. In this iterative process, datapoints that have been misclassified by the preceding learner are assigned higher weights, increasing their importance in the training of the next learner. This ensures that the model pays more attention to challenging examples, progressively improving its accuracy. The ultimate forecast is made using a weighted vote or combination of all the weak learners, emphasizing those with better performance (Lin et al., 2024).

AdaBoost is highly effective in reducing variance and bias, creating it a robust choice for many classification problems. However, it is delicate to outliers and noisy data as these may disproportionately influence the model due to the weight adjustment mechanism. AdaBoost works best with weak learners that perform somewhat more suitable than random guessing like shallow decision trees (also known as decision stumps). Its applications span numerous fields, including spam detection, face recognition, and medical diagnosis, where it is valued for its adaptability and ability to handle diverse datasets (Feng et al., 2020).

Random forest

Random Forest is a robust ensemble learning algorithm commonly applied to both classification and/or regression problems. It operates by constructing numerous decision trees while training and compounding their model outputs to generate an ultimate prediction. The training for each tree is conducted via a randomly selected subset of the data using bootstrap sampling and evaluates a random subset of inputs when determining splits at each node. This randomization promotes diversity among the trees, which aids to reduce the issue of overfitting and enhance the model's overall reliability. For classification tasks, the ultimate forecast in Random Forest is estimated by majority voting among the trees, whereas for regression jobs, it is computed as the mean of predictions from all the trees (Ghorbani et al., 2023b).

Random Forest is recognized for its flexibility and capability to work with large data sets with high dimensionality, including datasets with missing values or categorical features. It is less prone to overfitting respective to discrete decision trees because the aggregation of various trees smoothens out noise and prevents the model from focusing too much on detailed designs in the training data. While it offers high accuracy and interpretability through feature importance measures, it can be computationally intensive and may not perform as well on datasets with very sparse or unbalanced classes. Random Forest is commonly implemented in arenas like healthcare, finance, and ecology for tasks such as risk assessment, disease prediction, and species classification (Oshiro et al., 2012).

K-nearest neighbors

The KNN algorithm is a simple yet effective supervised method appropriate to classification and/or regression jobs. KNN identifies the k closest data points (neighbors) to a query point within the feature space using a specified distance measure, such as Manhattan or Euclidean distance. For classification, KNN assigns the query point to the most typical class between its k neighbors. In regression, KNN predicts the output by averaging the values of the k nearest neighbors. This versatile algorithm provides a straightforward approach to solving various predictive problems. KNN is an instance-based algorithm, meaning it does not explicitly build a model during training but rather relies on the training data directly during prediction (Zhang et al., 2017).

KNN is valued for its simplicity and flexibility, as it can handle multi-class problems and non-linear decision boundaries. However, it may be computationally costly with huge data sets, because predictions require calculating distances to all training samples. Moreover, its performance depends heavily on the choice of kk and the distance metric, as well as the scale and quality of the data. Proper normalization or scaling of features is often necessary to prevent bias towards variables with larger ranges. KNN finds applications in pattern recognition, anomaly detection, and recommendation systems where its ease of implementation and capability to capture local data structure are advantageous (Sha’Abani et al., 2020).

Ensemble learning

Ensemble learning is a machine learning method that combines the outputs of multiple models to improve overall prediction accuracy known as “base learners,” to build a more precise and reliable predictive method. The underlying principle is that a group of models working together can find more suitable functionality and generalization than any solitary model as different models may capture diverse patterns in the data or correct each other's errors. Ensemble methods can be largely segmented into two sorts: boosting and bagging. Bagging, exemplified by Random Forest, minimizes variance by training various models independently on randomly selected sub-portions of the dataset and combining their predictions, typically through averaging. Boosting, such as AdaBoost, tackles bias by training models sequentially, with each new model prioritizing the correction of errors made by its predecessors (Dong et al., 2020).

Ensemble learning is extensively made use of in both regression and/or classification tasks because of its capability to improve model accuracy, reduce overfitting, and enhance robustness against noisy data. It can integrate various base models including decision trees, neural networks, and SVMs, making it highly flexible. Popular ensemble methods such as Random Forest, Stacking and Gradient Boosting have become staples in competitions and real-world applications in many areas like healthcare, finance, and natural language processing. Despite its strengths, ensemble learning can increase computational complexity and may reduce interpretability, as the combined model's behavior becomes harder to explain than that of individual models (Sagi and Rokach, 2018).

Convolutional neural network

Convolutional neural networks (CNNs) are a specialized deep learning architecture for processing grid-structured data, including audio. Inspired by the human visual system, CNNs are adept at handling spatial hierarchies and patterns in tasks like object detection, image classification, and video analysis. Their typical architecture comprises pooling layers, convolutional layers, and fully connected layers, enabling efficient processing and learning from spatial data. Within the structure of convolutional layers learnable filters (kernels) are utilized to excerpt local features from input data, identifying patterns such as edges, shapes and textures. Pooling layers, such as max pooling, decrease the spatial feature maps dimensions, thus enhancing computational efficiency and providing robustness to spatial transformations like scaling and translation. This combination of layers lets CNNs to efficiently detect and process spatial and hierarchical features in data (Wang et al., 2024b; Wu, 2017).

One of CNNs’ important gifts is its capability to automatically learn hierarchical feature representations, where deeper layers capture more abstract and complex patterns. This makes CNNs highly effective for analyzing structured data with spatial relationships. Applications of CNNs extend beyond image processing to include natural language processing (e.g. text classification), medical imaging (e.g. tumor detection), and even audio signal analysis (e.g. speech recognition). While CNNs are powerful, they require substantial computational power and huge values of data for effective training. Advances like transfer learning and pre-trained models have helped address these challenges, broadening CNNs’ accessibility and impact across diverse fields (Wang et al., 2024c; Yamashita et al., 2018).

Support vector machine

SVMs are popular supervised learning algorithms used for regression and classification tasks. SVM's core principle involves finding an optimal hyperplane that divides data into different classes in high-dimensional space. It maximizes a margin, known as support vectors, to enhance generalization and improve performance on unseen data. When data cannot be linearly separated, SVM employs the kernel trick, a technique that projects the data into higher-dimensional space, enabling the creation of non-linear decision boundaries. This adaptability makes SVMs a powerful tool for various predictive modeling applications. This transformation enables the algorithm to create non-linear decision boundaries, making it effective for handling complex datasets (Hearst et al., 1998).

SVM is highly efficient for handling high-dimensional datasets and is especially well-suited for scenarios where count of input factors surpasses count of samples. Its flexibility comes from the ability to apply various kernel functions, like polynomial, linear, and radial basis function (RBF), allowing it to adapt to diverse data types. However, SVM can be computationally demanding when working with huge datasets, and its performance heavily relies on meticulous tuning of hyperparameters like adjusting parameter (C) and the kernel-specific parameters. SVM has found applications in fields like bioinformatics (e.g. gene classification), text categorization, and image recognition, where it is valued for its accurateness and capability to cope with complex decision boundaries (Pisner and Schnyer, 2020).

Multilayer perceptron artificial neural network

The MLP, a prominent type of ANN, is widely recognized for its effectiveness in supervised learning tasks like classification and regression. The architecture of a standard artificial neural network comprises three primary components: an input layer, one or more hidden layers, and an output layer, each comprising interconnected neurons. Neurons perform computations by aggregating weighted inputs, applying an activation function such as ReLU, sigmoid, or tanh, and forwarding the outcome to subsequent layers. The hidden layers play a vital role in seizing intricate, non-linear patterns within the data, making MLPs particularly suitable for addressing complex challenges (Lu et al., 2024; Luo et al., 2022; Wang et al., 2024d).

MLPs are trained using the backpropagation algorithm, which optimizes the weights by reducing a loss function through gradient descent. While MLPs are effective for structured data and relatively small-scale problems, their performance can degrade on high-dimensional or unstructured data (e.g. audio or images) compared to specialized architectures like CNNs. Despite this, MLPs remain a versatile and essential component of machine learning, often used in fields like finance (e.g. credit scoring), healthcare (e.g. disease prediction), and marketing (e.g. customer segmentation), due to their simplicity, adaptability, and effectiveness for a broad range of applications (Chen et al., 2025; Wang et al., 2024e; Zhu et al., 2024).

Extra trees

Extremely Randomized Trees (Extra Trees) is an ensemble learning method that extends the principles of decision tree-based algorithms like Random Forest. It is designed for both regression and/or classification jobs. The primary distinction of Extra Trees lies in how it builds individual decision trees. Unlike Random Forest, where trees are trained using bootstrap sampling and splits are determined by optimizing a criterion (e.g. Gini impurity or information gain), Extra Trees uses the entire dataset without bootstrapping and selects split points randomly from a range of values for each feature. This increased randomness improves the model's generalization capability and diminishes the commonly-occurring issue of overfitting (Martiello Mastelini et al., 2022).

Extra Trees offers several advantages, such as computational efficiency due to simplified split selection and robust performance on datasets with noisy or irrelevant features. Its high degree of randomness makes it less sensitive to small changes in the data, improving stability. However, like other tree-based ensembles, it can become resource-intensive with very large datasets or high-dimensional feature spaces. Extra Trees is broadly utilized in tasks like image recognition, fraud detection, and bioinformatics, where its balance of simplicity, speed, and predictive power is especially valuable (Alfian et al., 2022).

Database analysis and modeling methodology

Statistical analysis of the gathered coke database

The dataset utilized in this study contains measurements of CRI and various predictor variables obtained from proximate analysis, ultimate analysis, petrographic analysis of coals, and coking properties. These variables are strongly associated with CRI and were selected based on their relevance to the coke-making process and data availability. The chosen predictors include moisture content, ash content, volatile matter content, maximum fluidity, sulfur content, MMR, basicity index and plastic layer thickness. The dataset consists of 616 samples collected from a coke plant, representing a diverse range of coal origins and characteristics. The predictor variables and CRI values were measured using standard methodologies, including those organized by the American Society for Testing and Materials (ASTM). This comprehensive dataset serves as a robust foundation for analyzing and predicting CRI in the context of coke production (Astm et al., 1950). Table 1 provides the statistical characteristics of input and/or output parameters.

Table 1.

Statistical indices with respect to input and output variables in this research study.

Parameters	Minimum	Maximum	Median	Mode	Average	Kurtosis	Skewness	Standard deviation
Moisture (%)	0.67	12.4	7.57	10	6.86	−1.31	−0.29	3.45
Volatile matter (%)	9.51	40.05	27.04	39.47	27.04	2.46	−0.60	5.49
Ash (%)	5.01	27.58	9.90	8.9	10.54	12.75	3.45	3.29
Sulfur (%)	0.19	2.45	0.86	0.68	0.91	0.41	0.74	0.35
Maximum fluidity (ddpm)	13	48	29.04	37	29.91	−0.08	0.15	5.89
Plastic layer thickness (mm)	8	31	15.95	14	16.16	3.18	1.22	2.61
MMR (%)	0.67	1.65	1.18	1.19	1.13	1.09	−0.81	0.16
Basicity index	4	23.57	11.6	16.09	12.60	−0.01	0.34	2.85
CRI	19	53.7	30.22	38.2	31.46	−0.11	0.50	5.89

Modeling methodology

K-fold cross-validation is a popular statistical technique for evaluating the functionality of machine learning models. It partitions the training dataset into k equal-sized sub-portions called “folds.” Them, the training phase is carried out using k-1 folds and subsequently is validated on the residual fold. The so-called procedure is iterated k times, allowing each fold to serve as the validation set for one time. The final functionality indices are estimated by getting the average value of the outcomes due to all folds, yielding a reliable estimate of the model's effectiveness. K-fold cross-validation reduces the risk of overfitting or underfitting, ensuring a more dependable model assessment. K-fold cross-validation mitigates overfitting or underfitting risks associated with a single train-test split, ensuring a more reliable model evaluation (Wong and Yeh, 2019). The main benefit of K-fold cross-validation lies in its efficient use of data, particularly for small datasets where reserving a large portion for testing may limit the data available for training. K-fold cross-validation makes sure each datapoint is utilized once for testing and k-1 times for training, thereby providing a more balanced and unbiased assessment of the model's performance. This consistency makes K-fold cross-validation a reliable approach for model comparison and hyperparameter tuning, offering a thorough and equitable assessment. Additionally, it helps mitigate overfitting, especially when working with limited data samples (Jung, 2018). The choice of k is critical in shaping the effectiveness of K-fold cross-validation. Commonly used values, such as k = 5 or k = 10, strike a balance between computational efficiency and reliable performance estimates (Anguita et al., 2009; Li et al., 2024; Nti et al., 2021). Figure 2 illustrates the framework of K-fold cross-validation.

Figure 2.

Structure of k-fold cross validation technique used within the machine learning training process.

This study employs 5-fold cross-validation while doing the training and optimization stages of the machine learning algorithms. To assess the predictive functionality of each model, a number of evaluation indices is calculated, as detailed in the subsequent sections (Bassir and Madani, 2019a, 2019b; Bemani et al., 2023; Guan et al., 2025; Madani et al., 2021):

RE % = (\frac{o^{pred} - o^{\exp}}{o^{\exp}}) \times 100 : relative error percent (RE %)

(1)

AARE % = \frac{100}{N} \sum_{i = 1}^{N} (| \frac{o_{i}^{pred} - o_{i}^{\exp}}{o_{i}^{\exp}} |) : average absolute relative error (AARE %)

(2)

MSE = \frac{\sum_{i = 1}^{N} {(o_{i}^{pred} - o_{i}^{\exp})}^{2}}{N} : mean square error (MSE)

(3)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(o_{i}^{pred} - o_{i}^{\exp})}^{2}}{\sum_{i = 1}^{N} {(o_{i}^{\exp} - \bar{o})}^{2}} : determination coefficient (R^{2})

(4)

In the provided formulas, the index i represents a specific datapoint located in data set, whereas ^pred and ^exp signify predicted and real values, correspondingly. The variable N signifies the data set's total count. Note that input factors used to construct the predictive models include moisture content, ash content, volatile matter content, maximum fluidity, sulfur content, MMR, basicity index, and plastic layer thickness, with the output variable being the CRI. Notice that 90% of the datapoints were used for the training while 10% of the datapoints were used as the testing points. The initial training datapoints were also further segmented into training and validation segments using 5-fold cross validation that was described previously. The segmentation process was done randomly. Before being utilized in the model development phase, the dataset is normalized using the following relation (Bemani et al., 2023; Hasanzadeh and Madani, 2024):

n_{norm} = \frac{n - n_{min}}{n_{max} - n_{min}}

(5)

Here, notations n, n_max, n_min, and n_norm signify the datapoint, maximum, minimum, and normalized value, respectively. This normalization technique enhances the precision of model reliability comparisons by minimizing effects of variability in the gathered database understudy.

Results and discussion

Identification of suspected datapoints

The Leverage methodology is employed to detect datapoints that involve huge deviations. This approach integrates the Hat matrix (H) with the standardized residuals, with the following relation to estimate the Hat matrix (Bemani et al., 2023; Madani et al., 2021):

H = X (X^{T} X)^{- 1} X^{T}

(6)

In the provided formula, X is an n × m matrix, where n represents the number of data points and mm denotes the input parameters. X^T is the transpose of matrix X. The diagonal elements of the Hat matrix (H) represent the leverage values for individual data points. The threshold for identifying influential data points, known as warning leverage denoted with H*, is put forth via:

H * = \frac{3 (n + 1)}{m}

(7)

Herein, n and m are known as the dimensions of matrix X, with datapoints count and input parameter count, respectively. By comparing the leverage values to the warning threshold H^∗, the Leverage method identifies probable outliers and/or suspected data points. Suspected data points are further analyzed using Williams’ plot, which defines reliable and suspicious regions based on leverage and residual boundaries. As illustrated in Figure 3, the trustworthy zone encompasses most of the CRI data, while 14 data points, marked in orange, fall outside this region and are flagged as potentially unreliable. This visualization aids in assessing quality of data and understanding the effect of questionable data points on analysis and modeling. However, to ensure the resulting models are broadly applicable and capable of generalized characterization, all data points are included in the model development process.

Figure 3.

Detection of suspected datapoints via the highly-recognized leverage method.

Sensitivity study

In this segment, we focus on evaluating the influence of each input factor such as moisture content, ash content, volatile matter content, maximum fluidity, sulfur content, MMR, basicity index, and plastic layer thickness on the CRI. This analysis considers the specific rank value of all input factors using the concept of relevancy index which is computed via (Abbasi et al., 2023; Bemani et al., 2023; Hasanzadeh and Madani, 2024; Madani et al., 2021):

r_{j} = \frac{\sum_{i = 1}^{n} (x_{j, i} - {\bar{x}}_{j}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{j, i} - {\bar{x}}_{j})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}} (j = 1, 2, 3, \dots, 8)

(8)

In this formula, the symbol j represents the particular input factor under analysis. The relevancy factor is confined to a range between −1 and +1, with its magnitude reflecting the prowess of the relationship among input and/or output factors. A positive relevancy factor signifies a direct relationship, while a negative factor indicates an inverted correlation. Figure 4 illustrates the computed relevancy indices with respect to input factors. The results reveal that maximum fluidity and MMR exhibit a direct correlation with CRI. Conversely, moisture content, ash content, sulfur content, basicity index, plastic layer thickness, and MMR are inversely related to CRI. Among the examined variables, moisture content is identified as the most influential parameter, while MMR emerges as the least impactful.

Figure 4.

Relevancy index calculated for all the input parameters with respect to coke reactivity index.

Models’ optimization

The training and validation subsets are used to determine parameters and hyperparameters of different algorithms. For the decision tree model, there is a hyperparameter, namely max-depth which is calculated to be 13 (See Figure 5). As shown in Figure 6, the count of estimators in the AdaBoost model is determined to be 71. For the random forest model as shown in Figure 7, max depth is estimated to be 16. In the KNN model, the K value is estimated to be 2 as shown in Figure 8. The extra trees model has a tuned number of estimators equal to 39 as shown in Figure 9. The SVM model contains a hyper parameter indicated by c. The value of this hyper-parameter is calculated equal to 27 as illustrated in Figure 10. The loss values of CNN and MLP-ANN models which are mean square error (MSE) in this study, are reported for different iterations in Figures 11 and 12. Notice that the KNN, random forest, adaptive boosting and decision tree, with their tuned parameters are considered for the ensemble learning methodology.

Figure 5.

MSE as a function of max depth within the decision tree algorithm.

Figure 6.

MSE as a function of number of estimators within the adaptive boosting algorithm.

Figure 7.

MSE as a function of max depth within the random forest algorithm decision.

Figure 8.

MSE as a function of number of neighbors within the k-nearest neighbors algorithm.

Figure 9.

MSE as a function of number of estimators within the extra trees algorithm decision.

Figure 10.

MSE as a function of C hyperparameter within the support vector machine algorithm.

Figure 11.

MSE as a function of epoch within the convolutional neural network.

Figure 12.

MSE as a function of iteration within the multilayer perceptron artificial neural network.

Models’ evaluation

Table 2 displays the performance metrics, including the and average absolute relative error (AARE%), coefficient of determination (R-squared), and MSE based upon intelligent models developed in this research. To provide a comprehensive evaluation, Figure 13 illustrates these factors in test phase, offering a visual representation of the models’ predictive capabilities. First, as can be seen by the comparison of MSE values of train and test stages with significant differences for extra trees and adaptive boosting, these models suffer hugely from overfitting issues although they are re-enforced with the k-fold cross validation method. Ensemble learning is also not considered as a reliable method because one of its estimators is adaptive boosting. Random forest appears to be the most accurate and robust model based off of the emerged MSE, AARE% and R² for both train and test stages.

Figure 13.

MSE, R2, and AARE% with regard to testing stage for all the developed models.

Table 2.

Evaluation indices with regard to test, train and total datapoints of the developed machine learning models.

Model name	R-Squared			MSE			AARE%
Model name	Train	Testing	Total	Train	Testing	Total	Train	Testing	Total
Decision tree	0.99893	0.6025794	0.966123	0.037855	11.340784	1.175488	0.225235	6.960547	0.90314
AdaBoost	0.99889	0.804276	0.982782	0.039244	5.5851739	0.597438	0.249155	5.1170739	0.739108
Random forest	0.973819	0.7935704	0.958914	0.925939	5.8906689	1.425636	2.163091	5.96461	2.545712
KNN	0.92162	0.5734627	0.892846	2.772068	12.171658	3.718131	3.385438	7.843518	3.834141
Ensemble learning	0.988899	0.821077	0.975014	0.392624	5.1057419	0.866997	1.365595	5.3821518	1.769859
CNN	0.970702	0.7380508	0.951461	1.036195	7.4749764	1.684254	2.441532	6.7066877	2.870818
SVR	0.870651	0.7285728	0.858964	4.574668	7.7454392	4.893804	4.443798	6.4700202	4.647736
MLP-ANN	0.902377	0.7016272	0.885816	3.452616	8.5143593	3.962077	4.61869	7.0629837	4.864707
Extra trees	0.9999	0.8228947	0.985249	0.003538	5.0538717	0.511851	0.008225	5.1786963	0.52863

To evaluate the robustness and accuracy of the trained algorithms, various graphical subplots are applied in the current research work. Initially, crossplots for suggested predictive models are created, as depicted in Figure 14. In the random forest model, the grouping of points of data near unit-slope line shows a strong indication of precision. Additionally, equations resulted due to fitted lines on the aforementioned datapoints closely approximate the bisector line. Figure 15 shows spreading of relative deviation based upon all models. A tighter grouping of data points along the y = 0 line indicates improved accuracy of the estimator. According to this graph, the random forest machine learning approach stands out as the most efficient predictive method for estimating the CRI. The visual analysis of these plots contributes to a comprehensive assessment of the models’ performance and supports the selection of the random forest model as the preferred choice for accurate predictions.

Figure 14.

Crossplots of real versus modeled datapoints based upon train and test segments for all the constructed predictive models.

Figure 15.

Crossplots of relative error percent versus real points based upon train and test segments for all the constructed predictive models.

The findings of this study have significant practical implications for the metallurgical industry, particularly in optimizing coke quality for blast furnace operations. By reliably predicting the CRI, plant operators can fine-tune the selection and blending of raw materials to achieve the desired reactivity levels. This approach minimizes the reliance on costly experimental procedures and reduces waste, contributing to cost-effective and sustainable production processes. Furthermore, understanding the relationship between key parameters such as maximum fluidity, MMR, and basicity index with CRI offers actionable insights for improving coke formulation strategies, ensuring consistent performance across different batches. For example, targeting specific ranges of maximum fluidity and controlling ash content can enable a more efficient adjustment of coke reactivity to meet operational requirements.

Beyond its practical utility, the study also highlights broader contributions to the field of machine learning in metallurgical applications. The superior performance of the random forest model, with high R-squared values and low error rates, demonstrates the robustness of ensemble methods in handling complex, nonlinear relationships between process variables and CRI. This reinforces the potential of machine learning models as reliable tools for predictive analytics in industrial processes. Moreover, the study's methodology, that is, employing K-fold cross-validation to mitigate overfitting, can serve as a template for future work aiming to predict other critical metrics in coke or steel production. These findings not only enhance the industry's understanding of coke reactivity but also underscore the role of data-driven approaches in advancing metallurgical research and operational excellence.

Conclusions

In this study, robust machine learning models based on decision tree, extra trees, random forest, KNN, SVM, MLP ANN, convolutional neural network, ensemble learning, and adaptive boosting were successfully constructed to predict CRI based on the influencing parameters. The dataset employed in this study is sourced from a coke plant, aiming to predict CRI. To mitigate the issue of overfitting during the training phase, the study adopts the K-fold cross-validation methodology, which enhances the reliability and robustness of the machine learning models. The performance of each algorithm is evaluated through evaluation metrics and depicting figures. The results show that maximum fluidity and MMR exhibit a direct correlation with CRI while it is indirectly correlated with moisture content, ash content, sulfur content, basicity index, plastic layer thickness, and MMR. The model of random forest appeared as best algorithm for precise prediction of CRI, according to AARE%, R-squared, and MSE of 2.545%, 0.958, and 3.718, respectively, based on total datapoints.

Footnotes

Data availability statement

Data is made available with reasonable request from the corresponding author.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Samim Sherzod

References

Abbasi

Aghdam

SK-y

Madani

(2023) Modeling subcritical multi-phase flow through surface chokes with new production parameters. Flow Measurement and Instrumentation 89: 102293.

Abdi

Hadavimoghaddam

Hadipoor

, et al. (2021) Modeling of CO2 adsorption capacity by porous metal organic frameworks using advanced decision tree-based models. Scientific Reports 11(1): 24468.

Agarwal

Singh

Ganguly

, et al. (2021) Prediction of coke CSR using time series model in coke plant. Opsearch 58: 1–22.

Alfian

Syafrudin

Fahrurrozi

, et al. (2022) Predicting breast cancer from risk factors using SVM and extra-trees-based feature selection method. Computers 11(9): 136.

Anguita

Ghio

Ridella

, et al. (2009) K-Fold cross validation for error rate estimate in support vector machines. In: Proceedings of the international conference on data mining. USA.

Astm

, et al. (1950) American Society for Testing and Materials (ASTM). American Association of State Highway and Transportation Officials-AASHTO Standards, United States

Bao

Liu

Wang

, et al. (2024) Keyhole critical failure criteria and variation rule under different thicknesses and multiple materials in K-TIG welding. Journal of Manufacturing Processes 126: 48–59.

Bassir

Madani

(2019a) Predicting asphaltene precipitation during titration of diluted crude oil with paraffin using artificial neural network (ANN). Petroleum Science and Technology 37(24): 2397–2403.

Bassir

Madani

(2019b) A new model for predicting asphaltene precipitation of diluted crude oil by implementing LSSVM-CSA algorithm. Petroleum Science and Technology 37(22): 2252–2259.

10.

Bemani

Madani

Kazemi

(2023) Machine learning-based estimation of nano-lubricants viscosity in different operating conditions. Fuel 352: 129102.

11.

Chelgani

Hower

Hart

(2011) Estimation of free-swelling index based on coal analysis using multivariable regression and artificial neural network. Fuel Processing Technology 92(3): 349–355.

12.

Chen

Bai

(2013) A coke quality prediction model based on support vector machine. Advanced Materials Research 690: 3097–3101.

13.

Chen

Liu

Ding

, et al. (2025) Decoupling control of fuel cell air supply system based on data-driven feedforward and adaptive generalized supertwisting algorithm. IEEE Transactions on Circuits and Systems I: Regular Papers: 1–14.

14.

Ding

Shi

Jiang

, et al. (2019) Prediction of Coal Mine Gas Concentration Based on Partial Least Squares Regression. IEEE.

15.

Dong

Cao

, et al. (2020) A survey on ensemble learning. Frontiers of Computer Science 14: 241–258.

16.

Feng

Mao

, et al. (2023) Prediction of vitrinite reflectance of shale oil reservoirs using nuclear magnetic resonance and conventional log data. Fuel 339: 127422.

17.

Feng

D-C

Liu

Z-T

Wang

X-D

, et al. (2020) Machine learning-based compressive strength prediction for concrete: An adaptive boosting approach. Construction and Building Materials 230: 117000.

18.

Feng

Liu

Jiang

, et al. (2024) Novel (Ni, Mn) co-doping CuFe5O8 black ceramic pigment with pinning strengthen effect in high-temperature black zirconia ceramic application. Ceramics International.

19.

Ghorbani

, et al. (2022) A Robust Approach for Estimation of the Bone Age. IEEE.

20.

Ghorbani

Lajmorak

Ghorbani

, et al. (2023a) Application of a New Hybrid Machine Learning (Fuzzy-PSO) for Detection of Breast’s Tumor. IEEE.

21.

Ghorbani

Krasnikova

Ghorbani

, et al. (2023b) Improving the Estimation of Coronary Artery Disease by Classification Machine Learning Algorithm. IEEE.

22.

Guan

Xing

Huang

, et al. (2025) S2match: Self-paced sampling for data-limited semi-supervised learning. Pattern Recognition 159: 111121.

23.

Hajihosseinlou

Maghsoudi

Ghezelbash

(2024) Regularization in machine learning models for MVT Pb-Zn prospectivity mapping: Applying lasso and elastic-net algorithms. Earth Science Informatics 17(5): 4859–4873.

24.

Hasanzadeh

Madani

(2024) Deterministic tools to predict gas assisted gravity drainage recovery factor. Energy Geoscience 5(3): 100267.

25.

Hearst

Dumais

Osuna

, et al. (1998) Support vector machines. IEEE Intelligent Systems and Their Applications 13(4): 18–28.

26.

Jung

(2018) Multiple predicting K-fold cross-validation for model selection. Journal of Nonparametric Statistics 30(1): 197–215.

27.

Kang

Zhao

, et al. (2022) Artificial neural network model of co-gasification of petroleum coke with coal or biomass in bubbling fluidized bed. Renewable Energy 194: 359–365.

28.

Kardas

Pustějovská

. (2019) Quality of coke used in blast furnace process—Analysis of selected parameters.

29.

Lech

Jursova

Kobel

, et al. (2019) The relation between CRI, CSR indexes, chemical composition and physical parameters of commercial metallurgical cokes. Ironmaking & Steelmaking 46(2): 124–132.

30.

Liu

, et al. (2024) Data-driven pressure estimation and optimal sensor selection for noisy turbine flow with blocked clustering strategy. Physics of Fluids: 36.

31.

Lin

Chen

, et al. (2024) Imbalanced industrial load identification based on optimized CatBoost with entropy features. Journal of Electrical Engineering & Technology 19: 1–16.

32.

Zhao

Markov

, et al. (2024) Study on precise fuel injection under multiple injections of high pressure common rail system based on deep learning. Energy 307: 132784.

33.

Luo

Wang

, et al. (2022) Transport infrastructure connectivity and conflict resolution: A machine learning analysis. Neural Computing and Applications 34(9): 6585–6601.

34.

Liu

Wang

, et al. (2024) Effect of axial misalignment on the microstructure, mechanical, and corrosion properties of magnetically impelled arc butt welding joint. Materials Today Communications 40: 109866.

35.

Madani

Abbasi

Baghban

, et al. (2017) Modeling of CO2-brine interfacial tension: Application to enhanced oil recovery. Petroleum Science and Technology 35(23): 2179–2186.

36.

Madani

Moraveji

Sharifi

(2021) Modeling apparent viscosity of waxy crude oils doped with polymeric wax inhibitors. Journal of Petroleum Science and Engineering 196: 108076.

37.

Martiello Mastelini

Nakano

Vens

, et al. (2022) Online extra trees regressor. IEEE Transactions on Neural Networks and Learning Systems 34(10): 6755–6767.

38.

Myles

Feudale

Liu

, et al. (2004) An introduction to decision tree modeling. Journal of Chemometrics: A Journal of the Chemometrics Society 18(6): 275–285.

39.

Nishi

Haraguchi

Nishimoto

(1982) Coke Strength After CO2 Reaction.

40.

North

Blackmore

Nesbitt

, et al. (2018a) Models of coke quality prediction and the relationships to input variables: A review. Fuel 219: 446–466.

41.

North

Blackmore

Nesbitt

, et al. (2018b) Methods of coke quality prediction: A review. Fuel 219: 426–445.

42.

Nti

Nyarko-Boateng

Aning

(2021) Performance of machine learning algorithms with different K values in K-fold CrossValidation. International Journal of Information Technology and Computer Science 13(6): 61–71.

43.

Oshiro

Perez

Baranauskas

(2012) How Many Trees in a Random Forest? Berlin, Germany: Springer.

44.

Pisner

Schnyer

(2020) Support Vector Machine, in Machine Learning. Cambridge, USA: Elsevier. pp. 101–121.

45.

Razavi

Bemani

Baghban

, et al. (2020) Modeling of CO2 absorption capabilities of amino acid solutions using a computational scheme. Environmental Progress & Sustainable Energy 39(6): e13430.

46.

Sagi

Rokach

(2018) Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4): e1249.

47.

Sha’Abani

Fuad

Jamal

, et al. (2020) kNN and SVM Classification for EEG: A Review. Kuantan, Pahang, Malaysia: Springer.

48.

Shi

, et al. (2023) Ensemble regression based on polynomial regression-based decision tree and its application in the in-situ data of tunnel boring machine. Mechanical Systems and Signal Processing 188: 110022.

49.

Sidorov

Aristova

(2020) Simulation of coke quality indicators using artificial neural network. KnE Engineering 2020: 21–28.

50.

Soltanian

Bemani

Moeini

, et al. (2024) Data driven simulations for accurately predicting thermodynamic properties of H2 during geological storage. Fuel 362: 130768.

51.

Stankevich

Zolotukhin

(2015) Determining the technological value of coal on the basis of coke-quality predictions. Coke and Chemistry 58(7): 233–244.

52.

Wang

Zhang

Cheng

, et al. (2024a) An efficient molten steel slag gas quenching process: Integrating carbon solidification and waste heat recovery. Waste Management 186: 249–258.

53.

Wang

Liao

Zhou

, et al. (2024b) SwinURNet: Hybrid transformer-cnn architecture for real-time unstructured road segmentation. IEEE Transactions on Instrumentation and Measurement 73: 1–16.

54.

Wang

Liang

, et al. (2024c) Towards cognitive intelligence-enabled product design: The evolution, state-of-the-art, and future of AI-enabled product design. Journal of Industrial Information Integration 43: 100759.

55.

Wang

, et al. (2024d) MLP-Net: Multi-Layer Perceptron Fusion Network for Infrared Small Target Detection. IEEE Transactions on Geoscience and Remote Sensing 63: 1–13.

56.

Wang

Liu

Wang

, et al. (2024e) M-PINN: A mesh-based physics-informed neural network for linear elastic problems in solid mechanics. International Journal for Numerical Methods in Engineering 125(9): e7444.

57.

Wong

T-T

Yeh

P-Y

(2019) Reliable accuracy estimates from k-fold cross validation. IEEE Transactions on Knowledge and Data Engineering 32(8): 1586–1594.

58.

(2017) Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University China 5(23): 495.

59.

Xie

Wen

Ding

, et al. (2024) One-dimensional consolidation analysis of layered unsaturated soils: An improved model integrating interfacial flow and air contact resistance effects. Computers and Geotechnics 176: 106791.

60.

Yamashita

Nishio

RKG

, et al. (2018) Convolutional neural networks: An overview and application in radiology. Insights into Imaging 9(4): 611–629.

61.

Yang

Meng

Yang

, et al. (2023) Transfer learning-based crashworthiness prediction for the composite structure of a subway vehicle. International Journal of Mechanical Sciences 248: 108244.

62.

Zhao

Markov

, et al. (2024) Effects of hydrogen addition on ignition characteristics and engine performance of ammonia-hydrogen blended fuel: A kinetic analysis. International Journal of Hydrogen Energy 87: 722–735.

63.

Zhang

Wang

Guo

, et al. (2022) A support vector machine based prediction on sensitivity to coal ash blast for different degrees of deterioration. Journal of Sensors 2022(1): 7604338.

64.

Zhang

Y-C

Wang

J-F

, et al. (2025) Cold-sprayed cu matrix composite coatings with core-shell structured co@ WC reinforcements on Q235 steel. Surfaces and Interfaces 56: 105577.

65.

Zhang

Zong

, et al. (2017) Learning k for KNN classification. ACM Transactions on Intelligent Systems and Technology (TIST) 8(3): 1–19.

66.

Zhang

Hou

Xiong

, et al. (2024) EALLR: Energy-aware low-latency routing data driven model in mobile edge computing. IEEE Transactions on Consumer Electronics.

67.

Zhu

Zhou

Yan

, et al. (2024) Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities. IEEE Transactions on Power Systems 39: 7445–7448.

Machine learning frameworks to accurately predict coke reactivity index

Abstract

Keywords

Introduction

Machine learning backgrounds

Decision tree

Adaptive boosting

Random forest

K-nearest neighbors

Ensemble learning

Convolutional neural network

Support vector machine

Multilayer perceptron artificial neural network

Extra trees

Database analysis and modeling methodology

Statistical analysis of the gathered coke database

Modeling methodology

Results and discussion

Identification of suspected datapoints

Sensitivity study

Models’ optimization

Models’ evaluation

Conclusions

Footnotes

Data availability statement

Declaration of conflicting interests

Funding

ORCID iD

References