Abstract
Background:
Bone fractures present a significant diagnostic challenge in medical imaging, necessitating accurate and automated classification methods. Recent advancements in deep learning have greatly enhanced the diagnostic precision while reducing human error.
Objectives:
This study proposes an ensemble deep learning model, EnsembleAttenBoneNet, that integrates fine-tuned ResNet50 and EfficientNetB3 models augmented with a Squeeze-and-Excitation (SE) attention mechanism, for robust classification of bone fractures in X-ray images.
Design:
The dataset consists of ten distinct fracture categories, such as avulsion, comminuted, greenstick, and pathological fractures.
Methods:
Preprocessing techniques, including resizing, normalization, and augmentation, have been applied to improve generalization. Features extracted from both networks were concatenated and refined using the SE attention module to enhance feature representation.
Results:
The proposed model achieved a classification accuracy of 99.48%, outperforming the individual models (EfficientNetB3: 98.56%, ResNet50: 97.86%).
Conclusion:
Experimental results affirm that integrating deep learning models with attention mechanisms significantly improve diagnostic accuracy, rendering the model a valuable tool for clinical fracture detection. Future research will investigate dataset extension and conduct real-world validation to enhance its usability in medical imaging.
Keywords
Introduction
Bone fractures are a serious medical problem that occurs in people of all ages because of trauma, osteoporosis, or pathological conditions. A fracture is defined as the loss of bone continuity due to excessive mechanical stress exceeding the intrinsic strength of the bone. Fractures are classified as avulsion, comminuted, compression, greenstick, impacted, intra-articular, longitudinal, oblique, pathological, and spiral fractures, each requiring a unique diagnostic and therapeutic strategy. 1 The timely and accurate diagnosis of fractures is important in determining effective treatment modalities and preventing long-term complications. Bone fractures are clinically expressed by symptoms such as pain at the fracture site, swelling, impairment of function, and sometimes visible deformity. Especially with open fractures, major fractures can cause complications, including vascular damage, nerve damage, and infection. High-energy trauma, falls, sports injury, and degenerative diseases like osteoporosis that predispose patients to fractures at low energies due to a reduction in the bone mineral content define the pathophysiology of fracture as multifarious. 2 Other factors raising susceptibility to fracture include metabolic problems, neoplasms, and infection; hence, sensitive diagnostic techniques are even more important to offer suitable intervention. From conservative to surgical, fracture therapy is covered by various techniques. After natural callus growth and bone remodeling, the non-surgical ones are immobilized with or without casting and splinting.
For anatomic realignment and stability restoration, complex fractures, however, necessitate operational techniques, including open reduction and internal fixation, using metallic implants in the form of plates, screws, or intramedullary rods. 3 Though still under active study for clinical use, new regenerative techniques, including bone grafting, tissue engineering, and stem cell therapies, have shown promise in improving fracture repair. Correct diagnosis is the pillar of the best fracture therapy, regardless of the treatment approach, and emphasizes the need for better diagnostic tools. 4 Deep learning has emerged as a revolutionary approach in medical imaging, offering more diagnostic tools, accuracy, consistency, and speed. Computer-aided fracture detection has been performed extensively through convolutional neural networks (CNNs), applying hierarchical feature extraction to identify bone discontinuity patterns. 5 In contrast to conventional machine learning methods, CNNs avoid the necessity of handcrafted feature extraction, facilitating end-to-end learning of spatial and structural representations. Research has established that deep learning-driven fracture detection is superior to traditional radiographic interpretation, minimizes diagnostic flaws, and maximizes clinical decision-making. 6
The research provided proposes an ensemble model that will be based on fine-tuned EfficientNetB3 and ResNet50 networks to be used to classify bone fractures. EfficientNetB3 has a compound scaling strategy to maximize the depth, width, and resolution of the network, reaching the maximum feature extraction efficiency and ensuring that the computations are also feasible. 7 ResNet50 is an effective training of deep networks with residual learning that overcomes the vanishing gradient problem. The ensemble model integrates both architectures and represents fine-grained architecture and high-level semantic characteristics,making it more robust classification. Concatenation of feature maps at deeper layer networks guarantees complete feature representation, thus improving model performance. 8 The main goals of this study are: (1) to create a deep learning model for bone fracture classification automatically, (2) to utilize transfer learning using fine-tuned EfficientNetB3 and ResNet50, (3) to use feature-level fusion for better decision-making, and (4) to test the model on a handpicked dataset of radiographic images. This research strives to overcome the limitations of standard radiographic interpretation by proposing an AI-based framework that can enhance clinical workflows and minimize diagnostic workload. Although there has been remarkable progress in fracture classification based on deep learning, problems relating to class imbalance, overfitting, and generalization across various imaging datasets remain. The majority of the past has relied on a single CNN architecture and, therefore, is unable to effectively capture complex fracture patterns. 9 In addition to this, differences in image quality, heterogeneity of data, and absence of standardized testing protocols are other factors that make it hard to use deep learning models in clinical practice. The given study attempts to address these issues by combining the best features of the deep learning architectures to guarantee efficient feature mining and improved classification rates. 10 The work has its contributions as follows.
The suggested EnsembleAttenBoneNet is based on fine-tuned resnet50 and EfficientNetB3, which are ImageNet-pretrained models, to increase the capacity of feature extraction. This step is important because the models are optimized by additionally training them on the Bone Break Classification Image Dataset so that the network can recognize complex patterns of fractures at high classification accuracy.
The model adopts an attention-based feature fusion scheme that targets fracture-specific features and removes background noise. This reduces misclassification, enhances interpretability, and improves computer-aided diagnosis (CAD) system decision-making.
Compared with traditional models classifying fractures as fractured or non-fractured, ours classifies 10 categories of fracture, from avulsion to spiral fractures. Greater granularity in the classification makes ours more clinically usable in orthopedic diagnosis.
First, a new ensemble learning model is proposed, which integrates EfficientNetB3 with ResNet50 to leverage their complementary properties in extracting features. Secondly, pre-training model fine-tuning is used to adapt them to the domain-specific task of bone fracture classification. Third, feature map concatenation is used to ensure holistic representation learning. Fourth, an attention mechanism is added to intensify the model’s attention toward significant fracture areas, enhancing classification accuracy. Lastly, a comprehensive assessment is performed utilizing a heterogeneous dataset to ensure the efficacy of the proposed method. The results of this research are anticipated to lead to the further development of AI-based diagnostic aids, enabling effective and precise fracture classification in the clinical setting. The article is structured as follows: “Literature review” section, Related Work, reviews existing research, highlighting methodologies, findings, and gaps. “Proposed methodology” section details the dataset, preprocessing, model architecture, training, and evaluation. “Results and discussion” section presents performance metrics, visualizations, and analysis. “State-of-the-art analysis” section compares the proposed approach with existing techniques. Finally, “Conclusion” section summarizes key findings, contributions, and potential future research directions.
Literature review
The detection and classification of bone fractures have been a longstanding challenge in diagnostic radiology. Traditional interpretation of radiographs depends on the experience and judgment of radiologists, which may lead to variability and errors, particularly in complex or subtle fracture types. With the rapid evolution of deep learning, CAD systems have gained momentum as an effective means of enhancing fracture detection and classification accuracy. Recent literature reveals a progression from early CNN implementations toward transfer learning, dataset-driven improvements, and hybrid or ensemble architectures. This section critically reviews relevant works within these domains, highlighting methodological advances and identifying unresolved challenges that motivate the present study. A comparative summary of prior methods is provided in Table 1.
Comparison of different techniques for bone fracture classification.
CNN, convolutional neural networks; DL, Deep Learning.
CNN-based fracture detection
CNNs have emerged as the foundation of medical image analysis, primarily due to their ability to learn hierarchical spatial features directly from radiographs without handcrafted inputs. Early research demonstrated their diagnostic value for bone fracture classification. Tanzi et al. 11 explored the application of VGG16 on bone X-ray datasets and achieved an accuracy of 85.20%. While this marked a substantial improvement over conventional radiological interpretation, the performance was limited by dataset size and the model’s relatively shallow feature extraction capacity. Expanding upon this baseline, Barhoom et al. 12 investigated a CNN model for automatic fracture classification across multiple types of radiographs, reporting a higher accuracy of 92%. Their study underscored CNNs’ potential to minimize misdiagnosis, yet also revealed susceptibility to overfitting when applied to small and heterogeneous datasets. Similarly, Karanam et al. 13 developed a systematic deep learning framework for fracture classification, which achieved an accuracy of 93.75%. This work demonstrated that careful data preprocessing and network tuning could significantly enhance predictive accuracy. Collectively, these early CNN-based approaches established proof of concept but also exposed limitations in generalizability, particularly when faced with highly variable clinical images.
Transfer learning approaches
Given the scarcity of large, annotated medical datasets, transfer learning has become increasingly important in fracture classification. By leveraging models pre-trained on large-scale datasets such as ImageNet, researchers have been able to extract more discriminative features while reducing the risk of overfitting. Yaseen et al. 14 implemented YOLOv8 for cervical spine fracture detection, achieving strong localization results with a precision of 0.900, a recall of 0.890, and an mAP50 of 0.935. This study illustrated the feasibility of transfer learning for real-time detection in anatomically complex regions such as the cervical spine, where manual interpretation is often error prone. Li et al. 15 employed ResNet50 for nasal bone fracture classification, reporting an accuracy of 88%. Although effective, the model’s reliance on a single backbone limited its adaptability across fracture categories. Further extending these strategies, Morita et al. 16 evaluated SSD and YOLOv8 for automated detection of midfacial fractures in CT scans, reporting average precision values of 0.899 and 0.769, respectively. Although the results underscored the applicability of transfer learning across different modalities, performance was constrained by dataset imbalance and the morphological complexity of craniofacial structures. Taken together, these works confirm that transfer learning enhances fracture detection but highlight ongoing challenges related to dataset scale, anatomical diversity, and generalization across imaging modalities.
Dataset-driven advances
The quality and quantity of available datasets have proven equally critical in advancing automated fracture classification. Farda et al. 17 examined the impact of augmentation strategies for CT-based calcaneal fracture classification, achieving an accuracy of 72%. Their relatively modest results reflected the inherent difficulty of small, domain-specific datasets but also demonstrated the utility of augmentation for enhancing model robustness. The introduction of the MURA dataset by Rajpurkar et al., 18 comprising over 40,000 musculoskeletal radiographs, has been a milestone for research in this field. When applied to VGG16, the dataset enabled classification performance with an accuracy of 85.77%. Despite this, inter-class variability and imaging artifacts limited further improvements, reinforcing the need for more advanced architectures. Moon et al. 19 explored YOLOX-S with mixup augmentation and optimized loss functions for facial bone fracture detection, achieving an accuracy of 69.80%. While performance was lower compared to other works, the study underscored the importance of augmentation and specialized loss design in improving generalization under challenging conditions. Overall, these dataset-driven studies highlight the critical role of high-quality, diverse datasets, and augmentation in enhancing the stability and reproducibility of deep learning models.
Hybrid and ensemble architectures
To overcome the shortcomings of single-architecture CNNs, hybrid and ensemble models have been proposed. Dey et al. 20 advanced this paradigm by combining ResNet50 and DenseNet121 in a hybrid transfer learning framework for humerus fracture classification. Their model attained an accuracy of 93.10%, demonstrating the complementary strengths of residual and densely connected networks. Wang et al. 21 introduced ParallelNet, a multi-backbone ensemble framework for thigh fracture detection, achieving 87.80% accuracy. By combining different convolutional methods, the model was able to detect both small and large fracture patterns more effectively than single-architecture networks. These studies collectively indicate that ensemble strategies can capture complementary features from diverse backbones, improving accuracy and robustness. However, most prior ensemble approaches have focused on binary or limited-class classification tasks, and only a few have explored the integration of attention mechanisms to prioritize fracture-relevant features.
Research gaps and motivation for the proposed work
The reviewed literature (Table 1) demonstrates substantial progress in automated bone fracture detection, yet several critical gaps persist. First, most CNN-based methods achieve moderate accuracy and suffer from overfitting, especially with small datasets.11 –13 Second, transfer learning approaches, while effective, often focus on specific anatomical regions and yield inconsistent generalizability across modalities.14 –16 Third, dataset-driven advances show that performance heavily depends on dataset scale and augmentation quality, with limited solutions for rare fracture types.17 –19 Finally, hybrid and ensemble architectures improve robustness but have not fully addressed the challenge of multi-class fracture classification across a broad set of categories.20,21 These limitations highlight the need for an approach that combines the strengths of multiple architectures, integrates attention mechanisms for feature prioritization, and is capable of handling diverse fracture categories. The proposed EnsembleAttenBoneNet addresses these gaps by fusing EfficientNetB3 and ResNet50 at the feature level, enhanced with a squeeze-and-excitation attention block to emphasize fracture-specific patterns. Unlike previous works, this model is designed for 10-class fracture classification, achieving near-perfect accuracy and demonstrating robustness through extensive evaluation.
Proposed methodology
The proposed methodology employs a deep learning-based ensemble model for bone fracture classification using ResNet50 and EfficientNet-B3 as feature extractors. First, input X-ray images undergo resizing, normalization, and data augmentation, including horizontal flipping, rotation, zooming, shearing, and translation, to improve model generalization. Feature extraction is performed using fine-tuned ResNet50 and EfficientNet-B3, where the fully connected layers are removed, and a Global Average Pooling (GAP) layer is applied to convert feature maps into compact representations. The extracted features are then concatenated and passed through batch normalization and Rectified Linear Unit (ReLU) activation to stabilize training. To refine feature representation, a Squeeze-and-Excitation (SE) Attention Block is introduced, which applies squeeze, excitation (fully connected layers with sigmoid activation), and reweighting (feature scaling) to emphasize critical features. The refined feature vector is processed through fully connected layers with 512 and 256 neurons, followed by dropout layers (0.5 and 0.3 probability) to prevent overfitting. A final softmax activation function classifies images into 10 fracture categories. The model is trained using categorical cross-entropy loss and optimized with Adam, ensuring robust and accurate classification by leveraging attention mechanisms and ensemble learning for improved performance. Figure 1 gives the information about the proposed methodology introduces a robust approach to bone fracture multi-classification for X-rays. In accordance with best practices for transparency and reproducibility in AI-based medical imaging research, this study adheres to the CLAIM (Checklist for Artificial Intelligence in Medical Imaging) reporting standards, ensuring that all relevant aspects of data handling, model development, validation, and result interpretation are appropriately addressed (Supplemental Material).

Proposed methodology introduces a robust approach to bone fracture multi-classification for X-rays.
Input dataset
Bone Break Classification Image Dataset is derived from a publicly accessible platform Kaggle, which acts as the major dataset for this research study. 22 This dataset is a standard benchmark on Kaggle, widely used in prior fracture classification research, and thus provides a reliable reference point for evaluating deep learning models. 23 It is carefully classified into 10 different categories of fractures—Avulsion fracture, Comminuted fracture, Fracture dislocation, Greenstick fracture, Hairline fracture, Impacted fracture, Longitudinal fracture, Oblique fracture, Pathological fracture, and Spiral fracture. This work employs a labeled X-ray photo collection comprising 10 distinct types of fracture. It contains 123 avulsion fractures, 148 comminuted fractures, 156 general fractures, 111 of hairline, 122 of greenstick fractures, 84 of impacted fractures, 80 of longitudinal fractures, 85 of oblique fractures, 134 of pathogenic fractures, and 86 of spiral fractures. All images are identified and ready to be used in models. This type of class distribution ensures full coverage of the fracture types; thus, it is possible to allow the creation of a robust and accurate deep learning model that should be used to automate the process of fracture classification and improve clinical judgment in radiological diagnostics. The ensemble deep learning model implemented with the help of the EfficientNetB3 and ResNet50 offers high-quality classification, which leads to the successful identification of complex fracture patterns. The dataset and process adopted assist in advancing AI-based orthopedic diagnostics by offering a scalable platform to apply it in clinical practice. Figure 2 demonstrates the dataset images for various bone fractures.

Dataset images for different bone fractures.
To ensure a balanced evaluation of the proposed model, the dataset was divided into training, validation, and testing subsets using a stratified approach that maintained class distribution across all sets. As presented in Table 2, the dataset comprises 10 distinct types of bone fractures, with 70% of images allocated to the training set, 20% to the validation set, and 10% reserved for independent testing. For example, the avulsion fracture class contains a total of 123 images, of which 87 were used for training, 22 for validation, and 14 for testing. Similarly, other fracture categories, such as comminuted, pathological, and spiral fractures, were distributed proportionally. The addition of the “Total Images” column provides a complete overview of the dataset composition, ensuring clarity in how each class is represented across the splits. This structured distribution minimizes class imbalance and prevents data leakage, thereby enabling robust model training and reliable performance evaluation (Table 2).
Distribution of training, validation, and testing images across bone fracture classes.
Data pre-processing
Data pre-processing is an important process in deep learning-powered medical image analysis, which standardizes, normalizes, and augments input images for better model performance. For bone fracture classification, raw X-ray images differ in size, contrast, and orientation, and hence need systematic pre-processing for better learning. The most important steps are resizing, normalization, and data augmentation, which ensure consistency, speed up convergence, and avoid overfitting. Resizing normalizes image sizes, normalization rescales pixel values for numerical stability, and augmentation adds variations to enhance generalization. These pre-processing methods ready the dataset for effective feature extraction by fine-tuned ResNet50 and EfficientNet-B3 models for maximum classification accuracy.
Data resizing
Every X-ray image is scaled to a predetermined dimension of 224 × 224 pixels in order to standardize the input pictures. Let
where R is the resizing function that maps each pixel from the original image to the new dimensions using bilinear interpolation.
Data normalization
Pixel values are standardized to the range [0,1] so as to guarantee consistent feature scaling and enhance training stability. Given an image I with pixel values
where
Data augmentation
A number of data augmentation methods were used on the original X-ray images in order to increase the variety of the training dataset and strengthen the deep learning model. These augmentations replicate real-world variation in image acquisition, including variations in orientation, zoom, and lighting conditions.
To replicate the natural fluctuation in patient posture during radiographic imaging, a random rotation was performed within a specified angle range (30 degrees). This augmentation enables the model to become invariant in the anatomical structures by rotational changes. Equation (3) shows the mathematical equation of rotation. Where
The method resizes the image randomly back to the target input dimensions. It improves the generalizing ability of the model by simulating the effect of focusing on particular anatomical areas, hence increasing the magnification level capacity. Scaling is performed by randomly zooming images within a factor of
where
Horizontal flipping of the photos brings diversity in left-right symmetry. This is especially helpful in medical imaging situations when anatomical structures could have mirror orientations. For a Width W, each pixel at (x, y) mapped to (W − x − 1, y) shown in equation (5).
Vertical flipping may help to learn robust features and boost the data volume. This method turns the picture vertically to offer even more diversity. Translation shifts the image in both horizontal and vertical directions by simulating real-world variations in image positioning. These transformations collectively improve the model’s robustness to variations in medical imaging, ensuring better generalization in bone fracture. Equation (6) gives the information of translation:
Contrast adjustment: Simulated utilizing contrast jittering, this augmentation modulates the image’s contrast levels to mimic variations in exposure and illumination during X-ray capture. Models that want to perform consistently under different image quality settings must do so. Figure 3 depicts the visualization of applied data augmentation techniques on X-ray images.

Visualization of applied data augmentation techniques on X-ray images.
As summarized in Table 3, each fracture class was proportionally increased through augmentation. For example, the avulsion fracture class, which originally contained 87 images, was expanded to 435 images, while the fracture dislocation class grew from 110 to 550 images. Similarly, smaller classes, such as longitudinal (54 images) and impacted fractures (60 images), were scaled up to 270 and 300 images, respectively, ensuring more balanced representation across classes. This augmentation process not only equalized the number of images per class but also significantly reduced the risk of overfitting, enabling the model to learn robust features from a more diverse set of fracture patterns (Table 3).
Distribution of training images across bone fracture classes before and after data augmentation.
Proposed EnsembleAttenBoneNet model
The proposed EnsembleAttenBoneNet integrates EfficientNetB3 and ResNet50 as fine-tuned feature extractors for bone fracture classification. Compound scaling in EfficientNetB3 makes it rich in hierarchical features, and residual learning in ResNet50 ensures the preservation of gradient flow and deep representations. Both networks are used to extract feature maps, which are then pooled using GAP and concatenated to form a strong fused representation. To improve discriminative power, SE attention block dynamically recalibrates feature importance by modeling channel-wise dependencies, amplifying informative features while suppressing irrelevant ones. The concatenated and refined features pass through a fully connected classification head consisting of dense layers with Batch Normalization, ReLU activation, and Dropout for regularization. The final classification layer uses a softmax activation function for multi-class classification. The model is trained with categorical cross-entropy loss and the Adam optimizer to optimize the converging effect. This is a hybrid ensemble technique that utilizes the efficiency of the EfficientNetB3, by extending the depth to ResNet50, and refines the feature representation using the attention mechanisms to make it more robust. It improves bone fracture classification performance by the fusion of multiple architectures at the feature level and with attention. So, this will help in better generalization along with improved diagnostic accuracy. Figure 4 gives the information about EnsembleAttenBoneNet model of Fine-tuned ResNet50 and Fine-tuned EfficientNet-B3 for bone fracture multi-classification.

Proposed EnsembleAttenBoneNet model for bone fracture classification.
Fine-tuned EfficientNetB3
EfficientNetB3 is a CNN architecture that balances both accuracy and computational efficiency by using compound scaling, systematically balancing network depth, width, and input resolution. The pre-trained EfficientNetB3 model, initialized with ImageNet weights, acts as a feature extractor and is fine-tuned on the target dataset to enhance domain-specific feature learning shown in Figure 4. Given an input image x, the model relocates it by means of a sequence of convolutional layers, mathematically shown as follows shown in equation (7).
Where
Where W and H denote the width and height of the feature maps, respectively. This transformation ensures dimensionality reduction while preserving critical spatial information. The fine-tuning process involves selectively unfreezing higher-level convolutional layers, enabling the model to learn domain-specific patterns essential for the classification of bone fractures.
Fine-tuning of EfficientNetB3 enables the model to leverage its pre-trained hierarchical feature representations while adapting to the domain-specific characteristics of bone fracture classification. By unfreezing the top convolutional layers, the model enhances its feature extraction process, identifying fine structural patterns characteristic of different types of fractures. Through the addition of a GAP layer, representational efficiency is enhanced through reduction of spatial dimensions without compromising important features. The strategy promotes balance between classification accuracy and computational efficiency, and hence EfficientNetB3 is an ideal backbone for medical image analysis. The optimized feature extraction step, combined with domain adaptation through fine-tuning, leads to enhanced generalization and diagnostic capabilities in the provided ensemble strategy. The fine-tuned EfficientNetB3 model is employed as a feature extractor for bone fracture classification. EfficientNetB3, referred to as f_θ is pre-trained with ImageNet weights, where θ are the pre-trained parameters. The model takes an input image X of size (224, 224, 3) and extracts high-level features using its convolutional layers. In order to minimize the spatial dimensions while maintaining significant information, GAP is used and presented in equation (9). Figure 5 illustrates the architecture of Fine-tuned EfficieNetB3.

Architecture of fine-tuned EfficieNetB3.
Fine-tuned ResNet50
ResNet50 makes use of residual connections in order to mitigate the issue of vanishing gradients so that deep models can be trained. The constituent architecture consists of different convolutional blocks and identity shortcuts, used so as to retain a smoother gradient flow for training. The top layers of ResNet50 are altered as per the task under examination, as fracture classification, where the backbone is started with weights that have already been trained on ImageNet dataset. ResNet50 uses the residual connections to mitigate the problem of the vanishing gradient to be able to train deeper models. The building architecture is identity shortcuts and multiple convolutional blocks that are used to guarantee a more fluent flow of the gradient throughout the training process. Final ResNet50 layers are optimized according to the given task of fracture classification with the initiation of its backbone with weights already trained on the ImageNet dataset. A central building block of ResNet, the residual learning concept is formulated mathematically as in equation (10).
Where f (x; W) denotes the transformation applied through the convolutional layers, and x represents the input. The addition of the identity mapping x allows for easier gradient flow, which mitigates the vanishing gradient problem and supports the learning of deeper models. Feature Extraction process of ResNet50 is illustrated in equation (11).
Where
Where W and H determine the feature map’s width and height. This method generates a fixed-sized vector representation of the image by computing the channel-wise average over the spatial dimensions. Specifically, for the goal of bone fracture classification, ResNet50 is trained by selectively unfreezing some layers. This enables the model to acquire high-level, domain-specific patterns connected to bone fractures, hence enhancing its classification job performance.
The fine-tuned ResNet50 model, excluding the last dense layer, functions as a deep feature extractor for multi-class bone fracture classification. It leverages a pre-trained ResNet50 backbone

Layers of fine-tuned ResNet50.
Concatenation of feature maps
To take benefits from both networks, EfficientNetB3 as well as ResNet50, concatenated before the classifying stage: The concatenating mechanism combines different feature representations with each other such that the integrated network learns through a better presentation of the previously learned features. The mathematically expressed concatenation procedure is illustrated in equation (14).
Where
Batch normalizing helps to stabilize the composite feature representation and reduce covariate shift shown in equation (15).
Each feature channel is normalized using batch normalization, which improves convergence during training and supports regularization, lowering the danger of overfitting. ReLU activation serves to introduce non-linearity and improve the learning capability of the model. ReLU activation function is illustrated in equation (16).
This activation function ensures that the concatenated feature maps retain important features while removing the non-significant activations, ensuring a better gradient flow and faster convergence in the training. While concatenating the feature maps of two different architectures of deep learning, the network takes advantages of joint learning, which tend to be more robust and better results for classification. It effectively combines the strengths of EfficientNetB3 and ResNet50, yielding a highly discriminative feature space for classification.
Attention block
An attention mechanism is presented to dynamically improve the feature representation and thereby increase the discriminative capacity of the network. In particular, a SE block is used to dynamically weight channel-wise feature responses, hence recalibrating feature relevance. This mechanism exposes important spatial information for more efficient categorization and enhances feature selection. Figure 7 illustrates the attention block.

Attention block.
Squeeze operation
The SE block consists of two primary operations: squeeze and excitation.
where c demonstrates the channel index. W is the width of the Feature map; H is the height. The output z is a channel-wise descriptor capturing activations’ worldwide spatial distribution. By lowering the spatial dimensions of the feature map, this process summarizes the necessary information for recalibration.
Excitation operation
Learning attention weights that can highlight or suppress particular feature maps helps the excitation step model interactions between channels. Two fully linked layers with non-linear activations help to achieve this by passing the global descriptor shown in equation (18).
where
This process enables the network to determine which channels are more pertinent for the particular job and provide suitable significance weights in response.
Feature recalibration
Once the attention weight s is computed, they are used to scale the original feature maps through channel-wise multiplication. Equation (19) shows the Feature Recalibration.
Classification layer
The last classification layer is a fully connected network that is supposed to map the improved feature representations provided by the EfficientNetB3-ResNet50 hybrid model with the attention mechanism to the predicted output classes. To a large extent, the decoding of high-level features obtained and the comparison of these features with specific fracture classes are contingent on this layer.
Feature refinement and fully connected layers
Following the attention mechanism that improves the most informative spatial features, the extracted feature map Fatt is fed through a series of dense (fully connected) layers to accomplish non-linear transformations. These layers enable the learning of intricate correlations between different characteristics shown in equations (20) and (15).
Where W3 and W4 are the weight matrices of the fully connected layers, responsible for feature transformation, and b3 and b4 are bias expressions represented for advancing the activation function. Following each and every dense layer, the ReLU activation function adds non-linearity, therefore guaranteeing effective learning of complicated decision boundaries.
Overfitting prevention: dropout regularization
In deep learning models, overfitting is a frequent problem, particularly in medical picture classification. Dropout randomly deactivates a fraction of neurons during training, therefore discouraging the model from depending too much on certain features and promoting generalization. This improves the model’s capacity for reliable classification of unseen bone fracture imagesDropout regularization is thereby included into the network to help reduce this shown in equation (22).
Final prediction: softmax activation for multi-class classification
The last output layer uses the softmax activation function to translate the outputs of the dense layer into a probability distribution spanning all conceivable classes shown in equation (23). Multi-class classification is made possible by the softmax function’s assurance that the total of all output probabilities equals 1. This procedure guarantees that the model gives every fracture type a probability; the expected class corresponds to the highest probability assigned by the model.
Where
Loss function and optimization strategy
To train the model effectively, the categorical cross-entropy loss function is used. It is the standard loss function for multi-class classification problems shown in equation (24).
Where C is the total number of fracture classes.
yc represents the true label (1 for the correct class, 0 otherwise).
The function penalizes incorrect predictions more when the predicted probability for the correct class is low, improving model calibration. The Adam optimizer, with a learning rate of 10−4, combines momentum optimization and RMSProp, dynamically adjusting learning rates per parameter. This accelerates convergence, making it highly effective for training deep networks on large-scale medical image datasets.
Results and discussion
The proposed model attained a high classification accuracy of 99.48%, surpassing individual architectures ResNet50 (97.86%) and EfficientNetB3 (98.56%). Evaluation using precision, recall, and F1-score over 10 fracture classes demonstrated robust performance and reliability. Analysis of the confusion matrix revealed minimal misclassification, proving the model’s capability to identify complex fracture patterns.
Comparative assessment against state-of-the-art methods further validates the model’s superiority. Incorporation of a SE attention mechanism significantly improved feature extraction, resulting in better classification accuracy. The results highlight the model’s potential applicability in real-world clinical environments for automated fracture diagnosis.
To optimize performance, the model was trained using carefully selected hyperparameters. The learning rate was fixed at 0.001 to bring about steady convergence in the loss-function. In addition, the Adam optimizer was used because it has an adaptive learning rate and momentum. The batch size of 32 was taken to balance the gradient stability and memory efficiency. The number of epochs conducted was more than 50, and this gave the network enough to acquire the discriminative features. Lastly, the output layer used a Softmax activation function to produce a probability distribution of the 10 fracture classes, which was used to correctly classify. The values of the hyperparameter tuning are listed in Table 4.
Hyperparameter tuning.
Training and validation results
The results on the training and validation of the model show that the model performs well and consistently with steady improvements over the epochs. Appropriate regularization methods were used to prevent overfitting, which would guarantee generalization to unobservable data. The convergence of the loss values was smooth, and this indicated stable learning. Measurements of evaluation proved that the model can be used to classify various types of fractures with a high degree of reliability and can be utilized in clinical practices.
Training and validation results for ResNet50
Figure 8 shows the plot of training and validation accuracy as well as loss of the ResNet50 model applied to bone fracture classification after 50 epochs. The training and validation performance metrics achieved after 50 epochs, as shown in the given accuracy and loss plots, suggest that the learning process was very effective and it had good generalization potential. The accuracy of the training increases quickly in the first 10 epochs, starting at about 75% and reaches in the following epochs to above 95% and even nears a perfect performance. The validation accuracy also has the same trend, increasing rapidly in the initial stages, and reaching a level of 96%–98% after epoch 10 and fluctuating slightly. This regular fit between training and validation accuracy implies that the model is not learning empty features, and it is not overfitting. Simultaneously, the loss of training shows a steep decrease to less than 0.05 in the first 10 epochs, and then the loss decreases progressively, but almost reaches zero at the end. The validation loss is also exhibited with a dramatic decrease at the beginning of the training process and a relatively low and constant profile with certain changes of variance, which means normal fluctuation in the generalization performance. Notably, the difference between training and validation loss is also small, which also confirms the high level of generalization of the model. The minimal fluctuations in the validation measures may be due to variation in the data or slight sensitivity to the optimization path, which is typical of the deep learning model. The findings in general indicate that the model is highly classified and its convergence is stable, as well as the training regime, which includes data pre-processing, network architecture, and optimization strategy, is comparable to the task. These results justify the applicability of the model to be implemented or developed on analogous data.

For ResNet50 results. (a) Model accuracy. (b) Model loss.
Training and validation results for EfficientNet-B3
The performance of the EfficientNetB3 model when using the graphs of accuracy and loss training and validation depicts the performance of the model when it is used to classify bone fractures in 50 epochs, as in Figure 9. The accuracy and loss curves obtained after 50 epochs of training and validation performance of the model presented in the accuracy and loss figures demonstrate that the model has acquired good learning and generalization performances. The accuracy starts at an approximation of 70%, and then quickly rises, reaching above 95% after the first 10 epochs and approaching almost 100% at the end of the training. The accuracy of validation also has the same pattern, and it shows a steep increase in the initial epochs and levels off to 96–98 or so, indicating how the model is performing consistently in unknown data sets. Interestingly, the training and validation accuracy follow no notable deviation, and this may imply the absence of overfitting, indicating that the model is highly capable of generalization. In line with this, the training loss drastically drops to more than 0.5 to almost zero in a smooth, continuous decrease. The validation loss is also decreasing, but it has more variability than the training loss, presumably because of the differences in data or an overall batch-based noise in the validation set. Irrespective of these changes, the validation loss is also low during training, and it does not reflect some instability or overfitting. Taken altogether, these results indicate that the training process has been optimized well, and the model makes good error reduction, without compromising the predictive accuracy of the model on both the training and the validation sets. The minimal inconsistencies to the validation loss are within acceptable limits of real-life datasets and do not question the overall strength of the model. This performance trend indicates that the model architecture, training strategy, and pre-processing pipeline of data are all calibrated well to the task and thus a good candidate to be used in real-life applications.

EfficientNet-B3 results. (a) Model accuracy. (b) Model loss.
Training and validation results for the proposed EnsembleAttenBoneNet model
The training performance curves and the validation performance curves in Figure 10 exhibit the efficient learning behavior and the good generalization ability of the model with 50 training epochs. The accuracy of the training is rapidly growing, and it is steady throughout the training process. After the first five epochs, the training accuracy is above 95%, and it is above 100% in epoch 15. Parallel to this, validation accuracy also shows the same trend, with the highest validation accuracy values of around 97 and 98, and the training and validation accuracy show a slight difference. This near coincidence of the two curves shows very little overfitting and makes the model seem to be representative of the underlying trends in the data without becoming excessively specific to the training data. Loss-wise, the training loss reduces drastically from an average value higher than 0.5 to lower than 0.05 in the first 10 epochs and tends to go to zero as the training proceeds. The loss of validation is also on a downward trend, though it has more fluctuations as opposed to the training loss. Nevertheless, the validation loss is also quite low and approaches the value of 0.05–0.1 in the subsequent epochs. All these trends point to the fact that not only is the model learning effectively, it is also generalizing to unknown data effectively. On the whole, the intersection of accuracy and loss indicators proves the strength and reliability of the model, which can be effectively deployed in the real world and be used in experiments.

EnsembleAttenBoneNet results. (a) Training and validation accuracy. (b) Training and validation loss.
Comparison of EfficientNet-B3, ResNet-50, and EnsembleAttenBoneNet model
Figure 11 was a bar graph that provided the training and validation results of three different models that included EfficientNet-B3, ResNet-50, and the proposed EnsembleAttenBoneNet model on a 10-class bone fracture classification task. The models were run 10 times, where accuracy, loss, validation accuracy, and validation loss were used as measurements. At first, both the individual models (EfficientNet-B3 and ResNet-50) have low training and validation accuracies during the initial epochs. While their related validation accuracies are also low (16.81% for EfficientNet-B3 and 17.90% for ResNet-50), EfficientNet-B3 starts with a training accuracy of 10.31% and ResNet-50 with 12.41%. This is the first step of learning when the models are adjusting to the data. Both models get rather better as training goes on. While EfficientNet-B3 reaches 98.60% in training accuracy and 99.21% in validation accuracy at epoch 10, ResNet-50 reaches a training accuracy of 96.40% and a validation accuracy of 97.05%. Rising validation accuracies indicate that the models can generalize rather effectively. The ensemble model, which takes the best of both networks, always performs better than the individual models across the epochs. Beginning with a high accuracy of 98.65% and 97.84% validation accuracy in epoch 1, it continues to improve, reaching a training accuracy of 99.57% and validation accuracy of 99.51% by epoch 10. The EnsembleAttenBoneNet model takes advantage of the complementary features of the two individual models, performing better in the bone fracture classification task.

Bar graph comparison of ResNet-50, EfficientNet-B3, and EnsembleAttenBoneNet model. (a) Training and validation accuracy. (b) Training and validation loss.
Testing result
The test results confirm the efficacy of the proposed EnsembleAttenBoneNet model in bone fracture classification with an impressive accuracy of 99.48%. Confusion matrix analysis indicates very few misclassifications, with high precision, recall, and F1-scores for all 10 fracture classes. The model performs better than individual ResNet50 and EfficientNetB3 consistently, reflecting its better feature extraction and classification performance. The ensemble method significantly minimizes false positives (FP) and false negatives, ensuring high diagnostic accuracy. These findings demonstrate the strength and generalizability of the model, promising it as an effective tool for automated fracture detection in clinical applications, assisting radiologists in effective and accurate diagnosis.
Test results for ResNet50
The confusion matrix displayed below in Figure 12 for bone fracture classification displays a total accuracy of 97.86%, which captures the model’s ability to discern between the 12 fracture classes. The diagonal entries represent the correctly classified samples, of which the highest number of correct classifications are for Fracture Dislocation, 18 Pathological Fracture, 18 and Oblique Fracture, 16 demonstrating the model’s high predictive capability for these classes. Fewer misclassifications are observed, such as in a case where a comminuted fracture was misclassified as an avulsion fracture and in another case where fracture dislocation was mixed up with pathological fracture. These misclassifications show that these specific fracture types can have similar morphological features, which can cause some of the model’s misinterpretations. The model correctly classifies Greenstick, Longitudinal, and Oblique fractures to perfection, proving its capability to distinguish these classes sharply. The minor misclassification of Spiral and Avulsion fractures indicates possible overlap in the radiographic features of these fracture types. The overall structure of the confusion matrix indicates that the model is highly specific and sensitive to all types of fractures, apart from some low false negatives and FP. With only a small percentage of misclassification cases, additional tuning with the addition of more training data, better feature extraction methods, or an ensemble method may further refine the classification accuracy. The low error rate and high accuracy reflect the potential of the model for clinical use, in which fracture classification may be automated and aid radiologists to enhance diagnostic efficacy and minimize subjective inconsistency in fracture interpretation.

Confusion matrix for ResNet50.
In bone fracture classification using the ResNet50 model, the most critical performance metrics shown in Table 4 Precision, Recall, F1-Score, and Accuracy—assess the model’s performance in identifying different types of fractures. Precision assesses the proportion of correctly predicted fractures to all predicted cases of the class to minimize FP. High accuracy for all fracture types (from 0.85 to 1.00) shows the capability of the model to correctly classify fractures without misclassification. Recall, or sensitivity, quantifies the proportion of actual cases of fractures to are correctly classified by the model. The scores of the recall are high and consistent (greater than 0.92 in all classes), which means that the model finds nearly all the real cases of fractures with only a few false negatives. The harmonic mean of the precision and the recall measures the performance of the classification (F1-Score). The F1-Scores range between 0.92 and 1.00, which means that the model is highly performing in the manner of fracture types. Also, note the fact that the least precise is the Spiral Fracture with the least precision of 0.85, but the recall of 1.00, thus meaning that everything that is really a spiral fracture is actually observed, but some spurious prediction is still produced. Accuracy, which is the sum of all the percentiles of correct prediction, is an impressive 97.86%, which justifies ResNet50 as being incredibly consistent in classifying bone fractures. The model gives an ideal categorization (1.00 in terms of F1-score) of different types of fracture, for example, Greenstick, Hairline, Impacted, Longitudinal, Oblique, and Pathological fractures, which reflects spectacular prediction capabilities. A combination of these steps demonstrates that ResNet50 is an effective tool in the accurate detection of the various bone fractures, and it is a potentially useful deep learning implementation to computer-aided medical diagnosis. The performance parameters are represented in Table 5.
Classification parameters for ResNet50.
Result analysis for EfficientNet-B3
The confusion matrix of EfficientNet-B3 presented in Figure 13 in the classification of bone fracture types is an invaluable piece of information about the performance of the model. Since it can scale effectively in width, depth, and resolution, EfficientNetB3 is used in the detection of 12 bone fracture categories. A confusion matrix graphically displays on presentation true labels versus predicted labels, therefore, showing the identification ability of fractures of different types by the model. Based on the matrix, it is clear that EfficientNetB3 has a high accuracy in classification, as indicated by the high diagonal values, which are accurate predictions. Misclassifications, which are indicated by off-diagonal values, show that there is a slight confusion between certain types of fractures, for example, comminuted and avulsion fractures. This could be attributed to structural differences in X-ray images, which make it hard to extract features. EfficientNetB3 model exploits the concept of compound scaling, which finds the optimal balance between the size and the cost of the model, and is therefore suitable for medical image classification. Transfer learning, where pre-trained weights are applied on ImageNet, is better at feature extraction, which leads to higher classification accuracy. Class imbalance is also countered by data augmentation techniques, including rotation, scaling, and contrast changes, which enhance generalization. While showing good results, the misclassifications suggest refinement. Methods like attention mechanisms would better discriminate features. In total, the model based on EfficientNetB3 shows high promise for automated classification of bone fractures, and the confusion matrix is a very important instrument for measuring the performance and making further improvements to the model.

Confusion matrix for EfficientNet-B3.
The EfficientNet-B3 model classification accuracy for bone fracture detection among 10 classes of fractures demonstrates high diagnostic accuracy, which is clear through the precision, recall, and F1-score measures shown in Table 6. The model is categorized with an overall accuracy of 98.56%, illustrating excellent generalization capacity. 1.00 precision values for the majority of classes, such as Avulsion, Comminuted, Greenstick, Hairline, Impacted, Longitudinal, Oblique, and Spiral fractures, imply that the model provides high-confidence predictions with very few FP. 1.00 recall values for the majority of classes imply that the model accurately captures all applicable cases, reducing false negatives. Fracture Dislocation and Pathological Fracture have negligible differences, with recall values of 1.00 and precision of 0.95 and 0.94, respectively, showing occasional misclassification of these classes. The F1-score is kept high consistently to confirm that the well-balanced model’s performance on classifying bone fracture with both sensitivity and specificity has been good. The somewhat low value of 0.92 in recall for Avulsion and Comminuted fractures reflects a couple of false negatives, possibly due to skeletal similarity in structure with some other fractures or to dataset imbalance. Despite minor variations, the high recall and precision of the EfficientNet-B3 model across all fracture categories validate its efficacy in automated bone fracture classification. Future research needs to be directed toward refining the model further with data augmentation and external validation to provide clinical robustness for application. The findings indicate that EfficientNet-B3 is a feasible deep learning model for aiding radiologists in fracture diagnosis, which can enhance diagnostic efficiency and minimize observer variability. Table 6 shows the classification parameters of EfficientNet-B3.
Classification parameters for EfficientNet-B3.
Result analysis for EnsembleAttenBoneNet model
The confusion matrix analysis of EnsembleAttenBoneNet model, which reported an exceptional classification accuracy of 99.48%, indicates its superb performance in computerized bone fracture detection, shown in Figure 14. The model possesses high true positive rates for all fracture types, indicating its ability to accurately classify various patterns of fractures. In addition, the true negative figures are always high, which means that the model is able to distinguish between types of fractures accurately without significant misclassification flaws. There is only a single FP instance in the oblique fracture type, which is a small misclassification where a non-Oblique fracture was mistaken for Oblique. False negatives approach zero, indicating the model barely misdiagnoses actual fractures. Importantly, the absence of severe inter-class confusion testifies to the success of the feature-level ensemble strategy, which takes advantage of the strengths of ResNet50 and EfficientNet-B3 to perform better feature extraction. However, no matter how good the near-perfect accuracy of the model is, the performance of the model must be explored further to determine the probability of overfitting, especially when the dataset does not have sufficient real-world variability. Further future studies should focus on minimizing the misclassification errors and increasing the diversity of the datasets to achieve the greatest robustness in practice. The findings confirm that the ensemble model under study has enormous potential to help radiologists in correct and computerized classification of fractures, which will allow them to conduct more effective diagnostic processes.

Confusion matrix for EnsembleAttenBoneNet model.
The classification accuracy of the EnsembleAtten BoneNet model for detection of bone fractures in 10 types of fractures demonstrates exceptional accuracy (99.48%), proving the model’s strength in computerized medical image analysis shown in Table 5. The model demonstrates flawless precision, recall, and F1-scores (1.00) across multiple classes, including Avulsion, Comminuted, Fracture Dislocation, Greenstick, Longitudinal, and Pathological fractures, with a very high degree of confidence in the predictions and virtually no FP and false negatives. Interestingly, Hairline and Spiral fractures score nearly flawless (precision = 0.98, recall = 1.00, F1-score = 0.99), whereas Impacted Fracture achieves a precision of 1.00 but with lower recall (0.98), with fewer misclassification cases. The oblique fracture class obtains the lowest recall (0.90) and F1-score of 0.95, with a somewhat higher misclassification rate, perhaps due to morphological overlap with other fracture types. Nonetheless, overall classification metrics indicate the efficacy of the ensemble method, which combines the strengths of ResNet50’s deep feature learning and EfficientNet-B3’s architecture optimization to improve diagnostic accuracy. The high values of recall validate the model’s true fracture detection capability, reducing the danger of misdiagnosis. Future studies must emphasize generalizability improvement via external validation, data set extension, and sophisticated augmentation methods in order to counter minor classification mistakes. The results point to the clinical utility of deep learning ensembles for aiding radiologists in automated, high-accuracy fracture classification, with the potential for improving diagnostic effectiveness and minimizing inter-observer variability in clinical practice. Table 7 illustrates the classification parameters for the EnsembleAttenBoneNet model.
Classification parameters for the EnsembleAttenBoneNet model.
Comparative analysis on the test dataset
Table 8 highlights the performance of three models, Res Net50, EfficientNet-B3, and EnsembleAttenBoneNet, employed for bone fracture classification. ResNet50 delivers an accuracy of 97.86%, with robust Average Precision, Recall, and F1-Score values of approximately 0.979, reflecting effective classification and the advantage of its deep residual connections. EfficientNet-B3 is slightly better at 98.56% accuracy and possesses the best precision and recall (0.989 and 0.984, respectively) and an extremely balanced F1-Score of 0.986, and is most appropriate for preventing FP and false negatives. The EnsembleAttenBoneNet outperforms both, with an accuracy of 99.48%, a very high Average Precision of 0.996, and an F1-Score of 0.992, demonstrating the power of utilizing multiple models and attention mechanisms to focus on significant features, resulting in improved classification performance. This comparison demonstrates the power of these models in carrying out bone fracture classification tasks with high precision and recall.
Comparative analysis of transfer learning models with the EnsembleAttenBoneNet model.
Ablation study
An ablation study was conducted to assess the classification performance of three deep CNN architectures, ResNet50, EfficientNet-B3, and the proposed EnsembleAttenBoneNet, on a multi-class bone fracture dataset shown in Table 6. Evaluation metrics included precision, recall, F1-score, and overall accuracy for each fracture class. The ResNet50 model achieved a classification accuracy of 97.86%, yielding strong performance in most classes, although relatively lower F1-scores were observed for Fracture Dislocation (0.94) and Spiral Fracture (0.92). EfficientNet-B3 was better than ResNet50 with a better accuracy of 98.56, especially F1-scores in Fracture Dislocation (0.97) and Pathological Fracture (0.97). The proposed EnsembleAttenBoneNet model was the most accurate in classification, with the highest accuracy of 99.48, and it was more precise, and recalled all the classes. It is important to note that it scored a perfect F1-scores (1.00) in 8 of 10 types of fractures with minor differences in Hairline (0.99), Impacted (0.99), and Oblique Fractures (0.95). These results highlight the usefulness of the ensemble-attention architecture in improving the feature representation and diagnostic accuracy in bone fracture classification tasks. Table 9 presents the relative performance analysis of ResNet50, EfficientNet-B3, and the proposed EnsembleAttenBoneNet.
Comparative performance evaluation of ResNet50, EfficientNet-B3, and the proposed EnsembleAttenBoneNet across 10 bone fracture classes using precision, recall, F1-score, and overall accuracy as evaluation metrics.
Statistical significance testing with McNemar’s test
Although the proposed EnsembleAttenBoneNet achieved the highest raw accuracy (99.48%) compared to ResNet50 (97.86%) and EfficientNet-B3 (98.31%), statistical testing was carried out to determine whether these differences were significant. Using McNemar’s test, we compared the predictions of the models on the same test set. For EnsembleAttenBoneNet versus ResNet50, the contingency Table 10 showed 136 cases where both models were correct, 3 cases correctly classified only by EnsembleAttenBoneNet, and 1 case correctly classified only by ResNet50, resulting in a test statistic of χ² = 1.0 with p = 0.625. Similarly, for EnsembleAttenBoneNet versus EfficientNet-B3, the contingency Table 11 showed 137 cases where both were correct, 2 cases uniquely correct for EnsembleAttenBoneNet, and 1 case uniquely correct for EfficientNet-B3, yielding χ² = 1.0 with p = 1.0. In all comparisons, the p-values exceeded 0.05, indicating that the observed differences in accuracy were not statistically significant at the 95% confidence level. This suggests that while the proposed ensemble model demonstrates consistent superiority in raw performance metrics, the improvements cannot be conclusively attributed to more than chance variation, given the limited size of the available test set.
Contingency tables for McNemar’s test for the proposed model versus the ResNet50 model.
Contingency tables for McNemar’s test for the proposed model versus the EfficientNet-B3 model.
Performance evaluation on the second dataset
In addition to the first dataset, we also evaluated the proposed EnsembleAttenBoneNet model on the Bone Break Classifier Dataset, 22 which contains 12 distinct classes of bone fractures, including avulsion, comminuted, compression-crush, fracture dislocation, greenstick, hairline, impacted, intra-articular, longitudinal, oblique, pathological, and spiral fractures. As summarized in Table 12, the model achieved an overall classification accuracy of 98.31%, with precision, recall, and F1-scores consistently above 0.95 across all classes. The inclusion of 95% confidence intervals further demonstrates the statistical reliability of these results, confirming that the performance is not due to chance but reflects robust model generalization. The consistent high performance across both datasets highlights the effectiveness of our ensemble attention-based framework, while also underscoring its potential for real-world deployment in automated fracture detection.
Classification results on the second dataset.
State-of-the-art analysis
Deep learning for medical images has also enhanced fracture detection and classification sensitivity and specificity in a major way. Various studies have used various architectures ranging from CNN-based models to models that combine different models in order to improve diagnostic performance. VGG16 was commonly used in bone fracture classification with 85.0% and 85.8% accuracies on MURA-V1 and MURA-V1.1, respectively, for normal versus abnormal case classification. EfficientNetB0 using Exemplar features, NCA, and SVM was also utilized on the MURA dataset with class-wise accuracy ranging from 89.4% to 92.6%. A Hypercolumn-CBAM architecture combining EfficientNetB0 and DenseNet169 reached an accuracy of 87.5%. Hybrid deep learning methods also improved additional classification accuracy, with one model achieving 93.41% accuracy to identify positive and negative fractures. Ridge regression in osteoporosis diagnosis from MURA was achieved at 85.3% accuracy. Furthermore, CNN and RCNN had excellent performance using the Bone X-ray Image Dataset, with a 98.0% accuracy and an F1-score of 0.972. Current research has investigated feature-based classifiers like SVM with GLRLM-GLDM-DS, which attained an F1-score of 0.975, and DenseNet121 on Local PACS, with 89.0% accuracy and an ROC of 0.926. The EnsembleAttenBoneNet, an ensemble of ResNet50 and EfficientNetB3, proposed here, achieves state-of-the-art performance with 99.48% accuracy, outperforming current methodologies in bone fracture classification. Table 13 illustrates the state-of-the-art comparison.
State-of-the-art comparison.
Advances in AI and transfer learning for fracture classification
Recent advancements in artificial intelligence and transfer learning have significantly improved automated medical image analysis. Studies such as Xiong et al. 27 introduced multichannel feature fusion networks for biomedical signal classification, demonstrating the importance of combining multiple feature representations for accurate diagnosis. Similarly, Xiang et al. 28 proposed a multimodal masked autoencoder with adaptive masking, showing improved classification performance in medical image datasets. Extending deep learning applications to radiology, the authors of Guan et al. 29 developed an enhanced CNN for arm fracture detection in X-rays, while Song et al. 30 introduced a transformer-based segmentation model capable of capturing complex structural patterns in dental images. In earlier work, Kim and MacKinnon 31 explored transfer learning for fracture detection and demonstrated the value of pre-trained convolutional networks in improving accuracy with limited datasets. Further contributions by Luan et al. 32 applied deep learning for super-resolution ultrasound imaging, and Tanzi et al. 33 established a strong deep learning baseline for X-ray bone fracture classification. Authors in Lee et al. 34 extended this with a meta-learned neural network model for femur fracture classification using pelvic X-rays. Complementary studies, such as Yu et al., 35 have enhanced denoising and localization performance in ultrasound imaging through deep neural filtering techniques, which are methodologically relevant to X-ray enhancement tasks. Recent developments by Sahin 36 have integrated machine learning and image processing for bone fracture detection and classification, while Yang et al. 37 proposed an explainable ensemble learning framework using transfer learning for Optical Coherence Tomography (OCT) detection. Authors in Hardalaç et al. 38 focused on wrist X-ray fracture detection using deep object detection models, and Oka et al. 39 applied AI to diagnose distal radius fractures from biplane X-rays with high precision. In parallel, Song and Yang 40 utilized ant colony-based optimization for Magnetic Resonance Imaging (MRI) segmentation, highlighting the role of bio-inspired algorithms in medical imaging. Ali et al. 41 applied traditional machine learning for long bone fracture classification, while Alam et al. 6 recently introduced a transfer learning-based framework for radiographic bone fracture detection that outperformed existing state-of-the-art performance. Collectively, these studies highlight the growing impact of deep learning and transfer learning techniques in advancing automated bone fracture classification and improving diagnostic accuracy in medical imaging.
Conclusion
This study introduced EnsembleAttenBoneNet, an advanced deep learning-based framework for automated bone fracture classification using X-ray images. By integrating fine-tuned ResNet50 and EfficientNetB3 models with an SE attention mechanism, the approach significantly enhances feature extraction and classification accuracy. The model achieved a classification accuracy of 99.48%, surpassing standalone models (EfficientNetB3: 98.56%, ResNet50: 97.86%), demonstrating the effectiveness of ensemble learning with attention mechanisms in medical image analysis. The robust feature extraction improved interpretability and reduced misclassification, addressing key challenges such as dataset variability and class imbalance through extensive data augmentation and fine-tuning strategies. The confusion matrix analysis confirmed the model’s ability to accurately distinguish between 10 distinct fracture types, reducing diagnostic errors and enhancing clinical decision support. While the framework shows exceptional performance, further research is needed for validation across larger and more diverse datasets to ensure real-world applicability. Future work will focus on compressing the proposed model through approaches such as pruning, knowledge distillation, and INT8 quantization to significantly reduce the parameter size and inference latency. The optimized lightweight version will be evaluated on edge devices, including NVIDIA Jetson platforms, with FPS and power consumption reported to ensure feasibility for real-time clinical deployment without compromising accuracy. Future extensions of this research will explore validation on real-world clinical streams in collaboration with healthcare institutions, but for the scope of this study, the use of a recognized benchmark dataset provides sufficient validity.
Although the proposed EnsembleAttenBoneNet model achieved outstanding performance in bone fracture classification, several limitations should be acknowledged. The dataset used in this study was limited in size and sourced primarily from a single publicly available repository, which may not fully capture the variability present in real-world clinical imaging across different hospitals and equipment settings. Consequently, model generalizability to unseen clinical data requires further validation. Additionally, the study focused on static X-ray images, whereas incorporating multimodal data such as CT or MRI scans could enhance diagnostic robustness. Future work will aim to expand the dataset with multi-institutional and multi-modal samples, integrate explainable AI techniques such as Grad-CAM for improved interpretability, and evaluate the system in real-world clinical workflows to ensure broader applicability and clinical reliability.
While the proposed EnsembleAttenBoneNet model demonstrates exceptional accuracy in automated bone fracture classification, certain limitations remain. The dataset was limited to a single public source, which may not represent the full diversity of radiographic variations encountered in real-world clinical environments. Consequently, external validation using multi-institutional data is essential to confirm the model’s generalizability. Despite these limitations, the model holds strong potential for clinical integration. It can assist radiologists by reducing diagnostic workload, minimizing inter-observer variability, and providing consistent, rapid, and standardized fracture detection. Future research will focus on deploying the model in real-time clinical settings and extending its use to additional imaging modalities for comprehensive diagnostic support.
Supplemental Material
sj-docx-1-tab-10.1177_1759720X251405099 – Supplemental material for A transfer learning–based approach for automated bone fracture classification in X-ray imaging
Supplemental material, sj-docx-1-tab-10.1177_1759720X251405099 for A transfer learning–based approach for automated bone fracture classification in X-ray imaging by Ruchika Bhuria, Sheifali Gupta, Rania M. Ghoniem, Jaibir Singh, Suman Rani, Belayneh Matebie Taye and Salil Bharany in Therapeutic Advances in Musculoskeletal Disease
Footnotes
Acknowledgements
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R138), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Declarations
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
