Abstract
Introduction
Shoulder pain is the third most common musculoskeletal–related reason for seeking medical attention in the United States. 1 While the underlying cause of shoulder pain can be highly variable, a correlation with bursitis has been found; Draghi et al 2 found that in people presenting with shoulder pain, regardless of the cause, there was a common association between pain and the presence of bursitis.
Bursae are synovial-lined structures that function to minimize friction between at least two structures moving against each other. 2 The bursa is considered a potential space, seen on ultrasonography (US) as hypoechoic tissue between hyperechoic peribursal fat tissue.2,3 Bursitis is when there is swelling or inflammation of the bursa. The word bursitis is often a misnomer however, as not all cases of bursitis are primarily from an inflammatory process but instead from a non-inflammatory swelling of the bursa. 4 In cases of bursitis on US, the bursa appears fluid-filled and is lined with a hyperechoic wall. 5
The normal shoulder joint is comprised of multiple bursae, including the subacromial-subdeltoid (SA-SD) bursa. The SA-SD bursa is composed of two bursae that lie under the deltoid muscle and acromioclavicular joint and overlie the rotator cuff and bicipital groove.6,7 In people presenting with shoulder pain, there is often an association with SA-SD bursitis, regardless of the aetiology. 2 There are many pathologies that may cause SA-SD bursitis, including repetitive stress or overuse, rotator cuff injury, trauma, rheumatoid arthritis, infection and pigmented villonodular synovitis.2,5 The treatment for bursitis is usually conservative, including activity modification, physiotherapy, non-steroidal anti-inflammatory drugs and corticosteroid injections. Surgical resection of the bursa is reserved for treatment resistant cases. Thus, correctly diagnosing bursitis is important in regard to choosing the correct management for patients with shoulder pain.
Although findings of bursitis are relatively straightforward, there are no guidelines or classification systems that allow for standardized grading of bursitis. This leads to subjective assessments, which can lead to both intraobserver and interobserver variability. For example, Naredo et al 8 found that interobserver variability exists between experts in musculoskeletal (MSK) US (including a combination of radiologists and rheumatologists), with 84% agreement for diagnosing bursitis on shoulder US. This is most likely attributed to differences in opinion of what constitutes bursitis, mainly whether the presence of inflammation is necessary for diagnosis or not. While controlling for the differing opinions in defining bursitis, our aim was to determine whether intraobserver variability exists among fellowship trained MSK radiologists at our institution. This could grant healthcare providers with information to choose the correct treatment more confidently for their patients presenting with shoulder pain.
Methods
We conducted a retrospective study of patients who were diagnosed with bursitis on shoulder US at our institution between January 1, 2019 and December 31, 2020. Research ethics board approval was obtained. Included patients were between 18 and 69 years of age. Patients with incomplete imaging and full-thickness rotator cuff tendon tears were excluded. A total of 70 patients were analysed, including a small subset of control cases. We collected single-sonographic images, with standard window presets, of the SA-SD bursa from our institution’s Picture Archiving and Communication System for each patient. Images were acquired by MSK-trained sonographers. These single images were randomized to form a ‘test-bank’ of varying degrees of shoulder bursitis.
The test-bank was administered to all participating fellowship trained MSK radiologists (N = 10) within Hamilton in the form of an electronic document (Microsoft PowerPoint presentation). The participants were asked to grade each case as: within normal limits, mild, moderate, or severe. Given that no gold-standard exists for grading bursitis, the present study did not seek to provide objective measurements for determining each grade. Thus, the participants were asked to grade based on their prior training and experience. The bursa was measured at the widest thickness between the peribursal fat and the superficial margin of the supraspinatus muscle, in a plane parallel to the transducer beam. 9 Following the first administration, the test-bank was randomly reordered and readministered 3 months later. The participants were then asked to grade each case again, without knowing the grading previously assigned to each case. The participants were also asked how many years they had been practicing MSK radiology.
Data were collected and analyzed on Microsoft Excel. Cohen’s kappa coefficient was calculated to determine intraobserver variability between the four categorical variables (within normal limits, mild, moderate and severe). A linear regression model was used to assess for correlation between kappa coefficient as the dependent variable and the radiologist’s years of experience as the independent variable. Corresponding P-value and Pearson correlation coefficient (r) were calculated. Statistical significance was declared when P <.05. The Pearson correlation coefficient range was defined as between −1.0 and +1.0, with the sign of the correlation coefficient representing the direction of the relationship. The strength of the correlation was defined based on the absolute value of r as perfect (r = 1.0), strong (.8 ≤ r ≤ 1.0), moderate (.5 ≤ r ≤ .8), weak (.2 ≤ r ≤ .5), very weak (.0 < r ≤ .2) or no association (r = .0). 10 Data analysis was performed using SPSS version 28.0 (SPSS Inc., Chicago, IL).
Results
Patient Characteristics, Level of Disagreements and Intraobserver Variability.
Intraobserver variability was measured using Kappa coefficient. Number of disagreements >1 category refers to disagreements between two noncontiguous categories (between within normal limits and moderate, between within normal limits and severe, and between mild and severe); CI, confidence interval; *significant difference (P < .001).
The kappa coefficient with standard of error for each participant is demonstrated in Figure 1. To demonstrate the trend of variability regarding the level of experience, the x-axis is organized in increasing years of experience. There was a moderate positive correlation between years of experience and improved variability (r = .69, P = .026). Kappa coefficient (measure of intraobserver variability) for each participant in order of increasing years of experience. Values expressed as single kappa value with error bars representing 95% confidence interval.
Figure 2 displays the distribution of assigning differing grades of severity of bursitis for each radiologist. Although interobserver variability was not measured in this study, this figure further illustrates the known existence of interobserver variability between the different radiologists. Representative cases of patients with each grade of bursitis that were unanimously agreed upon amongst all radiologists are demonstrated in Figure 3. Distribution of the grades of shoulder bursitis for each participant. Representative cases of each severity grade of shoulder bursitis.

Discussion
The present study retrospectively examined patients at our institution diagnosed with shoulder bursitis on US. Fellowship trained MSK radiologists were asked to grade each case of bursitis using a single US image of the SA-SD bursa. Three months after each case was graded, the cases were randomized and reassessed to the radiologists. This allowed for the assessment of intraobserver variability. This study demonstrated that intraobserver variability exists amongst the radiologists, with a moderate positive correlation of improved variability (increasing reliability) with increasing experience.
At time of publication, intraobserver variability of grading shoulder bursitis on US has not previously been measured. The present study reports relatively good agreement in grading shoulder bursitis on US, regardless of years of experience. The kappa values ranged from .53 to .91 (Table 1). Although no researchers have previously measured intraobserver variability on grading shoulder bursitis on US, many have reported similar intraobserver variability on different MSK-related pathologies, for different joints, and amongst different sonographic experts (both radiologists and non-radiologists).12-15
Cohen’s kappa is a useful measure of intraobserver reliability. Values range from −1 to +1, where 0 represents the amount of agreement that can be expected from chance and 1 represents a perfect agreement between two tests. 16 Table 1 displays the individual kappa values for each participant and Figure 1 highlights the trend of increasing kappa with increasing experience. There was a moderate positive correlation of improved variability with increasing years of experience. The most experienced radiologist had the highest kappa of .91, representing a disagreement of only 4 of 70 cases (6%). The kappa values of the remaining participants were between .53 and .73, which is similar to previous studies assessing intraobserver variability for different sonographic findings in various joints.12-15
To interpret the strength of agreement for given kappa values, we can separate different kappa values into descriptive categories. Landis and Koch 11 have proposed the following standard: <.01 = poor, .01–.20 = slight, .21–.40 = fair, .41–.60 = moderate, .61–.80 = substantial and .81–1.00 = almost perfect. Similar standards have been proposed, 17 albeit with slightly different descriptors. However, the numerical value at each tier is relatively arbitrary and considering this when interpreting the results is paramount. The radiologists in the present study ranged from moderate to almost perfect (Table 1), with a moderate positive correlation of improved variability with increasing experience. In the first group of radiologists with experience ranging from 1 to 5 years (N = 3), there was moderate agreement between tests; whereas, in the most experienced group (16–30 years; N = 4), there was one moderate, two substantial and one almost perfect. Although the difference between moderate and almost perfect categorically appears significant, Figure 1 better illustrates how similar in agreement the radiologists are regarding numerical kappa values.
The magnitude of kappa is influenced by additional factors, including the number of categories and applying weighted factors to kappa. 18 The greater the number of categories, the greater the potential for disagreement between tests. In this case, there were four categories (within normal limits, mild, moderate and severe). Thus, in a clinical setting disagreement between within normal limits and severe should be more significant than a disagreement between mild and moderate, for example. However, in the present study, only 6 of the 10 radiologists had a disagreement spanning two noncontiguous categories (between within normal limits and moderate, between within normal limits and severe, or between mild and severe), and each of those six radiologists only made that disagreement once (representing 1.4% of cases).
Although the present study did not measure interobserver variability directly, Figure 2 illustrates the varying allotment for the different grades of shoulder bursitis between the different radiologists. Example cases of each grading severity that were unanimously agreed upon amongst all participants are provided in Figure 3. It is proposed that interobserver variability would exist if it were measured in this cohort, as many researchers have shown interobserver variability with diagnosing and grading different MSK pathologies on US imaging.8,12-15,19-23 This is most likely attributed to differences in opinion amongst clinicians of what constitutes bursitis, 8 as there is no gold standard definition of bursitis on US. A clinician’s definition and grading acumen of shoulder bursitis is likely influenced by a combination of their prior training and clinical experience. The present study did not seek to establish a consensus on the definition of bursitis or establish a grading criterion for shoulder bursitis on US. It is the current researchers’ chief interest in how clinicians would utilize this information to adjust their clinical practice. Most importantly, what impact does the radiologic impression of bursitis have in comparison to the clinical exam, or how much emphasis is placed on the radiologist’s grading of bursitis? Furthermore, do the answers to these questions differ between clinicians (orthopaedic surgeons, rheumatologists, physiatrists, physiotherapists, etc.).
The present study measured bursal distension on US as the distance between the peribursal fat and the superficial margin of the supraspinatus muscle, along a plane parallel to the transducer beam. We acknowledge that interobserver variability may exist when measuring the SA-SD bursa, thus next steps should consider this prospectively.
Our study has several limitations. It is a single centre study based on the retrospective assessment of saved US images of the SA-SD bursae without Doppler assessment. Additional clinical context, including patient characteristics, presenting symptomatology and comorbidities were not available to the interpreting radiologists. Ultrasonography is considered an inherently operator dependent imaging modality. This was not a primary objective of our study and was minimized by utilizing cases performed by MSK-trained sonographers, from only a single center.
Conclusion
This study demonstrates good intraobserver reliability in grading shoulder bursitis on US for all MSK-trained radiologists. Furthermore, there was a moderate positive correlation of improving variability with increasing years of experience. Thus, understanding the inherit intraobserver variability of shoulder US may help clinicians more confidently choose the correct treatment for their patients presenting with shoulder pain.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
