Abstract
Background
The well-known drawer tests to assess glenohumeral laxity and instability have shown appropriate reliability, although analysed mainly in healthy subjects.
Objective
To evaluate the intra- and inter-rater reliability of anterior and posterior drawer tests in subjects with symptoms of shoulder instability.
Design
Clinometric study of intra- and inter-rater reliability of drawer tests was carried out following COSMIN recommendations and GRRAS checklist.
Setting
Centres with equipped facilities for assessments.
Participants
There were 105 participants (69 male/36 female) aged 18 to 60 years with instability symptoms in at least one shoulder. Each participant underwent bilateral assessments. The sample consists of 210 shoulders, unstable and healthy.
Intervention
Anterior and posterior drawer tests.
Main measures
Humeral translations were assessed using drawer tests and graded with Hawkins scale, modified Hawkins and dichotomising (positive/negative). Two sessions were performed (seven to fourteen-day washout period): Each patient was evaluated by two examiners in the first session and by one of them in the second. Weighted Kappa analysed the reliability.
Results
The intra-rater reliability of the anterior and posterior drawer tests was excellent (weighted Kappa = 1) with the Hawkins scale. Inter-rater reliability was good for the anterior drawer: weighted Kappa = 0.76 (95%confidence interval: 0.67–0.85) with the Hawkins scale, weighted Kappa = 0.78 (95%confidence interval: 0.69–0.87) with modified Hawkins, and weighted Kappa = 0.80 (95%confidence interval: 0.71–0.89) dichotomising; and for the posterior drawer: weighted Kappa = 0.62 (95%confidence interval: 0.52–0.72), weighted Kappa = 0.67 (95%confidence interval: 0.57–0.78), and weighted Kappa = 0.70 (95%confidence interval: 0.59–0.80), respectively.
Conclusion
Drawer tests demonstrated excellent intra-rater and good inter-rater reliability in subjects with symptoms of shoulder instability.
Introduction
Glenohumeral instability, as well as its assessment, represents a challenge in the clinical and research setting. 1 Such instability is related to higher grades of shoulder laxity. 1 Laxity is considered a risk factor for multidirectional glenohumeral instability.2,3 A relationship between generalised ligament laxity and traumatic shoulder instability has even been proven. 4 Consequently, assessment and/or diagnosis of glenohumeral laxity is highly recommended to prevent or diagnose instability. 1
The diagnosis of shoulder instability is mostly based on clinical history and manual glenohumeral laxity tests, 5 which sometimes, are combined with imaging scans to assess the integrity of musculoskeletal structure. 6 These are: x-ray imaging, standard or stress imaging, 7 ultrasound, 6 magnetic resonance imaging 8 and even magnetic resonance arthrography after intra-articular injection of contrast, which has shown great diagnostic accuracy in terms of labral injuries, 9 frequent in patients with instability. However, these methods enable to discover the state of the structures, but not the functionality. In addition, they are costly, 10 are often not immediately available to clinicians and may carry risks, that is radiation emitted by X-rays. 11
In relation to the specific physical assessment tests for glenohumeral laxity, the well-known anterior and posterior drawer tests used in the assessment of humeral translation and described for the first time by Gerber et al. 12 stand out. They are frequently used by clinicians and researchers because they are easy to perform, accessible and practical. 13 In addition, they offer positive evidence at the levels of sensitivity,14–16 specificity14–16 and inter-rater reliability17,18 in pathological groups (such as rotator cuff tears,15,16 impingement syndrome15,16). However, reliability studies19–22 were conducted mainly with healthy participants and with smaller sample sizes than recommended.23,24
Thus, this study aimed to evaluate the intra- and inter-rater reliability of the anterior and the posterior drawer tests in subjects with symptoms of shoulder instability.
Methods
The study design consisted of a clinometric analysis of the intra- and inter-rater reliability of anterior and posterior drawer tests. A flow diagram is shown in Supplementary Material. The research was based on the consensus-based standards for the selection of health measurement instruments (COSMIN). 25 It was approved by the Ethics Committee of the Virgen Macarena-Virgen del Rocío Hospitals of the Andalusian Public Health System (No. 1267-N-21) in accordance with the Helsinki Declaration. 26
The Guidelines for Reporting Reliability and Agreement Studies (GRRAS) 27 checklist was considered.
Participants were chosen by non-random convenience sampling from University, clinical, and sports centres in Seville and Cadiz. Inclusion criteria were: (a) persons with symptoms of instability in at least one shoulder with or without a clinical diagnosis, although both shoulders were always assessed (see next paragraph); (b) aged between 18 and 55 years. 28 Exclusion criteria: (a) subjects with musculoskeletal shoulder pathologies not associated with possible instability; (b) cognitive impairment that affected following the clinician's instructions.
The study considered the two shoulders of each participant as a sample, that is, it included asymptomatic shoulders in order to cover all grades of the laxity scales used (range zero to three, where grades zero and one are associated with no laxity).
The assessment of the glenohumeral laxity, specifically of humeral translation, was carried out by means of the anterior and posterior drawer tests. The patient is placed in the supine position and the physical therapist stands on the side of the shoulder to be evaluated. 12 Figure 1 shows the execution of the anterior drawer test and Figure 2 shows the execution of the posterior drawer test. The displacement of the humerus over the scapula can be easily appreciated and graded. 12

The anterior drawer test. 12 The examiner holds the patient's forearm with the elbow slightly flexed and relaxed. The shoulder is held between 80° and 120° of abduction, 0° and 20° of flexion, and 0° and 30° of external rotation. The examiner fixes the scapula with their medial hand. The lateral hand contacts the head of the humerus applying an anterior force that causes translation of the humerus over the scapula.

The posterior drawer test. 12 The medial hand stabilises the scapula. The examiner's lateral hand holds the patient's forearm with the elbow flexed to 120°. The shoulder is held between 80° and 120° abduction and 20° flexion. A posterior force is applied to the head of the humerus.
The degrees of translation were recorded according to: the Hawkins scale19,20,29 and the modified Hawkins scale.19,20,30 The Hawkins scale ranges between zero and three: grade zero, no or minimal translation; grade one, translation of the humeral head to the glenoid but not over the rim; grade two, humeral head translates over the glenoid rim but does not lock; and grade three, humeral head locks out over the rim.29,31 The modified Hawkins1,30 equates grade zero with grade one without affecting clinical assessment, 30 and improves intra- and inter-rater reproducibility. 19 In addition, the results of both drawer tests were dichotomised based on the reliability study of clinical shoulder tests (e.g. load-and-shift) by Eshoj et al. 32 : the grade zero and one of Hawkins scale as negative and grade two and three as positive.
Fieldwork was carried out in equipped facilities in Seville and Cadiz, that is, acclimatised assessment room, with treatment couches, ergonomic cushions, portable dividing screens to safeguard the privacy of the participants, tables, chairs and adjoining dressing rooms. After reading the information sheet and signing the informed consent form, descriptive data on affiliation, age, weight, height, body mass index, clinical diagnosis, symptomatology and surgical intervention were collected.
Subsequently, the anterior and posterior drawer tests were performed on the assessment couches with cushions to keep the patient in supine decubitus with knees and hips semi-flexed. Participants had to undress from the waist up except for a bra or top, leaving the shoulder girdle visible.
Two assessment sessions were performed, leaving a washout period of 7 to 14 days, so as not to cause substantial changes in the joint. 33 In addition, participants confirmed that their shoulder condition had not changed in the second session.
In the first session, each patient was evaluated by two examiners (RA and MB) – both physiotherapists with clinical experience in shoulders and prior rigorous training in the physical tests employed. The execution and results obtained by one examiner could not be known by the other. The anterior drawer test was performed first and then the posterior drawer test.
In the second session, in order to obtain data for intra-rater reliability, RA examiner repeated his assessments following the same guidelines.
After the fieldwork, all the data collected – descriptive data and from the physical tests – were transferred to an Excel matrix for subsequent analysis.
Statistical analysis
Sample size influences the accuracy of reliability. 24 Thus, De Vet et al. 23 recommend including at least 50 patients to complete a two-by-two table. Moreover, considering the sample size (shoulders), for a 95% confidence level, that is a 5% alpha error and a precision of 3.5%, at least 203 shoulders are required. 34 This study doubled the suggested minimum sample size in both participants and affected shoulders.
As for the sample descriptives, absolute (N) and relative (%) frequencies were considered for qualitative variables. For quantitative variables, normality was assessed using the Kolmogorov–Smirnov test, taking the mean and standard deviation for parametric variables, and the median and interquartile range for nonparametric variables.
The observed proportions and the expected proportions by chance were calculated for the anterior and posterior drawer tests.35,36 The Kappa index was used to assess the level of agreement between the examiners in both tests. 37
The observed and the expected proportions by chance, as well as the level of agreement, were calculated for both the weighted and the unweighted forms.35,36
Intra- and inter-rater reliability was obtained using weighted Kappa to take into account the different levels of disagreement between categories. 38 In this study, the categories are zero to three in Hawkins and one to three in modified Hawkins. The interpretation of weighted Kappa was: 0, no reliability; 0.01–0.20, poor; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, good; and 0.81–1.00, excellent. 39 Statistical analysis was performed using IBM SPSS STATISTICS software version 29.
Results
The sample consisted of 210 shoulders from 105 participants, 69 men and 36 women, with symptoms of instability in at least one shoulder (see inclusion criteria). The descriptions of the participants (age, weight, height, body mass index) and the shoulder sample (clinical diagnosis, symptomatic and surgical treatment) are shown in Table 1.
Descriptive characteristics of the participants and the assessed shoulders.
BMI: body mass index; CI: confidence interval; Med: median; IQR: interquartile range.
Note: sample comprising shoulders.
The intra-rater reliability of both the anterior and posterior drawer tests, based on one examiner's ratings (RA), was excellent (weighted Kappa = 1; which did not enable the calculation of the confidence interval) when using the Hawkins scale. Therefore, the same result was obtained with the modified Hawkins and dichotomising.
Table 2 shows the results of anterior and posterior drawer tests of both examiners.
Results of anterior and posterior drawer tests.
CI: confidence interval.
Grades of Hawkins scale. 29
The inter-rater reliability was good for the anterior and posterior drawer tests, being slightly higher in the anterior drawer test. In this case, the weighted Kappa values increased from 0.76 (95% confidence interval: 0.67–0.85) with the Hawkins scale to 0.80 (95% confidence interval: 0.71–0.89) when the scale was dichotomised. For the posterior drawer test, the values increased from 0.62 (95% confidence interval: 0.52–0.72) to 0.70 (95% confidence interval: 0.59–0.80) (Table 3).
Inter-rater reliability analysed thought weighted Kappa.
The values in brackets correspond to the 95% confidence interval.
Additionally, Table 4 shows the observed proportions, the expected proportions by chance, and the Kappa index values, both weighted and unweighted, for the anterior and posterior drawer tests at the first assessment session. The anterior drawer test achieved good results with both the weighted Kappa (weighted Kappa = 0.76) and the unweighted Kappa (Kappa = 0.75). The posterior drawer test also showed good results, although slightly lower values (weighted Kappa = 0.62 and Kappa = 0.61, respectively).
Observed proportion, expected proportion by chance, and Kappa Index, weighted and unweighted.
Discussion
This study analysed the intra- and inter-rater reliability of anterior and posterior drawer tests for assessing glenohumeral laxity in subjects with symptoms of instability of at least one shoulder with or without a clinical diagnosis, whether or not they had undergone surgery. The main findings showed excellent intra-rater (weighted Kappa = 1) for both drawer tests, even with the Hawkins scale; and good inter-rater reliability.
As for intra-rater reliability, Morita et al. 20 found a similar result for the anterior drawer with a rater (weighted Kappa = 0.861, calculated by us); and lower than our results for the posterior drawer. A rater had good (weighted Kappa = 0.796) and excellent (weighted Kappa = 0.867) reliability with Hawkins and its modification 20 respectively, while other less experienced ones had moderate (weighted Kappa = 0.587) and good (weighted Kappa = 0.678). Levy et al. 19 calculated intra-rater reliability with four raters, reporting lower reliability than our study, and that of Morita 20 using Kappa instead of weighted Kappa. As with Morita et al. 20 the lowest data were from the least experienced rater. The reliability was at most moderate (Kappa < 0.5) for three raters with the Hawkins scale and its modification. However, the authors noted that lab conditions could have negatively affected. 19 The inter-rater reliability obtained for the drawer tests was good, using both the Hawkins scale (anterior drawer: weighted Kappa = 0.76; posterior drawer: weighted Kappa = 0.62) and its modification (anterior drawer: weighted Kappa = 0.78; posterior drawer: weighted Kappa = 0.67). Similar results were reported by Morita et al. 20 for the anterior drawer. In contrast, for the posterior drawer with modified Hawkins was only moderate according to Morita et al. (weighted Kappa = 0.428) 20 and Levy et al. (Kappa > 0.5). 19 Levy's results had more weight due to the involvement of four raters. Moreover, our study complemented the findings by dichotomising the Hawkins scale based on Eshoj et al., 32 increasing the inter-rater reliability of both drawer tests to almost excellent for the anterior drawer (weighted Kappa = 0.80).
The evidence of the reliability of the drawer tests has been analysed mainly in healthy subjects.19–22 However, this study included unilateral and bilateral instability symptoms, although both shoulders were always included to increase the heterogeneity of the sample, that is unstable, lax (without instability symptoms) and stable shoulders. Given the relationship between shoulder instability and glenohumeral laxity,1–3 and that the drawer tests assess the latter, 12 analysing their reliability in this population is crucial.
As for the Hawkins scale, a score of zero could not expected in our population. However, the assessed asymptomatic shoulders tended to be lax except for traumatic instability, based on the relationship between shoulder instability and generalised laxity. 4
Regarding sample sizes, unlike other studies,17–22 we exceeded the required and doubled De Vet et al.'s 23 recommendation of ≥50 participants, ensuring robust reliability evidence.
This study and McFarland et al. 35 advocate the modified Hawkins scale for grading humeral translation, as equalising grade zero (rarely obtained in our sample) and grade one (frequent in the absence of instability) does not affect clinical valuation, 40 but it increases inter-rater reliability 19 by avoiding confusions. 30 Nevertheless, we compared the original Hawkins scale with the modified one and found a minimal improvement with the latter, surpassed by Levy et al., 19 which increased from 47% to 78%.
On the other hand, some reliability studies of drawer tests19,20,22 employed the Kappa index to assess agreement, 38 whereas this study used the weighted Kappa because of the multiple response options. It reduces the error between the observed and the expected proportions by chance. 37 It considers the levels of disagreement between categories and the size of the differences,41,42 providing more consistent information for reliability.
Moreover, the excellent intra-rater and good inter-rater reliability were reinforced by other evidence 7 on the validity of diagnostic tools for laxity and/or instability. Thus, the stress radiography obtained a significant correlation with the anterior drawer. 7
Manual tests may be influenced by the experience, skill and sensitivity of raters, 13 as shown by Levy et al. 19 and Morita et al. 20 Many arthrometers were developed to measure glenohumeral laxity, but the discrepancies in the amount of force to be applied and the patient position lead to inconclusive findings. 13 Thus, manual laxity tests are still relevant.
Clinicians and researchers use the anterior and posterior drawers due to being simple, accessible, useful with a moderate sensitivity and high specificity.14–16 The original tests by Gerber et al. 12 have undergone modifications and there is no consensus on their execution. As Gerber et al., 12 we advocate for supine positions to ensure better relaxation 40 and reliability 12 ; and for scapular stabilisation to avoid compensatory movements that interfere with glenohumeral translation. However, Morita et al. 20 do not carry out this stabilisation.
Regarding the study's limitations, intra-rater reliability was assessed by a single rater, and not two as we did for inter-rater, where we obtained more robust evidence. In addition, an even larger sample would have enabled the tests to be applied randomly to further minimise bias.
As to its strengths, the reliability of the drawer tests followed COSMIN and GRRAS checklists; considered a large sample; and the weighted Kappa instead of Kappa. The humeral translation was graded with Hawkins scale, its modification and dichotomised results, enabling comparison between studies. Laboratory conditions were optimal.
A prospective study that analyses and compares the reliability of different physical tests for glenohumeral laxity to obtain the most appropriate, together with the patient's clinical history, would help in the diagnosis of shoulder instability.
Given the excellent intra-rater and good inter-rater reliability obtained, anterior and posterior drawer tests are recommended for the assessment of glenohumeral instability and/or laxity. We suggest their use by a single clinician to assess the progress of unstable shoulders. In other cases, these tests should be complemented with other objective assessment tools.
Clinical messages
Anterior and posterior drawer tests assess unstable and/or lax shoulders reliably and could be complemented by diagnostic imaging.
Drawer tests are appropriate assessment tools for a single clinician to value the progressions of shoulder laxity in physiotherapeutic and/or surgical treatments.
The modified Hawkins scale is better than the Hawkins to grade humeral translation.
Supplemental Material
sj-pdf-1-cre-10.1177_02692155251339380 - Supplemental material for Intra- and inter-rater reliability of anterior and posterior drawer tests for the assessment of people with shoulder instability
Supplemental material, sj-pdf-1-cre-10.1177_02692155251339380 for Intra- and inter-rater reliability of anterior and posterior drawer tests for the assessment of people with shoulder instability by Rocio Aldon-Villegas, Gema Chamorro-Moriana, Patricio Lopez-Tarrida and Maria-Luisa Benitez-Lugo in Clinical Rehabilitation
Supplemental Material
sj-docx-2-cre-10.1177_02692155251339380 - Supplemental material for Intra- and inter-rater reliability of anterior and posterior drawer tests for the assessment of people with shoulder instability
Supplemental material, sj-docx-2-cre-10.1177_02692155251339380 for Intra- and inter-rater reliability of anterior and posterior drawer tests for the assessment of people with shoulder instability by Rocio Aldon-Villegas, Gema Chamorro-Moriana, Patricio Lopez-Tarrida and Maria-Luisa Benitez-Lugo in Clinical Rehabilitation
Footnotes
Acknowledgements
The authors would like to acknowledge all the subjects for their participation in this methodological study. The authors would also like to thank the Research Group ‘Area of Physiotherapy CTS-305’ of the University of Seville for their collaboration.
Authors’ contributions
GC and RA conceptualised the idea and designed the study. RA and MB carried out the data collection. RA and PL performed the statistical data analysis. GC and RA wrote the first version of the papers. All authors contributed to the final version. All authors have read and agreed to the published version of the manuscript.
Consent to participate
Informed consent to participate was obtained in written form from all participants.
Consent for publication
Informed consent for publication was obtained in written form from all participants.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Ethical considerations
Ethics Committee of the Virgen Macarena-Virgen del Rocío Hospitals of the Andalusian Public Health System (No. 1267-N-21).
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
