Abstract
Objective:
The objective was to evaluate intrarater and inter-rater reliability of ultrasonography muscle morphology measurements in men and women, considering morphological differences between the genders.
Materials and Methods:
Thirty-two healthy subjects (16 male; 16 female; 26.6 ± 4.9 years) participated in two evaluation days. On day 1, subjects were evaluated by a single experienced rater, who repeated the evaluation on the next testing day along with two other raters. Muscle morphology of quadriceps femoris (QMT), rectus femoris (RFMT), vastus intermedius (VIMT), vastus medialis (VMMT), and vastus lateralis (VLMT) muscle thickness, rectus femoris muscle cross-sectional area (RFCSA), vastus lateralis muscle fascicle length (VLFL), and pennation angle (VLPA) were obtained. Reliability was evaluated by intraclass correlation coefficients (ICCs).
Results:
All intrarater comparisons demonstrated good reliability (ICC >0.90), even after participants’ gender stratification (ICC >0.81). Inter-rater comparisons for RFCSA, and muscle thickness showed good reliability (ICC >0.76). Male’s VLFL and VLPA reliability was considered insufficient (ICC = 0.58 and 0.60, respectively), while female’s was slightly higher (ICC = 0.79 and 0.75, respectively).
Conclusion:
Ultrasonography has potential to be used to observe changes in muscle size, but pennation angle and fascicle length should be evaluated by the same rater.
Ultrasonography (US) is a valid, convenient, and common diagnostic method to evaluate muscle architecture (i.e., muscle thickness, fascicle length and angle of pennation of its fibers),1 –5 which is determinant to muscle function and force production.6,7 A 2003 systematic review identified several studies that have shown the validity of this method to evaluate muscle size in comparison to the gold standards magnetic resonance imaging (MRI) and computerized tomography (CT). 8 Since then, other studies have been published that expand the support for the validity of this method by comparing specific metrics (fascicle length and pennation angle) between US and cadaveric measurements, 9 identifying its association with function in different populations2,10,11 and systematically reviewing current validity studies specifically for the diagnosis of sarcopenia. 2 The valid and reliable measurement of muscle architecture through time is important for detecting muscle losses or gains (and its associated strength) that can occur during hospitalization12,13 and training,10,11 respectively.
A critical matter when conducting measurements in both research and clinical settings, particularly those where the measurement depends mostly on the rater (i.e., does not require the participant to execute a movement), is the reliability of the results. Reliability is defined as the quality of a measure that produces reliable scores on repeated administrations of a test. 14 Although the terminology of the different kinds of reliability may vary between studies, two are commonly found in the literature: (1) the intrarater reliability that regards the results obtained by the same rater in different moments 15 and (2) the inter-rater reliability that regards the results obtained by different raters in the same day.16,17 Common metrics to quantify reliability are intraclass correlation coefficients (ICCs), standard error of measurements (SEMs), and minimal detectable change (MDC), 18 which are used to understand if a change, observed as a result of a rehabilitation program or hospitalization, is true or due to measurement error.
Although several studies have evaluated the reliability of muscle architecture,9,15,16,19 –22 generally finding good to excellent values, the current literature is still limited. The primary limitation is the lack of studies comparing the reliability of measurements taken on both genders. Because females tend to have smaller muscle mass than males, 23 there is a possibility that measurements in men can be less reliable than in woman since a relatively smaller portion of the muscle is visible on US images, increasing mathematical extrapolation of fascicle length calculation. Even though several studies have evaluated both males and females,1,9,19 –22,24,25 no study has specifically compared the reliability between genders. A second limitation is that studies typically focus on a few muscles (or muscle portions in the case of the quadriceps) and metrics, although there are many that can be used in clinical practice.19,26 –29 In addition, studies generally evaluate only intrarater reliability,9,19,20 despite the importance of both types when measurements are performed in different days to follow a patient’s progress and may also be conducted by different raters due to staff scheduling in a clinical setting. Finally, there are studies that chose a sample size that was not statistically justified and may have been underpowered to answer the proposed question.1,9,19,22
Because of the importance of being able to accurately measure muscle architecture to follow a patient’s response to disuse or to training programs, it is fundamental to know how the reliability of all portions of the quadriceps muscle is influenced by participant gender. In addition, considering situations in which these measures are carried out by different assessors, both the intrarater and inter-rater reliability should be evaluated. Therefore, the aim of this study was to compare intrarater and inter-rater reliability of quadriceps portions of healthy men and women, using different metrics with a sufficient number of individuals determined by sample size calculation. It was hypothesized that, given the appropriate training of the raters, all measures will be reliable, particularly in the female subgroup.
Materials and Methods
This laboratory-based study used a methodology approved by the University’s Research Ethics Committee (CAEE # 36588914.4.1001.5347). The participants were asked to read and sign an informed consent after all questions about the study was answered by the designated researcher.
Participants
Healthy and physically active men and women were recruited through social networks, advertisements, and in-person invitations at the institution where the study was conducted. None of the participants presented (1) injury to the evaluated lower limb, (2) any cardiovascular disease, or (3) central or peripheral neurological disease. The sample size was chosen to allow a precise estimation of ICCs when the evaluation is repeated three times (three raters), the true ICC being approximate to 0.9 based on previous studies,21,25,26,30 and the probability of obtaining a precision of 0.20 (i.e., confidence interval [CI] being ±0.20 from the ICC) being 50%.31,32 The minimal sample size for each group using these criteria was 12. To anticipate possible data loss (and avoid reducing statistical power), 16 male and 16 female participants were recruited.
Procedures
Subjects were submitted to two assessment sessions, on two different days separated by a week. The evaluations were performed without the cooperation of the participants (they were asked to relax during all procedures), in a small temperature-controlled room (23°C) equipped with a stretcher to simulate a hospital intensive care unit environment. This was done because the study was part of a greater project for the implementation of neuromuscular electrical stimulation in intensive care units. Before the tests, participants rested for 20 minutes in order to redistribute body fluids, 33 while lying on the stretcher in a dorsal decubitus position, resting the right lower limb on a knee extensor board. Subsequently, the US muscle morphology measurements were performed with the muscle at rest.
In one of the testing days, participants were evaluated by a single experienced rater (R1 = six years of experience with US measurements) who repeated the evaluation on the next testing day along with two other raters, who were duly trained by the experienced rater to perform the procedures (R2 = one year of experience, R3 = three months of experience). Both the raters’ evaluations order and the days’ order in which participants were evaluated by a single rater or by the three raters were intentionally randomized to discard the moment’s influence in which the subjects were evaluated. The second evaluation by R1 was conducted with the help of a map representing the transducer position in relation to moles, scars, and bone protuberances, which was created on the first day. 11 Different raters were blinded to other raters’ procedures, and all pen marks made on subjects’ skin were removed with alcohol to exclude inter-rater influence along evaluations.
Evaluation of Knee Extensors Morphological Properties
Two-dimensional real-time US was performed using a Vivid-I ultrasound equipment system (GE Healthcare, Waukesha, Wisconsin). To record the muscle morphology evaluation, a 44-mm wide linear-array transducer, transmit frequency of 9MHz was used. The transducer was soaked in a water-soluble transmission gel promoting acoustic contact without depressing the skin surface. Optimization parameters (e.g., brightness, gain) were kept the same during all evaluations, with only the depth being modified for different participants.
Images were analyzed using the Image-J software (National Institute of Health, USA) by a single experienced analyst, which was not one of the raters. The following measurements were obtained: (1) the rectus femoris muscle cross-sectional area (RFCSA); 34 the muscle thickness of the (2) quadriceps femoris (QMT), 35 (3) rectus femoris (RFMT), (4) vastus intermedius (VIMT), (5) vastus medialis (VMMT), 36 and (6) vastus lateralis (VLMT); 11 (7) the fascicle length of vastus lateralis muscle fibers (VLFL) 11 and (8) its respective pennation angle (VLPA). 11 For each evaluated morphological variable, the mean value of three images was used for statistical analysis.
Rectus Femoris Muscle Cross-Sectional Area Evaluation
Image capture was performed at the 70% level of the RF muscle belly (from the greater trochanter to the knee lateral epicondyle), in the transverse plane (See Figure 1A). The transducer was positioned transversely to the orientation of the muscle belly, with the image depth adjusted so that RF’s aponeuroses, as well as the femur, were visible. The RFCSA was obtained by tracing the muscle perimeter while excluding the aponeuroses and calculating the area of the resultant shape. The unit of measurement for this assessment was cm2.

Sonograms were obtained from a single participant and demonstrate the following: (A) rectus femoris cross-section area (RFCSA); (B) quadriceps muscle thickness (QMT), rectus femoris muscle thickness (RFMT), and vastus intermedius muscle thickness (VIMT); (C) vastus medialis muscle thickness (VMMT); (D) vastus lateralis muscle thickness (VLMT); and (E) vastus lateralis fiber length (VLFL) and pennation angle (VLPA) analysis.
Muscles Thickness Evaluation
The QMT, RFMT, and VIMT were obtained at 50% of the distance from the greater trochanter to the knee lateral epicondyle. A single vertical measurement was taken in the central portion of each muscle belly, from the superficial to the deep RF’s aponeuroses for RFMT, from the superficial VI’s aponeurosis to the femur for VIMT and from the superficial RF’s aponeurosis to the femur for QMT (See Figure 1B). For the VMMT, images were captured at 70% of the thigh length (See Figure 1C) with the transducer positioned obliquely, longitudinally to the muscle fibers. For VLMT, images were captured at 50% of the thigh length (See Figure 1D), longitudinally to muscle fibers. Muscle thickness (cm) was considered the distance between the superficial and deep aponeuroses, with a single measurement performed for obtaining QMT, RFMT, and VIMT; five equidistant measurements obtained and averaged for the VMMT and three for the VLMT (one at the left, one at the center, and one at the right side of each image).
Evaluation of the Vastus Lateralis Fascicle Length and Pennation Angle
For the evaluation of the VLFL and the VLPA, the same images obtained for the evaluation of the VLMT were analyzed, with one representative muscle fiber being selected for each of the three images. Given that the ultrasound transducer had a small scanning area (44 mm), the VLFL (cm) was calculated using trigonometry 11 considering the VLPA (°) to mathematically extrapolate the trajectories of the structures out of the image (See Figure 1E).
Statistical Analysis
The Shapiro–Wilk and Levene tests were used to confirm data normality and homogeneity of variance, respectively. To compare RFCSA, QMT, RFMT, VIMT, VMMT, VLMT, VLFL, and VLPA values between men and women, an independent samples’ T-test was used. To verify the clinical relevance of eventual differences found between measurements performed in men and women, effect sizes [Cohen’s d = (M2 − M1)⁄SDpooled] were calculated adopting the following criteria: <0.2: trivial, ≥0.2: small; ≥0.50: moderate; ≥0.80: large. 37
For the evaluation of intrarater and inter-rater reliability, the ICCs and their respective CIs were calculated using the “2, 1” model, 18 as follows:
where MSS is the participants’ mean square, MSE is the mean square error, MST is the total mean square, k is the number of trials, and n is the sample size. The SEM and MDC were calculated to quantify reliability, according to the formulas provided by Weir: 18
Benchmarks were defined for “good” reliability if ICCs were ≥0.75. 38 Thus, ICCs <0.75 were considered as “insufficient.” All reliability tests were conducted separately for men and women as well as for the pooled sample. A significance level of p = .5 was adopted for all analyses. Analysis was completed using IBM SPSS software (v 20, IBM Corp., Armonk, New York).
Results
All muscle thickness measurements were higher for the men compared to the women (See Figure 2), with large effect size. All muscle thickness intrarater ICCs were considered good, with only VMMT for men and VLMT for women presenting values lower than 0.90 (0.81 and 0.83, respectively). The SEMs ranged from 1.9% to 4.7% and MDCs ranged from 3.5% to 9.2% of the mean values, with no clear difference between the genders (See Table 1). All muscle thickness inter-rater reliability values were considered good, with only VMMT for women and VLMT for all groups presenting values lower than 0.90 (0.76 and between 0.84 and 0.87, respectively). The SEMs ranged from 1.7% to 4.3% and MDCs ranged from 3.3% to 8.5% of the mean values, also with no clear differences between genders (See Table 2).

Example sonograms obtained from the cohort of female (right) and male (left) participants. (A) rectus femoris at 70% of thigh length; (B) rectus femoris and vastus intermedius at 50% of thigh length; (C) vastus lateralis; (D) vastus medialis. Numbers and ticks to the left indicate centimeters to exemplify scale.
Mean and Standard Deviation Values of Ultrasonographic Measurements Acquired in Two Different Days by the Same Evaluator.
Abbreviations: CI, confidence intervals; ICC, intraclass correlation coefficient; MDC, minimal detectable change; QMT, quadriceps muscle thickness; RFCSA, rectus femoris cross-sectional area; RFMT, rectus femoris muscle thickness; SEM, standard error of measure; VIMT, vastus intermedius muscle thickness; VLFL, vastus lateralis fiber length; VLMT, vastus lateralis muscle thickness; VLPA, vastus medialis pennation angle; VMMT, vastus medialis muscle thickness.
Significant difference between men and women (p < .0001).
Significant difference between men and women (p < .05).
Effect sizes: atrivial; bsmall; cmoderate; and dlarge.
The Mean and Standard Deviation Values of Ultrasound Measurements Acquired in the Same Day by Three Different Evaluators.
Abbreviations: CI, confidence interval; ICC, intraclass correlation coefficient; MDC, minimal detectable change; QMT, quadriceps muscle thickness; RFCSA, rectus femoris cross-sectional area; RFMT, rectus femoris muscle thickness; SEM, standard error of measure; VIMT, vastus intermedius muscle thickness; VLFL, vastus lateralis fiber length; VLMT, vastus lateralis muscle thickness; VLPA, vastus medialis pennation angle; VMMT, vastus medialis muscle thickness.
Significant difference between men and women (p < .0001).
Significant difference between men and women (p < .05).
Effect sizes: atrivial; bsmall; cmoderate; and dlarge. †Insufficient reliability.
The RFCSA was greater for men than for women, with large effect size. Intrarater ICCs were good for all groups (>0.94), with SEMs ranging from 5.6% to 6.6% and MDCs ranging from 11% to 12.9% of the mean measured values (See Table 1). Inter-rater ICCs were considered good for all groups, with women presenting a slightly lower value (ICC = 0.82) than men (ICC = 0.91). The SEMs ranged from 5.3% to 10.1% and MDCs ranged from 10.5% to 19.8% of the mean values, with no difference between genders (See Table 2).
The VLFL was not significantly different between men and women (small effect sizes). Intrarater ICCs were considered good (>0.93), with SEMs ranging from 2.4% to 3.4% and MDCs ranging from 4.7% to 6.6% of the mean measured values, being similar for both genders (See Table 1). However, despite inter-rater VLFL ICCs were considered good for women (ICC = 0.79), they were insufficient for men and for the pooled sample (ICC = 0.58 and 0.69, respectively). The SEMs ranged from 2.5% to 7.9% and MDCs ranged from 4.8% to 15.4% of the mean, with no differences between genders (See Table 2).
The VLPA was not different between men and women (moderate effect sizes). Intrarater ICCs were considered good for men, women, and pooled samples (>0.88), with SEMs ranging from 2.1% to 4.0% and MDCs ranging from 4.1% to 8.0% of the mean and being similar between sexes (See Table 2). Inter-rater VLPA ICCs were considered good for women (ICC = 0.75) but also insufficient for men and pooled samples (ICC = 0.60 and 0.66, respectively). The SEMs ranged from 2.2% to 7.4% and MDCs ranged from 4.4% to 14.5% of the mean measured values, with no differences between genders (See Table 2).
Discussion
In this study, the intrarater and inter-rater reliability of different US measurements of the quadriceps portions were evaluated, particularly focusing on the participant between-gender differences. For women, all measurements were considered good (ICC >0.75), whereas for men, reliability was considered insufficient only for the inter-rater comparison of the VLFL and VLPA. These results partially confirm the study hypothesis that all measurements would have been reliable, as fascicle length and pennation angle ICCs seemed to be rater and gender-dependent, but SEMs and MDCs were similar between participants’ genders.
Muscle thickness of the four quadriceps portions and overall quadriceps presented high ICCs for both intrarater and inter-rater comparisons, regardless of participant gender (>0.81 for men and >0.76 for women). These results agree with the current literature that also found high reliability for quadriceps portions intrarater 9,19,24,29,39 –41 and inter-rater24 –26,29,30 comparisons in mixed or exclusively male populations. These studies were also conducted in populations that were healthy,9,26,39,40 hospitalized,25,29,30 or with diabetes melittus. 24 Similarly, RFCSA also presented good reliability that was very similar or slightly above to literature values, which observed intrarater ICCs between 0.87 and 0.991,20,22,40,42 and inter-rater values between 0.79 and 0.99.21,24,42 It is important to note that this muscle was evaluated at 70% of the thigh length, as opposed to the most popular 50%, in order to fit the whole muscle in the image with the available transducer. Transducer position did not seem to be an issue, which is further supported by Lima et al, 22 who found similar reliability when evaluating at 50% or 15 cm proximal to the patellar edge. Overall, the current study’s reliability findings may suggest that thickness and RFCSA can be reliable measurements to assess changes in muscle size due to training programs or disuse (e.g., during hospitalization).
The VLFL and VLPA inter-rater reliabilities were considered insufficient when considering all subjects pooled (0.69 and 0.66, respectively) and exclusively men (0.58 and 0.60). Conversely, when evaluating only women, ICCs rose to 0.79 and 0.75, values that were just above the threshold to be considered good. This difference is likely because women presented smaller VL muscles, making it easier to find a representative fascicle to be used for calculating the pennation angle. Furthermore, errors in pennation angle and muscle thickness are particularly important when extrapolating fascicle length using trigonometry, resulting in a higher reliability when the extrapolation is required for a small percentage of the length or not required at all. 23 Although smaller muscle size in women seems a reasonable explanation for their higher reliability in these measurements, other studies that could corroborate this hypothesis by stratifying groups by gender or presenting groups composed exclusively of women were not found.
There are not many studies that have investigated the reliability of these parameters when made by different raters, particularly in VL, as only Chiaramonte et al 26 evaluated pennation angle reliability in this muscle, finding an ICC of 0.95. However, it is not clear how much experience each rater had at the time of the measurements, nor which steps were taken to make sure the raters did not influence each other (e.g., exiting the room or erasing skin marks), all of which could have contributed to increasing the reliability. Inter-rater reliability has also been studied for other muscles. In a series of studies, Cho et al27,28,43 found medial gastrocnemius and tibialis anterior pennation angle ICC values between 0.81 and 0.98 in stroke patients. The only study found where fascicle length inter-rater reliability was measured was the one by König et al, 44 where medial gastrocnemius fascicle length ICC was 0.77, in addition to pennation angles between 0.80 and 0.90. Overall, the current study’s results were lower than those previously found, possibly due to the smaller size of medial gastrocnemius and tibialis anterior in comparison to the vastus lateralis, indicating that care should be taken when using fascicle length and pennation angle, measured by different raters, to make clinical decisions.
Intrarater reliability tends to be higher than inter-rater reliability because a given experienced rater uses a similar technique to identify the structures required to acquire a good sonographic image and may remember the characteristics of the participant from the previous evaluation. Another instrument that can be used to make sure the measurements are consistent is a map. This map can be produced using a transparent sheet where the position of the probe is recorded in relation to anatomical points such as moles, scars, and bony protrusions, making sure it is positioned in the same place in multiple evaluations. However, even using the map, there are other factors that are more difficult to accurately reproduce, such as the angle of the probe relative to the skin surface, which makes the results not the same, lowering the reliability scores. However, the current study findings showed that all the quadriceps measurements evaluated by an experienced rater had good reliability and can be used in research and clinical practice.
Inter-rater reliability can be highly influenced by the raters’ experience, given that experienced raters can easily identify anatomical points and the structures that need to be obtained during the evaluation of the US muscle morphology measurements. In the present study, the three raters had different levels of experience. While R1 had six years of experience with the assessment technique, R2 had one year of experience and R3 worked with the technique for only three months. When looking at the reproducibility values obtained in comparisons between the most experienced rater and the other two raters individually, it can be observed that only the VLFL and VLPA were more reliable when comparing the R1 with the R2 (0.85 and 0.72) than when comparing R1 with R3 (0.65 and 0.60). However, when comparing R1 with himself (intrarater), the ICC values were 0.94 and 0.90, suggesting that these differences may have been caused by the raters’ different levels of experience. In addition, as previously discussed, a landmark map may also help to improve reliability by minimizing probe position variation. However, no study has investigated intrarater or interrater reliability while comparing the use or not of this map. Nonetheless, what the current findings suggest is that, when evaluating fascicle length and pennation angle, it would be preferable that raters have experience with the technique for more than one year, whereas when evaluating thickness and cross-sectional areas, a proper training and a smaller period of experience should be enough to provide reliable measurements.
Limitations
This study has major limitations due to the study design that has threats to internal and external validity. It is also important to note added limitations, when interpreting these results: (1) the quadriceps muscle was chosen because it can represent well the participants’ muscle characteristics, given it is the largest muscle in the body and is highly associated with patient function and prognosis.2,45 However, other muscles can also be used in clinical practice and may present different inter-rater and intrarater reliabilities; (2) only one person analyzed all the images from both days and all raters to be consistent. The identification of structures by this analyst may also have influenced the results, and the inter-analyst reliability could also bring valuable information for understanding the reliability of these measurements. 17 Finally, (3) the participants were young and healthy and did not have any current pathology. The muscles of older and pathological people may have different characteristics that could make the identification of muscle structures more difficult. Thus, the results of this study should be used with care when seeking to extrapolate these results to clinical practice.
Conclusion
The high ICCs and low SEMs and MDCs observed for all intrarater parameters demonstrated that these measurements were reliable when performed by an experienced rater at different moments. High reliability found for RFCSA and muscle thickness measures in inter-rater comparisons in all groups and high reliability for VLFL and VLPA in women demonstrate that these measures could be used in the evaluation of musculoskeletal morphology, when performed by different raters. The insufficient reliability found for VLFL and VLPA in mixed-gender and exclusively male groups in the inter-rater comparisons suggests that these parameters are evaluator-dependent. These results should be considered when using these types of measurement for making clinical decisions based on US muscle architecture values.
Footnotes
Ethics Approval
Ethical approval for this study was obtained from the University’s Research Ethics Committee (CAEE # 36588914.4.1001.5347).
Informed Consent
Written informed consent was obtained from all subjects before the study.
Animal Welfare
Guidelines for humane animal treatment did not apply to the present study because no animals were used during the study.
Trial Registration
Not applicable.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors disclosed this work was supported by the Brazilian Council of Scientific and Technological Development (CNPq) (grant number 458838/2013-6). MAV, PhD, is a recipient of a research fellowship from the CNPq.
