Abstract
Background
Accurate organ measurement is essential in neonatal ultrasound in order to guide clinical decisions. However, interobserver variability remains a challenge, especially in preterm infants, where small organ dimensions and technical limitations can affect reproducibility. Consistent agreement between examiners is crucial to ensure reliable and standardized assessments. This prospective observational study evaluates interobserver agreement in ultrasound measurements of the liver, kidney, and spleen in preterm infants.
Methods
In this prospective observational study, a total of 74 ultrasound examinations were performed in 30 preterm infants, with seven infants being measured twice. The 37 paired assessments included independent measurements of the liver (midsternal line (MSL), midclavicular line (MCL), and anterior axillary line (AAL)), kidneys, and spleen by two experienced examiners. Statistical analyses included the Wilcoxon matched-pairs signed rank test, paired t-test, Pearson correlation coefficient (PCC), intraclass correlation coefficient (ICC), and Bland–Altman analysis.
Results
No significant differences were found between examiners (P > .05), except for the generally least reproducible liver length in the MCL. PCC and ICC values for all measurements exceeded 0.929, indicating excellent interobserver agreement.
Conclusion
Ultrasound investigations remain a reliable and reproducible tool for organ size assessment in neonatal care, even in extremely low birth weight infants with tiny anatomical structures. The strong interobserver agreement emphasizes the importance of standardized measurement protocols and ultrasound training, ensuring consistency in clinical practice.
Introduction
Abdominal sonography is an essential diagnostic tool in neonatal medicine, providing real-time and non-invasive assessment of parenchymal organs such as the liver, kidneys, and spleen. Accurate organ size measurements are essential for tracking growth, development, and early pathological changes in both preterm and term infants. Liver size abnormalities may indicate hepatomegaly, which can be associated with congestive heart failure, infections, metabolic disorders, liver tumors, or malformations.1–4 Renal dimensions play a crucial role in diagnosing congenital anomalies of the kidney and urinary tract (CAKUT), for example, renal dysplasia and duplex kidney, or ciliopathies like polycystic kidney disease.5–7 Equally, splenomegaly can indicate sepsis, portal hypertension, or hematological disorders like leukemia.8–10 Obtaining accurate and reproducible measurements in preterm infants presents specific challenges. Their small organ sizes, limited ultrasound windows, and frequent movement due to restlessness or respiratory effort contribute to measurement variability. These technical factors make interobserver agreement extremely important to ensure diagnostic reliability. Although interobserver variability has been examined in neonatal brain11–13 and lung ultrasound, 14 there is currently no published research evaluating agreement in abdominal organ measurements in preterm infants. Existing studies on sonographic interobserver reliability in abdominal imaging are limited overall and focus primarily on older children and adults.15, 16 While abdominal ultrasound is widely used in extremely low birth weight infants, for example, to detect conditions such as necrotizing enterocolitis,17, 18 no studies have evaluated the reproducibility of standardized liver, kidney, and spleen measurements in this population. This study aims to evaluate the interobserver agreement between two experienced examiners in standardized ultrasound measurements of the liver, kidneys, and spleen in preterm infants, using established statistical approaches to assess reproducibility in clinical practice.
Material and Methods
Study Design and Setting
This prospective study was conducted at a level III neonatal intensive care unit (NICU) at the Hannover Medical School, Lower Saxony, Germany, between March 2024 and January 2025. The NICU provides care for approximately 100 very low birth weight infants (<1.5 kg) annually, forming a diverse patient cohort. Most patients are inborn, with a smaller number of outborn transfers. All preterm infants admitted during the study period were eligible for inclusion if they were 37 0/7 weeks of gestational age at the time of ultrasound examination and could be independently examined by both investigators within a 24-h interval. Infants were excluded if they had congenital anomalies or conditions that could affect abdominal organ size (e.g., hepatic or abdominal tumors, metabolic hepatopathies, polysplenia, or polycystic kidney disease). Further exclusion criteria included scheduling-related limitations: if the investigators’ duty rosters did not overlap within the required 24-h timeframe, it was not feasible to include the infant. The lack of written informed consent by the parents or legal guardians also led to exclusion. In total, 254 infants were screened for eligibility during the study period. Of these, 154 were excluded due to scheduling constraints, 56 due to lack of parental consent, eight due to early postnatal death, and six due to congenital or organ-specific anomalies. The final study population consisted of 30 preterm infants (Figure 1). This sample size was chosen based on feasibility and is in line with previously published interobserver agreement studies in neonatal ultrasound.13, 19 To evaluate its adequacy, a retrospective precision-based calculation was performed according to Bonett. 20 Assuming an expected intraclass correlation coefficient (ICC) of 0.90 and a two-rater design, this sample size yields a 95% confidence interval of approximately ± 0.05. The study was approved by the local Ethics Committee and written informed consent was obtained from all legal guardians prior to participation. The study was conducted in accordance with the ethical standards of the 1964 Declaration of Helsinki and its later amendments.
Participant Flow Diagram: Overview of Screening, Exclusions, and Final Inclusion for Interobserver Ultrasound Analysis.
Variables
The primary outcome was the level of interobserver agreement in ultrasound measurements of the liver, kidneys, and spleen. No exposures or predictors were defined, as the study was designed to assess reproducibility rather than to examine causal relationships or prognostic factors. Potential confounding variables such as body weight, postmenstrual age, and sex were recorded during data collection but were not included in the statistical analysis, since adjustment was not applicable to the methodological objective of the study. No effect modifiers were considered. Diagnostic exclusion of congenital anomalies or organ-specific abnormalities was based on clinical evaluation and available imaging or medical record evidence, following standard NICU diagnostic practice.
Ultrasound Device and Examination Procedure
All ultrasound scans were taken using a GE Venue Go R4™ ultrasound scanner (GE HealthCare Technologies, Chicago, Illinois, USA) with a convex probe (8C). Infants were examined in a supine position to ensure consistency. In accordance with standard NICU care, neonates were neither woken nor sedated for the examination.
Ultrasound measurements were performed independently by two pediatricians with extensive experience in neonatal ultrasound: examiner 1 had 5.5 years and examiner 2 had 9 years of experience, both held DEGUM level 1 certification in pediatric sonography (DEGUM = German Society for Ultrasound in Medicine). The time interval between the two examinations of the same infant was limited to a maximum of 24 h to minimize potential physiological variations. Both examiners used the same ultrasound device, probe, system presets, and patient positioning protocol to ensure technical consistency across all examinations. In addition, all measurements were performed in accordance with standardized DEGUM guidelines and current reference protocols.21, 22
The craniocaudal liver length was measured in three strictly sagittal planes: midsternal line (MSL), using the aorta as a guide; the midclavicular line (MCL), with the gallbladder as a hallmark; and the longest craniocaudal alignment in the anterior axillary line (AAL). The spleen length was assessed in an oblique longitudinal scan from the left side, measuring from the upper to lower pole. Renal volume was estimated using the ellipsoid formula: volume = length × width × depth × π/6, with the kidneys consistently measured from the ventral side. To prevent examiner bias, images and measurement data were stored on separate, independent servers, ensuring that each examiner had no access to the others results. This setup eliminated any potential influence of prior measurements on subsequent assessments.
Statistical Analysis
Data analyses were performed using GraphPad Prism version 10.4.1 (GraphPad Software, Boston, Massachusetts, USA). Data distribution was tested for normality using the Shapiro–Wilk test. Normally distributed data were presented as mean ± standard deviation (SD), while non-normally distributed data were given as median and interquartile range (IQR). Depending on the distribution, differences between examiners were analyzed using either a paired t-test for normally distributed data or the Wilcoxon matched-pairs signed rank test for non-normally distributed data. Since the primary outcome of this study was the level of agreement between the two examiners, interobserver reliability was assessed using the ICC and Pearson’s correlation coefficient (PCC). A Bland–Altman analysis was performed to evaluate systematic bias and limits of agreement (LoA). For normally distributed data, LoA was defined as the mean difference ± 1.96 SD, while for non-normally distributed data, the median difference and the 2.5th-97.5th percentile range were used. As seven infants underwent measurements twice, their values were averaged, resulting in 30 data points for all statistical analyses except for kidney measurements, since one infant had a horseshoe kidney, reducing the number of statistical comparisons to 29 data points. Apart from the excluded renal measurement, no missing data occurred. All other variables were fully complete across the entire study population. To preserve individual pairwise comparisons, the data from all the examinations were used for the Bland–Altman analysis.
Results
Study Population
A total of 254 preterm infants were assessed for potential inclusion during the study period. Of these, 154 could not be enrolled due to scheduling constraints, 56 were not enrolled because no parental consent was obtained, eight died early postnatally, and six had congenital or organ-specific anomalies. This resulted in a final cohort of 30 infants, in whom a total of 74 ultrasound examinations were conducted, with each of the 37 assessments performed independently by two examiners. Among them, seven infants underwent measurements twice. Liver and spleen measurements were complete for all infants; one kidney measurement was excluded due to a horseshoe kidney, resulting in 29 complete renal datasets. The study cohort included 16 females (53%) and 14 males (47%). The mean gestational age at birth was 28 3/7 weeks ± 3 3/7 weeks, with a median birth weight of 0.865 kg (IQR: 0.64-1.2) and a median birth length of 34 cm (IQR: 30.2-36.8). The first ultrasound examination was performed at a mean postnatal age of 42 days ± 29 days (Table 1).
Demographic and Clinical Characteristics of the Study Population.
Across all measured parameters, no significant differences were observed between examiners (P > .05), except for liver length in MCL, where a small but statistically significant difference was detected (P = .0369). Overall interobserver agreement was excellent, with ICC ranging from 0.929 to 0.964 and PCC ranging from 0.930 to 0.965, indicating strong reliability (Table 2 and Figure 2). The highest agreement was noted for spleen measurements (ICC = 0.964, PCC = 0.965), while the lowest agreement was found for liver length in the AAL (ICC = 0.929, PCC = 0.930). Bland–Altman analysis further confirmed the strong agreement between examiners.
Comparison and Interobserver Agreement of Measurement Results.

The mean/median differences were small for all parameters, with 95% LoA ranging within clinically acceptable ranges (Figure 3). Example images of the MSL, AAL, spleen, and kidney are shown in Figures 4 and 5.



Discussion
This study represents the first systematic interobserver analysis of abdominal organ measurements in preterm infants, offering valuable insights into the reproducibility of neonatal ultrasound. While previous studies have examined interobserver variability in individual organ measurements, particularly in older children and adults,23–26 comprehensive data on liver, kidney, and spleen measurements in preterm infants have been lacking.
Our findings demonstrate high interobserver agreement across all parameters, with no statistically significant differences between examiners for almost all measurements. The only exception was the measurement of liver length in MCL, which showed a small but statistically significant difference (P = .0369). However, this variation appears to be of limited clinical relevance, as confirmed by the Bland–Altman analysis, which showed narrow LoA within acceptable ranges. One possible explanation for the higher variability in MCL measurements is the use of the gallbladder as a landmark, which is more prone to positional shifts due to respiratory movement and changes in its filling state. A fully distended gallbladder can appear in a wide area of the abdomen. This allows for variable positioning of the transducer, which limits the reproducibility of the measurement. Equally important is the physiological increase in the craniocaudal liver length in the MCL region, which means that small lateral transducer displacements in the sagittal plane can lead to significant differences in the measurement due to the pronounced change in liver length in this area. In contrast, MSL and AAL appear to be more stable reference points, potentially leading to greater measurement consistency. Opinions regarding the selection of the MCL as a valid and reproducible measurement point for craniocaudal liver length have differed in past studies.27–29 The inclusion of MCL measurements in the present study follows current recommendations, 22 whereas DEGUM only endorses MSL and AAL measurements. 21 Given its frequent use in liver assessments, further refinement of MCL-measurement protocols and examiner training could help minimize variability and improve reliability.
Clinical Implications and Perspectives
The high interobserver reliability in this study affirms the clinical value of sonography as a precise and reproducible imaging tool in neonates, including extremely low birth weight infants with very small anatomical structures. Accurate organ measurement is essential for the early detection of hepatosplenomegaly, renal growth abnormalities, and other congenital anomalies, allowing for timely intervention and monitoring. Liver size assessment is particularly relevant for detecting conditions such as infections, metabolic hepatopathies, or congestion due to cardiac failure.4, 30, 31 Even small inconsistencies in measurement technique could influence clinical decision-making and interindividual follow-up, underscoring the need for consistent transducer positioning and respiratory phase control during imaging. In this context, artificial intelligence (AI) is increasingly being explored as a tool to improve measurement accuracy. AI has shown potential in reducing variability in sonographic imaging and in clinical decision-making,32–34 but its role in neonatal abdominal ultrasound remains limited. Future research could explore whether AI-assisted approaches can further improve measurement accuracy. It should not be dismissed that the experience of the examiner remains indispensable.
Strengths and Limitations
A major strength of this study is its prospective design, ensuring systematic data collection and analysis. The inclusion of a well-defined cohort of preterm infants enhances the clinical relevance of the findings, especially for NICU settings. Additionally, this study provides a comprehensive evaluation of interobserver agreement across multiple abdominal organs, using a standardized measurement protocol. The strict blinding of examiners further strengthens the study, ensuring that measurements were performed independently and without access to each other’s results, minimizing potential bias.
However, some limitations must be acknowledged. First, while a sample size of 30 infants (74 examinations) is appropriate for an interobserver study, a larger cohort could provide more robust data on measurement variability. Second, as a single-center study, our findings may have limited generalizability to other clinical settings, particularly those with less experienced examiners, different ultrasound equipment, or non-standardized measurement protocols. This restricts the external validity of our results and suggests that similar levels of agreement may not be universally achievable. Third, a potential limitation of the study lies in the timing of ultrasound examinations. While no infant was excluded due to clinical instability, some neonates were not examined during acute critical phases, and measurements were instead performed in a later, more stable condition. This approach reflects routine NICU practice, but it may limit the generalizability of interobserver agreement to clinically unstable patients. Fourth, intraobserver variability was not assessed, which may have provided additional insights into the overall reproducibility of the measurements. Finally, repeated measurements in seven infants were averaged for analysis, which may have reduced visible variance and influenced the observed agreement metrics.
Conclusion
This study confirms that ultrasound-based abdominal organ measurements in preterm infants are highly reproducible, supporting their continued use as a reliable bedside imaging tool. However, liver length in the MCL was the only plane with a statistically significant difference between the two examiners, likely due to the gallbladder’s variable position depending on its filling state and the physiological increase of liver length in this area. These factors make MCL, in contrast to MSL and AAL, a less reliable reference for liver length measurement. The observed variability in MCL measurements aligns with clinical experience and previous studies.
Footnotes
Acknowledgments
The authors would like to express their sincere gratitude to the infants and parents for taking part in the study. Parts of this work were translated from German into English using DeepL Translator version 25.1.11615133 (DeepL SE, Cologne, Germany).
Data Availability Statement
The datasets generated and analyzed during this study are not publicly available due to patient confidentiality regulations but are available from the corresponding author upon reasonable request.
Declaration of Conflicting Interests
The authors declared no conflict of interest with respect to the research, authorship, and/or publication of this article.
Ethical Approval and Informed Consent
The Ethics Committee of Hannover Medical School approved the study (09.04.2024, No. 11351_BO_K_2024), and informed consent was obtained from all legal guardians.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
