Abstract
Background and aims:
The American Foregut Society (AFS) recently developed an improved grading system of the esophagogastric junction (EGJ). In our study, we aimed to test the interobserver agreement of AFS EGJ classification system using real-time endoscopy.
Methods:
We conducted a prospective observational study at a single center from 10-2024 to 12-2024. Inclusion criteria were veterans ≥18 years old referred for upper endoscopy for evaluation of GERD. Exclusion criteria included history of foregut surgery, upper aerodigestive cancer or known major disorder of esophageal peristalsis. Baseline sociodemographic and clinical variables were collected. Five endoscopists completed standardized instruction on EGJ assessment before study initiation. Two endoscopists independently scored the EGJ of each patient. Endoscopists were blinded to one another and self-reported Gastroesophageal Reflux Disease Questionnaire (GERDQ) scores. Interobserver agreement among endoscopists was determined by using the kappa statistic with corresponding 95% confidence intervals (CIs) and strength of agreement was categorized according to established definitions for kappa values.
Results:
117 patients met inclusion criteria and 70 successfully completed the study. The mean age was 53 and the participants were predominantly male (76%) and white (57%). Mean body mass index was 31 kg/m2 with 35 (50%) reporting GERD symptoms ≥5 years. The AFS grade demonstrated fair overall interobserver agreement (κ = .371, P ≤ .001). AFS grade demonstrated a weak correlation with the GERDQ score (ρ = .077, P = .578).
Conclusion:
The AFS EGJ classification demonstrated fair interobserver agreement between endoscopists for patients referred for endoscopy for evaluation of GERD which has implications for widespread implementation.
This is a visual representation of the abstract.
Key Learning Points
AFS classification showed only fair interobserver agreement (κ = .371) in real-time tandem endoscopy, improving to moderate (κ = .524) when simplified into two categories (AFS 1-2 vs 3-4).
Live endoscopy assessments revealed more variability than prior studies using still images, highlighting the challenges of applying a dynamic grading system in clinical practice.
Esophagologist tended to upgrade AFS grades, suggesting that specialized expertise may improve accuracy in EGJ assessment.
Variability across endoscopists underscore the need for enhanced training, standardized protocols, and potential use of AI tools to improve reproducibility.
Introduction
Gastroesophageal reflux disease (GERD) is a common condition afflicting almost one-third of the United States population. 1 An important mechanism of GERD is a breakdown of the anti-reflux barrier which consists of three vital components—the crura, the gastroesophageal flap valve and the lower esophageal sphincter and sling fibers. Historically, endoscopic assessment of the anti-reflux barrier has been performed using the Hill classification which is both reproducible 2 and correlates with GERD severity.3,4 However, the Hill classification has been criticized due to incomplete assessment of the esophagogastric junction (EGJ) and subjectivity resulting in it not being widely adopted.
The American Foregut Society subsequently developed a novel grading system of the EGJ using a modified Delphi methodology to define anatomic disruption based on objective, quantitative, standardized grading that includes maneuvers to induce hernia. 5 The AFS system has been shown to correlate with pH studies showing a significant stepwise increase in esophageal acid exposure time correlated with worsening disruption of the AFS grade. 6 The authors of this new classification posited it would have minimal interobserver variability and a subsequent blinded multi-reader validation study suggested substantial reproducibility. 7
However, that study was based on assessment of still images of the EGJ, and to date no studies validating the inter-rater agreement of AFS classification with tandem endoscopy have been published. Here, we attempted to prospectively validate the new EGJ classification by evaluating interobserver agreement amongst sequential endoscopists who prospectively evaluated outpatients receiving upper endoscopy for assessment of GERD-related symptoms. We hypothesized that inter-rater variability would be reduced compared to prior studies because it involved real-time endoscopic assessment of the EGJ which is both dynamic and operator dependent. A secondary goal of our study was to correlate the findings with a validated self-reported GERD questionnaire.
Methods
Study Setting and Population
This was a prospective, observational study involving outpatients at a large, tertiary care Veterans Affairs (VA) medical center in Houston, Texas who were referred for upper endoscopy for the diagnosis and evaluation of gastroesophageal reflux symptoms. Consecutive patients age >18 years referred for upper endoscopy for GERD or associated conditions were screened for inclusion from October 2024 to December 2024. Patients were excluded if they declined to participate, did not complete the study, or had a known major disorder of peristalsis per Chicago v.4.0, 8 upper aerodigestive malignancy, or previous foregut surgery. This research was approved by the Institutional Review Boards for Human Subjects Research for Baylor College of Medicine (IRB # H-54016) and the VA Research and Development Committee of the Michael E. DeBakey Veteran Affairs Medical Center (IRB # 1776066). All patients provided written informed consent for participation.
Study Protocol
After enrollment, subjects completed a GERDQ questionnaire (Supplemental Figure 1) which is a validated tool consisting of six questions used to diagnose GERD in the primary care setting. 9 In addition, baseline socio-demographic and clinical variables were obtained from the patient using a brief questionnaire and chart review. To minimize bias, 10 endoscopists performing endoscopy on enrolled patients were blinded to the GERDQ result. Five attending endoscopists participated including 1 expert esophagologist (WS) who serves as director of esophageal physiology testing and has conducted research on standardized reporting of the AFS classification system. The remaining endoscopists (SU, NS, CH, SL) had 2 to 49 years of post-fellowship experience (mean 18.25 years). No trainees participated. Before initiating the study, all endoscopists reviewed the AFS classification, 5 watched a video highlighting endoscopic EGJ assessment using AFS classification 11 and received instructional material on expert advice regarding the endoscopic assessment of hiatal hernia. 12 The AFS classification is a grading system of the EGJ consisting of 4 categories ranging from grade 1 (no disruption) to grade 4 (complete disruption). It also has subcomponents that allow for the measurement or grading of the axial length, hiatal aperture, and flap valve (Supplemental Figure 2).
The protocol involved tandem endoscopy where the first endoscopist completed the upper endoscopy with assessment for reflux related pathology including reflux esophagitis graded using the Los Angeles (LA) classification system, Barrett’s esophagus (BE) using the Prague classification and peptic stricture. The EGJ was carefully examined using the protocol on EGJ by Nguyen et al except for one deviation whereby the measurement of axial length was obtained upon withdrawing the endoscope after deflation of the stomach, as recommended by Katz et al. The first endoscopist then vacated the procedure room and scored their findings. A tandem endoscopist blinded to the findings of the 1st endoscopist then performed a second look endoscopy following the same protocol and graded the EGJ separately. Tandem endoscopists were paired based on endoscopist availability. Both investigators had access to a visual diagram of the AFS classification on the scoring sheet for reference. If additional maneuvers (ie, biopsy) were required, the first endoscopist re-entered the endoscopy room and completed the required procedures. Otherwise, the second endoscopist withdrew the endoscope and terminated the procedure. At the conclusion of the procedure, no further survey instruments or follow-up data were obtained.
Conscious sedation was performed according to standard protocol using intravenous midazolam, fentanyl, and diphenhydramine. All endoscopies were performed using EVIS EXERA III (GIF-HQ190) gastroscopes (Olympus USA, Center Valley, PA) with 5-cm interval marking on the shaft. CO2 was used for insufflation for all procedures.
Study End Points and Statistical Analysis
The primary outcome was the overall AFS EGJ interobserver agreement, measured by Cohen’s Kappa (κ) statistic. Secondary outcomes included interobserver agreement based on individual AFS grade and AFS subcomponents (length, diameter, and flap valve), AFSinterobserver agreement between esophagologist and other endoscopist raters and correlation between AFS and GERDQ score.
Power Calculations and Statistical Methods
Assuming a power of .8 and a significance level of .05 with four classifications with an expected moderate effect side (ie, Cohen’s Kappa of .4), a sample size of 60 subjects was sufficient to detect significant agreement between two raters.
Kappa (κ) calculations and 95% confidence intervals (CI) were performed using IBM SPSS Statistics (Version 28.0, IBM Corp, Armonk, NY). Measures of correlation were determined using both Spearman’s rank correlation coefficient (ρ) and univariable linear regression, or when residuals were not normally distributed, rank regression analysis. The cutoffs for Kappa were defined as follows: poor ≤ 0.00; slight = 0.00-0.20; fair = 0.21-0.40; moderate = 0.41-0.60; substantial = 0.61-0.80; almost perfect = 0.81-1.00. The cutoffs for Spearman’s ρ were defined as: no relationship = 0.01-0.19; weak relationship = 0.20-0.29; moderate relationship = 0.30-0.39; strong relationship = 0.40-0.69; very strong relationship ≥ 0.70.
Results
Baseline Patient Characteristics
A total of 236 patients were screened for eligibility; 117 met inclusion criteria. Forty-seven patients were excluded, and 70 participants successfully completed the study (see Figure 1 for details).

A flow diagram of the study.
The mean age was 53 years (standard deviation [SD] 15 years, range 24-79) and the participants were predominantly male (76%) and white (57%) (Table 1). Mean body mass index (BMI) was 31 kg/m2 (±6). Thirty-five patients (50%) reported GERD symptoms for ≥5 years. Fifty-three patients (75.7%) were receiving antisecretory medication with 47 (67.1%) being on proton pump inhibitors alone. The mean GERDQ was 7.6 (±4.7). Moderate to severe valve disruption (AFS 3 or 4) was present in 43 (61.4%) patients, based on endoscopists #1’s grading.
Baseline Sociodemographic and Clinical Variables of Study Participants.
Abbreviations: AFS: American Foregut Society; ASA: American Society of Anesthesiologists; BMI: body mass index; EGJ: esophagogastric junction; GERDQ: Gastroesophageal Reflux Disease Questionnaire.
As determined by endoscopist #1.
Baseline Endoscopist Characteristics
Five endoscopists participated in the study (1 esophagologist, 2 general gastroenterologists and 2 advanced endoscopists). The esophagologist was 5 years from completion of training. One endoscopist was <3 years post training, and the 3 remaining were a minimum of 10 years post training (range 2-49 years). The esophagologist performed 36 procedures. The mean for the other endoscopists was 26 study procedures (range 7-39).
Interobserver Agreement of AFS Classification
The overall AFS grade demonstrated fair overall interobserver agreement (κ = .371, 95% CI 0.206-0.536, P ≤ .01) (Table 2). When analyzed by specific AFS grade, fair agreement was observed for grade 1 (κ = .364, 95% CI 0.050-0.678, P = .002), grade 2 (κ = .278, 95% CI 0.031-0.525, P = .02), and grade 3 (κ = .352, 95% CI 0.130-0.574, P = .003) and moderate agreement was present for assessment of AFS grade 4 (κ = .528, 95% CI 0.272-0.784, P ≤ .001). When dichotomized into AFS grades 1 and 2 (no to partial disruption) versus AFS 3 and 4 (moderate to severe disruption) combined, the agreement was moderate (κ = .524, 95% CI 0.320-0.728, P ≤ .001) and (κ = .524, 95% CI 0.320-0.728, P ≤ .001), respectively.
Inter-Observer Agreement of Overall AFS, AFS by Grade, AFS by Component, and AFS by Expertise.
Abbreviation: AFS: American Foregut Society.
Analysis of each AFS component separately showed fair agreement for length (κ = .399, 95% CI 0.256-0.542, P ≤ .001), slight agreement for diameter (κ = .077, 95% CI −0.066 to 0.220, P = .254), and moderate agreement for flap valve (κ = .409, 95% CI 0.144-0.674, P < .001).
AFS Classification of Esophagologist Versus Endoscopists
Interobserver agreement between esophagologist and endoscopist raters showed fair agreement (κ = .326, 95% CI 0.107-0.546, P ≤ .001) (Table 2). When the esophagologist was excluded, the other endoscopists demonstrated moderate inter-rater agreement (κ = .435, 95% CI 0.214-0.657, P ≤ .001). There were no observable differences between interrater agreements of general and advanced endoscopists.
The esophagologist agreed with or upgraded the AFS classification 19/36 (52.8%) and 14/36 (38.9%), respectively. In contrast, the esophagologist downgraded 3/36 (8.3%) of cases (Figure 2).

Spearman correlation between (A) overall AFS and GERDQ (B) esophagologist AFS and GERDQ (C) endoscopist AFS and GERDQ.
In the 17 discordant cases, the esophagologist most frequently modified the AFS score based on the diameter component (12 instances), with fewer changes attributable to length (1 instance) or a combination of length and diameter (4 instances). The flap valve did not contribute to any discrepancies (Supplemental Table 2).
Correlation with GERDQ Score
There was only a slight correlation between the AFS grade and GERDQ for all endoscopists (ρ = .077, 95% CI −0.201 to 0.362 P = .587) (Figure 3). Rank regression analysis revealed that for each 1-point increase in AFS grade, there was a 0.081 increase in GERDQ score (Supplemental Table 1).

Esophagologist grading versus endoscopist grading during tandem endoscopy. The esophagologist concurred with the EGJ assessment of the tandem endoscopist for 19/36 subjects, upgraded 14/36 subjects and downgraded 3/36 subjects.
Esophagologist AFS grading showed a slight correlation with the GERDQ score (ρ = .088, 95% CI −0.359 to 0.492, P = .657). Rank regression analysis indicated that for each 1-point increase in AFS grading, there was a 0.096 increase in GERDQ score for esophagologist only. Endoscopist AFS rating also showed a slight correlation with the GERDQ score (ρ = .062, 95% CI −0.357 to 0.494, P = .774). Rank regression analysis revealed that for each 1-point increase in AFS grading, there was a 0.063 increase in GERDQ score.
Discussion
We found in a prospective study that overall, the interobserver agreement was only fair (κ = .371) for AFS classification of the esophagogastric junction. However, inter-rater agreement fared somewhat better when AFS classification was dichotomized to two groups of AFS 1-2 versus AFS 3-4 (κ = .524, moderate). Despite moderate interobserver agreement of AFS classification, we found a weak correlation (ρ = .077) between GERDQ and AFS classification.
To our knowledge, our study is the first prospective interobserver validation of the AFS classification that involved real time comparisons using live endoscopists. Even the original Hill Grade study, 2 despite demonstrating good interobserver agreement, was based on evaluations of videotapes and still photographs rather than real-time tandem endoscopy. Similarly, our results differ from a previous study Swei et al that evaluated inter-rater agreement of the AFS grading system where they reported substantial (κ = .65) inter-rater agreement for the AFS classification with 5 raters blinded to examine still endoscopic images only of the EGJ with premeasured hernia lengths. In contrast, our study involved in vivo grading of each patient during endoscopy which we believe is more reflective of the ‘real world’ inter-rater agreement, especially since the AFS classification is an operator dependent grading system involving dynamic assessment of the EGJ. Further, dynamic assessment of the EGJ is dependent on air insufflation and provocative maneuvers which cannot be replicated with still endoscopic images. In fact, the authors of that study suggested a prospective study evaluating AFS during live endoscopy would be useful to remedy the inherent bias and inaccuracy of ex vivo image inter-rater validation. It remains unclear to what extent the variability in our findings reflects challenges inherent to the AFS classification itself—namely, the application of a standardized grading system to a dynamic exam and anatomy—and how much is attributable to endoscopist-related factors such as training level, protocol adherence, and technique.
Notably, our study also demonstrated that AFS correlated weakly with the GERDQ score, which is consistent with a prior study that reported a neglible but statistically significant correlation (r = .202, P < .01) between AFS and GERDQ scores. 13 This is not surprising given the fact that GERDQ scores have a sensitivity and specificity of 63% and 46%, respectively on PPI and demonstrate a sensitivity 55% and specificity 52% off PPI when compared with 48 hour wireless pH monitoring. 14 Although rated fair, the observed interobserver agreement in our study is similar to other studies measuring interobserver agreement of other classifications using live endoscopy or video recordings. For example, a previous study evaluated interobserver agreement of 187 patients with Barrett’s that underwent two consecutive live endoscopies by different endoscopists. They report that the absolute agreement of hernia length defined as discrepancy of 0 cm was only 29% (95% CI 22-35) and only increased to 63% (95% CI 56-70) when defined as discrepancy of ≤1 cm. 15 Another study using video recording have reported similar findings. For example, the κ-values for the diagnosis of hernia was .177 (slight agreement) amongst 91 Japanese academic and community. 16 It stands to reason that studies involving still images, video recordings and live endoscopy will likely have different inter-rater agreements.
An interesting finding in our study was the relatively lower interobserver agreement of esophagologist in relation to the general and advanced endoscopists (κ = .326). Even more thought-provoking was the fact that 91.7% of the time the esophagologist agreed with or upgraded the AFS classification of the tandem endoscopist. Accurate assessment of the EGJ does require expertise, 17 and evidence suggests that esophageal experts are more accurate at assessment of EGJ anatomy 18 than other endoscopists. In our study it is plausible that the endoscopists underestimated the AFS grading. Even though the lack of a gold standard makes determination of accuracy difficult, anecdotal experience suggests underestimation of the EGJ is common, and the surgical literature validates this observation. 19
Our study has several notable strengths including real-time in-vivo prospective evaluation rather than examination of images or video sequences. Secondly, our endoscopists were blinded to one another and the GERDQ score. Third, we utilized a standardized protocol with a visual diagram of AFS classification to assist with grading. The limitations in our study include the lack of a gold standard which limits assessment of accuracy. Future studies might record the endoscopist assessments and also use a panel of blinded experts to grade the EGJ and compare their results to those of the investigators. A limitation of our protocol is that axial length was measured after desufflation of the stomach, which may underestimate hernia size compared to the original AFS classification method that recommends measurement during maximal insufflation; while this deviation may have contributed to the fair interobserver agreement for axial length, it is unlikely to explain the variability observed in other components such as diameter and flap valve, suggesting the overall impact on study findings is minimal. Another potential criticism of our study, given the inter-rater variability observed, is how well the endoscopists strictly adhered to the protocol. Importantly, all endoscopists received standardized instructions and were provided with visual diagrams to assist with grading and yet the inter-rater agreement was suboptimal. In fact, it is plausible that interobserver agreement in our study is possibly higher than the real world given the possibility of Hawthorne effect 20 use of visual diagrams that may not be available in all endoscopy units and is unlikely to be present in community settings. Variability in interobserver agreement may reflect differences in experience and familiarity with dynamic EGJ assessment, despite standardized training. This suggests a potential learning curve and highlights the need for more extensive training, which could be explored in future studies. Artificial intelligence may provide a solution as in Kafetzkis et al it performed better than inexperienced physicians at Hill grade assessment (85% vs 56%, P < .01) and trended better than experienced physicians (84% vs 69.6%, P = .07).
In conclusion, our study suggests that the AFS classification produces no better than fair interrater agreement. Increased education, quality metrics and implementation of artificial intelligence may be required to develop an accurate and reproducible adoption of this new EGJ classification system.
Supplemental Material
sj-docx-1-gut-10.1177_26345161251393104 – Supplemental material for Interobserver Agreement of the American Foregut Society Esophagogastric Junction Classification: A Prospective Observational Study
Supplemental material, sj-docx-1-gut-10.1177_26345161251393104 for Interobserver Agreement of the American Foregut Society Esophagogastric Junction Classification: A Prospective Observational Study by Zahraa Al Lami, Fouad Jaber, Fares W. Ayoub, Theresa H. Nguyen Wenker, Vinh Tran, Shifa Umar, Ned Snyder, Scott Larson, Clark Hair, David Y. Graham and Wasseem Skef in Foregut
Supplemental Material
sj-png-2-gut-10.1177_26345161251393104 – Supplemental material for Interobserver Agreement of the American Foregut Society Esophagogastric Junction Classification: A Prospective Observational Study
Supplemental material, sj-png-2-gut-10.1177_26345161251393104 for Interobserver Agreement of the American Foregut Society Esophagogastric Junction Classification: A Prospective Observational Study by Zahraa Al Lami, Fouad Jaber, Fares W. Ayoub, Theresa H. Nguyen Wenker, Vinh Tran, Shifa Umar, Ned Snyder, Scott Larson, Clark Hair, David Y. Graham and Wasseem Skef in Foregut
Supplemental Material
sj-png-3-gut-10.1177_26345161251393104 – Supplemental material for Interobserver Agreement of the American Foregut Society Esophagogastric Junction Classification: A Prospective Observational Study
Supplemental material, sj-png-3-gut-10.1177_26345161251393104 for Interobserver Agreement of the American Foregut Society Esophagogastric Junction Classification: A Prospective Observational Study by Zahraa Al Lami, Fouad Jaber, Fares W. Ayoub, Theresa H. Nguyen Wenker, Vinh Tran, Shifa Umar, Ned Snyder, Scott Larson, Clark Hair, David Y. Graham and Wasseem Skef in Foregut
Footnotes
Abbreviations
AFS: American Foregut Society
ASA: American Society of Anesthesiologists
BE: Barrett’s esophagus
BMI: body mass index
CI: confidence interval
EGJ: esophagogastric junction
GERD: gastroesophageal reflux disease
GERDQ: Gastroesophageal Reflux Disease Questionnaire
IRB: institutional review board
LA: Los Angeles
SD: standard deviation
VA: Veterans Affairs
Ethical Considerations
This research was approved by the Institutional Review Boards for Human Subjects Research for Baylor College of Medicine (IRB # H-54016) and the VA Research and Development Committee of the Michael E. DeBakey Veteran Affairs Medical Center (IRB # 1776066). All patients provided written informed consent for participation.
Author Contributions
Conception and design: WS, ZA, VT, SU, NS, SL, CH; analysis and interpretation of data: FJ, FA, WS; drafting of the article: FJ, WS; critical revision of the article for important intellectual content: DYG, FA, FJ, WS; final approval of the article: all authors.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Dr. Graham is supported in part by the Office of Research and Development Medical Research Service Department of Veterans Affairs, Public Health Service grant DK56338, which funds the Texas Medical Center Digestive Diseases Center and by the Cancer Prevention and Research Institute of Texas (RP220127). Dr. Nguyen Wenker is supported by the American College of Gastroenterology Junior Faculty Development Award.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Use of Artificial Intelligence
Artificial intelligence (AI) tools were used only to enhance the clarity and language of the manuscript. All ideas, interpretations, and the manuscript’s intellectual content were conceived, developed and written by the authors.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
