Abstract
Compromised gill health is a critical cause of forfeited welfare in Atlantic salmon farming. Detecting and quantifying the early onset of gill disease is important to reveal initial inflicting stimuli. We collected gill samples of 45 Atlantic salmon from 2 commercial recirculating aquaculture systems (RASs) spanning fry-to-market-size fish with no clinical signs of gill disease. Gill samples were assessed histologically by 3 independent raters with different levels of experience. Semiquantitative scoring for 7 types of gill changes was carried out for 10 filaments per gill (450 filaments total) over 3 rounds on anonymized samples. Scores were summarized for each type of gill change. The assumed clinical relevance for each change was transformed into a category score, followed by an assessment of agreement within (intra) and between (inter) raters. A generalized linear model estimated the difference in score levels between raters. For each rater, intra-rater agreement was high for 6 gill changes and moderate for 1 gill change. Inter-rater agreement was moderate to almost-perfect, except for 2 gill changes; generalized linear model regression revealed systematic differences in score usage between the raters. Our scoring protocol worked satisfactorily for mucous cell amount, lamellar clubbing, lamellar hypertrophy and/or hyperplasia, and aneurysms, despite different levels of expertise in histologic evaluation. Intra-rater agreement was consistent, but differences existed for interlamellar hypercellularity, lamellar inflammation, and degeneration. Scoring subclinical gill changes is a challenge, and our scoring system for mild-to-moderate lesions may enable early intervention to limit the detrimental effects of poor gill health in RAS farming.
Keywords
The gills are suitable as a biomarker for environmental impact in intensive aquaculture production. 36 Although the response repertoire of fish gills to irritants in the water is limited and often nonspecific,20,39 the severity of the response is influenced by the level of irritants, exposure, and recovery time.20,26 The induced changes increase the diffusion distance between blood and water, potentially affecting gill function. 34 Impaired gill function is expected to precede morphologic changes discernible at the light microscopic level. 34 A unifying histologic scoring system that quantifies mild-to-moderate gill changes could help increase our understanding of the early-stage gill changes that we believe can be diagnosed at the light microscopy level before manifestation of clinical disease.
Several grading and scoring systems have been deployed to assess gill pathology through histologic assessment of lesions,2,12,22,25,26,33 and used in ensuing studies, with or without modifications.8,17,24,27,29,32 Pathologic changes commonly evaluated in the gills are hypertrophy and/or hyperplasia of gill tissue, including mucous cell proliferation, lamellar fusion and edema or “lifting,” cellular anomalies, inflammation, and circulatory disturbances. Clavate lamellae, or clubbing, and the occurrence of various microorganisms are also reported.
“Histopathology is not an exact science” 38 as it is based on subjective interpretation of tissue changes,5,30 and precise understanding of gill lesions is best obtained when structural changes have clear definitions and minimal processing artifacts. 23 Despite using predetermined score definitions, raters might still interpret definitions and findings differently, 1 and borderline scores might add to the variability among raters. 30 Definition inconsistency is improved by revisions of scoring protocols. 9
A scoring system must produce consistent results with satisfactory repeatability and reproducibility (i.e., high agreement within and between evaluators).6,16,37 Studies have evaluated the objectivity of gill scoring systems through agreement assessment,12,22 but studies on mild-to-moderate gill pathology in Atlantic salmon are nonexistent. A semiquantitative gill scoring system should fit the expected range of lesions,10,21 and establishing a gill scoring system requires testing for robustness across rater experience.
We aimed to assess semiquantitative scoring of histopathologic changes in the gills of Atlantic salmon without clinical manifestation of gill disease obtained from commercial recirculating aquaculture system (RAS) sites, using raters with different levels of histopathology experience. Scoring was evaluated through intra- and inter-rater agreement analysis. Better understanding of mild-to-moderate gill lesions will improve our understanding of the transition process from normal variation toward clinical manifestation of gill disease, with a potential for pre-emptive intervention.
Materials and methods
Gills: origin, sampling, processing, and scoring
We used gill tissue from a repeated cross-sectional study on Atlantic salmon (Salmo salar L., n = 441) gill health. The fish sampled were farmed in commercial aquaculture operations, and fish size spanned fry-to-market-size. No clinical manifestations of gill disease were reported in the fish groups over the sampling period. We included 4 fish groups farmed in 2 RAS facilities in Norway. Site 1 is a commercial freshwater facility; site 2 is a seawater on-growth facility (2 groups each).
At each sampling, fish were haphazardly netted into a holding vessel with water and euthanized with an overdose of benzocaine (concentration depending on temperature, size of fish, and excitation). Depending on the size of the gill, the whole or the ventral branch of the second left gill arch was sampled from fish >2 g, and for the smallest fish, several gill arches were included. Fish <2 g were sampled whole. The samples were transferred to 10% phosphate-buffered formalin shortly after euthanasia and stored in formalin until further processing. The time between confinement, euthanasia, and sampling varied due to different handling procedures at various production or operational stages (i.e., the largest fish were crowded before netting at specific times, and the smallest fish were euthanized in groups). Samples were processed routinely 35 at a single laboratory, sectioned at 2–4 µm, stained with H&E, and the whole slide scanned to a digital image format and viewed with digital imaging software (NDP.view2; Hamamatsu Photonics).
We selected 45 fish gills semi-randomly from the pool described above for detailed histologic evaluation; all fish groups and several sizes were represented in these samples, and slides with major technical quality issues were discarded before deploying Excel (v.16; Microsoft) to select samples randomly. Each gill was evaluated 3 times by 3 evaluators (i.e., raters) with different levels of experience in histopathology over a period of 2–3 mo, rater 1 having less experience than raters 2 and 3. Each gill was given a random identity for each round (i.e., 1–45, 46–90, and 91–135 for rounds 1–3, respectively). To ensure anonymized assessment, the order by which the samples were assessed varied among the raters. Raters 1 and 3 rated in ascending order; rater 2 chose a more haphazard order. Up to 450 unique gill filaments were assessed per round, totaling 1,350 filament assessments per rater (Table 1). Criteria were set for which filament to include. An extensive scoring system developed by A. Dalum (not published) was used as the basis for the protocol. Scoring was performed with a refined protocol established through 3 pilot assessments to reduce subjective interpretation variation and create a fine-tuned scale fitted for low-to-moderate histopathologic changes (encompassing which and how many gill changes to include or differentiate between, definitions, the number of score levels, and scale-type [i.e., ordinal or visual analogue]). As we were aware that too much refinement and calibration could bias the result, especially since the pilot gills were to constitute a subset of the gills used in the final evaluation, we commenced with the final assessment after the third revision. In the final evaluation, the same 3 raters assessed an increased number of gills, all which were anonymized as described above.
The number of gill samples and filaments scored, distributed among fish weight and facilities. Three raters assessed each gill thrice (3 rounds), resulting in 1,350 filament assessments per rater.
From 2 fish groups kept in freshwater. Fish weight is estimated as average on a tank basis, ≤150 g.
From 2 fish groups kept in seawater. Fish weight is individual fish weight, >500 g.
In general, 10 filaments were assessed per gill sample, but 17 of 135 samples had <10 filaments.
Fish sampled whole.
Registrations were done using a customized scoring form in Excel. The setting “hide-label” in NDP.view2 was used to prevent accidentally seeing the original slide number and to make the study anonymous.
Target gill area and suitability criteria: fish >2 g
Ten filaments per gill were targeted; we required a minimum of 5 suitable filaments per gill for assessment. A filament was deemed suitable for assessment if it had symmetrical lamellae with a vascular structure in ≥20% of the assumed length of the filament. Symmetry of ipsilateral lamellae within a filament was defined as a lamellar length deviation of <50% (Fig. 1A–C), and the length of the lamellae had to be larger than the width of the filament cartilage (Fig. 1D), avoiding assessment of lamellae in the regions close to their afferent (trailing edge of the filament) or efferent (leading edge of the filament) side. This area was then defined as the scoreable filament area (Fig. 1A).

Examples of inclusion (A, E, F) and exclusion (B–D) criteria for gill scoring.
We assessed filaments from the second-gill arch from the left side of the fish. The ventral branch was prioritized, accounting from the assumed angle of the gill arch (defined as the joint formed by the epibranchial and ceratobranchial bone), starting from the second filament from the angle. If the number of suitable filaments was too low on the ventral branch, suitable filaments on the dorsal branch, if present, were included, starting from the angle at the gill arch (Fig. 1E). The filaments scored per slide per round were not marked on the slides, and thus, some variation as to which filaments were included varied.
Target gill area and suitability criteria: fish <2 g
Material for evaluation included 2 slides of whole fry. All available gill tissue was assessed on both sides, and the best-oriented filaments (i.e., those most compatible with our protocol) were evaluated. The gills needed at least 1 filament, and the gill changes and score scale were the same as for larger fish. However, the thresholds for a filament’s suitability were not included (i.e., asymmetric lamellae and short filaments could still be included if no better option was available; Fig. 1F).
Gill changes
Seven gill changes were scored on each of up to 10 filaments per gill (Fig. 2A–F). A severity grading protocol was established for each change, and each filament was scored based on a severity grade recorded on an ordinal scale (0–4) for the individual changes (Table 2). Ten filaments (maximum) were scored per fish, and for mucous cells, the maximum score per filament was 2. Thus, the maximum aggregate score for mucous cells per fish was 20. Similar calculations were made for the other gill changes (Table 2). Additionally, 2 control parameters were recorded for each slide. Before the final evaluation, the severity grading was established and refined through 3 pilot rounds of individual scoring by the 3 raters of a subset of the gills used in the final evaluation (25–30 in each pilot), followed by discussion and adjustments. Clubbing was defined as cellular thickening of the distal end of the lamellae with combinations of mucous cells, hypertrophic and/or hyperplastic epithelial cells, circulatory disturbances, inflammation, and/or degeneration. Infiltration with inflammatory cells in lamellae, without differentiation between the different leukocyte types, constituted lamellar inflammation; necrosis and/or apoptosis of the gill epithelium or subepithelial tissue constituted degeneration.

Examples of semiquantitative histologic scoring criteria.
Gill changes and their scoring at filament level according to their severity grade.
NA = not applicable.
Supplementary registrations
Raters assessed slide quality as optimal (= 1) or suboptimal (= 0) based on tissue orientation, symmetry, amount of cartilage, staining, contrast, artifacts, autolytic changes, or other postmortem changes. Raters also registered ease at finding the correct starting point according to the protocol described above: yes (= 1), not sure (= 0), as well as any other finding or remark, such as specific agents, epitheliocystis, or shortcomings of the slide, such as missing filaments.
Statistics
Gill change scores for up to 10 filaments per gill were summed to an aggregate score. No adjustments were made for gills having <10 scoreable filaments. The assumed clinical relevance for each gill change was used as the basis for transforming the aggregate score into a category score, graded 0–3 for mucous cell number, lamellar clubbing, and lamellar hypertrophy and/or hyperplasia, and 0–2 for interlamellar hypercellularity, aneurysms, lamellar inflammation, and degeneration. An increasing category score would mean increased gill reactions, and depending on the gill change, these reactions were classified as irritations or lesions. A critical remark is that the aggregate score thresholds or transformation criteria used are arbitrary and founded on a “biological relevance” approach, evaluating the consequence of different gill change scores at filament level on the aggregate score and the potential to affect gill function. The aggregate scores were transformed according to the scales used (e.g., a rater evaluated 10 filaments in a sample, according to Table 2; 5 filaments were given mucous cell score of 1, and 5 scored 2; the aggregate score of 15 resulted in a category score of 2 [i.e., classified as having moderate signs of irritation according to Table 3]).
The category scales and aggregate score thresholds. Each gill has an aggregate score for each change, calculated by summing the score from 10 filaments/gill, based on Table 2. Aggregate scores are then transformed into category scores.
NA = not applicable.
Lesions refer here to the rightmost column, where moderate and marked presence mean <20% and >20% of scoreable gill tissue involved, respectively.
Impaired gill function is expected at category score 3.
Agreement within (intra) rater and between (inter) raters was assessed using the kappaetc-framework (Stata), 15 and agreement within the rater was assessed both per rater basis and on an overall (global) level treating the 3 evaluations on each gill as “virtual” raters assessing 45 gills × 3 rounds of counts per gill (1–135). The percent agreement (PA) and Gwet agreement coefficient (GAC) 13 were calculated on the category score for each gill change. The calculations were performed unweighted to assess absolute agreement. When determining the strength of agreement, the probabilistic benchmarking method 13 was based on the kappaetc-framework, including the Landis and Koch variant of the benchmark scale.13,19 A conditional SE was used when estimating the benchmark interval for the intra-rater assessment; thus, this specific group of raters was used.
In contrast, an unconditional SE was used when estimating the inter-rater assessment benchmark interval to generalize to any rater. A generalized linear model (GLM) with ordinal family and logit link function was created to assess differences between rater scores for each gill change. All assessments were performed in Stata (release 16; StataCorp).
Results
Number of gill filaments scored and distribution of scores
A total of 402 gills were scored (99.3% of 405), and 135, 135, and 132 gills for each of the 3 raters, respectively. Most gill changes were skewed toward score 0; most lesions were mild, with a few moderate cases (Fig. 3). The percentage of examined gills having no observable gill changes (score of 0) was 52.7% for mucous cells, 60.7% for clubbing, 84.8% for hypertrophy and/or hyperplasia, 46.0% for interlamellar hypercellularity, 91.8% for aneurysms, 31.3% for lamellar inflammation, and 79.9% for degeneration.

The aggregate and category scores distributed per gill change, including all rounds and raters. The aggregate scores are shown in the left figure, and the category scores are on the right.
Histoscores
The average scores per fish presented below are all based on aggregate scores for 10 filaments per fish (Table 2). Mucous cell scores were, on average, 3.5/fish for all raters (3.5, 1.7, and 5.2 for raters 1–3, respectively). There was a skewing toward 0-scores (Fig. 3), and average scores, excluding 0-scores for the 3 raters, were 7.9, 7.0, and 8.7, respectively (Fig. 4A, 4B).

Histologic features used in scoring gill lesions.
Lamellar clubbing was, on average, 3.2 for all raters per fish, and individually 4.1, 1.6, and 4.0 for raters 1–3. Also, there was a skewing toward 0-scores (Fig. 3); average scores, excluding 0-scores for the 3 raters, were 10, 5.5, and 8.4, respectively (Fig. 4C, 4D).
Lamellar hypertrophy and/or hyperplasia with epithelial thickening without interlamellar involvement was scored on average 0.9 per fish for all raters and individually 0.7, 0.5, and 1.6 for raters 1–3. Excluding 0-scores, the scores for the 3 raters, were 4.0, 5.1, and 8.3, respectively (Fig. 4E).
Interlamellar hypercellularity was scored on average 1.5 per fish for all raters, and individually 2.2, 0.2, and 2.0 for raters 1–3. Excluding 0-scores, the scores for raters 1–3 were 3.2, 1.6, and 3.0, respectively (Fig. 4F–H).
Aneurysms were scored collectively with no differentiation between acute (fresh blood), subacute, or chronic lesions (all blood resorbed) with an average score of 0.3, and individually 0.3, 0.3, and 0.3 for raters 1–3. Small aneurysms located in the distal end of the lamellae were scored under this gill change and not as lamellar clubbing (Fig. 4I).
Lamellar inflammation was, on average, 2.4 for all raters, and individually 3.2, 1.6, and 2.5 for raters 1–3. When excluding 0-scores, the scores for raters 1–3 were 4.0, 2.6, and 3.9, respectively (Fig. 4J)
Degeneration was, on average, 0.4 for all raters, and individually 1.3, 0.03, and 0.00 for raters 1–3. When excluding 0-scores, the scores for the 2 raters were 2.2 and 1.0, respectively (Fig. 4K, 4L).
Rater agreement and differences
The intra-rater agreement was substantial to almost-perfect for all raters (Table 4). The GLM that was carried out for each of the gill changes showed systematic differences between the raters, and the margin analyses showed that the differences were related to using the low end of the scale (i.e., scores 0 and 1; Suppl. Table 1). Rater 2 scores were significantly different from rater 1 for mucous cell numbers, lamellar clubbing, lamellar hypertrophy and/or hyperplasia, and interlamellar hypercellularity; rater 3 differed from rater 1 for mucous cell number and lamellar inflammation (Suppl. Table 1).
The results from the agreement analysis, based on the category scores and using the kappaetc-framework in Stata.
GAC = Gwet agreement coefficient; PA = percent agreement.
Calculated for GAC using Gwet probabilistic benchmarking method, taking the uncertainty of the coefficient estimate into account. The numbers may, therefore, deviate from the actual coefficient intervals in the benchmark scale from the kappaetc-framework in Stata.13,15,19 The agreement coefficients and their corresponding strength of agreement in this scale are as follows: 0.00 = poor; >0.00–<0.20 = slight; ≥0.20–<0.40 = fair; ≥0.40–<0.60 = moderate; ≥0.60–<0.80 = substantial; ≥0.80–1.00 = almost perfect.
For degeneration, rater 3 scored all gills as 0, resulting in an intra-rater agreement for this rater to be incalculable (IC).
Two raters noted suboptimal quality in 19 of 45 and 27 of 44 slides; the third rater made no comments on slide quality. Of the 135 slides evaluated, 17 had <10 filaments, as these did not fulfill the suitability criteria. Autolytic changes were essentially absent. Examples of different slide quality issues are shown in Suppl. Fig. 1.
Discussion
Understanding gill reactions in farmed Atlantic salmon in different production environments is needed, 34 and a scoring system for mild-to-moderate lesions will increase our perception of the transition from normal variation toward clinical manifestation of gill disease. Such a system could reveal initial inflicting stimuli, enabling early intervention. We found that scoring of mild-to-moderate histopathologic gill changes was consistent within the scorer (intra-rater), with almost-perfect agreement for most gill changes over 3 rounds of scoring. Outcomes were slightly less consistent for inter-rater scores but still moderate to almost-perfect agreement for 5 of 7 gill changes assessed by raters with variable experience in histologic evaluation. Two categories of gill changes had “poor” inter-rater agreement. It is expected that intra-rater is higher than inter-rater agreement. 6 Re-scaling and better definition of the gill changes with poor inter-rater agreement are justified for future studies, plus optimizing tissue orientation, reducing out-of-focus areas, and ensuring good color contrast.
Degeneration and/or necrosis of epithelial cells may occur in normal gill tissue due to a high turnover rate, 33 and cellular debris may resemble necrosis in epithelial cells. This may have caused variation for the variable “degeneration”, in which differences occurred between scorers. Further, sparse lamellar inflammation can be difficult to detect, 34 and interlamellar hypercellularity and lamellar hyperplasia and/or inflammation can overlap, affecting classification of scores, and cause differences between scorers. Interlamellar hypercellularity can also be challenging to assess on gills with suboptimal tissue orientation or when the lamellae are short, 39 all factors contributing to variation.
We anonymized the slides and thus recognition from one round to the next was considered unlikely, but pilot rounds and the first round of scoring likely primed the raters for subsequent rounds. This might unconsciously have impacted the raters, rendering them less focused on details, especially for changes of low frequency. Some of the variations observed between raters could reside in different numbers of filaments being included or which filaments were scored. Including 10 filaments per gill was an attempt to mitigate some of these effects on the aggregate scores, as lesion distribution is often diffuse in gill tissue, 34 but variation does exist within different parts of the gills. This could have had an impact on the variation observed.
We used a scale of 3–5 score levels, which is in line with older studies2,22 and a recent study of gill lesions in farmed Atlantic salmon in which the authors recommend a scale of 0–5. 12 Our scoring also aligns with statistical evaluations where it has been shown that fewer than 3 grades reduce sensitivity, 10 and a higher number of animals would have been required to detect real biological differences in our material. In contrast, a higher number (>5) of score categories has a negative impact on repeatability or reproducibility,10,31 simply because it will be more difficult to distinguish between categories and thus assign the scores. This would have reduced repeatability in our study.
We found a low prevalence of target changes in the gills in our study, and the difference between raters was linked to the scores “0” and “1,” particularly for those target changes that included an inherent enumeration of changes present (without actual counting done by the raters). So, to what extent this is a misdiagnosis for either score (i.e., absence [0] or presence [1] of changes), or is related to the “sensitivity” of the rater to changes, is unknown. We believe that a low prevalence of changes will impact how the changes are interpreted. However, studies have shown that errors for normal images (false-positives) are higher than for abnormal images (false-negative errors), irrespective of experience. 4 Ultimately, the extent to which differences in cue usage explain differences observed herein is unknown. Several factors will impact the extent to which minor changes are recorded or missed. Studies have shown that missing real changes increased when the prevalence of the target is low, categorized as a low-prevalence effect. 40 We found that a low prevalence of targets demanded sustained attention because raters had to search each image thoroughly for target features in up to 450 filaments over 3 rounds. Sustained attention consumes cognitive resources, which can result in disengagement and an increased likelihood of observational error. 28 The general understanding is that image perception, successful detection of target changes, interpretation of targets, and diagnoses are based on finely tuned cognitive processes, 14 including visual search, pattern recognition, and various interpretation strategies, 18 all summarized as cue-based associations. 3 Further, several studies indicate that cue usage is vital in examining tissue and correctly identifying tissue features within the normal range versus pathologic processes. Including an assessment of cue usage among raters in future studies would add to understanding the observed differences. To what extent this explains the difference between raters is not known, and the design of the study did not allow us to record this.
There is no consensus for choice of statistical method for assessing inter-rater agreement 7 and it depends, in general, on the type of data (categorical, ordinal, or continuous) and the number of raters involved in the study. 11 The GAC offers a good method for assessing inter-rater agreement, particularly when the data are ordinal, and is less susceptible to the prevalence paradox that typically occurs when most of the ratings fall into a single category, as was the case in our study with a high number of 0-scores. In other words, the method delivers a more stable estimate of agreement that accounts for imbalances in data distribution, particularly rare events. This will lead to a low kappa value using, for example, the Cohen or Fleiss kappa, 7 even when there is substantial agreement; the GAC mitigates this issue by providing a more stable estimate of agreement that is less affected by the marginal distribution of ratings. 13 Finally, GAC is suitable for histologic studies involving >2 raters.
Supplemental Material
sj-pdf-1-vdi-10.1177_10406387241310900 – Supplemental material for Assessment of a semiquantitative scoring system for mild-to-moderate gill lesions in Atlantic salmon reared in recirculating aquaculture systems in Norway
Supplemental material, sj-pdf-1-vdi-10.1177_10406387241310900 for Assessment of a semiquantitative scoring system for mild-to-moderate gill lesions in Atlantic salmon reared in recirculating aquaculture systems in Norway by Thomas Amlie, Alf Dalum, Marit Stormoen and Øystein Evensen in Journal of Veterinary Diagnostic Investigation
Footnotes
Acknowledgements
We thank Asgeir Østvik for valuable help with sampling and Barbo Klakegg for valuable help administering the project. We are also indebted to Drs. Kilem Gwet and Daniel Klein for their valuable input on the statistical analysis and interpretation of data.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Thomas Amlie is an employee of Åkerblå.
Funding
The Research Council of Norway funded this study, project 298906, to Åkerblå.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
