Sage Journals: Discover world-class research

Abstract

Compromised gill health is a critical cause of forfeited welfare in Atlantic salmon farming. Detecting and quantifying the early onset of gill disease is important to reveal initial inflicting stimuli. We collected gill samples of 45 Atlantic salmon from 2 commercial recirculating aquaculture systems (RASs) spanning fry-to-market-size fish with no clinical signs of gill disease. Gill samples were assessed histologically by 3 independent raters with different levels of experience. Semiquantitative scoring for 7 types of gill changes was carried out for 10 filaments per gill (450 filaments total) over 3 rounds on anonymized samples. Scores were summarized for each type of gill change. The assumed clinical relevance for each change was transformed into a category score, followed by an assessment of agreement within (intra) and between (inter) raters. A generalized linear model estimated the difference in score levels between raters. For each rater, intra-rater agreement was high for 6 gill changes and moderate for 1 gill change. Inter-rater agreement was moderate to almost-perfect, except for 2 gill changes; generalized linear model regression revealed systematic differences in score usage between the raters. Our scoring protocol worked satisfactorily for mucous cell amount, lamellar clubbing, lamellar hypertrophy and/or hyperplasia, and aneurysms, despite different levels of expertise in histologic evaluation. Intra-rater agreement was consistent, but differences existed for interlamellar hypercellularity, lamellar inflammation, and degeneration. Scoring subclinical gill changes is a challenge, and our scoring system for mild-to-moderate lesions may enable early intervention to limit the detrimental effects of poor gill health in RAS farming.

Keywords

Atlantic salmon gill health histopathology RAS rater agreement semiquantitative scoring statistics

The gills are suitable as a biomarker for environmental impact in intensive aquaculture production.³⁶ Although the response repertoire of fish gills to irritants in the water is limited and often nonspecific,^20,39 the severity of the response is influenced by the level of irritants, exposure, and recovery time.^20,26 The induced changes increase the diffusion distance between blood and water, potentially affecting gill function.³⁴ Impaired gill function is expected to precede morphologic changes discernible at the light microscopic level.³⁴ A unifying histologic scoring system that quantifies mild-to-moderate gill changes could help increase our understanding of the early-stage gill changes that we believe can be diagnosed at the light microscopy level before manifestation of clinical disease.

Several grading and scoring systems have been deployed to assess gill pathology through histologic assessment of lesions,^{2,12,22,25,26,33} and used in ensuing studies, with or without modifications.^{8,17,24,27,29,32} Pathologic changes commonly evaluated in the gills are hypertrophy and/or hyperplasia of gill tissue, including mucous cell proliferation, lamellar fusion and edema or “lifting,” cellular anomalies, inflammation, and circulatory disturbances. Clavate lamellae, or clubbing, and the occurrence of various microorganisms are also reported.

“Histopathology is not an exact science”³⁸ as it is based on subjective interpretation of tissue changes,^5,30 and precise understanding of gill lesions is best obtained when structural changes have clear definitions and minimal processing artifacts.²³ Despite using predetermined score definitions, raters might still interpret definitions and findings differently,¹ and borderline scores might add to the variability among raters.³⁰ Definition inconsistency is improved by revisions of scoring protocols.⁹

A scoring system must produce consistent results with satisfactory repeatability and reproducibility (i.e., high agreement within and between evaluators).^6,16,37 Studies have evaluated the objectivity of gill scoring systems through agreement assessment,^12,22 but studies on mild-to-moderate gill pathology in Atlantic salmon are nonexistent. A semiquantitative gill scoring system should fit the expected range of lesions,^10,21 and establishing a gill scoring system requires testing for robustness across rater experience.

We aimed to assess semiquantitative scoring of histopathologic changes in the gills of Atlantic salmon without clinical manifestation of gill disease obtained from commercial recirculating aquaculture system (RAS) sites, using raters with different levels of histopathology experience. Scoring was evaluated through intra- and inter-rater agreement analysis. Better understanding of mild-to-moderate gill lesions will improve our understanding of the transition process from normal variation toward clinical manifestation of gill disease, with a potential for pre-emptive intervention.

Materials and methods

Gills: origin, sampling, processing, and scoring

We used gill tissue from a repeated cross-sectional study on Atlantic salmon (Salmo salar L., n = 441) gill health. The fish sampled were farmed in commercial aquaculture operations, and fish size spanned fry-to-market-size. No clinical manifestations of gill disease were reported in the fish groups over the sampling period. We included 4 fish groups farmed in 2 RAS facilities in Norway. Site 1 is a commercial freshwater facility; site 2 is a seawater on-growth facility (2 groups each).

At each sampling, fish were haphazardly netted into a holding vessel with water and euthanized with an overdose of benzocaine (concentration depending on temperature, size of fish, and excitation). Depending on the size of the gill, the whole or the ventral branch of the second left gill arch was sampled from fish >2 g, and for the smallest fish, several gill arches were included. Fish <2 g were sampled whole. The samples were transferred to 10% phosphate-buffered formalin shortly after euthanasia and stored in formalin until further processing. The time between confinement, euthanasia, and sampling varied due to different handling procedures at various production or operational stages (i.e., the largest fish were crowded before netting at specific times, and the smallest fish were euthanized in groups). Samples were processed routinely³⁵ at a single laboratory, sectioned at 2–4 µm, stained with H&E, and the whole slide scanned to a digital image format and viewed with digital imaging software (NDP.view2; Hamamatsu Photonics).

We selected 45 fish gills semi-randomly from the pool described above for detailed histologic evaluation; all fish groups and several sizes were represented in these samples, and slides with major technical quality issues were discarded before deploying Excel (v.16; Microsoft) to select samples randomly. Each gill was evaluated 3 times by 3 evaluators (i.e., raters) with different levels of experience in histopathology over a period of 2–3 mo, rater 1 having less experience than raters 2 and 3. Each gill was given a random identity for each round (i.e., 1–45, 46–90, and 91–135 for rounds 1–3, respectively). To ensure anonymized assessment, the order by which the samples were assessed varied among the raters. Raters 1 and 3 rated in ascending order; rater 2 chose a more haphazard order. Up to 450 unique gill filaments were assessed per round, totaling 1,350 filament assessments per rater (Table 1). Criteria were set for which filament to include. An extensive scoring system developed by A. Dalum (not published) was used as the basis for the protocol. Scoring was performed with a refined protocol established through 3 pilot assessments to reduce subjective interpretation variation and create a fine-tuned scale fitted for low-to-moderate histopathologic changes (encompassing which and how many gill changes to include or differentiate between, definitions, the number of score levels, and scale-type [i.e., ordinal or visual analogue]). As we were aware that too much refinement and calibration could bias the result, especially since the pilot gills were to constitute a subset of the gills used in the final evaluation, we commenced with the final assessment after the third revision. In the final evaluation, the same 3 raters assessed an increased number of gills, all which were anonymized as described above.

Table 1.

The number of gill samples and filaments scored, distributed among fish weight and facilities. Three raters assessed each gill thrice (3 rounds), resulting in 1,350 filament assessments per rater.

Fish weight, g	Gill samples, facility 1*	Gill samples, facility 2†	Gill filaments assessed per rater per round (n = 450)‡
0.3–2§	4		40
2–15	3		30
15–30	4		40
30–50	3		30
50–150	9		90
500–2,000		8	80
2,000–3,000		6	60
3,000–5,500		8	80

From 2 fish groups kept in freshwater. Fish weight is estimated as average on a tank basis, ≤150 g.

†

From 2 fish groups kept in seawater. Fish weight is individual fish weight, >500 g.

‡

In general, 10 filaments were assessed per gill sample, but 17 of 135 samples had <10 filaments.

Fish sampled whole.

Registrations were done using a customized scoring form in Excel. The setting “hide-label” in NDP.view2 was used to prevent accidentally seeing the original slide number and to make the study anonymous.

Target gill area and suitability criteria: fish >2 g

Ten filaments per gill were targeted; we required a minimum of 5 suitable filaments per gill for assessment. A filament was deemed suitable for assessment if it had symmetrical lamellae with a vascular structure in ≥20% of the assumed length of the filament. Symmetry of ipsilateral lamellae within a filament was defined as a lamellar length deviation of <50% (Fig. 1A–C), and the length of the lamellae had to be larger than the width of the filament cartilage (Fig. 1D), avoiding assessment of lamellae in the regions close to their afferent (trailing edge of the filament) or efferent (leading edge of the filament) side. This area was then defined as the scoreable filament area (Fig. 1A).

Figure 1.

Examples of inclusion (A, E, F) and exclusion (B–D) criteria for gill scoring. A. Ideal evaluation area: ipsilateral lamellar lengths are within 50% of each other, and their length surpasses filament cartilage thickness. B. Lamellar length discrepancy is >50%. C. Filament with lamellae on one side only. D. Filament cartilage thickness exceeds lamellar length. E. Evaluation start (1) and end (10) points: first 5 filaments and the one between 6 and 7 (×) are unsuitable. Indicated are the gill arch’s ventral (vb = ceratobranchial bone) and dorsal (db = epibranchial bone) branches and dorsal angle (da). F. The entire gill basket of fry was assessed; all filaments with lamellae were evaluated.

We assessed filaments from the second-gill arch from the left side of the fish. The ventral branch was prioritized, accounting from the assumed angle of the gill arch (defined as the joint formed by the epibranchial and ceratobranchial bone), starting from the second filament from the angle. If the number of suitable filaments was too low on the ventral branch, suitable filaments on the dorsal branch, if present, were included, starting from the angle at the gill arch (Fig. 1E). The filaments scored per slide per round were not marked on the slides, and thus, some variation as to which filaments were included varied.

Target gill area and suitability criteria: fish <2 g

Material for evaluation included 2 slides of whole fry. All available gill tissue was assessed on both sides, and the best-oriented filaments (i.e., those most compatible with our protocol) were evaluated. The gills needed at least 1 filament, and the gill changes and score scale were the same as for larger fish. However, the thresholds for a filament’s suitability were not included (i.e., asymmetric lamellae and short filaments could still be included if no better option was available; Fig. 1F).

Gill changes

Seven gill changes were scored on each of up to 10 filaments per gill (Fig. 2A–F). A severity grading protocol was established for each change, and each filament was scored based on a severity grade recorded on an ordinal scale (0–4) for the individual changes (Table 2). Ten filaments (maximum) were scored per fish, and for mucous cells, the maximum score per filament was 2. Thus, the maximum aggregate score for mucous cells per fish was 20. Similar calculations were made for the other gill changes (Table 2). Additionally, 2 control parameters were recorded for each slide. Before the final evaluation, the severity grading was established and refined through 3 pilot rounds of individual scoring by the 3 raters of a subset of the gills used in the final evaluation (25–30 in each pilot), followed by discussion and adjustments. Clubbing was defined as cellular thickening of the distal end of the lamellae with combinations of mucous cells, hypertrophic and/or hyperplastic epithelial cells, circulatory disturbances, inflammation, and/or degeneration. Infiltration with inflammatory cells in lamellae, without differentiation between the different leukocyte types, constituted lamellar inflammation; necrosis and/or apoptosis of the gill epithelium or subepithelial tissue constituted degeneration.

Figure 2.

Examples of semiquantitative histologic scoring criteria. A. Lamellar clubbing, marked by distal margin thickening (arrowheads), with mucous cells and mononuclear leukocytes. B. Increased lamellar epithelial layer, cell numbers, and mucous cells (arrowheads). C. Interlamellar hypercellularity (boxed mononuclear leukocytes). D. Inflammation beneath the lamellar epithelium (arrowheads) and at the base. E. Newly formed aneurysms. F. Subepithelial lamellar changes with degenerate and necrotic cells (arrowhead).

Table 2.

Gill changes and their scoring at filament level according to their severity grade.

Gill change	Score
Gill change	0	1	2	3	4
Mucous cell number, average mucous cells per lamellae	<1	1–6	>6	NA	NA
Lamellar clubbing, % of lamellae affected	<20	≥20–40	≥40–60	≥60–80	≥80
Lamellar hypertrophy and/or hyperplasia, % of lamellae affected	<33	≥33–66	≥66	NA	NA
Interlamellar hypercellularity (incl. all forms of lamellar adhesion), % of interlamellar spaces affected	0	<20	≥20	NA	NA
Aneurysms, % of lamellae affected	0	<20	≥20	NA	NA
Lamellar inflammation, % of lamellae affected	0	<20	≥20	NA	NA
Degeneration, % of lamellae/interlamellar spaces affected	0	<20	≥20	NA	NA

NA = not applicable.

Supplementary registrations

Raters assessed slide quality as optimal (= 1) or suboptimal (= 0) based on tissue orientation, symmetry, amount of cartilage, staining, contrast, artifacts, autolytic changes, or other postmortem changes. Raters also registered ease at finding the correct starting point according to the protocol described above: yes (= 1), not sure (= 0), as well as any other finding or remark, such as specific agents, epitheliocystis, or shortcomings of the slide, such as missing filaments.

Statistics

Gill change scores for up to 10 filaments per gill were summed to an aggregate score. No adjustments were made for gills having <10 scoreable filaments. The assumed clinical relevance for each gill change was used as the basis for transforming the aggregate score into a category score, graded 0–3 for mucous cell number, lamellar clubbing, and lamellar hypertrophy and/or hyperplasia, and 0–2 for interlamellar hypercellularity, aneurysms, lamellar inflammation, and degeneration. An increasing category score would mean increased gill reactions, and depending on the gill change, these reactions were classified as irritations or lesions. A critical remark is that the aggregate score thresholds or transformation criteria used are arbitrary and founded on a “biological relevance” approach, evaluating the consequence of different gill change scores at filament level on the aggregate score and the potential to affect gill function. The aggregate scores were transformed according to the scales used (e.g., a rater evaluated 10 filaments in a sample, according to Table 2; 5 filaments were given mucous cell score of 1, and 5 scored 2; the aggregate score of 15 resulted in a category score of 2 [i.e., classified as having moderate signs of irritation according to Table 3]).

Table 3.

The category scales and aggregate score thresholds. Each gill has an aggregate score for each change, calculated by summing the score from 10 filaments/gill, based on Table 2. Aggregate scores are then transformed into category scores.

Category score*	Possible aggregate scores
Category score*	Mucous cell number	Lamellar clubbing	Lamellar hypertrophy/hyperplasia	Interlamellar hypercellularity/aneurysms/lamellar inflammation/degeneration
0 = no signs of irritation/lesions	0–5	0–9	0–2	0
1 = mild signs of irritation	6–10	10–19	3–5	NA
2 = moderate signs of irritation or lesion presence	11–15	20–29	6–9	1–10
3 = marked irritation or lesion presence	16–20	30–40	10–20†	11–20†

NA = not applicable.

Lesions refer here to the rightmost column, where moderate and marked presence mean <20% and >20% of scoreable gill tissue involved, respectively.

†

Impaired gill function is expected at category score 3.

Agreement within (intra) rater and between (inter) raters was assessed using the kappaetc-framework (Stata),¹⁵ and agreement within the rater was assessed both per rater basis and on an overall (global) level treating the 3 evaluations on each gill as “virtual” raters assessing 45 gills × 3 rounds of counts per gill (1–135). The percent agreement (PA) and Gwet agreement coefficient (GAC)¹³ were calculated on the category score for each gill change. The calculations were performed unweighted to assess absolute agreement. When determining the strength of agreement, the probabilistic benchmarking method¹³ was based on the kappaetc-framework, including the Landis and Koch variant of the benchmark scale.^13,19 A conditional SE was used when estimating the benchmark interval for the intra-rater assessment; thus, this specific group of raters was used.

In contrast, an unconditional SE was used when estimating the inter-rater assessment benchmark interval to generalize to any rater. A generalized linear model (GLM) with ordinal family and logit link function was created to assess differences between rater scores for each gill change. All assessments were performed in Stata (release 16; StataCorp).

Results

Number of gill filaments scored and distribution of scores

A total of 402 gills were scored (99.3% of 405), and 135, 135, and 132 gills for each of the 3 raters, respectively. Most gill changes were skewed toward score 0; most lesions were mild, with a few moderate cases (Fig. 3). The percentage of examined gills having no observable gill changes (score of 0) was 52.7% for mucous cells, 60.7% for clubbing, 84.8% for hypertrophy and/or hyperplasia, 46.0% for interlamellar hypercellularity, 91.8% for aneurysms, 31.3% for lamellar inflammation, and 79.9% for degeneration.

Figure 3.

The aggregate and category scores distributed per gill change, including all rounds and raters. The aggregate scores are shown in the left figure, and the category scores are on the right. A. Mucous cells. B. Clubbing. C. Lamellar hypertrophy and/or hyperplasia. D. Interlamellar hypercellularity. E. Aneurysm. F. Lamellar inflammation. G. Degeneration.

Histoscores

The average scores per fish presented below are all based on aggregate scores for 10 filaments per fish (Table 2). Mucous cell scores were, on average, 3.5/fish for all raters (3.5, 1.7, and 5.2 for raters 1–3, respectively). There was a skewing toward 0-scores (Fig. 3), and average scores, excluding 0-scores for the 3 raters, were 7.9, 7.0, and 8.7, respectively (Fig. 4A, 4B).

Figure 4.

Histologic features used in scoring gill lesions. A, B. Mucous cell hyperplasia (arrowhead in insert) in the respiratory epithelium (A = score 1, B = score 2, with agreement between raters). C, D. Lamellar clubbing is dominated by mucous cells and mononuclear leukocytes. E. Changes interpreted as lamellar epithelial hypertrophy (arrowheads in insert), although it should be noted that differentiation between hypertrophic epithelial cells and chloride cells can be challenging in H&E-stained sections. Note also mucous cell hyperplasia. F–H. Hypercellular inter- and intralamellar areas (boxes and inserts), inflammation (mononuclear leukocytes; F and G), or epithelial hyperplasia (H), giving reduced interlamellar space. I. Fresh aneurysms (insert). J. Lamellar subepithelial inflammation with mononuclear leukocytes (box and insert). K, L. Degeneration and/or necrosis (karyorrhexis; black arrowheads, and white arrowheads in inserts), minute (K) in the lamellae, or more widespread in the multilayered filament epithelium (L).

Lamellar clubbing was, on average, 3.2 for all raters per fish, and individually 4.1, 1.6, and 4.0 for raters 1–3. Also, there was a skewing toward 0-scores (Fig. 3); average scores, excluding 0-scores for the 3 raters, were 10, 5.5, and 8.4, respectively (Fig. 4C, 4D).

Lamellar hypertrophy and/or hyperplasia with epithelial thickening without interlamellar involvement was scored on average 0.9 per fish for all raters and individually 0.7, 0.5, and 1.6 for raters 1–3. Excluding 0-scores, the scores for the 3 raters, were 4.0, 5.1, and 8.3, respectively (Fig. 4E).

Interlamellar hypercellularity was scored on average 1.5 per fish for all raters, and individually 2.2, 0.2, and 2.0 for raters 1–3. Excluding 0-scores, the scores for raters 1–3 were 3.2, 1.6, and 3.0, respectively (Fig. 4F–H).

Aneurysms were scored collectively with no differentiation between acute (fresh blood), subacute, or chronic lesions (all blood resorbed) with an average score of 0.3, and individually 0.3, 0.3, and 0.3 for raters 1–3. Small aneurysms located in the distal end of the lamellae were scored under this gill change and not as lamellar clubbing (Fig. 4I).

Lamellar inflammation was, on average, 2.4 for all raters, and individually 3.2, 1.6, and 2.5 for raters 1–3. When excluding 0-scores, the scores for raters 1–3 were 4.0, 2.6, and 3.9, respectively (Fig. 4J)

Degeneration was, on average, 0.4 for all raters, and individually 1.3, 0.03, and 0.00 for raters 1–3. When excluding 0-scores, the scores for the 2 raters were 2.2 and 1.0, respectively (Fig. 4K, 4L).

Rater agreement and differences

The intra-rater agreement was substantial to almost-perfect for all raters (Table 4). The GLM that was carried out for each of the gill changes showed systematic differences between the raters, and the margin analyses showed that the differences were related to using the low end of the scale (i.e., scores 0 and 1; Suppl. Table 1). Rater 2 scores were significantly different from rater 1 for mucous cell numbers, lamellar clubbing, lamellar hypertrophy and/or hyperplasia, and interlamellar hypercellularity; rater 3 differed from rater 1 for mucous cell number and lamellar inflammation (Suppl. Table 1).

Table 4.

The results from the agreement analysis, based on the category scores and using the kappaetc-framework in Stata.

Gill change	PA	GAC	Strength of agreement (probabilistic approach)*
Intra-rater: rater 1
Mucous cell number	0.93	0.91	Almost perfect
Lamellar clubbing	0.93	0.92	Almost perfect
Lamellar hypertrophy/hyperplasia	0.88	0.87	Substantial
Hypercellular interlamellar area	0.91	0.88	Almost perfect
Aneurysms	0.94	0.93	Almost perfect
Lamellar inflammation	0.85	0.82	Substantial
Degeneration	0.66	0.55	Moderate
Intra-rater: rater 2
Mucous cell number	0.87	0.85	Substantial
Lamellar clubbing	0.93	0.93	Almost perfect
Lamellar hypertrophy/hyperplasia	0.93	0.92	Almost perfect
Hypercellular interlamellar area	0.76	0.73	Substantial
Aneurysms	0.97	0.97	Almost perfect
Lamellar inflammation	0.76	0.69	Moderate
Degeneration	0.94	0.94	Almost perfect
Intra-rater: rater 3
Mucous cell number	0.88	0.85	Substantial
Lamellar clubbing	0.92	0.92	Almost perfect
Lamellar hypertrophy/hyperplasia	0.92	0.91	Almost perfect
Hypercellular interlamellar area	0.73	0.65	Moderate
Aneurysms	0.92	0.92	Almost perfect
Lamellar inflammation	0.83	0.78	Substantial
Degeneration†	No variation	No variation	IC
Intra-rater: global
Mucous cell number	0.89	0.87	Almost perfect
Lamellar clubbing	0.93	0.92	Almost perfect
Lamellar hypertrophy/hyperplasia	0.91	0.90	Almost perfect
Hypercellular interlamellar area	0.80	0.74	Substantial
Aneurysms	0.95	0.94	Almost perfect
Lamellar inflammation	0.82	0.76	Substantial
Degeneration	0.87	0.84	Substantial
Inter-rater
Mucous cell number	0.74	0.70	Moderate
Lamellar clubbing	0.86	0.85	Substantial
Lamellar hypertrophy/hyperplasia	0.83	0.81	Substantial
Hypercellular interlamellar area	0.55	0.40	Poor
Aneurysms	0.90	0.89	Almost perfect
Lamellar inflammation	0.71	0.63	Moderate
Degeneration	0.61	0.54	Poor

GAC = Gwet agreement coefficient; PA = percent agreement.

Calculated for GAC using Gwet probabilistic benchmarking method, taking the uncertainty of the coefficient estimate into account. The numbers may, therefore, deviate from the actual coefficient intervals in the benchmark scale from the kappaetc-framework in Stata.^13,15,19 The agreement coefficients and their corresponding strength of agreement in this scale are as follows: 0.00 = poor; >0.00–<0.20 = slight; ≥0.20–<0.40 = fair; ≥0.40–<0.60 = moderate; ≥0.60–<0.80 = substantial; ≥0.80–1.00 = almost perfect.

†

For degeneration, rater 3 scored all gills as 0, resulting in an intra-rater agreement for this rater to be incalculable (IC).

Two raters noted suboptimal quality in 19 of 45 and 27 of 44 slides; the third rater made no comments on slide quality. Of the 135 slides evaluated, 17 had <10 filaments, as these did not fulfill the suitability criteria. Autolytic changes were essentially absent. Examples of different slide quality issues are shown in Suppl. Fig. 1.

Discussion

Understanding gill reactions in farmed Atlantic salmon in different production environments is needed,³⁴ and a scoring system for mild-to-moderate lesions will increase our perception of the transition from normal variation toward clinical manifestation of gill disease. Such a system could reveal initial inflicting stimuli, enabling early intervention. We found that scoring of mild-to-moderate histopathologic gill changes was consistent within the scorer (intra-rater), with almost-perfect agreement for most gill changes over 3 rounds of scoring. Outcomes were slightly less consistent for inter-rater scores but still moderate to almost-perfect agreement for 5 of 7 gill changes assessed by raters with variable experience in histologic evaluation. Two categories of gill changes had “poor” inter-rater agreement. It is expected that intra-rater is higher than inter-rater agreement.⁶ Re-scaling and better definition of the gill changes with poor inter-rater agreement are justified for future studies, plus optimizing tissue orientation, reducing out-of-focus areas, and ensuring good color contrast.

Degeneration and/or necrosis of epithelial cells may occur in normal gill tissue due to a high turnover rate,³³ and cellular debris may resemble necrosis in epithelial cells. This may have caused variation for the variable “degeneration”, in which differences occurred between scorers. Further, sparse lamellar inflammation can be difficult to detect,³⁴ and interlamellar hypercellularity and lamellar hyperplasia and/or inflammation can overlap, affecting classification of scores, and cause differences between scorers. Interlamellar hypercellularity can also be challenging to assess on gills with suboptimal tissue orientation or when the lamellae are short,³⁹ all factors contributing to variation.

We anonymized the slides and thus recognition from one round to the next was considered unlikely, but pilot rounds and the first round of scoring likely primed the raters for subsequent rounds. This might unconsciously have impacted the raters, rendering them less focused on details, especially for changes of low frequency. Some of the variations observed between raters could reside in different numbers of filaments being included or which filaments were scored. Including 10 filaments per gill was an attempt to mitigate some of these effects on the aggregate scores, as lesion distribution is often diffuse in gill tissue,³⁴ but variation does exist within different parts of the gills. This could have had an impact on the variation observed.

We used a scale of 3–5 score levels, which is in line with older studies^2,22 and a recent study of gill lesions in farmed Atlantic salmon in which the authors recommend a scale of 0–5.¹² Our scoring also aligns with statistical evaluations where it has been shown that fewer than 3 grades reduce sensitivity,¹⁰ and a higher number of animals would have been required to detect real biological differences in our material. In contrast, a higher number (>5) of score categories has a negative impact on repeatability or reproducibility,^10,31 simply because it will be more difficult to distinguish between categories and thus assign the scores. This would have reduced repeatability in our study.

We found a low prevalence of target changes in the gills in our study, and the difference between raters was linked to the scores “0” and “1,” particularly for those target changes that included an inherent enumeration of changes present (without actual counting done by the raters). So, to what extent this is a misdiagnosis for either score (i.e., absence [0] or presence [1] of changes), or is related to the “sensitivity” of the rater to changes, is unknown. We believe that a low prevalence of changes will impact how the changes are interpreted. However, studies have shown that errors for normal images (false-positives) are higher than for abnormal images (false-negative errors), irrespective of experience.⁴ Ultimately, the extent to which differences in cue usage explain differences observed herein is unknown. Several factors will impact the extent to which minor changes are recorded or missed. Studies have shown that missing real changes increased when the prevalence of the target is low, categorized as a low-prevalence effect.⁴⁰ We found that a low prevalence of targets demanded sustained attention because raters had to search each image thoroughly for target features in up to 450 filaments over 3 rounds. Sustained attention consumes cognitive resources, which can result in disengagement and an increased likelihood of observational error.²⁸ The general understanding is that image perception, successful detection of target changes, interpretation of targets, and diagnoses are based on finely tuned cognitive processes,¹⁴ including visual search, pattern recognition, and various interpretation strategies,¹⁸ all summarized as cue-based associations.³ Further, several studies indicate that cue usage is vital in examining tissue and correctly identifying tissue features within the normal range versus pathologic processes. Including an assessment of cue usage among raters in future studies would add to understanding the observed differences. To what extent this explains the difference between raters is not known, and the design of the study did not allow us to record this.

There is no consensus for choice of statistical method for assessing inter-rater agreement⁷ and it depends, in general, on the type of data (categorical, ordinal, or continuous) and the number of raters involved in the study.¹¹ The GAC offers a good method for assessing inter-rater agreement, particularly when the data are ordinal, and is less susceptible to the prevalence paradox that typically occurs when most of the ratings fall into a single category, as was the case in our study with a high number of 0-scores. In other words, the method delivers a more stable estimate of agreement that accounts for imbalances in data distribution, particularly rare events. This will lead to a low kappa value using, for example, the Cohen or Fleiss kappa,⁷ even when there is substantial agreement; the GAC mitigates this issue by providing a more stable estimate of agreement that is less affected by the marginal distribution of ratings.¹³ Finally, GAC is suitable for histologic studies involving >2 raters.

Supplemental Material

sj-pdf-1-vdi-10.1177_10406387241310900 – Supplemental material for Assessment of a semiquantitative scoring system for mild-to-moderate gill lesions in Atlantic salmon reared in recirculating aquaculture systems in Norway

Supplemental material, sj-pdf-1-vdi-10.1177_10406387241310900 for Assessment of a semiquantitative scoring system for mild-to-moderate gill lesions in Atlantic salmon reared in recirculating aquaculture systems in Norway by Thomas Amlie, Alf Dalum, Marit Stormoen and Øystein Evensen in Journal of Veterinary Diagnostic Investigation

Footnotes

Acknowledgements

We thank Asgeir Østvik for valuable help with sampling and Barbo Klakegg for valuable help administering the project. We are also indebted to Drs. Kilem Gwet and Daniel Klein for their valuable input on the statistical analysis and interpretation of data.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Thomas Amlie is an employee of Åkerblå.

Funding

The Research Council of Norway funded this study, project 298906, to Åkerblå.

ORCID iDs

Thomas Amlie

Marit Stormoen

Øystein Evensen

Supplemental material

Supplemental material for this article is available online.

References

Banerjee

, et al Beyond kappa: a review of interrater agreement measures. Can J Statistics 1999;27:3–23.

Bernet

, et al Histopathology in fish: proposal for a protocol to assess aquatic pollution. J Fish Dis 1999;22:25–34.

Carrigan

, et al The role of cue-based strategies in skilled diagnosis among pathologists. Hum Factors 2022;64:1154–1167.

Carrigan

, et al Cue utilisation reduces the impact of response bias in histopathology. Appl Ergon 2022;98:103590.

Crissman

, et al Best practices guideline: toxicologic histopathology. Toxicol Pathol 2004;32:126–131.

Cross

SS.

Grading and scoring in histopathology. Histopathology 1998;33:99–106.

Feng

GC.

Factors affecting intercoder reliability: a Monte Carlo experiment. Quality Quantity 2013;47:2959–2982.

Flores-Lopes

Thomaz

AT.

Histopathologic alterations observed in fish gills as a tool in environmental monitoring. Braz J Biol 2011;71:179–188.

French METAVIR Cooperative Study Group. Intraobserver and interobserver variations in liver biopsy interpretation in patients with chronic hepatitis C. Hepatology 1994;20:15–20.

10.

Gibson-Corley

, et al Principles for valid histopathologic scoring in research. Vet Pathol 2013;50:1007–1015.

11.

Gisev

, et al Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res Social Adm Pharm 2013;9:330–338.

12.

Gjessing

, et al Histopathological investigation of complex gill disease in sea farmed Atlantic salmon. PLoS One 2019;14:e0222926.

13.

Gwet

. Handbook of Inter-Rater Reliability. 5th ed. Vol. 1. Analysis of Categorical Ratings. Publisher Services, 2021.

14.

Holdford

Content analysis methods for conducting research in social and administrative pharmacy. Res Social Adm Pharm 2008;4:173–181.

15.

Klein

Implementing a general framework for assessing interrater agreement in Stata. Stata J 2018;18:871–901.

16.

Kottner

, et al Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. J Clin Epidemiol 2011;64:96–106.

17.

Król

, et al Integration of transcriptome, gross morphology and histopathology in the gill of sea farmed Atlantic salmon (Salmo salar): lessons from multi-site sampling. Front Genet 2020;11:610.

18.

Krupinski

, et al Characterizing the development of visual search expertise in pathology residents viewing whole slide images. Human Pathol 2013;44:357–364.

19.

Landis

Koch

GG.

The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174.

20.

Mallatt

JM.

Fish gill structural changes induced by toxicants and other irritants: a statistical review. Can J Fish Aquat Sci 1985;42:630–648.

21.

Meyerholz

Beck

AP.

Principles and approaches for reproducible scoring of tissue stains in research. Lab Invest 2018;98:844–855.

22.

Mitchell

, et al Development of a novel histopathological gill scoring protocol for assessment of gill health during a longitudinal study in marine-farmed Atlantic salmon (Salmo salar). Aquat Int 2012;20:813–825.

23.

Mitchell

, et al Sampling artefacts in gill histology of freshwater Atlantic salmon (Salmo salar). Bull J Eur Assoc Fish Pathol 2023;43:1–11.

24.

Østevik

, et al A cohort study of gill infections, gill pathology and gill-related mortality in sea-farmed Atlantic salmon (Salmo salar L.): a descriptive analysis. J Fish Dis 2022;45:1301–1321.

25.

Østevik

, et al Assessment of acute effects of in situ net cleaning on gill health of farmed Atlantic salmon (Salmo salar L). Aquaculture 2021;545:737203.

26.

Poleksić

, et al Fish gills as a monitor of sublethal and chronic effects of pollution. In: Mueller

Lloyd

eds. Sublethal and Chronic Effects of Pollutants on Freshwater Fish. Fishing News Books, 1994:339–352.

27.

Rašković

, et al Histopathological indicators: a useful fish health monitoring tool in common carp (Cyprinus carpio Linnaeus, 1758) culture. Open Life Sci 2013;8:975–985.

28.

Reiner

Krupinski

The insidious problem of fatigue in medical imaging practice. J Digit Imaging 2012;25:3–6.

29.

Sanchez

, et al Morphometric and histochemical assessment of the branchial tissue response of rainbow trout, Oncorhynchus mykiss (Walbaum), associated with chloramine-T treatment. J Fish Dis 1997;20:375–381.

30.

Schafer

, et al Use of severity grades to characterize histopathologic changes. Toxicol Pathol 2018;46:256–265.

31.

Shackelford

, et al Qualitative and quantitative analysis of nonneoplastic lesions in toxicology studies. Toxicol Pathol 2002;30:93–96.

32.

Speare

, et al Branchial lesions associated with intermittent formalin bath treatment of Atlantic salmon, Salmo salar L., and rainbow trout, Oncorhynchus mykiss (Walbaum). J Fish Dis 1997;20:27–33.

33.

Speare

Ferguson

HW.

Fixation artifacts in rainbow trout (Salmo gairdneri) gills: a morphometric evaluation. Can J Fish Aquat Sci 1989;46:780–785.

34.

Speare

, et al Gills and pseudobranchs. In: Ferguson

, ed. Systematic Pathology of Fish: A Text and Atlas of Normal Tissues in Teleosts and Their Responses. Scotian Press, 2006:24–63.

35.

Speilberg

, et al Evaluation of five different immersion fixatives for light microscopic studies of liver tissue in Atlantic salmon Salmo salar. Dis Aquat Org 1993;17:47–55.

36.

Strzyżewska-Worotyńska

, et al Gills as morphological biomarkers in extensive and intensive rainbow trout (Oncorhynchus mykiss, Walbaum 1792) production technologies. Environ Monit Assess 2017;189:611.

37.

Watson

Petrie

Method agreement analysis: a review of correct methodology. Theriogenology 2010;73:1167–1179.

38.

Wolf

JC.

Comparing apples and oranges and pears and kumquats: the misuse of index systems for processing histopathology data in fish toxicological bioassays. Environ Toxicol Chem 2018;37:1688–1695.

39.

Wolf

, et al Nonlesions, misdiagnoses, missed diagnoses, and other interpretive challenges in fish histopathology studies: a guide for investigators, authors, reviewers, and readers. Toxicol Pathol 2015;43:297–325.

40.

Wolfe

, et al Cognitive psychology: rare items often missed in visual searches. Nature 2005;435:439–440.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.63 MB