Predicting fungal infection sensitivity of sepals in harvested tomatoes using imaging spectroscopy and partial least squares discriminant analysis

Abstract

Tomatoes (Solanum lycopersicum L.) are a widely grown and globally traded vegetable, essential for both local consumption and international trade. However, approximately 30% of harvested tomato yields are lost due to fungal decay during postharvest handling. Timely disease identification is crucial to prevent such losses, but certain tomato varieties exhibit higher susceptibility to fungal infections than others. Additionally, there are variations in susceptibility among individual sepals, with unknown underlying causes. Traditional methods for assessing fungal presence in plants have limitations, such as sample destruction and a focus on symptom detection rather than evaluating susceptibility to fungal infection. Hence, there is a demand need for an accurate, non-destructive method capable of predicting susceptibility to fungal infection. The use of hyperspectral imaging (HSI) with chemometrics presents a pioneering approach to address this need. In this study, three tomato cultivars (‘Brioso,’ ‘Cappricia,’ and ‘Provine’) were studied. Hyperspectral images were captured on day-1 of harvest, followed by controlled fungal growth conditions. Ground truth assessments were conducted by three experts on day-3 and day-4, averaging severity scores assigned per sepal. The methodology involved extracting spectra from HSI images and calibrating and validating models using partial least squares discriminant analysis (PLSDA), aiming to optimize model parameters for accurate predictions. The models were categorized into those developed using data from a single variety (intravariety) and those utilizing data from multiple varieties combined (global models). The best-performing intravariety model was established using the Cappricia variety, achieving a balanced accuracy of 0.84. Conversely, a global model combining Cappricia and Provine varieties achieved a balanced accuracy of 0.70. Overall, the results suggest that distinguishing between more and less susceptible sepals is feasible under controlled conditions.

Keywords

Chemometrics imaging spectroscopy feature selection cross validation modeling tomatoes post-harvest band assigment

Introduction

Tomato (Solanum lycopersicum L.) is a “ubiquitous vegetable.” Tomatoes are produced globally, either for domestic consumption or as a commodity for international export. The nutritional composition of this fruit includes carbohydrates, lipids and proteins. In addition, it contains vitamins, minerals, and carotenes in smaller proportions.¹

Tomato quality is divided into different aspects: commercial, organoleptic and nutritional.² Market quality grade considers appearance (e.g. color, form, size), firmness and shelf life, whereas health benefits rely on the nutritional value as well as on the absence of pathogenic hazards or contaminants.^2–5

The portion of tomatoes that go to waste after the harvesting stage can reach 42% worldwide.⁶ Around 30% of the harvested tomato produce may be lost during postharvest handling, primarily because of microbial decay caused by fungi such as Rhizopus stolonifer, Alternaria alternata, and Botrytis cinerea^.7 Pathogenic fungi can infect and spread to many different parts of a tomato plant, including the stem, calyx and skin of the fruit.⁸ In some countries tomatoes are sold including calyx. Fresh looking green parts of a tomato (calyx and vine) are a sign for dealing with fresh tomatoes. Older tomatoes show dehydration symptoms of the green parts. The calyx is also susceptible to infection by fungal spores. These spores may already be present on the tomato during cultivation. After harvest, under humid and poorly ventilated storage and transport conditions, these spores may germinate and grow further into visible mould on the calyx.⁹ This negatively affects the value of the fruit and may lead to extra food loss and waste.^9,10

The timely identification of disease has the potential to avert losses since prompt actions can be implemented to mitigate further damages (e.g. adapt packing strategies).⁹ Generally, the strategy employed in the industry to reduce pathogen attacks is the use of pesticides. However, these products can damage the food and diminish its nutritional value.² Whenever possible, it is preferable to protect the harvested fruits by using methods that do not introduce any additional chemicals or contaminants and do not harm the food in any way.

A possible means to assess the predisposition to microscopic fungal contamination is by tracking the growing and handling conditions of tomato produce within the supply chain. This correlation may be beneficial in the detection of probable origins of fungal contamination based on historical data. However, tracking individual tomatoes or even batches from growth to harvest and later post-harvest handling and logistics is highly difficult.

Some tomatoes are more susceptible to infection and growth of spores while others are not.^9,11 Moreover, susceptibility of individual sepals also differs. It is not known, yet, what is causing this difference. This knowledge would be useful to predict the susceptibility to this infection and growth. A more specific method is necessary which allows each calyx and sepal to be evaluated individually.

Some of the analytical methods traditionally used to evaluate the presence of fungus in plants are summarized here. Firstly, new DNA-based technology has been developed to support and replace morphology-based detections of phytopathogenic fungi. Jiménez-Fernández et al. developed a real-time qPCR assay for the calculation of F. oxysporum DNA in plant tissues and soil.¹² Moreover, tomato samples can be tested for mycotoxins, as a high level of these compounds is caused by fungal infection.¹³ Some detection solutions are, for instance, chromatography coupled with detector methods, electrochemical biosensors technology and immunological techniques such as enzyme-linked immunosorbent assay (ELISA), dipsticks and flow-through membranes.^14–17 Furthermore, gas chromatography-mass spectrometry (GC-MS) or electronic nose (e-nose) can be used to measure the shift of the composition and concentration of volatile organic compounds (VOCs) emitted by diseased tomatoes.¹³

Although these analytical methods are specific and accurate, they have several disadvantages. First, most of them destroy the sample during measurements. Furthermore, these methods detect disease symptoms and not the susceptibility to fungal infection and growth. That is, they evaluate what is happening to the fruit exactly at the moment of the measurement. In the case of visible symptoms of the fungus, the future is already known (this state will continue and worsen in the future); however, if the fruits are not yet infected or the fungi has not germinated, these methods cannot predict what will happen to the fruits in the future.

Therefore, there is a need for a reliable, non-destructive and specific method to predict susceptibility to fungal infection in a rapid manner. This would provide additional support for quality inspectors and post-harvest management.

Infrared spectroscopy can provide a possible solution to this problem. Skolik et al. have studied diseased progression in whole tomatoes using Attenuated Total Reflectance coupled with Fourier-Transform Infrared Spectroscopy (ATR-FTIR) and have highlighted that plant-pathogen interaction can be identified through alteration in the spectra fingerprint.¹⁸

Moreover, imaging spectroscopy (or hyperspectral imaging, HSI) can be even more useful because spectral information can be captured across the complete product at pixel level. Wang et al. accurately classified 97.5% of healthy fruit and 100% of decayed fruit using spectral imaging.¹⁹

Drawing from the work of Brdar et al., who explored ensemble machine learning methods for early fungal infection detection in one tomato cultivar (Brioso), this study aims to extend these findings to multiple cultivars and investigate traditional chemometric approaches alongside ensemble machine learning techniques. Unique to this research is the application of HSI combined with chemometrics to predict susceptibility to fungal infection in recently harvested tomatoes. The objective is to bridge this gap in the literature by employing a methodology that involves spectra extraction from HSI images and model calibration and validation using partial least squares discriminant analysis (PLSDA), with a focus on optimizing model parameters for improved predictive performance.^11,20,21

Materials and methods

Materials

Three tomato cultivars, ‘Brioso,’ ‘Cappricia,’ and ‘Provine,’ were used in this study. Fresh samples were harvested from different greenhouses on the 9^th and 10^th of May 2022. On the 10^th May 2022, the tomatoes on the vine arrived at the Phenomea Laboratory in Wageningen, Netherlands. Tomatoes without visible fungal infection were cut from the vine (2 tomatoes from the middle of a vine, 32 samples from each cultivar). The wounds at the cut end were greased with stopcock grease to prevent dehydration at the junction.

Methods

Data collection

Samples were imaged in two separate groups of equal size. Hyperspectral images were recorded on day one (10^th May) using a Specim FX17 NIR linescan camera with a spectral range (937.33 nm-1718 nm).¹¹ Subsequently, tomatoes were stored on trays (7 mm blue Forex plate (35 × 55 cm²) with holes of 2.5 cm diameter) in controlled conditions encouraging fungal growth (20°C, in a closed sanitized box reaching 100% Relative Humidity, in a room at 60% RH, lights on during 7:00–19:00 h, 15 μmol·s⁻¹·m⁻²).

Ground truth observations were made per sepal by three experts on day three and four (12^th and 13^th May), which comprised of severity scores from zero (no fungus) to four (severe infection). Ratings of the two days and three experts were averaged.

Spectra extraction from hyperspectral images

Hyperspectral images were converted to pseudo-color images, which were generated after manually choosing three bands which produced visibly good contrast between sepals and the background. These images were manually annotated with a separate polygon indicating the boundary of individual sepals (Figure 1). These polygons were converted to pixel masks, which indicated whether a pixel was included in the set of pixels belonging to a particular sepal. At sepal edges, because of blurring effects, there was some level of uncertainty with respect to which pixels to include. For this annotation, we favored keeping pixels only if they were substantially sepal containing. The spectrum of each pixel was collected and then used for further analysis.

Figure 1.

Spectra extraction from hyperspectral images. Visualization of the procedure carried out in each sepal.

The Darwin annotation tool from V7 labs was used to perform annotations.²² Annotations were used to extract sepal pixel spectra using a custom Python image processing pipeline employing the Numpy and Pandas libraries. The extracted sepal pixel spectra were made available for the R pipeline in spreadsheet format.²³

Data analysis

A chemometric analysis was conducted with the aim of calibrating and validating models to predict the susceptibility to fungal infection in tomatoes according to their degree of disease as observed by specialists after 4 days of germination. This analysis was done using R Statistical Software (v4.3.0; R Core Team 2021) with caret, rchemo and prospectr packages, and involved the following steps:^24–27

Data visualization

Firstly, spectra were plotted to have a first appreciation of the shape of the data, observe their clarity, signal-to-noise ratio, presence of obvious outliers, baseline, etc.

Data exploration and outlier removal

Burger found that bad pixels exhibit significantly different spectra compared to their neighbors. These abnormal pixels were classified into “dead pixels,” which do not respond to light, “hot pixels,” characterized by high dark current, and “stuck pixels,” which maintain an almost constant intermediate value. Liu further explained that hot pixels have a higher dark current than normal pixels, which experience a moderate dark current increase after irradiation. Moreover, some pixels are always noisy, while others are noisy only sporadically; some may show a “non-linear response to light intensity” while some others behave randomly.²⁸ In any case, these abnormal pixels exhibit distinct behavior compared to the rest and thus should be removed.

In this study, all pixels of a sepal were subjected to exploratory analysis using principal component analysis (PCA) at both the sepal and variety levels. During this process, any pixels presenting anomalies, such as being out of focus, were detected and removed before averaging all pixels of a sepal. This procedure served as a robust quality control measure, aimed at identifying and eliminating spectra that are out of focus or of poor quality.

In this step, PCA was applied over all the pixels for a given sepal. To detect outlier pixels, Mahalanobis distances were computed between the individual projection of each score value onto the model and the center of the model. The identification of outliers was determined based on a specified confidence level (0.95), indicating the probability that a data point lied within a certain range. The cutoff for Mahalanobis distances was employed as a threshold, beyond which data samples were classified as outliers. The confidence level played a crucial role in controlling the sensitivity of the outlier detection, with higher confidence levels leading to more stringent criteria for identifying outliers.

Once the outliers were removed, the remaining pixels were averaged, and the datasets were finally reassembled according to their labels. Consequently, data exploration was carried out again by PCA in order to remove outliers at variety level (in Provine, Brioso and Cappricia datasets). Score plots were created, outliers were detected visually and removed from the dataset.

Pretreatments on raw spectra

Different models were calibrated and validated using various pretreated forms of the original spectra, and their performances were compared. These methods include: Detrend grades 1 and 2; Savitzky–Golay first and second derivatives, second polynomial degree and either 9-, 11-, 15-, or 17-point smoothing windows; standard normal variate (SNV); and combinations of these.^29–31 Only the best results are presented.

Data split

Three binary-class scenarios were derived from the visual expert scoring described in the data collection section:

- Scenario 1: Score of 0 was considered healthy, and any other value was considered infected.

- Scenario 2: A score of 1 or less was considered healthy and the rest infected.

- Scenario 3: Scores from two consecutive days were averaged, and samples were considered healthy when the score was 0.5 or lower, otherwise the sepal was considered infected.

Stratified sampling was carried out in the following way. Each dataset was divided into calibration (70%) and validation (30%) sets, in a representative way for each class, randomly. This means that the 70/30 ratio was respected in both classes.

Feature selection

An iterative process was used to select a sparse subset of important variables from the training set, using CovSel algorithm.^32,33 Iteratively top 5 to top 39 important variables (ivs), (numbers chosen arbitrarily), were chosen for each pretreatment, labelling and cultivar. The selected variables were then used as input for the classification model, and saved in a matrix called “CovSelTrain”. The same ivs were selected from the test Set, and saved in “CovSelTest.”

Calibration and validation of PLSDA models

The Training set was split again, into calibration (70%) and validation (30%) sets, randomly. Different models with different number of latent variables (LVs) were calibrated in the calibration set and tested in the validation set. The number of LVs was selected according to the model that showed the lowest prediction error in the validation set.

All models calibrated in the calibration set must be tested (validated) later, in the validation set. A model might yield exceptional and highly accurate outcomes when applied to the calibration set. However, if overfitting occurs, the same model may produce poor results when evaluated using the validation set. In other words, an overfitted model fits perfectly well the calibration set, but cannot be generalized for efficient use in new, unknown samples.

To avoid this, the optimal number of latent variables must be chosen, according to the error observed in the validation set, when the model is tested on independent samples, which were not used during its calibration.

The prediction error in the calibration set can always decrease, carrying a risk of overfitting. Instead, the prediction error in the validation set decreases up to an optimal number of LVs, after which it increases. At this inflection point the optimal number of LVs that should be chosen for the model can be known. If a greater number of LVs is chosen, the model will have a risk of overfitting.

In other words, for each latent variable number, a prediction error value is obtained in the validation set. It is necessary to know all the prediction error values that correspond to all the different numbers of Lvs and choose the one that entails the smallest prediction error, in the validation set.

The PLSDA model, already optimized for the number of latent variables, was tested in “CovSelTest,” and classification parameters were obtained.

Evaluation of results

Ten different parameters were used to evaluate the results: sensitivity, specificity, precision, accuracy and balanced accuracy (BA), geometric mean, F-measure, Youden index, positive likelihood ratio, and negative likelihood ratio. These were explained in detail in previous publications and shown in Table 1.^34,35 The final evaluation considered all of them simultaneously, because each one of them took into account different characteristics of the general discrimination effectiveness.

Table 1.

Parameters commonly used to evaluate classification models.

Measure	Formula
Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$
Misclassification rate (1-accuracy)	$\frac{F P + F N}{T P + T N + F P + F N}$
Sensitivity (or recall)	$\frac{T P}{T P + F N}$
Specificity	$\frac{T N}{T N + F P}$
Precision	$\frac{T P}{T P + F P}$
Balanced accuracy (BA)	$0.5 \cdot s e n s i t i v i t y \cdot s p e c i f i c i t y$
Geometric mean	$\sqrt{S e n s i t i v i t y \cdot S p e c i f i c i t y}$
Positive likelihood ratio	$\frac{s e n s i t i v i t y}{1 - s p e c i f i c i t y}$
Negative likelihood ratio	$\frac{1 - s e n s i t i v i t y}{s p e c i f i c i t y}$
F-measure	$\frac{2 \cdot s e n s i t i v i t y \cdot p r e c i s i o n}{s e n s i t i v i t y + p r e c i s i o n}$
Youden index	$s e n s i t i v i t y - (1 - s p e c i f i c i t y)$

TN: true positives; TN: true negatives; FN: false negatives; FP: false positives.

Source: Akosa, 2017.

The iterative process carried out in this work can be summarized as follows:

A. Spectra visualization and outlier removal.

B. Model selection.

0. Start with a cultivar from a set of cultivars. Start with no “best model” for the cultivar.

1. Select labelling scenario (from 3 scenarios).

2. Select one pretreatment or combination of pretreatments;

3. Split dataset.

4. Select important features.

5. Apply PLSDA and select the optimal number of latent variables (LVs).

6. Repeat Steps 4 and 5 selecting from 5 to 39 variables by CovSel.

7. If a model BA is higher than the previous model, keep the current model as the “best model.”

Note: Results of steps 1 to 7 will give the best model per cultivar.

C. The same process was repeated for global modeling (“GM”) where different scenarios of variety combinations were investigated: Cappricia + Provine, Cappricia + Brioso, Brioso + Provine and Cappricia + Provine + Brioso.

Results and discussion

Average of the pixels coming from the region of interest (ROI)

Table 2 shows the description of the dimensionality of the initial and final datasets before and after averaging the spectra that belonged to the same sepal.

Table 2.

Description of the dimensionality of the initial and final datasets before and after averaging the spectra that belonged to the same sepal.

Dataset name/number of	Pixels per sepal	Sepals per tomato	Tomatoes per image	Spectra in the initial dataset	Spectra in the averaged dataset	Variables
Provine	Between 119 and 90	5 or 6	16	16,156	159	112
Brioso	Between 45 and 53	5 or 6	32	6497	164	112
Cappricia	Between 81 and 124	5 or 6	16	12,816	165	112

Interpretation of raw spectra

Three bands were observed in the pure and pretreated spectra of all varieties (Figure 2(a) and (b)). In the following paragraphs, tentative assignments will be mentioned along with their bibliographic sources.

Figure 2.

Raw (a) and SNV and second derivative (2, 17, 2) spectra (b) for each variety.

The maximum intensities observed were 1.80 (Cappicia), 2.21 (Provine) and 2.03 (Brioso); at 1455 nm (6873 cm⁻¹) in Cappricia and at 1448 nm (6907 cm⁻¹) in the other two varieties. These bands can be attributed to the symmetric and asymmetric stretching vibrations of water molecules at the first harmonic of the OH stretching vibrations of water.³⁶ More specifically, those wavelengths are included into two well-defined wavelength ranges where water shows the greatest variation of energy absorbance in response to disturbances, (Water Matrix Coordinates, “WAMACS”), called C8 and C9. “WAMACS describe different conformations of water such as water dimers, trimers, superoxides, water solvation shells, etc.”^36,37

C9 1458-1468 nm: Water molecules with 2 hydrogen bonds (S2)

C8 1448-1454 nm: ν2 + ν3, Water solvation shell, OH-(H2O)4,5.^37,38

The other peak in raw spectra was located at 1195 nm (8368 cm⁻¹) in all three varieties. According to Jakubíková et al. “The region from 8300 to 8600 cm⁻¹ corresponds to the third overtone band of the bond CH.”³⁸

Dalimov et al. concluded that tomato has approximately 11% lignin with carboxylic groups that distinguish it from other plants.²⁸ Moreover, these authors analyzed IR spectra of suspended tomato particles, and found typical absorption bands for lignin and carbohydrates. They assigned the 1195 nm wavelength to the second C-H stretching overtones of methyl groups, CH₃-groups, as well as the lignin component of tomatoes.

However, other publications assigned this band to glucose. Tanaka et al. measured several glucose anomers in light and heavy water by NIR, and found a peak at 1195 nm in both solvents.³⁹ Furthermore, Lopez et al. performed carbohydrate analysis by NIR, and assigned the same peak to the OH stretch 1^st overtone of glucose.⁴⁰

Finally, the three raw spectra have a peak at 979 nm (10,242 cm⁻¹). It has been assigned in literature to the O–H stretching second overtones, to the hydroxide ion (980 nm) and to the hydrogen-bonded –OH, 2^nd overtone (980 nm).^41,42

Pretreatments and exploratory analysis

First, the raw spectra were plotted after being extracted from the images. This first visualization allowed us to have a first appreciation of how the spectra looked in relation to noise and scattering effects, distortions in the baselines, signal-to-noise ratio, in addition to the presence of clear outliers. To understand the presence of multiplicative and/or additive effects in the spectra, their intensities were plotted as a function of the average spectra (graphs not shown here). The shape of these graphs (millefeuille or cone) helped distinguish effects in the spectra. In all cases, combined effects (multiplicative and additive) were found in the analyzed spectra. Figure 2(a) shows the average of the raw spectra for each variety, and Figure 2(b) shows the average of the spectra pretreated with SNV and second derivative (17-point window, 2^nd order polynomial fit). As mentioned above, other pretreatments were applied and compared as well. It should be mentioned that in this study, the most appropriate pretreatments were chosen according to the way in which they modified the performance of the models.

In this example, SNV was used to remove both the scattering effects caused by the diffusion of photons and the measurement noise (random phenomena present throughout the entire measurement chain). The resulting spectra had media equal to zero and standard deviation equal to one. Furthermore, the second derivative allowed to find the exact location (center) of the shoulders in the original spectra, by deconvoluting and highlighting the peaks. As a result, significantly narrower bands were observed. The peaks appeared in the same locations as the peaks in the original spectra.

The PCA analysis was performed in this case on the pretreated spectra, first with SNV and then with the second derivative (17-point window, 2^nd order polynomial fit), in all three varieties.

When examining all three varieties together using PCA, discernible clustering patterns did not emerge (Figure 3). The analysis indicated that the cumulative variance explained by the first 20 principal components accounts for approximately 99.6% of the total variance. Specifically, the first principal component (PC1), the second principal component (PC2), and the third principal component (PC3) represented the most significant contributors to this cumulative variance, explaining 45.5%, 24.6%, and 10.4% of the total variance, respectively. Subsequent components gradually contributed smaller proportions of variance, with PC20 explaining 0.1% of the total variance. Despite the comprehensive coverage of variance by the first 20 components, no distinct separation between the varieties was observed in the PCA plot, suggesting considerable overlap in their underlying characteristics.

Figure 3.

PCA score plots of three sample groups (Cappricia, Brioso, Provine): (a) PC1 vs PC2, (b) PC2 vs PC3, (c) PC1 vs PC3, (d) PC1 versus PC4.

Conversely, when each variety was analyzed independently, unique variances and contributions to the overall dataset became apparent. These findings implied that while the amalgamation of all three varieties may obscure underlying patterns, examining each variety separately revealed more nuanced insights. This underscored the importance of considering individual characteristics and subgroupings within each variety for a comprehensive understanding of the dataset’s structure and composition. The number of principal components to accumulate the variance explained by each model, together with the number of outliers detected can be seen in Table 3.

Table 3.

Results of exploration by PCA, to detect outliers at cultivar level.

Cultivar	Number of principal components	% variance explained by the model	Number of extreme outliers detected visually
Brioso	8	99.2	6
Cappricia	7	99.0	6
Provine	8	99.1	4

As previously mentioned, principal component analysis was conducted individually for each variety, as well as collectively for all varieties combined. Opting for the former approach afforded a more nuanced understanding of the intrinsic variability inherent to each distinct variety. The focus centered on elucidating the important wavelengths showed by the loadings of the initial three principal components (PC1, PC2, and PC3). It is worth noting that while the analysis predominantly relied on these three components, a more comprehensive examination necessitated a greater number of components to adequately account for the variance observed within each dataset. For instance, in the case of Cappricia, 99.0 % of the variance was explained by 7 PCs, Provine exhibited 99.1% variance explained by 8 PCs, and Brioso manifested 99.2% variance explained by 8 PCs.

These loadings represented the correlation between the original variables (wavelengths) and the principal components. Loadings of greater magnitude indicated stronger correlations between the variables (wavelengths) and the principal components. The sign of the loadings indicated the direction of the correlation. Positive loadings indicated a positive correlation between the wavelength and the principal component, while negative loadings indicated a negative correlation.

PC1 loadings highlighted distinct wavelengths across the three varieties (Figure 4). Between 1034 nm and 1132 nm, Cappricia and Brioso exhibited notable wavelengths, while Provine showed a key wavelength at a lower intensity. Cappricia displayed a significant wavelength at 1251 nm. In the range from 1300 nm to 1400 nm, all three varieties presented key wavelengths. Lastly, in the 1400 to 1500 nm range, Brioso had peaks at 1420 nm and 1505 nm, while Cappricia showed peaks at 1413 nm and 1512 nm.

Figure 4.

PC 1 Loading plots of cultivars Cappricia, Provine and Brioso.

Similarly, PC2 loadings showed shared peak positions among the varieties (Figure 5). Each variety exhibited a peak at 1034 nm, while all three varieties presented notable peaks at 1258 nm, 1391 nm, and 1469 nm. Between 1500 nm and 1600 nm, Brioso showed defined peaks at 1533 nm and 1633 nm, with Provine peaking at 1554 nm and 1618 nm. Cappricia also aligned with Brioso at 1533 nm and 1633 nm. Starting from 1300 nm, the intensity range was more constrained for Provine (0.32329) than for Brioso (0.35), highlighting distinctions in spectral intensity ranges among the varieties.

Figure 5.

PC 2 loading plots of cultivars Cappricia, Provine and Brioso.

In addition, the analysis of PC3 loadings identified unique peak positions for each variety (Figure 6). Brioso exhibited peaks at 1027 nm, 1146 nm, 1244 nm, 1335 nm, 1462 nm, and 1611 nm. Cappricia showed peaks at 1209 nm, 1363 nm, and 1583 nm, while Provine presented peaks at 1209 nm, 1391 nm, and 1540 nm. These distinct wavelengths highlight the spectral characteristics of each variety based solely on peak positions.

Figure 6.

PC 3 loading plots of cultivars Cappricia, Provine and Brioso.

In summary, each tomato variety demonstrated distinct spectral patterns across PC1, PC2, and PC3, highlighting unique characteristics that may be influenced by NIR wavelengths. PC1 revealed notable differences in Cappricia, suggesting distinctive biochemical or structural components relative to Brioso and Provine. PC2 showcased spectral features that set Provine apart, indicating specific biochemical markers or metabolic traits distinguishing it from the other varieties. Lastly, PC3 emphasized unique spectral properties primarily in Brioso, marking characteristics less evident in Cappricia and Provine. These findings underscore the potential for NIR wavelengths to differentiate tomato varieties based on their spectral profiles.

Data split

Table 4 shows the number of samples belonging to each class according to each labeling scenario, in the complete datasets.

Table 4.

Number of spectra in each class (healthy: class 1; diseased: class 2) when dataset was split according to different labelling scenarios (label 1: 0/123; label 2: 01/23 and label 3: 0.5/123).

Cultivar	n	Label 1		Label 2		Label 3
Cultivar	n	Healthy	Diseased	Healthy	Diseased	Healthy	Diseased
Cappricia	163	139	24	77	86	117	46
Brioso	153	145	8	74	77	126	27
Provine	152	137	15	83	74	129	23

Then, the Training sets were randomly divided into Calibration (70%) and Validation (30%). Table 5 shows how the original number of samples belonging to each class was then divided into the Training (70%), Test (30%) sets, consistently for both classes (the details of the calculation can also be seen). Then, the Training sets were randomly divided into Calibration (70%) and Validation (30%).

Table 5.

The number of samples in each set, after dividing the original datasets into train set (70%), test set (30%), randomly. The train set was split again into calibration set (70%) and validation set (30%).

Class	Provine		Brioso		Cappricia		Global model
Class	Healthy	Diseased	Healthy	Diseased	Healthy	Diseased	Healthy	Diseased
Nb. of samples	83	74	74	77	77	86	160	160
Training	58	52	52	54	54	60	112	112
Test	25	22	22	23	23	26	48	48
Total training	110		106		114		224
Total test	47		45		49		96
Calibration	77		74		80		157
Validation	33		32		34		67

Classification

Intravariety models

Different results obtained using different labeling scenarios between the healthy and the diseased classes for Cappricia cultivar were compared (plots not shown). As a result, the precision metric behaved erratically when Scenario 1 was selected. This metric showed high values when less than 13 variables were chosen, but then decreased abruptly with 14 variables; and increased again when 15 variables were chosen. This counterintuitive behavior was due to the fact that the precision metric took into account false positives in the denominator, which changed abruptly with different splits. In other words, the behavior of the precision metric showed that the data were not uniformly distributed in both classes, when Scenarios 1 and 3 were chosen. Another indicator of class balancing was the correlation between the accuracy and the balanced accuracy. When the classes were balanced, these metrics were almost identical, and their lines overlapped as observed in Scenario 2. On the other hand, a clear separation was observed between them, the accuracy was higher than the BA (graph not shown).

As a rule, high BA values showed that the model performances were good for both classes. On the other hand, high accuracy metric showed that the model performed well, in general, given the existing dataset balance. When using Scenarios 1 and 3, this was the case only for the majority classes.

It should be mentioned that there are several ways to solve data imbalance. One of them is oversampling (adding samples from the least represented class); another one is undersampling (deleting samples from the majority class). In the first case, poor implementation risks overfitting, the risk of overfitting increases, since during cross-validation, the same samples that are in the model can be used to validate it. In the second case, important information can be removed from the model.

A one-class classification could also have been used, where all samples similar to the samples of one class are included, and the others discarded by the model. However, these models are always less specific, and in the case of the present study, they showed poorer classification metrics.

For example, a One-class SIMCA analysis on the ‘Healthy’ classes revealed high variance explained by the models, with values of 99.1% for Brioso, 99.8% for Provine, and 99.1% for Cappricia. The model gave good results for the healthy class in the three varieties. In the calibration set (Cal), Brioso achieved a True Positive (TP) of 74, a Specificity and Sensitivity (Sens Cal.) of 0.97, while Provine showed a TP of 72, Spec. Cal and Sens Cal of 0.94 and Cappricia exhibited a TP of 71, Spec. Cal and Sens Cal of 0.92. However, when the trained models were validated in their corresponding diseased classes, they were not able to reject samples with high specificity. In the validation set (Val), Brioso demonstrated an Accuracy (Acc. Val.) of 0.16, Provine had an Acc. Val. of 0.30, and Cappricia achieved an Acc. Val. of 0.27. These validation metrics offered insights into the robustness of the model across different cultivars (Table 6).

Table 6.

Results of one-class SIMCA on “healthy” class. Exp.Var: % variance explained by the model.

Cultivar	Exp.Var	TP Cal	FN Cal	Spec. Cal	Sens Cal.	Acc. Cal.	PC	Spec. Val	Acc. Val
Brioso	99.1	74	2	NA	0.97	0.97	6	0.16	0.16
Provine	99.8	72	4	NA	0.95	0.95	6	0.30	0.30
Cappricia	99.1	71	6	NA	0.92	0.92	6	0.27	0.27

Due to these reasons, Scenario 2 was chosen to calibrate and validate the models. No addition or removal of samples was made, except for the aforementioned outliers.

The relationship between BA and ivs can be seen in Figure 7, for the different Scenarios 1, 2 and 3, in the Cappricia model. Once again, we can see that Scenario 2 was the best option, because it showed higher BA values.

Figure 7.

Comparison of balanced accuracy in different labelling scenarios according to different number of important variables as input for PLSDA, in the Cappricia model.

Another interesting aspect to highlight in Figure 7 is that the model’s performance remained the same with 5 variables as it did with 37 or 38 variables. This indicated that a smaller subset of variables captured the essential information needed for classification, making the model more efficient without significantly compromising accuracy. Specifically, for Cappricia, the 5 most influential variables chosen by CovSel were: 959 nm, 1000 nm, 1097 nm, 1441 nm, and 1654 nm. It is interesting, therefore, to analyze what these variables brought and what the NIR was sensitive to in this case.

In a general sense, NIR spectroscopy is particularly sensitive to overtones and combination bands of fundamental molecular vibrations, primarily involving C-H, O-H, N-H, and S-H bonds. The spectral assignment of the specified wavelengths were elucidated through their association with various molecular vibrations and the corresponding chemical compounds. The wavelength at 959 nm was related to the first overtone of C-H stretching vibrations, commonly found in hydrocarbons, lipids, and fatty acids.⁴³ At 1000 nm, a region that corresponds to harmonic combinations of C-H and O-H stretches, as well as overtones of C-H bonds in methyl and methylene groups was observed. This wavelength was often linked to carbohydrates, proteins, and alcohols.⁴⁴ The wavelength of 1097 nm was linked to the first overtone of O-H and N-H stretching vibrations, indicating the presence of water, proteins, and amines.⁴⁵ The 1441 nm wavelength corresponded to a combination of O-H stretching and bending bands, commonly found in water and alcohols. It was also sensitive to the hydrogen bonding in these compounds^.44 Finally, the wavelength at 1654 nm was associated with combination and overtone bands of N-H stretching and bending vibrations, along with potential C-H combinations, indicative of proteins, amides, and organic compounds containing nitrogen^.46

Achieving a balance between optimizing the model to achieve the highest statistical parameters and a more efficient model in terms of complexity but with slightly lower performance is always necessary. It is therefore worth noting that while the model with 5 variables was more efficient, the optimized model required 33 variables. This underscored the trade-off between model complexity and efficiency, where the choice depended on the specific requirements and priorities of the analysis.

Optimal models were generated using different parameters for each tomato variety: for Cappricia, SNV pre-treatment along with second derivative^2,15 and 33 independent variables (ivs) were employed; for Provine, SNV with 13 ivs was applied; while for Brioso, raw spectra with 18 ivs were used.

Global models

Table 7 shows PLSDA classification results of a global model, calibrated and validated with cultivars Cappricia and Provine. In this model, spectra were pretreated with SNV, and then with second derivative (17-point window, 2^nd order polynomial fit). Then, 19 important variables were chosen by the CovSel algorithm. Finally, the PLSDA model was trained in the calibration data, and 17 latent variables were chosen. This relatively high number can be understood as being due to the complexity of adding two different varieties in one model.

Table 7.

PLSDA classification results of a global model, calibrated and validated with Cappricia and Provine.

Data set	Real/predicted	Healthy	Diseased	Sens Val.	Spec. Val	Prec. Val	BA. Val
Calibration	Healthy	62	47	0.57	0.74	0.68	0.66
Calibration	Diseased	29	81	0.74	0.57	0.63	0.66
Validation	Healthy	23	25	0.48	0.71	0.62	0.60
Validation	Diseased	14	34	0.71	0.48	0.58	0.60

Table 8, shows PLSDA modeling results of all the optimal models in this study. Balanced accuracy was comparable to traditional accuracy in all models created with Scenario 2, showing that the classifier performed equally well on either class.

Table 8.

PLSDA classification results of all the optimal models in this study.

Parameter/model	Cappricia raw, 15v Label 3	Cappricia SNV + SG (2, 15, 2), 33v Label 2	Provine raw, 14v Label 3	Provine SNV, 13v Label 2	Brioso raw, 18v Label 2	Global model SNV, (Cap + Pro) 6v Label 2
Accuracy	0.83	0.84	0.71	0.71	0.66	0.70
Sensitivity or recall	0.89	0.71	0.08	0.76	0.43	0.81
Specificity	0.64	0.89	0.97	0.65	0.88	0.58
Precision	0.89	0.71	0.50	0.70	0.77	0.66
Balanced accuracy	0.77	0.80	0.52	0.71	0.65	0.70
Geometric mean	0.75	0.79	0.27	0.70	0.62	0.69
F-measure	0.89	0.71	0.14	0.73	0.55	0.73
Youden’s index	0.53	0.60	0.05	0.41	0.31	0.39
Positive likelihood ratio	2.47	6.45	2.67	2.17	3.58	1.93
Negative likelihood ratio	0.17	0.32	0.95	0.37	0.65	0.33

Accurate predictions for healthy sepals (sensitivity) were as follows: Cappricia (0.71), Provine (0.76), GM (0.81). Similarly, for diseased sepals correctly classified as such (specificity): Cappricia (0.89), Provine (0.65), GM (0.58).

Moreover, good performances on both positive and negative classes were found in the Cappricia Intravariety model. High positive likelihood ratio of 6.45 (above 1: increased evidence for disease-free) for the Healthy class; Low negative likelihood ratio of 0.32 (increased evidence for disease) for the Infected class.

For two-class classification, the geometric mean was calculated as the square root of the product of specificity and sensitivity (Table 1). As a rule, if one of the classes cannot be recognized by the model, the geometric mean tends to zero.⁴⁷ This parameter showed this behavior, when its values were less than 0.5. This was observed in the case of sample classification of the Provine variety using Scenario 3. Although the specificity of this model was high, the sensitivity was very low (0.08), and the geometric mean was 0.28. In all other cases, this parameter was greater than 0.5 showing that the models were able to recognize both classes.

Interpretation of information conveyed by PLSDA loadings

Out of the 33 variables selected by CovSel in the Cappricia model, PLSDA assigned greater importance to 1083 nm (iv 7), 1188 nm (iv 10), 1363 nm (iv 18), and 1427 nm (iv 21), according to the loadings of its first principal component.

In the Brioso model, 18 variables were selected by CovSel, with PLSDA also emphasizing 1363 nm (iv 7), similar to Cappricia, indicating its relevance to the susceptibility of tomato sepals. Additionally, PLSDA placed emphasis on variable number 9, at 1413 nm, which represents an important feature in Brioso sepals, aiding in their distinction from other varieties.

Regarding Provine, 13 variables were selected by CovSel, and among them, the PLSDA model for classification utilized 1090 nm (iv 6), 1427 nm (iv 8), and 1654 nm (iv 10). Interestingly, wavelength 1427 nm (iv 8) was common between Provine and Capprica, indicating its significance across multiple varieties.

In the GM with SNV, comprising the 37 most important variables according to CovSel, the key ones were five: 986 nm (iv 5), 1090 nm (iv 9), 1335 nm (iv 15), 1448 nm (iv 20), and 1668 nm (iv 31). It is worth noting that the 1363 nm wavelength (iv 18) was common among Capprica, Brioso, and the GM, while the 1427 nm wavelength (iv 21) was common between Capprica, Provine, and the GM.

Several reasons explain the selection of specific wavelengths by PLSDA models. According to Silva et al. in 2020, the factors contributing to tomatoes’ susceptibility to fungal diseases include the abundance of simple sugars and organic acids, or the activity of host cell wall-modifying proteins.⁴⁸ These factors, combined with the presence of water and environmental conditions, elucidate why the selected wavelengths played a significant role in the classification studied in this work.

It is worth noting that natural defenses in fruit, like cell walls, waxy coatings, and the skin, provide inherent resistance. When an infection takes place, the fruit initiates various systemic signals that activate specific defenses against the pathogen, thus safeguarding other parts of the fruit. The stage of ripeness determines the type of defense compounds present in the fruit. Nonetheless, some pathogens can bypass these defenses and cause infections, highlighting the importance of the composition of ripe fruit in the emergence of post-harvest diseases.⁴⁹

As mentioned earlier, the raw spectra showed bands at approximately 979 nm, 1195 nm, 1448 nm, and 1455 nm. These bands were also observed in similar wavelengths (970 nm, 1446 nm, and 1200 nm) by Li et al., who estimated the sensory qualities of tomatoes using visible and near infrared spectroscopy.⁵⁰ These authors attributed the 970 nm band to water and the 1190 nm band to the second overtone symmetric stretch of methyl groups.⁴⁸

Furthermore, the bands at 970 nm, 979 nm (or 978 nm), 1188 nm (or 1190 nm), and 1448 nm (or 1450 nm) have been attributed to the O-H stretching first overtone in water by several authors.^51–53 According to de Brito et al. the peak at 979 nm (or 978 nm) is caused by water due to its relation to the O–H absorption band range (740 nm, 840 nm, 960 nm, and 1440 nm).⁵¹ Similarly, the bands at 970 nm, 1188 nm (or 1190 nm) and 1448 nm (or 1450 nm) were attributed to the O–H stretching first overtone in water.^52,53

The water content in tomato sepals can affect their susceptibility to fungal attacks.⁵⁴ High moisture levels in plant tissues, including sepals, can create a favorable environment for fungal growth and infection. Fungi, such as Botrytis cinerea (grey mold) and A. alternata, thrive in humid conditions where water availability facilitates spore germination and mycelial growth.⁵⁵ Spores of many pathogenic fungi require high humidity conditions to germinate. The presence of water in plant tissues, including sepals, provides the moist environment necessary for spores to activate and begin germinating.³⁸ According to Thomma, some spores require high humidity to germinate and penetrate plant tissues.⁵⁶ Elad et al. emphasized that water content and humidity levels can significantly impact the resistance or susceptibility of tomatoes to fungal infections, as fungi like Botrytis cinerea require moisture for spore germination and infection establishment.⁵⁷

Secondly, polysaccharide and saccharide levels contribute to tomatoes susceptibility to fungi.⁴⁸ Peaks for polysaccharides and/or saccharides (at 1170 nm and 1200 nm) were used to differentiate tomatoes of different ages, indicating postharvest ripening levels regardless of storage conditions.⁵² Blanco et al. found that the polysaccharide content in tomato stem tissues significantly affects the susceptibility of tomatoes to Botrytis cinerea infections. Research has shown that these compounds play crucial roles in the structural integrity and biochemical properties of plant tissues, thus affecting their interactions with fungal pathogens. Tomato sepals, like other plant tissues, are composed of cells with cell walls containing polysaccharides. These polysaccharides contribute to the structural integrity of the sepals and play a role in their defense against pathogen invasion.^38,55 By fortifying the cell wall, polysaccharides in tomato sepals enhance their resistance to fungal pathogens. A robust cell wall can impede the penetration of fungal hyphae or spores, limiting their ability to infect the sepals and cause disease.³⁸ According to Jones and Dangl, during pathogen invasion, fungal pathogens attempt to penetrate the plant cell wall to gain access to nutrients and cause disease. Polysaccharides play a vital role in strengthening the cell wall, making it more resistant to penetration by fungal hyphae or spores. This reinforcement of the cell wall serves as a physical barrier that impedes the progress of fungal pathogens, thereby enhancing plant defense mechanisms.⁵⁸

Moreover, tomato sepals contain various sugars such as glucose, fructose, and sucrose, which can serve as energy sources for fungi. The 1170 nm band is associated with the C–H stretching second overtone bonds, indicating the presence of carbohydrates in fruit. Furthermore, Blanco et al. indicated that NIR data between 1069 and 1125 nm, specifically at 1083 nm and 1090 nm in this current research, have been used to predict the acid–Brix ratio of tomato juice. Regarding the 1668 nm band, it has previously been assigned to the first and second overtones of the C–H stretches, the first overtones of alkene C=C bonds, and the -CONH- of secondary amides.⁵¹ These functional groups, such as cis-RCH=CHR,’ CH, aromatic, and CH₃, are associated with sugars, fruit acids, and some amino acids.⁵⁰

On the other hand, loadings at 1455 nm (or 1456 nm) relate to CH₃ bending vibration in lipids and proteins, while a loading at 1090 nm (or 1095 nm) corresponds to symmetric PO2 stretching.⁵¹ The 986 nm (or 981 nm) region is associated with the phosphodiester region, and the 1427 nm (or 1422 nm) band pertains to proteins and lipids.⁵¹

The susceptibility of tomato sepals to fungal infections is significantly influenced by their lipid content. Lipids serve as essential components of plant cell membranes, affecting permeability and rigidity, which can either hinder or facilitate fungal penetration.⁵⁹ Additionally, certain lipids act as barrier components, with cuticular waxes providing hydrophobicity to prevent pathogen entry.⁵⁹ A reduction in these protective lipids can increase susceptibility to fungal infections. Furthermore, sepals possess an intricate defense mechanism that is triggered by the detection of invading pests or pathogens, leading to a series of events that may result in enhanced resistance.⁶⁰ As integral parts of cellular membranes, lipids play a pivotal role in facilitating the communication pathways that are essential for the regulation of defensive actions in plants. Research has shown that a variety of lipids, are critical in the transmission of signals during plant-pathogen interactions.⁶⁰

When it comes to proteins, Ferreira et al. described their role in plant-pathogen interactions as follows: “The interaction between a plant and a pathogen can be likened to a battle, where the primary tools used in the fight are proteins produced by both the plant and the pathogen.”⁶¹ According to Bashir et al., the function of a cell or tissue is primarily determined by its protein composition.⁶² Furthermore, a plant’s resistance to fungal infections relies on alterations in the protein composition of its cells. Upon exposure to fungal species, certain proteins stimulate the accumulation of lignin in the plant cell walls, which can strengthen the plant’s defenses against fungal pathogens.⁶² Moreover, enzymes involved in lipid synthesis, such as those belonging to the phospholipase family, are responsible for generating signaling compounds that are released in response to pathogen assaults.⁶⁰

Furthermore, the 1363 nm (or 1360 nm) band is attributed to the R-O-H stretching first overtone in alcohols.⁴⁹ This spectral feature can be related to the defense mechanisms of tomato sepals against fungi because the presence of specific alcohols or phenolic compounds in the sepals, which exhibit this band, might play a role in enhancing the plant’s defenses. Moreover, they serve as precursors in the synthesis of phytoalexins, antimicrobial compounds produced in response to fungal infection, and in the formation of phenols and flavonoids, which possess antifungal properties.^63,64 Alcohols also contribute to the production of lignans and tannins, which reinforce cell wall integrity and act as physical barriers against fungal invasion.⁴⁹

Phenylpropanoids, which include various chemical families such as flavonoids, isoflavonoids, stilbenes, monolignols, and lignin, function as inducible phytoalexins in many vegetable species to combat pathogens. In tomatoes, the primary phytoalexin is α-tomatine, a glycoalkaloid found throughout the plant, including the leaves, stems, and unripe fruit. Tomatine serves as a natural defense compound, protecting the tomato plant against pests and pathogens.⁴⁹ A previous study conducted by Ito et al. revealed that α-tomatine induced cell death in Fusarium oxysporum, a fungus prevalent in tomato crops.⁶⁵

Additionally, peaks related to cell wall components like pectin (1448 nm), cellulose (1363 nm), and lignin (1195 nm) are crucial.⁵³ Changes in these components, along with cell wall thickness during ripening, are significant. Other key peaks identified across different conditions were those relating to the structural and compositional development of the cuticle and cell wall. Compositional changes in key compounds such as pectin, cellulose, and other polysaccharides, as well as changes in cell wall thickness, are part of the ripening process.⁶⁶ Pectin undergoes de-esterification, serving as a measure of tomato maturity. Variations at 1300 nm may be due to fungal growth in inoculated date fruits, and the 1650 nm band might relate to hardness, as explained by Wang et al. for grain kernel at 1680 nm.¹⁹ Fungal infection can alter the texture of food, making the date fruits softer as the infection progresses, a change identified at 1650 nm.⁶⁷ The 1195 nm (or 1200 nm) band is linked to the C–H stretching second overtone in fiber components like cellulose and lignin.⁵³

Lastly, acids like malic and citric acid altering pH, and phenolic compounds, which some fungi use as nutrients, also play roles in the susceptibility of tomato sepals to fungal infections.^68,69 These organic acids can influence the pH of plant tissues. The pH can affect fungal growth and the activity of fungal enzymes. Fungi often thrive in specific pH ranges, and altering the pH of the host tissue can either inhibit or promote fungal infection.⁶⁸ Similar to sugars, citric acid declines with progressing maturation after ripening while the content of malic acid remains relatively constant.⁶⁶ Phenolic compounds are part of the plant’s defense mechanisms. However, some fungi have evolved to utilize these compounds as nutrients, potentially aiding their growth and infection processes.^54,69

To provide clarity, the most important wavelengths highlighted in this study, along with their tentative assignment, were summarized in Table 9.

Table 9.

Summary of tentative band assignments of the important variables (ivs) found in this study.

Wavelength (nm)	Found in this study as	Band assigment	Ref.
1083	PLSDA-PC1 loading in cappricia	First overtone of O-H stretching vibrations C-H stretching vibrations	⁵¹
1083	PLSDA-PC1 loading in cappricia	This band is often observed in the spectra of water, alcohols, and organic compounds containing hydroxyl groups	⁵⁵
1188	PLSDA-PC1 loading in cappricia	O–H stretching first overtone in water	^53,58
1363	PLSDA-PC1 loading in cappricia, brioso and GM	R–O–H stretching first overtone in alcohols	⁵³
1363	PCA-PC3 loading in cappricia	Cellulose	⁵¹
1427	PLSDA-PC1 loading in cappricia, provine and GM	Proteins and lipids	⁵¹
1413	PLSDA-PC1 loading in Brioso PCA-PC1 cappricia	First overtone of the N-H stretching vibration in proteins	⁵⁵
1090	PLSDA-PC1 loading in provine and GM	Symmetric PO2 stretching Acid–Brix ratio of tomato juice	⁵¹
1654	PLSDA-PC1 loading in Provine CovSel, iv in cappricia	Combination and overtone bands of N-H stretching and bending vibrations C-H combinations, indicative of proteins, amides, and organic compounds containing nitrogen	⁴⁶
986	PLSDA-PC1 loading in GM CovSel in GM	Phosphodiester region.	⁴⁶
1335	PLSDA-PC1 loading in GM CovSel-GM PCA-PC3 brioso	Combination and overtone bands of O-H stretching and bending vibrations C-H stretching vibrations in organic compounds such as carbohydrates and proteins	⁵⁵
1448	PLSDA-PC1 loading in GM	ν2 + ν3, water solvation shell	³⁸
	PLSDA-PC1 loading in GM	OH-(H2O)4,5	⁵³
	CovSel GM	O-H stretching first overtone in water	⁵¹
	Raw spectra of brioso and provine	Pectin	⁵²
1668	PLSDA-PC1 loading in GM	First and second overtones of the C–H stretches First overtones of alkene C=C bonds The –CONH- of secondary amides	⁵¹
1455	Raw spectra of cappricia	CH3 bending vibration in lipids and proteins	⁵¹
1195	Raw spectra of cappricia, brioso and provine.	Second C–H stretching overtones of methyl groups	²⁸
		CH3-groups	⁵³
		C–H stretching second overtone in fiber components like cellulose and lignin	³⁹
		OH stretch 1st overtone of glucose	⁴⁰

Conclusion

This work was carried out with the objective of developing a method to predict the susceptibility of freshly harvested tomatoes to the presence of fungi, in a non-destructive way, before the disease can be observed visually. To this aim, hyperspectral images of the samples were measured, and models were developed based on their relationships with ground truth data.

The models can be divided into two general categories: those calibrated and validated using a single variety (intravariety), and those calibrated and validated with several varieties together (global models). In both cases, the best results were found using Scenario 2 as a reference.

Within the first category, the optimal model was created with the Cappricia variety: Balanced accuracy = 0.84, Sensitivity = 0.71 and Specificity = 0.89. As for the global models, the optimal models were calibrated using Cappricia and Provine together: Balanced accuracy = 0.70, Sensitivity = 0.81, Specificity = 0.58.

In this study, the significance of specific wavelengths in distinguishing tomato varieties and understanding their susceptibility to fungal infections was investigated. For the Cappricia model, PLSDA highlighted the importance of 1083 nm, 1188 nm, 1363 nm, and 1427 nm, indicating their relevance in characterizing tomato sepals’ susceptibility. Similarly, in the Brioso model, 1363 nm was emphasized, along with 1413 nm, distinguishing Brioso sepals from others. Provine model emphasized 1090 nm, 1427 nm, and 1654 nm, with the wavelength 1427 nm being common among multiple varieties. These findings were supported by previous studies suggesting that factors like sugar abundance, organic acids, water content, and environmental conditions contribute to tomatoes’ susceptibility to fungal diseases, while the presence of specific wavelengths corresponded to various biochemical and structural components in tomato sepals, such as polysaccharides, saccharides, lipids, proteins, and phenolic compounds, playing crucial roles in plant defense mechanisms and fungal infection susceptibility.

The results from this research suggest the conclusion that discrimination between more susceptible and less susceptible sepals is feasible under controlled conditions.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: EU Commission; 664387.

ORCID iDs

Mercedes Bertotto

Hendrik AC de Villiers

Marko Panic

References

OECD . Tomato (solanum lycopersicum). Safety assessment of transgenic organisms in the environment. Paris, France: OECD Publishing, 2017.

Bertin

Génard

. Tomato quality as influenced by preharvest factors. Sci Hortic 2018; 233(15): 264–276.

Codex Alimentarius . Standard for tomatoes (codex stan 293-2008). Rome, Italy: Food and Agriculture Organization (FAO) and the World Health Organization (WHO), 2008.

United Nations Economic Commission for Europe . UNECE Standard FFV-36 concerning the marketing and commercial quality control of Tomatoes. Geneva, Switzerland: UNECE, 2017.

Regulation (EU) No 543/2011 of the European Parliament and of the Council of 11 April 2011 on the harmonisation of certain social legislation relating to road transport and amending Council Regulation (EEC) No 3821/85 and repealing Council Regulation (EEC) No 3820/85. OJ L 157, 31 May 2011, p. 1–21. Available from: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32011R0543.

Arah

Amaglo

Kumah

, et al. Preharvest and postharvest factors affecting the quality and shelf life of harvested tomatoes: a mini review. International Journal of Agronomy. 2015; 2015: 478041.

Peralta-Ruiz

Sinning-Mangonez

Coronell

, et al. Reduction of postharvest quality loss and microbiological decay of tomato “chonto” (Solanum lycopersicum L.) using chitosan-E essential oil-based edible coatings under low-temperature storage. Polymers. 2020; 12: 1822.

Smid

Hendriks

Boerrigter

, et al. Surface disinfection of tomatoes using the natural plant compound trans-cinnamaldehyde. Postharvest Biol Technol 1996; 9(3): 343–350.

Mensink

Chauhan

El Harchioui

, et al. Kwaliteit van tomatenkronen na oogst. Wageningen. Netherlands: Wageningen Food & Biobased Research. DOI: 10.18174/555206.

10.

Janse

Boerrigter

HAM

. Kroonschimmel bij tomaat: consultancyonderzoek. Bleiswijk, Netherlands: Wageningen UR Glastuinbouw WUAaFSG, 2007.

11.

Bradr

Panić

Hogeveen-van Echtelt

, et al. Predicting sensitivity of recently harvested tomatoes and tomato sepals to future fungal infections. Nature. 2021; 11: 23109. DOI: 10.1038/s41598-021-02302-2.

12.

Jiménez-Fernández

Montes-Borregob

Navas-Cortés

, et al. Identification and quantification of Fusarium oxysporum in planta and soil by means of an improved specific and quantitative PCR assay. Appl Soil Ecol. 2009; 46(3): 372–382.

13.

Nan

Xue

. Contamination, detection and control of mycotoxins in fruits and vegetables. Toxins 2022; 14(5): 309.

14.

Delavy

Dos Santos

Heiman

. Investigating antifungal susceptibility in Candida species with MALDI-TOF MS-based assays. Front Cell Infect Microbiol. 2019; 9: 19.

15.

Kazbek

Sambasivam

Bar

, et al. Biosensor technologies for early detection and quantification of plant pathogens. Front Chem 2021; 9: 636245.

16.

Rajeshwari

Shylaja

Krishnappa

, et al. Development of ELISA for the detection of Ralstonia solanacearum in tomato: its application in seed health testing. World J Microbiol Biotechnol. 1998; 14: 697–704.

17.

Yan

, et al. Establishment of the recombinase polymerase amplification–lateral flow dipstick detection technique for Fusarium oxysporum. Plant Dis. 2023; 107(9): 2665–2672.

18.

Skolik

McAinsh

Martin

. ATR-FTIR spectroscopy non-destructively detects damage-induced sour rot infection in whole tomato fruit. Planta. 2019; 249(3): 925–939.

19.

Wang

Zhang

, et al. Identification of tomatoes with early decay using visible and near infrared hyperspectral imaging and image-spectrum merging technique. J Food Process Eng 2021; 44(4): 13654.

20.

Ståhle

LWS

Wold

. Partial least squares analysis with cross-validation for the two-class problem: a Monte Carlo study. J Chemometr 1987; 1(3): 185–196.

21.

Barker

MRW

Rayens

. Partial least squares for discrimination. J Chemometr 2003; 17(3): 166–173.

22.

The Darwin annotation tool from V7 labs. Available from: https://darwin.v7labs.com (Last accessed 24th January 2024).

23.

Van Rossum

Drake

JFL

. Python reference manual. Amsterdam, Netherlands: Centrum voor Wiskunde en Informatica Amsterdam, 1995.

24.

R Core Team . R: a language and environment for statistical computing [internet]. Vienna, Austria: R Core Team, 2016. Available from: https://www.R-project.org/

25.

Kuhn

. Building predictive models in R using the caret package. J Stat Software 2008; 28(5): 1–26. DOI: 10.18637/jss.v028.i05.

26.

Brandolini-Bunlon

Jallais

Roger

, et al.

R pakage rchemo: dimension reduction, regression and discrimination for chemometrics

https://github.com/mlesnoff/rchemo (2023).

27.

Stevens

Ramirez-Lopez

. An introduction to the prospectr package. R package Vignette R package version 0.2.6. 2022.

28.

Dalimov

Dalimova

Bhatt

. Chemical composition and lignins of tomato and pomegranate seeds. Chem Nat Compd 2003; 39(1): 37–40.

29.

Smith

Polynomial detrending in near-infrared spectroscopy. Journal of Spectroscopic Techniques 2010; 20(3): 123–135.

30.

Antonov

. An alternative for the calculation of derivative spectra in the near-infrared spectroscopy. J Near Infrared Spectrosc 2017; 25(2): 145–148.

31.

Barnes

Dhanoa

Lister

. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Appl Spectrosc 1989; 43(5): 772–777.

32.

Roger

Mallet

Marini

. Preprocessing NIR spectra for aquaphotomics. Molecules 2022; 27(20): 6795.

33.

Biancolillo

Marini

Roger

J-M

. SO-CovSel: a novel method for variable selection in a multiblock framework. J Chemometr. 2019; 34(2): 3120.

34.

Akosa

. Predictive accuracy: a misleading performance measure for highly imbalanced data. Stillwater, OK: Oklahoma State University, 2017.

35.

Luz

D’Opazo

Quiles

, et al. Biopreservation of tomatoes using fermented media by lactic acid bacteria. Lebensm Wiss Technol 2020; 130: 109618.

36.

Muncan

Tsenkova

. Aquaphotomics—from innovative knowledge to integrative platform in science and technology. Molecules 2019; 24(15): 2742.

37.

Gowen

Tsenkova

Esquerre

Downey

. Use of near infrared hyperspectral imaging to identify water matrix co-ordinates in mushrooms (Agaricus bisporus) subjected to mechanical vibration. J Near Infrared Spectrosc. 2009; 17(6): 363–71. DOI: 10.1255/jnirs.860.

38.

Jakubíková

Kleinová

Májek

. Near-infrared spectroscopy for rapid classification of fruit spirits. J Food Sci Technol 2016; 56(6): 2797–2803.

39.

Tanaka

Tsenkova

Yasui

. Details of glucose solution near-infrared band assignment revealed the anomer difference in the structure and the interaction with water molecules. J Mol Liq 2021; 324: 114764.

40.

López

García-González

Franco-Robles

. Carbohydrate analysis by NIRS-chemometrics. In: Kyprianidis

Skvaril

, et al (eds) Developments in near-infrared spectroscopy. Norderstedt, Germany: InTech, 2017.

41.

Kovacs

Muncan

Veleva

, et al. Aquaphotomics for monitoring of groundwater using short-wavelength near-infrared spectroscopy. Spectrochim Acta A Mol Biomol Spectrosc. 2022; 279: 121378. DOI: 10.1016/j.saa.2022.121378.

42.

Tsenkova

Iordanova

Toyoda

, et al. Prion protein fate governed by metal binding. Biochem Biophys Res Commun 2004; 325(3): 1005–1012.

43.

Workman

Lois

. Practical guide and spectral atlas for interpretive near-infrared spectroscopy. Boca Raton, FL: CRC Press, 2012.

44.

Burns

Ciurczak

. Handbook of near-infrared analysis. New York, NY: Marcel Dekker, 2001.

45.

Osborne

. Near infrared spectroscopy in food analysis. In: Encyclopedia of analytical chemistry: applications, theory, and instrumentation. Hoboken, NJ: Wiley, 2006.

46.

McClure

. Near-infrared spectroscopy the giant is running strong. Anal Chem 2008; 66(1): 43A–53A.

47.

WHO . https://imbalancedlearn.org/stable/references/generated/imblearn.metrics.geometric_mean_score (Last accesed on the 30th January 2024).

48.

Silva

van den Abeele

Ortega-Salazar

, et al. Tomato fruit susceptibility to fungal disease can be uncoupled from ripening by suppressing susceptibility factors. bioRxiv. 2020; 132829. DOI: 10.1101/2020.06.03.132829.

49.

Rodrigues

MHP

Furlong

. Fungal diseases and natural defense mechanisms of tomatoes (Solanum lycopersicum): a review. Physiol Mol Plant Pathol 2022; 122: 101906.

50.

Hayakawa

Nakano

, et al. Estimating the sensory qualities of tomatoes using visible and near-infrared spectroscopy and interpretation based on gas chromatography–mass spectrometry metabolomics. Food Chem. 2021; 343: 128470.

51.

de Brito

Campos

Nascimento

, et al. Determination of soluble solid content in market tomatoes using near-infrared spectroscopy. Food Control 2021; 126: 108068.

52.

Omar

Atan

MatJafri

. NIR spectroscopic properties of aqueous acids solution. Molecules. 2012; 17(6): 7440–7450.

53.

Emsley

NEM

Holden

Guo

, et al. Machine learning approach using a handheld near-infrared (NIR) device to predict the effect of storage conditions on tomato biomarkers. ACS Food Sci Technol 2022; 2: 187–194.

54.

Ulrich

Richard

. Phenolic compounds in plant disease resistance. Phytoparasitica 1988; 16: 153–170.

55.

Osborne

Fearn

Hindle

. Practical NIR spectroscopy with applications in food and beverage analysis. Harlow, UK: Longman Scientific & Technical, 1993.

56.

Thomma

BPHJ.

Alternaria spp.: from general saprophyte to specific parasite. Mol Plant Pathol. 2003; 4(4): 225–36.

57.

Elad

. Mycoparasitism. In: Kohmoto

Singh

(eds) Pathogenesis and hostspecificity in plant diseases: Histopathological, biochemical, genetic and molecular basis. Oxford, UK: Pergamon, Elsevier Science Ltd, 1995, Vol. II: Eukaryotes, pp. 289–307.

58.

Jones

JDG

Dangl

. The plant immune system. Nature. 2006; 444: 323–329. DOI: 10.1038/nature05286.

59.

Yeats

Rose

JKC

. The formation and function of plant cuticles. Plant Physiol. 2013; 163(1): 5–20.

60.

Tanashvi

Sejal

Umar

, et al. The intricate role of lipids in orchestrating plant defense responses. Plant Sci 2024; 338: 111904.

61.

Ferreira

Monteiro

Freitas

, et al. The role of plant defence proteins in fungal pathogenesis. Mol Plant Pathol. 2007; 8(5): 677–700.

62.

Zoobia

Shafique

Ahmad

, et al. Tomato plant proteins actively responding to fungal applications and their role in cell physiology. Front Physiol 2016; 7: 257.

63.

Philippe

JCH

Deville

M-A

Cordelier

, et al. Deciphering the role of phytoalexins in plant-microorganism interactions and human health. Molecules 2014; 19(11): 18033–18056.

64.

Silva-Beltrán

Ruiz-Cruz

Cira-Chávez

, et al. Total phenolic, flavonoid, tomatine, and tomatidine contents and antioxidant and antimicrobial activities of extracts of tomato plant. Int J Anal Chem. 2015; 2015: 284071. DOI: 10.1155/2015/284071.

65.

Ito

Ihara

Tamura

, et al. Alpha-Tomatine, the major saponin in tomato, induces programmed cell death mediated by reactive oxygen species in the fungal pathogen Fusarium oxysporum. FEBS Lett. 2007; 581(17): 3217–3222.

66.

Agius

von Tucher

Poppenberger

, et al. Quantification of sugars and organic acids in tomato fruits. MethodsX 2018; 5: 537–550.

67.

Teena

Manickavasagan

Ravikanth

, et al. Near infrared (NIR) hyperspectral imaging to classify fungal infected date fruits. J Stored Prod Res 2014; 59: 306–313.

68.

Prusky

Yakoby

. Pathogenic fungi: leading or led by ambient pH? Mol Plant Pathol. 2003; 4(6): 509–516.

69.

Vincenzo

Veronica

MTL

Cardinali

. Role of phenolics in the resistance mechanisms of plants against fungal pathogens and insects. Phytochemistry: Advances in Research. 2006; 280: 23–67.