Abstract
This study is performed to investigate the potential of near infrared (NIR) spectroscopy for the detection of botanical trashes content of seed cotton harvested by cotton-picker (SCHCP). Large quantity of trashes become comingled with cotton fiber in the harvesting process, especially when the cotton is harvested with cotton-picker. In China, trashes content of seed cotton (SC) has to be detected when farmers sell the SC to ginneries because trashes reduce the prices of SC and it should be deducted from the whole weight. The conventional instrumental method used to detect the trashes content of SC, ginning and trashes analysis, is complex and time consuming. In this study, 353 SC samples were collected from three ginneries, the NIR spectra bands from 12,000 to 4000 cm−1 were collected with the FT-NIR spectrometer Nexus. Models between NIR spectra and the trashes contents of these SC samples have been developed with the method of partial least square regression (PLSR), bands of 12,000–4000 cm−1, multiplicative signal correction (MSC) was used to eliminate the negative effects caused by sample shapes, second derivative spectra were used to eliminate the translation and the rotation in the spectral baseline. And the parameters of optimized model: R2 is up to 0.985 (calibration set) and 0.973 (prediction set), RMSEC is as low as 0.072 g and RMSEP is 0.158 g. Results of ANOVA also certified the trashes contents calculated with the models are consistent with the actual trashes contents.
Introduction
Cotton is an essential natural fiber accounting for approximately 27% of all fibers. China is one of the world’s largest cotton producing countries. 1 In recent years, machined-picked cotton has been rapidly promoted, about 80% of cotton is harvested mechanically up to now. The harvested seed cotton (SC) will be ginned to separate lint from cotton seeds and various types of impurities, such as leaf, bark, stem, seed coat, paper, and plastic bag. The ginning procedure contains a series of sequential steps including several SC cleaning processes, ginning, several lint cotton (LC) cleaning processes, and packaging of LC into bales. If the impurity cleaning efficiency is too low, the high impurity content of LC will adversely affect spinning. According to a survey done by the International Textile Manufacturers Federation (ITMF), 26% of cotton processed by the spinning mills were found to be moderately or severely contaminated impurities. However, the excessive cleaning will significantly degrade the fiber quality,2,3 for example, the length, the length uniformity, the strength, etc. Which not only decreases the monetary value of cotton fiber, but also reduces the overall quality of yarns and cotton textiles. Therefore, the detection of the impurity content is very important in cotton industry.
The commonly used instrumental detection methods for cotton impurity include the gravimetric methods, the geometric methods and the spectroscopy methods. The gravimetric methods are mainly used to separated impurities and fiber. For example, the saw impurity analyzer which is the standard method to measure the impurity content of LC in China separates the botanical impurities and cotton fiber using the differences in densities and volumes of them.
The geometric methods are mainly used to detect the impurity content. For example, Lieberman et al.4–7 used neural networks and learn vector quantization methods to detect impurities in LC, the impurities in large cotton lump has also been detected; Li et al. 8 proposed a method which extracted the feature vectors with Gabor Operator and the white foreign fibers were separated in the binary image composited using the feature vectors. Zhang et al. 9 proposed a method for the identification of impurities in SCHCP with machine vision using the support vector machine classifier optimized with genetic algorithm.
The spectroscopy methods are also mainly used to detect the impurity content. For example, Fortier et al.10,11 acquired the FT-NIR spectral characteristics of hull, leaf, seed, stem of cotton, and identified the cotton impurity components with a NIR spectral database, the identification accuracy was as high as 98% when the spectrum of impurity of new sample was compared with the reference spectral library. Rodgers et al. 12 monitored the micronaire of cotton fiber with a portable NIR instrument. Liu et al. classified the cotton samples into different level using the models between cotton samples and their spectra (220–2200 nm), the results indicated that using the model in the bands of 1105–1700 nm could reach an acceptable separation; Liu et al.13–15 found there were large differences between the spectra (1200–900 cm−1) of the mature and immature cotton fiber. Allen et al. 16 used a FT-IR spectral database to classify cotton samples into different impurity levels. Gamble and Foulk 17 built the partial least squares (PLS) models of six botanical trash types using fluorescence spectroscopy and the models of leaves and hull were capable of predicting individual trash component with a high degree of confidence. Gaitán-Jurado et al. 18 determined the moisture content and impurity level of SC using NIR spectra. For moisture, the best model was obtained using PLS regression method, the first derivate, drying method, standard normal variate (SNV), and detrending as the pretreatment method.
The geometric methods and the spectroscopy methods are often combined to detect cotton impurities and other features of cotton fiber. For example, Zhang et al. 19 inspected foreign matter the surface of LC using the method of liquid crystal tunable filter hyper-spectral image with spectral ranging from 900 to 1700 nm. Mustafic et al.20,21 found fluorescent imaging apparatus with blue and UV light excitation sources could be a promising method for cotton foreign matter detection. Jia and Ding 22 transformed the discriminations of the absorption characteristic of cotton fibers and foreign fibers at the NIR band to image features, then an image segmentation algorithm was selected for extracting foreign fiber objects from cotton background. The high volume instrument (HVI) incorporates both spectroscopy and imaging, can measure multi-indexes of LC, such as the color grade, the length, the strength, and the fineness of cotton fiber, etc. HVI can also measure the impurity content of LC in the way of counting the number of impurities and the percentage of total surface area of impurities. 23
In summary, it can be found that there are many researches to detect the impurity of LC, while there is almost no research on the detection of the impurities of SC. The impurity content of SC is much larger than the impurity content of LC, especially large quantity of stems, hulls, and leaves of cotton plant become comingled with cotton fiber in SC harvested with cotton-picker (SCHCP); The other reason is SC contains cotton seeds while LC do not. In this research, NIR spectroscopy is investigated to detect the impurity content (mainly refers to stems, hulls, and leaves of cotton plant) of SCHCP. The partial least square (PLS) models between the diffuse reflection NIR spectra and the impurity content of SCHCP.
Materials and methods
The whole experiment mainly contains four steps: sampling and sample preparation, NIR spectra acquisition, separate impurity from cotton, mathematical modeling, as shown in Figure 1.

Experimental procedure: Sampling and sample preparation, NIR spectra acquisition, Separate trashes from cotton fibers, and Modeling and optimization.
Sampling and preparation
As we known, different geographical locations have different climatic conditions, so the cotton maturity period will also be different, which further leads to different impurity rate of cotton. In order to ensure the representativeness, three ginneries were selected from the three most representative producing areas in Xinjiang, Chinese main cotton production area. In each ginnery, two cotton modules with the volume of 10 × 2 × 2.5 m3 and the weight of 10 tons were randomly selected. In each module, 60 samples were collected on the module’s both longest sides. On each side, 30 samples were collected along roughly equidistant three lines. Along each line, 10 samples were collected roughly equidistant under the surface of the module not less than 10 cm. The weight of each sample ranged from 100 to 150 g. In conclusion, a total of 360 samples were collected from three ginneries, with 120 samples per ginnery. However, there are seven sample bags were broken during the transport. Therefore, 353 samples were actually used for modeling and analysis.
Before acquiring NIR spectral data, 20 g SC was separated from each sample and kept in the laboratory under the condition at a constant temperature of (20 ± 1)°C and relative humidity of (65 ± 2)%RH for more than 24 h. Then, all of the samples were sealed with PE self sealing bags as shown in Figure 2.

20 g SC used for NIR analysis, Gin, Trashes analysis. The weighing error was less than 0.01 g, the bags were weighted individually one by one.
NIR spectral acquisition
The NIR spectra were acquired with an FT-NIR spectrometer Nexus (Thermo Electron Corp., Madison, Wisc., USA) with a smart diffuse reflectance accessory, an InGaAs detector over a range of 12,000–4000 cm−1 and the light source was a built-in 50 W quartz halogen lamp. The background was the NIR spectra of a Teflon plate. Before spectral acquisition, all samples were kept at the environment of 20 ± 1°C and 65 ± 2%RH last for more than 24 h. The spectra were acquired at a resolution of 8 cm−1 and 32 scans over the range of 12,000–4000 cm−1.
The sample pool is a cylinder (Figure 3(a)) made of special metal material with the internal diameter of 10.16 cm and height of 6.35 cm. One end of the cylinder is sealed with low OH quartz glass which hardly absorb near infrared light and the other end is open for loading samples. In this research, the average spectrum of five spectra which were gained from five different positions (1–5 marked in Figure 3(a)) was used to represent the according sample. Because it is difficult to characterize one sample with the spectrum gained from just one position for the spot of the spectrometer is too small while the surface of the sample is quite large. A gold coated compression tool was used to press the samples to a certain height and density (Figure 3(b)). In this way, it can ensure the consistency of the density and the height of different samples, and the stray light can also be prevented.

The sample pool and the way of spectral acquisition shows the: (a) sample pool and the compression tool (sample pool of IN312/x, 97 mm diameter, hard-coated aluminum with low OH quartz window, 90 mm high; compression tool, 87 mm diameter, for use with IN312), 1–5 marks the spectral acquisition positions and (b) diffuse reflectance spectra acquisition process.
Trash detection
In this study, the trashes mainly refer to stems, hulls and leaves of cotton and grass (Figure 4) because they account most of the botanical trashes in SCHCP. As shown in Figure 1 (P3), the trashes in SCHCP are mainly separated in the following three steps:
(1) In the process of P3.1, most of the large trashes, mainly refers to cotton stems and hulls, were picked out manually.
(2) In the process of P3.2, the SCHCP was ginned with a small roller ginning machine SY-20 (roller size 120 mm × 205 mm, rotational speed of 88 rpm, produced by River machinery plant, Xinxiang, Henan, China), which was specially designed to gin small quantity of SC. In this process, the remaining cotton hulls and a part of cotton plant leaves were separated from cotton fiber.
(3) In the process of P3.3, the remaining leaves were separated from the cotton fiber with cotton trash analyzer YG-041 according to the test method for percentage of trash content in raw cotton (GB/T-6499) (roller size 57.15 mm × 490 mm, rotational speed of the roller 0.9 rpm, produced by Changzhou No.1 Textile Equipment Co., Ltd., Changzhou, Jiangsu, China), which is also specially designed for small quantity of LC.

One SCHCP sample and the main kinds of trashes contained in it shows the: (a) sample of SC harvested with cotton-picker, (b) stems of cotton plant, (c) hulls of cotton plant and (d) leaves of cotton plant.
In fact, although most of trashes could be separated from cotton fiber in the above processes, the LC still contains a small amount of trashes which is difficult to separate. In general, the trashes content of the cotton samples after trashes analysis is deemed as 0, so the content of this part of trashes is ignored in this research.
Tools and data analysis
Origin 2017 (OriginLab Corporation, Northampton, MA 01060 USA), Omnic 8.0 and TQ Analyst 8.0 (Thermo Electron Corp., Madison, Wisc., USA) were used for spectral pretreatment and modeling. Origin 2017 and Omnic 8.0 provides commonly used spectral preprocessing and spectral feature extraction methods, such as calculate the average spectra, the first and second derivative spectra, spectral smoothing, peak fitting, and so on. TQ Analyst 8.0 is a specialized software for infrared spectral modeling, which provides commonly used spectral preprocessing (multiplicative signal correction, baseline correction, spectral smoothing, data format transform) and modeling methods (e.g. distance match, discriminant analysis, simple beer’s law, principal component regression, PLS, etc.).
Models were formulated which related the FT-NIR spectra and trashes content of each sample. The prediction ability of the model is given as root mean squares error of calibration (RMSEC) and root mean squares error of prediction (RMSEP).
n number of samples of calibration set.
m number of samples of prediction set.
The mathematics statistic method Analysis of variance (ANOVA) was used to determine whether there were significant differences between the prediction trash contents and the tested trash contents.
In this research, number of data sets k = 2,
N the number of samples in each group,
Results and discussion
NIR spectra
In section 2.2, the methods of selecting multiple acquisition points for each sample and acquiring multiple spectra (32 scans) for each acquisition point were used to improve the representativeness and the signal-to-noise ratio (SNR). In addition, several other pretreatment methods have been used to improve the quality of the spectra before modeling. The baseline was corrected with the least squares baseline correction algorithm, then the spectra were smoothed with Savitzky-Golay method (the size of the window was 5 points and the polynomial order was 2), at last the spectra were normalized with the method of dividing by the maximum ordinate value (Figure 5).

The spectra pretreated with the baseline correcting method of the least squares, the smoothing method of Savitzky-Golay and the normalizing method of dividing by the maximum ordinate value shows the spectra of samples of: (a) Shihezi, (b) Kuitun, and (c) Aler.
The average spectra of the samples from different ginneries over the entire spectral range (12,000–4000 cm−1) are compared in Figure 6. There are obvious differences in the raw spectra (Figure 6(a)); But the differences are becoming smaller after correcting baseline and smoothing (Figure 6(b) and (c)); At last, the differences almost disappear after normalizing.

The average spectra of the SCHCP samples of different ginneries is the: (a) raw spectra, (b) spectra after correcting the baseline with the least squares method, (c) spectra after smoothing with Savitzky-Golay method, and (d) spectra after normalizing with dividing by the maximum ordinate value method.
It can also be observed that there are two bands representing cotton botanical trashes are observed at about 2050–2100 nm (bands of 4878–4762 cm1) with an O-H bend and C-O stretch combination, and bands at about 2200–2270 nm (bands of 4545–4405 cm1) with O-H and C-O stretch combination or C-H stretch and CH2 deformation. Despite this, it was found that there were considerable overlaps among the FT-NIR spectra of cotton fiber and varieties of botanical trashes (stems, leaves, hulls, seed coat, seed meat, etc.). Thus the components of cotton fiber and the trashes could not be uniquely identified with the original FT-NIR spectra.
Trash content
The main statistical characteristics related to the weight of trashes and the content of trashes are shown in Figure 7. The weight of the trashes separated in P3.3 (Figure 7(a)) is a little larger than the weight of the trashes separated in P2.1 and P3.2 (Figure 7(c)). Moreover, the trashes separated in P3.3 distributed more uniformly than the trashes separated in P3.1 and P3.2, because the trashes separated in P3.3 are mainly cotton leaves, while the trashes separated in P3.1 and P3.2 contain not only leaves, but also cotton plant stems and hulls whose sizes are larger than leaves and whose amounts are far less than the amounts of leaves. And the uneven distribution of trashes might lead to poor models.

The weights of trashes: (a and b) weights of the trashes separated in P3.3 and its ratio, (c and d) weights of the trashes separated in P3.1 and P3.2 and its ratio, and (e and f) total weights and its ratio.
The detailed statistical parameters of Figure 7 is shown in Table 1. The contents of trashes (shown in Figure 7(b), (d), and (f)) are calculated according to formulas (5)–(7). In which, I.1 represents the trashes separated P3.1 and P3.2, I.2 represents the trashes separated in P3.3.
The statistical parameters of the weight of trashes and the content of trashes.
Models between FT-NIR spectra and trashes content
The quantitative prediction models were established using the software TQ Analyst 8.0. Multiplicative signal correction (MSC) was used to eliminate the negative effects caused by different sample shapes and granularity of fiber and trashes; The PLS algorithm was used to build the relationship between the NIR spectra and the trashes content because it was known that PLS usually was better than other methods in cotton analysis for the errors caused by the tightness and thickness could be greatly reduced in the transformation of the data matrix composed both dependent variables and independent variables; To avoid over fitting, each group of SCHCP samples was divided into five internal cross validation sub groups. In this way, one sub group was used as the prediction set and the calibration model was built on the remain four sub groups. This procedure was repeated until each sub group was used as prediction set; The models between the three different spectral types (spectrum, first derivate and second derivate) ranging from 12,000 to 4000 cm−1 were established and compared (Table 2).
The parameters of models between trashes contents and NIR spectra of SCHCP samples (before optimizing).
It is obvious that the second derivate is most suitable for establishing between NIR spectra of SCHCP samples and their trashes contents (before optimizing). The most important reasons are the non-uniform translation in the baseline caused by different sample tightness and the significant non-uniform rotation caused by different sample heights. Although all of the samples were pressed with the same compression tool, the variations in trashes contents and the inconsistent traits of cotton seeds, etc. result in differences in sample tightness; And then the variations in sample tightness resulted in different heights which not only caused the non-uniform translation in the baseline, but also caused non-uniform rotation. It is known that the first derivative and the second derivative can reduce the effect of spectral translation and rotation to a certain extent. 24 Therefore, the models between NIR spectra of SCHCP and their trashes contents should be established using the second derivate. Figure 8 shows the relationship between the actual trashes contents and the calculated trashes contents of the SCHCP samples.

The relationship between the calculated trashes contents and actual trashes contents of SCHCP samples of Aler (a, d, and g), Kuitun (b, e, and h), and Shihezi (c, f, and i). The models were built with the original spectra (a–c), first derivate (d–f), second derivate (g–i) before optimizing.
Although the results of calibration set are quite good, the results of the prediction set are not as good as the calibration set. It is known that abnormal samples (including physical or chemical properties abnormal, spectral abnormal) might cause significant decrease in model quality. In section 3.2, the samples which are abnormal in trashes content have been distinguished. And here, the methods of mahalanobis distance, chauvenet test, and the leverage were used to distinguish the samples which were abnormal in FT-NIR spectra. In addition, the samples in the prediction set would be deemed as abnormal samples if the trashes contents exceed the trashes content coverage of the calibration set. In total, seven abnormal samples in Aler group, seven abnormal samples in Kuitun group, and four abnormal samples in Shihezi group were excluded from the models. The parameters of optimized models are shown in Table 3.
The parameters of models between trashes content and NIR spectra of SCHCP samples (after optimizing).
The relationship between the calculated trashes contents and the actual trashes contents after excluding the abnormal samples is shown in Figure 9. It is obvious that the models have been significantly improved (The average RMSEC reduces to 0.072 g and the average RMSEP reduces to 0.158 g).

The relationship between the calculated trashes contents and actual trashes contents of SCHCP samples of Aler (a, d, and g), Kuitun (b, e, and h), and Shihezi (c, f, and i). The models were built with the original spectra (a–c), first derivate (d–f), second derivate (g–i) after excluding the abnormal samples.
The ANOVA method is used to test whether there are significant differences between the actual trashes contents and the calculated trashes contents. The results are shown in Table 4 from which, it is obvious that the trashes contents calculated with the models are consistent with the actual trashes contents.
The ANOVA analysis results (in this table, the actual trashes content is the average trashes content tested by ginning and trashes analysis, the calculated trashes content is the average trashes content predicted with the NIR models).
Conclusions
This study is performed to investigate the potential of FT-NIR spectroscopy for the detection of botanical trashes contents of SCHCP samples. PLS models are built with the original FT-NIR spectra, the first derivate and the second derivate in bands of 12,000–4000 cm−1, the results indicates that the second derivate is most suitable for establishing the NIR models to predict the botanical trashes contents of SCHCP samples, the correlation coefficient of optimized model is as high as 0.985 (calibration set), 0.973 (prediction set) and the RMSEC is 0.072 g and RMSEP is 0.158 g. The ANOVA analysis results also certify the trash contents calculated with the models are consistent with the actual trashes contents. Future studies will include more calibration samples and a NIR knowledge-based expert system (NIR-KBES) for the detection of trashes content of SC will be developed.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: We are grateful for the financial support provided by the National Natural Science Foundation of China (No. 31601224), the Natural Science Foundation of Anhui Provincial Department of Education No. KJ2019A0650 and KJ2020ZD004, the key research and development plan of Anhui province No. 202104a06020014.
