Stop selecting features on pre-processed NIR spectra and then trying to explain the chemistry

Abstract

Feature selection in NIR spectroscopy is a critical step in chemometric data modeling. It involves identifying and selecting the most relevant wavelengths or spectral bands from the full NIR spectrum for use in building predictive models. In practice, features are selected not only to achieve robust and accurate models but also, in many cases, to develop low-cost multispectral systems. Although dimensionality reduction methods like Partial Least Squares or Principal Component Analysis are typically sufficient and often preferable for improving model robustness and accuracy, as they leverage the multivariate nature of data to represent the underlying chemical structure, feature selection remains an active area of research.

In scientific literature, feature selection and latent space modeling are frequently debated, often with model performance cited as the primary benchmark. However, this is not always a reliable metric. Slight changes in the number of selected features or latent variables during model optimization can significantly alter conclusions, making comparisons somewhat fragile. Nonetheless, there are valid reasons to pursue feature or band selection. One common motivation is computational efficiency: traditional deflation-based latent space models can become prohibitively time-consuming when dealing with a large number of variables, especially when cross-validation is involved. Combine this with the additional complexity introduced by various preprocessing strategies, and the process quickly becomes a never-ending task. Reducing the dataset to a smaller subset of variables can therefore greatly speed up model development.

Another key driver is hardware limitation. Embedded NIR sensors, which often have limited processing power, require a reduced set of features for real-time computation. Additionally, selecting meaningful spectral bands can help highlight regions associated with specific chemical bonds (e.g., O–H, C–H, N–H), thereby improving both the interpretability of models and our understanding of the underlying chemistry.

Despite these advantages, issues arise when feature selection is performed on pre-processed NIR spectra. The problem escalates further when the selected features are then used to infer chemical meanings such as identifying bond overtones or to design multispectral systems. It is important to understand that NIR spectra result from the complex interaction of light with the sample surface, particularly in reflection mode. These spectra encode physicochemical properties as a function of absorption, transmission, reflection, and scattering. The sensor-captured data often represent a mixture of these optical phenomena, with absorption/reflection typically dominating.

Scattering, frequently treated as noise in the context of quantifying chemical species, is commonly reduced or removed through preprocessing. Ideally, under controlled conditions and constant path length, Beer’s Law implies that light absorption is proportional to analyte concentration. However, many practitioners, especially those entering the field from other scientific domains, tend to apply various preprocessing methods (and their combinations) without a deep understanding of what these methods remove or alter. In fact, some studies have shown that preprocessing can actually degrade model performance. Still, preprocessing remains standard practice in most NIR modeling pipelines and is often built-in as an automatic step in commercial chemometric softwares. Some software packages even offer automatic exploration of preprocessing combinations, which, although convenient, may lack scientific rationale and are often glossed over in technical discussions.

While preprocessing may or may not improve model performance when using the full spectrum, its impact on feature selection is particularly problematic. The core issue is that certain preprocessing techniques fundamentally alter the physical meaning of the NIR spectra. Common methods such as normalization (e.g., Standard Normal Variate and its robust variants), derivatives of any order, and advanced scatter correction techniques transform spectra from physically interpretable absorption/reflection values to arbitrary units. These operations often rely on relationships between neighboring bands, meaning they cannot be applied to isolated features. Furthermore, they redistribute variances both signal and noise across the spectrum, blurring the chemical relevance of specific wavelengths. As a result, any attempt to explain the chemistry based on selected features from pre-processed spectra becomes highly questionable.

To mitigate this, some experienced NIR practitioners have proposed techniques such as integrating regression coefficients derived from derivative spectra, in order to recover physically meaningful interpretations. These approaches acknowledge the limitations of preprocessing and attempt to reverse-engineer the chemical context lost in the process.

Despite these challenges, not all preprocessings are is inherently problematic for feature selection. Certain transformations such as taking the logarithm of reflectance or transmission data to compute absorbance are acceptable, as they operate independently on each band and preserve the original variance distribution. In fact, log transformation can also help reduce some scatter effects. Likewise, simple baseline corrections can be beneficial, especially for feature selection methods sensitive to scale (e.g., covariance-based techniques). The general rule is this: any mathematical operation that treats each feature independently may be safely used and might even offer additional benefits.

Now, imagine designing a multispectral system based on features selected from standard normal variate pre-processed spectra. Would it work in the real world? Most likely not. This practice fundamentally misunderstands the interaction between preprocessing and spectral feature selection. If you know someone applying feature selection in this way, please share this commentary with them. Together, we can raise awareness and ensure that the science of NIR spectroscopy remains grounded, interpretable, and practically useful.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

Time spent in this commentary is supported by the KB-54 Sustainable Nutrition & Health within Wageningen University & Research and received financing from the Dutch ministry of Agriculture, Fisheries, Food Security and Nature.

Stop selecting features on pre-processed NIR spectra and then trying to explain the chemistry – It simply doesn’t make sense

Abstract

Footnotes

Declaration of conflicting interests

Funding