Abstract
Over the past decades, many methods have been proposed to solve the linear or nonlinear mixing of spectra inside the hyperspectral data. Due to a relatively low spatial resolution of hyperspectral imaging, each image pixel may contain spectra from multiple materials. In turn, hyperspectral unmixing is finding these materials and their abundances. A few main approaches to performing hyperspectral unmixing have emerged, such as nonnegative matrix factorization (NMF), linear mixture modelling (LMM), and, most recently, autoencoder networks. These methods use different approaches in finding the endmember and abundance of information from hyperspectral images. However, due to the huge variation of hyperspectral data being used, it is difficult to determine which methods perform sufficiently on which datasets and if they can generalize on any input data to solve hyperspectral unmixing problems. By trying to mitigate this problem, we propose a hyperspectral unmixing algorithm testing methodology and create a standard benchmark to test already available and newly created algorithms. A few different experiments were created, and a variety of hyperspectral datasets in this benchmark were used to compare openly available algorithms and to determine the best-performing ones.
Introduction
Hyperspectral imagery is used in many different areas due to the information it can capture. It is widely used in agriculture, mineralogy, food industry and others because it enables fast and accurate analysis with a non-destructive data-gathering method. Usually, hyperspectral cameras gather many light bands simultaneously but, in turn, have a small spatial resolution. Because of this, pixels of hyperspectral images may be a mixture of light emitted by different substances or materials, for example, different minerals captured by the hyperspectral camera while filming a quarry. The gathered light data can be mixed in a linear or non-linear way. In turn, this mixed data may be less useful for conducting analysis; therefore, hyperspectral image unmixing is an important issue that requires solutions. Additionally, hyperspectral cameras may gather a substantial amount of noise, especially when used in open fields, which creates additional errors in analysis, such as reflection from mixed or contaminated objects, atmospheric influences, weather-induced noise (from clouds or rain), and electrical noises from hardware.
To solve the problem of mixed data in hyperspectral pixels, hyperspectral unmixing (HU) methods are used. HU is the process used to separate hyperspectral image pixel spectra into a set of spectral signatures, called
The paper by Bioucas-Dias
This section describes and reviews available algorithms used for Hyperspectral unmixing. This section is split into three main parts describing different algorithms used. These parts are supervised algorithms, semi-supervised algorithms, and unsupervised algorithms. This section describes the algorithms in each category and shows the hyperspectral unmixing results that the authors of these algorithms acquired using experimentation. The reviewed algorithms were also checked if the authors publicly shared the algorithm implementation code. From these openly available algorithms, a few were selected and tested using the created hyperspectral unmixing benchmark. The code created for this paper’s benchmark and algorithm testing implementation is published as an open source. The implementation details and code is provided in Section 4.
Supervised Algorithms
Supervised algorithms are machine learning methods similar to function approximation algorithms that try to find the connection between input and output data and, in turn, require a collection of input and output (or ground truth) data to be present. Some examples of supervised algorithms are the nearest neighbour (hyperspectral image classification, Guo
Koirala Real hyperspectral dataset is gathered. Ground truth abundance and endmembers are used to linearly mix spectra into an artificial hyperspectral image corresponding to the real dataset. Both data sources are input to a machine learning algorithm to learn the mappings between data. After training, the model is created and saved. Trained model is then tested with a part of the real hyperspectral dataset. A linearly mixed spectra are generated due to the unmixing resulting in the abundance map of the hyperspectral testing dataset.
The authors used a neural network and two regression algorithms to learn the mapping between the generated linear and nonlinear training spectra. The algorithms were tested using 10,000 mixed spectra using
Semi-supervised algorithms are a combination of supervised and unsupervised learning. Because creating a high-quality labelled dataset is a time-consuming and difficult task, semi-supervised machine learning models may be used to help speed up this process. These methods use as input a dataset of labelled data. By training the machine learning method on this dataset, the created model can extrapolate data labels on a new collection of unlabelled data. A review of automatically labelled data may already be faster than labelling the data by hand, and with an expanding dataset, these models become more accurate at labelling new data.
Sparse Regression
Comparison of the results of sparse regression algorithms.
Comparison of the results of sparse regression algorithms.
A regression problem is learning a function or model capable of estimating the dependent variables from given observations or features. Sparsity refers to the input and output data being incomplete and not fully populated. In machine learning, sparsity indicates data that includes many zeros or other non-significant values. In turn, sparse regression is a subcategory of regression machine learning algorithms designed to handle non-densely packed data. The same regression algorithms can be used for sparse regression (linear, lasso, ridge, and others), but an additional step is often required to determine the subset of predictors. The problem of regression is learning a model that allows estimating a certain quantity of the dependent variable from several observed variables, known as independent variables. Table 1 shows an overview of algorithm results. More detailed explanations of each algorithm featured in Table 1 are provided below in paragraphs of this subsection.
The
A few key takeaways and conclusions were made from the review of semi-supervised hyperspectral unmixing algorithms and the results provided by the authors of these papers: Most commonly used metric was SRE, with some papers using SNR for synthetically generated datasets. The Jasper ridge dataset was the most commonly used real-world dataset in these papers, with the MCSU algorithm having the lower SRE metric for this dataset. Most of the synthetically created datasets that were used to test these algorithms had an added additional artificial noise. Most commonly, The SUnSAl algorithm is the most influential of the algorithms reviewed due to citation amount and other algorithms created from it. Using the Jasper ridge dataset, as it should be identical between different papers and the SRE metric, the highest value of
These algorithms do not require any labelled data of previously known ground truths to train the models. The main subcategories of unsupervised algorithms reviewed in this paper are linear mixture models (LMM) and non-negative matrix factorization (NMF) methods.
Comparison of the results of Nonnegative matrix factorization algorithms.
Comparison of the results of Nonnegative matrix factorization algorithms.
Nonnegative matrix factorization (NMF) is an algorithm group that, as the name states, factorizes a matrix into two separate matrices with an additional assumption that all matrices have no negative elements. Because hyperspectral data cannot have negative values and, in turn, the endmember and abundance matrices are also not negative, these algorithms are widely used in hyperspectral unmixing. Table 2 summarizes the datasets and metrics used in testing the algorithms reviewed below and the results achieved by the authors of corresponding papers.
The spectra at each hyperspectral image pixel are assumed to be a linear mixture of several endmembers. Therefore, the image
Estimation of the number of endmembers in the dataset being analysed. Identification of the endmember signatures. Estimation of the abundances in each pixel.
SAD values of 3.68 and 0.66 for a synthetic dataset with 20 and 80 dB SNR, respectively, are given.
From the conducted review of algorithms using nonnegative matrix factorization for hyperspectral unmixing, a few conclusions were gathered: Most commonly used metric was SAD, and compared to semi-supervised algorithms, SRE matric was not used. Cuprite and Jasper ridge datasets were the most common real-world datasets used in these reviewed papers. The most cited algorithm of the reviewed is CNMF, while the most popular now is the LIDAR-NTF due to the number of citations per year since it was published in 2021. Using the SAD metrics provided by the authors, the best algorithm from this review is SGSNMF for the Cuprite dataset (0.0913). The differences between SGSNMF and other algorithms that used the Cuprite dataset are very small, and visually inspecting the provided hyperspectral data cube reconstructions, the differences are imperceptible.
Autoencoders are a type of unsupervised learning-based neural network architecture. An artificial neuron bottleneck is created to create an autoencoder network that forces the input data to be compressed into a small number of features, extracting additional nonlinear information from the data. A few different types of encoder networks exist and are used for different purposes: Denoising autoencoder – a network that adds noise to input data, and from the corrupted input, it is trained to reconstruct the original data. This allows the removal of noise from the data in the future. Deep autoencoder – consists of at least 4 encoder and decoder layers that are Restricted Boltzmann Machines. Convolutional autoencoder – use the convolution to consider that a signal can be seen as a sum of other signals. In turn, they try to encode the input in a set of simple signals and reconstruct it in the decoder part.
The most commonly used autoencoder network for hyperspectral unmixing is the variational autoencoder. The model consists of the encoder part of the network that compresses the data and the decoder that reconstructs the original data using the compressed features as input. This allows the networks to be trained by minimizing the data reconstruction error, which measures the difference between input and reconstructed data. After the model is trained, the compressed data is extracted and can be input into other algorithms. This method extracts hidden or latent features in the training data. Variational autoencoder neural network architecture consists of two parts: the first part of the network is the encoder, and the second is the decoder, with a bottleneck layer in between. The variational part of the encoder provides distributions of values in the latent space of the network instead of a single value like in a regular autoencoder. The diagram also shows that the middle layers are smaller than the input and output layers, and the lines between nodes depict the neuron connection and weights. An overview of algorithm results is shown in Table 3, summarising results, metrics, and datasets from each corresponding paper.
Comparison of the results of autoencoder network algorithms.
A few conclusions were derived from the review of algorithms using autoencoder networks to solve the hyperspectral unmixing problems: Most common metric used in these reviewed papers was the RMSE. But a few variations of RMSE were used to analyse the differences between the reconstructed hyperspectral data, RMSE average over different material (classes), and separate RMSE for abundance matrix analysis. For autoencoder network algorithms, the most common real-world dataset was the Urban dataset. By using the provided RMSE metric of hyperspectral data reconstruction error, the algorithm with the lowest value ( Compared to the algorithms in semi-supervised and non-negative matrix factorization categories, the autoencoder network algorithms are newer, with the oldest published in 2020.
Linear mixture models (LMM) are regression model that simultaneously considers the variation of the dependent and the independent variables. The variations of both types of variables are often called fixed and random effects, and because the model uses both of these effects, it is called a mixed model. The linear mixture model is represented in equation (3). In the equation,
Comparison of the linear mixture model and supervised algorithms results.
This section establishes and discusses the methodology used in creating the experiments to develop a hyperspectral unmixing algorithm performance benchmark. The proposed benchmark methodology could be used as a standardized way to simultaneously test hyperspectral unmixing algorithms in a few different ways. Different experiments test different aspects of the unmixing algorithms.
Datasets
This section describes the datasets used for the algorithm testing experiments. Three different datasets were used to test the various performance metrics of hyperspectral unmixing algorithms. These datasets were chosen due to their popularity, usability, and availability: A synthetic hyperspectral data cube was created artificially by mixing different amounts of pure spectra from the USGS spectral library. To create the synthetic datasets, version 7 of the USGS spectral library (splib07a) (Kokaly A hyperspectral dataset created by the article’s authors (Zhao
2018 IEEE GRSS data fusion hyperspectral data RGB reconstruction. IEEE GRSS 2018 data fusion contest hyperspectral dataset (Prasad

To correctly test the performance and the ability of these algorithms in various aspects, a few metrics and their variants were chosen. Multiple different metrics are used in the hyperspectral unmixing problems. The most common are root mean squared error, signal reconstruction error, spectral angle mapping, and spectral angle distance. Root mean squared error and signal reconstruction error metrics were selected due to their popularity in hyperspectral unmixing algorithm performance evaluation and their overall simplicity in describing the differences between evaluated and real spectra in this benchmark:
Root mean squared error (RMSE) shows the difference between the predicted spectra and the ground truth value. Different authors used a few variations of RMSE to test a different aspect of the created algorithms; these include average RMSE between all endmembers, reconstruction RMSE and abundance RMSE.
Signal reconstruction error (SRE) is used to determine the quality of the spectral mixture reconstruction generated by the algorithms. A higher SRE value means a better reconstruction quality.
Metrics are calculated using these formulas:
To test the different aspects of the algorithm, the main experiment part is divided into four main sections:
Hyperparameter testing. This experiment tests the tested algorithms’ results when changing the available hyperparameter. Standard and controlled datasets are created to ensure that the results are only affected by the change in algorithm hyperparameter. This test allows checking the algorithm dependencies on the hyperparameters and, in turn, checking the universality of the algorithm. The laboratory-created dataset (Zhao
Endmember robustness. This tests the algorithm’s ability to be generalized and its overall performance when the input number of endmembers is changed. This type of test allows checking the algorithm’s ability to find endmembers and reconstruct hyperspectral images depending on the scene’s difficulty. Due to the changing number of endmembers, a synthetic dataset created using a combination of IEEE GRSS (Prasad
Robustness to noise. This experiment determines the algorithm’s ability to accurately unmix the hyperspectral image spectra when a different level of artificial noise is added to said image. This experiment tests algorithms with different amounts of white noise and a noise profile created from a real-world scenario. The dataset created to test endmember robustness was used as a base hyperspectral dataset, and a layer of artificial noise was added to it.
Impact of differences in input image sizes. By setting different sizes of hyperspectral images, the amount of spatial and spectral information changes, affecting the overall performance of algorithms. This also allows us to determine the optimal image size for the most accurate unmixing result and performance combination. It also shows the data required for algorithms to achieve their best accuracy. The same endmember robustness dataset was used and then downscaled using the methodology described below to create the different spatial size hyperspectral images.
The experiments described above were performed using the different datasets described in Section 3.1.
Endmember robustness testing is done by creating a group of artificially generated datasets according to a set of rules: Datasets The number of endmembers For each For each
Artificial hyperspectral image RGB representation. An artificial hyperspectral image

A collection of artificially generated hyperspectral images is created to test the algorithm’s robustness to noise. Then a different amount of noise is added to the images according to these set rules:
A collection of 4 different datasets are created with different endmembers using the same methodology as in the endmember robustness experiment in Section 3.4.
For each of the 4 datasets, a different amount of artificial noise is added.
The created noise is measured in SNR dB, in which a lower number means a higher amount of white noise.
A random noise with a mean value of 0 is generated with the desired SNR dB values of 20, 25, 30, 35, 40, 45, and 50.
Real and artificially created noise profile Pearson correlation comparison.
This noise is then applied to each of the 4 datasets to create noisy images.
In addition to the random white noise generated, a set of noise parameters was extracted from a hyperspectral imaging camera used for research in an uncontrolled field environment. The camera was a BaySpec OCI-F Hyperspectral Imager in VIS-NIR range (BaySpec, 2021). A Pearson correlation coefficient was calculated to measure the amount of noise the camera-generated at each wavelength. Each neighbouring wavelength was taken from a hyperspectral image, and the correlation between the values of these wavelengths across the whole image was calculated. Figure 3 (orange line) shows the correlation coefficient at wavelength index
A set of artificial noise parameters was found using a gradient descent minimization algorithm. When applied to our synthetically generated hyperspectral image, the band-to-band Pearson correlation coefficients were close to resembling a real-world camera noise specification. In this instance, a multivariate optimization algorithm was used to calculate the number of wavelengths – 1 amount of different variables. Specifically, an evolutionary algorithm from python library
Algorithm performance testing according to different image sizes was conducted using these steps: A synthetically generated hyperspectral image dataset with different numbers of endmembers was generated. Created using the exact methodology of the endmember robustness experiment described in Section 3.4. These datasets are then downscaled using mean values in an area of RMSE and SRE metrics are then calculated on these 3 collections of datasets to compare the results.
In this section, all of the algorithms used in the experiments are described, and the final benchmark results are given. These algorithms were selected based on a few main factors: algorithm code was made public by the authors and opened to use, and the algorithm solved at least one of the hyperspectral unmixing tasks.
Tested Algorithms
The algorithm code was gathered from the author’s GitHub or personal pages. All the code used in the experiments and links to the author’s pages is provided in the Github repository ( SUnSAL – solves an l1–l2 norm optimization problem with several constraints: positivity, which checks if all resulting values are greater or equal to 0, and Add-to-one, which calculates if the sum of the results (abundances) is equal to 1. The algorithm tries to minimize the l1 and l2 regularization norms. In other words, l1 and l2 norm optimization is simultaneously a sparse regression calculation on both linear and squared values. SUnSAL-TV – is an extension of the SUnSAL algorithm that adds an isotropic or non-isotropic total variation spatial regularization. S2WSU – an algorithm that uses spectral and spatial data at the same time to calculate a sparse unmixing matrix. CNMF – an algorithm that fuses high spatial resolution multispectral data and high spectral resolution hyperspectral data to calculate image endmembers and unmix these spectra. R-CoNMF – algorithm performs 3 important steps to find the endmembers, gather their signatures, and calculate the unmixing matrix. SGSNMF – considers the spatial data and pixel locations and runs under the assumption that unmixing matrices are sparse. RSNMF – a total variation regularized blind unmixing algorithm that considers pixel location and their correlation to nearby pixels. ALMM – a linear model that uses an endmember dictionary to help calculate the spectral variability.
This subsection describes the results collected by running the created experiments on available algorithms. The code used in creating and running these benchmarks can be accessed at
10 different datasets were created using the same 21 endmembers to add statistical differences to calculations.
Endmember robustness experiment diagram.
Endmember robustness experiment result with box plots for each endmember group and algorithm. (Colours: purple – SUnSAL, dark blue – SUnSAL-TV, blue – SGSNMF, light blue – S2WSU, cyan – RSNMF, yellow – R-CoNMF, orange – CNMF, red – ALMM.) A combined synthetic IEEE GRSS and USGS spectral library dataset was used as test data.
Endmembers were randomly selected into groups to create different amounts of endmembers, from 2 to 21, for each of the classes in the pattern.
For each group of endmembers, uniformly distributed abundances were created.
Other 9 variations of abundances were randomly selected and mixed into 10 different hyperspectral images.
Figure 5 and Table 5 show the results gathered from algorithms and calculating the RMSE metric between the predicted values and ground truth abundances that were generated. In Fig. 5, almost all algorithms excluding RSNMF show a consistent RMSE with different amounts of endmembers in the image, with SGSNMF having the biggest errors and, in turn, the worst performance while SUnSAL having the lowest error of all algorithms on average. The SGSNMF and RSNMF algorithms have the biggest value distributions out of these algorithms. The smaller distributions show more consistent results of these algorithms, while RSNMF is inconsistent at low amounts of endmembers. Table 5 displays the same information given in the previously mentioned Fig. 5, but in a numerical form and with the values averaged instead of their distributions.
Endmember robustness experiment results with average RMSE values for each endmember group and algorithm. (Columns list algorithms tested, and rows are several endmembers.)

Algorithm robustness to noise experiment results. A combined synthetic dataset of IEEE GRSS and USGS spectral library with added noise was used as test data.

Algorithm performance with 9 times down scaled hyperspectral images. A combined synthetic dataset of IEEE GRSS and USGS spectral library scaled down 9 times was used.

Algorithm performance with 4 times down scaled hyperspectral images. A combined synthetic dataset of IEEE GRSS and USGS spectral library scaled down 4 times was used.
To better compare these results, a Table 6 was created showing the averages of RMSE and SAD metrics for each algorithm and each set of scaled images. These results are shown in Table 6. Negative SRE values represent a worse reconstruction than positive values because the higher the SRE, the better the signal reconstruction. This table determines the effects of image scaling on the results of tested algorithms. All tested algorithms got consistent results between the different image scales. The SGSNMF algorithm got the worst results, while SUnSAL got the lowest RMSE results. The SRE values were consistent across the image scales, with S2WSU having a big difference in metrics values. Overall RSNMF algorithm got the lowest RMSE results of all the values.
During the benchmark experiment calculations, a log of the time spent on calculations of each algorithm was recorded to compare the time differences between them. This is not a standardized test, so the time comparison is only relative and will depend on the hardware. To compare the running times of the different algorithms with each dataset, all experiments were performed using a desktop computer with 12 core 24-thread AMD CPU and 64 GB of RAM and an Nvidia GTX 1080Ti with 11 Gb of VRAM. The average recorded times were gathered and are shown in Table 7.
In this paper, we analyse different available hyperspectral unmixing algorithms, propose a methodology, and create a benchmark to more accurately test these algorithms against each other. The code for the benchmark is available on GitHub. A hyperparameter testing experiment was conducted to determine the optimal hyperparameter of each tested algorithm. The main conclusion from this experiment was that hyperparameters are highly dependent on the datasets used and are not universal. An endmember robustness experiment was created to test the algorithm’s ability to accurately detect the abundances in hyperspectral images with different numbers of endmembers. Robustness to noise experiment shows the algorithm’s ability to get accurate results despite the artificially generated noise added to the same dataset. Image size difference experiment tests the algorithm’s ability to unmix hyperspectral images depending on the size of the image given and, in turn, the amount of spatial and spectral data available. One of the main takeaways from the conducted research is a perceived lack of standard algorithm testing methodology. Many reviewed papers use different metrics, testing methodologies and hyperspectral datasets to test their created algorithms. This makes it difficult to determine the best-performing algorithms. In this paper, we proposed a hyperspectral unmixing algorithms benchmark to help homogenize this type of algorithm testing. From the conducted hyperspectral unmixing algorithm benchmark experiments, we can conclude:
The SUnSAL algorithm got the lowest RMSE results (0.008) across all of the experiments except on the dataset with a noise profile that resembles a real-world scenario (4.824) which indicates that the algorithm may not be suitable for real-world use especially if the gathered data tends to have noise.
Image size difference algorithm comparison results.
Algorithm average calculation times in seconds.
In a real-world noise scenario, CNMF algorithm got the lowest RMSE result (0.0961). The resulting RMSE value was close to half of the next best value, but the values are small (at around 0.1 RMSE), so a perceived difference between these results may be minimal.
Using the SRE metric shows that the S2WSU (4.603) and SUnSAL (20.095) algorithms achieved the most accurate image size comparison experiment results. The difference between the most accurate algorithms is almost ten times, and in turn, differences between the best and the worst algorithms are more than a few orders of magnitude. But algorithms amongst themselves in the three different image sizes remain in the same SRE magnitude, showing little to no degradation of results when images are downscaled.
Image size comparison experiment showed that the differences in results between each image size were unnoticeable; from that, it is concluded that all of the algorithms are robust to changes in image size if their quality stays the same.
SUnSAL and R-CoNMF got the fastest calculation times, 228 and 257 seconds, of all algorithms. It has to be taken into account that this comparison between running times is only relative between the algorithms as the test was not normalized for other factors such as hardware and software resources.
