Image fusion based on shift invariant shearlet transform and stacked sparse autoencoder

Abstract

Stacked sparse autoencoder is an efficient unsupervised feature extraction method, which has excellent ability in representation of complex data. Besides, shift invariant shearlet transform is a state-of-the-art multiscale decomposition tool, which is superior to traditional tools in many aspects. Motivated by the advantages mentioned above, a novel stacked sparse autoencoder and shift invariant shearlet transform-based image fusion method is proposed. First, the source images are decomposed into low- and high-frequency subbands by shift invariant shearlet transform; second, a two-layer stacked sparse autoencoder is adopted as a feature extraction method to get deep and sparse representation of high-frequency subbands; third, a stacked sparse autoencoder feature-based choose-max fusion rule is proposed to fuse the high-frequency subband coefficients; then, a weighted average fusion rule is adopted to merge the low-frequency subband coefficients; finally, the fused image is obtained by inverse shift invariant shearlet transform. Experimental results show the proposed method is superior to the conventional methods both in terms of subjective and objective evaluations.

Keywords

Image fusion stacked sparse autoencoder shift invariant shearlet transform feature extraction

Introduction

In recent years, thanks to the rapid development of multisensor technology, image fusion has been widely applied into military reconnaissance, medical diagnosis, and remote sensing. Image fusion is a kind of technology which can merge relevant information from multiimages with the same scene into a single image. The integrated image is more informative and more suitable for human visual perception.

Basically, image fusion methods can be divided into two groups—spatial domain-based fusion method and transform domain-based fusion method. The spatial domain-based fusion method such as principal component analysis (PCA) and IHS-based fusion methods usually perform quickly but suffer from spatial distortion (Intensity, Hue, Saturation). The transform domain-based fusion methods convert images into frequency domain first. Then, fusion rules are conducted on transform subbands. This way one can solve the problem of spatial distortion easily. One of the key factors for transform domain-based fusion method is the selection of multiscale decomposition (MSD) tool. MSD includes Gaussian pyramid (GP) decomposition, discrete wavelet transform (DWT), stationary wavelet transform (SWT), contourlet transform (CT), nonsubsampled contourlet transform (NSCT), shearlet transform (ST), and shift invariant shearlet transform (SIST). DWT was proposed by Mallat,¹ which can capture both frequency and location information of signal. However, DWT suffers from limited directions in high-frequency subbands and lacks shift invariance for the down sampling step. Rockinger² proposed SWT in 1997. SWT can solve the shift invariant issue of DWT. However, SWT can only provide limited directional information in high-frequency subbands as DWT. To overcome the limited decomposition direction, CT was proposed by Do and Vetterli.³ CT has more directions but it still lacks of shift invariance. To overcome this drawback, Da Cunha et al.⁴ proposed NSCT which is shift invariant. Subsequently, Guo and Labate⁵ proposed ST. Compared with CT and NSCT, the computational efficiency of ST is high and without the limitation on the number of directions. SIST is proposed by Easley et al.⁶ Owing to the usage of nonsubsampled pyramid in decomposition progress, SIST is shift invariant. Based on the above analyses, the SIST is adopted as the MSD tool in this work.

The other important factor of transform domain-based image fusion method is the design of fusion rule. The task of fusion rule is to find an appropriate way to get a fused image which is more suitable for human visual perception and the subsequent process. The fusion rule consists of fusion strategy and the activity-level measure. Commonly used fusion strategies include the choose-max strategy and the weighted average strategy. The choose-max strategy treats coefficients with maximum activity-level measure as fused coefficients. The weighted average strategy gets the fused coefficients by building a weight function according to the activity-level measure of coefficient. Fusion result obtained by the choose-max fusion rule usually shows high contrast while fusion result obtained by the weighted average fusion rule usually inherits more information from original images. As we all know, the low-frequency subband of SIST is the approximation of source image while the high-frequency subbands of SIST contain the detail information of source image. The fused low subband should contain the energy information of source images as much as possible. Therefore, the average fusion strategy is selected for low-frequency subband coefficients. The fused high-frequency subbands should preserve the detail information as much as possible. Therefore, the choose-max fusion strategy is used to merge the high-frequency coefficients in this work.

The quality of fused image highly depends on the activity-level measure. The activity-level measure of coefficient should reflect the inner structure of image fully. At present, many activity-level measures are presented. For example, PCA is a commonly used feature. Hu et al.⁷ calculated the principal component of each image by using the eigenvalues and eigenvectors of correlation coefficients first. Second, the grayscale of high resolution image is stretched to get the same mean and variance as the principal component. Then, the principal component is replaced by high resolution coefficients and merges them by a fusion rule. In the end, use PCA inverse transform to get the fused image. Considering the fact that tensor has the capability to describe the high-dimensional data, Liang et al.⁸ decomposed source images by higher order singular value decomposition tool. Then, use 1-norm of coefficients as the activity-level measure to get better fused image. Yang and Li⁹ divided source image into overlapped image blocks first and then calculated the sparse coefficients of each image block. Sparse representation (SR) can describe complex data by a linear combination of an over complete dictionary. Finally, a fusion rule is proposed based on the sparse coefficients. However, the above-mentioned activity measures are either hand-crafted features or low level features and they cannot fully represent the inner structure of complex data. Over the past few years, deep learning method has achieved great success in the computer vision field because deep learning method is expert in interpreting complex data. However, traditional deep learning method is rarely applied to image fusion because the labeled data lack in this area. Fortunately, stacked autoencoder is a kind of unsupervised deep learning method, which is suitable for image fusion tasks. The sparse characters by adding sparse constrains in the hidden layers of SSAE can be obtained as the activity-level measure, which is an accurate representation of source image. Based on the above analysis, a novel SIST and stacked sparse autoencoder (SSAE)-based image fusion method is proposed in this work. First, decompose the source images into low- and high-frequency subband coefficients; then, design fusion rules for merging them, respectively. The average fusion rule is adopted to the fusion of low-frequency subband coefficients and the SSAE-based choose-max fusion rule is used to fuse the high-frequency subband coefficients; finally, using the SIST inverse transform to get fused image.

The rest of the paper is organized as follows. The next section presents the framework and procedure of the proposed method; “Related work” section briefly introduces the principle of SIST and SSAE; “Fusion rule for SIST subbands” section describes the fusion rule in detail; Experimental results and analyses are shown in “Experimental results and analyses” section; at last, conclusions are drawn in the final section.

The framework and procedure of the proposed method

The whole framework of the proposed method is given in Figure 1. The procedure of the proposed method is described as follows.

Figure 1.

The overall fusion architecture of the proposed method. SIST: shift invariant shearlet transform; SSAE: stacked sparse autoencoder.

Decompose the source images A and B into low-frequency subbands ${C_{L}^{A}, C_{L}^{B}}$ and high-frequency subbands ${C_{H}^{A}, C_{H}^{B}}$ by SIST;

Low-frequency subbands represent the approximation of the source images, therefore $C_{L}^{A}$ and $C_{L}^{B}$ are fused by using the average fusion strategy;

Considering the fact that high-frequency subbands contain very important detail information, $C_{H}^{A}$ and $C_{H}^{B}$ are fused by the SSAE-based choose-max fusion strategy;

The fused low-frequency subband coefficients and high-frequency subband coefficients are merged to fused image by the inverse SIST transform.

Related work

Brief introduction of the SIST

The SIST can be completed by two steps: multiscale partition and directional localization. In the first step, the nonsubsampled pyramid filter is adopted to perform multiscale partition. In the second step, the shift invariant shearing filter is used to decompose the frequency plane into one low-frequency subband and multiple bandpass directional subbands.¹⁰ The details about SIST can be found in Easley et al.⁶

Principle of SSAE

As the outstanding ability of representing image, deep learning methods are becoming more and more popular in machine learning area. The essence of deep learning is to compute hierarchical features or represent of the observational data, where the higher level features or factors are defined from lower level ones.¹¹ As a branch of deep learning, SSAE is a kind of unsupervised feature learning method which is suitable for complex and unlabeled datasets such as image and voice data.¹² Considering the above facts, SSAE is introduced into the research field of image fusion in this work.

An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation. The autoencoder is trained to minimize the discrepancy between the data and its reconstruction. So, the autoencoder is an artificial neural network which can be trained in a fully unsupervised way. The autoencoder consists of two parts: encoder and decoder. Encoder network transforms the high-dimensional input data into low-dimensional code. Decoder network can recover the data from the code.

As is shown in Figure 2, a SSAE is a neural network consisting of multiple layers of sparse autoencoders in which the output of each layer is employed as the input of the subsequent layer. The greedy layer-wise approach for pretraining a deep network works by training each layer in turn.

Figure 2.

The structure of SSAE.

The encoding step for an n-layer SSAE network is defined by forward order which is described by equations (1) and (2)

z^{l, 1} = W^{l, 1} h^{l - 1} + b^{l, 1}

(1)

h^{l} = f (z^{l, 1})

(2)

where

W^{l, 1}

is the weight matrix for the lth encoder,

b^{l, 1}

is the bias metrix for the lth encoder,

z^{l, 1}

is the weighted sum (including the bias term) between input neurons and hidden neurons in layer

l

h^{l}

is the activation of hidden neurons in the lth layer, specifically,

h^{l} = x

when

l = 1

f (\cdot)

is the activation function such as sigmoid function, tanh function, and ReLu function.

The decoding step for the n-layer SSAE network is achieved by inverse order which is described by equations (3) and (4)

{\hat{h}}^{l} = f^{- 1} (z^{l, 2})

(3)

z^{l, 2} = W^{l, 2} {\hat{h}}^{l + l} + b^{l, 2}

(4)

where

n

denotes the dimension of input vector;

n_{i}

is the number of hidden neuron units for the l-th layer autoencoder,

z^{l, 2}

is the weighted sum of the l-th decoder;

W^{l, 2}

denotes the weight of the l-th decoder, which is equal to the transpose of

W^{l, 1}

b^{l, 2}

is the bias of the decoder, which is equal to

b^{l, 1}

\hat{h}

is the activation of output neuron units,

f^{- 1} (\cdot)

is the inverse of activation function.

Fusion rule for SIST subbands

Fusion rule for low-frequency subband coefficients

Low-frequency subbands contain most of energy of source image, which is approximate images. In this paper, we choose the average fusion rule to fuse the low-frequency subband coefficients which are described in equation (5)

C_{L}^{E} (x, y) = \frac{1}{2} \times C_{L}^{A} (x, y) + \frac{1}{2} \times C_{L}^{B} (x, y)

(5)

where

C_{L}^{A} (x, y)

C_{L}^{B} (x, y)

, and

C_{L}^{F} (x, y)

represent low-frequency subband coefficients of source image A, image B, and fused image F located at

(x, y)

Fusion rule for high-frequency subband coefficients

The design of fusion rule of high-frequency subband is important for image fusion algorithm because it contains a lot more detail. The activity-level measure plays an important role in the design of fusion rule. The traditional method is extracting hand-crafted features from the source image as the activity-level measure. Some successful features used on fusion rule include spatial frequency, variance, gradient features, etc. These hand-crafted features designed by human often suffer from inaccurate image representation. The lost information may cause an inferior fusion result. Currently, most researchers pay attention to deep neural networks. Deep neural network takes the source image as input and extract features automatically by machine instead of human. This method has demonstrated great success in image representation. Compared to hand-crafted features, the feature obtained by deep neural network is a nonlinear feature and a higher level representation as the depth of the model increases. Therefore, it is a better choice to take high-level feature as activity-level measure. In addition, considering the difficulty that it is hard to get label data in image fusion field, we adopt SSAE to extract high-level feature as activity-level measure. After the layer-wise training of each building block and the building of a deep architecture, the output of SSAE is used for the high-frequency representation in the subsequent fusion processing. Inspired by the above analyses, a novel fusion rule based on SSAE for high-frequency subband coefficients is proposed. The fusion scheme is given in Figure 3 and the fusion procedure is described as follows.

Figure 3.

The fusion scheme of high-frequency subbands (here $n = 2$ , $n_{1} = 25$ , $n_{2} = 16$ , $n_{3} = 9$ ).

The high-frequency subband $C_{H}^{A}$ and $C_{H}^{B}$ are divided into $n u m$ overlapped image blocks by using sliding window technology;

All image blocks are reshaped into vectors $V_{H}^{s} (k)$ , where $s \in (A, B)$ which means from image A and B, and $k$ means the $k th$ block $(1 \leq k \leq n u m)$ ;

The features $F_{H}^{s} (k)$ are extracted from $V_{H}^{s} (k)$ by training a two-layer SSAE;

To make the feature contrast more apparent, the space frequency of the features $F_{H}^{s} (k)$ ( $S F F_{H}^{s} (k)$ ) is adopted in the fusion scheme. The $S F F_{H}^{s} (k)$ is defined by equation (6), $M$ is the row number of $F_{H}^{s} (k)$ , $N$ is the column number of $F_{H}^{s} (k)$

S F F_{H}^{s} (k) = \sqrt{\frac{1}{M \times N} (\sum_{i = 1}^{M} \sum_{j = 2}^{N} {[F_{H}^{s} (k (i, j)) - F_{H}^{s} (k) (i, j - 1)]}^{2} + \sum_{i = 2}^{M} \sum_{j = 1}^{N} {[F_{H}^{s} (k) (i, j) - F_{H}^{s} (k) (i - 1, j)]}^{2})}

(6)

For each pair of $V_{H}^{A} (k)$ and $V_{H}^{B} (k)$ , choose the one with maximum space frequency as the fused vector $V_{K}^{F} (k)$ which is given in equation (7)

V_{H}^{F} (k) = {\begin{matrix} V_{H}^{A} (k), if (S F F_{H}^{A} (k) \geq S F F_{H}^{B} (k)) \\ V_{H}^{B} (k), if (S F F_{H}^{A} (k) < S F F_{H}^{B} (k)) \end{matrix}

(7)

The fused high-frequency subbands $C_{H}^{F}$ are obtained from the fused vector $V_{H}^{F}$ by using inverse sliding windows technology, where the averaging method is taken for handing the overlapped pixels.

Next, we will give the detailed description about the above-mentioned step (3). With regard to a two-layer SSAE model, the first layer SSAE is trained first. The $V_{H}^{s} (k)$ are set as input and output, the original weight $W^{l, 1}$ and bias $b^{l, 1}$ is set as a random number close to zero, then $W^{l, 1}$ and $b^{l, 1}$ are adjusted by minimizing the reconstruction error $J (W^{l, 1}, b^{l, 1})$ . The $J (W, b)$ is calculated by equation (8)

J (W, b) = \frac{1}{N} \sum_{i = 1}^{N} (\frac{1}{2} ‖ z_{1} - x_{1} ‖^{2}) + β \sum {j = 1}_{D}^{'} K L (ρ ‖ \hat{ρ})

(8)

where

W^{l, 1}

and

b^{l, 1}

are weight and bias,

Δ W^{l, 1}

and

Δ b^{l, 1}

are increment of

W^{l, 1}

and

b^{l, 1}

J (W^{l, 1}, b^{l, 1})

is the reconstruction error, where

K L (ρ ‖ \hat{ρ}) = ρ \log \frac{ρ}{\hat{ρ}} + (1 - ρ) \log \frac{1 - ρ}{1 - \hat{ρ}}

is the Kullback–Leibler distance,

\hat{ρ}

is the average activation of hidden unit,

ρ

is the sparsity parameter whose value is close to zero, so average activation of hidden unit will be close to zero,

β

is the weight of the sparse penalty. This is an optimization problem, the training stage is used to update the parameters

W^{l, 1}

and

b^{l, 1}

(

l = 1, 2

), the gradient descent method is adopted in this work.

After the $W^{2, 1}$ and $b^{2, 1}$ are optimized, the activation of second-layer hidden neuron units $h^{2}$ is obtained by using (1) and (2). The $h^{2} (k)$ is defined as $F_{H}^{s} (k)$ where $s \in {A, B}$ .

In the step (5) of high-frequency subband fusion procedure, $S F F_{H}^{s} (k)$ is used as the activity-level measure to fuse the high-frequency coefficient. To verify the effectiveness of SFF compared to the traditional activity-level measure SF, a comparison experiment is given in Figure 4. Figure 4(a) and (b) is a pair of “Clock” multifocus source images. A big clock located at the right of source images is marked with a red rectangle. The red rectangle region of Figure 4(a) (A) is in focus but the corresponding region of Figure 4(b) (B) is out of focus. The red rectangle region includes 10,000 pixels. The SFF and SF measures are used to evaluate the sharpness of corresponding coefficients in the red rectangle regions, respectively. In Figure 4(c), the blue bar shows the number of pixels that the activity-level measure of $A (x, y)$ is great than that of $B (x, y)$ . On the contrary, the red bar shows the number of pixels that the activity-level measure of $A (x, y)$ is less than or equal to that of $B (x, y)$ . It can be counted from Figure 4(c) that there are 9032 pixels with $S F F_{A} (x, y) > S F F_{B} (x, y)$ and 8703 pixels with $S F_{A} (x, y) > S F_{B} (x, y)$ . Thus, the superiority of SFF is verified by this experiment.

Figure 4.

Activity-level measure for “Clock”: (a) Right-focused image with the clear region marked in a red rectangle, (b) left-focused image with the blur region marked in a red rectangle, and (c) performance comparison experiment of activity-level measure for SFF and SF.

Experimental results and analyses

To verify the effectiveness of the proposed fusion method, it is conducted on multifocus images, medical images, and infrared-visible images. Moreover, it is compared with the conventional gradient pyramid (GP)-based method,¹³ DWT-based method,¹⁴ state-of-the-art st-DWT (structure tensor and DWT)-based method,¹⁵ pulse coupled neural network (PCNN)-based method,¹⁶ N-PCNN(NSCT-PCNN)-based method,¹⁷ SR-based method,¹⁸ and directive contrast (DC)-based method¹⁹ from the visual perception and objective index perspectives. To carry out the quantitative comparison of fusion results, the EN (entropy), average gradient (AG), Qabf,²⁰ edge intensity (EI), mutual information (MI), and standard deviation (SD) metrics are computed in this paper.

In our experiments, the number of hidden neuron for two SSAE is set as 16 and 9, the sparsity parameter $ρ$ is set as 0.01, the weight of the sparse penalty $β$ is set as 4, and the target value for $J (W, b)$ is set as $10^{- 6}$ . The pyramid filter of SIST is set as “maxflat” and the source images are decomposed into three levels with the number of directions 3, 4, 5. The size of image block is 5 × 5 and the sliding step is defined as 2. Parameters of contrast experiments are set according to the reference papers. The computer environment of experiments is MATLAB-2016b under CPU-3.1 GHz windows 10 system.

Multifocus image fusion

In this section, the fusion methods are conducted on “Clock” multifocus images. Figure 5(a) is the left focus image whose left clock is clear while right clock is blurred, Figure 5(b) is the right focus image whose left clock is blurred and right clock is clear. The fused images are shown in Figure 5(c) to (j). Figure 5(d) and (f) obtained by DWT-based method and PCNN-based method is blurred. Though the right clock of fused images obtained by GP-based method and stDWT-based method is clear, the scale on the left clock can barely be seen. As for the N-PCNN-based method, the SR-based method, the DC-based method, and the proposed method, the fusion results show that they have the capability of multifocus image fusion. However, it is hard to tell which is better in visual perception. To further assess the qualities of the fused images, the objective comparisons are listed in Table 1. As can be seen from Table 1, the objective metrics of fused images obtained by N-PCNN-based method, SR-based method, DC-based method, and proposed method are superior to those obtained by other methods. This means the objective evaluation is consistent with the subjective perception. Except for the DC-based method, the proposed method gets the best score in four out of six objective indexes. It shows that the proposed method is efficient.

Figure 5.

(a) is the left focus image; (b) is the right focus image; and (c) to (j) are the fused results of GP, DWT, stDWT, PCNN, N-PCNN, SR, DC, and the proposed method.

Table 1.

The objective comparison of fused results, where black bold text denotes the best result (Figure 5).

Method	Evaluation metrics
Method	EN	AG	Qabf	EI	MI	SD
GP	7.45	7.97	0.66	53.77	6.57	48.76
DWT	7.43	6.24	0.64	48.47	7.01	48.89
stDWT	7.44	7.73	0.60	54.36	6.62	49.14
PCNN	7.43	6.29	0.61	47.88	7.09	48.82
N-PCNN	7.45	8.34	0.64	60.71	6.76	49.67
SR	7.45	8.41	0.70	61.41	7.43	50.26
DC	7.34	11.72	0.67	65.69	6.15	50.96
Proposed	7.48	8.42	0.68	61.35	7.58	50.31

AG: average gradient; DWT: discrete wavelet transform; EI: edge intensity; EN: entropy; GP: Gaussian pyramid; MI: mutual information; PCNN: pulse coupled neural network; SD: standard deviation; SR: sparse representation.

IR and visible image fusion

In this section, we test our method on IR and visible image fusion. Figure 6(a) is a visible image and Figure 6(b) is the corresponding infrared image. As can be seen from Figure 5, visible image provides the information of the right house, trees, and road but the visualities of the left two houses are weak. It is fortunate that the infrared image is high luminance and the left two houses are obvious. Fused results obtained by GP, DWT, stDWT, PCNN, N-PCNN, SR, DC and the proposed method are shown in Figure 6(c) to (j). It can be seen that the sky area of Figure 6(c) and (d) is blurry and the trees of Figure 6(e) and (g) are unclear. Further we can find that there is a little fuzzy in the left tree of Figure 6(f) and there is obvious blocking effect in the right tree of Figure 6(h). To quantitatively assess the performance of different methods, the evaluation metrics are computed and given in Table 2. Owing to the low visual quality of Figure 6(i), the DC-based method is excluded from the objective comparison. Except for the DC-based method, the table indicates that the fused result obtained by the proposed method is of better EN and MI values. These means that the fused result obtained by the proposed method holds maximum amount of information compared with those obtained by other methods. Besides, the proposed method is superior to other methods in AG and SD metrics, which means the texture of fused result obtained by the proposed method is abundant. Quantitative comparisons show that the proposed method achieves the best fusion performance.

Figure 6.

(a) is the visible image, (b) is the infrared image, and (c) to (j) are the fused results of GP, DWT, stDWT, PCNN, N-PCNN, SR, DC, and the proposed method.

Table 2.

The objective comparison of fused results (Figure 6).

Method	Evaluation metrics
Method	EN	AG	Qabf	EI	MI	SD
GP	6.60	4.99	0.52	38.11	2.14	23.86
DWT	6.54	3.36	0.37	29.26	2.36	23.04
strDWT	7.05	5.51	0.51	40.60	3.38	27.29
PCNN	7.17	5.33	0.62	46.45	3.74	29.33
N-PCNN	6.84	5.47	0.56	44.90	2.58	29.22
SR	7.39	6.06	0.81	52.34	4.05	43.19
DC	7.52	6.06	0.28	58.66	1.29	47.49
Proposed	7.41	6.01	0.70	54.53	4.37	45.79

Medical image fusion

In this section, we test our method on a pair of medical images. Figure 7(a) is the CT image and Figure 7(b) is the MRI image. As can be seen from Figure 7, Figure 7(a) gives the bone structure information while Figure 7(b) presents the organizational structure. All fused results are shown in Figure 7(c) to (j). It is apparent that Figure 7(e) lose important organizational structure information in the MRI source image. There is an overexposure phenomenon in the bone structure area of Figure 7(f) and (g). In addition, the spatial distortion appears in Figure 7(h) and (i). Compared with the fused result obtained by the GP-based method, the fused result obtained by the proposed method has better contrast in bone structure area and can fully preserve the organizational structure information in the MRI source image. Table 3 gives the objective evaluation of the fused results. The proposed method gets the best values in terms of Qabf and EI indexes, which is consistent with visual perception. Given all that, we draw the conclusion that the proposed method is effective in medical image fusion task.

Figure 7.

(a) and (b) are the source images and (c) to (j) are the fused results of GP, DWT, stDWT, PCNN, N-PCNN, SR, DC, and the proposed method.

Table 3.

The objective comparison of fused results (Figure 7).

Method	Evaluation metrics
Method	EN	AG	Qabf	EI	MI	SD
GP	6.60	10.75	0.70	83.22	4.62	90.22
DWT	6.48	7.92	0.60	68.90	6.59	90.62
strDWT	6.35	10.36	0.57	77.39	4.67	82.11
PCNN	6.09	9.34	0.51	73.39	5.86	102.12
N-PCNN	6.56	10.17	0.69	88.06	4.65	98.16
SR	5.89	11.70	0.65	87.94	5.24	107.03
DC	7.09	9.987	0.49	86.62	3.84	79.85
Proposed	6.83	10.94	0.70	90.28	4.17	99.64

As for the GP method, subband coefficients are acquired by using a Gaussian average and scaled down. This transform cannot extract directional information of source images. Therefore, the fused results obtained by GP method cannot present clear texture; compared with GP transform, there are three directional high-frequency subbands in DWT transform domain. In additional, the strDWT method introduces the structure tensor in the fusion process. The structure tensor is an advanced feature extraction method which can more accurately present the image than DWT subband coefficients. As a result, the strDWT method obtains better performance than the former ones; PCNN is a shallow neural network. The fusion method based on PCNN takes the firing times as the activity-level measure of source image. Therefore, this method is more suitable for the fusion of highly complementary source images. This phenomenon is verified by the experiment on IR and visible image fusion. N-PCNN method employs PCNN in NSCT domain. Compared with DWT, NSCT can provide anisotropy, shift invariant, and more directions. Therefore, compared with PCNN and other above-mentioned methods, the N-PCNN method can get better fusion performance. SR is a novel feature learning method which converts complex source data into sparse and useful representation. As can be seen from experimental results, this method gets the second best fusion performance. However, it is hard to represent all kinds of images by a dictionary. In the DC-based method, phase congruency and DC of NSCT coefficient are computed and used to fuse NSCT low- and high-frequency coefﬁcients. These two features cannot accurately represent all kinds of images. Therefore, this method gets good fusion result in “Clock” image fusion but it fails to fuse Figures 5(a) and (b) and 6(a) and (b). The proposed method is based on SSAE and SIST. The SSAE is a deep architecture neural network which has powerful feature extraction ability on image data. It is a kind of data-driven feature extraction method and can obtain more universal and representative feature compared with structure tensor, PCNN, and SR. Besides, SIST is a kind of flexible multiscale, multidirectional, anisotropy and shift-invariant MSD method, which is better than DWT and NSCT, Therefore, the proposed method achieves the best fusion performance.

Conclusion

The SIST is a popular MSD tool. Moreover, the SSAE is an unsupervised deep learning method, which can effectively represent the source images. Therefore, a novel image fusion method based on SIST and SSAE has been proposed in this paper. The low-frequency subbands are fused by the weighted average fusion rule and the high-frequency subband coefficients are fused by using the SSAE-based choose-max fusion rule. To achieve the fusion of high-frequency subbands, they are divided into overlapped image blocks, and then a two-layer SSAE network is adopted to extract feature from each image blocks. The space frequency of SSAE coding is calculated as the activity-level measure of image block. Experimental results demonstrate the effectiveness of the proposed method.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of P. R. China under grant no. 61772237 and the Provincial research grant no. BK20151358, BK20151202. The Ministry of Housing and Urban-rural Development of the People’s Republic of China under grant no. 2015-K8–035. The Fundamental Research Funds for the Central Universities JUSRP51618B and the Equipment Development and Ministry of Education union fund 6141A02033312.

References

Mallat

SG.

A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Machine Intell 1989; 11: 674–693.

Rockinger

Image sequence fusion using a shift invariant wavelet transform. In: Proceedings of IEEE international conference on image processing, Santa Barbara, CA, USA, 26–29 October 1997, pp.288–291, IEEE Publisher.

MN and

Vetterli

The contourlet transform: an efficient directional multiresolution image representation. IEEE Trans Image Process 2005; 14: 2091–2106.

Da Cunha

Zhou

J and

MN.

The nonsubsampled contourlet transform: theory, design, and applications. IEEE Trans Image Process 2006; 15: 3089–3101.

Guo

K and

Labate

Optimally sparse multidimensional representation using shearlets. SIAM J Math Anal 2007; 39: 298–318.

Easley

Labate

D and

Lim

WQ.

Sparse directional image representations using the discrete shearlet transform. Appl Comput Harmonic Anal 2008; 25: 25–46.

Liu

Xiao-Ping

et al . Research and recent development of image fusion at pixel level. Appl Res Comput 2008; 25: 650–655.

Liang

Liu

et al . Image fusion using higher order singular value decomposition. IEEE Trans Image Process 2012; 21: 2898–2909.

Yang

B and

Multifocus image fusion and restoration with sparse representations. IEEE Trans Instrum Meas 2010; 59: 884–892.

10.

Luo

Zhang

Z and

Adaptive multistrategy image fusion method. J Electron Imaging 2014; 23: 053011.

11.

Bengio

Courville

A and

Vincent

Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2013; 35: 1798–1828.

12.

Hinton

GE and

Salakhutdinov

RR.

Reducing the dimensionality of data with neural networks. Science 2006; 313: 504–507.

13.

Zhang

Z and

Blum

RS.

A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application. Proc IEEE 1999; 87: 1315–1326.

14.

Pajares

G and

De La Cruz

JM.

A wavelet-based image fusion tutorial. Pattern Recognit 2004; 37: 1855–1872.

15.

B and

Miao

Structure tensor based image fusion. In: Proc. third international symposium on electronic commerce and security workshops, Guangzhou, P. R. China, 29–31 July 2010, pp.343–346. Guangzhou: Academy Publisher.

16.

Zhai

Geng

et al . A novel algorithm of image fusion based on PCNN and shearlet. Int J Digital Content Technol Appl 2011; 5: 347–354.

17.

Yang

Wang

et al . Fusion of multiparametric SAR images based on SW-nonsubsampled contourlet and PCNN. Signal Process 2009; 89: 2596–2608.

18.

Yang

B and

Multifocus image fusion and restoration with sparse representation.

IEEE Trans Instrum Meas 2010; 59: 884–892.

19.

Bhatnagar

G and

Jonathan Wu

QM.

Directive contrast based multimodal medical image fusion in NSCT domain. IEEE Trans Multimedia 2013; 15: 1014–1024.

20.

Xydeas

CS and

Petrovic

Objective image fusion performance measure. Electron Lett 2000; 36: 308–309.