Compressive-Sensing-Based Video Codec by Autoregressive Prediction and Adaptive Residual Recovery

Abstract

This paper presents a compressive-sensing- (CS-) based video codec which is suitable for wireless video system requiring simple encoders but tolerant, more complex decoders. At the encoder side, each video frame is independently measured by block-based random matrix, and the resulting measurements are encoded into compressed bitstream by entropy coding. Specifically, to reduce the quantization errors of measurements, a nonuniform quantization is integrated into the DPCM-based quantizer. At the decoder side, a novel joint reconstruction algorithm is proposed to improve the quality of reconstructed video frames. Firstly, the proposed algorithm uses the temporal autoregressive (AR) model to generate the Side Information (SI) of video frame, and next it recovers the residual between the original frame and the corresponding SI. To exploit the sparse property of residual with locally varying statistics, the Principle Component Analysis (PCA) is used to learn online the transform matrix adapting to residual structures. Extensive experiments validate that the joint reconstruction algorithm in the proposed codec achieves much better results than many existing methods with consideration of the reconstructed quality and the computational complexity. The rate-distortion performance of the proposed codec is superior to the state-of-the-art CS-based video codec, although there is still a considerable gap between it and traditional video codec.

1. Introduction

1.1. Motivation and Objective

In wireless sensor network, with the constraints of limited processing capabilities, limited power/energy budget, and information loss [1], it is challenging for video sensors to encode and transmit these big-data video sequences by using the traditional video codec (e.g., H.264/AVC [2] and HEVC [3]), and therefore various video-codec schemes have been developed to provide a low-complexity but high-compression encoder, in which Distributed Video Coding (DVC) [4, 5] and Compressed Video Sensing (CVS) [6, 7] attract more attention. For the video coding in wireless sensor network, the CVS is potentially more suitable because its theoretic foundation, compressive sensing (CS) [8], ensures the simultaneous sampling and compression of each video frame by optical devices (e.g., CS-MUVI [9] and CACTI [10]). Currently, lots of CVS schemes are trying to realize a codec to code measurements of video sequence into bits, but their rate-distortion performances are still far from satisfactory.

The first objective of this paper is to design a CS-based video codec framework for wireless sensor network. In particular, based on the existing excellent works of CS, each step is crafted from original video sequences to bits and inverse, and we also discuss (1) how to design quantization of measurements for reducing the quantization error and (2) how the prediction structures of decoding affect the performance of joint reconstruction. Another objective of this paper is to propose an efficient reconstruction algorithm for further improving the rate-distortion performance of codec. As an important characteristic of video sequence, the interframe correlation will be exploited in the reconstruction of video frames.

1.2. Related Work

The basic elements of CVS scheme include random measurement, quantization of measurements, and reconstruction. Because of the huge amount of video data, the random measurement must be implemented frame by frame, in which the block-based random matrix [11] and structurally random matrix [12] are often used due to their small memory requirement, low complexity, and high universality. For the majority of CVS schemes [13, 14], the measurements are not quantized. The uniform scalar quantization can be occasionally used, such as in [15, 16], but it results in a poor rate-distortion performance. Recently, the more concerns are put on quantized CS [17–20], and some specific quantizers for measurements were designed to improve the recovery performance, such as DPCM-based quantizer [19] and binned progressive quantizer [20]. The existing reconstruction strategies can be divided into three categories: frame-by-frame reconstruction [6, 7], volumetric reconstruction [21, 22], and joint reconstruction [14, 23, 24]. The frame-by-frame reconstruction regards video sequence as a series of independent video frames, and the still-image recovery algorithm is exploited to reconstruct each frame. This strategy neglects the interframe correlation, which results in a poor reconstruction performance. The volumetric reconstruction regards video sequence as three-dimensional (3D) signal and uses a fixed 3D basis (e.g., 3D DWT and 3D DCT) to reconstruct the whole video sequence or certain video clip. However, it is not a practical strategy because the huge memory and high computational complexity are required at the decoder side. The joint reconstruction is derived from the decoding strategy of DVC, and each video frame is reconstructed with the aid of Side Information (SI), which is interpolated by Motion Estimation (ME) and Motion Compensation (MC). This method not only guarantees the small memory and low computational complexity by single-frame reconstruction but also exploits the motion information between adjacent frames in the reconstruction by using SI. Therefore, the joint reconstruction is the most promising approach among the above three kinds of strategies.

The joint reconstruction is used in the proposed CVS scheme, and it consists of SI generation and SI-based recovery. For the SI generation, it can be interpolated either by the Frame Rate Upconversion (FRUC) [25] techniques such as that in [13, 15] or by both the measurements and the neighboring frames of current frame such as those in [14, 24]. Generally speaking, the SI generated by measurements and neighboring frames has more superior performance than that generated by FRUC because the former can use the information of current frame to generate SI. After generating the SI, the SI-based recovery uses measurements to enhance the quality of SI. References [15, 26] resorted to the Wyner-Ziv codec [5] in DVC to realize the SI-based recovery; however, these recovery methods strongly rely on the encoder side in real time because they require encoder side via feedback channel to transmit measurements of current frame. Without the feedback channel, the SI-based recovery can still be performed by modifying the CS recovery model, such as in [13] that used the SI to modify initialization and stopping criterion of the GPSR (Gradient Projection for Sparse Reconstruction) algorithm. References [14, 23, 24, 27] proposed to recover the residual between the original frame and its SI. It is important for low latency of video communication to decode the video sequence with independence of encoder side, and therefore the methods based on the modified CS recovery model can make the CVS scheme more practical and flexible, especially for residual recovery used in [14, 23, 24]; its performance can be further improved by developing the more amenable CS recovery algorithm with the expectation that the residual is much more compressible than its original.

For the existing CVS approaches, the measurements of video frames cannot be efficiently compressed using quantization and entropy coding, which motivates us to develop the compression of CVS encoder. Importantly, the SI generation and residual recovery are improved, respectively, for guaranteeing the better performance of joint reconstruction. First, we design the architecture of CS-based video codec including encoder and decoder framework. In particular, a DPCM-based nonuniform quantizer is proposed to reduce the quantization error of measurements, and we also analyze the performances of various decoding predictive structures. Second, a joint reconstruction algorithm is proposed to improve the rate-distortion performance of codec. Specifically, combined with measurements of current frame, the temporal autoregressive (AR) model is used to generate its SI, and the quality of reconstructed residual is improved by using an adaptive orthogonal transform matrix learned online by Principle Component Analysis (PCA) [28].

1.3. Main Contributions

First, we present a CS-based video codec. At the encoder side, according to the statistical characteristic of measurements, we propose to replace uniform quantization in DPCM-based quantizer with nonuniform method. At the decoder side, we analyze the effect on reconstruction performance of various decoding predictive structures.

Second, we propose the joint reconstruction algorithm which consists of AR prediction and adaptive residual recovery. It motivates us to generate the SI of video frame that the AR model preserves the local structure of image better. To exploit the highly sparse property of residual, we use PCA to trace the locally varying statistics of residual.

The remainder of this paper is organized as follows. Section 2 provides a brief review of the CS theory and the comparison between CVS and traditional video codec. Section 3 presents the proposed CS-based video codec architecture. Section 4 describes the joint reconstruction algorithm including AR prediction and adaptive residual recovery. Experiment results are reported in Section 5 to evaluate the performance of the proposed video codec. Finally, the conclusion is made in Section 6.

2. Background

2.1. CS Theory

The CS theory builds on the groundbreaking study of Candès et al. [29] and Donoho [30], which asserts that one can accurately recover certain signals from far fewer samples or measurements than Nyquist rate. To make this possible, CS relies on three principles: sparsity or compressibility of signals, incoherent measuring, and optimal recovery, in which the sparsity or compressibility of signals is a necessary condition, the optimal recovery is the method to reconstruct original signal, and the incoherent measuring ensures the convergence of optimal recovery [31].

The mathematical formulation of CS is described as follows. Suppose x is the one-dimensional discrete signal with length of N. Note that the two-dimensional discrete image can also be represented as the one-dimensional discrete signal through raster scanning. Consider M measurement vectors $φ_{i} (i = 1, \dots, M)$ with length of N; we use them to form a measurement matrix $Φ = [{φ_{1}}^{T}, {φ_{2}}^{T}, \dots, {φ_{M}}^{T}]$ (the superscript T denotes the transposition) in rows and compute the length-M measurement vector y corresponding to x by the following formula:

\begin{matrix} y = Φ \cdot x . \end{matrix}

(1)

Suppose that x is an unknown vector, y is a known vector, and Φ is known and row full rank, and when

M ≪ N

, there are infinite solutions for (1), in which the original signal exists. How to find the original signal from these infinite solutions is the mathematical problem introduced by CS. However, if the original signal x is sparse or compressible in a certain transform domain Ψ , the exact recovery is possible by solving

\begin{matrix} \hat{x} = \arg \min_{x} {‖Ψ x‖}_{1} s . t . y = Φ \cdot x, \end{matrix}

(2)

where

{‖\cdot‖}_{1}

is the

l_{1}

-norm. For the images, the DCT and wavelet transform matrices are usually used to exploit the sparsity. The nonlinear optimal recovery method is used to solve model (2), but we need to guarantee the incoherence between Φ and Ψ in order to converge the solution of model (2) on the original signal or around it. The random matrix is usually used to realize the incoherent measuring due to its universality; for example, the structurally random matrix proposed by [12] maintains a high incoherence with any fixed matrix.

With development of the CS theory, it has already been applied in medical imaging, data communication, wireless sensor network, remote sensing, and so forth. For the wireless sensor network especially, each sensor cannot afford excessive computations, and the unstable wireless channel also requires that the output data from sensors have an antinoise performance. Exactly, the sub-Nyquist sampling of CS ensures a low computational complexity, and the inherence between measurements also increases the robustness to noise [32], which provides a good chance for applying the CS to data acquisition in the wireless sensor network.

2.2. CVS versus Traditional Video Codec

Traditional video codec (e.g., MPEG and H.264) uses the hybrid encoding framework, and its encoder performs the motion estimation to exploit the spatial-temporal redundancy existing in video signal. In general, its encoding complexity is about 5 to 10 times than its decoding complexity, and therefore the traditional video codec is suitable for some applications requiring one encoding and multiple repeated decoding, for example, video broadcasting, video on demand, and video storage [4]. However, due to the limited computation, memory, and energy of sensors, the wireless sensor network is inverse to the above applications, and it requires a low-complexity encoder but can tolerate a high-complexity decoder. The CVS system resorts to CS theory to transfer the majority of the complexity of video encoding to the receiver, and therefore it is more suitable for wireless sensor network than traditional video codec. In addition to offering substantially reduced encoding complexity, the CVS has many attractive and intriguing properties, particularly when we employ random measuring at the sensor. Random measurements are universal in the sense that any transform matrix can be used in the decoder, allowing the same encoding strategy to be applied in different sensing environments. Because the measurements coming from each sensor have equal priority, the random coding is also robust to bit errors; that is, one or more measurements can be lost or destroyed without corrupting the entire recovery [33].

3. CS-Based Video Codec Architecture

In this section, we describe the proposed CS-based video codec architecture in detail. The overall flow of this codec is shown in Figure 1. The input video sequence is firstly divided into several Groups of Pictures (GOPs) with the fixed length L, and then each ${G O P}_{i}$ is successively encoded as ${P a c k e t}_{i}$ . After transmitting this packet, the decoder receives it and reconstructs the corresponding ${\hat{G O P}}_{i}$ , and finally all reconstructed GOPs are regrouped as the entire video sequence in terms of their original time sequence. Both encoder and decoder sides use the same measurement matrix; however, it is not wise that the encoder side transmits measurement matrix to the decoder side because this way not only increases bitrate but also deteriorates seriously reconstructed quality once the errors happen in the transmission of measurement matrix. The measurement matrix is constructed by the pseudorandom generator, and therefore this problem can be perfectly addressed by synchronously updating the seed (initial state) of pseudorandom generator according to the clock. The following presents the concrete process of GOP encoding and decoding.

Figure 1

Overall flow of the proposed CS-based video codec architecture.

3.1. Encoder Framework

In the encoder framework whose block diagram is depicted as the dotted box marked by Encoder in Figure 1, the key frame is firstly split from the GOP, and others are regarded as the nonkey frames. Each $I_{r} \times I_{c}$ video frame is partitioned into K nonoverlapping blocks of size $B \times B$ , and each block is viewed as a vectorized column of length $N (= B^{2})$ . Then, all column vectors are measured independently as follows:

\begin{matrix} y_{K, k} = Φ_{K} \cdot x_{K, k} k = 1, \dots, K, \\ y_{N K, k} = Φ_{N K} \cdot x_{N K, k} k = 1, \dots, K, \end{matrix}

(3)

where

x_{K, k}

and

x_{N K, k}

denote the kth block of key frame and nonkey frame, respectively; both of the measurement matrices

Φ_{K}

and

Φ_{N K}

use the structurally random Hadamard matrix proposed by [12], which has been proven to be memory efficient, hardware friendly, and fast to compute. Note that the size of

Φ_{K}

M_{K} \times N

such that the subrate of key frame is

S_{K} = M_{K} / N

, and the size of

Φ_{N K}

M_{N K} \times N

such that the subrate of nonkey frame is

S_{N K} = M_{N K} / N

. Finally, the resulting measurement vectors

y_{K, k}

and

y_{N K, k}

are quantized by a DPCM-based nonuniform quantizer (DPCM-NQ), and all bits corresponding to GOP are packed into a packet and transmitted to the decoder side after the Huffman encoding [34].

The DPCM-based quantizer in [19] uniformly quantizes the residuals between measurements of the consecutive blocks because the residuals with less redundancy can further reduce the bits of video frame. Figure 2 shows the histograms of residuals for the 2nd frame of Foreman sequence with CIF format and 30 fps at the different subrates; we can see that residual values are unevenly distributed, but the small values appear more frequently. For this statistical characteristic of residual, the uniform quantization is not good at decreasing the quantization errors of measurements, and instead the nonuniform quantization is a more proper method. The block diagram of DPCM-NQ is shown in Figure 3. For the mth measurement $y_{k} (m)$ of the kth block, the nonuniform quantization of its corresponding residual value $d_{k} (m)$ is realized by adding compression before uniform quantizer, and the compression function is designed according to μ-Law [35]; that is,

\begin{matrix} d_{comp} = f (d) = sgn (d) \cdot \frac{\log (1 + μ |d / D|)}{\log (1 + μ)}, \end{matrix}

(4)

where D is the maximum value among all measurements of the current video frame,

sgn (\cdot)

is the sign function, and μ is fixed to 10 experimentally. After uniformly dequantizing quantized residual value

i_{k} (m)

, the estimate

{\hat{d}}_{k} (m)

d_{k} (m)

is recovered by the following expansion function:

\begin{array}{l} \hat{d} = f^{- 1} ({\hat{d}}_{comp}) \\ = \frac{sgn ({\hat{d}}_{comp}) \cdot D}{μ} \cdot [{(1 + μ)}^{{\hat{d}}_{comp}} - 1] . \end{array}

(5)

Adding compression and expansion guarantees that the small residual values with high frequency can be quantized with the small quantized interval, and therefore the quantization errors of measurements are effectively reduced, which is presented and discussed in Section 4.1.

Figure 2

Histograms of residuals for the 2nd frame of Foreman sequence with CIF format at the different subrates: (a) 0.1, (b) 0.2, and (c) 0.3.

Figure 3

Block diagram of the DPCM-based nonuniform quantizer: (a) quantization and (b) dequantization.

Finally, the Huffman encoding is used to compress the quantized measurements into bits. To conveniently add various headers (such as IP header) required by wireless network protocols, these bits and some important decoding parameters should be packed into a packet according to a format shown in Figure 4. The different fields in the packet are defined below, and note that the decoding parameters are saved by positive integer unless otherwise mentioned. (i)

Number $I_{r}$ of rows and number $I_{c}$ of columns: 12 bits each—these fields save the size of video frame.

(ii)

Block size B: 8 bits—this provides the block size of video frame.

(iii)

Number $M_{K}$ of measurements: 16 bits—this provides the number of measurements of each block in key frame.

(iv)

Number $M_{N K}$ of measurements: 16 bits—this provides the number of measurements of each block in nonkey frame.

(v)

Sequence number i of GOP: 16 bits—this field is used to uniquely identify the order of GOP in order to regroup video sequence at the decoder side.

(vi)

Length L of GOP: 8 bits—this field provides the fixed length of GOP.

(vii)

Bit depth b: 8 bits—this field is used to compute number $2^{b}$ of uniform quantized intervals.

(viii)

Maximum measurement $D_{l} (l = 1, \dots, L)$ of the tth frame in GOP: 32 bits, floating number—this provides the important parameter of expansion function.

(ix)

Data: the bits of measurements of each frame are saved in this field.

Figure 4

Packet format.

3.2. Decoder Framework

The block diagram of decoder framework is shown as the dotted box marked by Decoder in Figure 1. The key frame is reconstructed independently by only using these dequantized measurements ${\hat{y}}_{K, k}$ of blocks; that is,

\begin{matrix} {\hat{x}}_{K} = \arg \min_{x} \{{‖{\hat{y}}_{K} - Θ_{K} Ε \cdot x‖}_{2} + λ {‖Ψ \cdot x‖}_{1}\}, \end{matrix}

(6)

where

\begin{matrix} {\hat{y}}_{K} = [\begin{bmatrix} {\hat{y}}_{K, 1} \\ {\hat{y}}_{K, 2} \\ ⋮ \\ {\hat{y}}_{K, K} \end{bmatrix}] Θ_{K} = [\begin{bmatrix} Φ_{K} & 0 \\ Φ_{K} \\ ⋱ \\ 0 & Φ_{K} \end{bmatrix}] Ε \cdot x \\ = [\begin{bmatrix} x_{K, 1} \\ x_{K, 2} \\ ⋮ \\ x_{K, K} \end{bmatrix}], \end{matrix}

(7)

Ψ is the transform matrix of video frame x, and λ is a weighting factor. This way is similar to the Intra model in traditional video codec, and therefore it is also called Intra reconstruction. Model (6) can be solved by various still-image CS recovery algorithms, in which the multihypothesis prediction-based Smoothed Projected Landweber (SPL) algorithm [7] is used in the decoder. For SPL, the choice of λ can have a large effect on the performance of the regularization, so it is important to find a value which imposes an adequate level of regularization without causing the first term in (6) to become too large. We found in practice that, over a large set of different frames, a value of

λ \in [0.01, 0.12]

provided the best results, and consequently we use λ = 0.035 to reconstruct each key frame. The joint reconstruction of nonkey frame is realized by residual recovery coupled with SI generation as follows:

\begin{array}{l} {\hat{y}}_{R} = [\begin{bmatrix} {\hat{y}}_{R, 1} \\ {\hat{y}}_{R, 2} \\ ⋮ \\ {\hat{y}}_{R, K} \end{bmatrix}] = [\begin{bmatrix} {\hat{y}}_{N K, 1} - Φ_{N K} x_{S I, 1} \\ {\hat{y}}_{N K, 2} - Φ_{N K} x_{S I, 2} \\ ⋮ \\ {\hat{y}}_{N K, K} - Φ_{N K} x_{S I, K} \end{bmatrix}] \\ \approx [\begin{bmatrix} Φ_{N K} \cdot (x_{N K, 1} - x_{S I, 1}) \\ Φ_{N K} \cdot (x_{N K, 2} - x_{S I, 2}) \\ ⋮ \\ Φ_{N K} \cdot (x_{N K, K} - x_{S I, K}) \end{bmatrix}] \\ = [\begin{bmatrix} Φ_{N K} & 0 \\ Φ_{N K} \\ ⋱ \\ 0 & Φ_{N K} \end{bmatrix}] \cdot [\begin{bmatrix} r_{N K, 1} \\ r_{N K, 2} \\ ⋮ \\ r_{N K, K} \end{bmatrix}] \\ = Θ_{N K} E \cdot r_{N K}, \end{array}

(8)

\begin{matrix} {\hat{r}}_{N K} = \arg \min_{r} \{{‖{\hat{y}}_{R} - Θ_{N K} E \cdot r‖}_{2} + η {‖P \cdot r‖}_{1}\}, \end{matrix}

(9)

\begin{matrix} {\hat{x}}_{N K} = x_{S I} + {\hat{r}}_{N K}, \end{matrix}

(10)

where

x_{S I}

is the SI of video frame, P is the transform matrix of residual

r_{N K}

, and η is a weighting factor. The SI is generated by the AR prediction which is presented in Section 4.1, and model (9) is solved by the adaptive residual recovery algorithm which is described in Section 4.2.

To exploit the interframe statistical dependencies, an efficient Predictive Structure (PS) is required to select the reference frames for joint reconstruction. In the PS, the key frame is called I frame due to the fact that no reference frames are available for prediction, and the nonkey frames are classified as the following two types: P frame using unidirectional prediction and B frame using bidirectional prediction. The PS starts from I frame, and a high-quality initial reference frame is helpful for improving the reconstruction performance of following video frames. Therefore, the I frame requires the higher subrate than P and B frames. For the joint reconstruction, B frame has more superior performance than P frame because the former can use the more temporal information to reconstruct video frame, and consequently inserting B frame into PS has the potential to achieve a substantial performance gain. Figure 5 illustrates the five different PSs when the length of GOP is 8, in which I, P, and B frames are combined in the different reconstruction orders. Each PS is a strategy to explore the interframe correlation; however, only the reasonable combination of I, P, and B frames then can significantly improve the rate-distortion performance of codec. The experimental results using the different PSs depicted in Figure 5 are shown in Section 5.6, which concretely analyzes the effect of these PSs on the performance of the proposed codec.

Figure 5

Five reference frame structures for temporal prediction when the length of GOP is 8: (a) PS1, (b) PS2, (c) PS3, (d) PS4, and (e) PS5. The number at the top-right of box represents the reconstruction order.

4. Joint Reconstruction Algorithm

Here, we propose a novel joint reconstruction algorithm, which consists of AR prediction and adaptive residual recovery. The AR model can well model the fact that a local area along temporal axis can be viewed as a stationary process [36], and therefore AR prediction can exploit local temporal correlation to improve the quality of SI. Similar to natural signal, the residual between original frame and its SI typically has locally varying statistics, and there exists no fixed transform matrix in which all blocks of residual exhibit sparsity [37], which motivates us to propose a PCA-based locally adaptive strategy to recover the residual.

4.1. Autoregressive Prediction

As shown in Figure 6, the AR model is used to describe the temporal correlation between pixels along the motion trajectories from the block $x_{N K, k}$ in current frame to its matching blocks $x_{N K, k}^{P}$ and $x_{N K, k}^{F}$ in neighboring reference frames. According to the AR model for P frame, the pixel $x_{N K, k} (n)$ within $x_{N K, k}$ can be generated as

\begin{matrix} x_{NK, k} (n) = \sum_{r = 1}^{{(2 R + 1)}^{2}} x_{NK, k}^{P} (\tilde{n} ↻ r) \cdot α (r) + u (n), n = 1, \dots, N, \end{matrix}

(11)

where

\tilde{n} ↻ r

denotes the index of the rth neighbor of the matching pixel

x_{N K, k}^{P} (\tilde{n})

in previous reference frame, R is the radius of square window (i.e., the supporting order of AR model),

α (r)

is the

A R

coefficient, and

u (n)

is the zero-mean Gaussian noise. Due to the piecewise stationary statistics of natural images, the AR coefficients corresponding to each pixel within the block

x_{N K, k}

are assumed to be the same, which is proved to be reasonable in [36, 38]. Therefore, (11) can also be expressed in a matrix-vector form as

\begin{matrix} x_{N K, k} = A \cdot α + u, \end{matrix}

(12)

where the nth row of matrix A consists of (

2 R + 1

)² neighboring pixels of the nth pixel within

x_{N K, k}^{P}, α

is the AR coefficient vector, and u is the independent Gaussian noise vector. For B frame, the AR model can still be represented by (12), and the difference between it and P frame is that each row of matrix A contains the neighboring pixels not only in

x_{N K, k}^{P}

but also in

x_{N K, k}^{F}

Figure 6

$A R$ model with supporting order R = 1 for (a) P frame and (b) B frame.

Equation (12) is not a realistic representation of $A R$ model for $S I$ generation because the unavailable $x_{N K, k}$ makes it impossible to compute AR coefficients. Fortunately, the existence of measurement vector ${\hat{y}}_{N K, k}$ can further develop (12) as

\begin{matrix} {\hat{y}}_{N K, k} = Φ_{N K} A \cdot α + u_{e}, \end{matrix}

(13)

where the item

u_{e}

includes measurement noise, quantization errors, and

Φ_{N K} u

. As a consequence of central limit theorem [39] for large-scale video signal, the components of

u_{e}

are approximated as an independent zero-mean Gaussian noise with unknown variance

σ^{2}

, and then the likelihood of

{\hat{y}}_{N K, k}

is given by

\begin{matrix} p ({\hat{y}}_{N K, k} | Φ_{N K} A, α; σ^{2}) = {(2 π σ^{2})}^{- N / 2} \cdot \exp [- \frac{1}{2 σ^{2}} {‖{\hat{y}}_{N K, k} - Φ_{N K} A \cdot α‖}_{2}^{2}] . \end{matrix}

(14)

According to the Maximum-Likelihood (ML) estimation, the

A R

coefficients can be computed by minimizing the negative logarithm of likelihood as follows:

\begin{matrix} \hat{α} = \arg \min_{α} \{\frac{N}{2} \log (2 π σ^{2}) + \frac{1}{2 σ^{2}} {‖{\hat{y}}_{N K, k} - Φ_{N K} A \cdot α‖}_{2}^{2}\} . \end{matrix}

(15)

However, both dimensionality reduction and existence of noises aggravate the ambiguity of AR coefficients in (13), and therefore the ML estimation will result in overfitting without prior knowledge of the truth α .

To control the $A R$ model complexity, we define a prior distribution which expresses our “degree of belief” over AR coefficients that α might take:

\begin{matrix} p (α; θ_{r}, r = 1, \dots, {(2 R + 1)}^{2}) = \prod_{r = 1}^{{(2 R + 1)}^{2}} {(\frac{θ_{r}}{π})}^{1 / 2} \cdot \exp [- θ_{r} \cdot α^{2} (r)], \end{matrix}

(16)

where

θ_{r}

independently controls the variance of each AR coefficient

α (r)

. This choice of a zero-mean Gaussian prior expresses a preference for smoother models by declaring that the smaller

α (r)

corresponding to the larger

θ_{r}

is a priori more probable. Considering that

α (r)

is generally more smaller when its corresponding neighboring pixel

x_{N K, k}^{P} (\tilde{n} ↻ r)

x_{N K, k}^{F} (\tilde{n} ↻ r)

is more far away from the target pixel

x_{N K, k} (n)

, the value of

θ_{r}

can be set by

\begin{matrix} θ_{r} = {‖x_{N K, k} - A_{r}‖}_{2}^{2}, \end{matrix}

(17)

where

A_{r}

denotes the rth column of matrix A. Equation (17) cannot be directly used to compute

θ_{r}

because the actual pixels in

x_{N K, k}

are not available; however, it can be replaced with

\begin{matrix} θ_{r} = {‖{\hat{y}}_{N K, k} - Φ_{N K} \cdot A_{r}‖}_{2}^{2} \end{matrix}

(18)

depending on the Johnson-Lindenstrauss (JL) lemma [40] which holds that

{(2 R + 1)}^{2}

points in

R^{N}

can be projected into

M_{N K}

-dimensional subspace while approximately maintaining pair distances as long as

M_{NK} \geq O (\log [{(2 R + 1)}^{2}])

. Now, given the likelihood (14) and the prior (16), we form the Maximum A Posteriori (MAP) estimation for α via Bayes’ rule:

\begin{matrix} \hat{α} = \arg \min_{α} \{- \log [p ({\hat{y}}_{N K, k} | Φ_{N K} A, α; σ^{2})] - \log [p (α; θ_{r}, r = 1, \dots, {(2 R + 1)}^{2})]\} = \arg \min_{α} \{{‖{\hat{y}}_{N K, k} - Φ_{N K} A \cdot α‖}_{2}^{2} + 2 σ^{2} {‖Γ \cdot α‖}_{2}^{2}\}, \end{matrix}

(19)

where Γ is a diagonal matrix in the form of

\begin{matrix} Γ = [\begin{bmatrix} θ_{1}^{1 / 2} & 0 \\ θ_{2}^{1 / 2} \\ ⋱ \\ 0 & θ_{{(2 R + 1)}^{2}}^{1 / 2} \end{bmatrix}] . \end{matrix}

(20)

According to MAP, the optimal

A R

coefficients can be computed as

\begin{matrix} \hat{α} = {[{(Φ_{N K} A)}^{T} (Φ_{N K} A) + 2 σ^{2} Γ^{T} Γ]}^{- 1} {(Φ_{N K} A)}^{T} {\hat{y}}_{N K, k}, \end{matrix}

(21)

and therefore we can estimate the SI

x_{S I, k}

of block

x_{N K, k}

as follows:

\begin{matrix} x_{S I, k} = A \cdot \hat{α} . \end{matrix}

(22)

In the abovementioned AR model, it is essential to compute the Motion Vector (MV) from the block $x_{N K, k}$ in current frame to its matching block $x_{N K, k}^{P}$ or $x_{N K, k}^{F}$ in neighboring reference frame. As a consequence of JL lemma, the full-search-based block matching algorithm can be realized in the measurement domain [24]; that is,

\begin{matrix} x_{N K, k}^{P | F} = \arg \min_{x_{match} \in V_{k}} \{{‖{\hat{y}}_{N K, k} - Φ_{N K} \cdot x_{m a t c h}‖}_{2}^{2}\}, \end{matrix}

(23)

where

x_{m a t c h}

denotes the matching block candidate,

V_{k}

represents the candidate set including all blocks in the search area, and

x_{N K, k}^{P | F}

denotes the matching block in previous or following reference frame. However, the full search not only introduces the excessive computational complexity because it needs to traverse all possible candidates within search area, but also results in some inaccurate MVs without the smoothness constraint of Motion Vector Field (MVF) [41]. To overcome the defects of full search, the 3D Recursive Search (3DRS) [42] is used to construct the candidate block set

V_{k}

, in which each matching block candidate is extracted by using the candidate MVs of current block. As shown in Figure 7, the candidate MV set C of current block is composed of the seven candidate MVs (the coordinate of current block is defined as B): zero vector

0

; the MVs of spatio-neighboring block locations

S_{a}

(upper) and

S_{b}

(left); the MVs of temporal-neighboring block locations

T_{a}

(lower) and

T_{b}

(right); and the update MVs of spatio-neighboring block locations

U_{a}

(upper-left) and

U_{b}

(upper-right); that is,

\begin{matrix} C = \{0, MVF (S_{a}), MVF (S_{b}), {MVF}^{R} (T_{a}), {M V F}^{R} (T_{b}), MVF (U_{a}) + R_{a}, M V F (U_{b}) + R_{b}\} R_{a}, R_{b} \in U S, \\ U S = \{(\begin{pmatrix} 0 \\ 0 \end{pmatrix}), (\begin{pmatrix} 0 \\ 1 \end{pmatrix}), (\begin{pmatrix} 0 \\ - 1 \end{pmatrix}), (\begin{pmatrix} 0 \\ 2 \end{pmatrix}), (\begin{pmatrix} 0 \\ - 2 \end{pmatrix}), (\begin{pmatrix} 1 \\ 0 \end{pmatrix}), (\begin{pmatrix} - 1 \\ 0 \end{pmatrix}), (\begin{pmatrix} 3 \\ 0 \end{pmatrix}), (\begin{pmatrix} - 3 \\ 0 \end{pmatrix})\}, \end{matrix}

(24)

where

M V F (\cdot)

represents the MVF of current frame and

{M V F}^{R} (\cdot)

represents the MVF of reference frame. This approach imposes the smoothness constraint implicitly through predictive search, and therefore it guarantees the accuracy of MV with a low complexity.

Figure 7

Relative positions of candidate MVs used in 3DRS.

4.2. Adaptive Residual Recovery

Without the original residual $r_{N K}$ , it is a challenge to learn the transform matrix P adapting to the various local residual structures. Instead, when solving model (9), we can use the improved estimate of residual at each iteration to update the adaptive sparse domain of $r_{N K}$ . However, it poses a requirement that the adaptive transform matrix learned from residual estimate should be approximate to the one learned from original residual. As a classical signal decorrelation technique, the PCA has been successfully used in spatially adaptive image denoising. Reference [42] provided the proof that the PCA transform matrix associated with noiseless dataset is the same as the one associated with noisy dataset when the noise is white additive with zero mean and an arbitrary standard deviation. Therefore, the PCA is a proper method to learn the adaptive transform matrix of residual when considering that the residual estimate can be approximately modeled as the addition of original residual and zero-mean white Gaussian noise.

To learn online the PCA transform matrix of each residual block, the framework of iterative shrinkage algorithm summarized in [43] is used to realize the adaptive residual recovery, which is presented as follows.

The Proposed Adaptive Residual Recovery Algorithm

Task. Find the optimal solution ${\hat{r}}_{N K}$ of model (6).

Initialization. Initialize $j = 0$ , and set the initial estimate, denoted by ${\hat{r}}_{(0)}$ , of r by using the MMSE linear estimation.

Main Iteration. Increment j by 1, and apply these steps:

(i)

PCA-Update. Compute the PCA transformation matrix $P_{(j), k}$ of each residual block $r_{k}$ by using the previous estimate ${\hat{r}}_{(j - 1)}$ , $k = 1, \dots, K$ .

(ii)

Shrinkage. Compute ${\bar{r}}_{(j), k} = P_{(j), k} \cdot hard (P_{(j), k}^{T} \cdot {\hat{r}}_{(j - 1)}, τ)$ , $k = 1, \dots, K$ , where $hard (\cdot, τ)$ is a hard thresholding function with threshold τ.

(iii)

Back-Projection. Compute

\begin{matrix} {\hat{r}}_{(j), k} = {\bar{r}}_{(j), k} + Φ_{NK}^{T} {(Φ_{NK} Φ_{NK}^{T})}^{- 1} ({\hat{y}}_{R, k} - Φ_{NK} {\bar{r}}_{(j), k}), k = 1, \dots, K . \end{matrix}

(25)

(iv)

Stopping Rule. Stop when $|D_{(j)} - D_{(j - 1)}| < ε$ or $j \geq J$ , where $D_{(j)} = {‖{\hat{r}}_{(j)} - {\hat{r}}_{(j - 1)}‖}_{2} / \sqrt{I_{r} \cdot I_{c}}$ , ε is the predetermined threshold, J is the maximum iteration number, and $I_{r} \times I_{c}$ is the size of residual frame.

Output. The result ${\hat{r}}_{N K}$ is ${\hat{r}}_{(j)}$ .

A high-quality initial residual estimate is helpful to promoting gradually the accuracy of PCA transform matrix based on iterations; and we also expect that the initialization of residual cannot introduce the excessive computations. Therefore, the initial estimate is computed by the Minimum Mean Square Error (MMSE) linear estimation used in [11]; that is,

\begin{matrix} {\hat{r}}_{(0), k} = U Φ_{N K}^{T} {(Φ_{N K} U Φ_{N K}^{T})}^{- 1} \cdot {\hat{y}}_{R, k} k = 1, \dots, K, \end{matrix}

(26)

where U represents the autocorrelation function between pixels of image block and its element can be computed by

\begin{matrix} U (r, c) = {(0.95)}^{δ_{r, c}}, \end{matrix}

(27)

where

δ_{r, c}

denotes the Euclidean distance between the coordinates of rth pixel and cth pixel in an image block of size

B \times B

. At each iteration, after updating the PCA transform matrix

P_{(j), k}

of each residual block

r_{k}

by using the previous estimate

{\hat{r}}_{(j - 1)}

(PCA-Update), the hard thresholding is implemented to shrink all coefficients in PCA domain as follows:

\begin{matrix} β_{(j), k} = P_{(j), k}^{T} \cdot {\hat{r}}_{(j - 1)}, \\ {\tilde{β}}_{(j), k} = hard (β_{(j), k}, τ) = \{\begin{cases} β_{(j), k} (n) & |β_{(j), k} (n)| \geq τ \\ 0 & |β_{(j), k} (n)| < τ \end{cases} \\ n = 1, \dots, N, \\ {\bar{r}}_{(j), k} = P_{(j), k} \cdot {\tilde{β}}_{(j), k}, \end{matrix}

(28)

where τ is estimated using a robust median estimator [44]:

\begin{matrix} τ = 2.5 \cdot \frac{median ({\tilde{β}}_{(j)})}{0.6745}, {\tilde{β}}_{(j)} = [{\tilde{β}}_{(j), 1}; {\tilde{β}}_{(j), 2}; \dots; {\tilde{β}}_{(j), K}], \end{matrix}

(29)

and note that τ is viewed as the substitution of weighting factor η in model (6)

(τ \propto 1 / n)

. Finally, the Back-Projection is used to force each block of residual estimate to back into the hyperplane

H = {g : Φ_{N K} \cdot g = {\hat{y}}_{R, k}}

; that is,

\begin{matrix} {\hat{r}}_{(j), k} = {\bar{r}}_{(j), k} + Φ_{N K}^{T} {(Φ_{N K} Φ_{N K}^{T})}^{- 1} ({\hat{y}}_{R, k} - Φ_{N K} {\bar{r}}_{(j), k}) . \end{matrix}

(30)

The implementation of PCA-Update is described as follows. At first, we pixel-by-pixel extract M samples $s_{m}$ of size $B \times B$ from previous estimate ${\hat{r}}_{(j - 1)}$ to construct a dataset $S = [s_{1}, s_{2}, \dots, s_{M}]$ for PCA training. Then, in order to better capture local structures of each residual block $r_{k}$ , we cluster the dataset S into $K_{0}$ clusters by using K-means and learn a PCA transform matrix $P_{c a n d, p} (p = 1, \dots, K_{0})$ from each of the $K_{0}$ clusters. Apparently, the $K_{0}$ clusters are expected to represent $K_{0}$ distinctive patterns in S while guaranteeing a low complexity to be introduced in clustering. Therefore, we perform PCA to compute all principle components of S and use the projections onto the first $L_{0}$ most significant principle components as the feature of each sample $s_{m}$ for clustering. Once S is partitioned into $K_{0}$ clusters ${S_{1}, S_{2}, \dots, S_{K_{0}}}$ and $μ_{p}$ is denoted by the centroid of clusters $S_{p}$ , we can compute the PCA transform matrix $P_{c a n d, p}$ corresponding to $S_{p}$ . Finally, given the PCA transformation matrices ${P_{c a n d, p}}$ , we select the most suitable PCA transform matrix to shrink ${\hat{r}}_{(j - 1), k}$ based on the minimum distance between ${\hat{r}}_{(j - 1), k}$ and $μ_{p}$ ; that is,

\begin{matrix} p_{opt} = \arg \min_{p \in \{1, \dots, K_{0}\}} \{{‖{\hat{r}}_{(j - 1), k} - μ_{p}‖}_{2}\} . \end{matrix}

(31)

5. Experimental Results

In this section, various experiments are conducted to evaluate the performance of the proposed CS-based video codec. We use the nonquantization, AR prediction, and PCA-based adaptive residual recovery to improve the quality of reconstructed video frame, and therefore the above methods in the proposed CVS system are separated to verify their performance gains: (1) we compare the quantization errors of DPCM-based nonuniform quantizer (DPCM-NQ) with those of DPCM-based uniform quantizer (DPCM-UQ) proposed by [19]; (2) the encoding complexity of the proposed video codec is analyzed, and its encoding time is compared with those of H.264/AVC [2], HEVC [3], and DISCOVER [5]; (3) the performance of the proposed joint reconstruction algorithm is evaluated by using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [45], and the comparisons with some existing reconstruction algorithms [13, 14, 23, 24] are also presented; (4) the computational complexity of the proposed joint reconstruction is analyzed, and its reconstruction time is compared with those of the existing CS-based methods in [13, 14, 23, 24] and the decoder of traditional video codec H.264/AVC under the different frame resolutions; (5) the performance comparison is performed when the adaptive PCA matrix and fixed DCT and Daubechies-4 matrices [46] are applied into the residual recovery, respectively; (6) we discuss the effects of various PSs depicted in Figure 5 on the performance of the proposed video codec. Finally, the rate-distortion performance of the proposed overall video codec is evaluated from two aspects. On the one hand, we combine the proposed joint reconstruction and other CS-based algorithms in [13, 14, 23, 24] into our CVS system, respectively, and present the comparison among their rate-distortion performances. On the other hand, the rate-distortion curve of the proposed CVS system is also compared with those of H.264/AVC-Intra codec, DISCOVER, and CS-KLT video codec proposed in [16].

Four test sequences with CIF resolution of $352 \times 288$ pixels and frame rate of 30 fps are used in the experiments, and these are Foreman, Mobile, Highway, and Container. At the proposed CS-based video encoder side, the block size $B \times B$ of each frame is $16 \times 16$ , the bit depth b of quantization is set to 8, the subrate $S_{K}$ of key frame is set to 0.7, and the subrate $S_{N K}$ of nonkey frame varies from 0.1 to 0.5. For the $A R$ prediction, we empirically set the supporting order R and noise variance $σ^{2}$ to 2 and 0.015. In the adaptive residual recovery algorithm, the required parameters are set as follows: $K_{0} = 10$ , $L_{0} = round (0.1 \cdot N)$ , $J = 5$ , and $ε = 0.001$ . All experiments are implemented on a PC with an Intel Core i5 CPU at 3.6 GHz and 8 GB RAM (a website of this paper has been built, where all of the experimental results and the MATLAB source code of the proposed video codec can be downloaded (online). Available at https://sites.google.com/site/draner358/projects/cs_vc).

5.1. Quantization Performances

The DPCM-Q and DPCM-NQ are, respectively, used in the proposed video codec to encode the first 100 frames of each test sequence, and the average quantization errors of these two quantizers on all test sequences are presented in Table 1. It can be seen that the quantization errors of DPCM-NQ are more smaller than DPCM-UQ at any subrate, and it decreases 58.90% on average over the DPCM-UQ, which benefits from the fact that nonuniform quantization is more suitable for the distribution of measurement residual. However, the DPCM-NQ pays also the price for reducing quantization error. Table 1 shows the average total execution time to encode each test sequence. It can be observed that the DPCM-NQ requires more time than the DPCM-UQ at any subrate, and it obtains 55.17% time gain on average compared to DPCM-UQ, which results from the fact that compression and expansion operations introduce some computations.

Table 1

Performance comparisons of different quantizers.

Subrate	Quantization error		Encoding time (s)
Subrate	DPCM-UQ	DPCM-NQ	DPCM-UQ	DPCM-NQ
0.1	2.94	1.34	1.99	3.23
0.2	2.77	1.10	2.47	4.04
0.3	2.29	0.90	3.22	4.94
0.4	2.00	0.78	3.77	5.81
0.5	1.80	0.71	4.50	6.75
Avg.	2.36	0.97	3.19	4.95

5.2. Encoding Complexity

Due to nonstationary statistics of video sequence, it is not possible to accurately predict the number of operations required to encode each video frame, and instead we use the execution time of encoding video sequence to indirectly reveal the encoding complexity. The first 100 frames of each test sequence are encoded, respectively, by the proposed CS-based video codec, DISCOVER (available online at http://www.discoverdvc.org/), the H.264/AVC JM9.5 software (available online at http://iphome.hhi.de/suehring/tml/), and the HEVC HM10.0 software (available online at http://www.hevc.info/), in which our codec is written in MATLAB and others are programmed in C++. The test conditions are presented as follows. (i)

Proposed codec and DISCOVER: insert one I frame every 10 frames. The proposed codec is configured with different subrates of nonkey frame (i.e., $S_{N K}$ = 0.1, 0.2, and 0.3), and an intermediate Quantization Parameter (QP) in DISCOVER is set to 27.

(ii)

H.264/AVC and HEVC: the first configuration is All Intra with QP set to 27 (AI27), where all frames are encoded using I frame, and the second configuration is Low Delay (LD) with QP set to 27 (LD27), where only the first frame is encoded using I frame, and others are encoded using P frame.

Table 2 presents the encoding time of various video codecs under the above test conditions. We can see that the proposed codec requires much time as the subrate increases; however, it does not take more than 10 s even with a higher subrate; for example, the time to encode Mobile sequence requires only about 7.38 s when the subrate is 0.5. DISCOVER has a moderate encoding time, and it requires about 42.09 s on average to encode a test sequence. Regardless of AI27 and LD27, both H.264/AVC and HEVC take a long time; particularly LD27 configuration in HEVC has a heavy computational burden. Although these results have a weak comparability due to the tradeoff between encoding complexity and rate-distortion performance, it can be testified under the common test conditions that the encoder of proposed codec has a very low complexity when compared with H.264/AVC, HEVC, and DISCOVER. The Compression Ratio (CR) of all test sequences is also shown in the last row of Table 2; it can be seen that the proposed codec obtains the higher CR while reducing the encoding time, which is contrary to H.264/AVC and HEVC; however, the high CR only shifts the computational complexity from encoder to decoder. Besides, we can observe that DISCOVER has a high CR with the help of the feedback channel, but the existence of feedback channel increases the difficulty to its applications.

Table 2

Encoding time of various video codecs.

Sequence	Encoding time (s)
	Proposed codec			H.264/AVC		HEVC		DISCOVER
	$S_{NK}$ = 0.1	$S_{NK}$ = 0.3	$S_{NK}$ = 0.5	AI27	LD27	AI27	LD27	QP = 27
Foreman	3.14	4.93	6.94	78.50	389.41	908.97	2306.60	36.40
Mobile	3.67	5.45	7.38	125.45	296.32	1182.97	2880.12	60.08
Highway	3.11	4.83	6.71	65.66	372.60	841.32	1550.31	31.21
Container	3.18	4.89	6.76	86.58	292.83	910.14	1301.99	40.65

CR	10.29	3.75	2.29	3.43	11.21	54.37	103.88	47.22

5.3. Reconstruction Performances

Next, the performance of the proposed joint reconstruction algorithm is evaluated from objective and subjective views by comparing it with those of methods proposed by [13, 14, 23, 24]. We successively process 10 GOPs of length $L = 10$ , and the PS1 depicted in Figure 5 is selected as the prediction structure of decoding. These comparative methods are integrated into our CS-based video codec, and they are all generated by the original authors’ codecs with corresponding parameters manually optimized. Note that the single-hypothesis prediction [24] is used for SI generation in [13], and we select a full-search ME with integer accuracy for the method of [23] in order to reduce its computational complexity.

The average PSNR results reconstructed by various methods for each test sequence are provided in Table 3. It can be seen that the proposed method is very efficient for the highly textured Mobile sequence and the slow translational Container sequences; for example, when compared with the best one of these comparative methods, the proposed method improves the results by at most 2.85 dB and 1.00 dB for Mobile and Container sequences. For the Foreman sequence with moderate and large motions, the proposed method achieves the obvious PSNR gains at the low subrates, but it has about 0.2 dB loss at the high subrates when compared with [24]. However, the method of [24] obtains the PSNR gains of about 0.04–0.45 dB compared with our method for the Highway sequence with fast global motions. Similar results can also be achieved from the viewpoint of the SSIM, which can be observed in Table 4. We also visually assess some video frames constructed by different methods. Figures 8-9 show the reconstructed frames of Foreman and Mobile, at the subrate $S_{N K} = 0.3$ , respectively. It can be observed that the proposed method provides the pleasant results for each test sequence; for example, for Mobile sequences as shown in Figure 9, the numbers on calendar recovered by these competing methods contain many annoying artifacts, but the proposed method clearly perceives these numbers.

Table 3

Average PSNR (in dB) of different joint reconstruction algorithms.

Sequence	Foreman					Mobile
$S_{NK}$	0.1	0.2	0.3	0.4	0.5	0.1	0.2	0.3	0.4	0.5
[13]	31.91	32.58	33.56	34.27	34.91	23.62	24.23	24.68	25.25	25.91
[14]	32.03	33.45	34.76	35.75	36.85	22.93	24.51	25.87	26.89	28.08
[23]	29.98	32.81	34.99	36.61	38.14	19.02	20.31	22.62	24.65	26.53
[24]	32.62	35.16	37.00	38.46	39.83	21.01	23.36	25.49	27.42	28.42
Proposed	33.50	35.71	37.18	38.35	39.48	24.74	26.81	28.26	29.59	30.93

Sequence	Highway					Container

$S_{NK}$	0.1	0.2	0.3	0.4	0.5	0.1	0.2	0.3	0.4	0.5
[13]	35.34	35.92	36.87	37.39	37.89	31.13	31.18	31.38	32.3	32.89
[14]	34.48	35.69	36.72	37.49	38.46	33.03	33.80	34.43	34.92	35.49
[23]	32.20	34.69	37.04	38.69	40.12	31.99	32.61	33.46	34.32	35.19
[24]	35.61	37.60	39.10	40.28	41.46	30.18	32.54	34.02	35.17	36.21
Proposed	35.49	37.56	38.91	39.98	41.01	33.75	34.73	35.43	36.02	36.55

$^{*}$ A bold-faced number denotes the highest PSNR in each test.

Table 4

Average SSIM of different joint reconstruction algorithms.

Sequence	Foreman					Mobile
$S_{NK}$	0.1	0.2	0.3	0.4	0.5	0.1	0.2	0.3	0.4	0.5
[13]	0.8883	0.8990	0.9151	0.9242	0.9314	0.8078	0.8154	0.8281	0.8425	0.8593
[14]	0.8675	0.8922	0.9135	0.9279	0.9424	0.7555	0.8100	0.8493	0.8729	0.8964
[23]	0.8374	0.8942	0.9268	0.9459	0.9595	0.5038	0.5792	0.7121	0.7947	0.8501
[24]	0.8876	0.9260	0.9481	0.9618	0.9718	0.6154	0.7426	0.8229	0.8745	0.9129
Proposed	0.9018	0.9353	0.9519	0.9627	0.9708	0.8412	0.8811	0.9059	0.9242	0.9384

Sequence	Highway					Container

$S_{NK}$	0.1	0.2	0.3	0.4	0.5	0.1	0.2	0.3	0.4	0.5
[13]	0.9170	0.9219	0.9294	0.9340	0.9378	0.8949	0.8956	0.8993	0.9147	0.9222
[14]	0.8787	0.8929	0.9070	0.9174	0.9300	0.9205	0.9305	0.9395	0.9464	0.9540
[23]	0.8784	0.9068	0.9287	0.9435	0.9552	0.9301	0.9368	0.9451	0.9526	0.9595
[24]	0.9070	0.9300	0.9453	0.9558	0.9645	0.8945	0.9309	0.9494	0.9609	0.9692
Proposed	0.8902	0.9243	0.9410	0.9523	0.9612	0.9315	0.9465	0.9558	0.9620	0.9665

$^{*}$ A bold-faced number denotes the highest SSIM in each test.

Figure 8

Visual comparison of the reconstructed 26th frame of Foreman by different methods ( $S_{N K}$ = 0.3): (a) [13] (PSNR = 31.84 dB; SSIM = 0.8862), (b) [14] (PSNR = 34.60 dB; SSIM = 0.9094), (c) [23] (PSNR = 35.64 dB; SSIM = 0.9361), (d) [24] (PSNR = 36.97 dB; SSIM = 0.9467), and (e) the proposed method (PSNR = 37.96 dB; SSIM = 0.9576).

Figure 9

Visual comparison of the reconstructed 46th frame of Mobile by different methods ( $S_{N K}$ = 0.3): (a) [13] (PSNR = 23.72 dB; SSIM = 0.7947), (b) [14] (PSNR = 25.24 dB; SSIM = 0.8334), (c) [23] (PSNR = 21.39 dB; SSIM = 0.6668), (d) [24] (PSNR = 24.88 dB; SSIM = 0.8126), and (e) the proposed method (PSNR = 29.03 dB; SSIM = 0.9184).

5.4. Reconstruction Complexity

For the computational complexity of various methods, it can be seen from Table 5 that the proposed method has a moderate computational complexity; for example, its reconstruction time is only about half those of [24] for the sequences with QCIF and CIF format. We can also see that the reconstruction time of each algorithm increases as the resolution of video frame increases, especially for the sequences with 720P format; the proposed method has significant reconstruction time gains due to the sensitivity of PCA computations to large-scale signal. At the different resolutions, although some methods require shorter time than our method, there is a big reconstruction performance gap between them and ours. Therefore, taking full account of reconstructed quality and computational complexity, the proposed method has a better performance than other CS-based methods. Besides, we present the decoding time of H.264/AVC JM9.5 software with the configuration LD27, and it can be observed that the CS-based method has a heavy computational burden when compared with H.264/AVC, which verifies that the significant decrease of encoding complexity comes at the expense of increased decoding complexity.

Table 5

Comparison of average reconstruction time at all tested subrates.

Sequence	Reconstruction time (s/frame)
Sequence	[13]	[14]	[23]	[24]	H.264	Proposed
QCIF
News	1.82	1.83	15.15	4.68	0.042	2.53
Football	1.87	1.91	13.80	4.73	0.075	3.01
CIF
Foreman	7.20	7.57	69.69	21.61	0.067	10.14
Mobile	10.40	10.51	60.48	25.21	0.073	11.35
720P
Parkrun	375.08	234.82	667.97	954.43	0.44	889.58
Shields	400.39	248.34	653.87	962.10	0.35	719.54

Avg.	132.79	84.16	246.83	328.79	0.17	272.69

5.5. PCA versus Fixed Transform Matrices

In the proposed joint reconstruction, we recover the residual frame by using the PCA-based adaptive transform matrix. To verify the effectiveness of adaptive residual recovery, the GPSR algorithm [27] is provided with the fixed DCT and Daubechies-4 matrices, respectively, to recover residual of each frame, and their resulting average PSNR curves on all test sequences with CIF format are compared with that of adaptive residual recovery using PCA, which is presented in Figure 10. It can be seen that our adaptive PCA matrix has higher PSNR values than the fixed matrices at any subrate; particularly for the subrate of 0.3, the PSNR gain is about 0.35 dB when compared with DCT matrix, which indicates that the PCA matrix better explores the sparsity of residual due to its adaptivity to the local structures.

Figure 10

Average PSNR curves of the joint reconstruction algorithm when the residual recovery uses the different transform matrix.

5.6. Prediction Structures

In this subsection, we evaluate the decoding performance of the proposed CS-based codec under the five PSs depicted in Figure 5. The 10 GOPs of length L = 8 in each test sequence are reconstructed at the different subrates, and Figure 11 shows the average PSNR values and decoding time on all reconstructed test sequences under various PSs. It can be observed from Figure 11(a) that the PSNR value gradually rises as the number of B frames in PS increases; for example, when PS4 with all B frames is used, the PSNR gain can be up to 2.04 dB compared with PS5 without B frame. By the results of PS2 and PS3, we can see that the different predictive approaches of B frames have little impact on the reconstructed quality in a short time interval. From Figure 11(b) we can observe that PS4 requires the maximum decoding time among all PSs, and the decoding time will decrease along with reducing the number of B frames, which greatly attributes to the fact that B frame requires more computations than P frame because the former combines the previous and following reference frames to fulfill decoding task.

Figure 11

Decoding performance under various predictive structures: (a) average PSNR curves and (b) average decoding time.

5.7. Rate-Distortion Performances

The proposed joint reconstruction algorithm and other CS-based algorithms in [13, 14, 23, 24] are, respectively, applied into our CVS system, their rate-distortion curves on the CIF test sequences Foreman and Mobile are presented in Figure 12, and we can see that the CVS combined with the proposed method is superior to the most parts of bitrates compared to the one combined with the algorithms in [13, 14, 23, 24]. The superior rate-distortion performance of the proposed joint reconstruction greatly attributes to its desirable ability to generate the high-quality SI with AR prediction and thus enhance the sparsity of residual, and besides the PCA-based adaptive residual recovery also effectively corrects the errors between the SI and the original frame.

Figure 12

Rate-distortion curves for our CVS system combined with the reconstruction methods in [13, 14, 23, 24] and the proposed algorithm: (a) Foreman and (b) Mobile.

Figure 13 compares the rate-distortion performances, averaged over the first 100 frames of Foreman, Highway, and Container sequences, respectively, of the Intra coded results by the H.264/AVC JM9.5 software (H.264i), DISCOVER, the CS-KLT codec proposed by [16], and the proposed codec. For both DISCOVER and the proposed codec, the length L of GOP is set to 10, and the decoding prediction structure PS1 is used in the proposed codec. The CS-KLT codec implements ME and MC at the decoder by sparsity-aware reconstruction using interframe Karhunen-Loève Transform (KLT, which is equivalent to PCA) basis, and it exhibits the excellent performance among the existing CS-based video codecs. Note that the results of CS-KLT codec are directly taken from the order-10 decoding in [16]. From Figure 13, it is observed that the proposed codec is superior to the whole range of bitrates compared to the CS-KLT codec; for example, the highest PSNR gain can be up to 10.86 dB for Container sequence. Besides, the CS-KLT codec requires lots of computations; for example, its order-2 decoding time is about 332.81 seconds per frame on average, but our codec requires only about 10.77 seconds on average to decode one frame. It can be seen that these CS-based video codecs have inferior performance compared with H.264i and DISCOVER. For the H.264i, there are many computations to explicitly retain the information on each video frame at the encoder side, and it is easy to guarantee the efficient decoding performance with a light computational burden. With the help of the feedback channel, DISCOVER requires encoder to transmit parity bits of SI in real time when decoding each video frame, and therefore the reserved backward channel sells the decoding independence and time delay for the better rate-distortion performance. However, for the CS-based video codec, the simple encoding approach, realized by dimensionality reduction, captures implicitly all information into the measurements of each video frame, which causes the decoding to be an inverse problem, and consequently it is more difficult than H.264i and DISCOVER to improve the PSNR value as the bitrate increases.

Figure 13

Rate-distortion curves for H.264/AVC-Intra, DISCOVER, CS-KLT codec, and the proposed codec: (a) Foreman, (b) Highway, and (c) Container.

6. Conclusions

In this paper, we presented a CS-based video codec with a low-complexity encoder. The coding process starts by dividing the input video sequence into several GOPs. At the encoder side, each video frame in GOP is independently encoded by using block-based measuring, and then a DPCM-based nonuniform quantizer is used to quantize the resulting measurements of video frame in order to reduce the quantization errors of measurements. Finally, the Huffman encoding is used to compress the quantized measurements into bits, and these bits can be packed into a packet by a format. To fully explore the interframe correlation, a key frame with a high subrate can be inserted into each GOP, and other frames in GOP are encoded with the relatively low subrates. At the decoder side, the key frame is reconstructed by the still-image CS recovery algorithm, and it will offer a high-quality initial reference frame. For the nonkey frames, we proposed a novel joint reconstruction algorithm which consists of AR prediction and adaptive residual recovery. The AR prediction uses the local temporal correlation to accurately generate the SI of video frame, and the adaptive residual recovery learns online the PCA-based transform matrix adapting to residual structures to improve the reconstructed quality of residual. Besides, we also discuss the effects of various decoding predictive structures on the performance of joint reconstruction algorithm. Various experiments are performed to evaluate the performance of the proposed CS-based video codec from some perspectives, and their results demonstrated that the DPCM-based nonuniform quantizer used in our codec reduces effectively the quantization errors of measurements, a light computational burden is required at the encoder side of the proposed codec, and the proposed joint reconstruction algorithm has superior performance compared to many existing methods in both PSNR and visual quality. The rate-distortion performance of the proposed codec strongly outperforms that of CS-KLT video codec (one of state-of-the-art CS-based video codecs); however our codec still has the inferior performance when compared with the H.264/AVC and the DISCOVER. Therefore, in terms of future work, we will seek a more high-efficient joint reconstruction algorithm to further improve the rate-distortion performance of CS-based video codec.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China, under Grants nos. 61501393, 61202194, and 61471162, in part by Youth Sustentation Fund of Xinyang Normal University, under Grant no. 2015-QN-043, in part by the Key Scientific Research Project of Colleges and Universities in Henan Province of China, under Grant no. 15A520026, and in part by the Technology Research Program of Henan Provincial Department of Education (no. 12A520035).

References

Puri

Majumdar

Ishwar

Ramchandran

Distributed video coding in wireless sensor networks

IEEE Signal Processing Magazine 2006 23 4 94 106

10.1109/MSP.2006.1657820

2-s2.0-33746340591

Advanced Video Coding for Generic Audio-Visual Services, ITU-T Rec. H.264 and ISO/IEC 14496-10 (AVC), ITU-T and ISO/IEC JTC 1, 2003

Sullivan

G. J.

Ohm

J.-R.

Han

W.-J.

Wiegand

Overview of the high efficiency video coding (HEVC) standard

IEEE Transactions on Circuits and Systems for Video Technology 2012 22 12 1649 1668

10.1109/TCSVT.2012.2221191

2-s2.0-84872253926

Girod

Aaron

A. M.

Rane

Rebollo-Monedero

Distributed video coding

Proceedings of the IEEE 2005 93 1 71 83

10.1109/JPROC.2004.839619

2-s2.0-20744446446

Artigas

Ascenso

Dalai

Klomp

Kubasov

Ouaret

The DISCOVER codec: architecture, techniques and evaluation

Proceedings of the Picture Coding Symposium (PCS ′07)

November 2007

1103 1120

Mun

Fowler

J. E.

Block compressed sensing of images using directional transforms

Proceedings of the 16th IEEE International Conference on Image Processing (ICIP ′09)

November 2009

Cairo, Egypt

IEEE

3021 3024

10.1109/ICIP.2009.5414429

Chen

Tramel

E. W.

Fowler

J. E.

Compressed-sensing recovery of images and video using multihypothesis predictions

Proceedings of the Asilomar Conference on Signals, Systems, and Computers (ASILOMAR ′11)

November 2011

Pacific Grove, Calif, USA

1193 1198

Baraniuk

R. G.

Compressive sensing

IEEE Signal Processing Magazine 2007 24 4 118 124

10.1109/MSP.2007.4286571

2-s2.0-34548253373

Sankaranarayanan

A. C.

Studer

Baraniuk

R. G.

CS-MUVI: video compressive sensing for spatial-multiplexing cameras

Proceedings of the 16th International Conference on Clouds and Precipitation (ICCP ′12)

April 2012

1 10

10.

Llull

Liao

Yuan

Yang

Kittle

Carin

Sapiro

Brady

D. J.

Coded aperture compressive temporal imaging

Optics Express 2013 21 9 10526 10545

10.1364/oe.21.010526

2-s2.0-84878535592

11.

Gan

Block compressed sensing of natural images

Proceedings of the 15th International Conference onDigital Signal Processing, (ICDSP ′07)

July 2007

Cardiff, Wales

IEEE

403 406

10.1109/icdsp.2007.4288604

2-s2.0-47649087756

12.

T. T.

Gan

Nguyen

N. H.

Tran

T. D.

Fast and efficient compressive sensing using structurally random matrices

IEEE Transactions on Signal Processing 2012 60 1 139 154

10.1109/tsp.2011.2170977

2-s2.0-84555196779

13.

Kang

L.-W.

C.-S.

Distributed compressive video sensing

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ′09)

April 2009

1169 1172

10.1109/icassp.2009.4959797

2-s2.0-70349200890

14.

T. T.

Chen

Nguyen

D. T.

Nguyen

Gan

Tran

T. D.

Distributed compressed video sensing

Proceedings of the IEEE International Conference on Image Processing (ICIP ′09)

November 2009

1393 1396

10.1109/icip.2009.5414631

2-s2.0-77951951428

15.

Prades-Nebot

Huang

Distributed video coding using compressive sampling

Proceedings of the Picture Coding Symposium (PCS ′09)

May 2009

1 4

10.1109/pcs.2009.5167431

2-s2.0-70449638761

16.

Liu

Pados

D. A.

Motion-aware decoding of compressed-sensed video

IEEE Transactions on Circuits and Systems for Video Technology 2013 23 3 438 444

10.1109/TCSVT.2012.2207269

2-s2.0-84874873257

17.

Dai

Pham

H. V.

Milenkovic

Quantized compressive sensing

http://arxiv.org/abs/0901.0749

18.

Güntürk

C. S.

Lammers

Powell

Saab

Yilmaz

Ö.

Sigma delta quantization for compressed sensing

Proceedings of the 44th Annual Conference on Information Sciences and Systems (CISS ′10)

March 2010

1 6

10.1109/ciss.2010.5464825

2-s2.0-77953706903

19.

Mun

Fowler

J. E.

DPCM for quantized block-based compressed sensing of images

Proceedings of the 20th European Signal Processing Conference (EUSIPCO ′12)

August 2012

1424 1428

2-s2.0-84869753097

20.

Wang

Shi

Binned progressive quantization for compressive sensing

IEEE Transactions on Image Processing 2012 21 6 2980 2990

10.1109/tip.2012.2188810

MR2925354

2-s2.0-84861167358

21.

Lam

Wunsch

Video compressive sensing with 3-D wavelet and 3-D noiselet

Proceedings of the 19th IEEE International Conference on Image Processing (ICIP ′12)

October 2012

Orlando, Fla, USA

IEEE

893 896

10.1109/icip.2012.6467004

2-s2.0-84875827063

22.

Shu

Ahuja

Imaging via three-dimensional compressive sampling (3DCS)

Proceedings of the IEEE International Conference on Computer Vision (ICCV ′11)

November 2011

439 446

10.1109/iccv.2011.6126273

2-s2.0-84856675922

23.

Mun

Fowler

J. E.

Residual reconstruction for block-based compressed sensing of video

Proceedings of the Data Compression Conference (DCC ′11)

March 2011

183 192

10.1109/dcc.2011.25

2-s2.0-79955730466

24.

Tramel

E. W.

Fowler

J. E.

Video compressed sensing with multihypothesis

Proceedings of the Data Compression Conference (DCC ′11)

March 2011

193 202

10.1109/dcc.2011.26

2-s2.0-79955737572

25.

Wang

Zhang

Tan

Y.-P.

Frame rate up-conversion using trilateral filtering

IEEE Transactions on Circuits and Systems for Video Technology 2010 20 6 886 893

10.1109/TCSVT.2010.2046057

2-s2.0-77953117709

26.

Chen

H. W.

Kang

L. W.

C. S.

Dynamic measurement rate allocation for distributed compressive video sensing

Proceedings of the Visual Communications and Image Processing Conference (VCIP ′10)

2010

1 10

27.

Figueiredo

M. A. T.

Nowak

R. D.

Wright

S. J.

Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems

IEEE Journal on Selected Topics in Signal Processing 2007 1 4 586 597

10.1109/jstsp.2007.910281

2-s2.0-39449126969

28.

Abdi

Williams

L. J.

Principal component analysis

Wiley Interdisciplinary Reviews: Computational Statistics 2010 2 4 433 459

10.1002/wics.101

2-s2.0-77957553895

29.

Candès

E. J.

Romberg

Tao

Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information

IEEE Transactions on Information Theory 2006 52 2 489 509

10.1109/tit.2005.862083

2-s2.0-31744440684

30.

Donoho

D. L.

Compressed sensing

IEEE Transactions on Information Theory 2006 52 4 1289 1306

10.1109/TIT.2006.871582

2-s2.0-33645712892

31.

Lustig

Donoho

D. L.

Santos

J. M.

Pauly

J. M.

Compressed sensing MRI

IEEE Signal Processing Magazine 2008 25 2 72 82

10.1109/msp.2007.914728

2-s2.0-41949123921

32.

Candès

E. J.

Tao

Near-optimal signal recovery from random projections: universal encoding strategies?

IEEE Transactions on Information Theory 2006 52 12 5406 5425

10.1109/tit.2006.885507

2-s2.0-33947416035

33.

Baron

Duarte

M. F.

Wakin

M. B.

Sarvotham

Baraniuk

R. G.

Distributed compressive sensing

http://arxiv.org/abs/0901.3403

34.

Nelson

Gailly

J. L.

The Data Compression Book 1995 2nd

New York, NY, USA

M&T Books

35.

Wikipedia

μ-Law algorithm

2014, https://en.wikipedia.org/wiki/M-law_algorithm

36.

Zhang

Zhao

Liu

Gao

Side information generation with auto regressive model for low-delay distributed video coding

Journal of Visual Communication and Image Representation 2012 23 1 229 236

10.1016/j.jvcir.2011.10.001

2-s2.0-81255184474

37.

Dong

Zhang

Shi

Model-assisted adaptive recovery of compressed sensing with imaging applications

IEEE Transactions on Image Processing 2012 21 2 451 458

10.1109/TIP.2011.2163520

2-s2.0-84863054486

38.

Zhang

Zhao

Wang

Gao

A spatio-temporal auto regressive model for frame rate upconversion

IEEE Transactions on Circuits and Systems for Video Technology 2009 19 9 1289 1301

10.1109/TCSVT.2009.2022798

2-s2.0-70349446358

39.

Papoulis

Pillai

S. U.

Probability, Random Variables and Stochastic Processes 2002 4th

New York, NY, USA

McGraw-Hill

40.

Johnson

W. B.

Lindenstrauss

Extensions of Lipschitz mapping into Hilbert space

Contemporary Mathematics 1984 26 1 189 206

41.

Dikbas

Altunbasak

Novel true-motion estimation algorithm and its application to motion-compensated temporal frame interpolation

IEEE Transactions on Image Processing 2013 22 8 2931 2945

10.1109/TIP.2012.2222893

2-s2.0-84878475608

42.

de Haan

Biezen

P. W. A. C.

Huijgen

Ojo

O. A.

True-motion estimation with 3-D recursive search block matching

IEEE Transactions on Circuits and Systems for Video Technology 1993 3 5 368 379

10.1109/76.246088

2-s2.0-0027683149

43.

Elad

Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing 2010

New York, NY, USA

Springer Science+Business Media

44.

Donoho

D. L.

De-noising by soft-thresholding

IEEE Transactions on Information Theory 1995 41 3 613 627

10.1109/18.382009

ZBL0820.62002

2-s2.0-0029307534

45.

Wang

Bovik

A. C.

Sheikh

H. R.

Simoncelli

E. P.

Image quality assessment: from error visibility to structural similarity

IEEE Transactions on Image Processing 2004 13 4 600 612

10.1109/tip.2003.819861

2-s2.0-1942436689

46.

Daubechies

Orthonormal bases of compactly supported wavelets

Communications on Pure and Applied Mathematics 1988 41 7 909 996

10.1002/cpa.3160410705