Abstract
With the booming of video devices ranging from low-power visual sensors to mobile phones, the video sequences captured by these simple devices must be compressed easily and reconstructed by relatively more powerful servers. In such scenarios, distributed compressed video sensing (DCVS), combining distributed video coding (DVC) and compressed sensing (CS), is developed as a novel and powerful signal-sensing and compression algorithm for video signals. In DCVS, video frames can be compressed to a few measurements in a separate manner, while the interframe correlation is explored by the joint recovery algorithm. In this paper, a new DCVS joint recovery scheme using side-information-based belief propagation (SI-BP) is proposed to exploit both the intraframe and interframe correlations, which is particularly efficient over error-prone channels. The DCVS scheme using SI-BP is designed over two frame signal models, the mixture Gaussian (MG) model and the wavelet hidden Markov tree (WHMT) model. Simulation results evaluated on two video sequences illustrate that the SI-BP-based DCVS scheme is error resilient when the measurements are transmitted through the noisy wireless channels.
1. Introduction
Current video coding paradigms, such as MPEG and the ITU-T H.26x, are traditionally designed for the applications followed the so-called “broadcast” model, as shown in the left part of Figure 1. The video sequence is complicatedly encoded at the powerful server only once, and then the compressed video stream is distributed and decoded frequently on many cheap and simple user devices. So MPEG and H.26x standards both have complicated encoder and light decoder.

The comparison of the “Broadcast” model and the “Multiple-access” model.
However, with the booming of video devices ranging from low-power visual sensors to camera mobile phones, visual applications now have already developed beyond this broadcast model. The video processing paradigm in camera sensor networks, which are composed of spatially distributed smart camera devices capable of processing images or videos of a scene from a variety of viewpoints, is more like a “multiple-access” model, as shown in the right part of Figure 1. In this scenario, these video devices with limited battery power and storage memory need to send their captured video streams to the monitor server. Meanwhile, high compression efficiency is also required considering both the limitations of wireless bandwidth and transmission power. The requirements of the video processing paradigms here are diametrically opposed to MPEG and H.26x. Thus, lightweight and efficient encoding technologies are developed to satisfy such requirements of these multiaccess applications.
The first step to satisfy the multiaccess model has been taken by exploiting the interframe statistics at the decoder only, known as distributed video coding (DVC) [1], and the information-theoretic basis of DVC is distributed source coding (DSC) [2]. It states that the correlated sources can be separately encoded and jointly decoded using the correlation between them; furthermore, this separate encoding will achieve the same coding efficiency as the joint encoding. The concept of DSC has achieved a lot of attentions with the booming of wireless sensor networks where the correlated sources captured by sensors have to be encoded without communications with each other while decoded jointly at the sink node.
Slepian-Wolf theorem [2] for lossless coding and Wyner-Ziv theorem [3] for lossy coding are the most important theoretical foundations of DSC. The first practical strategy of DSC is proposed in [4] to exploit the potential of the Slepian-Wolf theorem by introducing channel codes. The statistical dependence between the two sources is modeled as a virtual correlation channel. One source is used as the side information to help the decoding of the other source.
DVC combines DSC methods with traditional intraframe video coding systems so as to shift the complicated motion search operations from the encoder to the decoder, and perfectly suits the “multiple-access” model. A classical framework called power-efficient, robust, high-compression, syndrome-based multimedia coding (PRISM) is proposed in [5]. They encoded the P frames following the same procedures as that of I frames but at lower rates, while the motion search is used at the decoder to estimate the side information from the neighboring recovered I frames. Based on the side information, the P frames can be successfully reconstructed using DSC decoding algorithms.
Girod et al. [6] investigated DVC in the similar format but using turbo codes instead of the trellis codes in [5]. The scheme in [7] tackled the multiview correlations using DVC methods. DVC not only provides light video encoder but also contributes to the low-cost protection of traditional video coding. The layered Wyner-Ziv video coding system [8] achieved robust video transmission by adding Wyner-Ziv bitstream layers as the enhanced layers.
Another recent direction to achieve light encoder is to exploit the intraframe signal's sparsity property, known as compressed sensing (CS) [9–14]. In traditional intraframe coding, a large number of pixel values are firstly transformed, and then only the important low-frequency coefficients are entropy coded while other coefficients are discarded. In CS, we denote an N-dimensional vector as
As long as an M-by-N sensing matrix Φ can be found incoherent with the representing matrix Ψ, the signal X can be sampled as
The solution to the optimization program is
Sparsity [11] is one of the essential factors of CS which guarantees the signal can be compressively sampled, and the other is the incoherence [12] which provides the sensing matrices. Usually, random matrices are largely incoherent with any fixed representing matrix. In the CS methodology, the image or video signals are usually sparse on the basis of discrete cosine transform (DCT) matrix and discrete wavelet transform (DWT) matrix [15], where the frame signal can be directly compressed to a small number of measurements. Thus, CS may greatly improve the efficiency and decrease the complexity of intraframe compression. CS has been applied in image coding by many researchers [16, 17], and [18, 19] discussed the modified transforming, blocking and quantizing methods for video coding according to the feature of CS measurements.
Combining DVC and CS is a main research direction for video compressed sensing, so called distributed video compressed sensing (DCVS). In [20], the authors reconstructed the difference between frame signals firstly using ordinary gradient projection for sparse reconstruction (GPSR) [21] algorithm, and then recovered each signal. GPSR is essentially a gradient projection algorithm applied to a quadratic programming formulation of (1), in which the search path from each iteration is obtained by projecting the negative-gradient direction onto the feasible set. [22] tried to explore the correlation between random measurements of signals using Wyner-Ziv method, but such cascaded design needs two sets of encoders and decoders, which will consequently increase the complexity. Taking the side information into account, the scheme in [23] modified the initializations and various stopping criteria within GPSR recovery. The idea is similar to ours, but the recovery algorithm using basis pursuit is different from ours. It also has to be noted that all these schemes did not consider the transmission noise of the measurements.
In this paper, a novel DCVS scheme using side-information-based belief propagation (SI-BP) to utilize both the interframe and intraframe correlations of video sequence is proposed. Each frame is compressed separately using structured sensing matrices. However, the P frame has lower compression ratio because it can be jointly decoded using SI-BP where the reconstructed adjacent I frames can be utilized to generate the side information. The algorithm SI-BP is derived from the Bayesian inference, so the frame signal model is crucial for the performance of the DCVS scheme. Two signal models are introduced, the mixture Gaussian (MG) and the wavelet hidden Markov tree (WHMT). Due to the recovery method based on Bayesian inference, our scheme is resilient to the noisy transmission channel which is inevitable in wireless networks. Although the recovery method based on the belief propagation (BP) has been introduced into CS in [24], it is only designed for single signal not for image or video signals.
The proposed DCVS scheme has three advantages. First, the system structure of DVC is introduced so that the motion search is moved from encoder to decoder to alleviate the complexity of the encoder. Second, a coding-theory-like sensing and recovery scheme is proposed based on Bayesian inference where the SI-BP algorithm is used, which is quite different from the optimization recovery schemes in prior work. Third, the MG and WHMT models of wavelet coefficients are exploited to recover signals not only from the point of sparsity but also the point of statistical distribution. The sufficient utilization of the video frame signal's properties makes our scheme robust to noise-prone transmission channels.
The rest of the paper is organized as follows. Section 2 introduces the two MG and WHMT frame signal models. In Section 3, we present the details of the proposed DCVS scheme consisting of separate compression and joint recovery algorithms. In Section 4, simulation results are illustrated and discussed. And finally, Section 5 gives some concluding thoughts with directions for future works.
2. Frame Signal Models
In our SI-BP based DCVS scheme, the correlations between the frames are used to initialize the decoding iteration. So the correlation models deeply affect the performance of the decoder. Generally, the frame signal of videos is compressible in the DCT basis or DWT basis, and the transform coefficients constitute a compressible signal with special construction model. In this section, we will discuss the effectiveness of the MG model and the WHMT model.
2.1. Mixture Gaussian Model
The MG model has been proved to be a simple yet effective model of real sparse signals in image processing and inference problems. The wavelet coefficients of the video frames can be regarded as a K-sparse signal
The investigation of the information-theoretic bounds for the performance of CS is always a focal point [25–27] based on the particular sparse representations. For such two-state MG sparse signal, [25] has derived the rate-distortion bounds using the mean squared error (MSE) distortion measure. The simple upper bound on
The lower bound on
These results are for a single DWT vector, while for video coding, the correlation between I frame and P frame should be exploited. Thus, we discuss the rate-distortion bound for one sparse signal X with the side information Y. Based on the correlation model
These results on rate-distortion bounds of DWT coefficients are the theoretical foundations of our proposed scheme. It can be noticed that the MG model is not unique for video frame signal but a universal model for any sparse signal. In other words, the particular features of video frame signals are not demonstrated fully through this model. So we adopt another model, WHMT model, which will be introduced in the next sub-section.
2.2. Wavelet Hidden Markov Tree Model
The DWT coefficients of an image or a video frame have the quad tree structure [15] which has been studied well before the advent of CS. And recent wavelet-based CS [28, 29] used this prior information into the CS reconstruction. We also introduce it into our joint recovery scheme.
Figure 2 shows the tree structure of the 3-level DWT coefficients of an image. The coefficients were decomposed with high (H) and low (L) pass filters at each level.

The quad tree structure of wavelet transform coefficients.
The DWT coefficients are modeled as a WHMT which is the general version of the two-state MG model. The two hidden states of this WHMT are also the “large” state and “small” state. The coefficient values of each state are drawn from a Gaussian distribution
3. Implementation of DCVS Using SI-BP
In this section, we will focus on the implementation of DCVS with SI-BP, including the separate compression and joint recovery algorithms. The framework of DCVS is shown in Figure 3. A frame group is assumed to consist of one I frame and two P frames. The I frames

The system model of the proposed DCVS.
The coefficients of I frames and P frames are sampled by sensing matrices
3.1. Separate Compression
The BP and SI-BP algorithms for recovery are processed on the bipartite graph, so the sensing matrices have to be designed as low-density sensing matrices (LDSMs).
For the MG model, the LDSMs
For the WHMT model, according to the unequal importance of different part of the tree, the LDSM is redefined as
In other words, the LDSM is written in a layered format
So, for both the MG and the WHMT model, the measurements of frames are generated as
The compression ratios are calculated as
3.2. Joint Recovery
Let us firstly recall the principles of BP algorithm in traditional channel decoding. BP algorithm approximately calculates the marginal distribution of all the variable nodes from the global distribution by passing messages iteratively along the edges of the bipartite graph. At the beginning of iterations, the variable nodes are initialized by the received codeword, but the check nodes have no external information. In contrast, the BP algorithm in CS [24] has no prior information for variable nodes, while it has external information for check nodes.
The proposed SI-BP algorithm is different from the aforementioned situations because both the variable nodes and check nodes have prior information in SI-BP. The correlation between the side information frame and the current P frame to be decoded can be modeled as a virtual channel. So the side information
For clarifying the iterative BP decoding algorithms, we give the following assumptions firstly.
Denote the variable nodes by The messages sent on the edges of the bipartite graph are probability density functions (pdfs). As the pdf is real function over the interval
The decoding algorithms for MG model and WHMT model are similar but different at the initialization. However, the WHMT model can be degenerated to MG model by setting all
Each variable node of I frames is initialized as
It can be found that the side information
The check node
The variable node
The iteration is repeated for the desired number of iterations. Finally, we get the pdf of each variable node as
In the SI-BP algorithm, the side information
4. Simulation Results and Analysis
For the simulations, we select the commonly used YUV 4 : 2 : 0 video sequences “Coastguard” and “Foreman” to test the performance of the DCVS algorithm, where the first one is in QCIF format and the last one is in CIF format. Here only the Y frames are processed. For the CIF sequence, a frame is divided into
In order to evaluate the performance of SI-BP based recovery scheme compared with traditional convex optimization recovery schemes, we use the DCVS algorithm [23] using GPSR in [21] for comparisons with our scheme using SI-BP. The two-state MG model and WHMT model are both simulated in our scheme. For the MG model, the sparsity K, the standard deviation
4.1. The Noiseless Channel Case
Firstly, the ideal transmission channel is considered. The results are shown in Figure 4 and give the performance comparison between the proposed SI-BP schemes and the GPSR scheme of these two video sequences “Foreman” and “Coastguard”, respectively. It can be found that the SI-BP scheme based on WHMT model performs better than the SI-BP scheme based on MG model and the GPSR scheme. This is because the WHMT model is specific for the video frame signal, and it just relies on the statistical properties of the signal which can be obtained by training. Besides, the unregular density distribution of LDSM protects the important part of the low-layer DWT coefficients, so it performs better than the MG model. For the MG model, its PSNR performance on Coastguard sequence is worse than that obtained from GPSR scheme, while for the Foreman sequence, it is better than that of GPSR. This is due to that the MG model also considers both of the sparse and statistical properties. And the SI-BP recovery algorithm depends heavily on the accuracy of the signal model. We can infer that the sparse property plays a more important role in the noiseless channel case.

The PSNR performances for the: (a) “Foreman” CIF and (b) “Coastguard” QCIF sequences with the noiseless channel.
4.2. The Noisy Channel Case
And then the PSNR performances of these three schemes in the error-prone channel case are simulated, where the noise standard deviation

The PSNR performances for the: (a) “Foreman” CIF and (b) “Coastguard” QCIF sequences, where the transmission channel is AWGN channel with the noise standard deviation
Figure 6 gives additional demonstration of the excellent resiliency of the noisy measurements of the SI-BP scheme. The PSNR performances of the SI-BP scheme based on WHMT and MG models and the GPSR scheme are compared at the average compression ratio

The PSNR performances for the “Foreman” CIF sequences with the average compression ratio
5. Conclusions and Future Works
In this paper, a novel DCVS scheme using side-information based belief propagation (SI-BP) is proposed to deal with the multiaccess model of video processing. Each video frame signal is compressed to its measurements separately, and the intra- and inter-correlations are utilized at the joint decoder. The SI-BP recovery algorithm is proposed based on Bayesian inference, which is quite different from the optimization recovery schemes in prior CS work and is error-resilient when the measurements are transmitted through the noise-prone channels.
The proposed DCVS scheme shifts the complexity of video coding to the decoder and guarantees a light and efficient encoder for the constrained camera sensors. The decoding algorithm based on SI-BP is more suitable in practical applications where the transmission noise is inevitable. In the future, we will further improve the performance of SI-BP based DCVS in noiseless scenarios and expand the achievements of this paper in other video and image analysis tasks, such as motion tracking and so on.
Footnotes
Acknowledgments
This work is supported by National Science Foundation of China (no. 61201149), the 111 Project (no. B08004), and the Fundamental Research Funds for the Central Universities. This work is also supported (in part) by Korea Evaluation Institute of Industrial Technology (KEIT), under the R&D support program of Ministry of Knowledge Economy, Korea. The authors would like to thank all of the reviewers and editors for their detailed comments that have certainly improved the quality of their paper.
