Cloud-Based Image Coding for Mobile Devices ...

Viewer
Transcript

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 4, JUNE 2013

845

Cloud-Based Image Coding for Mobile Devices—Toward Thousands to One Compression Huanjing Yue, Xiaoyan Sun, Jingyu Yang, and Feng Wu, Senior Member, IEEE

Abstract—Current image coding schemes make it hard to utilize external images for compression even if highly correlated images can be found in the cloud. To solve this problem, we propose a method of cloud-based image coding that is different from current image coding even on the ground. It no longer compresses images pixel by pixel and instead tries to describe images and reconstruct them from a large-scale image database via the descriptions. First, we describe an input image based on its down-sampled version and local feature descriptors. The descriptors are used to retrieve highly correlated images in the cloud and identify corresponding patches. The down-sampled image serves as a target to stitch retrieved image patches together. Second, the down-sampled image is compressed using current image coding. The feature vectors of local descriptors are predicted by the corresponding vectors extracted in the decoded down-sampled image. The predicted residual vectors are compressed by transform, quantization, and entropy coding. The experimental results show that the visual quality of reconstructed images is significantly better than that of intra-frame coding in HEVC and JPEG at thousands to one compression. Index Terms—Image compression, local descriptor, mobile devices, SIFT (scale-invariant feature transform), the cloud.

I. INTRODUCTION

T

HE cloud is characterized by a large amount of computing resources, storage, and data [1]. Imagining a cloud that collects a huge number of images, e.g., Google street view images [2], when you randomly take a picture with your phone on the street, you can often find some highly correlated images in the cloud that were taken at the same location at different viewpoints and angles, focal lengths, and illuminations. If you try to share the photo with friends through the cloud, it is problematic to use conventional image coding (e.g., JPEG) that usually provides only 8:1 compression ratio [3]. It will consume a lot of precious power and network bandwidth to transmit such a high-resolution and high-quality JPEG image. It would be more convenient to take advantage of the cloud for compression and transmission if there is a high probability of finding very similar images in the cloud. Manuscript received March 02, 2012; revised July 26, 2012; accepted November 13, 2012. Date of publication January 11, 2013; date of current version May 13, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Xiaoqing Zhu. H. Yue and J. Yang are with Tianjin University, Tianjin 300072, China (e-mail: [email protected]; [email protected]). X. Sun and F. Wu are with Microsoft Research Asia, Beijing 100080, China (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2013.2239629

However, state-of-the-art image coding, consisting of directional intra prediction and transform [4], [5], makes it hard to take advantage of highly correlated images in the cloud. Intra prediction uses decoded neighboring pixels from the same image to generate predictions and then the pixels are coded by subtracting the predictions. It requires that the predictions used at the encoder and decoder must be identical. The idea of intra prediction cannot be extended to external images. First, when mobile devices compress images, it is difficult to know which highly correlated images can be found in the cloud. Second, the number of images in the cloud is huge, so it is impossible to store them in mobile devices even partially. Third, the images in the cloud are changing dynamically. It will result in a big cost to maintain image consistency between the cloud and all mobile devices. Image search has been demonstrated as a successful application on the Internet [6]. By submitting the description of one image, including semantic content [7], outline [8], [9], and local feature descriptors [10], [11], one can easily retrieve many similar images. Near and partial duplicate image detection is a hot research topic in this field [10], [12], [13]. However, the purpose of image search is not to generate an image from search results. In fact, reconstructing a given image from similar images is tougher than the image search itself. Recent efforts have shed light on using a large-scale image database to recover, compose, and even reconstruct an image [14]–[19]. In particular, Weinzaepfel et al. are the first to reconstruct an image by using local feature descriptors SIFT (Scale Invariant Feature Transform) [20]. The follow-up work presented in [21] tries to reconstruct an image using SIFT and SURF (Speed Up Robust Features) descriptors. However, it is a very challenging problem to reconstruct a visually pleasing image using local feature descriptors only. To solve the above problem, we propose describing input images by SIFT descriptors and their down-sampled images. SIFT descriptors are extracted from the original images and are used to retrieve near and partial duplicate images in the cloud and identify corresponding patches. The down-sampled images play an important role in making the reconstructed images visually pleasing. They are used to verify every retrieved image patch and guide how to stitch image patches like the given images. The down-sampled image is compressed by conventional image coding or intra-frame coding in video. Using the correlation between the down-sampled image and SIFT descriptors to compress SIFT feature vectors is another important technical contribution in this paper. We first compress locations, scales, and orientations of SIFT descriptors and then use them to extract prediction vectors from the decoded down-sampled image so that high-dimension SIFT vectors can be efficiently compressed by prediction and transform coding.

1520-9210/$31.00 © 2013 IEEE

846

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 4, JUNE 2013

The reconstruction of images in this paper is similar to that in [20]. Besides using the down-sampled image, there are other two technical differences. First, we adopt the approach proposed in [11], [12] for partial duplicate image retrieval. It uses bundling features instead of a single feature that significantly increases the retrieval accuracy. Second, we use all SIFT descriptors in a patch to estimate patch transformation with the RANSAC algorithm [22] and a perspective projection model is used. The rest of this paper is organized as follows. Section II gives a brief overview of related work. Section III proposes the cloud-based image encoder and decoder. Section IV discusses how to generate image descriptions. Section V presents the proposed SIFT compression. Section VI describes how to reconstruct high-quality images. The experimental results are presented in Section VII. Section VIII discusses limitations and future work. Finally, Section IX concludes this paper. II. RELATED WORK A. Visual Content Generation Image inpainting is the task of filling in or replacing a region of an image [23]. Many approaches have been proposed to learn from known regions of an image and then recover an unknown region from what has been learned. Hays et al. are the first to propose image completion from a large-scale image database [14]. The proposed approach uses GIST descriptors to retrieve images of similar scenes that are applied to recover unknown region of an image. Whyte et al. later propose retrieving images of the same scene from the Internet via viewpoint invariant search and replacing a user specified region [15]. Image composition is a long-standing research topic. Here we focus on the papers that deal with composing an image from a sketch based on a large-scale image database. Eitz et al. propose the first sketch-to-photo scheme [16], [17]. It takes a sketch drawn by a user as input and retrieves correlated images from a sketch-based search. Finally, it composes an image by graph cut and possion blending. Chen et al. improve the scheme by taking both sketch and text labels as input [18]. Image composition can be viewed as the recovery of all objects in an image. The objects are only represented by input sketch. It is a much tougher problem than image inpainting and completion. Human interactions are required in current sketch-to-photo schemes. Although it requires only a small number of bits to represent a sketch and text labels, it is unacceptable to compress an image using interactions. From the compression viewpoint, composed images must look like input images in detail. But for an input sketch, image composition may generate quite different results in color and texture. Taking CG (Computer Graphics) images as input is an improvement over sketch [19]. Recently, Weinzaepfel et al. propose reconstructing an image from SIFT [20]. Although the purpose of this work is to study privacy in image search, it is actually the first to reconstruct an image from local feature descriptors. Daneshi et al. further use SURF descriptors and study the role of scale data in reconstruction [21]. Although content can be observed from reconstructed images, which is consistent with the goal of image compression,

reconstructed visual quality is terrible because SIFT descriptors only provide local information. To make image description and reconstruction meet the goal of image compression, it is important to select descriptions with a better trade-off between reconstruction distortion and data size. Sketch is a highly abstracted description that does not contain any details. Local feature descriptors provide local details only. Neither of them can guarantee that reconstructed images look like input images. To solve this problem, we propose describing input images by local feature descriptors and their down-sampled images. B. Local Feature Compression SIFT descriptors, proposed by Lowe in [24], present distinctive invariant features of images that consist of location, scale, orientation, and feature vector. The scale and location of SIFT descriptors are determined by maxima and minima of difference-of-Gaussian images. One orientation is assigned to each SIFT descriptor according to the dominant direction of the local gradient histogram. The feature vector is a 128-dimension vector that characterizes a local region by gradient histogram in different directions. Since SIFT descriptors have a good interpretation of the response properties of complex neurons in the visual cortex [25] and an excellent practical performance, they have been extensively applied to object recognition, image retrieval, 3D reconstruction, annotation, watermarking, and so on. Compression of SIFT descriptors has recently become a requirement of mobile-based applications. One image usually has thousands of SIFT descriptors. Without any data reduction and compression, the total size of SIFT descriptors may be even larger than the image size. Ke et al. propose applying Principal Components Analysis (PCA) to greatly reduce the dimension of the feature vector [26]. Hua et al. propose linear discriminant analysis to reduce the dimension of the feature vector [27]. Chandrasekhar et al. propose the transform coding of the feature vectors [28]. Yeo et al. propose using coarsely quantized random projections of descriptors to build binary hashes [29]. Jegou et al. decompose feature vectors into a Cartesian product of low-dimension vectors that are quantized separately [30]. Several approaches propose directly reducing generated dimension of feature vectors. SURF reduces the dimension to 64 [21] with a performance similar to the SIFT. Compressed histogram of gradient (CHoG) descriptors, proposed by Chandrasekhar et al. [31], [32], are designed for compression. It not only changes the generation of the gradient histogram but also compresses it by tree coding and entropy coding. It achieves a high compression ratio. However, from both practical adoptions and recent evaluations in MPEG [33], SIFT descriptors are still a good choice in many applications. Compared with conventional image coding, the approaches for the compression of SIFT descriptors are far away from being mature. Thus, several papers propose compressing images first and then extracting SIFT descriptors from decompressed images. Makar et al. propose compressing image patches at different scales and then generating SIFT descriptors from decompressed patches [34]. It performs better than the direct compression of SIFT descriptors. However, if there are many patches to code in an image, they will overlap each other and thus lead to

YUE et al.: CLOUD-BASED IMAGE CODING FOR MOBILE DEVICES—TOWARD THOUSANDS TO ONE COMPRESSION

low performance. Chao et al. analyze locations of SIFT descriptors in an image and assign more bits to the regions with SIFT descriptors in JPEG compression [35]. All of above approaches target the compression of SIFT feature vectors only. Since we target reconstructing an image, what we need is not only SIFT feature vectors but also SIFT locations, scales, and orientations because the latter tells us where retrieved image patches are stitched. For the same reason, we also need the down-sampled image. Since SIFT feature vectors have a strong correlation with the image, the problem examined in this paper is how to compress both the down-sampled image and all information of the SIFT descriptors efficiently. The basic idea proposed in this paper mimics inter-frame coding in video. The prediction of feature vectors is extracted from the up-sampled decoded image. Residual feature vectors are compressed after prediction by transform, quantization, and entropy coding. C. Image Reconstruction Many approaches are reported to use SIFT descriptors for image retrieval [36]–[40], where SIFT feature vectors are quantized into visual words [37]. Quantization of SIFT feature vectors makes image retrieval applicable to a large-scale database. But it also reduces the discriminative power of SIFT feature vectors. To solve this problem, Chum et al. propose expanding highly ranked images from original query as new queries [37]. Philbin et al. propose quantizing a SIFT descriptor to multiple visual words [39]. Jegou et al. introduce binary signatures to refine visual words [40]. Zhou et al. propose quantizing feature vectors to bit-vectors [41]. For near and partial duplicate image retrieval, the geometric relationship of visual words plays an important role. To utilize this information, Wu et al. propose bundling a maximally stable region and visual words together [12]. Zhou et al. propose using spatial coding to represent spatial relationships among SIFT descriptors in an image [13]. Image alignment is a historic research topic and most key papers have been surveyed in [42]. When the SIFT descriptors are available for two images that are to be aligned, the most popular approach for estimating the transformation between them is RANSAC [22]. Torr et al. improve the algorithm by using the maximum likelihood estimation instead of the number of inliers [43]. Philip et al. introduce a pyramid structure with ascending resolutions to improve the performance of RANSAC [44]. Chum et al. greatly speed up the approach by introducing confident match [45]. III. THE PROPOSED CLOUD-BASED IMAGE CODING The block diagram of the proposed cloud-based image encoder is shown in Fig. 1. For the input image, a down-sampled image is first generated and compressed. SIFT descriptors are also extracted from the original image. The location, scale, and orientation of every extracted SIFT descriptor guide to extract a feature vector as prediction from the decompressed image after up-sampling. Finally, the prediction is subtracted from the feature vector. All components of SIFT descriptors are compressed and transmitted to the cloud with the compressed down-sampled image.

847

Fig. 1. The block diagram of the proposed cloud-based image encoder.

The block diagram of the proposed cloud-based image decoder is depicted in Fig. 2. In the cloud, a server first decompresses the down-sampled image and SIFT data. By using the decompressed location, scale, and orientation of every SIFT descriptor again, one prediction vector, exactly the same as that in Fig. 1, is extracted from the decompressed image after up-sampling. Then, the SIFT feature vector is reconstructed by adding the prediction. To reconstruct the input image, decompressed SIFT descriptors are used to retrieve highly correlated image patches. For every retrieved image patch, we estimate the transformation and then stitch it to the up-sampled decompressed image. Finally, a high-resolution and high-quality image is obtained. The block diagrams of the encoder and decoder shown in Figs. 1 and 2 look very different from those of conventional image coding. In Sections IV–VIII, we will discuss the details of the proposed cloud-based image coding. IV. EXTRACTION OF IMAGE DESCRIPTORS In this paper, a down-sampled image is used to describe the target for reconstruction in the cloud because it carries enough information, including outline, shape, color, and objects. Furthermore, after an image is down-sampled, it can be efficiently compressed by conventional image coding. The down-sample process in this paper is described as follows

(1) is the discrete low-pass filter with the support . The kernel of the filter is a Gaussian function with variance . After the filtering, the down-sampled image is generated as (2) where is an integer. Parameter is the down-sampling ratio at one dimension. is often set as 2 or 4 in most current applications. In this paper, is set as 8 or 16. Thus the down-sampling factor of images will be 64:1 or 256:1. It is much larger than that to be considered in super-resolution so that our targeted compression ratio can be achieved.

848

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 4, JUNE 2013

Fig. 2. The block diagram of the proposed cloud-based image decoder.

to . The Taylor expansion of follows

where culated by

at

is described as

(3) . The accurate extreme point can be cal-

(4)

Fig. 3. Part of the Gaussian scale space with 3rd and 4th octaves. indicates . The red rectangle in indicates the coran extreme point detected in responding SIFT descriptor.

By repeatedly applying the filtering (1) and down-sampling (2) with to the input image, a Gaussian scale space is generated, as shown in the left side of Fig. 3. The image used here comes from the INRIA Holiday dataset [46]. It consists of a set of images . is an octave index in the space. It indicates that is a the input image is down-sampled n times. filtering index in an octave. The differences of Gaussian images are generated by . They are depicted in the right side of Fig. 3. In fact, all constitute a Laplacian scale-space with . is a scale index and is defined as with and . Feature points are detected by maxima and minima in . If one sample is the largest or smallest of its eight neighbors and nine neighbors and respectively, it will be a feature point in in the image. As shown in Fig. 3, is such a point at and is the scale index of . . The sub-pixel Let us define a vector and finer scale index to are derived by fitting a 3D quadratic

is the location and scale index of a SIFT descriptor. Its scale value is . In the Gaussian scale space, only the limited number of is available. Assume the scale index is the closest one to . of The orientation of the SIFT descriptor is calculated in a region of around . Its size is set as , where . Local image gradients and are calculated using a centered derivative mask . The orientation is evenly partitioned into 36 bins covering 360 degrees. A histogram with 36 bins can be calculated by

(5) with . is a weighting factor and is defined as a Gaussian function centered at . The histogram is normalized and approximated to a polynomial function. The orientation at the highest peak of the function is . Besides the highest peak, if other peaks exist with a value above 80% of the highest peak, multiple SIFT descriptors will be generated for these peak orientations with the same location and scale index. around The feature vector is also extracted in a region of . Its size is set as , where .The region is first rotated to , which provides rotation inguarantees a variance for the SIFT descriptors. The factor complete rectangle region after rotation. The rectangle region is further partitioned into uniform 4 4 sub-regions. Thus, and each sub-region has samples. The gradient orientation

YUE et al.: CLOUD-BASED IMAGE CODING FOR MOBILE DEVICES—TOWARD THOUSANDS TO ONE COMPRESSION

Fig. 5. The predicted MSE with regard to

Fig. 4. An exemplified prediction of SIFT descriptors. The right side is the Gaussian scale-space generated from original image and several extracted SIFT descriptors are drawn there. The left side is the Gaussian scale-space from the up-sampled decompressed image. The SIFT descriptors in this space are extracted by using the same locations and orientations as those in the right side.

with . is a weighting factor and is defined as a centered at . Combining Gaussian function all 16 sub-regions, we can get one 128-dimensional vector . After normalization, it is the feature vector of a SIFT descriptor. V. COMPRESSION OF IMAGE DESCRIPTORS The down-sampled image is directly compressed by the intraframe coding of HEVC [5]. The core problem in our proposed scheme is how to efficiently compress SIFT descriptors. As shown in Fig. 4, the key idea proposed in this paper is to use feature vectors extracted from decompressed image as the prediction. It is similar to inter-frame coding in video. Two questions will be answered here. The first is whether the feature vectors extracted from down-sampled image are good predictions. The second is how to efficiently use the predictions for the compression of SIFT descriptors. A. Prediction Evaluation First of all, we observe that SIFT descriptors extracted from a down-sampled image are different from those from the original image on location and scale index . This is reasonable because of the down-sampling process. Fortunately, we observe a strong correlation between feature vectors extracted

in different octaves.

from the down-sampled image and from the original image when scale index is larger than a certain value. Here we will evaluate this correlation. Decompressed down-sampled image is first up-sampled to the original resolution. A Gaussian scale space is generated by the approach described in Section IV. For a SIFT descriptor extracted from the original image, we can generate a prediction vector in the scale-space by using location , scale and orientation . We evaluate the prediction using the normalized mean square error (MSE)

is evenly partitioned into 8 bins. For a sub-region , an 8-dimension vector can be generated by

(6)

849

(7)

We take the same image used in Figs. 3 and 4 as an example and draw the curves of the average of all SIFT descriptors in an octave vs octave index as shown in Fig. 5. We have tested other images in the dataset and observed similar results. In Fig. 5, we use different approaches to generate . The curve “Non-coded” indicates that the down-sampled image is not coded. The curves “ ” indicate that the down-sampled image is coded by HEVC with a given QP. In this experiment, the image is down-sampled by , which corresponds to the scale index 4. We first observe that, when QP is equal to or less than 22, the curves are close to that without compression. We also observe that, if the scale index is equal to or larger than 4, the average approaches zero. It indicates that feature vector can predict very well. When the scale index is 2 or 3, the prediction is good but has some errors. For the scale indices smaller than 2, the prediction becomes considerably poorer because the details have been removed by down-sampling. B. Compression of SIFT Descriptors Based on the above evaluation, we propose different strategies for compression of SIFT feature vectors at different octaves. For octaves with scale index , feature vector is not coded and instead is its reconstruction. For octaves with , where is a constant scale index and is usually set as a small integer (e.g., 1 or 2), the residual vector is coded by transformation, quantization, and entropy coding. For the rest octaves, since the scale indices are

850

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 4, JUNE 2013

Fig. 6. The coding process of SIFT descriptors in one image.

small, retrieved patches are too small to help image reconstruction. Those SIFT descriptors are discarded directly. Selecting part of the SIFT descriptors is a common method of mobile applications [32]. Although it may have a negative impact on search results and reconstructed quality, it is a cost-effective way for practical applications. The coding process of SIFT descriptors is depicted in Fig. 6. The locations of all SIFT descriptors are compressed first. We can determine the number of SIFT descriptors based on the number of locations. All scale indices and orientations are then compressed in raster scan order, respectively. According to the discussed coding strategies and coded scale indices, we know which residual feature vectors should be coded. Finally, they are compressed one by one. Locations are important in the proposed scheme. First, they indicate where orientations and feature vectors are calculated. Second, they are used to calculate the transformation between retrieved image patches and the input image, which requires high precision. Thus, they are only quantized into integers in the original resolution. Given as the set of all SIFT locations, where M is the total number. We first generate a binary matrix of the same size as the input image, for which the element is defined as if if

.

(8)

This matrix is compressed by binary arithmetic coding. For any , an integer is used to indicate the number of SIFT . It is most likely one but the maximum descriptors at number allowed is four. It is compressed by small-alphabet arithmetic coding. This approach is similar to the coding of a block of DCT coefficients in H.264 [4]. decides the region sizes to calculate and , respectively. For every , its octave index is coded by 3 bits, and is quantized into 16 levels at precision of 1/16 that is coded by 4 bits. decides rotation invariance of SIFT descriptors. Every is quantized into 128 levels at precision of degrees and and coded by 7 bits. Therefore, for a SIFT descriptor, will take 14 bits. In the compression of the residual feature vector, a binary string is first generated, where each bit indicates if one residual feature vector is zero or not after transform and quantization. This string is compressed by binary arithmetic coding too. Given a residual vector to code, we organize it as two 8 8 matrices (9) is an 8-dimension row vector. Each mawhere trix is transformed by 8 8 DCT. After quantization, most of coefficients are zero and non-zero coefficients are most likely

Fig. 7. Energy compaction of residual feature vectors.

small integers. Similar to approach (8), DCT coefficients are converted to a binary matrix and a set of non-zero integers. One bit indicates if one coefficient is zero or not. An integer is the value of a non-zero coefficient. Finally, we evaluate DCT transformation in our scheme. The role of transformation is to compact signal energy. We randomly select ten thousand residual feature vectors in our scheme and calculate their average energy distribution in pixel domain and DCT domain in Fig. 7. We can observe that the energy after transformation concentrates at low and middle frequency and is reduced at high frequency. But obviously the energy compaction is not as good as in the residual image. Considering the computation of transform, it should be further studied if the transformation can be skipped in future. VI. IMAGE RECONSTRUCTION The decoding of SIFT descriptors is an inverse process to what we have described in Section V. Thus we skip the details of decoding and assume that both the down-sampled image and SIFT descriptors have been decoded. This section will focus on image reconstruction. A. Patch Retrieval The first step for reconstruction is to retrieve highly correlated images. Assume that the SIFT descriptors of all images in the cloud have been extracted with . 50 million SIFT feature vectors are selected randomly and trained into one million visual words by approximate k-means [37]. Every SIFT feature vector in the cloud is quantized to a visual word if it has the minimum Euclidian distance to the visual word. In an image, the region of a SIFT descriptor with a large scale index often partially or completely covers the regions of some SIFT descriptors with small scale indices. We bundle them as a group. One image often has tens to hundreds of groups. Every group is represented by a set of visual words and their geometric relationship in the image. Decoded SIFT feature vectors are quantized to visual words and organized into groups too. Every group is matched with all groups in the cloud. The matching is scored by the number

YUE et al.: CLOUD-BASED IMAGE CODING FOR MOBILE DEVICES—TOWARD THOUSANDS TO ONE COMPRESSION

of matched visual words and their geometric relationship. This score is assigned to the image that contains the group. After all groups are matched, several images with high sum scores are selected as highly correlated images for reconstruction. More details can be found in [12], [13]. In order to guarantee matching precision, patch retrieval is operated on 128-dimension vectors instead of visual words. All decoded SIFT descriptors are denoted as , and all of SIFT descriptors in the selected images as . Every decoded SIFT descriptor is independently matched with every . The matching criterion between and is defined as if otherwise.

(10)

The feature vector of SIFT descriptor is the closest one to that of at the square Euclidean distance, and the feature vector of is the second-closest one. and are the Euclidean distances. is a constant and usually set as 1.5. From a matched SIFT descriptor , we can get a patch from the retrieved images according to its location, scale, and from the orientation. Similarly, we can get another patch up-sampled decompressed image by . B. Patch Transformation The flowchart of image reconstruction in this paper is similar to that in [20]. In addition to the thumbnail being applied to verify every patch and guide the stitching, another key technical difference from that in [20] is that we use all SIFT descriptors located on a patch to estimate the transformation of the patch. Therefore, the RANSAC algorithm is adopted [22] and a perspective projection model is used. is a high-resolution patch and is a low-resolution one. It is difficult to estimate an accurate transformation between them by pixel-based matching. Fortunately, SIFT locations of and are extracted at high precision. Thus feature-based matching is adopted in the proposed scheme. The corresponding feature points are detected by matching SIFT descriptors of and . The set of SIFT descriptors in can be found by , and the set in by . and are the sets of samples in and , respectively. Since the set size of and are much smaller than that of and , more pairs of matched SIFT descriptors can be found by the same criterion (10). They are written as

851

Fig. 8. The selected images as inputs in our experiments, denoted as “a” to “j” in the raster scan order.

number of inliers are updated. Finally, we can get with the maximum number of inliers. Besides the estimated , we can immediately write another from the location, scale index, and orientation of and if the transformation between and is a combination of translation, rotation, and scaling

(11) is a rival transformation to and In the proposed scheme, we will select a better one during patch stitching. Although is usually not as accurate as , it can be used to control error if is not correct. Furthermore, it is easier to obtain the transformation because there is no need to run RANSAC. C. Patch Stitching Both transformations and are applied to to get two patches and . The up-sampled decompressed image is used to guide the patch stitching. Two patches and are matched with a region of centered at , respectively. The match at every location is scored by the MSE between the patch and a corresponding patch in the up-sampled decompressed image. The better patch and the best location are decided by the minimum score. Finally, we get the patch that will be stitched. If the minimum score of is larger than a threshold, we will discard this patch because it most likely is not correct for . Since may come from an image with different illumination, it is blended to the up-sampled image by Poisson editing [20], [47]. VII. EXPERIMENTAL RESULTS AND ANALYSES

In general cases, the transformation from to is defined as a planar projection with eight parameters. The parameters are estimated by RANSAC [22]. Several pairs of SIFT locations in are randomly selected to calculate . The remaining pairs of SIFT locations are used to verify . For a pair , if the Euclidean distance between and the point of after transformation is smaller than a given threshold , it is called as an inlier to ; otherwise it is an outlier. and the number of inliers are recorded. This has more inliers, and the process is repeated. If a new

We use the INRIA Holiday dataset [46], which has 1491 images in total, for our experiments in this paper. Images in this dataset have a resolution up to 8M pixels. There are multiple images in the same scene captured at different viewpoints and focal lengths. As shown in Fig. 8, 10 images are selected as input images and the rest are used as images in the cloud. For convenience, we denote them in this paper from “a” to “j” in the raster scan order. We select the intra-frame coding in HEVC and JPEG as the anchors. HEVC is an on-going video coding standard and is much better than H.264 [5]. JPEG is the most popular image coding standard on the Internet, although its coding efficiency

852

TABLE I IMAGE SIZES, COMPRESSED HEVC SIZES AND COMPRESSED SIFT SIZES IN THE PROPOSED SCHEME

is not good to date. Three schemes will be evaluated for both compression ratio and visual quality.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 4, JUNE 2013

TABLE II COMPARISONS WITH INTRA-FRAME CODING

OF

HEVC

AND

JPEG

TABLE III AVERAGE SCORES AND PSNR FOR JPEG, HEVC AND THE PROPOSED SCHEME

A. Compression Ratio In the proposed scheme, all images are down-sampled by 256:1 and compressed by the intra-frame coding of HEVC. In order to get a high compression ratio, all coding tools in HEVC are enabled. The quantization step for all images is set as 22. All SIFT descriptors in the octaves with scale indices more than 2 are compressed. Since there are too many SIFT descriptors in the 2nd octaves, only part of the SIFT descriptors, which can be predicted better, is selected for compression. The residual feature vectors are transformed and quantized. The quantization step is set as 30. It is a normal parameter for the inter-frame coding. The experimental results are listed in Table I. The average size of down-sampled images is 6.14 KB after HEVC compression. Each image has a different number of SIFT descriptors ranging from 227 to 722, which is dependent on image content. The average number of SIFT descriptors per image is 500. The average size of all SIFT descriptors per image is 1.82 KB after compression by the proposed scheme. It is equivalent to 30 bits per SIFT descriptor on average. The total size and compression ratio of every image are listed in Table II. For the proposed scheme, the total size is 7.96 KB on average and the corresponding compression ratio is 1885:1. The maximum compression ratio is as high as 4000:1. JPEG results are generated by the software IrfanView [48]. We try to make JPEG file sizes close to that of the proposed scheme. Even setting the lowest quality, the file size is still 42.84 KB on average and the corresponding compression ratio is only 338. HEVC results are generated using the reference software 4.0 [49] by setting quantization step as 51. The file size is 13.44 KB on average and the compression ratio is 1319. Even for HEVC, the most efficient image coding scheme to date, the file size is still larger than that of the proposed scheme by 70% on average. B. Visual Quality For the generated files listed in Table II, we evaluate their visual qualities here. Since the coding artifacts are quite different even for JPEG and HEVC, it is hard to evaluate them by an

objective criterion. We adopt the double-stimulus continuous quality-scale (DSCQS) method [50] for subjective testing. 20 undergraduate students without any coding experiences served as assessors. The decoded images are displayed on an ultra HD monitor in a random order. Image quality is scored from 1 to 5, indicating bad, poor, fair, good, and excellent. The average scores of every image are listed in Table III. The scores of JPEG and HEVC are only 1.1 and 2.42, respectively. They indicate that the quality of JPEG and HEVC are bad and poor. The average scores of the proposed scheme are 3.61. It indicates that the visual quality is between fair and good. We notice that the scores of 50% of the images are more than 4 in the proposed scheme. In other words, their quality is close to the excellent grade even after the thousands to one compression. Images “b” and “f” have low scores for the proposed scheme because there are several reconstruction artifacts. PSNR is also given in Table III as a reference. Limited by the pages, only the reconstructed results of images “a”, “e”, and “f” at different scores are shown in Fig. 9. Each group of results in Fig. 9 consists of two rows. In the first row, images from left to right are the original image and the results of the proposed scheme, HEVC, and JPEG. We can observe that the results of the proposed scheme are very consistent with the input images. The second row shows the details in a corresponding region indicated by red rectangles. The artifacts of HEVC and JPEG can be observed clearly. However, the results of the proposed scheme present the details clearly. As we have pointed out, the score of image “f” is low in the proposed scheme. Although the result looks like the original image,

YUE et al.: CLOUD-BASED IMAGE CODING FOR MOBILE DEVICES—TOWARD THOUSANDS TO ONE COMPRESSION

853

Fig. 9. The reconstructed results of images “a”, “e” and “f”. Each group of results consists of two rows. In the first row, images from left to right are original image and the results of the proposed scheme, HEVC and JPEG. The second row is the details in a corresponding region indicated by red rectangles. The red circles indicate the reconstructed artifacts in the proposed scheme.

some people in front of the building are not constructed well as marked by red circles. This is the typical artifacts in the proposed scheme.

C. Highly Correlated Images Highly correlated images in this paper mainly indicate those images taken of the same scene with different viewpoints, focal

854

Fig. 10. Highly correlated images retrieved for images “a”, “e”, and “f”.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 4, JUNE 2013

and decoding are simple. As discussed in Section V, the proposed SIFT compression has low computation without training, complicated prediction, and large size transform. The highest computation is needed for building a pyramid image space and extracting SIFT feature vectors. Since two octaves with small scales are skipped, the computation has been reduced greatly. In our current implementation, it usually takes 5–7 seconds in a single-core phone at 1 GHz frequency. This time is expected to decrease to 1 second or less in a four-core phone that is now available in market. Computation in the cloud is a key problem. Correlated images are retrieved by visual words, and high correlation patches are retrieved only in several selected images. Their computations are not high. However, both the RANSAC [22] algorithm and Poisson editing have high computation. In our current implementation, it often takes a minute or more to reconstruct an image. The RANSAC algorithm of every patch is very suitable for parallel computation. We can take hundreds of cores in the cloud to calculate projection models of all patches simultaneously. Poisson editing can only be performed one by one. Recently, we have found that it can be approximated by the membrane interpolation [51]. In our parallel implementation, the reconstruction time has been reduced to several seconds [52]. E. Comparison With SIFT Feature Vector Coding

Fig. 11. Images in the first row are the new correlated images for image “a”. The second row is the result of the proposed scheme in this case. The red rectangle indicates the stitched one patch.

lengths, and illuminations. In fact, the proposed scheme can be extended to the cases of partial duplicate images [12], [13], where scenes are different but some regions and objects are similar. Highly correlated images play an important role in the proposed scheme. We show four retrieved images for images “a”, “e”, and “f” in Fig. 10. We can observe that the image retrieval performs well and highly correlated images can be found and ranked first. For images “a” and “e”, one and two images retrieved are non-correlated. But since their patches have big MSE values with the decompressed image, they are finally discarded and do not result in bad results. What is the result of the proposed scheme if there are no highly correlated images retrieved? For image “a”, we remove the first four correlated images in Fig. 10 from the dataset and then reconstruct it. Four new retrieved images are shown in the top of Fig. 11 and obviously they are not correlated to image “a”. The reconstructed result is also shown in Fig. 11. We can observe that all patches except for one are removed because of the large MSE values of the image. The patch is stitched in the region marked by a red rectangle. It does not introduce annoying artifacts. The final result is the up-sampled decompressed image that looks very blurry. D. Complexity Analyses The client needs to encode and decode a down-sampled image, extract SIFT descriptors and encode SIFT descriptors. Since the size of the down-sampled image is small, its encoding

Finally, we evaluate the proposed compression approach in the image retrieval scenario by comparing it to CHoG [31], [32]. The Zurich Building Database (ZuBuD) is used [53]. It contains 1005 images of 201 buildings in Zurich of resolution 640 480. It also has 115 query images of resolution 320 240. The experiment is set up strictly according to the method in [32]. For the proposed scheme, SIFT descriptors are extracted from the original query images. Thumbnails are generated by 2:1 down-sampling in each dimension and then compressed by HEVC intra-frame coding. SIFT locations, scales, orientations and residual vectors are compressed by the approach described in Section V. The compressed description size is adjusted by changing the quantization steps for the thumbnail from 20 to 50 at an interval 5. At the same time, the number of SIFT descriptors varies by changing scale indices and predicted MSE values. As shown in Fig. 12, 7 compressed descriptions have sizes from 0.33 kB to 8.62 kB, including thumbnails. The vertical axis is the recall, which is defined as the percentage of query images correctly retrieved [32]. The SIFT curve shows the results without any compression. It should provide the best retrieval results by using SIFT descriptors. As shown in Fig. 12, point D is the un-compressed version of points D1, D2, and D3, which have different quantization steps. We can observe that the proposed compression only causes up to 1.8% recall loss at point B1 as compared with point B. But the query size is reduced more than 90%. The curve of CHoG in the paper [32] is drawn in Fig. 12 too. We can observe that they have similar recall at similar query sizes. Both of them have a largely decreasing slope on recall when query sizes become too small. One thing that we have to mention here is, for the proposed compression, the query includes the thumbnail that can be potentially applied to improve retrieval quality and re-rank searching results.

YUE et al.: CLOUD-BASED IMAGE CODING FOR MOBILE DEVICES—TOWARD THOUSANDS TO ONE COMPRESSION

855

stitching. Second, a patch replaces a corresponding region completely in the proposed scheme. Since the decompressed image is more reliable in low frequencies, the patch stitching can be improved in the frequency domain. IX. CONCLUSIONS

Fig. 12. The recall rate versus query size of SIFT, CHoG, and the proposed compression approach.

VIII. FURTHER DISCUSSIONS The proposed cloud-based image coding is not a replacement for conventional image coding. Conventional image coding is a mature way to use the correlations of pixels in the same image for compression. Only when the large-scale database of images is available in the cloud, the proposed coding scheme can exploit the correlations of external images and significantly reduce compressed data size. A. Limitations The large-scale database of images plays the most important role in the quality of reconstructed images. For an input image, if we cannot find highly correlated images in the cloud, the scheme cannot output a good quality reconstruction. Although mobile photo sharing is discussed as an exemplified application, it is not clear which practical applications in the cloud can always have highly correlated images available. It will limit the usage of the proposed scheme. There are some technical limitations to the proposed scheme too. Images in some complicated scenes may be difficult to reconstruct. For example, there are often people in front of famous places. Although the background can be reconstructed from the correlated images, it is hard to accurately reconstruct a face from other images. One solution for this issue is to segment the foreground by several user interactions and then compress the foreground by image compression. In addition, an image in a dynamic scene (e.g., billboards, windmills, etc.) may also be difficult to reconstruct. B. Future Work The description of the input image needs to be further studied. If the dimension of feature vectors can be reduced as in SURF and CHoG, it will save more bits for residual feature vector coding. In addition, the optimization between distortion and rate should be carefully studied for coding SIFT descriptors. Some coding parameters are set empirically in this paper. Last but not least, retrieved image patches are stitched into the up-sampled decompressed image one by one. This can be improved two-fold. First, for complicated scenes, it is difficult to find a correct transformation model for a large patch. It can be split into multiple pieces and each piece is refined before

This paper proposes a cloud-based image coding scheme, where images are described by SIFT descriptors and downsampled images. The SIFT descriptors are compressed by predicting from the corresponding SIFT descriptors extracted from the down-sampled images and transform coding. Finally, highquality and high-resolution images are reconstructed from a set of images in the cloud by the proposed description. Experimental results show that the proposed scheme not only achieves 1885:1 compression on average but also gets high subjective scores. The reconstructed quality in half of the testing images is close to that of the original images. ACKNOWLEDGMENT The authors would like to thank Dr. K. He and Ms. J. Liu for valuable insights on image reconstruction. They also provide us an implementation of Poisson editing. REFERENCES [1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A view of cloud computing,” Commun. ACM, vol. 53, pp. 50–58, 2010. [2] [Online]. Available: https://developers.google.com/maps/documentation/streetview [3] G. K. Wallace, “The JPEG still picture compression standard,” Commun. ACM, vol. 34, pp. 30–44, 1991. [4] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003. [5] G. J. Sullivan and J. R. Ohm, “Recent developments in standardization of high efficiency video coding (HEVC),” Proc. SPIE, vol. 7798, pp. 77980V1–V7, 2010. [6] Y. Rui, T. S. Huang, and S. F. Chang, “Image retrieval: Current techniques, promising directions, open issues,” J. Visual Commun. Image Representation, vol. 10, pp. 39–62, 1999. [7] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Trans. Multimedia Computing, Commun. Appl., vol. 2, pp. 1–19, 2006. [8] J. R. Smith and S. F. Chang, “Visualseek: A fully automated contentbased image query system,” ACM Multimedia, pp. 87–98, 1996. [9] C. H. Wang, Z. W. Li, and L. Zhang, “Mindfinder: Image search by interactive sketching and tagging,” in Proc. WWW, 2010, pp. 1309–1312. [10] Y. Ke, R. Sukthankar, and L. Huston, “An efficient parts-based nearduplicate and sub-image retrieval system,” in Proc. 12th ACM Int. Conf. Multimedia, 2004, pp. 869–876. [11] Q. F. Zheng, W. Q. Wang, and W. Gao, “Effective and efficient objectbased image retrieval using visual phrases,” in Proc. ACM Multimedia, 2006, pp. 77–80. [12] Z. Wu, Q. F. Ke, M. Isard, and J. Sun, “Bundling features for large scale partial-duplicate web image search,” in Proc. IEEE Conf. CVPR, 2009, pp. 25–32. [13] W. G. Zhou, Y. J. Lu, H. Q. Li, Y. B. Song, and Q. Tian, “Spatial coding for large scale partial-duplicate web image search,” in Proc. ACM Multimedia, 2010, pp. 511–520. [14] J. Hays and A. A. Efros, “Scene completion using millions of photographs,” ACM Trans. Graphics, vol. 126, 2007. [15] O. Whyte and J. S. A. Zisserman, “Get out of my picture! Internetbased inpainting,” in Proc. Brit. Mach. Vis. Conf., 2009, pp. 1–11. [16] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa, “PhotoSketck: A sketch based image query and composition,” in Proc. SIGGRAPH, 2009, p. 60. [17] M. Eitz, R. Richter, K. Hildebrand, T. Boubekeur, and M. Alexa, “PhotoSketcher: Interactive sketch-based image synthesis,” IEEE Comput. Graph. Applic., vol. 31, pp. 56–66, 2011.

856

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 4, JUNE 2013

[18] T. Chen, M. M. Cheng, P. Tan, A. Shamir, and S. M. Hu, “PhotoSketch: Internet image montage,” SIGGRAPH Asia, 2009. [19] M. K. Johnson, K. Dale, S. Avidan, H. Pfister, and W. T. Freeman, “CG2Real: Improving the realism of computer generated images using a large collection of photographs,” IEEE Trans. Vis. Comput. Graphics, vol. 17, no. 9, pp. 1273–1285, Sep. 2011. [20] H. J. Weinzaepfel and P. Pérez, “Reconstructing an image from its local descriptors,” in Proc. IEEE Conf. CVPR, 2011, pp. 337–344. [21] M. Daneshi and J. Q. Guo, “Image reconstruction based on local feature descriptors,” 2011. [Online]. Available: http://www.stanford.edu/class/ee368/project_11/reports/Daneshi_GUO_image_reconstrcution_from_descriptors.pdf [22] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, pp. 381–395, 1981. [23] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proc. SIGGRAPH, 2000, pp. 417–424. [24] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, pp. 91–110, 2004. [25] S. Edelman, N. Intrator, and T. Poggio, “Complex cells and object recognition.” [Online]. Available: http://kybele.psych.cornell.edu/~edelman/Archive/nips97.pdf [26] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors,” in Proc. IEEE Conf. CVPR, 2004, vol. 2, pp. 506–513. [27] G. Hua, M. Brown, and S. Winder, “Discriminant embedding for local image descriptors,” in Proc. IEEE Conf. CV, 2007, pp. 1–8. [28] V. Chandrasekhar, G. Takacs, D. Chen, S. S. Tsai, J. Singh, and B. Girod, “Transform coding of image feature descriptors,” in Proc. SPIE Conf. Vis. Commun. Image Process., 2009, vol. 7257, Art. ID 725710. [29] C. H. Yeo, P. Ahammad, and K. Ramchandran, “Rate efficient visual correspondences using random projections,” in Proc. IEEE Conf. Image Process., 2008, pp. 217–220. [30] H. Jégou, M. Douze, and C. Schmid, “Product quantization for neareast neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, Jan. 2011. [31] V. Chandrasekhar, G. Takacs, D. Chen, S. S. Tsai, R. Grzeszczuk, and B. Girod, “CHoG: Compressed histogram of gradients a low bit-rate feature descriptor,” in Proc. IEEE Conf. CVPR, 2009, pp. 2504–2511. [32] V. Chandrasekhar, G. Takacs, D. Chen, S. S. Tsai, R. Grzeszczuk, and B. Girod, “Compressed histogram of gradients: A low-bitrate descriptor,” Int. J. Comput. Vis., vol. 96, pp. 384–399, 2012. [33] G. Francini, S. Lepsoy, and M. Balestri, “Description of test model under consideration for CDVS,” Geneva, Switzerland, ISO/IEC JTC1/ SC29/WG11, N12367, Dec. 2011, , , . [34] M. Makar, C. L. Chang, D. Chen, S. S. Tsai, and B. Girod, “Compression of image patches for local feature extraction,” in Proc. IEEE Conf. ASSP, 2009, pp. 821–824. [35] J. S. Chao and E. Steinbach, “Preserving SIFT features in JPEG-encoded images,” in Proc. IEEE Conf. Image Process., 2011, pp. 301–304. [36] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. IEEE Conf. Comput. Vis., 2003, vol. 2, pp. 1470–1477. [37] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. CVPR, 2007, pp. 1–8. [38] O. Chum, J. Philbin, J. S. M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in Proc. IEEE Conf. Comput. Vis., 2007, pp. 1–8. [39] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image database,” in Proc. IEEE Conf. CVPR, 2008, pp. 1–8. [40] H. Jégou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in Proc. Eur. Conf. Comput. Vis., 2008. [41] W. G. Zhou, Y. J. Lu, H. Q. Li, and Q. Tian, “Scalar quantization for large scale image search,” in Proc. ACM Multimedia, 2012, pp. 169–178. [42] R. Szeliski, “Image alignment and stitching: A tutorial,” Found. Trends Comput. Graphics Vis., vol. 2, pp. 1–104, 2012.

[43] P. Torr and A. Zisserman, “MLESAC: A new robust estimator with application to estimating image geometry,” J. Comput. Vis. Image Understanding, vol. 78, pp. 138–156, 2000. [44] H. S. Philip and D. Colin, “IMPSAC: Synthesis of importance sampling and random sample consensus,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 3, pp. 354–364, Mar. 2003. [45] O. Chum and J. Matas, “Matching with Prosac—Progressive sample consensus,” in Proc. IEEE Conf. CVPR, 2005, pp. 220–226. [46] H. Jégou and M. Douze, “INRIA Holiday Dataset,” 2008. [Online]. Available: http://lear.inrialpes.fr/people/jegou/data.php [47] P. Pérez, M. Gangnet, and A. Blake, “Poisson image editing,” ACM Trans. Graphics, vol. 22, pp. 313–318, 2003. [48] IrfanView. [Online]. Available: http://www.irfanview.com [49] HEVC Test Model (HM). ver. 4.0 [Online]. Available: http://hevc. info/HM-doc [50] “Methodology for the Subjective Assessment of the Quality of Television Pictures,” ITU, Recommendation ITU-R BT. 5000-11, 2002. [51] Z. Farbman, R. Fattal, and D. Lischinski, “Convolution pyramids,” in Proc. SIGGRAPH Asia, 2011, pp. 175:1–175:8. [52] L. Dai, H. Yue, X. Sun, and F. Wu, “IMShare: Instantly sharing your mobile landmark images by search-based reconstruction,” in Proc. ACM Multimedia, 2012, pp. 579–588. [53] H. Shao, T. Svoboda, and L. V. Gool, “Zubud-Zurich Buildings Database for Image Based Recognition,” ETH Zurich, Tech. Rep. 206, 2003. Huanjing Yue received the B.S. degree in electrical engineering from Tianjin University, Tianjin, China, in 2010, where she is currently working toward the Ph.D. degree in electrical engineering. Her research interests include image compression and image super-resolution.

Xiaoyan Sun received the B.S., M.S., and Ph.D. degrees in computer science from Harbin Institute of Technology, Harbin, China, in 1997, 1999, and 2003, respectively. Since 2004, she has been with Microsoft Research Asia, Beijing, China, where she is currently a Lead Researcher with the Internet Media Group. She has authored or co-authored more than 50 journal and conference papers, ten proposals to standards. She has filed seven granted patents. Her current research interests include vision-based and cloud-based image and video compression. Dr. Sun was a recipient of the Best Paper Award of the IEEE Transactions on Circuits and Systems for Video Technology in 2009.

Jingyu Yang received the B.E. degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2003, and Ph.D. degree (honor) from Tsinghua University, Beijing, China, in 2009. Since 2009, he has been with the faculty of Tianjin Uinversity, China, and is currently an Associate Professor of electronic and information engineering. He visited Microsoft Research Asia (MSRA) from Feb.Aug. 2011 within the MSRA’s young scholar supporting program. He visited Signal Processing Laboratory at EPFL in Lausanne, Switzerlad from JulySept. 2012. His research interests mainly include image/video processing and computer vision. He was selected into the program for New Century Excellent Talents in University (NCET) from the Ministry of Education of China in 2011.

YUE et al.: CLOUD-BASED IMAGE CODING FOR MOBILE DEVICES—TOWARD THOUSANDS TO ONE COMPRESSION

Feng Wu received his B.S. degree in Electrical Engineering from XIDIAN University in 1992. He received his M.S. and Ph.D. degrees in Computer Science from Harbin Institute of Technology in 1996 and 1999, respectively. Wu joined in Microsoft Research Asia, formerly named Microsoft Research China as an Associate Researcher in 1999. He has been a researcher with Microsoft Research Asia since 2001 and is now a Senior Researcher/Research Manager. His research interests include image and video compression, media communication, and media

857

analysis and synthesis. He has authored or co-authored over 200 high quality papers and filed 67 US patents. His 13 techniques have been adopted into international video coding standards. As a co-author, he got the best paper award in IEEE T-CSVT 2009, PCM 2008 and SPIE VCIP 2007. Wu has been a senior member of IEEE. He serves as an associate editor in IEEE Transactions on Circuits and System for Video Technology, IEEE Transactions on Multimedia and several other International journals. He also serves as TPC chair in MMSP 2011, VCIP 2010 and PCM 2009, TPC track chair in ICME 2012, ICME 2011 and ICME 2009 and Special sessions chair in ICME 2010.

Speech Recognition for Mobile Devices at Google