Landmark Image Super-Resolution by Retrieving Web ... - IEEE Xplore

Viewer
Transcript

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

4865

Landmark Image Super-Resolution by Retrieving Web Images Huanjing Yue, Xiaoyan Sun, Senior Member, IEEE, Jingyu Yang, Member, IEEE, and Feng Wu, Fellow, IEEE

Abstract— This paper proposes a new super-resolution (SR) scheme for landmark images by retrieving correlated web images. Using correlated web images significantly improves the exemplarbased SR. Given a low-resolution (LR) image, we extract local descriptors from its up-sampled version and bundle the descriptors according to their spatial relationship to retrieve correlated high-resolution (HR) images from the web. Though similar in content, the retrieved images are usually taken with different illumination, focal lengths, and shot perspectives, resulting in uncertainty for the HR detail approximation. To solve this problem, we first propose aligning these images to the up-sampled LR image through a global registration, which identifies the corresponding regions in these images and reduces the mismatching. Second, we propose a structure-aware matching criterion and adaptive block sizes to improve the mapping accuracy between LR and HR patches. Finally, these matched HR patches are blended together by solving an energy minimization problem to recover the desired HR image. Experimental results demonstrate that our SR scheme achieves significant improvement compared with four state-of-the-art schemes in terms of both subjective and objective qualities. Index Terms— Super-resolution, image retrieval, web image, hallucination, landmark image.

I. I NTRODUCTION

I

MAGE super-resolution (SR) is an under-constrained problem as one low resolution (LR) image may correspond to multiple high resolution (HR) images. Therefore, some constraints on the underlying truth images are imposed to ensure image SR is well-posed and tractable. Essentially, such constraints exploit the correlations within a single image or across multiple images. According to the method of using correlations, image SR can be classified into three categories: interpolation-based, multi-image-based, and exemplar-based. The interpolation-based schemes [1] that do not use extra assisted information, suffer from blurring artifacts because HR details are difficult to infer given only a single LR image. Multi-image-based methods fuse multiple LR images of the same scene to provide additional information for extracting HR details [2]. Methods in this category generally perform Manuscript received January 23, 2013; revised June 19, 2013 and August 15, 2013; accepted August 15, 2013. Date of publication August 21, 2013; date of current version October 1, 2013. This work was partially supported by the General Program of NSFC (Grant No. 61072062). H. Yue and J. Yang are with Tianjin University, Tianjin 300072, China (e-mail: [email protected]; [email protected]). X. Sun and F. Wu are with Microsoft Research Asia, Beijing 100080, China (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2013.2279315

better than methods that use only a single LR image, but it is still hard to recover high-frequency (HF) details when the magnification factor is large [3]. Exemplar-based methods [4]–[10] build a training set of HR/LR pair patches to infer HR patches from LR patches. By using external HR images for additional information, exemplar-based methods are able to introduce new and plausible HR details. However, the training set is generally fixed, thus limiting the SR performance. The evolution of image SR methods suggests that one of the crucial components for high SR performance is to exploit more correlation from external images to better infer HR details. For a query image, powerful image retrieval techniques have significantly facilitated the availability of huge number of external correlated images from the Internet. This brings new opportunities for many image processing applications, e.g. completion [11], composition [12]–[15], hallucination [16], and compression [17], [18]. These methods exploit correlation by searching similar regions in the retrieved correlated images. Indeed, such an expansion of correlated sources has produced promising results as reported in [11]–[18]. However, for image SR, when the correlated images are captured of the same scene by various devices with different configurations (e.g., resolutions, focal lengths, and viewpoints), it is difficult to find truly matched patches without proper alignment. Artifacts are introduced when correlated but misaligned patches are fused, as shown in [16] - the HR images contain oil-painting like artifacts at 8 × 8 up-sampling even if there are highly correlated images in the matching scenes. On the other hand, general retrieved images may not be taken from the same spot, and do not have simple alignments. These challenges limit SR performance improvement when Internet-scale correlated images are used as assisted information. Among the vast number of images from the Internet, we note that there is an interesting category of highly correlated images, which we call landmark images, that are taken at landmark sites around the world. The number of landmark images on the Internet is dramatically increasing due to the boom of social websites. For example, there are about 300 million photos uploaded to Facebook every day. For a landmark image, one can readily retrieve many highly correlated images that are taken at the same place, but possibly by cameras with different parameters and viewpoints. The retrieved correlated images are near or partial duplicates of the query image with shifts in the viewpoint and illumination variations. By applying a global geometric transformation, the correlated images can be well aligned with the query image.

1057-7149 © 2013 IEEE

4866

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

Fig. 1. Exemplar-based SR results in three schemes. (a) Traditional SR result using a fixed training set (25.06 dB). (b) Traditional SR result using correlated images as training set (24.67 dB). (c) Our SR result (26.79 dB). (d) Original image. To clearly show the details, one region is highlighted with a red rectangle.

This merit could significantly facilitate high performance SR from correlated images. Furthermore, the state-of-the-art results of [11], [19]–[21] demonstrate the efficiency of using correlated images to provide assisted information. Based on the above observations, we propose hallucinating landmark images by retrieving highly correlated images from the Internet. We believe landmark image SR helps to enhance the HR quality of landmark-related images (landmark images or images with landmark inside) when browsing on a HR display or observing an interested portion, just like the other kinds of images. It can also be used to improve the reality of the free view, free resolution 3D landmark reconstruction. Implementing our scheme is not trivial at all. First, it is not easy to search HR images using the observed LR image. Traditional near or partial duplicate image retrieval is designed for images of similar resolution and the images on the Internet depict a huge variety of scenes. Second, it is difficult to hallucinate realistic details even if we have highly correlated images. The correlated images usually have irrelevant objects and are different in viewpoints, focal lengths, and illumination from the observed LR image. Although they provide context specific details, it is difficult to find matched patches via exhaustive search in the reference images at patch-level matching. Third, when there are no correlated images or miss retrieved images, we still need to produce a sharp and reliable HR image. This paper offers three main contributions. First, our scheme generates an adaptive, highly-correlated HR image dataset based on the local features of the observed LR image. Second, because the correlated images are similar to our LR image in content but with different viewpoints, scales, and illumination, we propose a novel matching method to combine the highlevel matching and low-level matching. In high-level matching, we identify the region of correlated patches though global registration, which reduces the chance of a mismatch; In lowlevel patch matching, a structure-aware matching criterion and adaptive patch size are adopted to improve the mapping accuracy between LR and HR patches. Third, our SR scheme is flexible in terms of magnification factors. It avoids the offline exemplar training with a fixed factor which is widely used in the common exemplar-based SR methods. It directly up-samples the LR image to the desired resolution and aligns the correlated images with the up-sampled version of the LR image. To show the improvement obtained by our method, Fig. 1 shows three SR results for a down-sampled version of image “a” shown in Fig. 9 using different exemplar-based SR

schemes. The magnification factor is 4. The SR results shown in Fig. 1 (a) and (b) are produced by the same exemplarbased SR method similar to [7], but with different images as a training set. Fig. 1 (a) uses several fixed natural images as a training set and (b) uses four correlated images (three of them are shown in Fig.5 (a)) as a training set. Fig. 1 (c) is the SR result produced by our proposed method using the same correlated images as (b). To show the details more clearly, one region of interest is marked with a red rectangle in the corresponding image. There are few HF details recovered in the SR result (a). The SR result (b) provides more details than (a), but has the lowest PSNR values because of the introduction of noisy details. Our method produces the highest PSNR values and the most photorealistic details. The remainder of this paper is organized as follows: Section II gives a brief overview of related work. Section III illustrates our framework. Section IV discusses the correlated image retrieval and global registration. Section V presents our local matching and blending method. Experimental results are presented in Section VI. Section VII discusses the limitations and future work. Finally, Section VIII concludes this paper. II. R ELATED W ORK This section gives an overview of related work on exemplarbased image SR, image retrieval, and retrieval-aided image SR. A. Exemplar-Based Image SR In this paper, we focus on the exemplar-based image SR, which recovers HF details from a training set of HR/LR pair patches. The up-sampled image of the observed LR image is split into overlapped patches and each of the LR patches searches for a matched LR patch in the training set. Then the corresponding HF patch in the set is added to the upsampled image, resulting in an HR image. It has become a hot topic since it was first proposed by Freeman et al. in [4]. In this pioneering work, Freeman et al. propose embedding two matching conditions into a Markov network. First, the matched LR patch should be similar to the observed LR patch. Second, the new HR patch should be consistent with its neighbors. Later on, the exemplar-based work is improved by introducing constraints on edge smoothness [5] or the gradient [6]. Also in [7] LR features are enhanced to improve the matching precision of LR patches. These methods achieve excellent results in preserving sharpness of salient edges.

YUE et al.: LANDMARK IMAGE SR BY RETRIEVING WEB IMAGES

C. Hsu and C. Lin propose adopting scale-invariant feature transform (SIFT) descriptors to enrich the training set and obtain improvement when integrated with several classical image SR methods [22]. Sparse kernel ridge regression and natural image prior are adopted to hallucinate image details in [8], which provides a coherent enhancement of existing textures and edge patterns. J. Yang et al. propose learning a compact dictionary for HR/LR patch pairs and has produced competitive results [9]. In order to hallucinate plausible details in texture regions, [10] integrates high-level image analysis with custom lowlevel image synthesis. The observed LR image is interpreted as tiling textures and each of the textures is matched with relevant textures in the database, which requires some user guidance. This method enhances the appearance in textured regions. Y. Tai et al. propose combining edge-directed SR with detail synthesis from a user supplied example image, which hallucinates plausible texture details at very large magnification [23]. Not coincidentally, context-constrained hallucination method is presented in [24] to improve the SR performance. Its training set consists of HR/LR segment pairs from 4000 natural images, which is much richer than previous training sets. It constrains the search regions for matched HR patches by finding matched patches in similar texture regions, which alleviates the disturbance from non-correlated regions. J. Sun et al. [25] propose selective image superresolution, in which the SR region, source and refinement are selective. The interesting objects are hallucinated based on dictionaries trained from the same class objects. This method works well for interesting object regions, which demonstrates the efficiency of using similar objects to train the HR/LR patches set. [26] presents a system to automatically construct high resolution images from an unordered set of photos of similar scenes by projecting the children details into the parent image. The excellent results of these methods demonstrate the efficiency of searching for matched patches in a similar object and texture set. These methods overcome the shortcomings of the traditional exemplar-based method proposed in [4], which results in more noise due to the lack of relevant examples. These papers show that image hallucination has been improved in two ways. On the one hand, the HR/LR training set is becoming more and more diverse, which provides more related examples. On the other hand, the matched patches are searched by integrating high-level segment regions matching with lowlevel patches matching. Our proposed scheme - with global level registration and local patch level matching in regions with correlated patches - can be thought of as taking the matching trend to its extreme. B. Image Retrieval Image retrieval has been demonstrated to be a popular application on the Internet [27]. Current research efforts enable users or applications to retrieve related images by inputting certain query information, such as key words [28], outlines [29], and local and global feature descriptors [30], [31]. Partial or near duplicate image retrieval [30], [32]–[34] has

4867

specifically drawn considerable attention as it is desirable for many applications, such as copyright protection. Local feature descriptors, especially SIFT descriptors, are widely used in partial or near duplicate retrieval. One feasible SIFT-based solution to large-scale image retrieval is proposed in [35], in which SIFT descriptors are represented by bag-of-visual-words (BoW) produced by clustering a large number of SIFT descriptors. For each SIFT descriptor, it is first quantized to the nearest visual word and then used for similarity rating. This quantization greatly accelerates image retrieval but at the same time reduces the discriminative power of SIFT descriptors. To increase retrieval accuracy, some advanced methods have been proposed. In [36] each SIFT feature is softly quantized to a weighted set of visual words instead of a single word to compensate for the quantization error. It is also possible to increase the performance by applying Hamming embedding to the descriptors and integrating weak geometric consistency within the inverted file system [37]. The geometric consistency is further exploited through spatial verification based on spatial coding [33] or a group of bundled features [32], [34]. In this paper, we adopt bundled SIFT features proposed in [34] in the search of highly correlated images. C. Retrieval-Aided Image SR There is a trend of leveraging more images in exemplarbased image SR. Research efforts have demonstrated that SR performance improves with the increase of the number of images used in the training of exemplars, from several [4], [6], [7] to several thousand [24]. Recently, L. Sun and J. Hays propose using Internet-scale scene images to build the database for scene image SR [16]. In this method, scene matching images are retrieved using a combination of descriptors (e.g. color and texton histograms). These images, as well as the query LR image, are then segmented into overlapped regions to cover transitions and boundaries. Then the HR details are synthesized at the segment level by patch-based matching with gradient constraint. For face hallucination, M. F. Tappen and C. Liu propose using SIFT-flow to warp the retrieved similar HR face images to the input LR image [21]. The synthesized realistic details demonstrate the effectiveness of alignment. In this paper, we also propose using large-scale images to build the database. Different from [16] which uses global features for scene image SR, our scheme utilizes the local features to retrieve similar landmark images. Due to the local features, our SR scheme is able to well explore the local correlations between images and thus enables the following homographic transformation which can further enhance the correlations between images. Moreover, advanced methods, including a structure-aware matching criterion, a size adaptive mechanism, and energy minimized blending, are presented to improve the matching accuracy as well as the final reconstructed quality. III. F RAMEWORK OVERVIEW Our proposed retrieval-aided landmark image SR framework is depicted in Fig. 2. For an observed LR image I l ,

4868

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

Fig. 2. The framework of our landmark image super resolution. I l is the observed LR image. The intermediate HR result is I˜. The SIFT descriptors of I˜ ˜ I j represents the registered image. The final HR result is I h . is denoted by .

we generate an HR image I h at a magnification ratio of n. The observed LR image I l may be captured by a low-end imaging device or be a thumbnail of an interesting landmark. The LR image is up-sampled into an intermediate HR result I˜ at n× magnification by the bi-cubic interpolation. The task at hand is to learn HF details from external HR images for the LR image. First, we extract SIFT descriptors from I˜. Second, the correlated images are retrieved from a large-scale database for I˜ using the bundled SIFT descriptors. Third, because the correlated images are at different scales and viewpoints, a global registration is applied to these images to align them with I˜ based on the matched feature points. The global registration not only increases the pixel-level correlation between the retrieved images and the anchor, but also significantly reduces the search space of the subsequent local patch matching. HF details are inferred at the patch level from the registered correlated images. The intermediate HR image is first split into overlapped patches, whose sizes are adaptive according to local characteristics. Then, for each LR patch in I˜, we search its best matched LR patch from the registered correlated LR images with a structure-aware matching criterion. The matched HR patches are recovered by adding the corresponding HF patches to the LR patches in I˜. The HR version of I˜ is recovered by blending the matched HR patches under an energy minimization framework. Each component of the proposed method is detailed in the following sections. In our SR scheme, we explore the correlations between images by the local features instead of pixel values. In specific, the SIFT features play an important role: in the retrieval stage, they are bundled according to their spatial correlations to find the most similar HR images; in the registration stage, they are randomly grouped to deduce the transformation matrix.

IV. C ORRELATED I MAGE R ETRIEVAL AND G LOBAL R EGISTRATION This section focuses on the generation of highly correlated content for patch-based SR. The modules, including the generation of SIFT descriptors, correlated image retrieval, and global registration, are introduced in detail.

A. SIFT Extraction SIFT descriptors [38] characterize image regions invariant to image scale, rotation, and 3D camera viewpoint. They are highly distinctive and thus widely used in image retrieval, 3D reconstruction, and panoramic stitching. A SIFT descriptor is denoted as f = {v, x, s, o},

(1)

where v is a 128-dimension vector describing one image region by gradient histograms in different directions. The location, scale, and dominant orientation are represented by x, s, and o respectively. A SIFT descriptor is extracted from a set of scale space images generated from an input image by both Gaussian filtering and down-sampling hierarchically. The feature location and scale are determined by the maximum and minimum of difference-of-Gaussian (DoG) space, which is the difference between two adjacent Gaussian blurred images. One or more orientations are calculated in a region that is centered at the feature location and of the size determined by the corresponding scale. The vector covers the gradient information calculated from 16 sub-regions after the region is transformed according to the dominant orientation. Please refer to [38] for further details on SIFT extraction. We note that an image is down-sampled into multiple octaves during SIFT extraction. Although SIFT descriptors are invariant to image scale, SIFT descriptors extracted from the observed LR image are different from those extracted from the original HR image. In our scheme, we approximate the HR details from the correlated HR images retrieved by the LR image. The correlated images are obtained by matching the SIFT descriptors between the up-sampled LR (ULR) image and the HR web images. We choose the ULR image instead of the LR image for SIFT descriptor matching due to two reasons: 1) the SIFT descriptors detected from the ULR image are used not only in the retrieval but also in the following global registration; 2) as shown in Fig. 3, the number of SIFT descriptors extracted from the ULR image1 (red circles in (a)) is larger than that from the LR query (blue circles in (b)); 1 Here we detect the SIFT descriptors from ULR image from the second octave to reduce the gradient distortions in small scales.

YUE et al.: LANDMARK IMAGE SR BY RETRIEVING WEB IMAGES

4869

Fig. 3. SIFT descriptors extracted from (a) ULR image (in red) and (b) LR image (in blue). The SIFT descriptors extracted from original HR image are denoted in green in (a). The numbers of SIFT descriptors extracted from original HR, ULR and LR images are 971, 424 and 126, respectively. (c) shows the percentage of ULR keypoints aligned with original keypoints with a precision of at least 1 and 2 pixels, respectively.

Fig. 5. Exemplified highly correlated images and registration results. For each image group, the small image on the left is the observed LR image. The other six images on the right are the three highly correlated images (at the top) and the corresponding aligned versions (at the bottom). Fig. 4.

The inverted-file index used in our retrieval scheme.

also as shown in Fig. 3 (c), most of the descriptors from the ULR image are aligned with those from the original HR image (green circles in (a)), which are helpful in making use of the spatial correlation among descriptors in the bundled matching. B. Correlated Landmark Image Retrieval Using a fixed training set limits the ability of exemplarbased SR in finding highly correlated HR patches. Our scheme uses an adaptively retrieved HR image set from a large-scale database built by crawling images from the Internet. Bundled SIFT descriptors proposed in [34] are adopted for image retrieval in our scheme, which are more distinctive than a single descriptor in a large scale partial-duplicate web image search. In the bundled retrieval, the SIFT descriptors are grouped according to spatial correlations among them. As shown in Fig. 2, one image region associated with a large-scale descriptor often contains some small-scale descriptors. These descriptors are bundled as one group to serve as a basic unit in feature matching. To accelerate the search process, the BoW (Bag-of-Words) [35] is used in which the SIFT descriptors are quantized into visual words [35]. The bundled groups of one image are then stored as one inverted-file index [39], as illustrated in Fig. 4. Each visual word has an entry in the index that contains the list of images in which the visual word appears. It also contains the number of members in the group centered at this visual word, followed by the members of visual words and sector indices. In this paper, the number of member

visual words in a group is limited to 128 and the number of indices is set at 4. The SIFT descriptors extracted from the query image I˜ are also bundled into groups in the same way. Then each bundled group is matched with the bundled groups stored in the inverted-file index. The matching is scored by the number of matching visual words and the geometric relationship. Then the score is assigned to the image associated with the matched group. After all the bundled groups in the query image are matched, the correlation between a candidate image and the query image is measured by the sum of scores of matched bundle groups between them. A larger total score suggests a stronger correlation. Images with several highest total scores are selected as correlated images for the query image. The visual group based retrieval outperforms the BoW method in terms of the mean average precision (mAP) due to the usage of spatial correlations among descriptors. For example, the mAP of the visual group method is 0.715 whereas that of the BoW is 0.627 of 1M codebook on Oxford5K Dataset [35], as revealed in [34]. Fig. 5 shows some examples of the retrieval results. The results demonstrate that the retrieval method does a good job of finding near duplicate images of the same scene with different imaging configurations. C. Global Registration Different from the previous SR schemes, especially [16], we propose aligning the correlated images using SIFT-based global registration before HR detail recovery. As shown in Fig. 5, the retrieved images are highly correlated but significantly different at the pixel-level since they

4870

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

are often created with different viewpoints, focal lengths, and illumination. Direct SR from these images decreases the HF correlation at the patch level. As shown in Fig. 1 (b), even when very similar images are available, the HF information inferred at the patch level is inadequate since it is generated through pixel-wise matching. One can establish pixel-level correspondence by optical flow to compensate for pixel displacements. However, this imposes considerable computational burdens. One way to reduce the search space is to use image segmentation and search patches in the matched segmentation texture regions, as done in [24], [16]. However, the problem of pixel nonalignment still remains due to different imaging configuration such as viewpoints and focal lengths. To solve this problem, we obtain an approximate alignment through geometric global registration, which is further refined by local patch-based matching. Image registration is a classical research topic and can be pixel-based or feature-based. Since the SIFT descriptors of all the involved images are available in our case, we estimate the registration transformation between the ULR image I˜ and one correlated image I using the Random Sample Consensus (RANSAC) algorithm [40]. During registration, correspondences of feature points will first be established between the image I˜ and each of the correlated images. We denote the set of SIFT descriptors for image ˜ := { ˜f i, i = 1, 2, ..., M}, where ˜f i = {˜v i, x˜ i, s˜i, o˜ i} I˜ by is the i − th SIFT descriptor, and the set of SIFT descriptors of correlated image I by := { f j, j = 1, 2, ..., N}. For each ˜ × , we define a mapping function SIFT pair ( ˜f i, f j) ∈ ˜ × → {0, 1} to evaluate whether they match M( ˜f , f ) : each other as: 1, if v = v min , ˜v − v min 22 > C ·˜v −v min 22 ˜ M( f , f ) = 0, otherwise, (2) where v min is the closet feature vector in to v˜ and v min is the second closet vector in to v˜ . When the two descriptors match, the descriptor vector v is not only the closest one to v˜ but also C times smaller than the second nearest one in square Euclidean distance. C is a constant and is usually set at 1.5, similar to the criterion proposed in [38]. Then the corresponding points of matched descriptors are extracted as ˜ f ∈ }. P = {( x˜ , x)|M( ˜f , f ) = 1, ˜f ∈ ,

(3)

Second, an eight-parameter homographic matrix H is achieved by using RANSAC. Matrix H is iteratively produced by first randomly selecting four pairs of correspondence points from P for H computation. Then the remaining correspondence points are transformed by H . For a pair point ( x˜ i, x i) ∈ P, if the Euclidean distance between x˜ i and H xi is less than a given threshold (In this work, is set at 2.), it is called an inlier of H ; Otherwise it is an outlier. The number of inliers and matrix H are recorded during each iteration. The matrix corresponding to the maximum inliers is selected as the optimum matrix. Finally, this matrix is updated using its inliers, producing the final transformation matrix H . The correlated image is aligned to the LR image using matrix H . After the transformation, we get a reference image set. Fig. 5 shows an example of global registration applied

Fig. 6. The ROC curves (c) of using a retrieved image (at the top of (b)) and a registered image (at the bottom of (b)) as reference for hallucinating the observed LR image (a).

to the retrieved images for three LR images. Before registration, the pixels of some objects have considerably varying displacements across the correlated images; after applying the transformation, all the correlated images are aligned in terms of the scale and viewpoint of the LR image, which greatly increases the relevance of pixel-wise patches between the LR and correlated images. To evaluate the effect of registration, a receiver operating characteristics (ROC) curve [41] is shown in Fig. 6 to show the relationship between the hit rate h r and error rate er . For each test patch P l sampled from I˜, The error rate er is defined as: er =

P h − (P l + Qh∗ − Ql∗ )22 P h 22

,

(4)

where P h is the original anchor HR patch for P l , Ql∗ is the nearest sample for P l in the retrieved correlated or registered image and Qh∗ is its corresponding HR patch. The hit rate h r is defined as the percentage of the test patch whose error rate is less than er . For a given error rate, the higher the hit rate, the more effective the reference image. In this experiment, we randomly choose 5,000 test patches from the up-sampled version of the image shown in Fig. 6 (a). The patch size is 13 × 13. Then we search for its nearest neighbor in the LR version of the image shown in Fig. 6 (b) the top (retrieved correlated image) and bottom (registered correlated image) respectively. Fig. 6 (c) gives the corresponding ROC curves for the two reference images respectively, which demonstrates that the registered image provides more accurate HR patches. We also observe that, after registration, the transformed correlated images I j may only partially cover the LR image I˜ (as shown in Fig. 5). We associate one binary matrix B j with each of the transformed images to identify pixels that have correspondences in I˜. In general, each pixel of I˜ has at least one candidate pixel in the transformed correlated images. These binary matrixes help to access valid pixels of the transformed images in the subsequent local patch matching. V. L OCAL M ATCHING AND B LENDING Given the reference image set, the task at hand is to fuse the reference images into an HR image. Note that the reference

YUE et al.: LANDMARK IMAGE SR BY RETRIEVING WEB IMAGES

4871

images are only approximately aligned to the LR image. Since the performance of image SR from multiple images depends on the accuracy of the alignment [2], we further refine the alignment with local patch matching before we blend the corresponding patches into an HR image. We propose enhancing the SR performance by gradient-aware matching criterion, adaptive patch size decision, and energy-minimized blending.

A. Patch Matching The ULR image I˜ is split into overlapped patches of size m × m at the step size of m/2 for both the rows and columns. For each patch in I˜, we search for its corresponding patch from the reference images within a search window. Assume image I˜ has T reference images {I j }Tj=1. Each reference image I j is first down-sampled at 1/n (n is the desirable up-sampling ratio in SR) ratio and then up-sampled by bi-cubic interpolation to the original resolution, resulting in I˜j . For one patch P l centered at (x, y) in I˜, the searched window W P is defined as a square region centered at (x, y) in I˜j . To ensure there are corresponding pixels, the search processing is conducted in image I j where B j is labeled 1 at the region of W P . We are seeking a patch Q l from W P that minimizes D(P l , Q l ), where D(P, Q) measures the difference between the two patches. Since the reference images are approximately aligned with the LR image, a small size of the search window, e.g. 3m×3m, is chosen for correspondence retrieval. During this search process, two important components, the matching criterion D(P, Q) and patch size m, are investigated and improved to increase the matching efficiency. 1) Matching Criterion: The commonly-used criterion to evaluate the distance between two image patches includes mean absolute difference (MAD) and mean squared error (MSE). The pixel-wise fidelity criteria lack awareness of intrinsic image structures. This can become an even more severe problem when the reference patch from I˜ is only a rough estimate of the original patch. To solve this problem, we propose a structure-aware criterion using the gradient information for distance measurement. The criterion D(P, Q) is defined as: D(P l , Q l ) = P l − Q l 22 + η × ∇ P l − ∇ Q l 22 Q l ∈ I˜j (W P ), B j (W P ) = 0, j = 1 . . . T

(5)

where Q l is a candidate patch of P l , ∇ Q l represents the gradient of patch Q l , η is a weighting factor set to 10, and the constraint B j (W P ) = 0 makes sure that the search region has candidates. The reference patches Q l are densely sampled from W P . Note that the patches in (5) are normalized by removing the DC components before solving (5). We define the value of D(P l , Q l )/(m ×m) as gradient mean square error (GMSE), which can be efficiently computed in parallel. Our GMSE-based patch matching increases the accuracy in terms of finding the correct HR candidate, especially for the structure regions. Given one LR patch P l and its matched patch Q l∗ generated by minimizing (5), the reconstructed HR

Fig. 7. Patch matching results by using MSE and proposed GMSE criteria. (b) and (e) are the original LR and HR patches, respectively. (a) and (d) are the matched LR patch and its corresponding reconstructed HR patch using MSE. (c) and (f) are the matched LR patch and its corresponding recovered HR result using GMSE.

patch for P l is denoted as P∗h = P l + Q h∗ − Q l∗

(6)

where Q h∗ is the collocated HR patch to Q l∗ . We randomly choose 12,000 patches from 12 test images shown in Fig. 9. The patch size m is set at 13. The average matching error e is 202.63 for MSE search, and 192.84 for GMSE search. The matching error is calculated as P h − P∗h 22 . (7) m ×m Fig. 7 shows exemplified matching results using two match criteria: MSE and our proposed GMSE. For an input LR patch P l shown in Fig. 7 (b), two matched LR patches (a) and (c) are retrieved by using MSE and GMSE, respectively. Fig. 7 (d) and (f) show the corresponding reconstructed HR patches respectively. From Fig. 7, we observe that the LR patch (a) obtained by the MSE matching is closer to the input LR patch (b) than that (c) obtained by our GMSE matching. However, comparing the recovered HR patches with the original, the HR patch (f) recovered from the GMSE matched patch better preserves the image structure than one (d) recovered from the MSE matched patch. This clearly demonstrates that our proposed matching criterion has the awareness of image structure by integrating gradient information and thus achieves high matching accuracy. 2) Patch Size: We notice that previous SR schemes pay little attention to the patch size of exemplars and often use a fixed patch size. However, we observe that patch size can significantly affect the final SR performance. In general, a larger LR patch that contains richer structural information can increase the chance of finding a well-matched HR patch if there is one existing in the reference images. However it results in a high matching error when no matched candidate is available in the reference image or if there is a mismatch. In contrast, a smaller LR patch with less structural information may be more flexible in finding well-matched patches, but have difficulty in blending into coherent image structures and thus results in artifacts. Therefore, we propose adapting the patch size according to local characteristics: large patches are adopted if the correlated images are well aligned with the LR e=

4872

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

sizes with regard to the goodness of alignment significantly improves the SR performance over a fixed patch size.

B. Patch Blending After the local patch matching, each pixel p in I˜ has i=N N p recovered candidates {ci ( p)}i=1 p , where ci ( p) are the recovered HR values of the query LR patches at pixel p. The number of candidate pixels N p depends on the patch size and the sampling step of these patches. The straightforward way to fuse these candidate pixels is averaging, which tends to produce smoothness. A more powerful approach to this problem is an energy minimization framework in which some constraints and priors can be flexibly integrated. For example, contextural constraints are incorporated into an energy function to achieve high SR performance in [24]. In our scheme, we propose blending the candidates by minimizing the energy function as: E(I h ) = E d (I h , I l ) + μE h (I h ) Fig. 8. The HR results obtained by fixed size patch matching (24.38 dB) (a) and adaptive size patch matching (25.17 dB) (b). Three highlighted regions are shown on the right side for better visual inspection.

image; otherwise, small patches are preferred. The goodness of alignment is evaluated by the GMSE between the LR image and one correlated image within the matching region. A smaller GMSE indicates a better alignment. The patch sizes are empirically determined by the GMSE value as: ⎧ 21, GMSE <= α ⎪ ⎪ ⎨ 17, α < GMSE <= β, m= (8) 13, β < GMSE <= γ , ⎪ ⎪ ⎩ 9, GMSE > γ , where α, β, and γ are three parameters that divide the GMSE into four partitions corresponding to four possible patch sizes. In our implementation, the three parameters are set as: α = 1000, β = 1500, and γ = 2500. The GMSE value is calculated at the initial patch size 21 × 21. Then the patch is split into overlapped patches with corresponding patch size according to (8). The adaptive patch splitting strategy is inspired by the variable-sized block inter prediction mode decision in H.264 [42]. Moreover, the sliding step of patches is dependent on the patch size: the step size is set at m/2 if m is larger than 16, and is set at m/3 otherwise. Fig. 8 shows two SR images generated by using a fixed patch size (13 × 13) and our adaptive size, respectively. PSNR results demonstrate that our adaptive patch size achieves better reconstructed SR image in comparison with a fixed one. This is more obvious when observing the highlighted regions closely. The first two cropped regions (top and middle) in Fig. 8 (b) are much clearer than those shown in Fig. 8 (a), which benefit from the large patch size used in these regions. The third cropped region in Fig. 8 (b) has less artifacts compared with the one in Fig. 8 (a) since a smaller patch size (9 × 9) is selected. This demonstrates that the adaptation of patch

(9)

where μ is a weighting factor to adjust the importance of the two terms, I l is the observed LR image and I h is our desired HR image. The first energy term, namely the data term, aims at forcing the reconstructed result to be consistent with I l at low resolution: E d (I h , I l ) = D · G · I h − I l 22

(10)

where G is a blurring operator to reduce aliasing, and D is a down-sampling operator. The second energy term is the hallucination term to make the pixels in I h close to the pixel candidates in the reference images. One way to calculate the energy term is to evaluate the Euler distance between I h and the average of candidate pixels. However, as mentioned before, this would introduce smoothness. Therefore, we evaluate the squared error between I h ( p) and the closest pixel in ci ( p) as:

h 2 h E h (I ) = (11) mini I ( p) − ci ( p) p

However, this energy term is not differentiable and difficult to 2 solve. Let di ( p) = I h ( p) − ci ( p) , then the minimum of Eq. 11 for each pixel p is min{d1 , d2 , . . . , d N p } =

−1 log λ

Np

exp − λdi ( p) .

i=1

It achieves a good approximation when the minimum significantly deviates from other entries. Therefore, λ is set to 10. As suggested in [24], we then relax the minimum selection in the hallucination term (11) into the following differentiable function: Np −1 h (12) log exp − λdi ( p) . E h (I ) = λ p i=1

The energy function (9) is minimized by the gradient descent method: Ith+1 = Ith − τ ∇ E(Ith )

(13)

YUE et al.: LANDMARK IMAGE SR BY RETRIEVING WEB IMAGES

4873

TABLE I K EY W ORDS OF THE L ANDMARKS S ELECTED TO G ENERATE THE I MAGE D ATABASE

where t is the iteration index and τ is the step size along the negative gradient (set at 0.25). The gradient of the energy function is the sum of each term’s gradient ∇ E(I h ) = G T · D T · (D · G · I h − I l ) + μ∇ E h (I h ) GT

where is the transpose of G and ∇ E h is the gradient of the hallucination term: Np exp − λdi ( p) h h I ( p) − ci ( p) ∇ E h (I ( p)) =

k exp − λdk ( p) i=1 (15) and μ is set at 1. I0h is initialized by the weighted averaging of candidate patches. Essentially, a patch that is more consistent with the observed LR patch would be assigned to a larger weight. The weighted value w of each patch is defined as: (16)

The gradient descent algorithm to minimize (9) generally converges after 15 iterations, resulting in the final SR image I h . VI. E XPERIMENTAL R ESULTS AND A NALYSIS Here we will evaluate the performance of our proposed SR scheme. We first present the details of the image database generated from the Internet. Then the SR results are evaluated both objectively and subjectively in comparison with four state-of-the-art SR methods [8], [9], [22], [16]. Finally, we discuss the impact of illumination and correlated images, and analyze the computational complexity of our algorithm. A. Database Generation We build an image database by crawling images from the Internet. As our scheme targets super-resolution for landmark images, we retrieve images whose width or height is larger than 1024 from Flickr2 using the listed key words in table I. Our database contains 535,518 images and the total size of JPEG files is 245 GB. The test images for the SR performance evaluation are crawled from Google3. As shown in Fig. 9, 12 images of resolution around 1024 × 768 are selected as test images and are excluded from the database. The correlated image number T for each test image can be set adaptively depending on the database. In the following tests (Table II), we select the top four correlated images (T = 4) for each test image, as shown in [43]. It can be observed that the retrieved correlated images 2 http://www.flicker.com 3 http://www.google.com

Test images used in our experiments denoted from “a” to “l”.

(14)

(I h )

w = 1.2 × e−GMSE/700 .

Fig. 9.

are similar to the input LR image in content but with different viewpoints, scales, and illuminations. B. Experimental Results In our experiment, the magnification factor n is set to 4. For color images, we only enhance the luminance component. The chrominance components are directly up-sampled by bi-cubic interpolation, since humans are more sensitive to luminance changes. The patch size in our experiments is adaptive selected using (8). We compare our method with four SR schemes [8], [9], [22], [16]. The results of [8] are produced with a fixed training set. The training sets in [9], [22], and [16] are replaced by our retrieved correlated images so that they have the same HR images as our scheme for the SR test. The SR scheme in [9] seeks a sparse representation for LR patches and uses the coefficients to recover the HR patches. The results are produced by using its public code4 with the recommended parameters in [9]. The SR scheme in [22] exploits different scales and orientations in images using SIFT descriptors to enrich the training set. The comparison results in this paper are generated by integrating it with exemplar-based SR. The SR scheme in [16] uses matched scenes to hallucinate the HR details. We also compare our method with bi-cubic interpolation. Table II lists image SR results of our method and the four comparison methods in terms of two objective metrics: PSNR and structure similarity (SSIM). Our method achieves the best SR performance at 27.22 dB (PSNR) and 0.81 (SSIM) on average over 12 test images, obtaining a gain of 1.82 dB in PSNR and 0.08 in SSIM over the average results (25.40 dB and 0.73) of the second best method [16]. We also list the gain against the second best individual result (underlined in Table II) for each test image, which shows that our method consistently outperforms the other SR methods. Note that with several images it is quite challenging to achieve good SR results. For example, for image “e”, “h”, and “j”, some methods even generate lower SR quality than the bicubic method. In contrast, our method provides remarkable results, with up to 2.47 dB gain over the bi-cubic for these images. We also notice that the average PSNR results of [9] and [16] are quite close to [8] in which a fixed training set is 4 http://www.ifp.illinois.edu/∼jyang29/

4874

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

TABLE II PSNR AND SSIM VALUES FOR THE B I -C UBIC M ETHOD , F OUR C OMPARED M ETHODS [8], [9], [22], [16], AND OUR M ETHOD . T HE S ECOND B EST R ESULT I S U NDERLINED . O UR G AIN C OMPARED W ITH THE S ECOND B EST R ESULT I S S HOWN IN THE ‘G AIN ’ C OLUMN

Fig. 10. SR result of one indoor scene. From left to right: input LR image, three crops of the HR results for the highlighted regions (indicated in the LR image) produced by bicubic method, method [8], our method and the original image.

used, but our scheme clearly outperforms these schemes. It indicates even when highly correlated images are used in [9] and [16], it is difficult to enhance the accuracy of HF approximation in the SR process. In contrast to these schemes, our proposed scheme significantly improves the SR performance by both exploiting the correlations from the aligned images and adopting advanced patch matching and blending methods. We also compare the above SR methods in terms of visual quality. Fig. 14 shows the SR images for four of the twelve test images5 . For each image we show two cropped regions highlighted with red rectangles to visualize the details. It can be observed that for all the test images our method generates photorealistic details with sharp and natural edges (e.g. the window frame in image “d”) and vivid textures (e.g. decorative pattern in the door in image “h”). The bi-cubic interpolation gives the lowest visual quality with blur and jagging artifacts along the edges. [8], [9], and [16] tend to produce sharp edges but smoothed texture regions, e.g. the face in image “e”, and [22] results in noisy artifacts along the edges. Another interesting phenomenon is that our scheme is able to faithfully recover ridges on the roof, while all the other schemes provide 5 More results are provided in the supporting document in order to meet the page limit.

much blurred or unnatural results. For the Chinese characters in image “j”, our result is much clearer than that produced by all the compared methods. In brief, our scheme not only provides the best SR performance in terms of objective quality as shown in Table II, but also provides the best results in terms of visual quality. Our SR scheme is flexible in terms of magnification factors. It avoids the offline exemplar training with a fixed factor which is widely used in the common exemplar-based SR method. Given one LR image, after retrieving HR correlated images from the Internet, the LR image can be directly magnified to the desired resolution (e.g. 3×, 4×, 5×) and the correlated images are aligned with the up-sampled version of the LR image. Our scheme is also feasible to general images that have correlated images available. Fig. 10 presents the SR result of one indoor scene and its correlated images are shown in [43]. We can see that our method generates more realistic details compared with [8] and bicubic interpolation. C. Illumination The scales and viewpoints of the correlated images can be aligned with the LR image by one global transformation. However the illuminations of the correlated images are often different from the LR image. As shown in Fig. 11, the LR image is very bright while the correlated image has low lighting. Since our SR scheme only needs the high frequency components of the correlated image, this kind of illumination differences has limit effect on our SR result, as demonstrated in Fig. 11. D. Correlated Images The key motivation of our work is to boost image SR performance by exploiting strong correlations from web images. The quality of our SR results depends on the retrieved correlated images. As indicated in Table II, our SR performance varies for different images. For example, for image “f”, our gain over the second best method is above 4 dB, while for image “b” the gain is only 0.22 dB. The reason is that the retrieved images

YUE et al.: LANDMARK IMAGE SR BY RETRIEVING WEB IMAGES

4875

Fig. 11. The SR result with the correlated image that has different illumination from the query image. From left to right: the observed LR image, the correlated image, and the SR result. Fig. 13. The SR result with uncorrelated images as reference images. From left to right: input LR image, reference images, three crops of the HR results for the highlighted regions (indicated in the LR image) produced by our proposed method with uncorrelated images, with highly correlated images, and the bi-cubic method.

Fig. 12. The SR result for a landmark image containing a person. From left to right: the observed LR image, the reference image, the SR result. To clearly show the details, two crops are highlighted with red rectangles.

for image “f” are more strongly correlated than those retrieved for image “b”. The results shown in Table II are all obtained with correlated images [43] available. For a thorough evaluation of correlated images, we test our method in two cases: first, a person wants to hallucinate a landmark image containing him or herself; second, no correlated image is available. In the first case, it is difficult to find correlated images containing the same person. Most likely, the correlated image only contains the landmark. As shown in Fig. 12, we select the top one retrieved correlated image as reference. We can see that the SR result for the building’s roof is realistic, while the person is not well hallucinated. One possible solution for this issue is to segment the person by user interactions and include other photos of the person as correlated images [20]. In the second case, for image “g”, we use two totally different reference images in SR, as shown in Fig. 13. In this case, the global registration will not work due to lack of properly matched features. The patch size is fixed to 9 × 9. Using our patch matching and blending methods, we generate the SR result. For a comparison, SR results for the bi-cubic method and our proposed method with highly correlated images are also shown for the highlighted regions. We observe that the SR image produced without highly correlated images is not as sharp in edges as that produced with highly correlated images, but is much sharper than that generated by bi-cubic interpolation. This shows that, even without highly correlated images, our method is able to provide good SR results by exploiting the patch level correlations. E. Complexity Analysis In this section, we discuss the computation complexity of our scheme. It takes about one second to calculate the SIFT descriptors for one image of resolution 1024×768. The image

retrieval is based on visual words and the global registration is only performed for a few images. The image retrieval and registration normally take three seconds on average for one correlated image. Patch matching stage is performed patch by patch and it is the most time consuming part of our scheme. It takes about five minutes to hallucinate one image of resolution 1024×768 utilizing a simple single thread full search method. Fortunately the patch matching stage is independent, and the computation time is reduced to 80 seconds in our parallel implementation using four threads. What’s more, the patch matching stage in our scheme is similar to the motion estimation in video coding, which can be accelerated by more than 100 times by using fast patch matching algorithms [44]. VII. D ISCUSSIONS In this section, we discuss the limitations in our framework and future directions in our research. A. Limitations With regards to SR landmark images, our scheme has some limitations for SR complex objects, such as face and animals, since current image retrieval methods may not be able to find highly correlated images. Also, for images containing rich textural regions without dominant structures, such as grass and leaves, which are hard to match with SIFT descriptors, our scheme will suffer the same limitation as traditional SR schemes. B. Future Work As mentioned before, the complexity of our SR scheme can be greatly reduced by fast matching. We would like to work on accelerating our scheme in the future. Furthermore, we would like to enhance the scalability of our SR scheme. We will increase the diversity of the database by extending the selected landmarks to more than 1000 and collecting 100 million images. Also, using a homographic transformation for aligning images has limitations. In the future, we would like to pay more attention to the advanced alignment methods, e.g.

4876

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

Fig. 14. The HR results for images “d”,“e”, “h”, and “j”. For each column, images from the first row to the sixth row are obtained by the bi-cubic method, the compared methods [8], [9], [22], [16], and our method. The last row is the original image. To see the details clearly, two crops of the selected regions, highlighted with red rectangles, are shown in the corresponding image.

image-based 3D reconstruction and multi-model fitting [45] to detect multi-homography regions, to enhance the registration accuracy. In this paper, we evaluate our SR scheme by landmark images. With the rapid growth of web images, we believe our proposed framework can be applicable to more scenarios.

In the future, we will put effort to extend our scheme to generic images as well as videos [46]. VIII. C ONCLUSION We propose a novel method to hallucinate landmark images by retrieving correlated web images. In contrast to the

YUE et al.: LANDMARK IMAGE SR BY RETRIEVING WEB IMAGES

traditional exemplar-based hallucination method, which utilizes one fixed training set, we build an adaptive correlated image set for each LR image by retrieving correlated images from the Internet. In the global matching stage, these images are registered with the LR image. Therefore, we can retrieve more accurate high frequency details from the registered images. In the patch matching stage, GMSE-based search and adaptive block size decision are proposed to improve the matching accuracy between LR and HR patches. Then the candidate patches are blended optimally by a global energy minimization function. Experimental results show that our method obtains the best results in both objective and subjective evaluations compared with four state-of-the-art methods. ACKNOWLEDGMENT The authors would like to thank authors of [16], [22] for generating their results on our datasets. R EFERENCES [1] L. Zhang and X. Wu, “An edge-guided image interpolation algorithm via directional filtering and data fusion,” IEEE Trans. Image Process., vol. 15, no. 8, pp. 2226–2238, Aug. 2006. [2] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: A technical overview,” IEEE Signal Process. Mag., vol. 20, no. 3, pp. 21–36, May 2003. [3] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1167–1183, Sep. 2002. [4] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super resolution,” IEEE Comput. Graph. Appl., vol. 22, no. 2, pp. 56–65, Mar./Apr. 2002. [5] S. Y. Dai, M. Han, W. Xu, Y. Wu, and Y. Gong, “Soft edge smoothness prior for alpha channel super resolution,” in Proc. IEEE Conf. CVPR, Jun. 2007, pp. 1–8. [6] J. Sun, J. Sun, Z. B. Xu, and H.-Y. Shum, “Image super-resolution using gradient profile prior,” in Proc. IEEE Conf. CVPR, Jun. 2008, pp. 1–8. [7] Z. Xiong, X. Sun, and F. Wu, “Image hallucination with feature enhancement,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 2074–2081. [8] K. I. Kim and Y. Kwon, “Single-image super-resolution using sparse regression and natural image prior,” Pattern Anal. Mach. Intell., vol. 32, no. 6, pp. 1127–1133, Jun. 2010. [9] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp. 2861–2873, Nov. 2010. [10] Y. HaCohen, R. Fattal, and D. Lischinski, “Image upsampling via texture hallucination,” in Proc. IEEE Int. Conf. Comput. Photography, Mar. 2010, pp. 20–30. [11] O. Whyte, J. Sivic, and A. Zisserman, “Get out of my picture! Internetbased inpainting,” in Proc. 20th Brit. Mach. Vis. Conf., Sep. 2009, pp. 1–11. [12] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa, “PhotoSketck: A sketch based image query and composition,” in Proc. ACM Conf. Graph., 2009. [13] M. Eitz, R. Richter, K. Hildebrand, T. Boubekeur, and M. Alexa, “Photosketcher: Interactive sketch-based image synthesis,” IEEE J. Comput. Graph. Appl., vol. 31, no. 6, pp. 56–66, Nov./Dec. 2011. [14] T. Chen, M. M. Cheng, P. Tan, A. Shamir, and S. M. Hu, “Sketch2Photo: Internet image montage,” in Proc. ACM SIGGRAPH ASIA, 2009. [15] M. K. Johnson, K. Dale, S. Avidan, H. Pfister, and W. T. Freeman, “CG2Real: Improving the realism of computer generated images using a large collection of photographs,” IEEE Trans. Visualizat. Comput. Graph., vol. 17, no. 9, pp. 1273–1285, Sep. 2011. [16] L. Sun and J. Hays, “Super-resolution from internet-scale scene matching,” in Proc. IEEE Conf. ICCP, Jun. 2012, pp. 1–12. [17] H. Yue, X. Sun, J. Yang, and F. Wu, “SIFT-based image compression,” in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2012, pp. 473–478. [18] L. Dai, H. Yue, X. Sun, and F. Wu, “IMShare: Instantly sharing your mobile landmark images by search-based reconstruction,” in Proc. 20th ACM Int. Conf. Multimedia, 2012, pp. 579–588.

4877

[19] X. Liu, L. Wan, Y. Qu, T.-T. Wong, S. Lin, C.-S. Leung, and P.-A. Hen, “Intrinsic colorization,” ACM Trans. Graph., vol. 27, no. 5, pp. 152:1–152:9, 2008. [20] N. Joshi, W. Matusik, E. H. Adelson, and D. J. Kriegman, “Personal photo enhancement using example images,” ACM Trans. Graph., vol. 29, no. 2, p. 12, 2010. [21] M. F. Tappen and C. Liu, “A Bayesian approach to alignment-based image hallucination,” in Proc. ECCV, 2012, pp. 236–249. [22] C. Hsu and C. Lin, “Image super-resolution via feature-based affine transform,” in Proc. 13th Int. Workshop MMSP, Oct. 2011, pp. 1–5. [23] Y. Tai, S. Liu, M. S. Brown, and S. Lin, “Super resolution using edge prior and single image detail synthesis,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 2400–2407. [24] J. Sun and M. F. Tappen, “Context-constrained hallucination for image super-resolution,” in Proc. CVPR, 2010, pp. 231–238. [25] J. Sun, Q. Chen, S. Yan, and L.-F. Cheong, “Selective image superresolution,” Comput. Vis. Pattern Recognit., 2010, arXiv: 1010. 5610v1. [26] M. Eisemann, E. Eisemann, H. Seidel, and M. Magnor, “Photo zoom: High resolution from unordered image collections,” in Proc. GI, 2010, pp. 71–78. [27] Y. Rui, T. S. Huang, and S. F. Chang, “Image retrieval: Current techniques, promising directions, open issues,” J. Vis. Commun. Image Represent., vol. 10, no. 1, pp. 39–62, 1999. [28] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Trans. Multimedia Comput., Commun. Appl., vol. 2, no. 1, pp. 1–19, 2006. [29] C. H. Wang, Z. W. Li, and L. Zhang, “MindFinder: Image search by interactive sketching and tagging,” in Proc. 19th World Wide Web, 2010, pp. 1309–1312. [30] Y. Ke, R. Sukthankar, and L. Huston, “An efficient parts-based nearduplicate and sub-image retrieval system,” in Proc. 12th Annu. ACM Multimedia, 2004, pp. 869–876. [31] Q. F. Zheng, W. Q. Wang, and W. Gao, “Effective and efficient objectbased image retrieval using visual phrases,” in Proc. 14th Annu. ACM Int. Conf. Multimedia, 2006, pp. 77–80. [32] Z. Wu, Q. F. Ke, M. Isard, and J. Sun, “Bundling features for large scale partial-duplicate web image search,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 25–32. [33] W. G. Zhou, Y. J. Lu, H. Q. Li, Y. B. Song, and Q. Tian, “Spatial coding for large scale partial-duplicate web image search,” in Proc. Int. Conf. Multimedia, 2010, pp. 511–520. [34] L. Dai, X. Sun, F. Wu, and N. Yu, “Large scale image retrieval with visual groups,” in Proc. IEEE ICIP, Sep. 2013. [35] J. Philbin and O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. CVPR, Jun. 2007, pp. 1–8. [36] J. Philbin and O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image database,” in Proc. IEEE Conf. CVPR, Jun. 2008, pp. 1–8. [37] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in Proc. 10th Eur. Conf. Comput. Vis., 2008, pp. 304–317. [38] D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [39] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. IEEE Conf. Comput. Vis., vol. 2. Oct. 2003, pp. 1470–1477. [40] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981. [41] J. Sun, N. N. Zheng, H. Tao, and H. Y. Shum, “Image hallucination with primal sketch priors,” in Proc. IEEE Comput. Soc. Conf. CVPR, Jun. 2003, pp. 729–736. [42] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003. [43] (2013). Correlated Images [Online]. Available: https://skydrive.live.com/?lc=1033#cid=033247E518A5E217&id=33247 E518A5E217%21116 [44] S. I. A. Pandian, G. J. Bala, and B. A. George, “A study on block matching algorithms for motion estimation,” Int. J. Comput. Sci. Eng., vol. 3, pp. 34–44, Jan. 2011. [45] A. Delong, A. Osokin, H. Isack, and Y. Boykov, “Fast approximate energy minimization with label costs,” Int. J. Comput. Vis., vol. 96, no. 1, pp. 1–27, 2012. [46] C. Ancuti, T. Haber, T. Mertens, and P. Bekaert, “Video enhancement using reference photographs,” Int. J. Comput. Graph. Vis. Comput., vol. 24, nos. 7–9, pp. 709–717, 2008.

4878

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

Huanjing Yue received the B.S. degree in electrical engineering from Tianjin University, Tianjin, China, in 2010, where she is currently pursuing the Ph.D. degree in electrical engineering. Her current research interests include image compression and image processing.

Xiaoyan Sun (M’04–SM’10) received the B.S., M.S., and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 1997, 1999, and 2003, respectively. Since 2004, she has been with Microsoft Research Asia, Beijing, China, where she is currently a Lead Researcher with Internet Media Group. She has authored or co-authored more than 60 journal and conference papers and ten proposals to standards. Her current research interests include image and video compression, image processing, computer vision, and cloud computing. Dr. Sun was a recipient of the Best Paper Award of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY in 2009.

Jingyu Yang (M’10) received the B.E. degree from the Beijing University of Posts and Telecommunications, Beijing, China, in 2003, and the Ph.D. (Hons.) degree from Tsinghua University, Beijing, in 2009. Since 2009, he has been with the Faculty of Tianjin University, Tianjin, China, and is currently an Associate Professor at the School of Electronic Information Engineering. He visited Microsoft Research Asia (MSRA) from February to August 2011 within the MSRA’s Young Scholar Supporting Program. He visited the Signal Processing Laboratory at EPFL, Lausanne, Switzerland, from July to October 2012. His research interests mainly include image/video processing and computer vision. He was selected into the program for New Century Excellent Talents in University (NCET) from the Ministry of Education of China in 2011, and selected into the Elite Scholar Program of Tianjin University in 2012.

Feng Wu received his B.S. degree in electrical engineering from Xi’dian University, Xi’an, China, in 1992, and his M.S. and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 1996 and 1999, respectively. He joined Microsoft Research Asia, formerly named Microsoft Research China as an Associate Researcher in 1999. He has been a Researcher with Microsoft Research Asia since 2001 and is now a Senior Researcher/Research Manager. His research interests include image and video compression, media communication, and media analysis and synthesis. He has authored or co-authored over 200 high quality papers and filed 67 U.S. patents. His 13 techniques have been adopted into international video coding standards. As a co-author, he received the Best Paper Award in IEEE TCSVT in 2009, PCM in 2008, and SPIE VCIP in 2007. He serves as an Associate Editor in the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEM FOR V IDEO T ECHNOLOGY , the IEEE T RANSACTIONS ON M ULTIMEDIA and several other international journals. He served as a TPC Chair in MMSP in 2011, VCIP in 2010, and PCM in 2009, TPC Track Chair in ICME in 2012, ICME in 2011, and ICME in 2009, and Special Sessions Chair in ICME in 2010.

Tag-Based Image Retrieval Improved by Augmented ... - IEEE Xplore