Three-Dimensional Image Processing in the Future of ...

Viewer
Transcript

288

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

Three-Dimensional Image Processing in the Future of Immersive Media Francesco Isgrò, Emanuele Trucco, Peter Kauff, and Oliver Schreer

Abstract—This survey paper discusses the three-dimensional image processing challenges posed by present and future immersive telecommunications, especially immersive video conferencing and television. We introduce the concepts of presence, immersion, and co-presence and discuss their relation to virtual collaborative environments in the context of communications. Several examples are used to illustrate the current state of the art. We highlight the crucial need of real-time, highly realistic video with adaptive viewpoint for future immersive communications and identify calibration, multiple-view analysis, tracking, and view synthesis as the fundamental image-processing modules addressing such a need. For each topic, we sketch the basic problem and representative solutions from the image processing literature. Index Terms—Computer vision, immersive communications, three-dimensional (3-D) image processing, three-dimensional television (3-D-TV), videoconferencing. Fig. 1.

Classification of presence (adapted from [1] and [2]).

I. INTRODUCTION

T

HIS survey paper discusses the future of immersive telecommunications, especially video conferencing and television, and the three-dimensional (3-D) image-processing techniques needed to support such systems. The discussion is organized into two parts. Part 1 introduces immersive telecommunication systems through the concepts of presence, immersion, and co-presence and their relation to virtual collaborative environments and shared environments within communication scenarios. We focus on two major applications, immersive video conferencing and immersive television, which have risen as challenging research areas in the past few years. In both, immersiveness relies mostly on visual experience within a mixed reality scenario, in which participants interact with each other in a half-real, half-virtual environment. As the virtual imagery is created electronically from real video material, it is necessary to ask which computer vision and image-processing techniques will play a major role in supporting the immersive conferencing and television systems of the future. Part 2 attempts to provide an answer focused on the two key applications identified in Part 1. The key modules required must, crucially, support the dynamic rendering of 3-D objects cor-

Manuscript received January 14, 2003; revised October 13, 2003. The work was supported in part by the European Union Framework-V under Grant VIRTUE IST-1999-10044. F. Isgrò is with the Dipartimento di Informatica e Scienze dell’Informazione, Università di Genova, 16146 Genova, Italy (e-mail: [email protected]). E. Trucco is with the School of Engineering and Physical Sciences, Electrical, Electronic and Computer Engineering, Heriot-Watt University, EH14 4AS Edinburgh, U.K. (e-mail: [email protected]). P. Kauff and O. Schreer are with the Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, D-10587 Berlin, Germany (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TCSVT.2004.823389

rectly and consistently; 3-D image processing is therefore at the heart of immersive communications. Our discussion, which is limited due to space constraints, identifies calibration, multiple-view analysis, tracking, and view synthesis as the fundamental image-processing modules that immersive systems must incorporate to achieve immersiveness within mixed-reality scenarios. For each of these topics, we sketch the basic problem and some known representative solutions from the image-processing literature. Further modules, obviously necessary (e.g., figure-background segmentation) but less foundational or characteristic for our applications, are not dealt with here. II. APPLICATIONS, SCENARIOS, AND CHALLENGES A. From Presence to Immersive Telepresence The idea of immersive media is grounded in two basic concepts: presence and immersion. The structure of presence has been studied for a long time in the interdisciplinary field of human factors research. Although several aspects are still unclear, it is commonly agreed that the basic meaning of “presence” can be stated as “being virtually there” [1], [2]. The different approaches found in the literature can be divided roughly into two main categories, which are social and physical presence (Fig. 1). Social presence refers simply to the feeling of being together with other parties engaged in a communication activity. It does not necessarily aim at a good reproduction of spatial proximity or at high levels of realism. A visualization of the communication situation is not required and sometimes even undesirable. In fact, Internet chats and e-mail, phone calls, or conventional paper letters may give us a strong impression of social presence. In contrast, physical presence concerns the sensation of being physically colocated in a mediated space with communication

1051-8215/04$20.00 © 2004 IEEE

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

partners; radio, TV, and cinema are classical examples. Note that most of these are entertainment services that, although sometimes enjoyed in groups, do not necessarily improve the social aspects of communication. At the intersection of these two categories, we can identify the quite new area of co-presence systems (see Fig. 1). Videoconferencing and shared virtual environments are good examples of this intermediate category: by providing a sense of togetherness and an impression of co-location in a shared working space, such systems support social and physical presence simultaneously. In contrast to the general interpretation of presence formulated from the human factors standpoint, the concept of immersion fits into the technical domain and is clearly linked to the category of physical presence. Immersion concerns concrete technical solutions specifically improving the sense of physical presence in a given application scenario. It is interesting to notice that the roots of immersive systems are not found in telecommunications, but in stand-alone applications like cinema, theme park entertainment, flight simulators, and other virtual training systems; a popular example is the transition from conventional cinemas to IMAX theaters. The introduction of immersion in telecommunications and broadcast is a new and exacting challenge for image processing, computer graphics, and video coding. Recent advances in computer hardware, networks, and 3-D video processing technologies are beginning to supply adequate support for algorithms to meet this challenge. A new type of capability, immersive telepresence, is therefore emerging through real systems. The user of an immersive telepresence system feels part of a virtual or mixed-reality scene within an interactive communication situation, e.g., video conferencing. This feeling is mainly determined by visual cues like high-resolution images, realistic rendering of 3-D objects, low latency and delay, motion parallax, seamless combination of artificial and live video contents, but also by acoustic cues like correct and realistic 3-D sound. We concentrate here on two major applications of immersive telepresence. The first is immersive 3-D videoconference, which allows geographically distributed users to hold a videoconference with a strong sense of physical and social presence. Participants are led to believe to be co-present at a round-table discussion thanks to the realistic, real-time reproduction of real-life communication cues like gestures, gaze directions, body language, and eye contact. The second application is immersive television, which can be regarded as a next-generation broadcast technology. The ultimate challenge for immersive broadcast systems is the natural reproduction and rendering of large-scale, real-world 3-D scenes and their interactive presentation on suitable displays. Before we discuss next-generation systems in these two applications (Sections II-C and II-D), we sketch the main concepts behind collaborative systems and shared environments in general. B. Collaborative Systems and Shared Environments in Immersive Communications Collaborative systems allow geographically distributed users to work jointly at the same task. A simple example is document

289

Fig. 2. Representation of an office scene for CVE applications (from the German KICK project).

Fig. 3.

CAVE, University College London.

sharing, for which NetMeeting is often used in collaborative teamwork applications. A more advanced approach is collaborative virtual environments (CVEs) or shared virtual environments (SVEs). These are typical co-presence systems, but not necessarily immersive. The main reason is that they are usually PC-based applications with small displays, which greatly reduces the scope for real-life realistic interaction. Example applications are given by the European IST project TOWER (Theatre of Work) and the German project KICK (see Fig. 2). They both provide awareness of collaborative activities among team members and their shared working context through symbolic presentation in a virtual 3-D environment [3], [4]. The aim is to enhance distributed collaborative work with group awareness and spontaneous communication capabilities very similar to face-to-face working conditions. The step toward immersive telecollaboration (ITC) occurred with the development of CAVEs and workbenches (see Fig. 3) [5]. These systems achieve presence by allowing the user to interact naturally with a virtual environment, using head tracking or haptic devices to align the virtual and real worlds. CAVEs and workbenches were originally stand-alone systems, mainly developed for presentation purposes, but the increasing number of

290

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

Fig. 6. Rendering of a virtual 3-D conference.

Fig. 4. ACTS project Coven.

Fig. 5.

Multiview capture for the VIRTUE three-party conference system.

CAVE sites, combined with increasingly accessible broad-band backbone networks, propitiated the introduction of telecollaboration. A particularly important field of CVE or SVE in the context of immersive video conferencing is shared virtual table environments (SVTEs). The basic idea is to place 3-D visual representations of participants, usually graphical avatars, at predefined positions around a virtual table. An example is the European ACTS project COVEN [6] that demonstrated the benefits of the SVTE concept by a networked, virtual reality (VR) business game application in 1997 (see Fig. 4) [7]. In more complex systems, the motion and 3-D shape of the participants are captured at each terminal by a multiple camera setup like the ones shown in Fig. 5 (the VIRTUE system, see Section II-C). The 3-D arrangement of participants around the shared table is ideally isotropic (all participants appear in the same size and are equally spaced around the table), as symmetry suggests equal social importance. Hence, in a three-party conference the participants would form an equilateral triangle, in four-party one a square, and so on. Following this composition rule and given the number of participants, the same appropriate SVTE can be displayed at each terminal of a conference. Individual views of the virtual conference scene can then be rendered at each terminal using a virtual camera (see Fig. 6) which follows the instantaneous position of the participant’s eyes, continuously estimated by a head tracker.

Assuming that the real and virtual worlds are correctly calibrated and aligned, the conference scene can be displayed to each participant from the correct viewpoint, even if the participant moves his or her head continuously. The geometric alignment of virtual and real worlds crucially supports the visual presence cues of an SVTE application, e.g., gaze awareness, gesture reproduction, and eye contact. In addition, the support of head motion parallax allows participants to change their viewpoint purposively, e.g., to watch the scene from a different perspective or to look behind objects. As already mentioned, most approaches in this area have been limited to strictly graphical environments and avatars for visualizing remote users. Several such systems have been proposed during the last decade; recently, researchers have begun to integrate video streaming into the virtual scene, for instance incorporating video presentations on virtual screens in the scene, or integrating seamlessly 2-D video images or even 3-D video avatars into CVEs to increase realism [8], [9]. Some of these approaches were driven by the advent of the MPEG-4 multimedia standard and its powerful coding and composition tools. An example is Virtual Meeting Point (VMP), developed by the Fraunhofer Institute for Telecommunications/Heinrich-Hertz-Institut in collaboration with Deutsche Telekom [10]. VMP is a low-bitrate, MPEG-4-based software application, in which the image of each participant becomes a video object to be pasted in a virtual scene, displayed on each participant’s screen. Further examples are the virtual conference application of the European ACTS project MoMuSys or the IST project SonG [11], [12]. C. Immersive Videoconferencing Effective videoconferencing is an important facility for business with geographically distributed operations, and high-speed, computer-based videoconferencing is a potential killer application, gathering research efforts from major market players like VTEL, PictureTel, SONY, Teleportec, VCON, and others. Corporate reasons for using videoconferencing systems include business globalization or decentralization, increased competition, pressure for higher reactivity and shorter decision-making, an increase in the number of partners, and reduced time and travel costs. According to a proprietary Wainhouse research from March 2001, voice conferencing and e-mail are still preferred, in general, to videoconferencing. Key barriers seem to be high unit prices, the limited (perceived) business needs, the cost of ownership, and concerns about integration, lack of training, and user friendliness. One reason is that these

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

291

Fig. 9.

Fig. 7.

Design of a virtual auditorium, University of Stanford.

Screen shot of a NetMeeting session.

Fig. 10.

Plasma-lift videoconference table, D+S sound lab.

Fig. 11.

Examples of the teleportec system.

Fig. 8. Access grid.

systems still offer little support for natural, human-centered communication. Most of them are window-based, multiparty applications where single-user images are presented in separate PC windows or displayed in full-body size on large video walls, often in combination with other window tools. One example used frequently in the field of desktop applications is NetMeeting (Fig. 7). Notice that gestures, expressions, and appearance are reproduced literally, i.e., camera images are displayed unprocessed. Further examples can be found in the framework of the U.S. Internet2 consortium, for instance the Access Grid (AG) (Fig. 8), the Virtual Rooms Video Conferencing Service (VRVS), the Virtual Auditorium of Stanford University (Fig. 9), or the Global Conference System [13]–[16]. Based on high-speed backbone networks, these systems offer high-quality audio and video equipment with presence capabilities for different applications like teleteaching, teleconferencing and telecollaboration. Nevertheless, realism is lacking as eye contact and realistic viewing conditions are not supported. Other systems are dedicated to two-site conferences and use a point-to-point connection between two user groups. The local group sits at one end of a long conference table placed against a large screen; the table is continued virtually in the screen, and

the remote group appears at the other end of the table (in the screen). All participants get the impression of sitting at the same table. Furthermore, as the members of each group sit close together and the virtual viewing distance between the two groups is quite large, eye contact, gaze direction, and body language can be reproduced at least approximately. However, as mentioned above, such videoconference table systems are usually restricted to two-site, point-to-point scenarios. Two commercial state-of-the-art examples are reported in [17] and [18] (Figs. 10 and 11). The main restriction in all these systems is the use of conventional, unprocessed 2-D video, often coded by MPEG-2 or H.263. This makes it impossible to meet a basic requirement of immersive, human-centered communication in videoconferencing, which is that every participant gets his or her own view of the conference scene. This feature requires a virtual camera tracking the individual viewer’s viewpoint and an adaptation of the viewpoint from which the incoming video images are displayed. Given this, the mission of immersive 3-D video conferencing can be seen as combining the SVTE concept with adequately processed video streams, consequently taking advantage of both the high grade of realism provided by real video and the versatile functionalities of SVTE systems. The main objective is to offer rich communication modalities, as

292

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

Fig. 12. Setup of the tele-cubicle approach of UNC, Chapel Hill, and the University of Pennsylvania. Fig. 13.

similar as possible to those used in face-to-face meetings (e.g., gestures, gaze awareness, realistic images, and correct sound direction). This would overcome the limitations of conventional videoconferencing and VR-based CVE approaches, in which face-only images shown in separate windows, unrealistic avatars, and missing eye contact impoverish communication. The most promising video-based SVTE approach is probably the tele-cubicles [19], [20] developed within the U.S. National Tele-Immersion Initiative (NTII) [21]. Here, remote participants appear on separate stereo displays arranged in an SVTE-like spatial setup. A common feature is the symmetric arrangement of participants around the shared table, with each participant appearing in his own screen (Fig. 12). Note that symmetry guarantees consistent eye contact, gaze awareness, and gesture reproduction: everybody in the conference perceives consistently who is talking to whom or who is pointing at what (i.e., everybody perceives the same spatial arrangement) and in the correct perspective (i.e., the view is consistent with each individual viewpoint). For example, if the person at the terminal in Fig. 12 talks to the one on the left while making a gesture toward the one on the right, the latter can easily recognize that the two others are talking about him. Viewing stereo images with shutter glasses supports the 3-D impression of the represented scene and the remote participants. The tele-cubicle concept holds undeniable merit, but it still carries disadvantages and unsolved problems. First of all, the specifically arranged displays appear as “windows” in the offices of the various participants, resulting in a restricted mediation of social and physical presence. Furthermore, the tele-cubicle concept is well suited to a fixed number of participants (e.g., three in the setup of Fig. 12) and limited to single-user terminals only, but does not scale well: any addition of further terminals requires a physical rearrangement of displays and cameras, simply to adjust the geometry of the SVTE setup to the new situation. Finally, it is difficult to reconcile the tele-cubicle concept with the philosophy of shared virtual working spaces. Although the NTII has already demonstrated an integration of telecollaboration tools into their experimental tele-cubicle setup, the possibility of joint interactions is limited to two participants only, and shared workspaces with more than two partners are hard to achieve because of the physical separation of tele-cubicles windows.

VIRTUE demonstrator.

To overcome these shortcomings, a new SVTE concept has been proposed by the IST project VIRTUE (Virtual Team User Environment) [22], [23]. It offers all the benefits of tele-cubicles, but extends them by integrating SVTEs with shared virtual working spaces. The main idea is a twofold combination of the SVTE and mixed-reality metaphors. First, a seamless transition between the real table in front of the display and the virtual conference table in the screen gives the user the impression of being part of a single, extended perceptual and working space. Second, the remote participants are rendered seamlessly and in the correct perspective into the virtual conference scene using real video with adapted viewpoint. Fig. 13 shows the VIRTUE setup, which has been demonstrated in full for the first time at the Immersive Communication and Broadcast Systems (ICOB) workshop in Berlin, Germany, in January 2003. D. Immersive Broadcast Systems Being present at a live event is undeniably the most exciting way to experience any kind of entertainment. The mission of immersive broadcast services is to bring this experience to users unable to participate in person. A first technical approach was realized by Kanade at Superbowl 2001 within the EyeVision project [24]. The objective was to design a new broadcast medium, combining the realism of video or cinema with natural interactions with scene contents, as in VR applications, and to provide immersive home entertainment systems, bridging the gap between the different levels of participation and intensity granted by live events and state-of-the art consumer electronics (Fig. 14). For this purpose, immersive broadcast services must incorporate three different features: panoramic large-screen video viewing, stereo viewing, and head motion parallax viewing. Panoramic viewing is well known from large-screen projection techniques in cinema or IMAX theaters. Electronic panorama projection is mainly attractive for digital cinema, but also for other immersive visualization techniques like video walls, “office of the future,” ambient rooms, CAVE, and workbenches. Such large-screen projections require an extremely high definition, say, at least 4000 3000 pixels. Often, the horizontal resolution required is even higher. In contrast, the best digital cinema projectors, which are available on the

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

Fig. 14.

Fig. 15.

293

Fig. 16.

Immersive TV viewed by head-mounted systems.

Fig. 17.

Technical implementation of an Immersive TV system.

Objective of immersive TV.

Multiple video projection with six cascaded CineBoxes.

market, are limited to QXGA resolution (2048 1536 pixels) and can be very expensive. Due to these drawbacks, several researchers have proposed to mosaic multiple projections into one large panoramic image. One example is the CineBox approach of the German Joint project D_CINEMA. As shown in Fig. 15, it is a modular approach using one CineBox as a basic unit. Each CineBox provides an MPEG-2 HD-decoder with extended functionality. It offers electronic blending functions to control a seamless transition from one image to another in overlap areas. In addition, MPEG decoding of various CineBoxes can be synchronized. Hence, cascading CineBoxes for multiple projections is very flexible, and up to six HD-images can be mosaiced into a panoramic view. Further examples and details on multiple projection techniques can be found in [25] and [26]. Lodge [27], [28] was the first to propose the use of such techniques for broadcast. The resulting Immersive TV concept envisages to capture, encode, and broadcast wide-angle, high-resolution views of live events combined with multichannel audio and to display these events through individual, immersive, head-mounted viewing systems in combination with a head-tracker and other multisensory devices like motion seats (Fig. 16). The most significant feature, however, is that Immersive TV targets a one-way distribution service. This means that, unlike usual VR applications, the same signal can be sold to

any number of viewers without the need for the broadcaster to handle costly interactive networks and servers. A possible implementation, similar to the one shown in Fig. 15 for digital cinema, is outlined in Fig. 17. A large panoramic view is transmitted in the form of multiple HD MPEG-2 streams, synchronized at the receiver and stitched seamlessly into one image. A special merger unit is used for this purpose and allows the user to look around the scene while wearing a head-mounted display. Optionally, the stitched HD frames can also be watched jointly by a small group of viewers as wide-screen panorama projection, for which purpose a given number of HD projectors can be plugged into a single merger unit. An extension of this system toward head motion parallax viewing, called Interactive Virtual Viewpoint Video (IVVV), is reported in [29] and [30]. Several panoramic viewing systems have been proposed for Internet applications; [31] is a representative example. Another important cue for Immersive TV is stereo viewing. The exploitation of stereo vision for broadcast services has long been a focus in 3-D television (3-D-TV). However, most approaches to 3-D-TV were restricted to the transmission of two video streams, one for each eye. In contrast, the European IST project ATTEST has recently proposed a new concept for 3-D-TV [32], [33]. It is based on a flexible, modular, and open architecture that provides important system features, such as backward compatibility to today’s 2-D digital TV, scalability in terms of receiver complexity, and adaptability to a wide range of different 2-D and 3-D displays. The data representation and coding syntax of the ATTEST system make use of the layered structure shown in Fig. 18. This structure basically consists of

294

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

from the video data in order to create the output sequence. To this end, we look for a spatial relationship between synchronized frames in the various sequences via the image-matching module. Once the structure of the scene has been determined, the output sequences can be rendered according to the particular application needs. In the following, we will identify some foundational techniques for the applicative scenarios described in this paper. For each of them, we identify basic definitions and problems, provide a quick, structured tour of image processing solutions, and try to point out what solutions are feasible for immersive systems. A. Camera Calibration Fig. 18.

Immersive TV viewed by head-mounted systems.

one base layer and at least one additional enhancement layer. To achieve backward compatibility to today’s conventional 2-D digital TV, the base layer is encoded by using state-of-the-art MPEG-2 and DVB standards. The enhancement layer delivers the additional information to the 3-D-TV receiver. The minimum information transmitted here is an associated depth map providing one depth value for each pixel of the base layer to be able to reconstruct a stereo view from the baseline MPEG-2 stream. Note that the layered structure in Fig. 3 is extendable in this sense. For critical video content (e.g., large-scale scenes with a high amount of occlusions), one can add further layers, for example, segmentation masks and maps with occluded texture. Hence, the ATTEST concept can be seen as an interesting introduction scenario for immersive TV [34]. It strictly follows an evolutionary approach, being backward compatible to existing services on one hand and open for future extensions on other hand. In addition, it allows the usage of head tracker to support head motion parallax viewing for both 2-D and 3-D displays. This is an important feature for immersive broadcast services, because it gives the user the opportunity to interact intuitively with the scene content: for instance, the head position can be changed purposively to watch the same TV scene from different viewpoints. Thus, ATTEST is a first step toward visionary scenarios like free-viewpoint TV, Virtualized Reality, EyeVision, or Ray-Space TV, where the user can walk with a virtual camera through moving video scenes [35]–[38]. III. 3-D VIDEO PROCESSING FOR IMMERSIVE COMMUNICATIONS How will 3-D computer vision and video processing play a role in the scenarios described above? Which techniques are most likely to be needed, developed, and integrated in the immersive communication systems of the future? In Fig. 19, we attempt a schematic representation of our answer, focusing especially on the systems described in Sections II-C and II-D. From an input set of video sequences, the system generates in general a new video sequence, e.g., seen from a different viewpoint. Some calibration is generally necessary; this is either precomputed by an offline procedure or obtained online from a set of features tracked and matched across the sequences. Once the system is calibrated, a description of the scene must be extracted

Camera calibration [39], [40] is the process of estimating the parameters involved in the projection equations. Such parameters encode the geometry of the projection taking place in each camera and the positions of the cameras in space, and are necessary information if we want to extract the Euclidean geometry of the scene from 2-D images. This is frequently useful or even necessary for immersive systems, for instance to guarantee geometric consistency across different terminals or to select the correct size of synthetic objects to be integrated in a virtual scene. Calibration algorithms abound in literature, but, given the practical importance of the topic, it seems appropriate to include a concise discussion on recent developments. We identify two main classes of camera calibration techniques: Euclidean calibration and self-calibration. 1) Euclidean Calibration: Euclidean calibration methods estimate the values of the projection parameters with no ambiguity apart from their intrinsic accuracy limits. These methods are based on the observation of a calibration object, the 3-D geometry of which is known to a high accuracy. Examples of this approach are given in [39] and [41]–[44]. These methods may look tricky, as they require particular setups (e.g., special calibration objects are necessary, sometimes to be placed in particular positions). This can be a drawback, especially for large-market applications, where users cannot be expected, in general, to go through complicated setup procedures. Therefore, easier and more flexible algorithms are ideally required, for which little or no setup is necessary. Some steps in this direction have been made recently: in [45], the calibration is performed from at least two arbitrary views of a planar calibration pattern which can be constructed easily printing dots on a sheet of paper. More recently, the same author presented an algorithm for calibration from one-dimensional (1-D) objects [46]. Most immersive communications systems include several cameras (for instance, VIRTUE uses four cameras), which must be calibrated within the same world reference frame. This can be achieved simply with ad hoc calibration patterns (a multifaceted one, for instance) visible from all cameras, but more specialized algorithms exist, especially for the case of a two-camera stereo [44], [47]. For systems with more than two cameras, the reader is referred to [48]–[50]. 2) Self Calibration: Where several views of a scene are available (e.g., multiple-camera systems, moving cameras), a full Euclidean calibration may be difficult to achieve or even unnecessary. In these cases, it is still possible to achieve a weak

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

Fig. 19.

295

Schematic representation of a computer vision system for video immersive applications.

calibration, which still gives the geometric relation among the different cameras, but in a projective space. In practical terms, this involves computing an algebraic structure encoding the constraints imposed by the multicamera geometry. These structures encode important relations among images (e.g., useful constraints for correspondence search) and are computed directly from image correspondences. From these structures, it is possible to recover (up to a projective transformation) the camera parameters. The algebraic structures mentioned above are: the fundamental matrix [51] for a two-camera stereo system, the trifocal tensor [52] for a three-camera system, and the quadrifocal tensor for a four-camera system. The list stops here, as it has been proved that no such structures combining more than four images exist [53]. Recent research on calibration has focused on this problem, called self-calibration: estimating the camera parameters from weakly calibrated systems and with little or no a priori Euclidean information about the 3-D world. The main assumption made by this class of algorithms is the rigidity of the scene [54]–[57]. A critical review of self-calibration techniques can be found in [58]. The highest potential of self-calibration for immersive communications is probably in the entertainment industry, e.g., mixing movies from different sources or creating augmented reality film effects with no set rigs or additional structures (see, for instance, the BouJou system by 2D3 [59]). For applications requiring a strong sense of physical presence, such as immersive teleconferencing, it is usually acceptable to fully calibrate the system offline. B. Multiview Correspondence Multiview correspondence, or multiple view matching, is the fundamental problem of determining which parts of two or more

Fig. 20.

Camera arrangement for the VIRTUE setup (left-hand stereo pair).

Fig. 21.

Resulting original stereo pair acquired from cameras a and b.

Fig. 22. Visualization of the associated disparity maps from camera a to b (left) and from b to a (right). Notice the inversion of grey levels encoding disparity.

images (views) are projections of the same scene element. The output is a disparity map for each pair of cameras, giving the relative displacement, or disparity, of corresponding image elements (see Figs. 20–22). Disparity maps allow us to estimate

296

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

the 3-D structure of the scene and the geometry of the cameras in space. Passive stereo [40] remains one of the fundamental technologies for estimating 3-D geometry. It is desirable in many applications because it requires no modifications to the scene and because dense information (that is, at each image pixel) can nowadays be achieved at video rate on standard processors for medium-resolution images (e.g., CIF, CCIR) [60]–[62]. For instance, systems in the late 1990s already reported a frame rate of 22 Hz for images of size 320 240 on a Pentium III at 500 MHz [63]. The availability of real-time disparity maps also enables segmentation by depth, which can be useful for layered scene representation [34], [64]–[66]. Large-baseline stereo, generating significantly different images, can be of paramount importance for some SVTE applications, as it is not always possible to position cameras close enough to achieve small baselines or because doing so would imply using too many cameras given speed or bandwidth constraints. The VIRTUE system [22] is an example: four cameras can only be positioned around a large plasma screen, and using more than four cameras would increase delay and latency beyond acceptable levels for usability (but see recent systems using high numbers of cameras [37], [67], [68]). There are two broad classes of correspondence algorithms seeking to achieve, respectively, a sparse set of corresponding points (yielding a sparse disparity map) or a dense set (yielding a dense disparity map). 1) Sparse Disparities and Rectification: Determining a sparse set of correspondences among the images is a key problem for multiview analysis. It is usually performed as the first step in order to calibrate (fully or weakly) the system, when nothing about the geometry of the imaging system is known yet, and no geometric constraint can be used in order to help the search. We can classify the algorithms presented in literature so far in two categories: feature matching and template matching. Algorithms in the first category select feature points independently in the two images, then match them using tree searching, relaxation, maximal clique detection, or string matching [69]–[72]. A different algorithm is given in [73], which presents an interesting and easy-to-implement algebraic approach based on point positions and correlation measures. Algorithms in the second category select templates in one image (usually patches with some texture information) and then look for corresponding points in the other image using a similarity measure [39], [74], [75]. The algorithms in this class tend to be slower than the ones in the first class as the search is less constrained, but it is possible to speed up the search for some particular cases [76]. The search for matches between two images is simplified and sped up if the two images are warped in such a way that corresponding points lie on the same scanline in both images. This process is called rectification [40]. The rectified images can often be regarded as acquired by cameras rotated with respect to the original ones or images of these cameras projected onto the same plane. Most of the stereo algorithms in the literature assume rectified images.

2) Dense Disparities: Dense stereo matching is a well-studied topic in image analysis [39], [40]. An excellent review including suggestions for comparative evaluation1 is given in [78]; we refer the reader to this paper for an exhaustive list of algorithms. Here we give a general discussion of dense matching and focus on large-baseline matching given its importance for advanced visual communication systems. The output of a dense matching algorithm is a disparity map. As already mentioned (see Section III-A2), the matching image points must satisfy geometric constraints imposed by the algebraic structures such as the fundamental matrix for two views, plus other constraints (physical and photometric). These include: order (if two points in two images match, then matches of nearby points should maintain the same order); smoothness (the disparities should change smoothly around each pixel); and uniqueness (each pixel cannot match more than one pixel in any of the other images). Points are usually matched using correlation-like correspondence methods [78]: given a window in a frame, standard methods in this class explore all possible candidate windows within a given search region in the next frame and pick the one optimizing an image similarity (or dissimilarity) metric. Typical metrics include sum of squared differences (SSDs), sum of absolute differences (SADs), or correlation. Typically the windows are centered around the pixel of which we are computing the disparity. This choice can give poor performance in some cases (e.g., around edges). Results can be improved adopting multiple windows matching, where different windows centered in different pixels are used [79], [80], at the cost of a higher computational time. Computation of disparity maps can be expensive, but some tricks can be used to speed up the computation of the similarity measure by using box filtering techniques [78] and partial distances [81]. Good guidelines for an efficient implementation of stereo-matching algorithms on state-of-the-art hardware is given in [82] and [83]. 3) Large-Baseline Matching: This is the difficult problem of determining correspondences between significantly different images, typically because the cameras’ relative displacement or rotation is large. This problem is very important for our scenario, see, for instance, the VIRTUE terminal (Fig. 13). As a consequence of the significant difference between the images, direct correlation-based matching fails at many more locations than in small-baseline stereo. From an algorithmic point of view, the images of a large-baseline stereo pair lead to significant disparities and may present considerable amounts of relative distortions and occlusions. Large camera translations and rotations induce large disparities in pixels, thus forcing search algorithms to cover large areas and increasing the computational effort. Large displacements between cameras may introduce also geometric and photometric distortions, which complicate image matching. As to occlusions, the farther away the viewpoints, the more likely occluded areas (i.e., visible to one camera but not to the other). The problem of occlusions can be partially solved, at the cost 1A methodology for evaluating the performance of a stereo matching algorithm for the particular task of immersive media has been recently proposed in [77].

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

of extra computation, in multi-camera systems as long as every scene point is imaged by at least two cameras [84]–[86]. However, in practice, increasing the number of cameras may increase the risk of unacceptably high delay and latency. Solutions to the problem of large-baseline matching include intrinsic curves, coarse-to-fine approaches, maximal regions [88], and other invariant regions [88], [89]. Intrinsic curves [90] are an image representation that transforms the stereo-matching problem into a nearest-neighbor problem in a different space. The interest of intrinsic curves here is that they are ideally invariant to disparity, so that they support matching, theoretically, irrespective of disparity values. In coarse-to-fine approaches [91]–[93], matching is performed at increasing image resolutions. The advantages are that an exhaustive search is performed only on the coarsest-resolution image, where the computational effort is minimal, and only a localized search takes place on high-resolution images. Approaches based on invariant features rely on properties that remain unchanged under (potentially strong) geometric and photometric changes between images. Very good results have been reported, but computational costs are usually high and direct application to real-time, telepresence systems are unfeasible. Indeed, all of the techniques above are still too time-consuming if the target is a full-resolution disparity map for full-size video at frame rate. The methods in [60] and [94] are two approaches that address this point within immersive communications by exploiting the redundancy of information in video sequences. The former [60] unifies the advantages of block-recursive disparity estimation and a pixel-recursive optical flow estimation in one common scheme, leading to a fast matching algorithm. The latter [94] uses motion detection to reduce the quantity of pixels at which disparity is computed. C. View Synthesis View synthesis addresses the problem of generating convincing virtual images of a 3-D scene from real images acquired from different viewpoints, without reconstructing explicit 3-D models of the scene. In other words, the target is to generate the image that would be acquired by a specified virtual camera from an arbitrary point of view, directly from video or images. The range of applications for such techniques is very wide, including collaborative environments, computer games, and virtual and augmented reality, and it is of paramount importance for immersive communication systems (see Fig. 23 for some examples). The generation of virtual images has been for years the territory of computer graphics. The classic approach generates synthetic images using a 3-D geometric model of the scene, a photometric model of the illumination, and a model of the camera projection. A computer-aided design (CAD) model is typically adopted for the scene, and it must be created manually or obtained from 3-D sensors (time of flight, triangulation, and passive stereo). Colors are obtained by mapping texture onto the model; the texture can again be artificial or obtained from real images. In recent years, a new alternative trend for the generation of synthetic views has emerged, which is based on real images

297

Fig. 23. Frames from a synthetic fly-over around a moving speaker, generated from the synchronized stereo sequence from which the pair in Fig. 20 was extracted.

only. This approach, called image-based rendering (IBR) [95], [96], aims to generate photorealistic synthetic images and videos solely from multiple real images which capture all the necessary information about a scene under real illumination conditions. There is no need for complex 3-D geometric models or physics-based radiometric simulations (e.g., light sources or surface reflectance) to achieve realism, as the realism is in the images themselves [97]. However, some modeling is still necessary to guarantee the consistency of the synthetic images. It is impossible to generate views of a generic scene without any 3-D information. Most view-synthesis algorithms do not compute 3-D structures explicitly, but need dense disparity maps between the input images, information intimately related with the 3-D structure of the scene. As a consequence, the quality of the synthetic views depends crucially on the quality of the disparity maps. Where the accuracy of disparities degrades (typically and especially in occluded areas), artifacts are introduced in the novel view. A class of techniques, that, following [97], we call here CAD-like modeling, represent a compromise between the classic computer graphics approach (using full geometric and radiometric models) and IBR methods (using only images). A notable example of CAD-like modeling is the system developed at Carnegie Mellon University by Kanade’s group [67]. These methods concatenate computer vision and computer graphics modules: they first obtain a 3-D reconstruction of the scene from images and then use the recovered model to render novel views. The advantage is that it is possible to use existing, specialized rendering hardware, as IBR methods are not yet

298

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

well supported by hardware and software libraries [98] at the same level of polygon-based computer graphics techniques. However, real-time implementations of IBR techniques do exist [99], [100], and hardware supporting image warping is appearing [101]. A classical approach to rendering novel views is image interpolation, introduced in [102] that has been popularized by QuickTimeVR products [103]. They can only produce images that are intermediate views between two original images (i.e., the virtual camera lies on the baseline between the two real cameras). This approach was adopted in the PANORAMA system [104]. Various researchers have adopted image interpolation for creating novel views; Seitz and Dyer [105] showed that straightforward image interpolation generates views that are not physically valid, i.e., they cannot be produced by any real camera, and derived a criterion for creating correct synthetic views. Another way for generating synthetic views is to construct a lookup table (LUT) approximating the plenoptic function [106] (the plenoptic function is a representation of the flow of light at every 3-D position and for every 2-D viewing direction) from a series of sample images taken from different viewpoints. An image view from an arbitrary viewpoint is then synthesized by interpolating the LUT. Representative approaches using this technique are [107] and [108] where, under the assumption of free space, a four-dimensional (4-D) plenoptic function has been adopted. The results produced are very realistic, but the method needs a very large number of densely sampled images (on the order of 100), therefore it is not easily suitable for online applications as the ones considered in this paper. It is worth pointing out that the method does not need to compute explicitly dense disparities between the images, as the effort is transferred in the image sampling procedure exploiting the fact that the plenoptic function used is 4-D. Pollard et al. [109] use three sample images (and cameras) and edge transfer to synthesize any view within the triangle defined by the three optical centers; only edge matching is necessary, instead of dense pixel correspondence. A wider range of views can be created using point transfer based on the principle of the fundamental matrix for re-projection [110], called epipolar transfer. The drawback is a nonnatural way of specifying the virtual viewpoint if no calibration is available (the method suggested is to choose the position of four control points in the target image) and the existence of degenerate configurations: the method fails to reproject points lying on the trifocal plane (i.e., the plane containing the three optical centers) and any points at all when the three optical centers are collinear. More generic and easier to implement is the method presented in [111], which exploits the algebraic relation existing within triplets of images formalized by the trifocal tensor. The real advantage of this method is that the method fails only in reprojecting points on the baseline between the two real cameras. The paper also suggests a simple way to specify the novel viewpoint that can be used for epipolar transfer as well. To generate a synthetic view, in general, correspondence information between the original images is necessary, and it is not surprising that the quality of the synthetic images strongly depends on the quality of the disparity maps. Moreover, the pres-

ence of occluded areas (i.e., 3-D part of the scene visible only in one image, for which no disparity values are available) can create disturbing artifacts. It is worth mentioning at this point that IBR techniques are not conceptually different from 3-D reconstruction plus reprojection [97]: they are indeed a shortcut, as disparity maps supply, in principle, the same information as a 3-D structure (considering only scene points for which disparity can be computed). Points reprojection alone, although quite effective, is not enough to cover all of the situations occurring in immersive communications, as images are rendered under the same lighting conditions of the original scene. Recent work has sought to also incorporate illumination changes in the IBR process. The limit of standard IBR algorithms is that they assume a static scene with fixed illumination; when the viewpoint is moved, the illumination may move rigidly with it. However, in several applications, e.g., navigation of environments or augmented reality, illumination variation should be considered. If a linear reflectance model (e.g., Lambertian) applies, and light sources are assumed at infinity, then every light direction can be synthesized by a linear combination of three static images under different light directions [112]. More recently, it has been shown [113] that an image of a Lambertian surface obtained with an arbitrary distant light source can be approximated by the linear combination of a basis of nine images of the same scene under different light conditions. Unfortunately, the assumptions are not realistic in most cases, so that different techniques are needed. The techniques reported can be divided in two categories, estimation of the bidirectional reflection distribution function (BRDF) and photometric IBR. Methods in the first class generally require knowledge of the 3-D structure of the scene and information about lighting conditions. They recover the complete BRDF from several static images acquired under different lighting conditions [114], [115] or even from a single image [116]. Typically these methods are very slow and may require up to several hours of computation. Photometric IBR, instead, does not recover any reflectance model, but uses a set of basis images under different lighting conditions as a model of the BRDF. Images under synthetic lights are obtained as interpolation of the basis images [117]–[119]. However, these methods are still unwieldy for real-time applications, as computing reflectance models can be too computationally expensive and difficult with dynamic scenes. All of the methods mentioned so far apply to rigid scenes and fail if this condition is not satisfied. This case has been addressed by recent work modeling nonrigid surfaces as stochastically rigid [120]. D. Tracking Video tracking is the problem of following moving targets through an image sequence. We mention it last as it is a ubiquitous module, and its importance for the scenarios introduced in the first part of this paper cannot be underestimated. For instance, videoconferencing systems incorporating 3-D effects need head tracking to estimate the 3-D head position of the viewer in order to generate images of the participants. In

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

VIRTUE [22], head position estimates are achieved by passive video tracking, but a variety of head tracking technologies exist, either passive [121], [122] or active [123]. Here, passive tracking means that the target (in this case, the head) is tracked using optical sensors placed in scene and following a set of landmarks placed on the moving user (that can be natural landmarks as the nose or eyes), whereas active tracking means that the optical sensor are on the moving user, and the landmarks are fixed targets in the scene. In augmented and mixed-reality applications, inserting CAD elements in real video consistently with the current image is a key requirement. “Consistently” means that the size, position, and shading of synthetic objects must match those of the surrounding image or real objects. This requires tracking the motion of real objects and the egomotion of the camera (or equivalent information), usually achieved by tracking subsets of image points. In terms of performance characteristics, the following is sought of a video tracker: robustness to clutter (the tracker should not be distracted by image elements resembling the target being tracked), robustness to occlusion (tracking should not be lost because of temporary target occlusion, but resumed correctly when the target reappears), few or no false positives and negatives, agility (the tracker should follow targets moving with significant speed and acceleration), and stability (the lock and accuracy should be maintained indefinitely over time). We now review briefly the existing classes of tracking systems, from the image processing point of view, in increasing order of target complexity, culminating with methods learning the shape and dynamics of nonrigid targets. 1) Window Tracking: This is usually performed adopting the same correlation-like measures as for stereo matching, so we can say they are practically the same thing. However, in the context of tracking, some patterns of grey levels are better than others as they guarantee better numerical properties [124]. 2) Feature Tracking: The next more complex targets are image elements with specific properties, called image features, that we define as detectable parts of an image that can be used to support a vision task. Notice that this definition is entirely functional to our discussion and does not capture the many usages of the word “feature.” Feature tracking takes place by first locating features in two subsequent frames and then matching each feature in one frame with one feature in the other (if such a match exists). Notice that the two processes are not necessarily sequential. Tracking Local Features: Local features cover limited areas of an image (e.g., edges, lines, and corner points). Their key advantage over image windows is that local features are invariant, within reasonable limits, to image changes caused by scene changes and can be expected to remain detectable over more frames. Moreover, extracting feature reduces substantially the volume of data passed on for further processing. The disadvantages of local features are a limited suggestiveness (i.e., local features may not correspond to meaningful parts of 3-D objects), the fact that they do not appear only on the target object (e.g., edges and corners are generally detected on both target and background, increasing the risk of false associations), and sensitivity to clutter. Typical image elements

299

Fig. 24. Example of eye tracking in a videoconferencing sequence.

used as local features are intensity edges [125], [126], lines [127], and the ever-popular corners [128], [129]. A good review of local feature tracking is given in [130]. A typical use in computer vision of local features tracking is in the field of structure from motion from video sequences, and a major application in the media technology is given by the BouJou [59] or, more related to the scope of this review, the already mentioned head tracking. Tracking Extended Features: Extended features cover a larger part of the image than local features; they can be contours of basic shapes (e.g., ellipses or rectangles), free-form contours, or image regions. Typical examples of the last two classes are the head and eyes. The main advantage of extended features is their greater robustness to clutter [131], as they rely on a larger image support. In other words, one expects a higher risk of false positives and negatives with local features than with extended ones. Another advantage is that extended features are related more directly to significant 3-D entities (e.g., circles in space always appear as ellipses in images). The price is more complex matching and motion algorithms. Extended features, different from local ones, can change substantially over time: consider, for instance, the image of a person walking. Here a contour tracker must incorporate not only a motion model, but also a shape deformation model constraining the possible deformations. Considering that a discrete contour can be formed by several tens of pixels, the search space can grow large and unwieldy. Extended image features can change because the corresponding 3-D entities are moving rigidly, as for a rotating circle, or changing shape, as for a walking human. Deformable objects combine both effects. Clearly, it is easier to predict the appearance of moving, rigid

300

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

3-D shapes [132]–[136] than that of moving and deforming 3-D objects [137]–[140]. Devising sufficiently general models for the latter is very difficult. For this reason, several authors have turned to visual learning techniques [141], [142] or use various templates [143]. Tracking targets are frequently parts of the human body in immersive communications, for instance, the head [144] or eyes (see Fig. 24), hands [145], and legs [146]. IV. CONCLUSION The main objective of this paper was to offer an overview of immersive communication applications, especially videoconference and ITV, together with the 3-D video processing techniques that they require. We hope that our classification and discussion of topics, from presence and immersion to shared virtual table environments, can help the reader to identify and understand the main recent and future streams of research in this area. The examples mentioned in the first part have illustrated different approaches taken in the development of recent, advanced prototypes. The main message here is that the combination of immersive telepresence with shared virtual environments, as emerging in very recent videoconference and ITV prototypes, has created a powerful new paradigm for immersive communications, which may possibly the base of killer applications for the associated markets. The main message of the second part of this paper is that truly immersive systems require real-time, highly realistic video with a continuously adaptive viewpoint, and this can nowadays be achieved by IBR techniques (as opposed to traditional, polygonbased graphics). The key image-processing techniques we identified as necessary for future 3-D immersive systems are calibration, multiview analysis, view synthesis, and tracking. For each technique, we identified the key problem and discussed some representative solutions from the sometimes very large number of existing approaches. The last three are likely to see a substantial push in applied research to achieve algorithms meeting the demanding needs of immersive communications. REFERENCES [1] W. A. Ijsselsteijn, M. Lombard, and J. Freeman, “Toward a core bibliography of presence,” Cyper Psychol. Behavior, vol. 4, no. 2, 2001. [2] W. A. Ijsselsteijn, J. Freeman, and H. D. Ridder, “Presence: Where are we?,” Cyper Psychol. Behavior, vol. 4, no. 2, 2001. [3] L. Schäfer and S. Küppers, “Camera agents in a theatre of work,” in Proc. 2002 Int. Conf. Intelligent User Interfaces, San Francisco, CA, Jan. 13–16, 2002, pp. 218–219. [4] R. Buß, L. Mühlbach, and D. Runde, “Advantages and disadvantages of virtual environments for supporting informal communication in distributed workgroups,” in Proc. Int. Conf. Human Computer Interaction, Munich, Germany, 1999. [5] [Online]. Available: http://www.cs.ucl.ac.uk/research/vr/Projects/Cave/ [6] COVEN: COllaborative Virtual Environments. ACTS. [Online]. Available: http://www.crg.cs.nott.ac.uk/research/projects/coven [7] V. Normand, C. Babski, S. Benford, A. Bullock, S. Carion, Y. Chrysanthou, N. Farcet, E. Frécon, J. Harvey, N. Kuijpers, N. Magnenat-Thalmann, S. Raupp-Musse, T. Rodden, M. Slater, G. Smith, A. Steed, D. Thalmann, J. Tromp, M. Usoh, G. Van Liempd, and N. Kladias, “The COVEN project: Exploring applicative, technical and usage dimensions of collaborative virtual environments,” PRESENCE: Teleoperators and Virtual Environment, vol. 8, no. 2, 1997. [8] O. Ståhl, “Meetings for real – Experiences from a series of VR-based project meetings,” in Proc. Symp. Virtual Reality Software and Technology, London, U.K., Dec. 20–22, 1999, pp. 164–165.

[9] D. Sandin et al., “A realistic video avatar system for networked virtual environments,” in Proc. Immersive Projection Technology Symp., Orlando, FL, Mar. 24–25, 2002. [10] S. Rauthenberg, P. Kauff, and A. Graffunder, “The virtual meeting room,” in Proc. 3rd Int. Workshop Presence, Delft, The Netherlands, Mar. 27–28, 2000. [11] Home Page of ACTS Project MoMuSys. [Online]. Available: http://www.cordis.lu/infowin/acts/rus/projects/ac098.htm [12] Home Page of IST Project SonG. [Online]. Available: http://www.octaga.com/SoNG-Web/ [13] “Access Grid,” Home Page. AG Alliance. [Online]. Available: http://www-fp.mcs.anl.gov/fl/accessgrid/ag-spaces.htm [14] Virtual Room Video-Conferencing System Home Page. VRVS. [Online]. Available: http://www.vrvs.org/About/index.html [15] M. Chen, “Design of a virtual auditorium,” in Proc. ACM Multimedia, Ottawa, ON, Canada, Sept. 30–Oct. 5, 2001, pp. 19–28. [16] (2002) Fuqua School of Business: “Global Conference Systeme” Press Release. Duke Univ. [Online]. Available: http:// www.fuqua.duke.edu/admin/extaff/news/global_conf_2002.htm [17] The Plasma-Lift A/V Conference Table. D+S Sound Labs Inc. [Online]. Available: http://www.dssoundlabs.com/avtable.htm [18] [Online]. Available: www.teleportec.com [19] W. C. Chen et al., “Toward a compelling sensation of telepresence: Demonstrating a portal to a distant (static) office,” in Proc. IEEE Conf. Visualization, Salt Lake City, UT, Oct. 8–13, 2000, pp. 377–383. [20] T. Aoki et al., “MONJUnoCHIE system : Videoconference system with eye contact for decision making,” in Proc. Int. Workshop Advanced Image Technology, 1999. [21] H. Towles, W.-C. Chen, R. Yang, S.-U. Kum, H. Fuchs, N. Kelshikar, J. Mulligan, K. Daniilidis, L. Holden, B. Zeleznik, A. Sadagic, and J. Lanier, “3D tele-immersion over Internet2,” in Proc. Int. Workshop Immersive Telepresence, Juan Les Pins, France, Dec. 6, 2002, pp. 185–194. [22] “VIRTUE Home” European Union’s Information Societies Technology Programme, Project IST-1999–10 044. British Telecom. [Online]. Available: http://www.virtue.eu.com [23] O. Schreer and P. Sheppard, “VIRTUE – The step toward immersive telepresence in virtual video conference systems,” in Proc. eWorks, Madrid, Spain, Sept. 2000. [24] Takeo Kanade’s Superbowl 2001EyeVision Project. [Online]. Available: http://www.ri.cmu.edu/events/sb35/tksuperbowl.html [25] Y. Ruigang, D. Gotz, J. Hensley, H. Towles, and M. Brown, “PixelFlex: A reconfigurable multi-projector display system,” in Proc. IEEE Conf. Visualization, San Diego, CA, Oct. 21–26, 2001, pp. 167–174. [26] G. Welch, H. Fuchs, R. Raskar, M. Brown, and H. Towles, “Projected imagery in your office in the future,” IEEE Comput. Graphics Applicat., vol. 20, pp. 62–67, July/Aug. 2000. [27] N. Lodge, “Being part of the fun—Immersive television,” in Proc. Conf. Broadcast Engineering Society of India, New Dehli, India, Feb. 1999. [28] N. Lodge and D. Harrison, “Being part of the action – Immersive television!,” in Proc. Int. Broadcasting Convention, Amsterdam, The Netherlands, Sept. 1999. [29] C. Fehn, P. Kauff, O. Schreer, and R. Schäfer, “Interactive virtual view video for immersive TV applications,” in Proc. Int. Broadcasting Convention, Amsterdam, The Netherlands, Sept. 2000. [30] C. Fehn, E. Cooke, O. Schreer, and P. Kauff, “3D analysis and imagebased rendering for immersive TV applications,” Signal Processing: Image Commun. J., vol. 17, no. 9, pp. 705–715, Oct. 2002. [31] T. Pintaric, U. Neumann, and A. Rizzo, “Immersive panoramic video,” in Proc. 8th ACM Int. Conf. Multimedia, Oct. 2000, pp. 493–494. [32] M. O. de Beeck, P. Wilinski, C. Fehn, and P. Kauff, “Toward an optimized 3D broadcast chain,” in Proc. 3D-TV, Video & Display, SPIE Int. Symp., Boston, MA, Aug. 2002. [33] C. Fehn, P. Kauff, M. O. de Beeck, F. Ernst, W. Ijsselsteijn, M. Pollefeys, L. Vangool, E. Ofek, and I. Sexton, “An evolutionary and optimized approach on 3D-TV,” in Proc. Int. Broadcast Convention, Amsterdam, The Netherlands, Sept. 2002. [34] C. Fehn and P. Kauff, “Interactive virtual view video (IVVV) – The bridge between 3D-TV and immersive TV,” in Proc. 3D-TV, Video & Display, SPIE Int. Symp., Boston, MA, Aug. 2002. [35] T. Fujii and M. Tanimoto, “Free-viewpoint TV system based on rayspace representation,” in Proc. 3D-TV, Video & Display, SPIE Int. Symp., Boston, MA, Aug. 2002. [36] T. Kanade, P. Rander, S. Vedula, and H. Saito, “Virtualized reality: Digitizing a 3D time-varying event as is and in real time,” in Mixed Reality, Merging Real and Virtual Worlds, Y. Ohta and H. Tamura, Eds. Berlin, Germany: Springer-Verlag, 1999, pp. 41–57.

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

301

[37] T. Kanade, P. Rander, and P. J. Narayanan, “Virtualized reality: Constructing virtual worlds from real scenes,” IEEE Multimedia, Immersive Telepresence, vol. 4, pp. 34–47, Jan. 1997. [38] Free Viewpoint Video Representation System. NHK. [Online]. Available: http://www.nhk.or.jp/strl/open2002/en/tenji/id08/08index.html [39] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint. Cambridge, MA: MIT Press, 1993. [40] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision. Upper Saddle River, NJ: Prentice-Hall, 1998. [41] R. Y. Tsai, “A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf tv cameras and lenses,” IEEE J. Robot. Automat., vol. RA-3, pp. 323–344, Aug. 1987. [42] O. Faugeras and G. Toscani, “Camera calibration for 3D computer vision,” in Proc. Int. Workshop Machine Vision and Machine Intelligence, Feb. 1987, pp. 240–247. [43] J. Heikkillä and O. Silvén, “A four-step camera calibration procedure with implicit image correction,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997, pp. 1106–1112. [44] O. Faugeras and G. Toscani, “The calibration problem for stereo,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1986, pp. 15–20. [45] Z. Zhang, “Flexible camera calibration by viewing a plane from unknown orientations,” in Proc. Int. Conf. Computer Vision, 1999, pp. 666–673. , “Camera calibration with one-dimensional objects,” in Proc. Eur. [46] Conf. Computer Vision, vol. IV, 2002, p. 161. [47] B. Kamgar-Parsi and R. D. Eastman, “Calibration of a stereo system with small relative angles,” Comput. Vis. Graph. Image Processing, vol. 51, no. 1, July 1990. [48] F. Pedersini, A. Sarti, and S. Tubaro, “Accurate and simple geometric calibration of multi-camera systems,” Signal Processing, vol. 77, pp. 309–334, Sept. 1999. [49] H. G. Maas, “Image sequence based automatic multi-camera system calibration techniques,” Int. Archives Photogrammetry and Remote Sensing, vol. 32, 1998. [50] P. Baker and Y. Aloimonos, “Complete calibration of a multi-camera network,” in Proc. IEEE Workshop Omnidirectional Vision, Hilton Head Island, SC, June 16, 2000, pp. 134–141. [51] Z. Zhang, “Determining the epipolar geometry and its uncertainty: A review,” Int. J. Comput. Vis., vol. 27, no. 2, pp. 161–195, Mar. 1998. [52] R. I. Hartley and A. Zisserman, Multiple View Geometry. Cambridge, U.K.: Cambrige Univ. Press, 2000. [53] T. Moons, “A guided tour through multiview relations,” in Proc. SMILE Workshop, vol. 825, Lecture Notes in Computer Science, 1998, pp. 297–316. [54] S. J. Maybank and O. Faugeras, “A theory of self-calibration of a moving camera,” Int. J. Comput. Vis., vol. 8, no. 2, pp. 123–151, 1992. [55] Q. T. Luong and T. Vieville, “Canonical representations for the geometries of multiple projective views,” Comput. Vis. Image Understand., vol. 64, no. 2, pp. 193–229, 1996. [56] A. Azarbayejani and A. P. Pentland, “Recursive estimation of motion, structure and focal length,” IEEE Trans. Pattern Anal. Machine Intell., vol. 17, pp. 562–575, June 1995. [57] M. Pollefeys, R. Koch, and L. Van Gool, “Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters,” in Proc. Int. Conf. Computer Vision, 1998, pp. 90–95. [58] A. Fusiello, “Uncalibrated Euclidean reconstruction: A review,” Image Vis. Comput., vol. 18, no. 6-7, pp. 555–563, 2000. [59] [Online]. Available: http://www.2d3.com/ [60] O. Schreer, N. Brandenburg, S. Askar, and P. Kauff, “Hybrid recursive matching and segmentation-based postprocessing in real-time immersive video conferencing,” in Proc. Conf. Vision, Modeling and Visualization, Stuttgart, Germany, Nov. 21–23, 2001. [61] R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” in Proc. Eur. Conf. Computer Visions, vol. 2, Stockholm, Sweden, May 2–6, 1994, pp. 151–158. [62] K. Muhlmann, D. Maier, J. Hesser, and R. Manner, “Calculating dense disparity maps from color stereo images, an efficient implementation,” Int. J. Comput. Vis., vol. 47, no. 1/2/3, pp. pf 79–88, 2002. [63] K. Konolige. The SRI Small Vision System. [Online]. Available: http://www.ai.sri.com~konolige/svs [64] J. W. Shade, “Layered depth images,” in Proc. SIGGRAPH, Orlando, FL, 1998, pp. 231–242.

[65] J. Snyder and J. Lengyel, “Visibility sorting and compositing without splitting dor image layer decomposition,” in Proc. SIGGRAPH, Orlando, FL, 1998, pp. 219–230. [66] E. Trucco, F. Isgrò, and F. Bracchi, “Plane detection in disparity space,” in Proc. IEE Int. Conf. Visual Information Engineering, 2003, pp. 73–76. [67] [Online]. Available: http://www.ri.cmu.edu/labs/lab_62.html [68] H. Fuchs, G. Bishop, K. Arthur, L. McMillan, R. Bajcsy, S. W. Lee, H. Farid, and T. Kanade, “Virtual space teleconferencing using a sea of cameras,” in Proc. 1st Int. Symp. Medical Robotics and Computer Assisted Surgery, Pittsburgh, PA, 1994, pp. 161–167. [69] J. K. Cheng and T. S. Huang, “Image registration by matching relational structures,” Pattern Recognit., vol. 17, no. 1, pp. 149–159, 1984. [70] R. Horaud and T. Skordas, “Stereo correspondence through feature grouping and maximal clique,” IEEE Trans. Pattern Anal. Machine Intell., vol. 11, pp. 1168–1180, Nov. 1989. [71] S. Ullman, The Interpretation of Visual Motion. Cambridge, MA: MIT Press, 1989. [72] D. Tell and S. Carlsson, “Combining appearance and topology for wide baseline matching,” in Proc. Eur. Conf. Computer Vision, vol. I, 2002, pp. 68–81. [73] M. Pilu, “A direct method for stereo correspondence based on singular value decomposition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997, pp. 261–266. [74] A. Goshtasby, S. H. Gage, and J. F. Bartholic, “A two stage cross correlation approach to template matching,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-6, pp. 374–378, Mar. 1984. [75] C. H. Chou and Y. C. Chen, “Moment-preserving pattern matching,” Pattern Recognit., vol. 23, no. 5, pp. 461–474, 1990. [76] M. Pilu and F. Isgrò, “A fast and reliable planar registration method with applications to document stitching,” in Proc. British Machine Vision Conf., Cardiff, U.K., Sept. 2–5, 2002, pp. 688–697. [77] J. Mulligan, V. Isler, and K. Daniilidis, “Performance evaluation of stereo for tele-presence,” in Proc. IEEE Int. Conf. Computer Vision, vol. 2, 2001, pp. 558–565. [78] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. Comput. Vis., vol. 47, no. 1-3, pp. 7–42, Apr. 2002. [79] A. Fusiello, E. Trucco, and A. Verri, “Efficient stereo with multiple windowing,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 1997, pp. 858–863. [80] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory and experiments,” IEEE Trans. Pattern Anal. Machine Intell., vol. 16, pp. 920–932, Sept. 1994. [81] K. Lengwehasarit and A. Ortega, “Probabilistic partial-distance fast matching algorithms for motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 139–152, Feb. 2001. [82] K. Muhlmann, D. Maier, J. Hesser, and R. Manner, “Calculating dense disparity maps from color stereo images, an efficient implementation,” Int. J. Comput. Vis., vol. 47, no. 1/2/3, pp. pf 79–88, 2002. [83] M. Perez and F. Cabestaing, “A comparison of hardware resources required by real-time stereo dense algorithms,” in Proc. IEEE Int. Workshop Computer Architecture for Machine Perception, New Orleans, LA, May 12–14, 2003. [84] J. Mulligan, V. Isler, and K. Daniilidis, “Trinocular stereo: A real-time algorithm and its evaluation,” Int. J. Computer Vision, vol. 47, no. 1/2/3, pp. pf 51–61, 2002. [85] M. Okutomi and T. Kanade, “A multiple-baseline stereo,” IEEE Trans. Pattern Anal. Machine Intell., vol. 15, pp. 353–363, Apr. 1993. [86] S. B. Kang, R. Szeliski, and J. Chai, “Handling occlusions in dense multi-view stereo,” in Proc. Int. Conf. Computer Vision and Pattern Recognition, vol. 1, Kuaui, HI, Dec. 8–14, 2001, pp. 103–110. [87] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in Proc. British Machine Vision Conf., 2002, pp. 384–393. [88] C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 530–535, May 1997. [89] A. Baumberg, “Reliable feature matching across widely separated views,” in Proc. IEEE Int. Conf. Comp. Vision and Pattern Recognition, vol. I, 2000, pp. 774–781. [90] C. Tomasi and R. Manduchi, “Stereo matching as nearest-neighbor problem,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 333–340, Mar. 1998.

302

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 3, MARCH 2004

[91] S. Crossley, N. A. Thacker, and N. L. Seed, “Benchmarking of bootstrap temporal stereo using statistical and physical scene modeling,” in Proc. British Machine Vision Conf., 1998, pp. 346–355. [92] L. Matthies and M. Okutomi, “Bootstrap algorithms for dynamic stereo vision,” in Proc. 6th Mutidimensional Signal Processing Workshop, 1989, pp. 12–22. [93] M. O’Neil and M. Demos, “Automated system for coarse to fine pyramidal area correlation stereo matching,” Image Vis. Comput., vol. 14, pp. 225–136, 1996. [94] F. Isgrò, E. Trucco, and L. Q. Xu, “Toward teleconferencing by view synthesis and large-baseline stereo,” in Proc. IAPR Int. Conf. Image Analysis and Processing, Sept. 2001, pp. 198–203. [95] S. B. Kang, “A Survey of Image-Based Rendering Techniques,” Cambridge Research Lab., Digital Equipment Corp., Tech. Rep. 97/4, 1997. [96] L. McMillan and G. Bishop, “Plenoptic modeling: An image-based rendering system,” in Proc. SIGGRAPH95, 1995, pp. 39–46. [97] Z. Zhang, “Image-based geometrically-correct photorealistic scene/object modeling: A review,” in Proc. Asian Conf. Computer Vision, 1998, pp. 231–236. [98] T. Whitted, “Overview of IBR: Software and hardware issues,” in Proc. IEEE Int. Conf. Image Processing, vol. 2, 2000, pp. 1–4. [99] V. Popescu, A. Lastra, D. Alliaga, and M. de Oliveira Neto, “Efficient warping for architectural walkthroughs using layered depth images,” in Proc. IEEE Visalization, 1998, pp. 211–215. [100] V. Popescu, “Forward rasterization: A reconstruction algorithm for image-based rendering,” Ph.D. dissertation, Univ. of North Carolina, Chapel Hill, 2001. [101] J. Torborg and J. Kajiya, “Talisman: Commodity real-time 3D graphics for the PC,” in Proc. SIGGRAPH, 1998, pp. 353–363. [102] S. E. Chen and L. Williams, “View interpolation for image synthesis,” in Proc. SIGGRAPH 93, 1993, pp. 279–288. [103] S. E. Chen, “Quicktime VR – An image-based approach to virtual environment navigation,” in Proc. SIGGRAPH 95, Los Angeles, CA, Aug. 6–11, 1995, pp. 29–38. [104] [Online]. Available: http://www.tnt.uni-hannover.de/project/eu/ panorama/overview.html [105] S. M. Seitz and C. R. Dyer, “Physically-valid view synthesis by image interpolation,” in Proc. IEEE Workshop Representations of Visual Scenes, 1995, pp. 18–25. [106] L. McMillan and G. Bishop, “Plenoptic modeling: An image-based rendering system,” in Proc. SIGGRAPH95, 1995, pp. 39–46. [107] S. J. Gortler, R. Grzesczuk, R. Szeliski, and M. F. Cohen, “The lumigraph,” in Proc. SIGGRAPH’96, 1996, pp. 43–54. [108] M. Levoy and P. Hanrahan, “Light field rendering,” in Proc. SIGGRAPH’96, 1996, pp. 31–42. [109] S. Pollard, M. Pilu, S. Hayes, and A. Lorusso, “View synthesis by trinocular edge matching and transfer,” in Proc. British Machine Vision Conf., 1998, pp. 770–779. [110] S. Laveau and O. Faugeras, “3-D scene representation as a collection of images,” in Proc. IAPR Int. Conf. Pattern Recognition, 1994, pp. 689–691. [111] S. Avidan and A. Shashua, “Novel view synthesis in tensor space,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997, pp. 1034–1040. [112] A. Shashua, “Illumination and view position in 3D visual recognition,” in Advances in Neural Information Processing Systems, S. E. Moody, S. J. Hanson, and R. P. Lippman, Eds. San Mateo, CA: Morgan Kaufmann, 1992, pp. 404–411. [113] R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE Trans. Pattern Anal. Machine Intell., vol. 25, pp. 218–233, Feb. 2003. [114] K. J. Dana, B. van Ginneken, S. K. Nayar, and J. J. Koenderink, “Reflectance and texture of real-world surfaces,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997, pp. 151–157. [115] Y. Yu, P. Debevec, J. Malik, and T. Hawkins, “Inverse global illumination: Recovering reflectance models of real scenes from photographs,” in Proc. SIGGRAPH, Los Angeles, CA, Aug. 8–13, 1999, pp. 215–224. [116] S. Boivin and A. Gagalowicz, “Image-based rendering of diffuse, specular and glossy surfaces from a single image,” in Proc. SIGGRAPH, Los Angeles, CA, Aug. 12–17, 2001, pp. 107–116. [117] Z. Zhang, “Modeling geometric structure and illumination variation of a scene from real images,” in Proc. Int. Conf. Computer Vision, 1998, pp. 1041–1046. [118] Y. Mukaigawa, S. Mihashi, and T. Shakunaga, “Photometric image-based rendering for virtual lighting image synthesis,” in Proc. IEEE and ACM Int. Workshop Augmented Reality, San Francisco, CA, Oct. 20–21, 1999, pp. 115–124.

[119] Y. Mukaigawa, H. Miyaki, S. Mihashi, and T. Shakunaga, “Photometric image-based rendering for image generation in arbitratry illumination,” in Proc. Int. Conf. Computer Vision, Vancouver, BC, Canada, July 9–12, 2001, pp. 652–659. [120] A. Fitzgibbon, “Stochastic rigidity: Image registration for nowhere-static scenes,” in Proc. Int. Conf. Computer Vision 2001, vol. 1, 2001, pp. 662–670. [121] M. La Cascia, S. Sclaroff, and V. Athitsos, “Fast, reliable head tracking under varying illumination: An approach based on robust registration of texture-mapped 3D models,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 322–336, Apr. 2000. [122] S. Birchfield, “Elliptical head tracking using intensity gradients and color histograms,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Santa Barbara, CA, June 1998, pp. 232–237. [123] G. Welch, G. Bishop, L. Vicci, S. Brumback, K. Keller, and D. Colucci, “The HiBall tracker: High-performance wide-area tracking for virtual and augmented environments,” in Proc. ACM Symp. Virtual Reality Software and Technology, 1999, pp. 20–22. [124] J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 194, pp. 593–600. [125] L. Bretzner and T. Lindeberg, “Feature tracking with automatic selection of spatial scale,” Comput. Vis. Image Understand., vol. 71, no. 3, pp. 385–391, 1998. [126] H. Gu, M. Asada, and Y. Shirai, “The optimal partition of moving edge segments,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1993, pp. 367–372. [127] R. Deriche and O. Faugeras, “Tracking line segments,” in Proc. Eur. Conf. Computer Vision, 1990, pp. 259–268. [128] H. Wang and M. Brady, “Real-time corner detection algorithm for motion estimation,” Image Vis. Comput., vol. 13, no. 9, pp. 695–705, 1995. [129] J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1994, pp. 593–600. [130] S. Smith. (1999) Literature review on feature-based tracking approaches. [Online]. Available: http://www.dai.ed.ac.uk/CVonline/motion.htm [131] M. Isard and A. Blake, “Condensation – Conditional density propagation for visual tracking,” Int. J. Comput. Vis., vol. 29, no. 1, pp. 5–28, 1998. [132] K. Kanatani, Geometric Computation for Machine Vision. Oxford, U.K.: Oxford Univ. Press, 1993. [133] D. Lowe, “Robust model-based motion tracking through the integration of search and estimation,” Int. J. Comput. Vis., vol. 8, pp. 113–122, 1992. [134] M. Pilu, A. W. Fitzgibbon, and R. B. Fisher, “Ellipse-specific leastsquares fitting,” in Proc. Int. Conf. Pattern Recognition, vol. 1, Vienna, Austria, Aug. 25–30, 1996, pp. 253–257. [135] E. Marchand and G. D. Hager, “Dynamic sensor planning in visual servoing,” in Proc. IEEE Int. Conf. Robotics and Automation, vol. 1, 1998, pp. 1988–1993. [136] C. E. Smith and N. P. Papanikolopoulos, “Grasping of static and moving objects using a vision-based control approach,” J. Intell. Robot. Syst., vol. 19, pp. 237–270, 1997. [137] A. Blake and M. Isard, Active Contours. London, U.K.: Springer-Verlag, 1998. [138] Y. Ricquebourg and P. Bouthemy, “Real-time tracking of moving persons by exploiting spatio-temporal image slices,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 797–808, Aug. 2000. [139] L. Torresani and C. Bregler, “Space-time tracking,” in Proc. Eur. Conf. Computer Vision, vol. I, 2002, pp. 801–812. [140] M. Brand, “Morphable 3D models from video,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. II, 2001, pp. 456–633. [141] S. Avidan, “Support vector tracking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001, pp. 283–310. [142] M. Pontil and A. Verri, “Object recognition with support vector machines,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 637–646, June 1998. [143] T. Schoepflin, V. Chalana, D. R. Haynor, and K. Yongmin, “Video object tracking with a sequential hierarchy of template deformations,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 1171–1182, Nov. 2001. [144] R. Yang and Z. Zhang, “Model based head tracking with stereo vision,” in Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition, 2002, pp. 112–117. [145] S. Malik, C. McDonald, and G. Roth, “Hand tracking for interactive pattern-based augmented reality,” in Proc. Int. Symp. Mixed and Augmented Reality, 2002, pp. 117–126. [146] C. C. Chang and W. H. Tsai, “Vison-based tracking and interpretation of human leg movement for virtual reality applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 9–24, Jan. 2001.

ISGRÒ et al.: 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

303

Francesco Isgrò received the Laurea degree in mathematics from Università di Palermo, Palermo, Italy, in 1994, and the Ph.D. degree in computer science from Heriot-Watt University, Edinburgh, U.K., in 2001. From 2000 to 2002, he was Research Associate with the Department of Computing and Electrical Engineering, Heriot-Watt University. He is now a Research Associate with Dipartimento di Informatica e Scienze dell’Informazione, Università di Genova, Genova, Italy. He also is part-time Lecturer at Università di Palermo. His current research interests are image-based rendering and applications to videoconferencing, three-dimensional analysis, and image registration.

Peter Kauff received the Diploma degree in electrical engineering and telecommunication from the Technical University of Aachen, Aachen, Germany, in 1984. He is the Head of the “Immersive Media & 3D Video” Group in the Image Processing Department, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut (FhG/HHI), Berlin, Germany. He has been with Heinrich-Hertz-Institute since 1984, involved in numerous German and European projects related to digital HDTV signal processing and coding, interactive MPEG-4-based services, as well as in a number of projects related to advanced three-dimensional video processing for immersive telepresence and immersive media. He has been engaged in several European research projects, such as EUREKA 95, RACE-project FLASH, ACTS-project HAMLET, COST211, ESPRIT-Project NEMESIS, IST-Projects VIRTUE and ATTEST and the Presence Working Group of IST-FET Proactive Initiative. Mr. Kauff is a reviewer of several IEEE and IEE publications.

Emanuele Trucco received the B.Sc. and Ph.D. degrees from the University of Genoa, Genoa, Italy, both in electronic engineering, in 1984 and 1990, respectively. He is now a Reader in the School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh, U.K. His current interests are in multiview stereo, motion analysis, image-based rendering, and applications to videoconferencing, medical image processing, and subsea robotics. He has managed grants in excess of one million pounds from the European Union, EPSRC, various foundations (e.g., Royal Society and British Council), and industry. He has served on professional, technical, and organizing committees of several conferences. He is an Honorary Editor of the IEE Proceedings on Vision, Image and Speech Processing. He has published more than 100 refereed publications and coauthored a book widely adopted by the international community. Press reports include New Scientist, the Financial Times, and an invited participation in the BBC Tomorrow’s World Roadshow 2002.

Oliver Schreer received the degree in electronics and electrical engineering and the Ph.D. degree in electrical engineering from the Technical University of Berlin, Berlin, Germany, in 1993 and 1999, respectively From 1993 until 1998, he was an Assistant Teacher with the Institute of Measurement and Automation, Faculty of Electrical Engineering, Technical University of Berlin. He was responsible for lectures and practical courses in the fields of image processing and pattern recognition. His research interests have been camera calibration, stereo image processing, three-dimensional (3-D) analysis, navigation, and collision avoidance of autonomous mobile robots. Since August 1998, he has been working as a project leader in the “Immersive Media & 3-D Video” Group, Image Processing Department, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut (FhG/HHI), Berlin, Germany. In this context he is engaged in research for 3-D analysis, novel view synthesis, real-time videoconferencing systems, and immersive TV applications. He was the representative from FhG/HHI within the European FP5 IST project VIRTUE and leader of the “Real-time” work package. Since autumn 2001, he has been an Adjunct Professor with the Faculty of Electrical Engineering and Computer Science, Technical University of Berlin. Dr. Schreer has served as a Guest Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and is a reviewer for several IEEE and IEE journals.

Digital Image Processing Digital Image Processing - CPE, KU

Wavelets in Medical Image Processing: Denoising ... - CiteSeerX

Digital Image Processing

Implementation of Image Processing Algorithms in ...

Fundamentals of Image Processing

Multi-Resolution Image Processing in selecting the ...

Image Processing Manual.pdf

Digital Image Processing

Image processing using linear light values and other image ...

$pdf-1864\computer-image-processing-in-traffic-engineering-traffic ...$

pdf-1864\computer-image-processing-in-traffic-engineering-traffic ...

projects in image processing using matlab pdf

Image processing based weed detection in lawn

Image processing in aerial surveillance and reconnaissance: from ...