Online Inserting Virtual Characters into Dynamic Video ...

Viewer
Transcript

Online Inserting Virtual Characters into Dynamic Video Scenes Yijiang Zhang1,2 , Julien Pettr´e1 , Jan Ondˇrej1 , Xueying Qin2,3 , Qunsheng Peng2 , and St´ephane Donikian1 1

Bunraku team, INRIA Rennes State Key Lab of CAD&CG, Zhejiang University 3 School of Computer Science and Technology, Shandong University 2

Abstract The seamless integration of virtual characters into dynamic scenes captured by video is a challenging problem. In order to achieve consistent composite results, both the virtual and real characters must share the same geometrical space and their interactions must follow some common sense. One essential question is how to detect the motion of real objects - such as real characters moving in the video - and how to steer virtual characters accordingly to avoid unrealistic collisions. We propose an online solution. First, by analysis of the input video, the motion states of the real objects are recovered into a common world 3D coordinate system. Meanwhile, a simplified accuracy measurement is defined to represent the confidence of the motion estimate. Then, under the constraints imposed by the real dynamic objects, the motion of virtual characters are accommodated by a uniform steering model. The final step is to merge virtual objects back to the real video scene by taking into account visibility and occlusion constraints between real foreground objects and virtual ones: an estimate of depth information of real objects in the video is captured into alpha maps. It is used to reduce the ambiguity caused by occlusions between real objects in the original video. Several examples demonstrate the efficiency of the proposed algorithm.

Keywords: mixed reality, steering method, collision avoidance

1 Introduction Mixed Reality refers to the merging of real and virtual worlds to produce new environments where physical and digital objects co-exist and interact normally in real time. It is especially significant to integrate virtual characters with the real world as humans are the principal actors in most events. A number of applications are inspired by this technique, evident in the simulation and entertainment industries, etc. The visual realism of mixed reality can be exhibited by integrating the virtual objects into a real video. To achieve a seamless integration, firstly, the virtual objects must be defined in a space consistent with the one presented in the

video. Meanwhile, the movement of the virtual characters should fit the real environment. This problem is a spatiotemporal one as the geometrical constraints imposed by the real environment change in time due to the presence of dynamic objects. Our motivation is to synthesize believable animation of virtual characters, their motion and reaction to virtual objects should reach an acceptable level of realism in comparison to real human behaviors. As an example, once an update of the moving people in the video is detected, the integrated virtual characters should predict the future movement of these walkers and adapt their own motion to avoid any collisions with real obstacles. It is therefore essential for the virtual characters to get aware of the real environment. Although the geometry of static part of the scene can be reconstructed in advance with the current 3D reconstruction technique [1, 2], the motion of the dynamic objects and in particular, pedestrians, is still very difficult to reconstruct on line. A new method with better trade-off between accuracy and efficiency to meet the requirements of the above challenge is of great significance. Occlusion is another problem that may break the visual realism of mixed reality. When a virtual character is occluded by some real objects, the visibility computation must be conducted to reflect the correct spatial relationship between the virtual character and the real objects. The dynamic shape of moving objects makes the tracking and synthesis of corresponding silhouette very difficult. In this paper, we develop techniques to seamlessly integrate virtual characters into the real groups of pedestrians on-line. First, we propose an adaptive steering model under the static or dynamic constraints set by both the real and virtual world. The virtual characters can then adapt their motion to avoid collisions and reach predefined destinations. Secondly, with the knowledge of the camera model and the geometry of the static environment in the video, we present an improved segmentation algorithm for mixed reality which removes several visual artifacts and insures the quality of the final mixed video. Finally, we demonstrate a mixed reality system which integrates online virtual characters into a real video capturing several real walkers. The remainder of this paper is organized as follows: in

section 2, the related works both in tracking and crowd animation are reviewed. Section 3 presents an overview of our system. The tracking of multiple pedestrians is discussed in section 4. An approach for embedding virtual characters into the real scene is proposed in section 5. To achieve a seamless integration, for each pedestrian a finer alpha mapping is implemented based on the binary mask, the related issues are presented in section 6. Experimental results are demonstrated in section 7 and section 8 concludes the paper with highlights on future research directions.

2 Related Work Our techniques are at the crossroad of computer vision and virtual humans simulation. We summarize related work in this section from such two aspects: multiple segmentation and tracking in computer vision and crowd animation in simulation. To reconstruct the motion of the dynamic targets in a given video, most of approaches adopt segmentation as a primary process [3, 4] to extract the dynamic targets from the static background on each frame of the video, and then track the trajectory of these targets. Although background subtraction, under the assumption of a fixed camera, is the most efficient and robust algorithm for segmentation, occlusion may cause serious problem in multiple objects segmentation and tracking. This is because conventional background subtraction approach cannot distinguish two real pedestrians who are occluded by each other and therefore form a cluster in the foreground. In [5], Khan and Shah proposed to model each target as a Gaussian color model and derive the tracking and segmentation result based on the segmentation at previous frame. [6, 7] adopted Markov Random Field (MRF) to combine motion, appearance and occlusion together to find the optimal segmentation solution. In [4], Zhao and Nevatia employed a human shape model and lighting knowledge to analyze the foreground and proposed a robust method to detect and track pedestrians. To track the pedestrians in 3D world space, Rosales and Sclaroff proposed a combined 2D, 3D approach in [8]. The authors tracked multiple humans accounting for occlusion, and used extended Kalman filter to estimate the relative 3D motion trajectories up to a scale factor. In our case, the camera is calibrated. Then we not only track the pedestrians in the 3D world space but also ensure the estimate accuracy. Besides, to satisfy the visual requirement in mixed video, the binary mask from segmentation is refined and replaced by an alpha map. Behavior modeling of the integrated virtual characters in the video is another problem. Helbing’s social forces model is one of the most popular works [9]. This model has been widely adopted and improved [10, 11]. In these models, virtual walkers are modeled as velocity-controlled particles undergoing a sum of acceleration forces with an analogy to Newtonian Physics. Interactions are modeled as repulsive forces between walkers, which are expressed as a function of their relative distance. Treuille et al. also made an analogy to Physics [12], and formulated interactions as a minimization problem. Walkers’ motion is deduced from a potential field, whose dynamic component results from a repulsion emitted by walkers. Reynolds [13] enabled interaction with

anticipation. The unaligned collision avoidance extrapolates walkers’ trajectories - assuming that their velocity is constant - and check collisions in the next time segment. A reactive acceleration is computed for both walkers, if a potential collision is detected. Pettr´e et al. [14] proposed an interaction model based on an experimental study, in which interaction between two walkers is solved by a combination of velocity and orientation adaptations. It is noteworthy that the authors explained the pair-interaction by decomposing the trajectories into three phases: initial, interaction and recovery phase. The walkers have different tasks in each phase. In [15], a definition, velocity obstacle, is extended to solve the multiple interaction, which is similar to the method in [14]. Note that in mixed reality, the virtual crowd animation will be different from the current work, because the real walkers are autonomously moving. Such dynamical constraint and noise in motion estimation make the animation more challenging. Somasundaram and Parent proposed a system which can integrate some virtual characters into a real crowd on the video [16]. However, such a system is based on global path planning and can only be performed off-line. In this paper we will propose a new system which is capable of integrating virtual characters into real video scenes on-line.

3 Overview Our system can be divided into two parts: preprocess and on-line process, as displayed in Figure 1.

Figure 1: System Diagram In preprocess, the 3D space of the scene is acquired. For simplicity reasons, we will select scenes with open ground. Assume that the camera is fixed, the background can be estimated from a few beginning frames of the video. If there are some static obstacles in the scene, the corresponding masks in the background and their positions in the 3D world space are assigned manually. We perform camera calibration to reconstruct the 3D space of the scene [17]. The intrinsic parameters can be derived by chess board

calibration, while the extrinsic parameters are derived based on several matches between the marks in the scene and their corresponding pixels in the image. If there are not enough marks in the scene, some artificial marks can be placed for calibration and removed afterwards. The on-line process will be discussed in detail in this paper. The motions of the pedestrians are estimated in the 3D world space. The accuracy of the estimate is used as a confidence of the tracking. In simulation, virtual characters are inserted into the 3D space of the real scene and modeled to interact with the real pedestrians and other static obstacles. During synthesis, the virtual characters are rendered with the same lighting in the original video. To exhibit the occlusion of the inserted virtual characters by the pedestrians in the video precisely, improved alpha mappings for the relevant pedestrians are used to improve the quality of the mixed video. This task will be accomplished from two aspects.

4 Tracking of Multiple Pedestrians With the purpose of rationality for virtual characters’ motion, virtual characters are first enabled to perceive the evolution of the environment. In our work, we assume the moving objects in the scene are only pedestrians. Thus perceiving the evolution of the environment comes to estimating the pedestrians’ motions in the 3D world space. In this section, with the monocular video as input, we will segment and track the pedestrians in video space and then estimate their motions in the 3D world space. Various kinds of techniques in pedestrian tracking were proposed, as summarized in the previous section. We adopt a similar approach proposed in [3]. First, the moving pedestrians are segmented by background subtraction. The foreground regions consist of several connected components (blobs). A single pedestrian corresponds to one blob, while the pedestrians having inter-occlusion correspond to an identical blob. Then the bounding box of each blob is recorded to represent the corresponding pedestrian’s position in the image. The continuity of the video helps to associate tracks with foreground blobs. Thus the pedestrians can be tracked efficiently in the video. A merging and splitting detection method provides a mechanism to handle the occlusion. The tracking result in video is recovered to the 3D world space. The pedestrians are assumed to move on a known ground plane. Their vertical movements in the 3D world space are ignored. The camera model and the ground plane together serve as a bridge to transform 2D and 3D quantities. Technically, an extended Kalman filter (EKF) is used to recursively predict the pedestrian’s position based on the current position and velocity [8]. In such EKF, the state vector is defined as xk = (xk , yk , x˙ k , y˙ k )T , where (xk , yk ) and (x˙ k , y˙ k ) are position and velocity of the tracked pedestrian on the ground plane at time k. In the prediction, the pedestrian is supposed to move with constant velocity in short time. Our measurement vector in this EKF is zk = (uk , vk )T , where (uk , vk ) are the image coordinates of the tracked pedestrian. The values are given as the middle pixel of the bottom of the pedestrian’s bounding box in the image. Given the camera model known, the measurement equation is derived by the projection relation between xk

and zk . The occlusion between pedestrians may make zk far from the real image coordinates of the tracked pedestrian. When an occlusion is detected, we skip the measurement update in EKF and only use the EKF prediction to update the motion. In each loop of EKF, we get not only the estimated ˆ k|k , but also the a posteriori motion of the pedestrian, x error covariance matrix, Pk|k . Pk|k measures the estimated accuracy of the motion estimate. As a 4 × 4 matrix, Pk|k has some items correlated and seems complicated for application. A simplified accuracy measurement, σk , is defined here p p Pk|k (1, 1) + Pk|k (2, 2) , (1) σk = 2 σk is defined as the average of the accuracy measurement of the position along two horizontal coordinates in the 3D world space. ˆ k|k as dynamic We consider the estimated motion x constraints for virtual characters simulation in next section. Meanwhile, σk represents the confidence of the corresponding constraint.

5 Simulation In this section we insert virtual characters into the real scene and discuss how to simulate them in the mixed environment. In our work, the main mission for inserted virtual characters is to move from given initial positions to chosen destinations on the ground. A virtual character A is modeled as a velocity-controlled moving point. Without any obstacles, A can easily deduce its motion from its destination: its desired velocity Vd is oriented toward this destination, and its norm is the character’s comfort speed vc , which is constant and thus an individual parameter. If there are some obstacles in the real environment, the virtual character should be prevented from colliding with them. The real static obstacles have been located manually beforehand and the moving obstacles, i.e., pedestrians, have been perceived by tracking in the previous section. Under the constraints posed by such obstacles, the virtual characters’ motions are adapted. On the other hand, the real pedestrians cannot perceive the virtual ones, so the interaction between real pedestrians and virtual characters is actually the virtual characters’ reaction to the real pedestrians. In addition, interactions between virtual characters are also considered, which is actually a typical topic in computer animation. Here, we adopt a steering model based on an egocentric representation of walkers’ relative motion, which is inspired by the work presented in [14]. This model model is designed for pairwise-interactions between two virtual walkers. We extend it to shaped static obstacles and refine for uncontrolled real walkers. The idea consists of three phases: first, the definition of personal area is extended from virtual characters to general objects, which can be virtual or real, static or dynamic. Then the model for avoiding a single obstacle is proposed. Finally, we introduce the solution for collision avoidance in sparsely populated environments.

5.1 Personal Area Personal Area is the region surrounding a person which they

regard as psychologically theirs. Invasion of personal area often leads to discomfort, anger, or anxiety on the part of the victim. The size of the personal area is dependent on personality, environment, social culture etc. For a virtual character, collision avoidance is equivalent as preserving its personal area from invasion by other obstacles [18]. We use a kite shape to approximate the personal area, which is represented in Figure 2. The personal area has the same size for the left, right and back side RA . As an input parameter, RA is called the radius of the personal area. We set RA = 0.4m in this paper. The front of the personal area, along the velocity relative to the world space, is set to a larger size dependent on VA/W , as well as RA , the velocity i.e., LA = RA + RA VA/W . This definition accords with common sense. As a walker goes faster, he needs to pay more attention to the front space. For real walkers in the mixed environment, their motions and avoidances are completely controlled by themselves. In order to combine them with virtual walkers into a united model, we design a personal area for each of the real walkers. Take A as a real walker, the definition of his personal area is almost the same as the virtual one except the radius: according to the analysis in previous section, the motion estimation of the real walker is not accurate enough and an accuracy measurement σk is proposed. So a dynamical personal area is set based on such estimation confidence, RA = Rbasic + 3σ, (for the sake of clarity, the subscript k is omitted). Rbasic , (0.2m in this paper), is a critical value insuring the personal area not too small. The second item 3σ is inspired by the 3-σ principle in probability theory. Moreover, the personal area can be extended to general static obstacles. For a static obstacle, we vertically project it onto the ground and define its personal area as the circumpolygon of its projection. Conservatively, a larger but simpler polygon is a better personal area representation. In this way, all of the objects share the same attribute formally, which promises a united approach to resolve collision avoidance. Next, we describe the steering model gradually in three different conditions.

Figure 2: Personal Area

5.2 Avoid One Walker The model for the virtual character A avoiding one walker is introduced. The main idea of our strategy is to keep the predicted position of the perceived walker outside of A’s personal area, by adapting A’s velocity. Our description is supported by the example introduced in Figure 3. Two walkers walk straight forward toward their goals, whose trajectories are secant. We note A the reference walker for which the model is controlling the motion. Regardless of a

future collision, the reference walker A’s desired velocity is Vd , as explained before. A is walking at the desired velocity as no adaptation has been made yet, i.e., VA/W = Vd . The perceived walker on the bottom-right of the figure is noted B, which can be real or virtual. Our approach is preformed based on the local coordinate system centered and oriented on the reference walker A. First, the personal area of B is transferred to A. The personal area of A is redefined with the new radius: RAB = RA + RB . In the local coordinate system, B’s relative position is noted PB/A , and the relative velocity is computed as follows: VB/A = VB/W + VW/A ,

(2)

where VB/W is the velocity of B relatively to the world W, and VW/A is the velocity of W relatively to A (simply deduced from absolute velocity vector: VW/A = −VA/W ).

Figure 3: Illustration of the avoidance model T1 and T2 are tangents to the personal area passing by PB/A . These tangents (colored in light green in Figure 3) delimit an area called interaction area. We also define the interaction point I as follows: I = PB/A + ut · VB/W + ut · VW/A ,

(3)

where ut is the unit of time. I is the predicted position of B relative to A and our technique is mainly based on the position of I relatively to the interaction area: in the case that I is outside of the interaction area, it suggests that the relative position of B will always be outside of the personal area of A if both of them keep constant velocity in future. Therefore, we think there is no interaction between them and no necessity to adapt motion for A. Contrarily, I is located in the interaction area, the extrapolated B’s trajectory is crossing A’s personal area, which means a collision between them will happen if they keep constant velocity. To prevent such collision, the solution we adopt is to move I on the boundary of the interaction area by adjusting the velocity of the reference walker A. We want I ∈ T1 ∪ T2 , the solution velocity VsolW/A has to verify the following condition: PB/A + ut · VB/W + ut · VsolW/A ∈ T1 ∪ T2 .

(4)

Infinity of solution exists, the closest point to I belonging to T1 ∪ T2 , said Isol , is chosen to determine the optimal solution. VsolW/A =

Isol − PB/A − ut · VB/W . ut

(5)

The solution VsolW/A is sufficient for A to prevent collision within ut . Considering the physical ability of the characters,

VsolW/A needs to be truncated by the maximum speed. The collision avoidance can still be completed in several intervals. It is worth to note that the two solution domains T1 and T2 respectively correspond to two different roles: passing first or giving way. For example, Isol is located in T1 , as shown in Figure 3, A’s choice is to pass first. Its velocity adaption VsolA/W is to speed up and turn left.

5.3 Avoid One Static Obstacle Our adaptation model can be extended to avoid the collision with a static obstacle. The principle is the same: keeping A’s personal area from invasion. We first explain how a virtual character avoids a line-shaped static obstacle, which is considered as the primary element for general static obstacles. As Figure 4(a) shows, a line-shaped static obstacle B is on the virtual character A’s way. These two objects have their own personal area respectively. A’s personal area is the kiteshaped space, while B’s is its projection onto the ground, i.e., the line segment. A collision is predicted if A keeps constant velocity. The center of B is denoted as O. Like the previous method, B’s personal area is transferred to A: the line segment is translated to each vertex of A’s personal area, aligning the center O with each vertex, see Figure 4(b). The circum-polygon of the four translated line segments, the gray area in Figure 4(b), is defined as the new personal area of A. Meanwhile, B is reduced to a point O and its velocity is 0. Afterwards, we will adopt the same approach as before to get the solution velocity. O

o

B

B

-VA/W

T2

T1 VA/W

VA/W

A

(a)

A

(b)

Figure 4: Model for avoiding a line-shaped static obstacle For a static obstacle with general shape, the personal area is defined as the circum-polygon of its projection onto the ground. As a result, the problem that a virtual walker avoids such a static obstacle is reduced to another problem that the virtual character avoids a set of line segments synchronously. If we accomplish the solution for A to avoid more than one obstacle, the avoidance of a general static obstacle is accessible.

5.4 Avoid Multiple Obstacles We continue using personal area to solve multiple avoidances. A solution velocity is required to avoid all

of the perceived obstacles at the same time. Consider the case that the virtual character A is moving with several obstacles Bi , i = 1, ..., n around, which may be a real or virtual walker, or a line shaped static obstacle. To adapt the motion of A correctly, first each pair, A and Bi , is processed as the previous method. The pair with no interaction is omitted. For the pair with interaction, instead of choosing the optimal solution by distance, we keep two velocity domain Ti1 and Ti2 . Result is a set of solution velocity domains: S = {Ti1 , Ti2 } with i = 1, ..., m. Note that m ≤ n because some of the perceived obstacles have no interaction with the considered reference walker A. Secondly, we merge the solution velocity domains to enable A better motion adaptation. A solution velocity candidate for a specific obstacle may not be an eligible candidate for all the perceived obstacles, because it may fall into the interaction area with other perceived obstacles. Thus S is pruned by eliminating the solution candidates which conflict with other perceived obstacles. Eventually, the final solution velocity domain is reduced as S 0 . In case S 0 is empty, we set Vsol = 0. Otherwise, the velocity closest to VW/A is selected. A detailed explanation is described in [14].

6 Occlusion Handling for Synthesis Whereas previous steps guarantee collision free motion of virtual characters among real obstacles, it is now required to seamlessly compose real and virtual objects in to the video space. One issue is how to synthesize the virtual characters into the original video with correct visibility. If a virtual character is located behind a real object, we have to determine which part of the virtual character is occluded. To this end, the alpha map of the foreground object in the original video is required. For the static objects, the alpha map is manually done in preprocess. For the moving ones, i.e., pedestrians, we fortunately have achieved some raw segmentation by background subtraction in tracking. The binary mask, however, is usually not an accurate representation of the alpha map, because it cannot embody the transition between foreground and background, see Figure 5(b). Though this kind of error may be small around the boundary, the flickering artifact due to this error can be very distractive and unpleasant in the final mixed video. Therefore the alpha value of the pixel on the boundary is changed to a moderate value, the frequency of foreground pixels in the surrounding neighborhood. Besides, when two pedestrians have an inter-occlusion, the merged blob is far from an acceptable alpha map for either of them. Sometimes a better alpha map is necessary for these pedestrians, which means that the merged blob needs to be separated. For example, as Figure 5(a) shows, two pedestrians form a merged blob in the foreground mask. When a virtual character is located around between them coincidentally in the 3D world space, an artificial ambiguity will appear in the final mixed video if we do not separate the merged blob properly, see Figure 8(a). The case that two pedestrians have an inter-occlusion and a virtual character is located between them occurs occasionally, but these artifacts are very conspicuous. To prevent such kind of artifacts, a separation of the merged blob is a prerequisite. Unfortunately, an accurate and fast separation is almost

impossible considering the color ambiguity and deformation of the pedestrians. Here, a shape based method is adopted to suggest the projection of each pedestrian in the image and separate the merged blob if possible.

(a)

(b)

(c)

(d)

Figure 5: Segmentation and Separation: (a) is a sample input frame, (b) is the foreground from standard background subtraction, (c) is the alpha map by softening the boundary and (d) means the separation of the two pedestrians by the shape model

We will employ the motion estimation in section 4 to accomplish the separation. We model walker shape by an rectangle plate parallel with image plane of the groundlevel camera, whose width and height are set 0.4m and 1.75m by default. With the knowledge of the camera model and the pedestrian’s horizontal position, the projection of the rectangle plate suggests the range of the pedestrian in the image. Once projecting all the related rectangle plates onto the image plane, the projections cover the merged blob in the image. For the pedestrian with largest depth, the segmentation is set as the intersection of the projection and the merged blob, see the gray area in Figure 5(d). The blob is updated by setting the segmented area as background. Then the pedestrian with sub-largest depth is extracted from the remaining blob in the same manner. We iterate the operation until the one with closest depth. The reason why we prefer the descending depth order is that the larger depth means a smaller silhouette and hence relatively less error is caused by the above crude segmentation. The performance of the separation is dependent on the degree of the occlusion. Severe occlusion, for example, the head is occluded, causes more error. So the separation will be skipped for the pedestrian whose head is occluded. The head detection is accomplished by analyzing the boundary of the merged blob, as proposed in [4].

7 Results The system was implemented and experiments were conducted on an Intel(R) Core(TM)2 Extreme 2.8G HZ PC. For all experiments we used a consumer hand held

video camera recording at 30 frames/secs (640 × 360, color). In order to test the system, we collected 10 sequences in three scenes. Everything is online for recorded videos and the average speed is about 20 fps, without any code-level optimization. Actually, the system spends the most time on segmentation. Considering that the virtual characters are occluded only sometimes, we can accelerate the system by skipping the unnecessary computation in handling occlusion. Next we will show some representative results.

Figure 6: Example of segmentation in video space and tracking in 3D world space

Our first result, in Figure 6, consists of a 13 second (390 frames) multiple body walking sequence. Four real pedestrians enter into the view of the camera from four different directions and walk straight ahead. The first row shows 4 sample frames of the original video. The middle row shows the pedestrians’ masks. When there is a partial occlusion, the merged blob is separated by the shape model, the occluded one is represented as a gray mask as showed in the second and third images. The quality of the separation is dependent on the severity of the occlusion. We may improve the separation by some subtle techniques, such as color model, but it will cost more computation. For head occlusions, see the left occlusion in the third image, the separation is skipped. The bottom images are the topviews showing the pedestrians’ positions on the ground. The red triangle in the bottom right represents the fixed camera. The four curves on the ground represent the trajectories of the four pedestrians along the video. And the small disks on trajectories represent the pedestrians’ positions. Note that there is a gap in the orange trajectory. This is because the associated pedestrian has an occlusion soon after entering into the view. When the estimation of the motion is not accurate enough yet, the occlusion invalidates the measurement update in EKF and therefore the estimate is far from the real state. This error will disappear as soon as the occlusion is over. Figure 7 illustrates the perspective effect on the motion estimation. The top view, Figure 7(a), displays the estimated trajectory of a pedestrian tracked by EKF. The distance between the pedestrian and the camera varies as the pedestrian moves. Because of the perspective effect, the accuracy of the motion estimation varies dependently. Figure 7(b) illustrates this dependency. At beginning, the estimation accuracy is initialized with a large value.

8

16

0.06

4 2 Y (m)

14

0.05

12

0.04

Distance (m)

Trajectory

0 −2 −4

Estimation Accuracy (m)

Distance Estimation Accuracy

6

−6 −8

Camera −10 −2

−1

0

1

2

3

4

5

6

10

7

0

50

100

X (m)

(a)

150

200 250 Time (1/30 s)

300

350

0.03 400

(b)

Figure 7: The tracking accuracy dependent on the distance. (a) plots the camera and an estimated trajectory; (b) displays the relation between distance from the tracked object to the camera (blue) and the estimation accuracy σ (dark green).

It rapidly decreases and afterwards varies along with the distance from the tracked object to the camera. We also infer that the precision of the estimation is less than 15 cm (3σ) in the range less than 15 m around the camera, which is acceptable for the motion adaptation of the virtual characters in our experiments.

(a)

(b)

Figure 8: Shape based separation avoids artifact in final mixed video

Figure 8 demonstrates the validity of the merged blob separation. The original frame is Figure 5(a). In the mixed video, a virtual walker is located about between these two real pedestrians. Without the separation, the merged blob is considered as the silhouette of the real pedestrian closer to the camera. As a result, the virtual human seems occluded by the further real pedestrian, 8(a), in mixed video. The separation avoids this artifact, see Figure 8(b). 6 6 4 4

2

0 Y (m)

Y (m)

2

0

−2

−2

−4

−4

−6 −8

−6

−6

−4

−2

0 X (m)

2

4

6

8

−8 −10

−5

0 X (m)

5

10

Figure 10: Top views of two interaction examples:

The trajectories of real(virtual) walkers are plotted in blue(red), and the black rectangles represent real static obstacles.

9.

Some samples of the mixed video are displayed in Figure And the top views in Figure 10 demonstrate the

validity of the motion model for virtual walkers. More detailed examples can be found in supplement material. It is worth to mention a phenomenon about the interaction between a real walker and a virtual one. Because real walkers cannot perceive the inserted virtual walkers in the mixed environment and therefore cannot subjectively avoid them. The collision avoidance between them is completely dependent on the virtual walkers’ motion adaptation. It becomes a little dangerous for the virtual walker passing first, because the real one will not slow down or turn to cooperate with him. When the virtual walker attempts to pass first, the real walker probably collides with him from his back. To avoid such collisions, we increase the size of the back part of the virtual character’s personal area individually. As a result, the virtual walker needs to adapt more than before to pass first.

8 Discussion and Future Work We proposed a video-based mixed reality system enabling interaction between real and virtual humans in real scenes. The real scene is captured by a camera, and the evolution of the environment is got by video analysis, which provides the constraint for the motion of virtual characters. Next, the virtual characters are modeled to move around the scene, keeping away from real obstacles and other virtual characters. To seamlessly integrate the virtual characters, the alpha map of each real pedestrian is carefully extracted. The framework of the system is flexible and each component in the system can be improved independently. Our system is a prototype for the online mixed crowd simulation. Several limitations exist and suggest our future work direction, which is presented in the following paragraphs. In the background subtraction, we evade the strong lighting effect intentionally because we are unwilling to be distracted by some bothering segmentation details. If the real pedestrians have strong shadows, our segmentation will segment the shadow together as foreground, which further disturbs the following processing, both in tracking and synthesis. We have shown a tracking failure in the topview in Figure 6, where the occlusion in the original video cause obvious error in the motion estimation, forming a gap in the trajectory. In fact, as the crowd complexity increases, such kinds of errors occur more frequently. As a result, the virtual walkers cannot get the accurate motion estimation of the real ones and a collision between them may appear when they are close to each other. To improve the motion estimation, we have to adopt a more subtle tracking method by applying new cues from the video and therefore sacrifice efficiency. For example, an elaborate 3D model of pedestrian can make the segmentation more accurate and derive better observation in EKF. Moreover, we can ignore some occlusion in a group and track the group as whole. Basically, frequent occlusions with long duration cause ambiguity and hide lots of essential information in 3D world. As far as we know, none of the existing tracking technique is able to handle correctly severe occlusions well. Comparatively, it is easier to synthesize collision free motion of virtual characters among real obstacles once

Figure 9: Samples of mixed video the motion estimation is got correctly. As summarized in Related Work, there are several different approaches to simulate the motion of the virtual characters. The advantage of our proposed model is that the personal area induces asymmetry in the interaction. As a result, the trajectories of the mixed interaction look more realistic than the others. Although our model can handle most near-linear motion, which is the most common case in daily life, we need take care of the non-linear case. In [14], a new criterion of interaction adaption is applied to address this problem. In our mixed case, fortunately, the estimation accuracy will increase when the tracked pedestrian has sudden motion change, which extends the personal area and decreases the probability of collision. In a word, the non-linear case still needs more attention in animation community. The future work is to extend the current system to handle more complicated environments. For complex crowds, when there are some groups, it is not valid to track each pedestrian as a single target. Instead, we should analyze the group and track it as whole. Normally, the motions of pedestrians are smooth and predictable in a relatively long duration. Learning the patterns of motion trajectories of the pedestrians is a better substitute for the current tracking method. In simulation, the tracked groups are new type of real obstacles, and we can also simulate virtual groups, which will make the mixed crowd more realistic. In our current work, the pedestrians cannot perceive virtual objects yet. Another important work for us is to provide the real pedestrians the ability to perceive the virtual characters.

References [1] Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera. Int. J. Comput. Vision, 59(3):207–232, 2004. [2] Guofeng Zhang, Jiaya Jia, Tien-Tsin Wong, and Hujun Bao. Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:974–988, 2009. [3] Tao Yang, Stan Z. Li, Quan Pan, and Jing Li. Real-time multiple objects tracking with occlusion handling in dynamic scenes. In CVPR ’05, pages 970–975, Washington, DC, USA, 2005. IEEE Computer Society. [4] Tao Zhao and Ram Nevatia. Tracking multiple humans in complex situations. IEEE Trans. Pattern Anal. Mach. Intell., 26:1208–1221, September 2004. [5] Sohaib Khan and Mubarak Shah. Tracking people in presence of occlusion. In Asian Conference on Computer Vision, pages 1132–1137, 2000.

[6] Aur´elie Bugeau and Patrick P´erezz. Track and cut: simultaneous tracking and segmentation of multiple objects with graph cuts. J. Image Video Process., pages 1–14, 2008. [7] Chaohui Wang, Martin de La Gorce, and Nikos Paragios. Segmentation, ordering and multiobject tracking using graphical models. In ICCV’09:IEEE International Conference in Computer Vision, 2009. [8] R. Rosales and S. Sclaroff. 3d trajectory recovery for tracking multiple objects and trajectory guided recognition of actions. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 2, page 123 Vol. 2, 1999. [9] D. Helbing and P. Molnar. Social force model for pedestrian dynamics. Physical Review, 1995. [10] Dirk Helbing, Lubos Buzna, and Anders Johansson. Self-organized pedestrian crowd dynamics: Experiments, simulations, and design solutions. Transportation Science, 39:1–24, February 2005. [11] Nuria Pelechano and Jan M Allbeck. Controlling individual agents in high-density crowd simulation. In Proceedings of the 2007 ACM SIGGRAPH/Eurographics symposium on Computer animation, SCA ’07, pages 99–108, Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association. [12] Adrien Treuille, Seth Cooper, and Zoran Popovi´c. Continuum crowds. ACM Transactions on Graphics, 25(3):1160–1168, July 2006. [13] Craig Reynolds. Steering behaviors for autonomous characters. pages 763–782, 1999. [14] Julien Pettr´e, Jan Ondˇrej, Anne-H´el`ene Olivier, Armel Cretual, and St´ephane Donikian. Experiment-based modeling, simulation and validation of interactions between virtual walkers. In Eurographics/ ACM SIGGRAPH Symposium on Computer Animation, 2009. [15] Jur van den Berg, Ming C. Lin, and Dinesh Manocha. Reciprocal velocity obstacles for real-time multi-agent navigation. In IEEE Conference on Robotics and Automation, pages 1928–1935. IEEE, 2008. [16] Arunachalam Somasundaram and Rick Parent. Inserting synthetic characters into live-action scenes of multiple people. In CASA ’03, page 137, Washington, DC, USA, 2003. IEEE Computer Society. [17] R.I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [18] Martin G´erin-Lajoie and Carol Richards. The negotiation of stationary and moving obstructions during walking: anticipatory locomotor adaptations and preservation of personal space. Motor control, 3:342–69, 2005.