Fast Forensic Video Event Retrieval Using Geospatial ...

Viewer
Transcript

Fast Forensic Video Event Retrieval Using Geospatial Computing Hongli Deng, Mun Wai Lee, Asaad Hakeem, Omar Javed, Weihong Yin, Li Yu, Andrew Scanlon, Zeeshan Rasheed, Niels Haering ObjectVideo, Inc. Reston, Virginia

{hdeng, mlee, ahakeem, ojaved, wyin, liyu, ascanlon, zrasheed, nhaering}@objectvideo.com ABSTRACT This paper presents a fast forensic video events analysis and retrieval system in a geospatial framework. Starting from tracking targets and analyzing video streams from distributed camera networks, the system generates video tracking metadata for each video, maps and fuses them in a uniform geospatial coordinate. The combined metadata is saved into spatial database where target trajectories are represented in geometry and geography data type. Powered by spatial functions of database, various video events such as crossing a line, entering an area, loitering and meeting, are detected by executing stored procedures that we have implemented. Geographic information system(GIS) data of TigerLine1 and GeoNames2 are integrated with this system to provide contextual information for more advanced forensic queries. A semantic data mining system is also attached to generate text descriptions of events and scene contextual information. The NASA World Wind3 is the geobrowser used to submit queries and visualize result. The main contribution of this system is that it initiates in running video event retrieval using geospatial computing techniques. This interdisciplinary combination makes this system scalable and manageable for large amount of video data from distributed cameras. It also makes the online video search possible by filtering tremendous amount of data efficiently using geospatial index techniques. From the application point of view, it extends the frontier of geospatial application by presenting a forward-looking application model.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process; H.2.8 [Database Management]: Database Applications—Spatial Databases and GIS ; I.2.10 [Artificial Intelligence]: Vision and Scene 1

www.census.gov/geo/www/tiger www.geonames.org 3 worldwind.arc.nasa.gov/java

Understanding—Video analysis; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Motion

Keywords video analysis, video retrieval, video event search, spatial database, video surveillance

1. INTRODUCTION 1.1 Challenges The State-of-the-art closed circuit television (CCTV)-based video security systems employ a network of cameras to monitor large facilities such as air/sea ports or even large sections of cities. These camera networks generate a tremendous amount of distributed video data. Mostly video data analysis is currently performed by human operators. While people are good at recognizing threatening or suspicious behavior, they are not very good at effectively reviewing large amounts of video. Short attention spans, a vulnerability to interruptions or distractions, and a demonstrated difficulty in processing multiple video streams, all mitigate highly effective analysis of videos by humans. Thus, there is a dire need of fast, accurate and automatic capabilities for video analysis over camera networks. Current automatic video analysis over camera networks is facing three main challenges. Firstly, cameras are widely distributed and the number of cameras is large. There lacks a way of managing these cameras in a user friendly framework. Secondly, the surveillance videos are continuously streaming thus generating tremendous amount of data. Storing and organizing these data is extremely hard. Thirdly, analyzing, annotating, and searching these long-duration videos efficiently have not been achieved. This system is developed to address these problems.

2

1.2

Previous Works

Video analysis and retrieval is a traditional practice of many computer vision applications. In this system, we refer the video analysis mainly as object tracking and trajectory-based analysis. Although quite simple, the motion trajectories can be analyzed to achieve many advanced event inferences. There were a lot of previous works of trajectory-based video analysis in computer vision community. Some techniques analyze trajectory patterns in single video [10, 2, 7]. These methods track objects and analyze activities by clustering

motion tracks and finding deviations. Others perform trajectory analysis in camera networks [9, 13]. These methods infer the topology of camera views, solve the correspondence problem, stitch trajectories of the same target together, and analyze trajectories on a common coordinate system. Many trajectory-based video retrieval systems [16, 4, 3, 8] were developed too, but none of these techniques performed video analysis using geospatial computing. There were few reported video analysis applications in the geospatial domain. There have been previous works on integrating imagery data with geospatial system, but mainly for visualization purpose. The Google Street [14] builds a map based street view system which enables user to view and navigate street images though a map interface. It is mainly a visualization system and does not perform image or video understanding. Ay et al. [1] build a geo-referenced video search engine, but their system only correlates video field of views (FOVs) with GIS locations, it can retrieve and display videos given a map area, but it can not retrieve videos based on their contents. Another GIS map based surveillance system [15] is proposed recently. This system manages cameras in map but only analyzes video events in single video. In contrast, our system analyzes single video as well as multiple videos from camera networks in a uniform geospatial framework. Our system ventures in employing advanced geocomputing techniques to do video analysis and retrieval. This interdisciplinary combination brings huge benefits. On the one hand, the geospatial framework and computing techniques provide an extremely easy and scalable way to manage, analyze, retrieve and visualize video data. This is very important for building practical systems especially when the number of cameras and the amount of video data becomes intractably large. On the other hand, this system presents a new application model that we have never seen before in the geospatial application domain.

1.3

A geospatial-based video forensic analysis system

Figure 1 illustrates conceptually the architecture of this system. There are five major system components: the video management subsystem; the video analysis and fusion subsystem; the semantic data generation subsystem, the activity and semantics inference subsystem and the geobrowser query and visualization subsystem. Video data from camera networks are streamed into the video management subsystem. This subsystem stores video data into customized video database to facilitate safe storage and quick access. The video analysis and fusion subsystem applies computer vision algorithms to generate video metadata from the same video feeds and fuse data in world coordinates. These fused metadata are saved into video metadata database. The semantic data generation subsystem processes video metadata and generates video events in plain text and semantic formats. The activity and semantics inference subsystem detects events of interest and unusual activities using information from the metadata database. The geobrowser-based query interface and visualization subsystem interprets user-specified threat scenarios as queries, submits queries to the activity and semantics inference sub-

Figure 1: Architecture of the intelligent video analysis system system, displays retrieved targets and events on the map and extracts the corresponding videos from video database.

2. VIDEO METADATA GENERATION 2.1 Camera Registration To build the system in a geospatial framework, we first need to geo-register each surveillance camera view. As the camera FOVs are usually small, the mapping of camera view to world coordinate can be approximated by a homography transformation [6]. Homography represents a collinear transformation between two planes. Any two images of the same planar surface are related by a homography (assuming a pinhole camera model). In our case, we assume that targets move on the same ground plane in every single camera view. This assumption is valid in most cases as the effective covered area for a surveillance camera is normally a small planar area. The mapping can be represented as P = Hp where P are ground points in world coordinates and p are ground points in image coordinates. If we rewrite the mapping matrix H in vector form h = (h11 , h12 , h13 , h21 , h22 , h23 , h31 , h32 , h33 ), The equation become Ah = 0 where A is the matrix (1):   x1 y1 1 0 0 0 −x1 X1 −y1 X1 −X1 0 0 x1 y1 1 −x1 Y1 −y1 Y1 −Y1   0   . . . . . . . .   .   . . . . . . . .   .  x yn 1 0 0 0 −xn Xn −yn Xn −Xn  n 0 0 0 xn yn 1 −xn Yn −yn Yn −Yn (1) In matrix (1), x, y are coordinates of points p and X, Y are coordinates of points P , n is the total number of corresponding points. These corresponding points are obtained from mouse clicks on the video frame and on the geobrowser by human operators. Given a number of corresponding points, the least square estimation of h is equivalent to calculate the 0 eigenvector of the smallest eigenvalue of matrix A A. The eigenvector can be obtained directly from singular value decomposition (SVD) of A too. Figure 2 shows the mapping between camera view and geobrowser view.

2.2

Spatial Database

A critical component of this system is a spatial database. A spatial database is a normal database enhanced and optimized to store, manipulate, and query geometric and geographical data. The enhancements include spatial data type,

Figure 2: Mapping between camera view and geobrowser view in two sites. Left column: two camera views at Reston, VA (the upper two images) and locations of the two sites (the lower image). Middle column: the geobrowser views of the two sites. Right column: four camera views at Panama City, FL spatial functions and spatial index. For example, Microsoft SQL Server 2008 is a spatial database with two new embedded data types: geometry and geography. In addition, the database provides support for spatial data processing and querying, such as finding closest point, filtering out objects in an area and calculating distance between two objects. All these operations can be computed in an optimized manner using the spatial index.

2.3

Video Metadata Generation

The video analysis and fusion sub-system employs computer vision algorithms to generate video metadata and stores them into spatial database. Each individual sensor in camera network executes a video analysis module that does background substraction, forground blobbing, target tracking and recognition. It also collects useful information such as target size, speed, target snapshot image etc.. We call these information video primitives. Video primitives from multiple sensors are streamed into a central map-based fusion engine. The map fusion engine translates target locations from image coordinates to world coordinates using the mapping described in Section 2.1 and output fused map primitives. As our system supports single video-based query and mapbased query, both video primitives and map primitives are saved into tables in spatial database. After setting up, the whole video processing, metadata extraction process runs automatically without human supervision. General information of generated video metadata are shown in Figure 3. One important data field is the Trackline which records target tracking trajectories. Tracklines are just line segments extracted from tracking trajectories which form approximations of target motion tracks. Tracklines are saved as geometry data type in video tracklet table and geography data type in map tracklet table.

3. VIDEO EVENT INFERENCES 3.1 A Simple Example

Figure 3: Video primitives and map primitives.

We use a simple example to illustrate the idea of event inference using video metadata and spatial database. Given a video (see left image of Figure 4) and a user query: finding any object crossing the red virtual tripwire (the red line). SQL queries in Figure 5 are executed to get all tracklines that intersect with this tripwire: In this example, the tripwire @g is represented as a geometry data type in the format of Well-Known Text(WKT), ST Intersects is the spatial function that calculates intersections of geometrical objects. The results of SQL execution are shown in Figure 6. There are six tracklines intersecting with the tripwire indicating that six objects crossed this line. In this example, inferences are performed for a single video. Inference on a map is similar by replacing geometry data type with geography data type, and video tracklines

Target Property time classification speed physical size color feature Figure 4: Left: Tripwire (the red line) in video frame. Right: Video tracklines. use VideoEventDB DECLARE @g geometry SET @g = geometry::STGeomFromText( ’MULTILINESTRING((0.181 0.417, 0.781 0.375))’,0); SELECT StartTime, TargetID, geom, geom.STIntersects(@g) as event FROM Primitives order by event DESC Figure 5: SQLs for calculating tripwire crossing. with map tracklines.

Target Actions tripwire enter exit appear disappear loiter contain distance

Target Interactions meet follow approach

Semantics text search semantic query

Table 1: Categories and examples of video event queries. The fourth category is semantics-based search, in which user can search for events via keyword-based search on generated plain text description of events, or via semantic-based search using XML query (XQuery) or SPARQL protocol and RDF query language (SPARQL). We call all these various user-defined queries discriminators. There are target property discriminator, target action discriminator, target interaction discriminator and semantic discriminator. Some query examples formed by the four discriminators are listed below: • Finding a red vehicle enter an AOI. • Finding a person exit an AOI with a speed more than 16 feet per second. • Listing all targets with length greater than 10 feet on Sunrise Valley Dr. • Finding three persons meeting more than 1 minute at a street intersection. • Finding all targets approaching a specific vehicle. • Finding all vehicles turning right to Freedom Dr. between 11:30:00 to 11:50:00 on 2010/01/01

Figure 6: Tripwire crossing results by SQL executions. Numbers indicates corresponding tracklines and events.

3.2

Various User-Defined Events

To provide a comprehensive video event search system, we implemented four categories of queries (see Table 1) based on the needs of surveillance and forensic analysis. The first category is target property-based search. This includes time, speed, color, 3D physical size, feature vector and classification. These data are generated by ObjectVideo analysis system embedded in each distributed camera. The second category is target action-based search. The system will detect targets actions such as crossing lines, entering or exiting areas-of-interest (AOIs), appearing and disappearing from AOIs, loitering within AOIs, being contained in and within some distances to geographical objects. The third category is target interaction-based search. This includes finding any targets meeting in AOIs, one target following another target and one target approaching another target.

3.3

Implementation of Event Query

Implementation of target property discriminator is straightforward. As all target properties are stored as value-type in database, a normal SQL query on the Target table returns filtered targets. Implementation of action discriminator follows the idea of the simple example in Section 3.1. But different user-defined geographical objects (points, lines, polygons) and spatial functions are applied. Some extra non-SQL post-processing steps are needed too. Action discriminators are implemented as below: • Tripwire computes intersections between Tracklines (lines) and Tripwires (lines). This operation generates potential candidates for all line crossings. Crossing directions can be determined by calculating cross product between Tracklines and Tripwires. • Enter,Exit computes intersections between Tracklines and the boundary lines of the polygon of AOI. This

operation generates both Enter and Exit events. The final results are calculated by checking whether the ending point of a Trackline is within the polygon of AOI. If it is, it is an Enter event; otherwise, it is an Exit event.

All these operations are written as SQL stored procedures which achieve faster speed and could be called by multiple users across the network. Figure 7 shows Tripwire and Enter AOI drawn on map by users.

• Contain is a basic discriminator used by many other discriminators. It computes intersections of Trackline with the solid polygon of AOI. This operation quickly prunes a large amount number of unwanted Tracklines and only returns Tracklines that are within the AOI. • Appear, Disappear starts with running the Contain discriminator, then checking whether the first Trackline of each target is inside the AOI or not. If it is, it is an Appear. Disappear is calculated in the same way by checking whether the last Trackline is within the AOI. • Loiter starts with Contain discriminator, then checking the time laps between the fist Trackline and the last Trackline to see whether it is greater than a userdefined threshold (loiter time). If it is, this target event is reported. • Distance retrieves all the Tracklines that is within a user-defined distance to geographical objects. These geographical objects can be points, lines or polygons defined by users or returned by GIS queries. Implementation of interaction discriminator involves more computations than action discriminator. • Meet starts with running the Contain discriminator to retrieve all targets within the user-defined meeting area. Then the time span of these targets is divided into small sequential time slices. Every time slice is checked to see whether the distances between targets are smaller than the predefined meeting threshold (say, 10 feet) or the number of target is greater than the user-defined threshold (say more than three persons). Meeting times are also checked by counting the total number of continuous meeting slices. • Approach is similar to Meet. But not like Meet, it is a multi-to-one target relation instead of multi-to-multi relation. There is a central target specified by user. All other targets Meet (Approach) this central target. • Fellow discriminator performs shape matching between Tracklines with time constraints. Implementation of semantics discriminator is composed of plain text queries and semantic queries. The plain text queries are supported by MSSQL server that performs a fuzzy search to find relevant text, while allowing some variations in the search terms. With full-text indexing, the retrieval is very efficient. Semantic XQuery and SPARQL queries are executed on VEML[12] or web ontology language and resource description framework (OWL/RDF) representations of video events. The generation of plain text, VEML and OWL/RDF representations of video events is described in section 5.

Figure 7: Tripwire and Enter discriminators on map. Red area: Enter AOI, Red line: Tripwire, Blue lines: Tracklines of targets. Yellow area: FOV of cameras.

3.4

System Scalability

Scalability is critical for handling tremendous amount of video data and providing online search speed. This system achieves scalability by two means. Firstly, video metadata provides a very concise representation of bulky video contents. From our experiments, it takes roughly 12 Megabytes database space to store metadata for a 1 Gigabyte video in MPEG42 format which lasts approximately 1 hour. These well organized and information-richen metadata reduce the data volume largely. Secondly, geospatial indexing provides strong supports for fast data pruning and retrieval. The system takes advantage of spatial indexing techniques provided by spatial database to make online querying possible. Spatial indexing has been widely used in geospatial computing. It is a proven technique and has already been embedded in spatial database. From the system design point of view, this is very important. Fully taking advantages of the index in spatial database simplifies the developing process, improves the reliability of the system and achieves fast querying. We did an experiment of querying a subset of geography objects on 89, 296 line objects with and without spatial index. The speed difference is listed in table 2.

Speed(s)

With Index 0.84866

Without Index 26.4256

Speed Gain (X) 31.14

Table 2: Speed gain by using spatial index

We also noticed that spatial index has different impact on speed with different data types and spatial functions. If a function only involves calculations by indexing which mean the results can be obtained directly by checking indexes, it shows the same and fastest speed on all data types, but the results may be nondeterministic and inaccurate because the space can only be indexed to some extent. On other circumstances, indexing is only used to prune data and many other calculations follow to get accurate results, different data types have different speed gain. We tested on 1 million geographical Point data and 90K Line data. The speed difference is shown in Table 3:

Speed(s)

Lines 1.75

Points 0.06

Table 3: Speed changes by querying different data type

From the Table 3 we can see, speed of retrieving points is faster than retrieving lines. This suggests that point calculation is based on indexing only and line calculation involves other functions. To achieve scalability in running event inference queries, all these conditions must be taken into account. For example, to find tracklets that is within some distances to a point, the Distance function should not be used; instead, we should form a circular area around the point with the distance as radius. This transforms the problem from distance calculation into Intersection calculation which is much faster. More discussions about query speed will be shown in Section 7.

4.

GIS-ENABLED QUERIES

User experience of this system is greatly improved by integrating GIS information. GIS data provides relevant geographical context information that is useful for scene understanding in the map configuration. For instance, movement of land vehicles can be associated to the streets on which vehicles are traveling, anybody who loiters on a specific road can be retrieved, people who enter any hospital in a city can be found, and any meeting of people at a street intersection could be detected. These GIS-based queries enable user to search for video events based on geographical content. We integrate the U.S. Census Bureau’s TIGER/Line and the GeoNames database into this system. TIGER/Line describes geography features such as roads, rivers, lakes, and administrative boundaries. GeoNames is a large dataset containing over 6 million geographical features around the world. It contains information about the name, alternative names, location, elevation, and administrative subdivision of each feature. In our system, software modules are implemented to parse the above-mentioned GIS datasets and insert them into the MSSQL database. During a querying process, a user submits a GIS query. The returned geographical entities become areas of interest which are further used for running discriminators. The geographical objects returned by GIS system sometimes are not appropriate for running action or interaction discriminators. For example, querying a road name on TIGER/Line returns a sequence of lines, but only Tripwire and Distance discriminators could run on lines; Querying GeoNames returns points, but only Distance discriminator can run on points. To make GIS-enabled query has a wider application domain, we introduce a Buffer discriminator which defines an area around a geo-object. The Buffer discriminator generates buffering areas (polygons) around the geographical objects with some predefined distance. For example, if the input object is a point, the buffered area will be a circle with a radius of “distance”. If the input object is a line, the buffered area will be a polygon around the line such that every point within the polygon has a fixed maximum “distance” to the line. As TIGER/Line normally returns connected line sequences, after applying the Buffer discrim-

Figure 8: Results of GIS queries. A. Buffer area around a road, B. Buffered area of street intersection, C. Hospitals in the view, D. High schools in the view. E. Camera views inator on each line, a polygon Union operation is needed to further form the whole area of interest. After doing these operations, all discriminators described in Section 3.2 could be executed on GIS query results. Examples of GIS-based query are shown in Figure 8.

5.

SEMANTICS ENHANCEMENTS

To enable text and semantic search, video content are represented using three types of semantic representation: (i) a XML-based Video Event Markup Language (VEML) used by the computer vision community [12], (ii) the OWL/RDF used in Semantic Web, and (iii) plain text description. The first two types are inter-convertible; they present the same information but in different formats to enable interoperability with other tools and to support both XQuery and SPARQL queries. Plain text description is generated from the VEML data using a text generation engine [11]. It presents a natural language report of visual events. This also allows for text-based search using existing full-text search engines and document retrieval tools. In addition, the system correlates and integrates information from other sources in Semantic Web and GIS datasets. This provides enhanced contextual information on video content. With these representations, the system provides query tools for users to search and retrieve video content. It supports different query languages including: SQL, XQuery, SPARQL, and natural language query [5]. Some examples of generated text for video events in urban and maritime scenes are shown in Figure 9

6.

GEOBROWSER QUERY AND VISUALIZATION SUBSYSTEM

This system uses the open source 3D interactive world viewer NASA World Wind as a user interface. A querying interface is developed as a plug-in of World Wind. It is shown in figure 10. The whole user query process is: 1. User opens the World Wind and loads the query interface plug-in.

Figure 10: User interface of the system. Left: the query interface plug-in. Right: the NASA World Wind Geobrowser 2. User inputs three kinds of information: target properties, event locations and event actions or interactions. Target properties and action types are directly specified in the interface. Event locations are drawn as points, lines or polygons on the World Wind map or generated by GIS queries. Boat 4 follows Boat 3 between 35:36 and 37:23

3. User interface wraps these information into discriminators and send them to server.

Boat 7 turns right at 55:00

4. Retrieval results are passed back to the interface and are shown in the result list. When a result is chosen, a video window will be popped up and the result will be played in video and animated in map.

7. Land vehicle 359 approaches intersection along Fountain Dr. at 57:27. It stops at 57.29. Land vehicle 360 approach intersection along Freedom Dr. at 57:31 Land vehicle 359 enters intersection at 57:35. It turns right at 57:39. It leaves intersection at 57:36. It exits the scene along Freedom Dr. at 57:18 Figure 9: Samples of generated text and corresponding video snapshots.

RESULT COMPARISON

This system illustrates an emerging geospatial application. To our knowledge, there is no similar work integrating geocomputing for video surveillance application. So it is hard to find previous academic methods or similar applications to compare with. So we compare it with our previous system that does not use geospatial computing techniques. The comparison is conducted in two perspectives: speed gain and event inference accuracy. In our old system, video metadata is not indexed so events are inferred sequentially. In addition, video metadata are saved in individual files. It takes time to load them into memory when running a query. The speed comparison on a series of Tripwire queries is shown in Table 4. From the results we can see, the new system achieves 371 times faster than the old system. Video Length (hrs) 19

Number of Targets 3243

Old system (secs) 1583.5

New system (secs) 4.27

Speed gain (X) 371

Table 4: Speed gain on tripwire event queries

The Tripwire detection accuracy is shown in Table 5. It can

be seen that the detection accuracy is the same for the two systems. Video Length

Number of

Number of

(hrs) 19

Targets 3243

events 1967

Accuracy of old system 96%

Accuracy of new system 96%

[5]

[6]

Table 5: Accuracy of tripwire event queries [7]

8.

CONCLUSION AND FUTURE WORKS

This paper describes a new video event analysis and retrieval system using geospatial computing techniques. By transforming and fusing video analysis and tracking results into map coordinates and saving them in spatial database, various user-defined video events queries can be directly executed in the geospatial framework through a geobrowser. The spatial database provides the system a stable, fast and easy manageable platform which is essential for managing large amount of video data. The spatial index provided by the database makes the online querying possible by fast pruning unrelated data and working on data of interest. This system leads in introducing the geospatial computing into video retrieval domain and on the other hand, it presents a forward-looking but practical application model in the geospatial applicaiton domain. We are currently working on many advanced functionalities for this system. Image-based target retrieving based on user feedbacks, auto calibration of distributed cameras, customized feature vector indexes in spatial database and new web-based user interface are all under development. We expected that this system is to be deployed in surveillance systems in the near future.

9.

ACKNOWLEDGMENTS

This work was supported, in part, by the Office of Naval Research under Contract #N00014-07-M-0287, #N0001408-C-0639, and #N00014-09-C-0688. We sincerely express our gratitude to our sponsor. We thank NASA for providing the wonderful open source geobrowser World Wind. We also thank U.S. Census Bureau and various public sources for providing the TigerLine and Geonames GIS data.

10.

REFERENCES

[1] S. Ay, L. Zhang, S. Kim, M. He, and R. Zimmermann. Grvs: A georeferenced video search engine. In Proceedings of the seventeen ACM international conference on Multimedia, pages 977–978. ACM, ACM, 2009. [2] W. G. C. Stauffer. Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):747–757, Aug. 2000. [3] M. Campbell, A. Haubold, S. Ebadollahi, ˇ M. Naphade, A. Natsev, J. Seidl, J. Smith, J. TeZic, and L. Xie. Ibm research trecvid-2006 video retrieval system. In NIST TRECVID-2006 Workshop, 2006. [4] S. Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong. A fully automated content-based video

[8]

[9]

[10]

[11]

[12]

[13]

[14] [15]

[16]

search engine supporting spatiotemporal queries. IEEE Transactions on Circuits and Systems for Video Technology, 8(5):602–615, Sept. 1998. A. Hakeem, M. W. Lee, O. Javed, and N. Haering. Semantic video search using natural language queries. In International Conference on Multimedia. ACM, 2009. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University, England, 2003. W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank. A system for learning statistical motion patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1450–1464, Sept. 2006. W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank. Semantic-based surveillance video retrieval. IEEE Transactions on Image Processing, 16(4):1168–1181, Apr. 2007. O. Javed, Z. Rasheed, K. Shafique, and M. Shah. Tracking across multiple cameras with disjoint views. In Proceedings of the IEEE international conference on computer vision, pages 1207–1216. IEEE, 2003. N. Johnson and D. Hogg. Learning the distribution of object trajectories for event recognition. In Proceedings of the 6th British conference on Machine vision, pages 583–592. BMVA Press, 1995. M. W. Lee, A. Hakeem, N. Haering, and S.-C. Zhu. Save: A framework for semantic annotation of visual events. In First Workshop on Internet Vision. IEEE, 2008. R. Nevatia, J. Hobbs, and B. Bolles. An ontology for video event representation. In Workshop on Event Detection and Recognition. IEEE, 2004. Y. Sheikh and M.Shah. Trajectory association across multiple airborn cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):361–367, Feb. 2008. L. Vincent. Taking online maps down to street level. Computer, 40(12):118–120, Dec. 2007. X. Xiong, B. Wang, and D. Wang. Research of event-based emergency video surveillance system. In Proceedings of the 2009 International Workshop on Information Security and Application. Academy Publisher, Finland, 2009. H. Zhang, J. Wu, D. Zhong, and S. W. Smoliar. An integrated system for content-based video retrieval and browsing. Pattern Recognition, 30(4):643–658, Apr. 1997.

Video Stream Retrieval of Unseen Queries using ...