Informedia Digital Video Library:  Digital video library research at Carnegie Mellon School of Computer Science
nav graphic
















  Carnegie Mellon University
  School of Computer Science
  5000 Forbes Avenue
  Pittsburgh, PA 15213

Informedia Video Information Summarization & Demonstration Testbed
About VACE I   |  Ongoing Project Info   |  VACE I Reports  ||   VACE II   |  Sponsor  
Howard Wactlar
Mike Christel, Alex Hauptmann, Jianbo Shi
Advanced Research Development Agency (ARDA)
Video Analysis and Content Exploitation
September 2000 - December 2002

Project Description

The Informedia project is producing significant advancements in the automatic generation of video summarizations over very large archives of video segments utilizing the Informedia Project infrastructure processing and integration techniques. We intend to extend and modularize the underlying Informedia processing, query and display infrastructure with standardized interfaces to enable its use as a demonstration and testbed vehicle for the work of other researchers in video understanding component technology. This proposal addresses the ARDA desired target capabilities for:

  • Fully automating the indexing of video streams based on information content in image, audio and textual components.
  • Developing cross-media processing techniques to extract information using all components of a video stream.

In particular, this research will:

  • Provide unified infrastructure for integration and demonstration of
    - Object detection, recognition and tracking
    - Event understanding, query-by-example, multi-modal fusion
  • Develop new capabilities for video summaries and multimodal video mining

We will develop a new prototype video analysis system that automatically processes video data, indexes the extracted data and provides mechanisms for search and retrieval. The system will include the current CMU face and text detection and recognition abilities already under development in the Informedia Project, as well as CMUs Sphinx speech recognition system. Additional image processing will determine shot boundaries and allow for image similarity comparisons. Combining features from text, speech and image analysis will enhance the performance as well as the quality of the video metadata extraction processes, compared to processing each modality in isolation. All derived metadata will be indexed in support of more efficient query interfaces. We will initially populate the VACE digital video library with automatically processed data from broadcast news sources such as CNN, and will start with MPEG-I video formats, with the goal of later handling other formats, including MPEG-II video data streams.

Figure 1: Vdieo summarizer metadata extraction

Figure 1: Video summarizer metadata extraction.

This system architecture will be modularized to support tailorability, where the video processing modules for shot detection, text and face detection/recognition can be replaced with other components developed at other research organizations, including CMU. The system will also allow for the insertion of new vehicle and object detection and recognition modules, based on a standardized interface definition. The metadata extracted by the new modules will be automatically searchable in the system. This metadata will also be time-aligned to the processed video, enabling it to be used in conjunction with other synchronized metadata for building efficient,
effective interfaces to the video collections.

Figure 2: Conceptual view of the processing architecture.

Figure 2: Conceptual view of the processing architecture.

Beyond providing a plug and play framework for other processing modules, the research will also develop video summaries in the form of "collages", using the metadata generated from the modules along with any available collateral data such as manually generated transcripts and closed-captioned text. Video information collages will be built by advancing information visualization research to effectively deal with multiple video documents. A video information collage is a presentation of text, images, audio, and video derived from multiple video sources in order to summarize, provide context, and communicate aspects of the content from the originating set of sources. The collages to be investigated include chrono-collages emphasizing time sequences, geo-collages emphasizing spatial relationships, and auto-documentaries, which preserve the video's temporal nature. Users will be able to interact with the video collages to generate multimodal queries across time, space, and sources. Video collages can be made adaptive by giving preference to the concepts and query terms in the user's interaction history. The synthesis and summarization functions underlying these collages will be made possible through extensions of text clustering and Expectation-Maximization algorithms to video and audio features.

We will examine the effects of metadata quantity and quality on the generation and use of video information collages. We will explore the importance of particular processing modules on collages for a video genre, e.g., face detection for chrono-collages of news broadcasts. Through successful incorporation of key processing modules, video information collages can be constructed so that users can efficiently access large video collections and assimilate information relevant to their needs. The project's anticipated advances in processing architecture and video summaries complement the research of others focusing directly on the video analysis domain.

The system architecture proposed will allow video to be processed and indexed and the resulting derived and extracted data to be searched and compared. The modular design of the database and the processing modules will simplify the exchange of video content extraction modules and addition of specialized processing components for other video analysis and object extraction. The system will initially provide functionality such as keyframe extraction, shotbreak detection, text detection and recognition, face detection and recognition, although at a limited level of accuracy. The infrastructure reconstruction and modularization of the Informedia architecture is a two-year effort under proposed funding levels.

The proposed video summarization research will change the paradigm for accessing digital video archives so that users can explore meaningful, manipulable overviews of video document sets, issue true multimodal queries, and be aided by adaptive summarizations of very large amounts of video. Our work will enable users to more quickly interpret and assimilate information relevant to their needs via automatic, intelligent synthesis of different video sources. The summarization research proposed here initiates the early stages of a five to six year effort, leveraging related work of the broader Digital Library research program.

About VACE I   |  Ongoing Project Info   |  VACE I Reports  ||   VACE II   |  Sponsor    |   INFORMEDIA HOME
topCopyright 1994-2002 Carnegie Mellon and its licensors.  All rights reserved.