Informedia Digital Video Library:  Digital video library research at Carnegie Mellon School of Computer Science
nav graphic

 

 

 

 

 

 

 

 

 

 

 

 

 


      

      Google
 

  Carnegie Mellon University
  School of Computer Science
  5000 Forbes Avenue
  Pittsburgh, PA 15213
  informedia@cs.cmu.edu


  About the AQUAINT Projects    |  AQUAINT I    |   AQUAINT II   |   Sponsors

Informedia Contexture: Analyzing and Synthesizing Video and Verbal Context for Intelligence Analysis Dialogues

PI:
Howard Wactlar
CoPI's:
Alex Hauptmann, Mike Christel, Mark Derthick, Dorbin Ng, Scott Stevens
Sponsor:
Advanced Research and Development Activity (ARDA) AQUAINT Program
Period:
April 2004-March 2006

Project Description

This project will develop tools and techniques to automatically process and extract evidence from multimedia source content to understand questions, find answers, and to organize and present the answers as contextures supporting the analysts' activities. Contexture, for the purposes of this research, is the weaving or assembling of multimedia parts into a coherent whole in order to provide a more complete "picture" or information structure to both questions and answers. We emphasize in this AQUAINT Phase 2 effort the need to interpret and communicate an associated visual and verbal context to information, which may surround or accompany a question, an answer or a follow-up, to throw light on its meaning or significance, or explain its circumstances. The major innovation of the proposed research is the automatic delivery of contextures serving as context-rich presentations that organize and summarize answers, while encouraging and facilitating review and refinement of the answer space through analyst interaction. Such contextures function more as a questioning expert human colleague than like an illustrated encyclopedic resource.

We continue to pursue and resolve the challenge of how to query unstructured video and natural language information sources for purposes of intelligence gathering. In so doing, we will accelerate discovery by both system and analyst, thereby enabling a more robust and relevant sequence of follow-on questions and responses. We focus in this phase on the need (1) to understand the analysts' context when they seek information, (2) to automatically extract and interpret context from multimedia content, revealing aspects from bias to time and venue, and (3) to deliver results with additional context, in words, images, audio, and motion video. Understanding and delivering information within a context, both verbal and visual, is what engages the analyst and catalyzes the analyst-to-system dialogue to effectively explore a problem-solving scenario.

We illustrated the utility of video source material and visual answers from CNN video as part of our AQUAINT Phase 1 work. For Phase 2, we consider video content ranging from produced domestic (English) and foreign (Chinese Mandarin) news broadcasts, to correspondents' B-rolls of unedited source material, down to field-captured video from training camps to combat zones. We will leverage from the very rich, but customizable, visual displays of composite information sources that we developed in AQUAINT Phase 1 and other programs sponsoring Carnegie Mellon Informedia research. Several new innovations are proposed to develop and deliver Informedia contextures supporting advanced question and answering activities by the skilled professional analyst. Figure 1 shows a conceptual overview of the proposed work.

We start with a dramatic leap from Informedia digital video extraction work to video interpretation research as we address the goal of understanding video perspectives - expressed and subtle opinions and attitudes in video information from varied sources and even across multiple languages. We apply Informedia multi-modal processing to extract low-level syntactic features and a textual narrative derived from speech recognition. We introduce new machine understanding of textual and visual rhetoric, taking advantage of the redundancy in the aural and visual presentations to overcome shortfalls in understanding just the derived text alone.

We will develop innovative tools and techniques that provide more context to the answers the system delivers. For example, by spotting, labeling and naming people, places and objects as video is processed (that technology being developed in companion research efforts) we will derive an ability to generate continuous video dossiers that trace individuals and their associates over time and location, in order to provide background context when needed.

Figure 1. Conceptual overview of the research involved in analyzing and synthesizing video and verbal context for intelligence analysis dialogues.

Working with colleagues at USC, we will further support the understanding of perspectives and generation of dossiers by developing an ontology for structuring visual knowledge in video. In this phase we will generate a representation language and grammar that describes news broadcasts, down to characterizing the content of field reports. This enables us to parse, extract and compare similar elements from broadcasts produced by arbitrary sources in multiple formats, even in different languages. Furthermore, this work will improve our ability to "rephrase" an analyst's spoken or typed queries into our visual context.

We will pursue the challenge of how to ask a better question by developing a capability for applying and interpreting verbal, visual and multi-modal queries that may contain language, imagery and gestures (e.g., pointing). Correspondingly, when answers are delivered verbally, we will automatically "illustrate" them with corresponding video clips or sound bytes if available.
We will explicitly represent measures of uncertainty in the results through the interfaces. There exists varying credibility in the system responses due to the limited amount of evidence, error rates of automatic processing, and provenance of source content that needs to be conveyed to the analyst.
Finally, we will integrate our capabilities with the CMU Javelin end-to-end system through a drag-and-drop protocol, SOAP and XML exchange that enables the passing of objects and results between them. We will jointly experiment with complementary corpora of textual newswire and broadcast video covering the same periods. Both systems will also be Mandarin capable.
We structure our research to satisfy three fundamental requirements of a dialogue query system:
(1) That the questions be iterative. The system must interact with the analyst to provide more than a single set of images and sound bytes, but in addition provide context for them and lay the groundwork for the follow-on questions that comprise an analysis dialogue or scenario.
(2) That the system be integrative. We address multiple integration issues: combining multiple sources, data types and crossing languages; accounting for what the user [analyst] already knows, observes and annotates; and applying broadband human-computer interaction and visualization technologies to project a collage of relevant information that unifies verbal, image, graphical, geographic, and temporal information.

(3) That the interaction be intuitive. The paradigm for querying video needs to go far beyond finding similar images or occurrences of particular objects. Though these still need to be provided, and with associated thresholds of confidence, the need is for much more semantic inference in both what the system sees and the analyst observes. What matters may be who is seen with whom, or the rock formations in the background. In viewing satellite imagery or reconnaissance photography, analysts often know what they are looking for and rapidly interpret much of what they observe, but they rarely need to articulate the process. They operate within a context that the system needs to better understand. Our systems will be developed with a continuous program of user testing and interface evolution to achieve an effective, efficient and enjoyable analyst interaction.

The government shall be granted government purpose license rights to system and application software developed or modified and used in the deliverables of this contract, including integration into other systems for purposes of evaluation and demonstration. Limitations and exceptions may apply (see Section II.J).

Click here to see a Powerpoint overview slide

 

About the AQUAINT Projects    |  AQUAINT I    |   AQUAINT II   |   Sponsors   |   INFORMEDIA HOME
topCopyright 1994-2002 Carnegie Mellon and its licensors.  All rights reserved.