Informedia Digital Video Library:  Digital video library research at Carnegie Mellon School of Computer Science
nav graphic

 

 

 

 

 

 

 

 

 

 

 

 

 


      

      Google
 

  Carnegie Mellon University
  School of Computer Science
  5000 Forbes Avenue
  Pittsburgh, PA 15213
  informedia@cs.cmu.edu


  About the AQUAINT Projects    |  AQUAINT I    |   AQUAINT II   |   Sponsors
Question Answering from Errorful Multimedia Streams
PI:
Howard Wactlar
CoPI's:
Alex Hauptmann, John Lafferty, Steven Roth, Mark Derthick, Mike Christel, Laurie Waisel (Concurrent Technology Corporation)
Sponsor:
Advanced Research and Development Activity (ARDA) AQUAINT Program
Period:
July 2002 -December 2003

Project Description

There are 33,071 television stations in the world, according to the CIA World Factbook 2000. At 16 broadcast hours per day, these stations transmit 193 million hours of total programming per year. Computational resources can be applied to combining massive numbers of errorful video extracts in order to synthesize crisp, reliable and visually rich answers drawn from this content. This project will develop probabilistic tools for deriving evidence from such multi-media source content to find answers, and to organize and present the answers in context-rich summary visualizations. Multi-media data streams, such as television and radio broadcasts, radio call-in shows, and telephone conversations, can provide highly valuable intelligence information but are currently not sufficiently exploited for intelligence purposes due to the high costs of analyzing such data manually. Most of today's research on question answering and summarization focuses on analysis of textual documents, such as newspaper reports or hypertext documents found on the Web. While it is possible to provide text transcripts from multi-media data, such textual data is often errorful, ungrammatical, and without sentence boundaries, punctuation, and capitalization. To date, most methods developed for question answering, information extraction, named entity analysis, fact extraction, and multi-document summarization, rely on a syntactic parse of grammatically well-formed and punctuated sentences. Currently we lack methods to discover meaningful answers in text that was errorfully extracted from streamed media or OCR.

This project seeks to overcome that gap. The goal of the project is the development of a system that provides analysts with answers from multi-media data streams, based on errorfully extracted information. To make use of multi-media content for intelligence analysis, we need to extract the names, places, organizations, and other entities mentioned in media streams, identify what they refer to, find out how different entities relate to each other, and present this to the analyst in a coherent, summarized form in the proper context.

The project achieves this through focused tasks, which can be grouped into two categories: information extraction and processing to determine the answer, and automated visualization design to present answers.

For determining the answer we will probabilistically extract information from collections of multi-media documents in the face of errorful speech and image recognition by:

  • Resolving co-references between different extracted named entities despite ungrammaticalities.
  • Extracting information about semantic relations from media data and secondary text sources.
  • Learning models of information flow between different sources.
  • Hardening uncertain information using additional evidence actively extracted from multiple sources, both unstructured and structured data such as phone books, census data, and gazetteers.

To present answers, we will summarize and organize the information together with related contextual and meta-data through:

  • Text summaries that combine evidence from multiple candidate answers, as determined above.
  • Augmenting direct answers with supporting contextual material that can be absorbed at a glance using automated design of visualizations specific to the user, task, data, and analysis history.
  • Presenting maps, charts and images that serve as interfaces for rapidly posing follow-up questions or drilling down to supporting material and assumptions.
  • Representing uncertainty explicitly in the interfaces supporting answers, i.e., representing varying amounts of credibility due to the amount of evidence, error rates of automatic processing, and authorship of material.

Figure 1. Proposed interface showing a simulated summarized multimedia answer extracted from broadcast news, addressing the origins of terrorists in the American embassy bombings in East Africa.

Figure 1. Proposed interface showing a simulated summarized multimedia answer extracted from broadcast news, addressing the origins of terrorists in the American embassy bombings in East Africa.

Determining the Answer
References will be processed through techniques that analyze statistical similarities of the contexts in which such references occurred, thereby resolving ambiguities that might exist in the transcripts. Co-reference resolution has been studied extensively in the MUC domain, and that work will be extended to deal with the ungrammatical and errorful nature of text extracted from multi-media streams, as well as account for visual metadata elements. Learning to extract semantic relations among entities can be seen as a generalization of finding co-reference relations, and we will apply the same kind of algorithms adapted to errorful multi-media data. Relations can be relatively static (e.g., stories about Bush and McCain tend to indicate differences of opinion), or can be dynamic roles in a particular event (e.g., Bush signed the nuclear energy research initiative), in which case the time and type of event must also be extracted.

We will learn Bayes' network models of information flow to trace the sources of information, including hidden nodes that represent unknown sources and account for parallel appearance of similar documents. Later we will add more knowledge so that we can recognize antecedents based on more abstract shared points of view.

We will "harden" the above algorithms by connecting the unstructured information extracted from open-source, free form broadcast news to structured databases from phone books, census data, and gazetteers. We will develop techniques to actively seek out confirming or disconfirming evidence from phone-book type databases of people or places with associated information (street address, affiliated organization, role in organization, title), with the entities mentioned in the broadcast news stream. Our approach will initially involve direct lookup, but since spurious matches are very likely in such large lists, probabilistic disambiguation will be performed using corroborating evidence.

Organizing and Presenting the Answer
Dynamic, query-driven summaries of facts and the underlying media source streams will be generated using maximal marginal relevance and adaptive clustering to identify and summarize the subset of information most relevant to the analyst's current interest. Most summarization work has dealt with extractive summarization by extracting representative sentences from single or multiple documents, ordering them and smoothing the output summary into more palatable English. More recently, learning approaches have been successfully applied to summarization of speech documents and other errorful data. Our work will present textual answers combined with visualizations by placing relevant text and image information on interactive maps and charts. Our previous experience in location extraction and presentation of geographic information has shown that visualizations allow complex geographic, temporal, organizational, and other relationships to be summarized clearly. Visualizations efficiently convey summary information, and serve as context as an analysis session progresses. The visualization-based interfaces will allow the analyst to effortlessly explore variations on or follow-ups to the original question. For instance, sliders can modify the time range of interest, and hyperlinks support drill-down to details or to follow-up questions that the system can anticipate.

Foundations
This research project builds on the well-established multi-media processing infrastructure of the Informedia Project developed under previous NSF, DARPA and ARDA funding which focuses specifically on information extraction from video and audio content. Informedia pioneered the extraction of textual information from video and audio streams. Over two terabytes of online data were collected, with automatically extracted metadata and indices for retrieving videos from this library. The proposed research tasks address functionality in question answering that is not currently available in the Informedia system. The extraction of relevant named entities and facts from the multi-media source data provides the basis for summarizing uncertain and perhaps conflicting information into answers. Information may have been unreliably extracted, or appear to be conflicting due to the conflicts in the sources or due to errorful extraction.

The combination of textual summaries and multimedia visualizations of geo-spatial, relational, and numeric data is the foundation of an interactive Q&A system, which allows the analyst to quickly digest aggregate answers in display templates and to interactively refine, and modify questions within the current context. We have previously demonstrated integrated text and graphics presentations of retrieved video segments, automated design of user, task, and data-specific visualizations, and automated domain-specific generation of extended text and graphics briefings. The challenge here is to extend these foundations to advanced question-answering.

Evaluation
We will measure the accuracy and effectiveness of the answer extraction techniques and the visualization and presentation interfaces both qualitatively and quantitatively. Quantitative evaluation of the experience analysis, synthesis, and access is only possible for pieces of the technology using measures such as precision, recall, and ROC curves for individual modules. We will also evaluate the overall usability with expert analysts provided by Concurrent Technologies Corporation (CTC) using standard HCI techniques such as contextual inquiry, heuristic evaluation, cognitive walkthrough, and think-aloud user study analysis. CTC is a subcontractor in this effort who is committed to successfully transferring leading edge technologies to the civil-military industrial base and will be providing consultation on intelligence analysis and usability design.

Click here to see a powerpoint overview slide

About the AQUAINT Projects    |  AQUAINT I    |   AQUAINT II   |   Sponsors   |   INFORMEDIA HOME
topCopyright 1994-2002 Carnegie Mellon and its licensors.  All rights reserved.