|
|
|
|
There are 33,071 television stations in the world, according
to the CIA World Factbook 2000. At 16 broadcast hours per day, these stations
transmit 193 million hours of total programming per year. Computational
resources can be applied to combining massive numbers of errorful video
extracts in order to synthesize crisp, reliable and visually rich answers
drawn from this content. This project will develop probabilistic tools
for deriving evidence from such multi-media source content to find answers,
and to organize and present the answers in context-rich summary visualizations.
Multi-media data streams, such as television and radio broadcasts, radio
call-in shows, and telephone conversations, can provide highly valuable
intelligence information but are currently not sufficiently exploited
for intelligence purposes due to the high costs of analyzing such data
manually. Most of today's research on question answering and summarization
focuses on analysis of textual documents, such as newspaper reports or
hypertext documents found on the Web. While it is possible to provide
text transcripts from multi-media data, such textual data is often errorful,
ungrammatical, and without sentence boundaries, punctuation, and capitalization.
To date, most methods developed for question answering, information extraction,
named entity analysis, fact extraction, and multi-document summarization,
rely on a syntactic parse of grammatically well-formed and punctuated
sentences. Currently we lack methods to discover meaningful answers in
text that was errorfully extracted from streamed media or OCR.
This project seeks to overcome that gap. The goal of the
project is the development of a system that provides analysts with answers
from multi-media data streams, based on errorfully extracted information.
To make use of multi-media content for intelligence analysis, we need
to extract the names, places, organizations, and other entities mentioned
in media streams, identify what they refer to, find out how different
entities relate to each other, and present this to the analyst in a coherent,
summarized form in the proper context.
The project achieves this through focused tasks, which
can be grouped into two categories: information extraction and processing
to determine the answer, and automated visualization design to present
answers.
For determining the answer we will probabilistically
extract information from collections of multi-media documents in the face
of errorful speech and image recognition by:
- Resolving co-references between different extracted
named entities despite ungrammaticalities.
- Extracting information about semantic relations
from media data and secondary text sources.
- Learning models of information flow between
different sources.
- Hardening uncertain information using additional
evidence actively extracted from multiple sources, both unstructured
and structured data such as phone books, census data, and gazetteers.
To present answers, we will summarize and organize
the information together with related contextual and meta-data through:
- Text summaries that combine evidence from multiple
candidate answers, as determined above.
- Augmenting direct answers with supporting contextual
material that can be absorbed at a glance using automated design
of visualizations specific to the user, task, data, and analysis history.
- Presenting maps, charts and images that serve as
interfaces for rapidly posing follow-up questions or drilling down
to supporting material and assumptions.
- Representing uncertainty explicitly in the interfaces
supporting answers, i.e., representing varying amounts of credibility
due to the amount of evidence, error rates of automatic processing,
and authorship of material.

Figure 1. Proposed interface showing a simulated
summarized multimedia answer extracted from broadcast news, addressing
the origins of terrorists in the American embassy bombings in East Africa.
Determining the Answer
References will be processed through techniques that analyze statistical
similarities of the contexts in which such references occurred, thereby
resolving ambiguities that might exist in the transcripts. Co-reference
resolution has been studied extensively in the MUC domain, and that work
will be extended to deal with the ungrammatical and errorful nature of
text extracted from multi-media streams, as well as account for visual
metadata elements. Learning to extract semantic relations among entities
can be seen as a generalization of finding co-reference relations, and
we will apply the same kind of algorithms adapted to errorful multi-media
data. Relations can be relatively static (e.g., stories about Bush and
McCain tend to indicate differences of opinion), or can be dynamic roles
in a particular event (e.g., Bush signed the nuclear energy research initiative),
in which case the time and type of event must also be extracted.
We will learn Bayes' network models of information flow to trace the sources
of information, including hidden nodes that represent unknown sources
and account for parallel appearance of similar documents. Later we will
add more knowledge so that we can recognize antecedents based on more
abstract shared points of view.
We will "harden" the above algorithms by connecting
the unstructured information extracted from open-source, free form broadcast
news to structured databases from phone books, census data, and gazetteers.
We will develop techniques to actively seek out confirming or disconfirming
evidence from phone-book type databases of people or places with associated
information (street address, affiliated organization, role in organization,
title), with the entities mentioned in the broadcast news stream. Our
approach will initially involve direct lookup, but since spurious matches
are very likely in such large lists, probabilistic disambiguation will
be performed using corroborating evidence.
Organizing and Presenting the Answer
Dynamic, query-driven summaries of facts and the underlying media source
streams will be generated using maximal marginal relevance and adaptive
clustering to identify and summarize the subset of information most relevant
to the analyst's current interest. Most summarization work has dealt with
extractive summarization by extracting representative sentences from single
or multiple documents, ordering them and smoothing the output summary
into more palatable English. More recently, learning approaches have been
successfully applied to summarization of speech documents and other errorful
data. Our work will present textual answers combined with visualizations
by placing relevant text and image information on interactive maps and
charts. Our previous experience in location extraction and presentation
of geographic information has shown that visualizations allow complex
geographic, temporal, organizational, and other relationships to be summarized
clearly. Visualizations efficiently convey summary information, and serve
as context as an analysis session progresses. The visualization-based
interfaces will allow the analyst to effortlessly explore variations on
or follow-ups to the original question. For instance, sliders can modify
the time range of interest, and hyperlinks support drill-down to details
or to follow-up questions that the system can anticipate.
Foundations
This research project builds on the well-established multi-media processing
infrastructure of the Informedia Project developed under previous NSF,
DARPA and ARDA funding which focuses specifically on information extraction
from video and audio content. Informedia pioneered the extraction of textual
information from video and audio streams. Over two terabytes of online
data were collected, with automatically extracted metadata and indices
for retrieving videos from this library. The proposed research tasks address
functionality in question answering that is not currently available in
the Informedia system. The extraction of relevant named entities and facts
from the multi-media source data provides the basis for summarizing uncertain
and perhaps conflicting information into answers. Information may have
been unreliably extracted, or appear to be conflicting due to the conflicts
in the sources or due to errorful extraction.
The combination of textual summaries and multimedia visualizations
of geo-spatial, relational, and numeric data is the foundation of an interactive
Q&A system, which allows the analyst to quickly digest aggregate answers
in display templates and to interactively refine, and modify questions
within the current context. We have previously demonstrated integrated
text and graphics presentations of retrieved video segments, automated
design of user, task, and data-specific visualizations, and automated
domain-specific generation of extended text and graphics briefings. The
challenge here is to extend these foundations to advanced question-answering.
Evaluation
We will measure the accuracy and effectiveness of the answer extraction
techniques and the visualization and presentation interfaces both qualitatively
and quantitatively. Quantitative evaluation of the experience analysis,
synthesis, and access is only possible for pieces of the technology using
measures such as precision, recall, and ROC curves for individual modules.
We will also evaluate the overall usability with expert analysts provided
by Concurrent Technologies Corporation (CTC) using standard HCI techniques
such as contextual inquiry, heuristic evaluation, cognitive walkthrough,
and think-aloud user study analysis. CTC is a subcontractor in this effort
who is committed to successfully transferring leading edge technologies
to the civil-military industrial base and will be providing consultation
on intelligence analysis and usability design.
Click here
to see a powerpoint overview slide
|
|