![]() |
|||||||||||||||||||
|
Carnegie Mellon University School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213 informedia@cs.cmu.edu |
|||||||||||||||||||
|
|||||||||||||||||||
Project Description This project will develop tools and techniques to automatically process and extract evidence from multimedia source content to understand questions, find answers, and to organize and present the answers as contextures supporting the analysts' activities. Contexture, for the purposes of this research, is the weaving or assembling of multimedia parts into a coherent whole in order to provide a more complete "picture" or information structure to both questions and answers. We emphasize in this AQUAINT Phase 2 effort the need to interpret and communicate an associated visual and verbal context to information, which may surround or accompany a question, an answer or a follow-up, to throw light on its meaning or significance, or explain its circumstances. The major innovation of the proposed research is the automatic delivery of contextures serving as context-rich presentations that organize and summarize answers, while encouraging and facilitating review and refinement of the answer space through analyst interaction. Such contextures function more as a questioning expert human colleague than like an illustrated encyclopedic resource. We continue to pursue and resolve the challenge of how to query unstructured video and natural language information sources for purposes of intelligence gathering. In so doing, we will accelerate discovery by both system and analyst, thereby enabling a more robust and relevant sequence of follow-on questions and responses. We focus in this phase on the need (1) to understand the analysts' context when they seek information, (2) to automatically extract and interpret context from multimedia content, revealing aspects from bias to time and venue, and (3) to deliver results with additional context, in words, images, audio, and motion video. Understanding and delivering information within a context, both verbal and visual, is what engages the analyst and catalyzes the analyst-to-system dialogue to effectively explore a problem-solving scenario. We illustrated the utility of video source material and visual answers from CNN video as part of our AQUAINT Phase 1 work. For Phase 2, we consider video content ranging from produced domestic (English) and foreign (Chinese Mandarin) news broadcasts, to correspondents' B-rolls of unedited source material, down to field-captured video from training camps to combat zones. We will leverage from the very rich, but customizable, visual displays of composite information sources that we developed in AQUAINT Phase 1 and other programs sponsoring Carnegie Mellon Informedia research. Several new innovations are proposed to develop and deliver Informedia contextures supporting advanced question and answering activities by the skilled professional analyst. Figure 1 shows a conceptual overview of the proposed work. We start with a dramatic leap from Informedia digital video extraction work to video interpretation research as we address the goal of understanding video perspectives - expressed and subtle opinions and attitudes in video information from varied sources and even across multiple languages. We apply Informedia multi-modal processing to extract low-level syntactic features and a textual narrative derived from speech recognition. We introduce new machine understanding of textual and visual rhetoric, taking advantage of the redundancy in the aural and visual presentations to overcome shortfalls in understanding just the derived text alone. We will develop innovative tools and techniques that provide more context to the answers the system delivers. For example, by spotting, labeling and naming people, places and objects as video is processed (that technology being developed in companion research efforts) we will derive an ability to generate continuous video dossiers that trace individuals and their associates over time and location, in order to provide background context when needed.
Figure 1. Conceptual overview of the research involved in analyzing and synthesizing video and verbal context for intelligence analysis dialogues. Working with colleagues at USC, we will further support the understanding of perspectives and generation of dossiers by developing an ontology for structuring visual knowledge in video. In this phase we will generate a representation language and grammar that describes news broadcasts, down to characterizing the content of field reports. This enables us to parse, extract and compare similar elements from broadcasts produced by arbitrary sources in multiple formats, even in different languages. Furthermore, this work will improve our ability to "rephrase" an analyst's spoken or typed queries into our visual context. We will pursue the
challenge of how to ask a better question by developing a capability for
applying and interpreting verbal, visual and multi-modal queries that
may contain language, imagery and gestures (e.g., pointing). Correspondingly,
when answers are delivered verbally, we will automatically "illustrate"
them with corresponding video clips or sound bytes if available. (3) That the interaction be intuitive. The paradigm for querying video needs to go far beyond finding similar images or occurrences of particular objects. Though these still need to be provided, and with associated thresholds of confidence, the need is for much more semantic inference in both what the system sees and the analyst observes. What matters may be who is seen with whom, or the rock formations in the background. In viewing satellite imagery or reconnaissance photography, analysts often know what they are looking for and rapidly interpret much of what they observe, but they rarely need to articulate the process. They operate within a context that the system needs to better understand. Our systems will be developed with a continuous program of user testing and interface evolution to achieve an effective, efficient and enjoyable analyst interaction. The government shall
be granted government purpose license rights to system and application
software developed or modified and used in the deliverables of this contract,
including integration into other systems for purposes of evaluation and
demonstration. Limitations and exceptions may apply (see Section II.J).
Click here to see a Powerpoint overview slide
|
|||||||||||||||||||
| About the AQUAINT Projects | AQUAINT I | AQUAINT II | Sponsors | INFORMEDIA HOME | |||||||||||||||||||