NSF Progress Reports
Informedia: Integrated Speech, Image, and Language Understanding for the Creation and Exploration of Digital Video Libraries
Carnegie Mellon University Informedia Digital Video Library Project
NSF Cooperative Agreement IRI 9411299
Quarterly Report, November 1996
Howard D. Wactlar, Project Director
Following is a brief summary of research and implementation progress for the period 1 August 1996 to 31 October 1996 In this period we have: (1) improved skim selection and enhanced image matching, (2) begun constructing a Web version of the Informedia client, and (3) released version 0.91 of the software to the testbed site
Speech, Image, Language Understanding for Library Creation
Video Skimming
We have created more rules for image selection in video skimming. These rules use previous image understanding technology such as camera motion, face and text detection, and scene segmentation; as well as new technology that is currently being developed. Our new rules include:
Keyword and keyphrase detection has been extended to include proper names. With this technique, we can identify regions in which people are mentioned throughout the video.
We have incorporated our systems for text and face detection to select representative poster frames. Human faces are among the most interesting objects in video. When a face appears with captions, this usually indicates a person or affiliation relevant to the video segment. Captions seldom appear in documentaries, so this technology will be primarily used for selecting poster frames in broadcast news. In this domain, captions are used to show a person's name and affiliation, and locations and descriptions of events.
Video Spotting and Parsing
We are working on automatic extraction of certain typical, important content from news video, such as nature of the material (a speech, conversation, etc.), whether it's a meeting or conference, for instance, that's being covered in the news segment, and the specific event or incident. For the automatic detection of such semantically rich information, both image analysis and natural language analysis are being developed.
Typical phrases in text suggesting the above information were analyzed both by human and an existing parser for natural language processing. We investigated keyword spotting and detection of subjects and topic. We were able to detect more than half of the appropriate information in the content we analyzed.
We also analyzed images suggesting the same information, and we are currently investigating utilization of our existing face detection technique for identifying news material nature (i.e., speech, conversation, etc.).
We are developing a method for "linking" images and texts. By connecting images and texts through dynamic programming, structured and self-contained data for "speech" can be generated (for example, data with a speaker's face and the contents of speech that can be referenced by a face, topic, name, date, and so on). Extensions for event classification and identification are being considered.
Face and Name Association
To enhance ``Name-It'', our name and face association method, we have developed a face tracking method and an enhanced name extraction method.
Our face tracking method adaptively gets the statistical face color model from a detected face, tracks it within MPEG video, and provides occurrence and duration information for that face. In addition to this, it evaluates the angle of faces to output the best (most nearly frontal) shot of each face.
Our enhanced name extraction technique uses dictionaries and a parser to analyze transcripts to extract name candidates much more intelligently.
Library Exploration
Color Image Retrieval System
We have developed an initial version of the Advanced Region Based Image Retrieval System (ARBIRS). Having observed various existing image retrieval systems, and by our own experimental studies, we concluded that prominent regions in the image, along with their associate features, provide the best capability to accomplish a higher level, content-based, image retrieval system. A major challenge of this approach is that the image retrieval quality depends heavily on the robustness and accuracy of the image segmentation method which detects prominent regions from the image. We have made substantial progress on this issue by developing a new method to segment regions under non-uniform illumination conditions, such as shade, highlight, high contrast, etc.
A High-dimensional Indexing Method
To enable efficient exploration of a digital video library, image similarity matching from among millions of images is necessary. High-dimensional, efficient indexing is required since each image is converted into a high-dimensional feature vector in a typical image similarity matching method.
We reviewed existing indexing methods including R*-tree and SS-tree, which are the most successful and used in other image matching systems. We discovered problems with these methods, especially when applied to extremely high-dimensional data. To overcome those problems we developed SR-tree, which, according to our evaluations, outperforms other indexing methods in high-dimensional data indexing. We are planning to incorporate this into the current Informedia image retrieval facility.
Annotations
We added the ability for users to "mark-up" the library at their site. These annotations can then be used to refine searches within that library. For example, we are currently annotating a small library we've creating from videos of the Republican and Democratic conventions as well as the two presidential debates. The annotations we are adding are simply to identify the "speaker" and the "location" (of the speech). Having done that, a user can then issue a "fielded" query such as: "welfare policies" AND Speaker "Dole" AND Location "debate". This will resolve to clips where "welfare policies" were found in the transcript, title, or abstract of any clip, but only those whose annotations contain "Dole" somewhere in the "speaker" field AND "debate" somewhere in the "location" field.
Annotations will be useful in the many contexts as users can begin to "customize" the libraries towards the local usage patterns. For instance teachers in schools might want to include "class notes" in the library.
We also completed a very user-friendly utility to be used as an "annotation editor" to make it easy for users to annotate their libraries. This "editor" allows the user to play the video, mark stop/start times to define the boundaries of a new annotation, to define new "annotation types" (i.e., "speaker_id", "class_notes", etc.) as well as fields within those types (i.e., for "speaker_id", define a "speaker" and "location").
The database API was modified so that library clients can then build search dialogs on the fly, by querying the database as to the types of "fielded data" the user issue a refined query on.
Information Retrieval Studies
We extended our information retrieval baseline experiments to see how the information retrieval results would change in larger corpus sizes. Each data sets used the same 105 prompts for which corresponding stories were either created manually or through a speech recognizer. In this case, though, three corpus sizes were generated by adding manually generated transcripts. Average rank figures were computed using the best retrieval system. What follows is the average correct rank for the set of 105 stories retrieved from the different size databases.
Average rank based on: 602 2,600 12,000 stories
Manually Prepared Transcript 2.32 5.65 9.34
Speech Generated Transcript 7.89 31.16 60.19
These results show the scaling behavior of the average rank measure for the best retrieval system as the number of documents in the corpus is increased by adding more manually generated "distractor" story transcripts. The average rank rises more quickly for speech recognized transcripts than for manually created transcripts. However, both conditions seem to degrade approximately with the log of the size of the corpus. Since only "perfect" manual transcripts were added to the corpus, this data is slightly biased against the speech recognized transcripts. One would expect better measured performance in the speech recognition condition if the additional stories in the corpus were always of the same type (i.e., speech transcript) as the original 105 stories. It is also worth noting that the ratio of the average rank between speech recognition and manual transcripts increases with the size of the corpus. This indicates that speech recognition generated transcripts are less focused on the correct topic and are more likely to be displaced by other apparently relevant stories from the "distractor" set.
Data Organization, Networking Architecture, and Interoperability
Dec Video Server
We completed installation of 40 GB mpeg library (approximately 60 hours) on the MediaPlex server, which is an Alpha 600 running Digital Unix 3.0. We also demonstrated the ability to stream mpegs at VHS quality throughput (30 fps) over TCP on a network w/ at least 1.5 Mb/sec bandwidth per client. The disappointing news with respect to the video server is that the expected client side browser plug-in (which was to be used as an embedded control within an html document) never materialized from the outside developer. This type of control is imperative for Informedia, since much of what our client displays is two or more synchronized streams, i.e., transcript and filmstrips synchronized to video playback. Without the client side control, our prospects for fully exploiting this technology are not promising.
Java-based Client
A first version of the Java based web client for Informedia has been completed and is functional. Performance issues remain to be addressed over the next few months.
Included in this work has been the design and construction of a Java applet implementing the Informedia search and retrieval function, and another permitting library browsing. A further component has been the design and construction of a TCPIP access mechanism for the library enabling remote access to library content. Initial experiments designed to test interoperability via this interface are underway.
Testbeds, Specialized Corpora, and User Studies
User Studies
We collected data via interviews and transaction logs from our initial testbed users at Winchester Thurston (WT). A discussion of how this data was used in the refinement of the digital library interface is presented in the following electronic document:
Christel, M.G., and Pendyala, K. "Informedia Goes to School: Early Findings from the Digital Video Library Project." D-Lib Magazine, September, 1996. http://www.dlib.org/dlib/september96/informedia/09christel.html.
We conducted a formal empirical study with 30 high school and college students over the summer. This work has implications for multimedia abstractions used by a digital video library as well as how the data is segmented within that library. A discussion of the study has been submitted to a conference (CHI) for publication consideration. From that paper:
Quick access to short, relevant segments of video enables the efficient use of a digital video library. Three interfaces were designed for such access, allowing the user to browse through a set of video segments in support of a fact-finding task. An experiment is described in which subjects' performance and attitudes are measured to determine the relative effectiveness of these three interfaces. Results show that visual imagery benefits both performance and subjective satisfaction compared to text list presentation. Tailoring the representative images for video segments in a set based on the query which produced that set (query-based poster frames) provides significant improvements.
We are planning and doing preliminary work for an empirical study into the effectiveness of a collapsed video, or video "skim," created in four different ways: fixed time intervals, "best" subsets chosen by only considering the video data, "best" subsets chosen by only considering the audio data, and "best" subsets chosen by considering both audio and video content. Pilot tests will be conducted in 11/96, followed by more formal studies.
Specialized Corpus
We processed small amount of content provided by DARPA (approximately 5-7 hours), and installed this library along with the latest client at DARPA.
Interoperability
We completed work on the server side of a Web "gateway" to the Informedia library, i.e., a cgi-script which works with vanilla http servers to parse http requests to access Informedia metadata and media content, and to process queries and browse our library hierarchically.
We also completed work on a first cut at a "Web client", i.e., one that works within the constraints of a Web Browser. Initially we chose to maximize portability of the this client by writing most of the GUI in Java. The price attempting to keep the client "portable" was the steep performance hit exacted by Java.
We began work in replacing the Java client interface with ActiveX controls for better performance. While ActiveX itself is still an emerging technology, what we've seen thus far seems to address our need for better run-time performance as well as (network) scalability.
In order to address the problem of actually playing back content over the Internet, we have begun work to implement a "slideshow", which we view as a surrogate for actual video playback. A slideshow would stream continuous audio, and display still images within the video synchronized to the audio stream. We have been investigating the plethora of commercial web gadgetry that is becoming available to do this effectively, including ActiveX, RealAudio, Xing StreamWorks, InterVu, VDOnet, Xtreme, etc. Ultimately, we hope that we can "ratchet" the number of frames/second displayed in the slideshow up or down by dynamically responding to the size of the current network throughput.
We began an experiment to interoperate with the DL project at Stanford. The basic idea is that we issue queries from our web client to a socket connection to a Stanford "infobus" server, parse the returns and are able to display the results in the same client as we show results from our own library. Most (if not all) of Stanford's content is _not_ multi-media. Likewise, Stanford will issue http requests to our library web server, to search, browse, and retrieve objects from the Informedia library for use within their own client interface. This builds on their current work of building "proxies" to scale the amount of data accessible via the Infobus.
We helped organize a workshop at this year's ACM Multimedia Conference in Boston (Nov 18 - 22) entitled "Interoperability for Digital Video Libraries). This has attracted about 8 different projects from the US, Europe, and Australia, and our hope is that this meeting will spark more such experiments between the participants.
External Interactions
Visitors & Industry Contacts
Public Presentations and Conference Papers
I certify that to the best of my knowledge (1) the statements herein (excluding scientific hypotheses and scientific opinions) are true and complete, and (2) the text and graphics in this report as well as any accompanying publications or other documents, unless otherwise indicated, are the original work of the signatories or individuals working under their supervision. I understand that the willful provision of false information or concealing a material fact in this report(s) or any other communication submitted to NSF is a criminal offense (U.S. Code, Title 18, Section 1011).
Howard D. Wactlar
Project Director
Copyright © 1997 Carnegie Mellon University
All Rights Reserved