NSF Progress Reports
Informedia: Integrated Speech, Image, and Language Understanding for the Creation and Exploration of Digital Video Libraries
Carnegie Mellon University Informedia Digital Video Library
NSF Cooperative Agreement IRI 9411299
Quarterly Report, August 1996
Howard D. Wactlar, Project Director
Following is a brief summary of research and implementation progress for the period 1 May 1996 to 31 July 1996. In this period we have: (1) improved skim selection and enhanced image matching, (2) begun constructing a Web version of the Informedia client, and (3) released version 0.91 of the software to the testbed site
Speech, Image and Language Understanding for Library Creation
Video Skimming
We have worked to improve our rules for skim selection using further integrated image and language characteristics. Improved characterization includes:
Detection of object presence and motion.
Keyword/Keyphrase detection through improved document corpus for TF/IDF.
Detection of proper names to identify people.
Our ability to match images and image regions is being enhanced through research in detecting background and foreground motion. This will also enhance our skim selection by allowing discrimination between camera and object motion.
We are currently planning user studies to test the usefulness and quality of video skims, and working on developing "dynamic skims", i.e., skims that are relevant to a particular query.
Color Image Similarity Matching
We developed an R-tree based, multidimensional indexing method to accelerate color image similarity matching. It is being tested with actual color histograms and is expected to be applicable for retrieval with more than 100,000 images with practical retrieval time.
User Interfaces and Client Implementation
Web Based Client
We have completed design work for a complete, platform independent, web-based, Java-based client with similar function to that of the current MS Visual Basic client, and implementation is proceeding. We currently have enabled browse and search of the Informedia database via the Internet (with only local access to date because of rights issues). The most difficult challenge will be serving appropriate video segments via the Internet's sometimes limited bandwidth. We are currently investigating both a high-bandwidth MPEG-based video streamer from Digital, and the low bandwidth video streamer from VDO. Both clients are capable of displaying the poster frames and automatically-generated text synopses for returned results.
A related challenge is to enable speech recognition for a web-based interface. We are exploring a number of options for this in collaboration with the Wearable Speech project in the CMU speech group.
A Baseline for Information Retrieval
We have established an information retrieval baseline for our data. We used titles from 105 stories to retrieve the corresponding story (with titles) from a database of 602 stories. Three versions were available: JGI transcripts, CC transcripts, and SR transcripts. Relative to the JGI transcripts, CC had a word error rate of 15% and speech, 50.7%
The correct story (i.e., the one that goes with the human generated title (from JGI) was determined and its rank computed among all the returned results. What follows is the average correct rank for the set of 105 stories retrieved from the database of 602 stories.
TYPE |
JGI TEXT |
CC |
SR |
Random Selection |
301.00 |
301.00 |
301.00 |
Pursuit Search Engine |
91.49 |
NA |
125.26 |
Pursuit |
68.60 |
83.16 |
101.23 |
Stopwords (6/96) |
5.69 |
7.03 |
35.78 |
Best as of (7/96) |
1.79 |
2.00 |
7.60 |
The best system uses a combination of stemming, stopwords, tf/idf, document length normalization, document weighting; pursuit+stopwords is therefore a realistic baseline.
Face & Name Association
Our face and name association technique, "Name-It", is being enhanced in both image and natural language processing. In image processing, we analyzed the statistical model of a human's face color, which will then be applied to face tracking in image sequences to get face sequences as well as face occurrence duration. We are developing a "best face" model to select among image sequences a "best face" which is most appropriate for face matching. In natural language processing, we are applying parsing and structural knowledge of news to news transcripts to achieve high level name detection, i.e., matching pronouns to proper names.
New Client Software
We completed version 0.91 of the Informedia application in late July. This version is anticipated to be released to Winchester Thurston late in August. It will require new search indices to be generated for every catalog, and every video must be tagged with its actual length.
New features of the software include: New search engine.
Spell checker has been integrated to flag potential misspelled words in the query.
Better organized options. The options are now:
Better usage tracking via better timing and logging of transactions.
Continuous listening is partially implemented.
A separate "Copy Text" menu item supplements "Copy Text with Attribution."
Window activation will usually follow the mouse, except when over the browser form. When the mouse is over the browser, a click is necessary to activate the browser.
Testbeds, Specialized Corpuses, and User Studies
We analyzed logs of Winchester Thurston user activity, and used that as input to the design of an interface study on poster frames for this summer and as input to the design of the next testbed.
We conducted a study with Pennsylvania Governor's School high school students and CMU students, with results to be analyzed in late August or early September.
We installed a client at the Testbed integration Environment at DARPA The first client was left at DARPA in late May/early June; an improved client was sent in early July.
Interoperability
We completed the RPC interface to our library to the extent that it replicates current functionality of our stand-alone client. While the next step would be to add caching support and other features necessary for implementation on a WAN, we've deferred this work in favor of an alternate solution to interoperability over a WAN, i.e., a Web interface.
We began an effort within our group to implement the current library API behind a Web server. This promises to be much more scalable from an interoperability standpoint, since changes in the API and the library itself are all hidden behind a central location (the Web server). In this model, the Web server communicates with the library directly, and exports the search engine, the library browser, and various media types and playback options via HTML. The one missing piece in this model is that typical Web browsers are not equipped with enough multi-media gadgets to support many of the datatypes and playback mechanisms which we currently have in our client. We are currently examining the latest Web technology (Java, ActiveX) available to supplement the browsers with Multi-media objects.
We installed the beta (and later the first production) version of the Dec Video Server, with a test library of approximately seven hours of mpeg video. We wrote a batch utility which can load the video library onto the server without human intervention. The product necessitates specific client side hardware and software, which we've also installed in a testbed.
While we still have to iron out some technical problems with Dec, we did prove to our satisfaction that the product dovetails nicely with our new Web interface efforts. A prototype of our web client is able to stream mpeg video from the Video Server via embedded HTML links. Furthermore, the server has the capability to stream specified segments within a larger mpeg, which is crucial to our application.
Personnel Changes
We hired a new research scientist, Dr. Yihong Gong, who started on June 1st to work in Informedia image understanding group.
External Interactions
Visitors and Industry Contacts
Public Presentations and Conference Papers
I certify that to the best of my knowledge (1) the statements herein (excluding scientific hypotheses and scientific opinions) are true andcomplete, and (2) the text and graphics in this report as well as any accompanying publications or other documents, unless otherwise indicated, are the original work of the signatories or individuals working under their supervision. I understand that the willful provision of false information or concealing a material fact in this report(s) or any other communication submitted to NSF is a criminal offense (U.S. Code, Title 18, Section 1011).
Howard D. Wactlar
Project Director
08/02/96
Copyright © 1997 Carnegie Mellon University
All Rights Reserved