Notes
Outline
Development and Evaluation of Digital Video Library Interfaces
Outline
Surrogates for Informedia Digital Video Library
Abstractions for single video document
Empirical studies on thumbnail images, skims
Quick overview of early HCI investigations
Summaries across video documents (collages)
Demonstration of information visualization
Required advances in automated content extraction
TREC Video Retrieval Track 2002
Overview of Carnegie Mellon participation and results
Multiple storyboard interface emphasizing imagery
Informedia Digital Video Library Project
Initiated by the National Science Foundation, DARPA, and NASA under the Digital Libraries Initiative, 1994-98
Continued funding via Digital Libraries Initiative Phase 2 (NSF, DARPA, National Library of Medicine, Library of Congress, NASA, National Endowment for the Humanities)
New work and directions via NSF, NSDL, ARDA VACE,
“Capturing, Coordinating, and Remembering Human Experience” CCRHE Project, CareMedia, etc.
Details at http://www.informedia.cs.cmu.edu/
Techniques Underlying Video Metadata
Image processing
Detection of text overlaid on video
Detection of faces
Identification of camera and object motion
Breaking video into component shots
Detecting corpus-specific categories, e.g., anchorperson shots and weather map shots
Speech recognition
Text extraction and alignment
Natural language processing
Determining best text matches for a given query
Identifying places, organizations, people
Producing phrase summaries
Text and Face Detection
Text Extraction and Alignment
Deriving “Matching Shots”
Initial User Testing of Video Library, ca. 1996
104 hour library consisting of 3481 clips
Average clip length of 1.8 minutes, consuming 15.7 megabytes of storage
Automatic logs generated for usage of Informedia Library by high school science teachers and students
243 hours logged (2473 queries, 2910 video clips played)
Early Lessons Learned
Titles frequently used, should include length and production date
Results and title placement affect usage
Greater quantity of video was desired
Storyboards (filmstrips) used infrequently
Empirical Study Into Thumbnail Images
Text-based Result List
“Naïve” Thumbnail List (Uses First Shot Image)
Query-based Thumbnail Result List
Query-based Thumbnail Selection Process
Thumbnail Study Results
Empirical Study Summary*
Significant performance improvements for query-based thumbnail treatment over other two treatments
Subjective satisfaction significantly greater for query-based thumbnail treatment
Subjects could not identify differences between thumbnail treatments, but their performance definitely showed differences!
_____
*Christel, M., Winkler, D., and Taylor, C.R. Improving Access to a Digital Video Library. In Human-Computer Interaction: INTERACT97, Chapman & Hall, London, 1997, 524-531
Thumbnail View with Query Relevance Bar
Close-up of Thumbnail with Relevance Bar
“Skim Video”:  Extracting Significant Content
Skims:  Preliminary Findings
Real benefit for skims appears to be for comprehension rather than navigation
For PBS documentaries, information in audio track is very important
Empirical study conducted in September 1997 to determine advantages of skims over subsampled video, and synchronization requirements for audio and visuals
Empirical Study:  Skims
Skim Study Results
Skim Study Questions on User Satisfaction
Skim Study Results*
1996 “selective” skims performed no better than subsampled skims, but results from 1997 study show significant differences with “selective” skims more satisfactory to users
audio is less choppy than earlier 1996 skim work
synchronization with video is better preserved
grain size has increased
_____
*Christel, M., Smith, M., Taylor, C.R., and Winkler, D. Evolving Video Skims into Useful Multimedia Abstractions. In Proc. ACM CHI ’98 (Los Angeles, CA, April 1998), ACM Press, 171-178
Match Information
Using Match Information For Browsing
Using Match Info to Reduce Storyboard Size
Adding Value to Video Surrogates via Text
Captions AND pictures better than either modality alone
Large, A., et al. Multimedia and Comprehension: The Relationship among Text, Animation, and Captions. J. American Society for Information Science 46(5) (June 1995), 340-347
Nugent, G.C. Deaf Students' Learning from Captioned Instruction: The Relationship between the Visual and Caption Display. J. Special Education 17(2) (1983), 227-234
Video surrogates better with BOTH images and text
Ding, W., et al. Multimodal Surrogates for Video Browsing. In Proc. ACM Conf. on Digital Lib. (Berkeley, CA, Aug. 1999), 85-93
Christel, M. and Warmack, A. The Effect of Text in Storyboards for Video Navigation. In Proc. IEEE ICASSP, (Salt Lake City, UT, May 2001), Vol. III, pp. 1409-1412
For news/documentaries, audio narrative is important, but other video genres may be different
Li, F., Gupta, A., et al. Browsing Digital Video. In Proc. ACM CHI ’00 (The Hague, Neth., April 2000), 169-176
How Much Text, and Does Layout Matter?
Results from Christel/Warmack Study
More Results from Storyboard/Text Study
Conclusions from Storyboard/Text Study
Storyboard surrogates clearly improved with text
Participants favored interleaved presentation
Navigation efficiency is best served with reduced interleaved text (BriefByRow)
BriefByRow and All had best task performance, but BriefByRow requires less display space
If interleaving is done in conjunction with text reduction, to better preserve and represent the time association between lines of text, imagery and their affiliated video sequence, then a storyboard with great utility for information assessment and navigation can be constructed.
Discussed Multimedia Surrogates, i.e., Abstractions based on Library Metadata
Range of Multimedia Surrogates
Designing and Evaluating Video Surrogates
Techniques discussed here:
transaction logs
formal empirical studies
Other techniques used in interface refinement:
contextual inquiry
heuristic evaluation
cognitive walkthroughs
“think aloud” protocols
Extending to Surrogates ACROSS Video
As digital video assets grow, so do result sets
As automated processing techniques improve, e.g., speech and image processing, more metadata is generated with which to build interfaces into video
Need overview capability to deal with greater volume
Prior work offered many solutions:
Visualization By Example (VIBE) for matching entity relationships
Scatter plots for low dimensionality relationships, e.g., timelines
Dynamic query sliders for direct manipulation of plots
Colored maps for geographic relationships
Enhancing Library Utility via Better Metadata
Displaying Metadata in Effective “Collages”
Zooming into “Collage” to Reveal Details
Example of “Chrono-Collage”
Named Entity Extraction
Challenge:  Integrating Imagery into Collages
Great Volume of Imagery Requires Filtering
Video can be decomposed into shots
Consider 2050 hours of CNN videos from 1997-2002
1,688,000 shots
67,700 segments/stories
1 minute 53 seconds average story duration
4.5 seconds average shot duration
23 shots per segment on average
Result sets for queries number in the hundreds or thousands
Against 2001 CNN collection, top 1000 stories for queries on “terrorism” and “bomb threat” produced 17545 and 18804 shots respectively
User needs a way to filter down tens of thousands of images
Adding Imagery to Visualizations
Query-based thumbnail images added to VIBE, timeline, map summaries
Layout differs:  overlap in VIBE/timeline; tile in map
Extend concept of “highest scoring” to represent country, or a point in time or a point on VIBE plot
Adding Text Overviews to Collages
Transcript and other derived text such as scene text and characters overlaid on broadcast video provide input for further processing
Named entity tagging and common phrase extraction provides filtering mechanism to reduce text into defined subsets
Visualization interface allows subsets, e.g., people, organizations, locations, and common phrases, to be displayed for the set of documents plotted in the visualization view
Example of Text-Augmented Timeline
Most frequent common phrases and people from query on “anthrax” against 2001 news listed beneath timeline plot.
Example of Text-Augmented VIBE Plot
Refinement of Collages*
Image addition to summaries improved over time
Anchorperson removal for more representative visuals
Consume more space in timeline with images via better layout
Image resizing under user control to see detail on demand
Text addition found to require new interface controls
Selection controls, e.g., list people, organizations, locations, and/or common phrases
Stopping rules, e.g., list at most X terms, list terms only if they are covered by Y documents or Z% of document set
Show some text where user’s attention is focused, by the mouse pointer, i.e., pop-up tooltips text
_____
*Christel, M., et al. Collages as Dynamic Summaries for News Video. In Proc. ACM Multimedia ’02 (Juan-les-Pins, France, Dec. 2002)
NIST TREC Video Retrieval Track
Definitive information at NIST TREC Video Track web site:  http://www-nlpir.nist.gov/projects/trecvid/
TREC series sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies
Goal is to encourage research in information retrieval from large amounts of text by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results
Video Retrieval Track started in 2001
Goal is investigation of content-based retrieval from digital video
Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip
TREC-Video 2001 and TREC-Video 2002
2001 collection had ~11 hours of MPEG-1 video:  260 segments, 8000 shots, 80,000 I-frames
2002 search test collection had ~40 hours of MPEG-1 video:  1160 segments, 14,524 shots (given by TREC-V), 292,000 I-frames
2001 results
http://trec.nist.gov/pubs/trec10/t10_proceedings.html
Automatic search (no human in loop) difficult:  about 1/3 of queries were unanswered by any of the automatic systems
Research groups submitting search runs included Carnegie Mellon, Dublin City Univ., Fudan Univ. China, IBM, Lowlands Group Netherlands
2002 results
http://www.cdvp.dcu.ie/Papers/TREC2002_Video_report.pdf
25 search topics developed by NIST
Search runs submitted by list above, plus CLIPS-IMAG (Fr), Imperial College London, Microsoft Research Asia, U. Maryland, U. Oulu (Fin)
TREC-Video 2002 Queries
Specific item or person
Eddie Rickenbacker, James Chandler, George Washington, Golden Gate Bridge, Price Tower in Bartlesville Okla.
Specific fact
Arch in Washington Square Park in NYC, map of continental US
Instances of a category
football players, overhead views of cities, one or more women standing in long dresses
Instances of events/activities
people spending leisure time at the beach, one or more musicians with audible music, crowd walking in an urban environment, locomotive approaching the viewer
TREC-Video 2002 Features for Auto-Detection
Outdoors: recognizably outdoor location
Indoors: recognizably indoor location
Face: at least one human face with nose, mouth, and both eyes
People: group of two more humans
Cityscape: recognizably city/urban/suburban setting
Landscape: a predominantly natural inland setting, i.e., one with little or no evidence of development by humans
Text Overlay: superimposed text large enough to be read
Speech: human voice uttering recognizable words
Instrumental Sound: sound produced by one or more musical instruments, including percussion instruments
Monologue: an event in which a single person is at least partially visible and speaks for a long time without interruption by another speaker
TREC-Video 2002 Features for Auto-Detection
Outdoors: recognizably outdoor location
Indoors: recognizably indoor location
Face: at least one human face with nose, mouth, and both eyes
People: group of two more humans
Cityscape: recognizably city/urban/suburban setting
Landscape: a predominantly natural inland setting, i.e., one with little or no evidence of development by humans
Text Overlay: superimposed text large enough to be read
Speech: human voice uttering recognizable words
Instrumental Sound: sound produced by one or more musical instruments, including percussion instruments
Monologue: an event in which a single person is at least partially visible and speaks for a long time without interruption by another speaker
TREC-V 2002 Search Results
Interface Development for TREC-V
Multiple document storyboards
Resolution and layout under user control
Query context plays a key role in filtering image sets to manageable sizes
TREC-V 2002 image feature set offers additional filtering capabilities for indoor, outdoor, faces, people, etc.
Displaying filter count and distribution guides their use in manipulating the storyboard views
Multiple Document Storyboards
Resolution and Layout under User Control
Summary
Genre matters, e.g., tailoring to news/documentaries
Context, e.g., matching terms, and synchronization between imagery and narrative can reduce surrogate complexity for both single clip and sets of video clips
Text with imagery more useful in video summaries than either text alone or imagery alone
Richness of imagery holds potential; look to TREC-V and similar venues to chart progress
“Overview first, zoom and filter, then details on demand”
Visual Information-Seeking Mantra of Ben Shneiderman
Direct manipulation interfaces leave the user in control
Iterative prototyping reveals areas needing further work
Credits
Many Informedia Project and CMU research community members contributed to this work; a partial list appears here:
Project Director:  Howard Wactlar
User Interface: Mike Christel, Chang “Liz” Huang, Neema Moraveji, Adrienne Warmack, Dave Winkler
Image Processing: Takeo Kanade, Norm Papernick, Yanjun Qi, Robert Chen, Toshio Sato, Henry Schneiderman, Michael Smith
Speech and Language Processing:  Alex Hauptmann, Ricky Houghton, Rong Jin, Dorbin Ng, Michael Witbrock, Rong Yan
Informedia Library Essentials:  Bob Baron, Colleen Everett, Mark Hoy, Melissa Keaton,  Bryan Maher, Scott Stevens