|
|
|
|
|
|
Surrogates for Informedia Digital Video
Library |
|
Abstractions for single video document |
|
Empirical studies on thumbnail images,
skims |
|
Quick overview of early HCI
investigations |
|
Summaries across video documents
(collages) |
|
Demonstration of information
visualization |
|
Required advances in automated content
extraction |
|
TREC Video Retrieval Track
2002 |
|
Overview of Carnegie Mellon participation and
results |
|
Multiple storyboard interface emphasizing
imagery |
|
| |
|
|
|
Initiated by the National Science Foundation,
DARPA, and NASA under the Digital Libraries Initiative,
1994-98 |
|
Continued funding via Digital Libraries Initiative
Phase 2 (NSF, DARPA, National Library of Medicine, Library of
Congress, NASA, National Endowment for the Humanities) |
|
New work and directions via NSF, NSDL, ARDA VACE,
“Capturing, Coordinating, and Remembering Human Experience”
CCRHE Project, CareMedia, etc. |
|
Details at
http://www.informedia.cs.cmu.edu/ | |
|
|
|
|
Image processing |
|
Detection of text overlaid on video |
|
Detection of faces |
|
Identification of camera and object
motion |
|
Breaking video into component shots |
|
Detecting corpus-specific categories, e.g.,
anchorperson shots and weather map shots |
|
Speech recognition |
|
Text extraction and alignment |
|
Natural language processing |
|
Determining best text matches for a given
query |
|
Identifying places, organizations,
people |
|
Producing phrase
summaries | |
|
|
|
|
|
|
|
|
|
104 hour library consisting of 3481
clips |
|
Average clip length of 1.8 minutes, consuming 15.7
megabytes of storage |
|
Automatic logs generated for usage of Informedia
Library by high school science teachers and students |
|
243 hours logged (2473 queries, 2910 video clips
played) | |
|
|
|
Titles frequently used, should include length and
production date |
|
Results and title placement affect
usage |
|
Greater quantity of video was desired |
|
Storyboards (filmstrips) used
infrequently |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Significant performance improvements for
query-based thumbnail treatment over other two
treatments |
|
Subjective satisfaction significantly greater for
query-based thumbnail treatment |
|
Subjects could not identify differences between
thumbnail treatments, but their performance definitely showed
differences! |
|
_____ |
|
*Christel, M., Winkler, D., and Taylor, C.R.
Improving Access to a Digital Video Library. In Human-Computer
Interaction: INTERACT97, Chapman & Hall, London, 1997,
524-531 | |
|
|
|
|
|
|
|
|
|
Real benefit for skims appears to be for
comprehension rather than navigation |
|
For PBS documentaries, information in audio track
is very important |
|
Empirical study conducted in September 1997 to
determine advantages of skims over subsampled video, and
synchronization requirements for audio and
visuals | |
|
|
|
|
|
|
|
|
|
|
1996 “selective” skims performed no
better than subsampled skims, but results from 1997 study show
significant differences with “selective” skims more satisfactory to
users |
|
audio is less choppy than earlier 1996 skim
work |
|
synchronization with video is better
preserved |
|
grain size has increased |
|
_____ |
|
*Christel, M., Smith, M., Taylor, C.R.,
and Winkler, D. Evolving Video Skims into Useful Multimedia
Abstractions. In Proc. ACM CHI ’98 (Los Angeles, CA, April 1998),
ACM Press, 171-178 | |
|
|
|
|
|
|
|
|
|
|
Captions AND pictures better than either
modality alone |
|
Large, A., et al. Multimedia and Comprehension: The
Relationship among Text, Animation, and Captions. J. American
Society for Information Science 46(5) (June 1995),
340-347 |
|
Nugent, G.C. Deaf Students' Learning from Captioned
Instruction: The Relationship between the Visual and Caption
Display. J. Special Education 17(2) (1983), 227-234 |
|
Video surrogates better with BOTH images
and text |
|
Ding, W., et al. Multimodal Surrogates for Video
Browsing. In Proc. ACM Conf. on Digital Lib. (Berkeley, CA, Aug.
1999), 85-93 |
|
Christel, M. and Warmack, A. The Effect of Text in
Storyboards for Video Navigation. In Proc. IEEE ICASSP, (Salt Lake
City, UT, May 2001), Vol. III, pp. 1409-1412 |
|
For news/documentaries, audio narrative
is important, but other video genres may be different |
|
Li, F., Gupta, A., et al. Browsing Digital Video.
In Proc. ACM CHI ’00 (The Hague, Neth., April 2000),
169-176 |
|
| |
|
|
|
|
|
|
|
|
|
Storyboard surrogates clearly improved with
text |
|
Participants favored interleaved
presentation |
|
Navigation efficiency is best served with reduced
interleaved text (BriefByRow) |
|
BriefByRow and All had best task performance, but
BriefByRow requires less display space |
|
If interleaving is done in conjunction with text
reduction, to better preserve and represent the time association
between lines of text, imagery and their affiliated video sequence,
then a storyboard with great utility for information assessment and
navigation can be
constructed. | |
|
|
|
|
|
|
|
|
Techniques discussed here: |
|
transaction logs |
|
formal empirical studies |
|
Other techniques used in interface
refinement: |
|
contextual inquiry |
|
heuristic evaluation |
|
cognitive walkthroughs |
|
“think aloud” protocols |
|
| |
|
|
|
|
As digital video assets grow, so do
result sets |
|
As automated processing techniques
improve, e.g., speech and image processing, more metadata is
generated with which to build interfaces into video |
|
Need overview capability to deal with
greater volume |
|
Prior work offered many
solutions: |
|
Visualization By Example (VIBE) for matching entity
relationships |
|
Scatter plots for low dimensionality relationships,
e.g., timelines |
|
Dynamic query sliders for direct manipulation of
plots |
|
Colored maps for geographic
relationships | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Video can be decomposed into
shots |
|
Consider 2050 hours of CNN videos from
1997-2002 |
|
1,688,000 shots |
|
67,700 segments/stories |
|
1 minute 53 seconds average story
duration |
|
4.5 seconds average shot duration |
|
23 shots per segment on average |
|
Result sets for queries number in the
hundreds or thousands |
|
Against 2001 CNN collection, top 1000
stories for queries on “terrorism” and “bomb threat” produced 17545
and 18804 shots respectively |
|
User needs a way to filter down tens of
thousands of images | |
|
|
|
Query-based thumbnail images added to VIBE,
timeline, map summaries |
|
Layout differs: overlap in VIBE/timeline;
tile in map |
|
Extend concept of “highest scoring” to represent
country, or a point in time or a point on VIBE
plot | |
|
|
|
Transcript and other derived text such as scene
text and characters overlaid on broadcast video provide input for
further processing |
|
Named entity tagging and common phrase extraction
provides filtering mechanism to reduce text into defined
subsets |
|
Visualization interface allows subsets, e.g.,
people, organizations, locations, and common phrases, to be
displayed for the set of documents plotted in the visualization
view | |
|
|
|
Most frequent common phrases and people from query
on “anthrax” against 2001 news listed beneath timeline
plot. | |
|
|
|
|
|
|
|
Image addition to summaries improved over
time |
|
Anchorperson removal for more
representative visuals |
|
Consume more space in timeline with
images via better layout |
|
Image resizing under user control to see
detail on demand |
|
Text addition found to require new
interface controls |
|
Selection controls, e.g., list people,
organizations, locations, and/or common phrases |
|
Stopping rules, e.g., list at most X
terms, list terms only if they are covered by Y documents or Z% of
document set |
|
Show some text where user’s attention is
focused, by the mouse pointer, i.e., pop-up tooltips
text |
|
_____ |
|
*Christel, M., et al. Collages as Dynamic
Summaries for News Video. In Proc. ACM Multimedia ’02
(Juan-les-Pins, France, Dec. 2002) |
|
| |
|
|
|
|
Definitive information at NIST TREC Video
Track web site:
http://www-nlpir.nist.gov/projects/trecvid/ |
|
TREC series sponsored by the National
Institute of Standards and Technology (NIST) with additional support
from other U.S. government agencies |
|
Goal is to encourage research in information
retrieval from large amounts of text by providing a large test
collection, uniform scoring procedures, and a forum for
organizations interested in comparing their results |
|
Video Retrieval Track started in
2001 |
|
Goal is investigation of content-based retrieval
from digital video |
|
Focus on the shot as the unit of information
retrieval rather than the scene or story/segment/clip |
|
| |
|
|
|
|
2001 collection had ~11 hours of MPEG-1
video: 260 segments,
8000 shots, 80,000 I-frames |
|
2002 search test collection had ~40 hours
of MPEG-1 video: 1160
segments, 14,524 shots (given by TREC-V), 292,000
I-frames |
|
2001 results |
|
http://trec.nist.gov/pubs/trec10/t10_proceedings.html |
|
Automatic search (no human in loop) difficult: about 1/3 of queries were
unanswered by any of the automatic systems |
|
Research groups submitting search runs included
Carnegie Mellon, Dublin City Univ., Fudan Univ. China, IBM, Lowlands
Group Netherlands |
|
2002 results |
|
http://www.cdvp.dcu.ie/Papers/TREC2002_Video_report.pdf |
|
25 search topics developed by NIST |
|
Search runs submitted by list above, plus
CLIPS-IMAG (Fr), Imperial College London, Microsoft Research Asia,
U. Maryland, U. Oulu (Fin) |
|
|
|
| |
|
|
|
|
Specific item or person |
|
Eddie Rickenbacker, James Chandler, George
Washington, Golden Gate Bridge, Price Tower in Bartlesville
Okla. |
|
Specific fact |
|
Arch in Washington Square Park in NYC, map of
continental US |
|
Instances of a category |
|
football players, overhead views of cities, one or
more women standing in long dresses |
|
Instances of
events/activities |
|
people spending leisure time at the beach, one or
more musicians with audible music, crowd walking in an urban
environment, locomotive approaching the viewer |
|
|
|
|
|
| |
|
|
|
Outdoors: recognizably outdoor
location |
|
Indoors: recognizably indoor location |
|
Face: at least one human face with nose, mouth, and
both eyes |
|
People: group of two more humans |
|
Cityscape: recognizably city/urban/suburban
setting |
|
Landscape: a predominantly natural inland setting,
i.e., one with little or no evidence of development by
humans |
|
Text Overlay: superimposed text large enough to be
read |
|
Speech: human voice uttering recognizable
words |
|
Instrumental Sound: sound produced by one or more
musical instruments, including percussion instruments |
|
Monologue: an event in which a single person is at
least partially visible and speaks for a long time without
interruption by another
speaker | |
|
|
|
Outdoors: recognizably outdoor
location |
|
Indoors: recognizably indoor location |
|
Face: at least one human face with nose, mouth, and
both eyes |
|
People: group of two more humans |
|
Cityscape: recognizably city/urban/suburban
setting |
|
Landscape: a predominantly natural inland setting,
i.e., one with little or no evidence of development by
humans |
|
Text Overlay: superimposed text large enough to be
read |
|
Speech: human voice uttering recognizable
words |
|
Instrumental Sound: sound produced by one or more
musical instruments, including percussion instruments |
|
Monologue: an event in which a single person is at
least partially visible and speaks for a long time without
interruption by another
speaker | |
|
|
|
|
|
Multiple document storyboards |
|
Resolution and layout under user
control |
|
Query context plays a key role in filtering image
sets to manageable sizes |
|
TREC-V 2002 image feature set offers additional
filtering capabilities for indoor, outdoor, faces, people,
etc. |
|
Displaying filter count and distribution guides
their use in manipulating the storyboard views |
|
| |
|
|
|
|
|
|
|
|
Genre matters, e.g., tailoring to
news/documentaries |
|
Context, e.g., matching terms, and
synchronization between imagery and narrative can reduce surrogate
complexity for both single clip and sets of video clips |
|
Text with imagery more useful in video
summaries than either text alone or imagery alone |
|
Richness of imagery holds potential; look
to TREC-V and similar venues to chart progress |
|
“Overview first, zoom and filter, then
details on demand” |
|
Visual Information-Seeking Mantra of Ben
Shneiderman |
|
Direct manipulation interfaces leave the user in
control |
|
Iterative prototyping reveals areas
needing further work | |
|
|
|
Many Informedia Project and CMU research community
members contributed to this work; a partial list appears
here: |
|
Project Director: Howard Wactlar |
|
User Interface: Mike Christel, Chang “Liz” Huang,
Neema Moraveji, Adrienne Warmack, Dave Winkler |
|
Image Processing: Takeo Kanade, Norm Papernick,
Yanjun Qi, Robert Chen, Toshio Sato, Henry Schneiderman, Michael
Smith |
|
Speech and Language Processing: Alex Hauptmann, Ricky
Houghton, Rong Jin, Dorbin Ng, Michael Witbrock, Rong
Yan |
|
Informedia Library Essentials: Bob Baron, Colleen Everett,
Mark Hoy, Melissa Keaton,
Bryan Maher, Scott
Stevens | |