Electronic Proceedings of the
ACM Workshop on Effective Abstractions in Multimedia
November 4,
1995
San Francisco, California
Addressing the Contents of Video in a Digital Library
- Michael G. Christel
-
- Software Engineering Institute
- Carnegie Mellon University
- Pittsburgh, PA 15213-3890
- 412-268-7799
- mac@sei.cmu.edu
- http://www.cs.cmu.edu/~christel
Abstract
A digital video library must be efficient at giving users
precisely the material they need, due to the unique characteristics of video as
compared to text. To make the retrieval of bits faster, and to enable faster
viewing or information assimilation, the digital video library will need to
support partitioning video into small-sized clips and alternate representations
of the video.
For a general purpose digital video library, precision may have to be
sacrificed in order to ensure that the material the user is interested in will
be recalled in the result set to a query. The result set may then become quite
large, so the user may need to filter the set and decide what is important. This
can be accomplished by collapsing the playback rate of video objects in the
result set as well as adjusting the size of the objects in the result set. The
Informedia Digital Video Library
Project at Carnegie Mellon University deals with these issues and is
introduced here with pointers to additional information.
A library cannot be very effective if it
is merely a collection of information without some understanding of what is
contained in that collection. Without that understanding it could take hundreds
of hours of viewing to determine if an item of interest is in a 1000 hour video
library. Obviously, such a library would not be used very often. Marchionini and
Maurer reflect on information accessible via the Internet [Marchionini95,
p. 72]:
-
- It has often been said that the Internet is starting to provide the
largest library humankind has ever had. As true as this may be, the Internet
is also the messiest library that ever has existed.
Information is found best on the Internet when the providers augment the
information with rich keywords and descriptors, provide links to related
information, and allow the contents of their pages to be searched and indexed.
There is a long history of sophisticated parsing and indexing for text
processing in various structured forms, from ASCII to PostScript to SGML and
HTML. However, how does one represent video content to support content-based
retrieval and manipulation?
An hour-long motion video segment clearly contains some information suitable
for indexing, so that a user can find an item of interest within it. The problem
is not the lack of information in video, but rather the inaccessibility of that
information to our primarily text-based information retrieval mechanisms today.
In fact, the video likely contains an overabundance of information, conveyed in
both the video signal (camera motion, scene changes, colors) and the audio
signal (noises, silence, dialogue). A common practice today is to log or tag the
video with keywords and other forms of structured text to identify its contents.
Such text descriptors have the following limitations:
- Manual processes are tedious and time consuming.
- Manual processes are seriously incomplete. Even if full transcripts of the
audio track are entered, other information about the video will almost surely
be left out, such as the identity of persons and objects in each scene.
- Transcripts are inaccurate, with mistypings and incorrect classifications
often introduced.
- Text descriptors are biased by whatever predetermined structures are used
to classify the video contents.
- Cinematic information is complex and difficult to describe, especially for
non-experts.
- Text descriptors are biased by the ambiguity of natural language.
The Informedia Digital Video
Library (IDVL) Project at Carnegie Mellon University is an ongoing research
project begun in 1994, but leveraging two decades of related CMU research [Stevens94,
Hauptmann95,
Smith95].
Central to the project is the establishment of a large, online digital video
library that goes beyond just keyword approaches to indexing video content. Some
other techniques will be overviewed, followed by a concluding outline of how the
IDVL Project is addressing this task.
Anyone who has
retrieved video from the Internet realizes that because of its size a video clip
can take a long time to move from one location to another, such as from the
digital video library to the user. Likewise, if a library consists of only 30
minute clips, when users check one out it may take them 30 minutes to determine
whether the clip met their needs. Returning a full one-half hour video when only
one minute is relevant is much worse than returning a complete book, when only
one chapter is needed. With a book, electronic or paper, tables of contents,
indices, skimming, and reading rates permit users to quickly find the chunks
they need. Since the time to scan a video cannot be dramatically shorter than
the real time of the video, a digital video library must be efficient at giving
users the material they need. To make the retrieval of bits faster, and to
enable faster viewing or information assimilation, the digital video library
will need to support partitioning video into small-sized clips and alternate
representations of the video.
Just as text books can be
decomposed into paragraphs embodying topics of discourse, the video library can
be partitioned into video paragraphs. The difficulties arise in how this
partitioning is to be carried out. Does the author of the video information
supply paragraph tags marking how a larger video should be subsetted into
smaller clips? This is routinely accomplished in text through chapters,
sections, subheadings, and similar conventions. Analogous structure is contained
in video through scenes, shots, camera motions, and transitions. Manually
describing this structure in a machine readable form would place a tremendous
burden on the video author, and in any case would not solve the partitioning
problem for pre-existing video material created without paragraph markings.
Perhaps the paragraph boundaries can be inferred from whatever parsing and
indexing is done on the video segment. Some video, such as news broadcasts, have
a well-defined structure which could be parsed into short video paragraphs for
different news stories, sports, and weather. Techniques monitoring the video
signal can break the video into sequences sharing the same spatial location, and
these scenes could be used as paragraphs.
Davis cautions, however, that physically segmenting a video library into
clips imposes a fixed segmentation on the video data [Davis94].
The library is decomposed into a fixed number of clips, i.e., a fixed number of
small video files, which are separated from their original context and may not
meet the future needs of the library user. A more flexible alternative is to
logically segment the library by adding sets of video paragraph markers and
indices, but keeping the video data intact in its original context. A basic
tenet of MIT's Media Streams is that what we need are "representations which
make clips, not representations of clips" [Davis94,
p. 121].
In order for a digital video library to be logically segmented as such, the
system must be capable of delivering a subset of a movie (rather than having
that subset stored as its own movie) quickly and efficiently to the user. Video
compression schemes will have to be chosen carefully for the library to retain
the necessary random access within a video to allow it to be logically
segmented.
In
addition to trying to size the video clips appropriately, the digital video
library can provide the users alternate representations for the video, or layers
of information. Users could then cheaply (in terms of data transfer time,
possible economic cost, and user viewing time) review a given layer of
information before deciding upon whether to incur the cost of richer layers of
information or the complete video clip. For example, a given half hour video may
have a text title, a text abstract, a full text transcript, a representative
single image, and a representative one minute "skim" video, all in addition to
the full video itself. The user could quickly review the title and perhaps the
representative image, decide on whether to view the abstract and perhaps full
transcript, and finally make the decision on whether to retrieve and view the
full video.
These layered approaches to describing video are implemented in a number of
systems [Hauptmann95,
Zhang95,
Rao95].
The problems are similar to the indexing problem: how should the alternate
representations or descriptors be generated? How can they be as complete and
accurate as possible, and can tools alleviate the labor and tediousness involved
in their creation?
The utility of the digital video library can be judged on the
ability of the users to get the information they need from the library easily
and efficiently. The two standard measures of performance in information
retrieval are recall and precision. Recall is the proportion of relevant
documents that are actually retrieved, and precision is the proportion of
retrieved documents that are actually relevant. These two measures may be traded
off one for the other, i.e., returning one document that is a known match to a
query guarantees 100% precision, but fails at recall if a number of other
documents were relevant as well. Returning all of the library's contents for a
query guarantees 100% recall, but fails miserably at precision and filtering the
information. The goal of information retrieval is to maximize both recall and
precision.
In many information systems, precision is maximized by narrowing the domain
considerably, extensively indexing the data according to the parameters of the
domain, and allowing queries only via those parameters. This approach is taken
by many CD-ROM data sets, but has the following limitations:
- Data could really only be added if it falls within the boundaries of the
domain established by the predefined indices.
- Access to the data is limited by the predefined indices.
Researchers of multimedia information systems have raised concerns over the
difficulties in adequately indexing a video database so that it can be used as a
general purpose library, rather than say a more narrow domain such as a network
news archive [Davis94,
Zhang95].
For general purpose use, there may not be enough domain knowledge to apply to
the user's query and to the library index in order to return only a very small
subset of the library to the user matching just the given query. For example, in
a soccer-only library, a query about goal can be interpreted to mean a score,
and just those appropriate materials can be retrieved accordingly. In a more
open context, goal could mean a score in hockey or a general aim or objective. A
larger set of results will need to be returned to the user, given less domain
knowledge from which to leverage.
In attempting to create a general purpose digital video library, precision
may have to be sacrificed in order to ensure that the material the user is
interested in will be recalled in the result set. The result set may then become
quite large, so the user may need to filter the set and decide what is
important. Three principle issues with respect to searching for information are
how to let the user
- quickly skim the video objects to locate sections of interest
- adjust the size of the video objects returned
- identify desired video clips when multiple objects are returned
Browsing can help users
quickly and intelligently filter a number of results to the precise information
they are seeking. However, browsing video is not as easy as browsing text.
Scanning by jumping a set number of frames may skip the target information
completely. On the other hand, accelerating the playback of motion video to, for
instance, twenty times normal rate presents the information at an
incomprehensible speed.
The difference between video or audio and text or images is that video and
audio have constant rate outputs that cannot be changed without significantly
and negatively impacting the user's ability to extract information. Video and
audio are a constant rate, continuous time media. Their temporal nature is
constant due to the requirements of the viewer/ listener. Text is a variable
rate continuous medium. Its temporal nature is manifest in users, who read and
process the text at different rates.
While video and audio data types are constant rate, continuous-time, the
information contained in them is not. In fact, the granularity of the
information content is such that a one-half hour video may easily have one
hundred semantically separate chunks. The chunks may be linguistic or visual in
nature. They may range from sentences to paragraphs and from images to scenes.
If the important information from a video can be retrieved and the less
important information collapsed, the resulting "skim" video could be browsed
quickly by the user and still give him or her a great deal of understanding
about the contents of the complete video clip. This introduces the issue of
deciding what is important within a video clip and worthy of preservation in a
"skim" video.
Another approach to
letting the user browse and filter through search results more efficiently is to
return smaller video clips in the result set. There are about 150 spoken words
per minute of "talking head" video. One hour of video contains 9,000 words,
which is about 15 pages of text. Even if a high playback rate of 3 to 4 times
normal speed was comprehensible, continuous play of audio and video is a totally
unacceptable browsing mechanism. For example, assume that a desired piece of
information is halfway through a one hour video file. Fast forwarding at 4 times
normal speed would take 7.5 minutes to find it. Returning the optimally sized
chunk of digital video is one aspect of the solution to this problem.
If the user issues a query and receives ten half-hour video clips, it could
take them hours to review the results to determine their relevance, especially
given the difficulties in collapsing video playback as mentioned above. If the
results set were instead ten two minute clips, then the review time by the user
is reduced considerably. In order to return small, relevant clips the video
contents need to be indexed well and sized appropriately, tasks discussed
earlier in this abstract.
Users often wish to peruse
video much as they flip through the pages of a book. Unfortunately, today's
mechanisms for this are inadequate. The results from a query to a video library
may be too large to be effectively handled with conventional presentations such
as a scrollable list. To enable better filtering and browsing, the features
deemed important by the user should be emphasized and made visible. What are
these features, though, and how can they be made visible, especially if the
digital video library is general purpose rather than specialized to a particular
domain? These questions return us back to the problem of identifying the content
within the video data and representing it in forms that facilitate browsing,
visualization, and retrieval. Researchers at Xerox PARC's Intelligent
Information Access and Information Visualization projects note that the
information in digital libraries should not just be retrieved but should allow
for rich interaction, so that users can tailor the information into effective
and memorable renderings appropriate to their needs [Rao95].
If such rich interaction can be achieved, it can be used to browse not only
query result sets but the contents of the full library itself, allowing for
another access mechanism to the information.
The IDVL
Project builds on the assumption that a video's contents are conveyed in both
the narrative (speech and language) and the image. Only by the collaborative
interaction of image, speech and natural language understanding technology can
diverse video collections be successfully populated, segmented, indexed, and
searched with satisfactory recall and precision. This approach compensates for
problems of interpretation and search in error-full and ambiguous data
environments.
Using a high-quality speech recognizer, the sound track of each video is
converted to a textual transcript. A language understanding system analyzes and
organizes the transcript and stores it in a full-text information retrieval
system, as well as generates brief text abstracts for the videos. Image
understanding techniques are used for segmenting video sequences by
automatically locating boundaries of shots, scenes, and conversations.
Integration of these techniques provides for richer indexing and segmentation of
the video library. For example, text displayed in the video can be located via
image processing and then added to the body of text for natural language
processing. As another example, having both a visual scene change and a change
in the narrative increases the likelihood of a segment boundary.
Figure 1.Techniques underlying segmentation of video into
smaller paragraphs.
Library exploration is based on these same techniques. The user can browse
through parallel presentations of alternate representations of video clips, from
titles to single image "poster frames" to skims. In creating a skim, image
understanding techniques are used to select important, high interest segments of
video. Scene changes (as marked by color histogram spikes characterizing big
differences in adjacent frames), camera motion, object detection (e.g., the
entrance and exit of a human face in the scene), and text detection (e.g., a
title or name of a person being interviewed overlaid on the video) are used in
the heuristics determining which video should be included in the skim. Using
parallel criteria for linguistic information, natural language processing
selects appropriate audio. For example, the term frequency-inverse document
frequency weighting scheme can be used to determine word relevance, with other
heuristics employed to further filter which audio to use, such as not repeating
the same word within a certain time limit. Selected audio and video are then
integrated into a skim of the original video.
Figure 2.Portion of skim created from significant audio and
video data.
For more details on the IDVL interface in a news-on-demand application,
consult the on-line walkthrough found in [Hauptmann95].
This work is partially funded by the
National Science Foundation, the National Space and Aeronautics Administration,
and the Advanced Research Projects Agency. For a complete list of sponsors and
partners for the Informedia Digital Video Library Project, consult the IDVL sponsor list.
- [Davis94]
- Davis, M. "Knowledge Representation for Video." Proc. of AAAI '94,
1994, Seattle, WA, pp. 120-127.
- [Hauptmann95]
- Hauptmann, A.G., Witbrock, M.J., & Christel, M.G. "News-on-Demand: an
Application of Informedia Technology." Online document available at URL
http://www.informedia.cs.cmu.edu/documents/dlib95_haupt.htm,
D-Lib Magazine, September 1995.
- [Marchionini95]
- Marchionini, G. and Maurer, H. "The Roles of Digital Libraries in Teaching
and Learning." Communications of the ACM, 38, April 1995, pp.
67-75.
- [Rao95]
- Rao, R., Pedersen, J., Hearst, M., Mackinlay, J., Card, S., Masinter, L.,
Halvorsen, P.-K., and Robertson, G. "Rich Interaction in the Digital Video
Library." Communications of the ACM, 38, April 1995, pp. 29-39.
- [Smith95]
- Smith, M.A., & Christel, M.G. "Automating the Creation of a Digital
Video Library." Online document available at URL
http://www.ius.cs.cmu.edu/afs/cs.cmu.edu/Web/People/msmith/mm_95_msmith.html,
Proceedings of the ACM Multimedia '95 Conference, San Francisco, November
1995.
- [Stevens94]
- Stevens, S., Christel, M., & Wactlar, H. "Informedia: Improving Access
to Digital Video." interactions, 1, October1994, pp. 67-71.
- [Zhang95]
- Zhang, H., Tan, S., Smoliar, S., and Yihong, G. "Automatic Parsing and Indexing
of News Video." Multimedia Systems, 2, 1995, pp. 256-266.