The Informedia Digital Video Library:

Integrated Speech, Image and Language Understanding for the Creation
and Exploration of Digital Video Libraries

Proposal responding to Announcement NSF 93-141 Research on Digital Libraries Copyright (c) 1994 by Carnegie Mellon University

Principal Investigators:

Takeo Kanade, Robotics Institute
Michael Mauldin, Center for Machine Translation
Raj Reddy, School of Computer Science
Marvin Sirbu, Information Networking Institute
Scott Stevens, Software Engineering Institute.
Doug Tygar, Computer Science Dept.
Howard Wactlar, School of Computer Science

1. Executive Summary
2. Project Description
3. Testbed Facility
4. Organizational Roles
5. References


1. Executive Summary

The Informedia library project will establish a large, on-line digital video library by developing intelligent, automatic mechanisms to populate the library and allow for full-content and knowledge-based search and retrieval via desktop computer and metropolitan area networks. Initially, the library will be populated with 1000 hours of raw and edited video drawn from video assets of WQED/Pittsburgh, Fairfax County (VA) Public Schools, and the Open University (U.K.). We will deploy the library at Carnegie Mellon University and local area K-12 schools.

The distinguishing feature of our technical approach is the integrated application of speech, language and image understanding technologies for efficient creation and exploration of the library. Using a high-quality speech recognizer, the sound track of each videotape is converted to a textual transcript. A language understanding system then analyzes and organizes the transcript and stores it in a full-text information retrieval system. Likewise, image understanding techniques are used for segmenting video sequences by automatically locating boundaries of shots, scenes, and conversations. Exploration of the library is based on these same techniques. Additionally, the user interface will be instrumented to investigate user protocols and human factor issues peculiar to manipulating video segments. We will implement a network billing server to study the economics of charging strategies and also incorporate mechanisms to ensure privacy and security.

The Informedia Project has industry partners who are committed to provide substantial resources and base technology. They will evaluate commercial opportunities for the underlying technology and for the provision of information services. Currently, our committed partners include Digital Equipment, Microsoft, QED Enterprises, and Bell Atlantic. Together, these companies span the requisite disciplines for digital video library commercialization.

1.1 Motivation

Vast digital libraries of information will soon become available on the nation's Information Superhighway as a result of emerging multimedia computing technologies. These libraries will have a profound impact on the conduct of business, professional, and personal activities. However, it is not enough to simply store and play back information as in commercial video-on-demand services. New technology is needed to organize and search these vast data collections, retrieve the most relevant selections, and effectively reuse them.

The Informedia Library project proposes to develop these new technologies and to embed them in a video library system primarily for use in education and training. The nation's schools and industry together spend between $400 and $600 billion per year on education and training, an activity that is 93% labor-intensive, with little change in teacher productivity ratios since the 1800s. The new digital video library technology will allow independent, self-motivated access to information for learning, exploration, and research. This will bring about a revolutionary improvement in the way education and training are delivered and received.

More specifically, the Informedia Digital Video Library proposes to develop intelligent, automatic mechanisms that provide full-content search of, and selective retrieval from, an extremely large on-line digital video library. We will build the initial library from WQED Pittsburgh's video archives, video course material produced by the BBC for the Open University and material from Fairfax County (VA) Public Schools' Electronic Field Trips series. We will develop the tools that will populate the library and support access via desktop computers and local-to-metropolitan area networks. We will also research the economics and security of network accessible video intellectual property, which is a vital issue in the future of commercial and public video libraries.

Jointly conceived by Carnegie Mellon University and QED Communications (WQED/Pittsburgh), the Informedia Library project integrates regional and international resources. Carnegie Mellon is an acknowledged international leader in computer science education and research. QED is a major Public Broadcasting Service production center and winner of thirty-five Emmy and eight Peabody Awards. The UK's Open University, a model for distance teaching universities world-wide, brings us access to one of the world's largest collections of educational video. Fairfax County Public Schools has been a pioneer in satellite distribution of elementary school materials combined with networked communications for learning communities. The project's industrial sponsors currently include some of the nation's leading technology companies in computing, software and communications.

Several factors distinguish our project from similar efforts. First, our technical approach integrates image, speech and language understanding to operate simultaneously on the same data stream. This reaches beyond those systems for video or text searching that succeed or fail on the strength of one mode of recognition or interpretation. Second, the integration of these technologies provides new research opportunities, both within and across disciplines. Third, we will produce a highly usable library of commercial broadcast quality and a mechanism for disseminating and commercializing our products. Fourth, our work also addresses human factors issues: learning, interaction, motivation, and effective usage modes for K-12, post-secondary, and life-long learning. And fifth, we incorporate billing, variable pricing and security mechanisms that enable and encourage commercialization.

1.2 Technical Approach

Our approach utilizes several techniques for content-based searching and video sequence retrieval. Content is conveyed in both the narrative (speech and language) and the image. Only by the collaborative interaction of image, speech and natural language understanding technology can we successfully populate, segment, index, and search diverse video collections with satisfactory recall and precision.

This approach uniquely compensates for problems of interpretation and search in error-full and ambiguous data sets. We start with a highly accurate, speaker-independent, connected speech recognizer which will automatically transcribe video soundtracks and store them in a full-text information retrieval system. This text database allows for rapid retrieval of individual video segments which satisfy an arbitrary query based on the words in the soundtrack. Image and natural language understanding enables us to locate and delineate the corresponding "video paragraph" by using combined source information about camera cuts, object tracking, speaker changes, timing of audio and/or background music, and change in content of spoken words. Controls allow the user to interactively request corresponding video pages, video quantity, to intelligently "skim" the returned content, and to reuse the stored video objects in diverse ways.

This project builds upon extensions and applications of existing and evolving technology from both new and established research programs at CMU and elsewhere. These programs include automated speech recognition (Sphinx-II), image understanding (Machine Vision), natural language processing (TIPSTER, Ferret, Scout), human-computer interaction (ALT), distributed data systems (AFS, DFS), networking (INI), and security and economics of access (NetBill). These multi-disciplinary teams will apply these technologies to the problems of video information systems to permit rapid population of the library and agent assisted retrieval from vast information stores in a commercially feasible manner.

1.2.1 Integrating the System Components

The Informedia system will integrate data and processes from several independent research efforts. Additionally, the Informedia system will allow modification and incorporation of emerging standards in media compression and storage, breakthrough products for media manipulation and display, and evolving high-bandwidth communication standards and services. Wherever possible and practical, we will build upon existing commercial hardware and software products. The major subtasks of developing Informedia include: The system implementation will incorporate fall-back mechanisms for the most ambitious features in case they fail to perform adequately. This also permits us to release early versions of the system with more limited libraries and capabilities into project affiliate organizations before all the automated facilities are fully operational.

1.2.2 Building the Video Database

We will populate our initial video library with materials from three primary sources: (1) QED's vast library of science programs, documentaries, and original source materials, (2) the BBC's educational video course material for the U.K.'s Open University, and (3) the Fairfax County (VA) public schools Electronic Field Trip series. Early versions will use commercial compression formats (e.g. Intel's DVI), thus requiring only 10 Mbytes per source video minute to achieve VHS quality playback (256x240 pixels). Later versions may utilize newer compression technologies (e.g., MPEG, MPEG-II). We anticipate that the primary media-server file system will require one terabyte (10^12 bytes) of storage and comprise over 1000 hours.

Our collection will incorporate not only the broadcast programs themselves, but also the unedited source materials from which they were derived. Such background materials enrich our library significantly, as reference resources and for uses other than those originally intended. They also enlarge it greatly: Typical QED sources run 50 to 100 times longer than the corresponding broadcast footage. A recent interview with Arthur C. Clarke for WQED's Space Age series, for example, produced two minutes of airtime from four hours of tape.

Our particular combination of video resources should enable our users to retrieve same subject matter material presented at varying levels of complexity - ranging from the popular example-based presentation often used in PBS documentaries through elementary and high school presentations from Fairfax Co., to the more advanced college-level treatment by the OU. The self-learner at any level can iterate on the search in order to build understanding and comprehension through multiple examples and decreasing (or increasing) depth, complexity, and formalism.

1.2.3 Powerful Indexing Through Automated Transcription

The Informedia system will use the Sphinx-II speech recognition system to transcribe narratives and dialogues automatically. Sphinx-II is a large-vocabulary, speaker-independent, continuous speech recognizer developed at Carnegie Mellon. With recent advances in acoustic and language modeling, it has achieved a 5% error rate on standardized tests for a 5000-word, general dictation task.

By itself, the video library's larger vocabulary will degrade recognition rate. However, several innovative techniques will be exploited to reduce errors. The use of program-specific information, such as topic-based lexicons and interest-ranked word lists will additionally be employed by the recognizer. Word hypotheses will be improved by using adaptive, "long-distance" language models and we will use a multi-pass recognition approach that considers multi-sentence contexts. Aiding these processes is the fact that our video footage will typically be of high audio quality and will be narrated by trained professionals.

The transcript generated by Sphinx-II need not be viewed by users, but will be hidden, and will be time-aligned with the video for subsequent retrieval. Because of this, we expect our system will tolerate higher error rates than those that would be required to produce a human-readable transcript. On-line scripts and closed-captioning, where available, can provide base vocabularies for recognition and searchable texts for early system versions.

1.2.4 Intelligent Searching with Natural Language

Current retrieval technology works well on textual material from newspapers, electronic archives and other sources of grammatically correct and properly spelled written content. Furthermore, natural language queries allow straight-forward description by the user of the subject matter desired. However, the video retrieval task, based upon searching errorful transcripts of spoken language, challenges the state of the art. Even understanding a perfect transcription of the audio would be too complicated for current natural language technology.

The Informedia system will extend current leading-edge performance systems and algorithms (TIPSTER, Ferret, Scout) and apply them to augment and index the Sphinx-transcribed soundtracks. Tasks will include (1) developing facilities to automate error reduction in transcripts through post-processing, (2) identification of topics/subtopics in transcript collections, (3) exploration of "sound-alike" matching as a technique to overcome out-of-vocabulary misses during transcription, and (4) an enriched natural language retrieval request interface. This integrated approach will significantly increase the system's ability to locate a particular video segment quickly, despite transcription errors, inadequate keywords, and ambiguous sounds.

1.2.5 Video Segmentation

Our work will automate the segmentation process with techniques developed in CMU's Image Understanding Systems Laboratory. Our design avoids the time-consuming, conventional procedure of reviewing a file frame-by-frame around an index entry point. To identify segment boundaries, the Informedia system will automatically locate begin/end points for the shot, scene, conversation, etc. by applying machine vision methods that interpret image sequences. This approach can track objects, even across camera motions, to determine the limits of a video segment. The resulting segmentation process will be faster, more precise, and more easily controlled than the manual method allows.

1.2.6 Engineering an Effective Interface

The application of speech recognition, natural language, and image understanding technologies will permit intelligent searching of large corpora of digital video and audio. Nonetheless, three significant problems remain: how to aid users in the identification of desired video when multiple objects are returned, how to let the user adjust the length of the video objects returned, and how to let the user quickly skim the video objects to locate sections of interest. Solutions to these problems require an intimate understanding of digital video and audio, and the development of new modes of interfaces based on this model.

The user interface will likely include a context-sizing slide switch. This control will enable the user to adjust the duration and information content of retrieved segments. A skimming dial control will allow the user to adjust the information playback rate as well as to adjust the media playback rate. When a search returns multiple hits, parallel presentation will be used to simultaneously present numerous intelligently chosen sequences. Functionality will be provided to enable the user to extract subsequences from the delivered segments and reuse them for other purposes in various forms and applications.

1.2.7 Accounting and Economics

Commercialization of digital video information services cannot be realized without very low cost, auditable, private and secure data and billing services. Copyright owners need to be compensated when their intellectual property is distributed to users. Accordingly, the digital library must be supported by a system for authenticating users, verifying willingness and ability to pay, authorizing access, recording charges, invoicing the user, receiving and processing payments, and managing accounts. We envision a generalized internet billing service, NetBill, which can help to relieve the Informedia Digital Library, or any other service provider, from having to manage user accounts or handle individual payments. To this end, our implementation will support the mechanisms necessary to provide adequate privacy protection, a wide range of pricing policies set by intellectual property owners, and restrictive access policies that dynamically limit the accessibility of certain collections to classes of users. For example, there may be content which is age-sensitive. A more complex example would be restricting specific segments within larger unrestricted materials. We will study the impact of alternative pricing policies on the user's information-seeking behavior, and perform an empirically grounded economic analysis of the likely impacts of electronic multimedia delivery on intellectual property owners and users.

1.3 Testbed Facilities Plan

User test sites will be established at (1) Carnegie Mellon University, building on the experience and success with deploying the Mercury electronic text and image library, (2) the Winchester Thurston School, a culturally diverse, academically excellent, K-12 college preparatory school in Pittsburgh, (3) the Fairfax County (VA) public school system, and (4) the Open University in the U.K. We will provide metropolitan area network access over new, low-cost, switched multimegabit data services. Iterative evaluations of prototype systems with reduced data and more limited functionality will provide information vital to a successful technology development and transition strategy. To understand issues of acceptance and modes of use, systems will be instrumented to collect statistical data of user protocols and retrieval success measures. Additionally, successful acceptance and integration of the developed library in local school systems will provide a national reference site for the project.

Other researchers will be invited to participate in exploring or building upon our research system. We foresee several forms of involvement:

1.4 Organizational Roles and Management Plan

An important and explicit goal of this project is to accelerate acceptance of Informedia Library technologies by seeding the market and priming the providers. We have assembled the project partners and organized the project structure with this goal in mind. The partnerships we have established for resources, field testing and productization will enable us to achieve pervasive impact and commercial realization.

QED Enterprises, the commercial division of QED Communications, will be pursuing follow-on commercial licensing opportunities incorporating the systems and technologies developed. They will explore the use of video assets as library source materials for education and training markets. In collaboration with other project affiliates, QED Enterprises will assess the business model for providing continuing video library reference services to local area schools, hospitals, and commercial clients.

The Open University, U.K., will assess the application of our library system to structured distance learning, applicable to both academic and commercial training and education. Companies will participate in technology exchange with us to give them early access to our research and in turn providing the project their experimental hardware and software. This enables us to evaluate and build upon new commercial infrastructure and industry standards as they become available.

CMU and QED will serve as parent organizations with a governing board representing both institutions and other founding project members, with additional input from industrial and other institutional affiliates. CMU will manage personnel, provide space and be responsible for the operations and technical agenda. There will be joint strategic planning, public representation, and fund raising.


2. Project Description

The Informedia project will establish a large on-line digital video library; intelligent, automatic mechanisms to populate the library and allow for its full-content, knowledge based search and segment retrieval via desktop computer and metropolitan area networks. Initially, the library will be populated with video assets from WQED/Pittsburgh, Fairfax County (VA) Public Schools, and the Open University (UK).

Our approach utilizes multiple modalities for content-based searching and video sequence retrieval. The content of video data is conveyed by both narrative (speech and language) and image. Only by the collaborative integration of image, speech, and natural language understanding technology can we hope to automatically populate, segment, index, and search diverse video collections. This analysis approach uniquely compensates for problems of interpretation and search in errorful and ambiguous data environments. We start with the use of a highly accurate, speaker-independent speech recognizer to automatically transcribe video soundtracks which are then stored in a (time-track corresponding) full-text information retrieval system. The textual transcript is then analyzed and organized by a language understanding system and stored in a full text information retrieval system. This allows for rapid retrieval of individual corresponding video segments which satisfy an arbitrary subject area query based on the words in the soundtrack.

Image understanding techniques are employed for segmenting, or video paragraphing, video sequences by automatically locating boundaries for shots, scenes, and conversations. User controls provide for interactively requesting the corresponding video page or video volume, to ``skim'' the returned content, and to reuse the stored video objects in diverse ways.

The system will be instrumental to enable the study of user protocols and human factor issues peculiar to manipulating video segments. We will implement a network billing server to study the economics of charging strategies and incorporate mechanisms to insure privacy and security. We will also deploy and evaluate the system at Carnegie Mellon University and K-12 schools.

The project builds upon extensions and applications of existing and evolving technology from, established research programs at CMU in automated speech recognition (Sphinx-II), image understanding (Machine Vision), natural language processing (Tipster, Ferret, Scout), human-computer interaction (ALT), distributed data systems (AFS, DFS), networking (INI), and security and economics of access (NetBill). Application of these technologies in this new domain will permit rapid population of the digital video library and agent assisted retrieval of vast information stores, realizable in commercial settings. To appreciate the scope of the integration task, three perspectives of the system are presented: a user's perspective; a technology perspective; and a systems engineering perspective.

2.1 Informedia: User Perspective

Imagine a high school student sitting at a multi-media workstation in the school's library. Her class project is to create a multimedia composition on how world culture has been changed by communications satellites. Groping for a beginning she begins speaking to the monitor, ``I've got to put something together on culture and satellites. What are they?''

Transparent to the user the system has just performed highly accurate, speaker independent, continuous speech recognition on her query. It then used sophisticated natural language processing to understand the query and translate it into retrieval commands to locate relevant portions of digital video. The video is searched based on transcripts from audio tracks that were automatically generated through the same speech recognition technology. The appropriate selection is further refined through scene sizing developed by image understanding technology.

Almost as soon as she has finished her question, the screen shows several icons, some showing motion clips of the video contained, followed by text forming an extended title/abstracts of the information contained in the video (see Figure 1).

Making this possible, image processing helped select representative still images for icons and sequences from scenes for intelligent moving icons. Speech recognition created transcripts which are used by natural language technologies to summarize and abstract the selections.


Figure 1: Sample User Display

Figure 1


Through either a mouse or a spoken command, the student requests the second icon. The screen fills with a video of Arthur Clarke describing how he did not try to patent communications satellites, even though he was the first to describe them. Next the student requests the third choice, and sees villages in India that are using satellite dishes to view educational programming.

Asking to go back, Arthur Clarke reappears. Now, speaking directly to Clarke, she wonders if he has any thoughts on how his invention has shaped the world. Clarke, speaking from his office, starts talking about his childhood in England and how different the world was then. Using a skimming control she finds particularly relevant section to be included in her multimedia composition.

Beyond the requisite search and retrieval, to give our student such functionality requires image understanding to intelligently create scenes and the ability to skim them.

The next day she gives her teacher access to her project. More than a simple presentation of a few video clips, our student has created a video laboratory that can be explored and whose structure is itself indicative of the student's understanding.

Helping this student be successful are tools for building multimedia objects that include assistance in the language of cinema, appropriate use of video, and structuring composition. Behind the scenes the system has created a profile of how the video was used, distributing that information to the library's accounts. Assets for which the school has unlimited rights are tracked to understand curricular needs. And accounts for assets that the school has restricted, pay-per-use rights are debited.

2.2 Informedia: Technology Perspective

The digital video library presents unique challenges and opportunities for the user: videotape has historically been the most difficult medium to search automatically. The constant-time nature of analog video makes it difficult to work with, because reviewing 1000 hours of video tape takes 1000 hours. The capability to digitize video and automatically transcribe the audio component are breakthroughs allowing effective indexing and retrieval of video data for the first time.

Previous demonstrations ranging from the Aspen Project [Lippman80] to ClearBoard [Ishii92] use analog video which limits the user interface design. Current multimedia applications, usually CD-ROM based, associate short video and audio objects with an image or section of text (hypermedia); providing only the ability to select video clips based on the title. These techniques have been employed since the first computer controlled videodiscs [Fuller82]. Because they treat the video segment as a black box, however, they are totally inadequate for access to extremely large digital video libraries. These projects have also ignored the need for accounting of digital video use to allow owners of copyrighted material to be appropriately compensated.

By contrast, the following five technologies and research areas are synergistically integrated in Informedia:

Combining speech recognition, natural language, and image understanding technologies permits Informedia to intelligently search the full content of large digital video collections. Still, three significant problems remain: how to aid users in the identification of desired video when multiple objects are returned; how to let the user adjust the size of the video objects returned; and how to let the user quickly skim the video objects to locate sections of interest. Solutions to these problems require an intimate understanding of digital video and the development of new interfaces for it.

Informedia will provide the user with a variety of techniques to locate desired material in the library. An initial query may be typed on a keyboard, clicked with a mouse or spoken into a microphone. Techniques for automatically computing image similarity can be used to process visual queries, allowing the user to find related video segments with similar images or backgrounds.

2.2.1 Digital Video: Interrupted Video to Interactive Video

In complex, emerging fields such as digital libraries and multimedia, it is not surprising that most of today's applications have failed to take full advantage of the information bandwidth, much less the capabilities of a multimedia, digital video and audio environment.

Today's designs typically employ a VCR/Video-Phone view of multimedia. In this simplistic model, video and audio can be played, stopped, their windows positioned on the screen, and, possibly, manipulated in other ways such as by displaying a graphic synchronized to a temporal point in the multimedia object. This is the traditional analog interactive video paradigm developed almost two decades ago. Rather than interactive video, a much more appropriate term for this is ``interrupted video.''

Today's interrupted video paradigm views multimedia objects more as text with a temporal dimension [Hodges89, Yankelovich88]. Researchers note the unique nature of motion video. However, differences between motion video and other media, such as text and still images, are attributed to the fact that time is a parameter of video and audio. In the hands of a user, every medium has a temporal nature. It takes time to read (process) a text document or a still image. However, in traditional media each user absorbs the information at his or her own rate. One may even assimilate visual information holistically, that is, come to an understanding of complex information nearly at once.

However, to convey almost any meaning at all, video and audio must be played at a constant rate, the rate at which they were recorded. While, a user might accept video and audio played back at 1.5 times normal speed for a brief time, it is unlikely that users would accept long periods of such playback rates. In fact, studies show that there is surprisingly significant sensitivity to altering playback fidelity [Christel91]. Even if users did accept accelerated playback, the information transfer rate would still be principally controlled by the system.

The real difference between video or audio and text or images is that video and audio have constant rate outputs that cannot be changed without significantly and negatively impacting the user's ability to extract information. Video and audio are a constant rate, continuous time media. Their temporal nature is constant due to the requirements of the viewer/listener. Text is a variable rate continuous medium. Its temporal nature only comes to life in the hands of the users.

While video and audio data types are constant rate, continuous-time, the information contained in them is not. In fact, the granularity of the information content is such that a one-half hour video may easily have one hundred semantically separate chunks. The chunks may be linguistic or visual in nature. They may range from sentences to paragraphs and from images to scenes.

Understanding the information contained in video is essential to successfully implementing the Informedia digital video library. Returning a full one-half hour video when only one minute is relevant is much worse than returning a complete book, when only one chapter is needed. With a book, electronic or paper, tables of contents, indices, skimming, and reading rates permit users to quickly find the chunks they need. Since the time to scan a video cannot be dramatically shorter than the real time of the video, a digital video library must give users just the material they need. Understanding the information content of video enables Informedia to not only find the relevant material but to present it in useful forms.

Content is conveyed in both narrative (speech and language) and image. Only by the collaborative interaction of image, speech and natural language understanding technology can we hope to automatically populate, segment, index, and search diverse video collections with satisfactory recall and precision. This approach uniquely compensates for problems of interpretation and search in error-full and ambiguous data environments.

2.2.2 Identifying Digital Video

An information search illustrates a significant difference between constant-rate, continuous-time media and variable rate continuous media. The human visual system is adept at quickly, holistically viewing an image, or a page of text, and finding a desired piece of information while ignoring unwanted information (noise). This has been viewed as a general principle of selective omission of information [Resnikoff89] and is one of the factors that makes flipping through the pages of a book a relatively efficient process. Even when the location of a piece of information is known a priori from an index, the final search of a page is aided by this ability.

Building on these principles, parallel presentation of video, taking advantage of the special abilities of the human vision system, will be investigated in this project. When a search produces multiple hits, as will usually be the case, the system presents numerous sequences simultaneously in separate windows. Several representations of this extracted video will be tested in this project. The simplest, single images extracted from the video, will use the first image with valid (i.e. non-blank) data as determined by the image recognition techniques described in section 2.4. A slightly more complex representation will be motion icons, micons [Brondmo90]. As implemented by Brondmo, micons are short motion sequences extracted from the first few seconds or minutes of the video they are to represent.

Both still iconic and miconic representations of video information can easily mislead a user. For example, a search for video sequences related to transportation of goods during the early 1800's may return 20 relevant items. If the first 20 seconds of several sequences are ``talking head'' introductions, icons and micons will provide no significant visual clue about the content of the video; the information after the introduction may or may not be interesting to the user. However, intelligent moving icons, micons, may overcome some of these limitations. Image segmentation technology can create short sequences that more closely map to the visual information contained in the video stream. Several frames from each new scene will be used to create the micon. This technique will allow for the inclusion of all relevant image information in the video and the elimination of redundant data.

For a video containing only one scene with little motion, a micon may be the appropriate representation. If video data contains a single scene but with considerable motion content, or multiple scenes, the imicon is needed to display the visual content. To determine the imicon content, significant research will be performed on the optimal number of frames needed to represent a scene, the optimal frame rate, and the requisite number of scenes needed for video representation. Since the human visual system is adept at quickly finding a desired piece of information, the simultaneous presentation of intelligently created motion icons will let the user act as a filter to choose high interest material.

2.2.3 Sizing Digital Video

Once an object of interest is identified, those objects that have intrinsic constant temporal rates, such as video and audio, are difficult to search. There are about 150 spoken words per minute of ``talking head'' video. One hour of video contains 9,000 words, which is about 15 pages of text. This problem is acute if one is searching for a specific piece of a video lecture, and worse yet with audio only. Even if a high playback rate of 3 to 4 times normal speed was comprehensible, continuous play of audio and video is a totally unacceptable search mechanism. This can be seen by assuming the target information is on average half way through a one hour video file. In that case it would take 7.5 to 10 minutes to find. Returning the optimally sized chunk of digital video is one aspect of the solution to this problem.

Detailed indexing of the video can aid this process. However, users often wish to peruse video much as they flip through the pages of a book. Unfortunately, today's mechanisms for this are inadequate. Scanning by jumping a set number of frames may skip the target information completely. On the other hand, accelerating the playback of motion video to, for instance, twenty times normal rate presents the information at an incomprehensible speed. Even if users could comprehend such accelerated motion, this would still take six minutes to scan through two hours of videotape. A two second scene would be presented in only one tenth of a second.

Playing audio fast during the scan will not help. Beyond 1.5 or 2 times normal speed, audio becomes incomprehensible since the faster playback rates shift frequencies to inaudible ranges [Degen92]. Digital signal processing techniques are available to reduce these frequency shifts, but at high playback rates, these techniques present sound bytes much like the analog videodisc scan.

Tools have been created to facilitate sound browsing which present graphical representations of the audio waveform to the user to aid identification of locations of interest. However, this has been shown to be useful only for audio segments under three minutes [Degen92]. When searching for a specific piece of information in hours of audio or video, other mechanisms will be required. In previous work at CMU [Christel92, Stevens89], a multidimensional model of multimedia objects (text, images, digital video, and digital audio) was developed. With this model (called ALT), variable granularity knowledge about the domain, content, image structure, and the appropriate use of the multimedia object is embedded with the object. Based on a history of current interactions (input and output), the system makes a judgment on what to display, and how to display it. Techniques using such associated abstract representations have been proposed as mechanisms to facilitate searches of large digital video and audio spaces [Stevens92]. In this scheme, embedding knowledge of the video information with the video objects allows for scans by various views, such as by content area or depth of information.

Using automatically derived transcripts as detailed in section 2.4, natural language searching, and video segmentation, video objects are imbued with knowledge about their content and their use. This allows first pass searches to retrieve more focused segments of video. Integrated together, these techniques will permit the creation context-sizing interfaces. These will simulate slide switches, enabling the user to adjust the ``size'' of the retrieved video/audio segments for playback. Here, the ``size'' may be time duration, but more likely it will be abstract knowledge chunks where information complexity or type will be the determining measure. This research will investigate the appropriate metaphors to use when the ``size'' the user is adjusting is abstract content. Here we will research what it means, from both interface development and a search methods, to permit the user to say ``I want more background on each subject returned.''

2.2.4 Scanning Through Digital Video

No matter how precise the selection of video is, users will want to scan through video themselves. Video segmentation can aid the scanning process. By knowing where scenes begin and end, the Informedia system will perform high speed scans of digital video files by presenting quick representations of scenes. This is a great improvement over jumping a set number of frames, since scene changes often reflect changes in organization of the video much like sections in a book.

Application of these techniques will permit the development of a skimming dial, an information-based scanning of digital video data. Much like the chapter and section headings in a book, the skimming dial will permit fast, content based perusal of video data. Even though the Informedia system is designed to return the most appropriate data, with the rich set of information available this ``dial'' will be critical, allowing users to skim by content, more precisely finding desired information in video.

2.2.5 Tools For Reuse

Just viewing video from digital video libraries, while useful, is not enough. Once users identify video objects of interest, they will need to be able to manipulate, organize, and reuse the video. Demonstrations abound where students create video documents by the association of video clips with text. While excellent steps in the right direction, the reuse of video is more than simply editing a selection and linking it to text.

Today, excellent stand-alone tools to edit digital video exist in the commercial market and this project will use commercial off-the-shelf (COTS) tools when available. However, there are currently no tools to aid in the creative design and use of video as there are for document production. One reason is the intrinsic, constant rate temporal aspect of video. Another is the complexities in understanding the nature and interplay of scene, framing, camera angle, and transition. Building on previous work at CMU [Stevens89, Christel92], tools will be developed to provide expert assistance in cinematic knowledge. The long range goal will be to integrate the output of the image understanding and natural language understanding sub-systems with this tool to create semantic understanding of the video. This would make possible context sensitive assistance in the reuse of video and its composition into new forms.

As an example, compared with watching a linear interview, permitting a student to interview an important historical or contemporary figure would provide a more interesting, personal, and exploratory experience. Creating such a synthetic interviewee is possible with existing video resources. Broadcast productions typically shoot 50 to 100 times as much material as they actually broadcast. As previously noted, WQED interviewed Arthur C. Clarke for its recent series ``Space Age.'' Approximately two minutes of the interview was broadcast, but over 4 hours were taped. While few would want to sit through 4 hours of an interview, many would like to ask their own questions. It would be especially interesting and motivating if the character responded in a fashion that caused the viewer to feel as if the answer was ``live.'' That is, specifically and dynamically created in response to the question.

Similar synthetic interviews have been hand-crafted [Stevens89, Christel92]. For typical users to create such a interview, new tools will be needed. Certainly, searching, parallel presentation, context sizing, and skimming will be needed to find, organize, and size the responses. In addition, tools must be developed to refine automatically generated abstractions of the transcript, associate those abstractions to specific responses, and define how to use the responses. The nature and form of such tools will be investigated through iterative development and testing with users interacting with the digital video testbed.

2.2.6 Billing and Privacy

Billing servers with privacy guarantees are essential in order to enable commercial development of video information and entertainment services. Owners of copyrighted information typically want to be compensated when their intellectual property is distributed to users. The digital library must be supported by a system for authenticating users, verifying willingness and ability to pay, authorizing access, recording the charges, invoicing the user, receiving and processing payments, and managing accounts. Indeed, any service provided via the Internet -- information retrieval, printing, or data processing -- requires a means to perform these functions. We envision a generalized internet billing service which can act as a factor to relieve the Informedia Digital Library, or any other service provider, from the need to be directly involved in user account management and handling individual payments.

Because the economic value of any individual video clip or other information object may be only a few cents, these services must be provided in a highly automated way, so that the transaction costs associated with any individual purchase amount to fractions of a cent. While keeping the transaction simple, suppliers must have assurance that a user does not, with a few keystrokes, initiate transactions which far outstrip his or her ability to pay, while the user must be protected against fraudulent charges being debited to his account.

We envision an internet billing service, or NetBill System, which can provide all of the services necessary to account for intellectual property delivered via a network. The system will be designed to provide these services not just for the Informedia collections, but also for collections of intellectual property made available by other organizations. In particular, two other responders to this solicitation -- one group from CMU and one from M.I.T. -- have indicated their desire to rely on the NetBill system to handle these functions for the Digital Library testbeds they are proposing. Our intent is to design NetBill as an open system which could be used by any organization providing intellectual property even through simple means such as an anonymous FTP server. Because we expect multiple financial services organizations to eventually wish to provide factoring services, open protocol interfaces will be designed so that users may charge purchases through any of several independently operated NetBill compatible systems.

2.3 Informedia: Systems Engineering Perspective


Figure 2: Informedia Digital Video Library System Overview


The Informedia will incorporate a number of independently developed research systems whose data and processes must be integrated. Figure 2 provides a pictorial view of how the Informedia system integrates these process and data flow. Processes are divided between those that occur off-line and are time-insensitive, and those that are executed on-line in real-time for the interacting user.

Furthermore, the system must be organized and constructed to allow for modification and incorporation of (1) emerging standards for media compression and storage, (2) breakthrough products for media manipulation and display, and (3) evolving high-bandwidth communication standards and services. Field experiments will produce feedback regarding system usability and performance which must be addressed rapidly in subsequent releases. Therefore, where possible and practical, we will build Informedia upon existing commercial hardware and software products.

We propose that the system be constructed as four major ``subassemblies,'' each of which provides a major function which can be independently implemented without a need-to-know about one another's internal structure.

They communicate with one another primarily through network protocols, and so can be independently tested through simulated calls. Each subassembly calls upon one or more of the underlying fundamental technologies for implementation of its functions.


Figure 3: Informedia Data and Networking Architecture


Each subassembly production has within it a number of subtasks, which constitute the subassembly's role in Informedia. The background library creation assembles new analog video and audio and produces an indexed video, audio and textual database through the following processes:

The interactive user station is the client's networked interface to the system: The network billing server enables commerce in video information services: The data and network architecture provide infrastructure to enable Figure 3 shows the Informedia architecture and subassemblies. The fundamental modularity and interactivity of these subassemblies permits us to generate, test and release frequently new versions of the total system. This enables more timely scheduled releases to the external testbed users systems which have been ``shaken down.'' The total system implementation will be staged with fall-back, or substitute mechanisms, for some of the processes should they not all perform suitably to the task when needed. For example, if the automated speech recognition is not providing adequate accuracy levels in time for a scheduled release of new user interface function, a system can be released with only those videos which were closed-captioned, thus providing an alternative transcription. Similarly for inadequate video image segmentation, we can temporarily substitute time brackets. This also permits us to release early versions with more limited libraries into affiliate organizations before all the automated data creation is operational.

The construction of the data and network infrastructure is based upon extant technology from research and commercial systems, but which has not been applied to the video server task. This approach provides predictable baseline performance and reliability levels which may be strained by this application. We incorporate real-time Unix extensions, hierarchically cached file systems, and commercial switched multi-megabit data services to provide the required performance and bandwidth.

The long-term implementation plan anticipates shifting our data repositories to industrial video-on-demand servers as they become available and accessible to us. This reinforces our adherence to industry or de facto standards in areas where there is marginal, if any, gain in building our own experimental systems.

2.4 Informedia Technology Research and Development

We describe the proposed research and development for each technology area: speech, language, image, user interface, billing, and architecture.

2.4.1 Automatically Derived Transcripts

Speech recognition and language understanding provide automatic generation of audio transcripts for video segmentation, processing and retrieval.

2.4.1.1 Speech Understanding in Informedia

Even though much of broadcast television is close-captioned, the vast majority of the nations video and film assets are not. More importantly, typical video production generates 50 to 100 times more data that is not broadcast and therefore not captioned. Clearly, effective use and reuse of the country's video information assets will require automatic generation of transcripts. In this section, we will describe how we plan to do this. The transcripts thus generated will be used for analysis, indexing and retrieval of movie and television based multi-media data.

Unlimited vocabulary speaker independent connected speech recognition is an unsolved problem. However, recent results in transcription of newspaper dictation recognition provides the promise and potential of being able to create automatic transcriptions of unlimited vocabulary spoken language. In this section we will provide the current state-of-the-art problems that remain to be solved in order to make progress in the area of video transcriptions, our proposed method of approach, and the expected results in the 1996 time frame.

2.4.1.2 State of the Art in Speech Recognition

In 1992 ARPA speech recognition evaluations, systems developed at Carnegie Mellon had the highest accuracy of all systems tested [Hwang93]. Carnegie Mellon has a history of work in this area spanning over twenty-five years and currently has a team of about twenty researchers working in various aspects of this problem.

The current best system, Sphinx II, uses a 20,000 word vocabulary to recognize connected spoken utterances from many different speakers. The task domain is recognition of dictation of passages from the Wall Street Journal. On a 150 MIPS DEC Alpha workstation the system operates in near real time and on average makes one error out of eight words.

Sphinx-II uses senonic semi-continuous hidden Markov models (HMMs) to model between-word context-dependent phones. The system uses four types of codebooks: mel-frequency cepstral coefficients, 1st cepstral differences, 2nd cepstral differences, and power and its first and second differences. Twenty-seven phone classes are identified, and a set of four VQ codebooks is trained for each phone class [Hwang94]. Cepstral vectors are normalized with an utterance-based cepstral mean value. The semi-continuous observation probability is computed using a variable-sized mixture of the top Gaussians from each phone-dependent codebook.

The recognizer processes an utterance in four steps:

The acoustic models are based on the short-term speakers of both WSJ0 and WSJ1 sets. The training data is partitioned within sex into two clusters via a linear warping of all same-sex training speakers at the phonetic level. (The speakers are clustered around the two most disparate speakers, using the warping distance metric as the basis of comparison.) Training was accomplished using the forward-backward algorithm based on senonic decision trees obtained from the training data. A total of 10,000 senones are trained.

2.4.1.3 Proposed Research and Approach

A number of sources of error and variability arise naturally in a video transcription task. We present the prospects for solving these problems over the next few years.

Multiple Signal to Noise Ratio Problem. Broadcast video productions, whether they are documentary style interviews or theatrical productions, have to recognize speech from multiple speakers standing in different locations. This results in speech signal quality with different signal noise ratio properties. Further confounding the problem are the effects of different orientations of the speakers and reverberation characteristics of the room [Liu93]. Signal adaptation techniques have been developed which appear to automatically correct for such variability. However, such systems have not been tested with environments where nearly every other sentence has a different signal to noise ratio. We expect with appropriate preprocessing and detection of the signal levels to be able to modify the current CDCN technology to solve this problem.

Multiple Unknown Microphone Problem. Most current systems optimize the performance using close talking head mounted microphones. As we go to table top microphones, lapel microphones, and directional boom microphones traditionally used in broadcast video productions, the variability arising from differences in microphone characteristics and differences in signal to noise ratios will significantly degrade performance. Recent results by Stern and Sullivan indicate that dynamic microphone adaptation can significantly reduce the error without having to retrain for the new microphone [Sullivan & Stern 93].

Fluent Speech Problem. In a typical video interview, people speak fluently. This implies many of the words are reduced or mispronounced. Lexical descriptions of pronunciations used in conventional systems for dictation where careful articulation is the norm will not work very well for spontaneous, fluent speech. At present the only known technique is for manual adaptation of the Lexicon using knowledgeable linguists. It is our hope that this task domain will provide us with a rich data source so that automatic pronunciation learning techniques can be formulated to handle fluent speech phenomena.

Unlimited Vocabulary Problem. Unlike the Wall Street Journal dictation task where the domain limits the size and nature of the vocabulary likely to be used in sentences, video transcriptions generally tend not to have such constraints. However, they do represent specific task domains. Our recent research in long distance language models appears to indicate twenty to thirty percent improvement in accuracy may be realized by dynamically adapting the vocabulary based on words that have recently been observed in prior utterances. In addition, most broadcast video programs have significant descriptive text available. These include early descriptions of the program design called treatments, working scripts, abstracts describing the program, and captions. In combination, these resources can provide valuable additions to dictionaries used by the recognizer.

Relaxed Time Constraints. For transcription of digital video, processing time can be traded for higher accuracy. The system doesn't have to operate in real time. This permits the use of larger, continuously expanding dictionaries and more computationally intensive language models and search algorithms. In initial untuned testing on our current system, interview data appears to increase the error rate from around 12% to over 50% in unrestricted environments. It is our expectation that using the techniques outlined above we will remove many of the new problems. In addition, by removing the constraint of real-time processing it will be possible to deepen searches by the recognizer beyond that currently used in real-time applications.

By the end of 1995 the error rate is expected to return to 12 to 15% for unrestricted video data. Improvements in computer technology, search technology, and speech processing techniques can be expected to reduce the error again by one half resulting in a 5 to 6% word error rate by 1996. At these levels we believe semantically based indexing techniques proposed in this proposal should prove to be acceptable in routine use of multi-media libraries.

2.4.2 Language Processing of Queries and Transcripts

2.4.2.1 Natural Language Processing in Informedia

Our initial goals for Informedia are to provide retrieval, browsing, and viewing of segments from a large videotape library. To this end we focus on ways for the user to specify a subject search and for ways to implement that search based on the transcript of the audio track. We therefore have three principal tasks: Query processing: the user must be able to specify a subject or content area for search without having to resort to specialized syntax or complicated command forms. The Informedia system must be able to process simple English statements of user interests. Retrieval: once the system has digested a user query, the corresponding text objects must be located, scored, and ranked according to user interest. Display: the video segments associated with each relevant text object must be located, and appropriate scene boundaries identified for each video object (e.g. corresponding to a "visual sentence", paragraph or page) are used to generate a menu of visual segments for user selection. A combination of image-based and transcript-based techniques will be used to identify scene boundaries.

An initial query may be textual, entered either through the keyboard, mouse, or spoken words entered via microphone and recognized by the system. Subsequent refinements of the query, or new, related queries may relate to visual attributes such as, ``find me scenes with similar visual backgrounds.''

Subsequent goals for the project flow from the potential for automatic processing of the audio track. Summarization: by analyzing the words in the audio track for each visual paragraph, the Informedia system will be able to determine the subject area and theme of the narrative. This understanding can be used to generate headlines or summaries of each video segment for icon labeling, tables of contents, or indexing. Tagging: using data extraction technology (from the Tipster and Scout projects), Informedia will be able to identify names of people, places, companies, organizations and other entities mentioned in the sound track. This will allow the user to find all references to a particular entity with a single query. Transcript correction: the most ambitious goal is to automatically generate transcripts of the audio with speech recognition errors corrected. Using semantic and syntactic constraints from NLP, combined with a phonetic knowledge base such as the Sphinx dictionary, some recognition errors should be correctable.

2.4.2.2 State of the Art

Our current work on text processing and retrieval is embodied in the Scout System. Scout is a full-text information storage and retrieval system that also serves as a testbed for information retrieval and data extraction technology. Figure 4 shows Scout being used to search a textual database of audio transcripts from one of the WQED videotapes. The user has selected the ``Clarke'' database and searched using ``independent'' as the query. Scout finds several hits using approximate matching to handle the spelling errors in the transcript. For Informedia, we will use the retrieval engine from Scout as the basis for a digital video search module, and we will incorporate data extraction technology from the Textract system developed for ARPA's Tipster program.


Figure 4: Scout Screen Dump


Current retrieval research focuses on newspapers, electronic archives, and other sources of ``clean'' documents. Natural language queries (instead of complex query languages) allow straight-forward description of the material described [TREC93].

The video retrieval task challenges the state of the art in two ways:

  1. Non-grammaticality: Written texts, especially news articles, correspond closely to the strict rules of classroom English, whereas the utterances recorded on videotape contain false starts, meta-utterances, pauses, ``um''s, grunts, deictic references to objects in the visual plane, and other phenomena that are not handled by standard grammars of English. So even perfect transcripts of the audio would be more complicated than current natural language technology can reliably parse.

  2. Noise: Current speech recognition techniques do not provide perfect transcripts. So we anticipate that only 4 out of 5 words will be correctly recognized by the speech algorithms. This level of error would reduce the effectiveness of typical retrieval algorithms.
For example, in one of our training tapes, Arthur Clarke is being interviewed. He says
SELF FULFILLING PROPHECIES
but because Sphinx was run using a smaller dictionary that does not contain the words ``prophecy'' or ``prophecies,'' Sphinx returns the closest phonetic match:
SELF FULFILLING PROFIT SEIZE
Maintaining high recall performance will require the retrieval of segments in spite of such mis-recognition.

2.4.2.3 Proposed Research and Approach

The Informedia task challenges the existing state of the art in three specific ways:

Performance of current retrieval algorithms on transcribed speech with recognition errors: The search and retrieval operations for regular text are well understood [Mauldin89,91, Jacobs93], but existing work has focused on high quality newswire text. What is not understood is how well these algorithms work in the context of spoken rather than written speech, and how their performance is degraded by errors in recognition.

Elaboration of syntactic and semantic models for spoken language: Our current retrieval technology relies on pattern sets and grammars that were developed for retrieving newspaper-quality texts from full-text databases. They do not address the additional complexity of spoken language.

Enhancement of pattern matching and parsing to recover from and correct errors in the token string: CMU researchers are already investigating the use of noise-tolerant grammar-based parsing [Lavie93]. We will investigate the use of this technique and the statistical techniques developed for SCOUT on the Digital Video Library corpus. Using the phonetic similarity measures produced by the Sphinx System, a graded string similarity measure will be used to retrieve and rank partial matches.

To address the issue of the inadequacy of current retrieval algorithms, we propose to first document their performance on transcribed video. We will create a test collection of queries and relevant video segments from the digital library. Using manual methods we will establish the relevant set of segments from the library. We will then use the test collection to evaluate the retrieval performance of our existing retrieval algorithms in terms of recall and precision.

We will use the results of the baseline performance test to direct additional research into two main lines of attack: (1) we will elaborate current pattern sets, rules, grammars and lexicons to cover the additional complexity of spoken language by using large, data-driven grammars. To provide efficient implementation and high development rates, we will use regular expression approximations to the context free grammars typically used for natural language. This approach worked well in our Textract data extraction system that was evaluated by ARPA under the Tipster text program. Our hypothesis is that extending this technique to an automatically recognized audio track will provide acceptable levels of recall and precision in video scene retrieval. (2) We will extend the basic pattern matching and parsing algorithms to be more robust, and to function in spite of lower level recognition errors by using a minimal divergence criterion for choosing between ambiguous interpretations of the spoken utterance. CMU's SCOUT text retrieval system already uses a partial match algorithm to recognize misspelled words in texts. We would extend our existing algorithm to match in phonetic space as well as textual. So the earlier example is converted in phonetic space to:

Query: P R AA1 F AH0 S IY0 Z		prophecy

Data: P R AA1 F AH0 T   S IY1 Z		profit seize
which deviate only by one insertion (T) and one change in stress (IY0 to IY1).

We will focus first on error-tolerance, and later we will extend that to error correction. We will periodically re-evaluate the performance of the retrieval against the baseline to track accomplishments.

Since the first phase of the retrieval engine must function early in the project to allow for iterative development and rapid prototyping, the first year's focus is on adapting existing boolean and vector-space models of information retrieval to the Informedia architecture. We will also develop a test collection to measure recall and precision, and establish a base line performance level. Users will have various options for ordering the returned set of ``hits,'' and for limiting the size of the hits as well.

In subsequent years we will develop enhancements to the retrieval algorithms, including phonetics-based matching, use of robust parsing, and concept-based retrieval. We will also explore the use of language processing of the audio track to improve the scene segmentation.

Finally, we will investigate the correction of transcript errors using context-based language processing. We will also investigate the automatic summarization of retrieved material and build a module that will assemble the retrieved segments into a single user-oriented video sequence.

The two standard measures of performance in information retrieval are recall and precision. Recall is the proportion of relevant documents that are actually retrieved, and precision is the proportion of retrieved documents that are actually relevant. These two measures may be traded off one for the other, and the goal of information retrieval is to maximize them both.

Each technology will be included in the testbed evaluation system as it is developed. Each additional algorithm enhancement or module will be evaluated for its effect on recall, precision, and user functionality.

2.4.3 Content-based Image Manipulation

2.4.3.1 Image Understanding in Informedia

Image understanding plays a critical role in Informedia for organizing, searching, and reusing digital video. In Informedia, digital video will be ``annotated'' automatically by speech and language understanding, as well as using other textual data that has been associated with the video. Spoken words (and sentences) can be attached to their associated frames. Yet, the traditional database search by keywords, where images are only referenced, not directly searched for, is not appropriate or useful for Informedia. Rather, digital video images themselves must be segmented, searched for, manipulated, and presented for similarity matching, parallel presentation, context sizing, and skimming, while preserving image content.

The first capability required for digital video library is segmentation, or ``paragraphing,'' of video into a group of frames when video library is formed. Each group can be reasonably abstracted by a ``representative frame,'' and thus can be treated as a unit for context sizing or for image content search. Part of this task can be done by content-free methods that detect big ``image changes'', for example, ``key frame'' detection by changes in the DCT coefficient.

However, to be completely successful we need content-based video paragraphing methods; for example, an individual speaker paragraph or a ``talking head'' paragraph. Content-based query is also essential. What an Informedia user is interested in is ``subject or content'' retrieval, not just ``image'' retrieval. The subject consists of both image content and textual content; the combination specifies the subject. The textual information attached is useful to quickly filter video segments locating potential items of interest. But subsequent query is often visual, referring to image content. For example, ``Find video with similar scenery,'' ``Find the same scene with different camera motion,'' ``Find video with the same person,'' and so on. Again, we notice that part of the capability can be realized by content-free methods, such as histogram comparison, but real solutions lie in content-based image search which presents a long-term challenge to the field of computer vision research.

2.4.3.2 State of the Art

Research in image databases which allow for visual query is becoming popular. However, video information is temporal, spatial, often unstructured, and massive. As a result, a complete solution of automatic extraction of semantic information or a ``general'' vision recognition system is not feasible at this point. Current efforts in image databases, in fact, are mostly based on indirect image statistics methods. With few exceptions, they fail to exploit language information associated with images or to deal with 3D events.

Image statistics methods computes primitive image features and their time functions, such as color histograms [Swain91, Gong92], coding coefficients, shape [Kato92, Satoh92] and texture measures [Kato92], and use them for indexing, matching and segmenting images. This is a practical and powerful approach for some applications, but obviously it deals with only images, but not their content.

Content-based retrieval by content-preserving coding has shown good results for a relatively well-defined class of static images of a single object, such as face, texture, and 2D shape [Pentland93a,b,94]. Ambitious efforts toward automatic and semi-automatic stratification of movie is also being studied [Smith91, Satoh92]. Howeer, not only is the visual recognition difficult, but also the text input for context setting is done manually.

CMU's Image Understanding group with 80 researchers is one of the largest vision groups in the country. The group's activities range from basic physics-based vision theory to applied vision systems for a large scale mobile robot. In particular, the results in color understanding [Klinker88,90], stereo [Kanade91], motion tracking [Lucas81], shape and motion from image sequence [Tomasi92, Poelman93], and efficient shape matching with arbitrary orientation and location [Yoshimura93] are unique and have a great potential for application to the digital video library. Our work for Informedia proposed below will be based on these results, and will continue to be leveraged by the basic image understanding effort.

2.4.3.3 Proposed Research and Development

We propose a staged development of the following abilities and their incorporation into Informedia for organizing and retrieving video: The strategy is to begin with the tools based on mature techniques and move toward those which require further research and development effort but that will have a great pay-offs, enhancing the digital video library system with unique capabilities. Thus, Informedia will have a constantly-increasing set of video segmentation tools available as our research proceeds.

Use of Comprehensive Image Statistics for Segmentation and Indexing. Raw video materials are first segmented into video paragraphs so that each segment can be connected/integrated for indexing with transcripted text. This initial segmentation can be done in a relatively content-free manner, as has been demonstrated previously, by using image statistics, especially by monitoring coding coefficients, such as DCT, and detecting fast changes in them. This analysis also allows for identifying the key frame(s) of each video paragraph; the key frame is usually at the beginning of the visual sentence and relatively static.

Once a video paragraph is identified, we extract image features like texture, color, and shape from video as attributes. We will develop a comprehensive set of image statistics as part of the video segmentation tool kit. While these are ``indirect statistics'' to image content, they have been proven to be quite useful in quickly comparing and categorizing images, and will be used at the time of retrieval.

Concurrent Use of Image and Speech/Language Information. In addition to image properties, other cues, such as speaker changes, timing of audio and/or background music, and change in content of spoken words can be used for reliable segmentation. Figure 5 is an example where keywords are used to locate items of interest and then image statistics (motion) is used to select representative figures of the video paragraph. In this example, the words, ``toy'' and ``kinex'' have been used as key words. The initial and closing frames have similar color and textual properties.


Figure 5: Video paragraphing for the K'nex.

Figure not available


Structural and temporal relationships between video segments can also be extracted and indexed.

Camera and Object motion in 2D. One important kind of visual segmentation is based on the computer interpreting and following smooth camera motions such as zooming, panning, and forward camera motion. For example, when a large panoramic scene is being surveyed, or in which the camera (and narration) focus the viewer's attention on a small area within a large scene, or in which a camera is mounted on a vehicle such as a boat or airplane.

A more important kind of video segment is defined not by motion of the camera, but by motion or action of the objects being viewed. For example, in an interview segment, once the interviewer has been located by speech recognition, the user may desire to see the entire clip containing the interview with this same person. This can be done by looking forward or backward in the video sequence to locate the frame at which this person appeared or disappeared from the scene. Such a single-object tracking is relatively easy. We are actually capable to track far more complicated objects.

A technique is being developed [Rehg94] to track high degree-of-freedom objects, such as a human hand (27 degrees of freedom), based on ``deformable templates'' [Kass87] and the Extended Kalman Filtering method [Matthies89]. Such a technique provides a tool to the video database to track and classify motions of highly articulated objects.

Object Presence. Segmenting video by appearance of a particular object or a combination object is a powerful tool. While this is difficult for a general 3D object for arbitrary location and orientation, the technique of the KL Transform has proven to work to detect a particular class of object. Among object presence, Human content is the most important and common case of object presence detection. We will include this function.

Object and Scene in 3D. The techniques discussed so far were two-dimensional, but video represents mostly 3D shape and motion. Adding a 3D understanding capability to the image understanding tool kit will revolutionalize the scope of the system. The ``factorization'' approach, pioneered at Carnegie Mellon University [Tomasi90], has the potential. In this approach, in each image frame, an ``interest point'' operator finds numerous corner points and others in the image that lend themselves to unambiguous matching from frame to frame. All the coordinates of these interest points, in all frames of the video sequence, are put into a large array of data. Based on a liner algebra theory, it has been proven that this array - whose rank is always equal to or less than 3 - can be decomposed into shape and motion information; i.e., Observations = Shape X Motion. We will investigate use of such 3D shape and motion understanding for the digital video library.

2.4.4 User Interface Development and Testing

Multimedia technology can deliver more information, more effectively than any scheme developed to date. But more than just delivering information, effective multimedia systems require a deep understanding of how users interact with huge volumes of information in many forms. The Informedia user environment requires much more of its designers than typical applications. User studies of subjects utilizing Informedia are thus an integral part of this development effort.

The Informedia workstation will be instrumented to keep global history of each session. This will include all of the original digitized speech from the session, the associated text as recognized by Sphinx-II, the queries generated by Scout and the video objects returned, compositions created by users, and a log of all user interactions. In essence, Informedia will be able to replay a complete session, much like a flight simulator can replay a session. This will permit both comprehensive statistical studies and detailed individual protocol analyses.

Informedia's integration of speech recognition, natural language, and image understanding technologies creates a natural, literally invisible first-order user interface for searching large corpora of digital video. Nonetheless, significant user interface issues remain. Three principal issues with respect to searching for information are: how to aid users in the identification of desired video when multiple objects are returned; how to let the user adjust the size of the video objects returned; and how to let the user quickly skim the video objects to locate sections of interest. With respect to reuse of video objects tools that go beyond editing of video are required and include expert assistance in visual and temporal organization of video. Solutions to these problems require an intimate understanding of digital video and the development of new modes of interfaces based on this model.

The initial studies will focus on the presentation and control interfaces:

Parallel presentation. When a search contains many hits, the system will simultaneously present icons, intelligent moving icons, imicons and full motion sequences along with their text summarization. To develop heuristics for imicons creation, empirical studies will be performed to determine the number of unique scenes needed to represent a video chunk; the effect of camera movements and subject movements on the selection of images to represent each scene; and the best rate of presentation of images. Users will likely react differently to a screen populated by still images than the same number of moving images. Therefore studies will also be used to identify the optimal number and mix of object types. Outcomes of this work will be input to the image and natural language understanding portions of this research to refine the scene identification and summarization capabilities of Informedia.

Context-sizing slide switch. This simulated slide switch enables the user to adjust the ``size'' (duration) of the retrieved video/audio segments for playback. Here, the ``size'' may be time duration, but more likely it will be abstract chunks where information complexity or type will be the determining measure. This research will investigate the appropriate metaphors to use when the ``size'' the user is adjusting is abstract content. Here, empirical studies will be used to help determine typical visual ``paragraphs'' for different materials. For example, it is well know that higher production value video has more shot changes per minute than, for example, a video taped lecture. And although it is visually richer, it may be linguistically less dense. These studies will help determine unique balance of linguistic and visual information density appropriate for different types of video information. Here we will research what it means, from both interface development and a search methods, to permit the user to say ``I want more background on each subject returned.''

Skimming dial. This simulated analog rotary-dial will interactively control the rate of playback of a given retrieved segment, at the expense of both informational and perceptual quality. One could also set this dial to skim by content, e.g., visual scene changes. Video segmentation will aid this process. By knowing where scenes begin and end the Informedia system will perform high speed scans of digital video files by presenting quick representations of scenes. This can be an improvement over jumping a set number of frames, since scene changes often reflect changes in organization of the video much like sections in a book. Empirical studies will be conducted to determine the rate of scene presentation that best enables users searches and the differences, if any, image selection for optimal scans compared to image selection for the creation of imicons.

Once users identify video objects of interest they will need to be able to manipulate, organize, and reuse the video. Even the simple task of editing is far from trivial. To effectively reuse video assets, the user will need to combine text, images, video and audio in new and creative ways. To be able to effectively write, we spend years learning formal grammar. The language of film is both rich and complex and deep cinematic knowledge, the grammar of video, cannot be required of users.

While excellent stand-alone tools to edit digital video exist, and will be used by Informedia, there are currently no tools to aid in the creative design and use of video as there are for document production. One reason is the intrinsic, constant rate temporal aspect of video. Another is complexities in understanding the nature and interplay of scene, framing, camera angle, and transition. Building on previous work at CMU [Stevens89, Christel92] tools will be developed to provide expert assistance in cinematic knowledge. The long range goal will be to integrate the output of the image understanding and natural language understanding sub-systems with this tool to create semantic understanding of the video.

For example, the contraposition of a high quality, visually rich presentation edited together with a selection from a college lecture on the same material may be inappropriate. However, developing a composition where the lecture material is available for those interested, but not automatically presented, may create a richer learning environment. With deep understanding of the video materials, it will be possible to more intelligently assist in their reuse.

Prototypes will be placed early on in cooperating affiliate schools and laboratories. Beyond the user studies described above, multimedia compositions will be collected and analyzed along with the histories of the users' session. Additionally, focused protocol analyses and exit interviews will be conducted to refine both the tools described previously and those providing assistance in the reuse of video.

Advanced multimedia applications require much more of developers and computing systems than do today's interrupted video. The multimedia equivalent of a teletypewriter I/O paradigm must be avoided to take advantage of the convergence of computing and video. Through a creative, multi-disciplinary approach, this project proposes to engineer a new digital video interface paradigm, further extending CMU research related to human factors issues in multimedia information environments [Christel91, Stevens85].

2.4.5 The NetBill System

2.4.5.1 Accounting in Informedia

For there to be a large digital video collection in the first place, the copyright owners must be assured that their property will be properly protected and that its use is measured to ensure them appropriate compensation. The NetBill system will provide security and accounting for Informedia. These major research issues must be addressed: System design issues include problems of availability, scalability, and auditability.

The NetBill system will evolve from three earlier Internet Billing Service prototypes designed and built at CMU [Mak91, Sirbu93, Scope93]. We have also extensively analyzed user requirements [Requirements92,93], and design tradeoffs[Design92,93, Scalability93]. Figure 6 illustrates our model of how the NetBill system relates to the user and the multimedia library.


Figure 6: NetBill


2.4.5.2 Research Issues and Proposed Research

NetBill depends on a number of functions working correctly, each of which pose significant research questions.

Authentication: NetBill must establish the identities of all parties to ensure the proper transfer of funds (for reasons of privacy, it may be desirable to support some types of anonymous participation in a transaction). This raises subtle security issues (for example, the interaction of authentication protocols with other security-related protocols is not yet well understood [Heintze94]; security flaws may be introduced when two protocols are combined). Scaling provides another set of challenging questions - authentication software that can easily handle ten thousand users may fail when it must handle millions of users.

Possible starting points for authentication services include the Kerberos system [Steiner88] (which is based on Needham and Schroeder's authentication protocol [Needham78]), systems based on public key digital signature methods such as RSA [Rivest78] or NIST's DSS [NIST91], or ``zero-knowledge'' protocol based methods that use dynamic probabilistic proofs [Goldwasser85]. None of these methods has yet been extended to the scale envisioned when all of the nation's K-12 students become potential users.

We will select an authentication mechanism taking into account technical concerns, existing mechanisms, and standardization trends, choosing unencumbered algorithms where possible.

Security and Privacy: Security is paramount to this work; we must ensure that all transfers of funds are authorized by all relevant parties and that we can account for all funds. In particular, we will develop a mechanism for users to preauthorize their account to be charged for a certain sum; those funds are then frozen until the user is charged for the transaction, guaranteeing that the funds are reserved and reducing collection problems. We will explore a variety of cryptographic mechanisms to verify the negotiation and agreement between parties on a fund transfer limit, including the use of digital signatures [Rabin79], cryptographic checksums [Rabin81, Rivest91], and private key encryption [USNBS77].

Privacy raises other concerns. Logging of all transactions raises the specter of a third party tracing transactions by a given user, or a given service provider, or satisfying some particular criteria. We will research and build mechanisms which allow for normal billing while providing a maximum amount of privacy to individual users. In many cases, users may wish to request a service anonymously, or to be sure that even the service provider is not tracking the nature of requests.

Libraries in particular have long recognized the importance of keeping patron identity confidential. We will research mechanisms for anonymous billing of service use.

However, we must make sure that these services can be overridden to allow auditing of transactions under a number of circumstances. For example, if a bill is disputed, users may want to trace a transaction. Also, there are many circumstances under US law where a financial transaction must be reported (for example, transactions over $10,000). We will investigate the use of threshold schemes that allow all transactions - even anonymous ones - to be traced if several people authorize the tracing.

Access Control: The NetBill system must be able to specify which classes of users can access which services. For example, in this proposal, there may be a desire to restrict the access afforded to faculty versus students. In general, the access control restrictions will vary with each service provider, and may be fairly complex. Access control may be implemented independently by each service provider, or centrally by the NetBill system for a broad category of services. For example, it should be possible to restrict the access of a student to age appropriate materials.

Because of the highly dynamic nature of intellectual property, it will require research to find the best way to specify and enforce these access control lists. It is important to note that we do not intend to address the confinement problem in this research: we will not be able to stop a user who has legitimate access to some information from forwarding that information to a third party.

We do intend to investigate mechanisms that make it difficult and inconvenient to forward information to unauthorized third parties. We also intend to investigate mechanisms of tagging documents with tracing data that will make it possible to locate the source of a confinement breach.

Account Hierarchies: An important feature of our system is that it allows account hierarchies. This permits a single organization to manage spending by department as well as by user. On the service provider side, we use hierarchical accounts to allow aggregate payments for separate services operated by one administrative entity. This sort of structure provides advantages not only in management, but also in maintenance of availability. New research is required to fully integrate hierarchical accounts on the scale of use that we envision.

Auditing: There is a fundamental tension between our ability to audit accounts and protecting the privacy of users. Audits will occasionally be necessary: when requested by a user, when required by court order, when required by tax or Treasury authorities, and when security breaches are suspected. In other cases, summary reports must be generated (for economic experiments described later, or to monitor usage). It is also important that the mechanisms for privacy in this system provide maximum protection against individuals who may wish to probe accounts. For example, we want to keep accounts private from a nosy clerk, even if that clerk is working for a NetBill billing service. Basic research is required to find mechanisms that provide privacy but allow auditing under certain circumstances (such as when certain internal thresholds are exceeded, or when several parties approve the auditing). We plan to use basic cryptographic mechanisms such as secret sharing [Shamir79, Herlihy87, Rabin88] and secret counting [Benaloh87, Camp94].

Scalability: The Internet has 15 million users today. Mechanisms that work well for thousands or tens of thousands of users may fail when put on the Internet. Problems of scale will affect every aspect of system design. We must also provide very high availability for billing services. To achieve both high availability and wide scale, we will need to investigate advanced system mechanisms such as delegation of responsibility [Satya93], caches [Gray93, Weihl93], failure tolerant protocols and platforms [Lampson93], and multiprocessor platforms [Mullender93, Accetta86] to build our system.

2.4.5.3 Pricing Server

A flexible infrastructure is required capable of calculating prices according to whatever formula the copyright holder specifies. We propose to assign responsibility for calculating prices to a logically separate pricing server. Normally, each service provider that makes use of the NetBill system would operate its own pricing server, although the NetBill service provider could operate a pricing server on behalf of a service provider which did not want to run its own.

Pricing raises a number of significant research questions:

Analysis of these issues in the first year of the project will provide input to the pricing server design.

2.4.5.4 User Response to Pricing Alternatives

We will evaluate the users response to varieties of pricing schemes, especially the distinction between usage-based and subscription-based pricing. We will also provide the user a choice of lower quality video at a reduced price (by reducing the bandwidth requirements by reducing the space, time or color fidelity of the video). By the end of the project, we expect to have sufficient numbers of users to begin to understand usage patterns, and the response to various pricing/quality options. We will also attempt to determine the amount of redistribution of materials for various pricing schemes.

2.4.5.5 Analyses of Economic Impacts on Producers, Libraries and Users)

Over the long term, digital archives are expected to dramatically change the nature of video access. While there have been several theoretical studies of the economics of electronic libraries [Zahray90a,90b], there is little literature on the economics of video archives [Vin93]. For example, the extent to which viewing is concentrated among a small fraction of video segments, or more evenly distributed across the archive is not well understood. The existence of a significant corpus available in electronic form will, for the first time ever, allow us to measure use and characterize the ``locality of viewing.'' Early in the project we will examine the economics of architectural tradeoffs, building a mathematical model that we can use to compare architectures. Other questions to be addressed include: These studies will largely be undertaken in Years 3 and 4, using the empirical evidence from operation of the testbed to test the analyses.

2.4.6 Data and Network Architecture

2.4.6.1 Continuous Media Playback Requirements

The fundamental problem in providing continuous media (e.g. video and audio) from remote storage systems is the ability to sustain sufficient data rates from the file system and over the network in order to provide pleasing audio and video fidelity (e.g. frame rate, size and resolution) on playback for the receiving user. The ability to continuously transmit 30 frames/second of full-color, full-screen, TV quality images even to a single user is limited by network bandwidth and allocation. For current compression ratios yielding 10 Mbytes/min. of video, a minimum 1.3 Mbit/s dedicated link would be required to deliver continuous video, which is not commonly achievable across the Internet. The ability to deliver the same video material simultaneously to a number of users is further limited by disk transfer rates.

2.4.6.2 File System and Data Organization

The digital video/audio archive storage we propose to start with will be a hierarchically cached file system, with all the digitized data at the top ``media-server'' node (approximately 1 terabyte) and caches of most recently accessed media at the ``site-server'' nodes (40-50 gigabytes). The server will be implemented as a multi-threaded user-level process on a UNIX system, with a fixed priority policy scheduler [Tokuda90, Nakajima93], and will communicate continuous media data on standard network connections. There remain numerous sources of timing and synchronization failures at the various levels of the data hierarchy and many operating system approaches to deal with them [Herrwitch92]. These are psychologically ameliorated for this project by the user's model of searchretrievedisplay service for information search and retrieval systems [Blattner92]. They will also be operationally moderated by the fact that teachers typically develop lesson plans ahead of assignment to students. It is common for teachers to reserve library material in anticipation of high student use. Using Informedia, teachers can similarly request local storage of broad searches related to upcoming assignments prior to student use.

The ``site-server'' sits on a local area net with end-user PC-workstations. The searchable transcripts and auxiliary indices will exist at the main server and be replicated at each site. This permits the cpu-intensive searches to be performed locally, and media to be served either from the local cache or from the central server. The local user pc-workstation can alternately be a buffering display station, a display plus search engine, or the latter plus media cache (approximately 2 gigabytes), depending upon its size and performance class. Caching strategies will be implemented through standard file system implementations: Transarc's Andrew File System (AFS) [Satya85, Howard88, Spector89] and OSF's industry standard Distributed File System (DFS) [DFS91]. Concentration of viewing strongly influences system architecture. Where and how much to cache depend on ``locality of viewing.'' Early in the project we will examine the economics of these architectural tradeoffs, and intend to build a mathematical model that we can use to compare architectures.

The stringent continuous stream network data requirements typical for video-on-demand systems is relaxed in our library system implementation because (1) most sequences are anticipated to be short (<2 minutes), (2) many will be delivered from the locally networked site-server, and (3) the data display is always performed from the buffer constituted by the user's local disk, typically 1-2 gigabytes in early system deployments. Currently used compression techniques reduce the data requirement to approximately 10 Mbytes/minute of video. The performance assumptions therefore hold well unless very long video sequences are requested. Forthcoming research and commercial file systems structured for delivery of continuous media [Anderson92] and video-on-demand [Rangan92, Vin93] address the problems of achieving sufficient server performance, including the use of disk striping on disk arrays to enable continuous delivery to large numbers of simultaneous viewers of the same material [Schwartz94]. As we can shift our data repositories to such higher performance servers and the higher bandwidth network links anticipated over the four years of this proposal, we can correspondingly reduce the on-line secondary storage requirements (and costs) for the end-user nodes.

We estimate that if all prime time television of the last 40 years (approximately 160,000 hours) were digitized in the same format that we propose using, it would require a 100 terabyte repository. Thus, our 1 terabyte testbed will be sufficiently representative of the commercial environments we forsee, and will demonstrate many of the same operational and performance issues.

2.4.6.3 Networking Services

We intend to implement our server network to selected affiliate sites via Bell-Atlantic's commercially available switched multi-megabit data service (SMDS). This currently provides very economically priced T1 data rates (1.17 Mbits/sec) at a flat rate anywhere in the 412-area. Frame relay services from 56Kbps to 1.5 Mbps are also provided for remote satellite services, appropriate to the model of how a school district might serve a number of schools from a single ``media'' or ``site'' server. Communication interfaces developed by CMU to interface local workstation Ethernet to the SMDS clouds are in place now under an experiment sponsored by Bell Atlantic.

A key element of the on-line digital video library is the communication fabric through which media-servers and satellite (user) nodes are interconnected. Traditional modem-based access over voice-grade phone lines is not adequate for this multi-media application. The ideal fabric has the following characteristics. First, communication should be transparent to the user. Special-purpose hardware and software support should be minimized in both server and slave nodes. Second, communication services must be cost effective, implying that link capability (bandwidth) be scalable to match the needs of a given node. Server nodes, for example, will require the highest bandwidth because they are shared among a number of satellite nodes. Finally, the deployment of a custom communication network must be avoided. The most cost-effective, and timely, solution will build on communication services already available or in field-test. The implementation already begun for this project satisfies these requirements. Currently, in cooperation with Bell Atlantic, we have begun to deploy a tele-commuting Wide-Area Network (WAN) ideally suited for the on-line digital video library. This WAN is based on services from Bell that are currently available.

The topology of the WAN we have deployed is shown in Figure 7. The two key elements of the communication fabric are (1) use of Central-Office Local-Area Networks (CO-LANs) to provide unswitched data services to workstations over digital subscriber loop technology and (2) use of a Switched Multi-Megabit Data Service (SMDS) ``cloud'' to interconnect the CO-LANs and high-bandwidth server nodes.


Figure 7: On-line digital video library communication fabric


High-bandwidth server nodes are directly connected into the SMDS cloud through a standard 1.17 Mbit/s T1-access line. The SMDS infrastructure provides for higher bandwidth connections (from 4 Mbit/s through 34 Mbit/s) should they be required. Currently, a T1-class SMDS connection is tariffed at a flat $600/month, anywhere within our local (412) area code.

SMDS is a public, packet-switched data service that is in limited service today. It is offered by Bell Atlantic and other telecommunication carriers and supports a rante of data applications that depend on high-speed communications. SMDS extends the performance and efficiencies of LANs over a wide area, while offering the economic benefits of a shared service. SMDS is connectionless, meaning that there is no need to set up a connection through the network before sending data. This provides bandwidth on demand for efficient transmission of bursty data. SMDS also provides any-to-any communication. Any SMDS client can exchange data with any other SMDS client. Finally, SMDS is protocol independent, permitting any end-to-end protocols (e.g. TCP/IP, OSI, DECnet, Novell IPX, etc.) to be used between connected clients.


3. Testbed Facility

Our proposal contains within it a plan for four testbed installations, of varying size and intended application, each with different constraints (e.g. network bandwidth, library size) or characteristically different user communities (e.g. grade school children, university faculty). We have partners who have already committed to serve these roles, which are also essential to our own research studies of use, performance and economics. In addition, we will be able to (1) provide networked access to the primary testbed, (2) export portions of the system and data to other sites for their local exploration and experimentation, and (3) import enhanced or substitute components from external researchers for experiment or incorporation into our local test facility.

3.1 Planned User and System Studies

The initial user test sites will be established at (1) Carnegie Mellon University, building on the experience and success with deploying the Mercury electronic text and image library, (2) the Winchester Thurston School, a culturally diverse, academically excellent, K-12 college preparatory school in Pittsburgh, (3) the Fairfax County (VA.) public school system, and (4) the Open University in the U.K.. The latter two will also be providing resource materials. The CMU system will allow campus-wide access across its networked infrastructure; users will be college students and faculty in all disciplines. The Winchester Thurston testbed will include a site-server and local area network linked to the CMU primary server via Bell Atlantic SMDS service. Users will be students in three age groups from their lower, middle and upper (high school equivalent) schools, as we test viability of the library search concept and usability of the user interface for various age and interest groups. The Fairfax site will start with Internet access to the CMU server, but mostly work with more limited resource material preloaded on their site-server. The O.U. installation is expected to be a scaled down replication of the CMU installation, though the Internet connection is available for non real-time video downloading.

3.2 External Research Studies

The highly modular system structure and implementation of our proposed system is a fertile testbed for researchers in many disciplines. The four major ``sub-assemblies'' of our research implementation are: Within each of the above there are subsystems for a large number of separable functions or techniques, e.g., speech recognition, image sequence segmentation; user interface display and control tools; text indexing, search and retrieval; video servers; network streaming protocols; dynamic pricing algorithms; and others. It is our intent to permit researchers who have interests in any of the components as well as the overall system use and application, to participate. If they build subsystems to our interfaces and data types, there is potential for incorporation into the system. We forsee several forms of involvement: For those seeking network access to the existing testbed, resources may be restricted by the providers. For use of prototype systems with limited protection mechanisms implemented, users may need to sign licensing agreements protecting property rights of the providers. The concerns are generally related to unauthorized copying and redistribution of source material. Requests for involvement by external researchers will be evaluated by the project's principal investigators. Criteria include anticipated impact on the performance or function of the overall system and costs to integrate and verify their contributions if implementation is involved. Whatever is contributed by external researchers must be available for continued use and subsequent research in the project. Redistribution must be permitted to the extent that the project's base technology will be redistributed, including access to the various project partners.

3.3 Non-research Use

Additional user access to other schools, particularly in the Pittsburgh region, will be considered. Institutions will need to write proposals justifying their request. They will be required to provide funding sources for their own equipment and communication (networking) costs, which must meet minimal bandwidth requirements. Proposals will be evaluated based on their intended usage and internal support plans. Certain library resources may be access restricted.

4. Organizational Roles

An important and explicit goal of this proposed project is to accelerate acceptance of the techniques and technologies developed by seeding the market and priming the providers. We have assembled the project partners and organized the project structure to realize this goal.

4.1 Project Partner Backgrounds

Carnegie Mellon University

Carnegie Mellon is an acknowledged international leader in computer science education and research. During the past quarter century its laboratories have produced some of the most important breakthroughs in artificial intelligence, operating systems, machine architectures, speech recognition, image understanding, machine translation and electronic libraries. Carnegie Mellon has been one of the pioneers in the implementation and operation of digital libraries. In 1988, the university began its Mercury project which has been the first university library system to be based around a modern distributed computing environment, rather than a time-shared view of computing. Mercury has concentrated on two forms of material, text and page images of documents. The project had a strong technical component, but it has also addressed the practical questions of building a working library. These have included close relationships with publishers, operational and user support, system integration, standards, copyright, authentication and control. The Informedia project will address many of the same issues, but with different materials, digitized video.

CMU will manage the project, conduct the fundamental research, and become the initial networked deployment testbed.

CMU Information Networking Institute

The Information Networking Institute was created at CMU in 1989 to coordinate research and teaching at the intersection of communications, computers, business and policy studies. Research activities have ranged from distributed systems to gigabit networks to organizational impacts of EDI. A cooperative program of CMU's schools of engineering, computer science and business, the Institute also administers an interdisciplinary MS degree program in Information Networking in which students take courses from all three schools.

A key feature of the MS program is the requirement for an individual thesis or integrated group project. In 1992 and 1993 groups of 15 and 10 students respectively worked full time for four months on the problem of internet billing services under the supervision of faculty from several departments. The result has been two generations of requirements analysis, design, and prototype implementation covering both business and technical issues for an internet billing server.

QED Communications and QED Enterprises

QED Communications is the parent company of WQED/Pittsburgh, the first community-owned public television station in the United States. A major production center for the Public Broadcasting Service, WQED has won thirty-five Emmy Awards and eight Peabodys. It has proven itself an innovator in science and educational programming, and drama and wild-life documentary production. WQED-TV was the first television station in the country to broadcast television programs into the classroom. QED brings to the partnership a rich film library and a history of fruitful collaborations with world-class affiliates, such as The National Academy of Science, The National Wildlife Federation, The National Geographic Society and NHK-Japan. QED boasts an exceptionally creative team and years of experience in designing projects that simultaneously teach and delight an audience.

QED will provide a large library of video resources and pursue follow-on commercial service opportunities through its commercial subsidiary, QED Enterprises.

Winchester Thurston

Founded in 1887, the Winchester Thurston School is an independent, co-educational, college-preparatory school enrolling students in kindergarten through twelfth grades and serving Pittsburgh and surrounding communities. Its 500-member student body encompasses an ethnic, racial and economic diversity. The school has a long-standing reputation of providing students with an exceptional education with an emphasis on the individual. It has been a leader in science education, stressing the process of discovery within the science curriculum for all grades. The school's computing facilities are provided access to the Pittsburgh Supercomputing Center. The school has participated in a number of joint university programs and experimental studies with CMU and other Universities.

Winchester will be the initial K-12 testbed site and has agreed to experiment with early prototypes in order to provide feedback on usability issues at various age levels.

Fairfax County Schools

The Fairfax County (Virginia) School District is a very large urban public school system representing a diverse socio-economic and intellectual range of students. Fairfax County schools consistently score amongst the nations top in standardized tests. An innovator in applying media instruction and distance-learning technology, they have agreed to contribute the video materials from their Electronic Field Trip program. These surrogate travels include trips to the Berlin Wall, Wales, China and the Biosphere as well as other points of interest. As many as 9000 schools have participated in each of these Electronic Field Trips. They will also participate as a remote field testbed site for the project once operational.

The Open University, UK

The O.U. has often been described as Britain's most important educational innovation of the last quarter of a century. Founded by Royal Charter in 1969 as a fully autonomous independent university, it has become the UK's largest single educational institution, with more than 200,000 people currently studying its courses. It is the model for other open and distance teaching universities in many parts of the world. The University has developed a flexible, multi-media teaching system which includes specially-produced textbooks, local tuition and other support services, short residential schools, home experiment kits, computer networks, audio-visual materials and broadcasts on national radio and television. By these means it reaches people of every age and background, enabling them to fit in their studies with the rest of their lives. It offers a wide range of educational opportunities, including degrees, diplomas and other qualifications, professional training and updating, as well as courses and study packs for personal development. Research interests of Open University staff span a wide range of disciplines and provide the firm base for the University's teaching strategy.

The O.U. will provide a large collection of video course material in the math, science and technology disciplines and will deploy the system first for internal use by the faculty and potentially for remote student use pending issues of network accessibility.

Digital Equipment Corporation

Digital, the world's second largest manufacturer of computing systems, has a strong interest in accelerating the development of the desktop video information industry. The range of their computing product offerings, from very high end servers and storage systems to high performance and commodity-priced pc's, is an excellent match to our needs, as we build and test with today's technology for a market and technology base that will exist 4-5 years from now. Digital has committed to contribute the equipment necessary to implement, test and deploy the system, data repository and user pc/workstations we propose under this proposal. They have also committed to working with us to increase the number of affiliated industrial partners to extend our support base and incorporate our technology.

Microsoft Corporation

Microsoft is the world's largest producer of commercial software. They have demonstrated great interest in video-on-demand software and multimedia information libraries and related tools. We have agreed to a technology exchange effort which will enable the Informedia Project to evaluate and build upon applicable Microsoft research and commercial systems. Additionally, Microsoft has committed to provide some direct financial support to the project if NSF funded.

Bell Atlantic

Bell Atlantic, the regional Bell operating company, has been an innovative developer and marketer of communication infrastructure for high bandwidth multimedia communication at modest cost. Their recent proposed acquisition of TCI cable and involvement of field tests for interactive video to the home, demonstrates their intent to be a major player in the area of interactive multimedia information systems. They have agreed to underwrite the communication costs of our regional testbed sites permitting us use of the latest appropriate technology in extended field trials.

4.2 Achieving Pervasive Impact

Universal access to vast, low-cost digital information and entertainment will significantly impact the conduct of business, professional, and personal activity. Most of the major computer manufacturers, news media producers, publishers, cable and communication companies have involved themselves in one or more joint ventures to explore the technology and market potential of digital video information products and services. The initial impact of the project's activity will be on the broad accessibility and reuse of existing video materials (e.g., documentaries, news, vocational, training) previously generated for public broadcast; public and professional education; vocational, military and business training.

The greatest societal impact of what we do will most likely be in K-12 education. The Digital Video Library represents a critical step toward an educational future that we can hardly recognize today. Ready access to multimedia resources will bring to the paradigm of ``books, blackboards, and classrooms'' the energy, vitality, and intimacy of ``entertainment'' television and video games. The key, of course, is the access mechanism itself: easy and intuitive to use, powerful and efficient in delivering the desired video clip. The persistent and pervasive impact of such capabilities will revolutionize education as we've known it, making it as engaging and powerful as the television students have come to love.

The greatest commercial impact will be in industrial/commercial training and education, at reduced cost and in less time. We enable individuals to learn through exploration and examples at varying levels of complexity in an often entertaining, highly visual and auditory information flow.

Our initial project members represent significant testbeds of two important sectors on which we are focused - K-12 and university education. The schools involved represent a diverse socio-economic and intellectual range of students. The Winchester Thurston School and CMU will provide the first testing of the new digital video library system, and play a key role in mapping the new technology into the urban and college classroom. Our studies of usage and motivation with these students will provide invaluable input on how to provide ubiquitous information services across the national information infrastructure (NII). Combined, these environments will provide the requisite span of discipline, reference, and casual users.

4.3 Ability to Commercialize

The project includes all of the different types of companies required to make this initiative successful at creating an industry to provide ongoing digital video library products and services without Federal support in a variety of markets.

QED Enterprises, the commercial division of QED Communications, will be pursuing follow-on commercial licensing opportunities incorporating the systems and technology developed. They will explore use of their own and other past and future video assets as general and special purpose collections of library source materials for the education and training markets. In collaboration with other project affiliates, QED Enterprises will assess the business model for providing continuing video library reference services to local area schools, hospitals, and commercial clients, including requirements at the local sites, local- and metropolitan- area networks for delivery, and centralized video database repositories. This proposed effort provides a unique opportunity for them to achieve new commercial value from their vast libraries while enabling them to fulfill their fundamental education mission.

Bell Atlantic has been actively supporting our prototype networking efforts, including the provision of CMU equipment space and connections within the telephone central office. They have a direct interest in determining requirements and monitoring performance of our applications so that they can cost effectively offer appropriate data services for future commercial installations of digital video library service providers. Both Bell and QED will closely follow our research related to network billing servers and insuring data security and privacy.

We have engaged both Digital Equipment and Microsoft Corporation in technology exchange relationships which will provide them access to our developed technology for assessment and potential productization. Equally important, they will provide us early access to their research prototypes and production hardware and software systems for video and multimedia servers and related delivery systems. This will enable us to evaluate and build upon forthcoming commercial infrastructure and industry standards as they becomes available.


4.4 Management Plan

4.4.1 Development Management

CMU will manage personnel, provide space and be responsible for the operations and technical agenda. QED will provide initial source materials and production expertise, explore commercial applications, generate plans for providing ongoing services and outreach, and provide production resources on a cost-recovery basis. There will be joint strategic planning, public representation and fund raising.

4.4.2 Affiliate Members

Additional Affiliate members from all sectors will be invited to participate in the project. They will provide, at varying levels, cash resources, previously produced source material, technical personnel, user training personnel, and content expertise. They will receive early access to prototype Informedia implementations, special license to selective materials, and may participate in user studies with both student, faculty, and casual subjects. Members will be sought from several interest areas: information technology and computing companies, organizations that do continuous internal training, consulting companies, publishing and resource companies, communication companies, and professional education organizations. Both Digital Equipment and Microsoft as founding partners have committed to broadening the number of industrial affiliates and resource providers.

4.4.3 Project Management

CMU and QED will serve as parent organizations and equal partners with a governing board comprised of representatives from both institutions and other founding project members, with additional inputs from industrial and other institutional affiliates. CMU will provide oversight through an ad hoc committee reporting to the Provost.

4.4.4 Intellectual Property

Generally, base technology will be owned by the project member that actually created it, and the remaining members will have licensable rights regarding use and (where appropriate) distribution of the technology. Intellectual property rights relating to donated source materials will remain with the contributing members. However, the University maintains a non-exclusive license to utilize all donated material for internal and testbed educational research purposes. Implementation of appropriate data security and access privacy to protect the rights of resource providers and users are part of the base research.


5. References

[Accetta86]
Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Avadis, T. Jr., Young, M.W. "Mach: A New Kernel Foundation for Unix Development," Proceedings of Summer USENIX, Atlanta, GA, July 1986.
[Anderson92]
Anderson, D.P., Osawa, Y., Govindan, R. "A File System for Continuous Media," ACM Transactions on Computer Systems, Vol 10, No 4, pp. 311-337, November 1992.
[Benaloh87]
Benaloh, J. Verifiable Secret-Ballot Elections, Technical Report, Yale/DCS/RR-561, Yale, 1987.
[Blattner92]
Blattner, Meera M., Dannenberg, Roger B., eds., Multimedia Interface Design, ACM Press, New York, N.Y., and Addison-Wesley, 1992.
[Brondmo90]
Brondmo, H.P., Davenport, G. "Creating and Viewing the Elastic Charles - a Hypermedia Journal," Hypertext, State of the Art, McAleese, R., Green, C. (eds.), Intellect Ltd.
[Camp94]
Camp, L.J., Tygar J.D. "Protecting Privacy While Preserving Access to Data," The Information Society, Vol 10, No 1, 1994.
[Christel91]
Christel, M.G. A Comparative Evaluation of Digital Video Interactive Interfaces in the Delivery of a Code Inspection Course, Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, 1991.
[Christel92]
Christel, M.G. and Stevens, S.M. "Rule Base and Digital Video Technologies Applied to Training Simulations," Software Engineering Institute Technical Review T92, Pittsburgh, PA: Software Engineering Institute, 1992.
[Degen92]
Degen, L., Mander, R., and Salomon, G. "Working with Audio: Integrating Personal Tape Recorders and Desktop Computers." Proceedings of ACM CHI '92 Conference on Human Factors In Computing Systems, 1992.
[Design92]
Internet Billing Server Prototype Design Document, INI Technical Report INI92-4, Information Networking Institute, Pittsburgh, PA, 1992.
[Design93]
Internet Billing Server Prototype Design Document, INI Technical Report INI93-3, Information Networking Institute, Pittsburgh, PA, 1993.
[DFS91]
File Systems in a Distributed Computing Environment, White Paper, Open Software Foundation, Cambridge, MA, 1991.
[Fuller82]
Fuller, R.G., Zollman, D. "The Puzzle of the Tacoma Narrows Bridge Collapse: An Interactive Video Disc Program for Physics Instruction," Creative Computing, pp. 100-109, 1982.
[Goldwasser85]
Goldwasser, S., Micali, S., Rackoff, C. "The Knowledge Complexity of Interactive Proof Systems," Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing, Providence, RI, May 1985.
[Gong92]
Gong, Y., Sakauchi, M. "A Method for Color Moving Image Classification Using the Color and Motion Features of Moving Images," ICARCV '92, 1992.
[Gray93]
Gray, J., Reuter, A. Transaction Processing, Morgan-Kauffman, Palo Alto, CA, 1993.
[Heintz94]
Heintze, N.C., Tygar J.D. "A Model for Secure Protocols and Their Compositions," IEEE Symposium on Security and Privacy, May 1994.
[Herrtwich92]
Herrtwich, R.G. "Network and Operating System Support for Digital Audio and Video", Second International Workshop, Heidelberg, Germany, November 18-19, 1991, proceedings Springer-Verlag, 1992.
[Hirlihy87]
Hirlihy, M.P., Tyger, J.D. "How to Make Replicated Data Secure", Advances in Cryptology, CRYPTO-87, Springer-Verlag, August 1987. Also to appear Journal of Cryptology.
[Hodges89]
Hodges, M.E., Sasnett, R.M., Ackerman, M.S. "A Construction Set for Multimedia Applications," IEEE Software, January 1989.
[Howard88]
Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satyanarayanan, M., Sidebotham, R.N., West, M.H. "Scale and Performance in a Distributed File System", ACM Transactions on Computer Systems 6, 1 (February 1988).
[Hwang93]
Hwang, Huang, Alleva "Predicting Unseen Triphones with Senones", ICASSP-93, 1993.
[Hwang94]
Hwang, Thayer, Huang "Semi-Continuous HMMs with Phone Dependent VQ Codebooks for Continuous Speech Recognition," to appear ICASSP-94, 1994.
[Ishii92]
Ishii, H., Kobayashi, M. "ClearBoard: A Seamless Medium for Shared Drawing and Conversation with Eye Contact," Proceedings of ACM CHI '92 Conference on Human Factors In Computing Systems, 1992.
[Jacobs93]
Jacobs, P. "Description of the TIPSTER/SHOGUN System as used for MUC-5", Jacobs, P. (ed.), Proceedings of the Fifth Message Understanding Conference, sponsored by ARPA/SISTO, Baltimore, MD, August, 1993.
[Kanade91]
Kanade, T., Okutomi, M. "A Stereo Matching Algorithm with an Adaptive Window: Theory and Experiment," proceedings of 1991 IEEE International Conference on Robotics and Automation (Cat. No.91CH2969-4), Sacramento, CA, IEEE Computing Society Press, Vol 2, pp 1088-95, 1991.
[Kass87]
Kass, M., Terzopoulos, D., Witkin, A. "Symmetry-Seekign Models and 3D Object Reconstruction", International Journal of Computer Vision, Netherlands, Vol 1, No 3, pp 211-221, 1987.
[Kato92]
Kato, T. "Database Architecture for Content-Based Image Retrieval," SPIE: Image Storage and Retrieval Systems, SPIE, San Jose, CA, February 19
[Klinker88]
Klinker, G., Shafer, S., and Kanade, T. "The Measurement of Highlights in Color Images," International Journal of Computer Vision, Vol. 2, No. 1, pp. 7-32, June 1988.
[Klinker90]
Klinker, G., Shafer, S., and Kanade, T. "A Physical Approach to Color Image Understanding," International Journal of Computer Vision, Vol. 4, No. 1, pp. 7-38, January 1990.
[Lampson93]
"Reliable Messages and Connection Establishment," Distributed Systems, Second edition, Mullender, S. (ed.), Addison-Wesley and ACM Press, 1993.
[Lavie93]
Lavie, A., Tomita, M. "GLR* - An Efficient Noise-skipping Parsing Algorithm for Context-free Grammars," Proceedings of Third International Workshop on Parsing Technologies (IWPT-93), Tilburg, The Netherlands, 1993.
[Lippman80]
Lippman, A. "Movie-Maps: and Application of the Optical Videodisc to Computer Graphics," Computer Graphics, 14, 3, 1980.
[Liu93]
Liu, F.H., Stern, R.M., Huang, X., and Acero, A., "Efficient Cepstral Normalization For Robust Speech Recognition," Proceedings of the Sixth ARPA Workshop on Human Language Technology, Princeton, NJ, Morgan Kaufmann, M. Bates, Ed. 1993.
[Lucas81]
Lucas, B.D. and Kanade, T. "An Iterative Technique of Image Registration and Its Application to Stereo," Proc. 7th International Joint Conference on Artificial Intelligence, pp. 674-679, August 1981.
[Mak91]
Mak, S. Networked Based Billing Server, Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, 1991.
[Matthies89]
Matthies, L., Kanade, T. and Szeliski, R. "Kalman Filter-based Algorithms for Estimating Depth from Image Sequences," International Journal of Computer Vision, Vol. 3, pp. 209-236, 1989.
[Mauldin89]
Mauldin, M. Information Retrieval by Text Skimming, PhD Thesis, Carnegie Mellon University. August, 1989. (also available as CMU Computer Science technical report CMU-CS-89-193). Revised edition published as Conceptual Information Retrieval: A Case Study in Adaptive Partial Parsing, Kluwer Academic Press, September 1991.
[Mauldin91]
Mauldin, M. "Retrieval Performance in FERRET: A Conceptual Information Retrieval System," Proceedings of the 14th International Conference on Research and Development in Information Retrieval, Chicago, October, 1991.
[Mullender93]
Mullender, S. "Kernel Support for Distributed Systems," Distributed Systems, Second edition, Mullender, S. (ed.), Addison-Wesley and ACM Press, 1993.
[Nakajima93]
Nakajima, T., Kitayama, T., Arakawa, H., Tokuda, H. "Integrated Management of Priority Inversion in Real-Time Mach," Proceedings of the Real-Time Systems Symposium, IEEE Computer Society, pp. 120-130, 1993.
[Needham78]
Needham, R.M., Schroeder, M.D. "Using Encryption for Authentication in Large Networks of Computers," CACM Journal, Vol 21, No 12, pp. 993-999, 1978. Also Xerox Research Report, CSL-78-4, Xerox Research Center, Palo Alto, CA, 1978.
[NIST91]
A Proposed Federal Information Processing Standard for Digital Signature Standard, Technical report, National Institute of Science and Technology, Docket No. 910907-1207, RIN 0693-AA86, 1991.
[Pentland93a]
Pentland, A. "Modal Descriptions for Vision and Graphics," IEICE Trans. Information and Systems Special Issue on Computer Vision and Graphics, January, 1993
[Pentland93b]
Sclaroff, S., Pentland, A. "Modal Matching for Correspondence and Recognition," MIT Media Lab, (portions appeared in ICCV93), No 201, May 1993.
[Pentland94]
Pentland, A., Moghaddam, B., Starner, T., Oliyide, O., Turk, M. "View-Based and Modular Eigenspaces for Face Recognition," MIT Media Lab, No 245, 1994.
[Poelman93]
Poelman, C., Kanade T., "A Paraperspective Factorization Method for Shape and Motion Recovery," Carnegie Mellon University, December, 1993. (also available as CMU Computer Science technical report CMU-CS-92-219).
[Rabin88]
Rabin, M., Tygar, J.D. An Integrated Toolkit for Operating System Security (Revised Version), Technical Report TR-05-87R, Center for Research in Computing Technology, Aiken Laboratory, Harvard University, August 1988.
[Rabin81]
Rabin, M. Fingerprinting by Random Polynomials, Technical Report TR-81-15, Center for Research in Computing Technology, Aiken Laboratory, Harvard University, May 1981.
[Rabin79]
Rabin, M. Digitized Signatures and Public-Key Functions as Intractable as Factorization, Technical Report MIT/LCS/TR-212, Laboratory for Computer Science, Massachusetts Institute of Technology, January 1979.
[Rangan92]
Rangan, P.V., Vin, H.M., Ramanathan, S. "Designing an On-Demand Multimedia Service," IEEE Communications Magazine, Vol 30, No 7, pp. 56-64, July 1992.
[Rehg94]
Rehg, J. and Kanade, T. "Visual Tracking of High DOF Articulated Structures: an Application to Human Hand Tracking," to be presented at ECCV94, May 94.
[Requirements92]
Internet Billing Server Requirements Document, INI Technical Report INI92-1, Information Networking Institute, Pittsburgh, PA, 1992.
[Requirements93]
Internet Billing Server Requirements Document, INI Technical Report INI93-2, Information Networking Institute, Pittsburgh, PA, 1993.
[Resnikoff89]
Resnikoff, H. L. The Illusion of Reality, New York: Springer-Verlag, 1989.
[Rivest91]
Rivest, R., Dusse, S. The MD5 Message-Digest Algorithm, unpublished manuscript, July 1991.
[Rivest78]
Rivest, R., Shamir, A., Adleman, L. "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems," Communications of the ACM, Vol 21, No 2, pp. 120-126, February 1978.
[Satoh92]
Satoh, T., Yamane, J., Yee-Hong, G., and Sakauchi, M. "A Multimedia Retrieval System Using Video Scene Description Language," Seisan Kenkyu(Japanese), Vol 44, No 11, pp. 23-25, November 1992.
[Satya93]
Satyanarayanan, M. Distributed File Systems, Second edition, Mullender, S. (ed.), Addison-Wesley and ACM Press, 1993.
[Satya85]
Satyanarayanan, M., Howard, J.H., Nichols, D.N., Sidebotham, R.N., Spector, A.Z., West, M.J. "The ITC Distributed File System: Principles and Design," Proceedings of the 10th ACM Symposium on Operating System Principles, Orcas Island, December 1985.
[Scalability93]
Availability, Reliability and Scalability Issues in the Internet Billing Server Design and Prototype, INI Technical Report INI93-5, Information Networking Institute, Pittsburgh, PA, 1993.
[Schwartz94]
Schwartz, E.I. "Demanding Task: Video on Demand," The New York Times, 23 January 1994.
[Scope93]
Internet Billing Server Prototype Scope Document, INI Technical Report INI93-1, Information Networking Institute, Pittsburgh, PA, 1993.
[Shamir79]
Shamir, A. "How to Share a Secret", Communications of the ACM, Vol22, No 11, pp. 612-614, November 1979.
[Sirbu93]
Sirbu, M. "Internet Billing Server Design and Prototype Implementation", Intellectual Property; Project Proceedings, Interactive Multimedia Association, Vol1, No1, 1994.
[Smith91]
Smith, T., Davenport, G., "The Stratification System: A Design Environment for Random Access Video," MIT Media Lab, Summer 1991.
[Spector89]
Spector, A.Z., Kazar, M.L. "Wide Area File Service and the AFS Experimental System", Unix Review, Vol 7, No 3, pp. 60-71, March 1989.
[Steiner88]
Steiner, J.G., Neuman, C., Schiller, J.I. "Kerberos: An Authentication Service for Open Network Systems," USENIX Conference Proceedings, Dallas, pp. 191-200, Winter 1988.
[Stevens85]
Stevens, S.M. "Interactive Computer/videodisc Lessons and Their Effect on Students' Understanding of Science," National Association for Research in Science Teaching: 58th Annual NARST Conference, ERIC, Columbus, OH, 1985.
[Stevens89]
Stevens, S.M. "Intelligent Interactive Video Simulation of a Code Inspection," Communications of the ACM, July 1989.
[Stevens92]
Stevens, S.M. "Next Generation Network and Operating System Requirements for Continuous Time Media," Network and Operating System Support for Digital Audio and Video, Herrtwich, R.G. (Ed.) New York: Springer-Verlag, 1992.
[Stevens93]
Stevens, S.M. "Multimedia Computing: Applications, Designs and Human Factors," User Interface Software, Bass, L. & Dewan, P. (Ed.) New York: Wiley, 1993.
[Sullivan & Stern 93]
Sullivan, T.M., and Stern, R.M. 93 "Multi-Microphone Correlation-Based Processing for Robust Speech Recognition," Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, Minnesota, 2: 91-94.
[Swain91]
Swain, M.J., Ballard, D.H. "Color Indexing," International Journal of Computer VISION, Netherlands, Vol7, No 1, pp. 11-32, November 1991.
[TREC93]
"Proceedings of the Second Text Retrieval Conference," D. Harmon, editor, sponsored by ARPA/SISTO, August 1993.
[Tokuda90]
Tokuda, H., Nakajima, T., Rao, P. "Real-Time Mach: Toward a Predictable Real-Time System," Proceedings of the USENIX Mach Workshop, USENIX, October 1990.
[Tomasi90]
Tomasi, C., Kanade, T. "Shape and Motion without Depth," ICCV 90, Osaka, Japan.
[Tomasi92]
Tomasi, C., Kanade, T. "Shape and Motion from Image Streams under Orthography: a Factorization Method," International Journal of Computer Vision, Vol 9, No 2, pp. 137-154, 1992.
[USNBS77]
Federal Information Processing Standards Publication 46: Data Encryption Standard, FIPS PUB 46, U.S. National Bureau of Standards, Washington D.C., January 1977.
[Vin93]
Vin, H.M., Rangan, P.V. "Designing a Multi-User (HDTV) Storage Server," ieeeSAC, Vol 11, No 1, January 1993.
[Weihl93]
Weihl, W., "Transaction-Processing Techniques," Distributed Systems, Second edition, (S. Mullender, Editor), Addison-Wesley and ACM Press, 1993.
[Yankelovich88]
Yankelovich, N., Haan, B. J., Meyrowitz, N. K., and Drucker, S. M. "Intermedia: The Concept and the Construction of a Seamless Information Environments," IEEE Computer, January 1988.
[Yoshimura93]
S. Yoshimura and T. Kanade "Fast and Accurate Object Matching with Robation," Technical Report, Carnegie Mellon University, Pittsburgh, PA, June 1993.
[Zahray90a]
Zahray, P.Electronic Dissemination of Scholarly Journals: An Economic and Technical Analysis, Ph.D. Thesis, Carnegie Mellon University, Department of Engineering and Public Policy, 1990.
[Zahray90b]
Zahray, P., Sirbu, M. "The Provision of Scholarly Journals by Libraries via Electronic Technologies: An Economic Analysis," Information Economics and Policy, Vol 4, pp. 127-154, 1991.


Return to Informedia-I Reports Page