1. Executive Summary
2. Project Description
3. Testbed Facility
4. Organizational Roles
5. References
The distinguishing feature of our technical approach is the integrated application of speech, language and image understanding technologies for efficient creation and exploration of the library. Using a high-quality speech recognizer, the sound track of each videotape is converted to a textual transcript. A language understanding system then analyzes and organizes the transcript and stores it in a full-text information retrieval system. Likewise, image understanding techniques are used for segmenting video sequences by automatically locating boundaries of shots, scenes, and conversations. Exploration of the library is based on these same techniques. Additionally, the user interface will be instrumented to investigate user protocols and human factor issues peculiar to manipulating video segments. We will implement a network billing server to study the economics of charging strategies and also incorporate mechanisms to ensure privacy and security.
The Informedia Project has industry partners who are committed to provide substantial resources and base technology. They will evaluate commercial opportunities for the underlying technology and for the provision of information services. Currently, our committed partners include Digital Equipment, Microsoft, QED Enterprises, and Bell Atlantic. Together, these companies span the requisite disciplines for digital video library commercialization.
The Informedia Library project proposes to develop these new technologies and to embed them in a video library system primarily for use in education and training. The nation's schools and industry together spend between $400 and $600 billion per year on education and training, an activity that is 93% labor-intensive, with little change in teacher productivity ratios since the 1800s. The new digital video library technology will allow independent, self-motivated access to information for learning, exploration, and research. This will bring about a revolutionary improvement in the way education and training are delivered and received.
More specifically, the Informedia Digital Video Library proposes to develop intelligent, automatic mechanisms that provide full-content search of, and selective retrieval from, an extremely large on-line digital video library. We will build the initial library from WQED Pittsburgh's video archives, video course material produced by the BBC for the Open University and material from Fairfax County (VA) Public Schools' Electronic Field Trips series. We will develop the tools that will populate the library and support access via desktop computers and local-to-metropolitan area networks. We will also research the economics and security of network accessible video intellectual property, which is a vital issue in the future of commercial and public video libraries.
Jointly conceived by Carnegie Mellon University and QED Communications (WQED/Pittsburgh), the Informedia Library project integrates regional and international resources. Carnegie Mellon is an acknowledged international leader in computer science education and research. QED is a major Public Broadcasting Service production center and winner of thirty-five Emmy and eight Peabody Awards. The UK's Open University, a model for distance teaching universities world-wide, brings us access to one of the world's largest collections of educational video. Fairfax County Public Schools has been a pioneer in satellite distribution of elementary school materials combined with networked communications for learning communities. The project's industrial sponsors currently include some of the nation's leading technology companies in computing, software and communications.
Several factors distinguish our project from similar efforts. First, our technical approach integrates image, speech and language understanding to operate simultaneously on the same data stream. This reaches beyond those systems for video or text searching that succeed or fail on the strength of one mode of recognition or interpretation. Second, the integration of these technologies provides new research opportunities, both within and across disciplines. Third, we will produce a highly usable library of commercial broadcast quality and a mechanism for disseminating and commercializing our products. Fourth, our work also addresses human factors issues: learning, interaction, motivation, and effective usage modes for K-12, post-secondary, and life-long learning. And fifth, we incorporate billing, variable pricing and security mechanisms that enable and encourage commercialization.
This approach uniquely compensates for problems of interpretation and search in error-full and ambiguous data sets. We start with a highly accurate, speaker-independent, connected speech recognizer which will automatically transcribe video soundtracks and store them in a full-text information retrieval system. This text database allows for rapid retrieval of individual video segments which satisfy an arbitrary query based on the words in the soundtrack. Image and natural language understanding enables us to locate and delineate the corresponding "video paragraph" by using combined source information about camera cuts, object tracking, speaker changes, timing of audio and/or background music, and change in content of spoken words. Controls allow the user to interactively request corresponding video pages, video quantity, to intelligently "skim" the returned content, and to reuse the stored video objects in diverse ways.
This project builds upon extensions and applications of existing and evolving technology from both new and established research programs at CMU and elsewhere. These programs include automated speech recognition (Sphinx-II), image understanding (Machine Vision), natural language processing (TIPSTER, Ferret, Scout), human-computer interaction (ALT), distributed data systems (AFS, DFS), networking (INI), and security and economics of access (NetBill). These multi-disciplinary teams will apply these technologies to the problems of video information systems to permit rapid population of the library and agent assisted retrieval from vast information stores in a commercially feasible manner.
Our collection will incorporate not only the broadcast programs themselves, but also the unedited source materials from which they were derived. Such background materials enrich our library significantly, as reference resources and for uses other than those originally intended. They also enlarge it greatly: Typical QED sources run 50 to 100 times longer than the corresponding broadcast footage. A recent interview with Arthur C. Clarke for WQED's Space Age series, for example, produced two minutes of airtime from four hours of tape.
Our particular combination of video resources should enable our users to retrieve same subject matter material presented at varying levels of complexity - ranging from the popular example-based presentation often used in PBS documentaries through elementary and high school presentations from Fairfax Co., to the more advanced college-level treatment by the OU. The self-learner at any level can iterate on the search in order to build understanding and comprehension through multiple examples and decreasing (or increasing) depth, complexity, and formalism.
By itself, the video library's larger vocabulary will degrade recognition rate. However, several innovative techniques will be exploited to reduce errors. The use of program-specific information, such as topic-based lexicons and interest-ranked word lists will additionally be employed by the recognizer. Word hypotheses will be improved by using adaptive, "long-distance" language models and we will use a multi-pass recognition approach that considers multi-sentence contexts. Aiding these processes is the fact that our video footage will typically be of high audio quality and will be narrated by trained professionals.
The transcript generated by Sphinx-II need not be viewed by users, but will be hidden, and will be time-aligned with the video for subsequent retrieval. Because of this, we expect our system will tolerate higher error rates than those that would be required to produce a human-readable transcript. On-line scripts and closed-captioning, where available, can provide base vocabularies for recognition and searchable texts for early system versions.
The Informedia system will extend current leading-edge performance systems and algorithms (TIPSTER, Ferret, Scout) and apply them to augment and index the Sphinx-transcribed soundtracks. Tasks will include (1) developing facilities to automate error reduction in transcripts through post-processing, (2) identification of topics/subtopics in transcript collections, (3) exploration of "sound-alike" matching as a technique to overcome out-of-vocabulary misses during transcription, and (4) an enriched natural language retrieval request interface. This integrated approach will significantly increase the system's ability to locate a particular video segment quickly, despite transcription errors, inadequate keywords, and ambiguous sounds.
The user interface will likely include a context-sizing slide switch. This control will enable the user to adjust the duration and information content of retrieved segments. A skimming dial control will allow the user to adjust the information playback rate as well as to adjust the media playback rate. When a search returns multiple hits, parallel presentation will be used to simultaneously present numerous intelligently chosen sequences. Functionality will be provided to enable the user to extract subsequences from the delivered segments and reuse them for other purposes in various forms and applications.
Other researchers will be invited to participate in exploring or building upon our research system. We foresee several forms of involvement:
QED Enterprises, the commercial division of QED Communications, will be pursuing follow-on commercial licensing opportunities incorporating the systems and technologies developed. They will explore the use of video assets as library source materials for education and training markets. In collaboration with other project affiliates, QED Enterprises will assess the business model for providing continuing video library reference services to local area schools, hospitals, and commercial clients.
The Open University, U.K., will assess the application of our library system to structured distance learning, applicable to both academic and commercial training and education. Companies will participate in technology exchange with us to give them early access to our research and in turn providing the project their experimental hardware and software. This enables us to evaluate and build upon new commercial infrastructure and industry standards as they become available.
CMU and QED will serve as parent organizations with a governing board
representing both institutions and other founding project members, with
additional input from industrial and other institutional affiliates.
CMU will manage personnel, provide space and be responsible for the
operations and technical agenda. There will be joint strategic
planning, public representation, and fund raising.
Our approach utilizes multiple modalities for content-based searching and video sequence retrieval. The content of video data is conveyed by both narrative (speech and language) and image. Only by the collaborative integration of image, speech, and natural language understanding technology can we hope to automatically populate, segment, index, and search diverse video collections. This analysis approach uniquely compensates for problems of interpretation and search in errorful and ambiguous data environments. We start with the use of a highly accurate, speaker-independent speech recognizer to automatically transcribe video soundtracks which are then stored in a (time-track corresponding) full-text information retrieval system. The textual transcript is then analyzed and organized by a language understanding system and stored in a full text information retrieval system. This allows for rapid retrieval of individual corresponding video segments which satisfy an arbitrary subject area query based on the words in the soundtrack.
Image understanding techniques are employed for segmenting, or video paragraphing, video sequences by automatically locating boundaries for shots, scenes, and conversations. User controls provide for interactively requesting the corresponding video page or video volume, to ``skim'' the returned content, and to reuse the stored video objects in diverse ways.
The system will be instrumental to enable the study of user protocols and human factor issues peculiar to manipulating video segments. We will implement a network billing server to study the economics of charging strategies and incorporate mechanisms to insure privacy and security. We will also deploy and evaluate the system at Carnegie Mellon University and K-12 schools.
The project builds upon extensions and applications of existing and evolving technology from, established research programs at CMU in automated speech recognition (Sphinx-II), image understanding (Machine Vision), natural language processing (Tipster, Ferret, Scout), human-computer interaction (ALT), distributed data systems (AFS, DFS), networking (INI), and security and economics of access (NetBill). Application of these technologies in this new domain will permit rapid population of the digital video library and agent assisted retrieval of vast information stores, realizable in commercial settings. To appreciate the scope of the integration task, three perspectives of the system are presented: a user's perspective; a technology perspective; and a systems engineering perspective.
Transparent to the user the system has just performed highly accurate, speaker independent, continuous speech recognition on her query. It then used sophisticated natural language processing to understand the query and translate it into retrieval commands to locate relevant portions of digital video. The video is searched based on transcripts from audio tracks that were automatically generated through the same speech recognition technology. The appropriate selection is further refined through scene sizing developed by image understanding technology.
Almost as soon as she has finished her question, the screen shows several icons, some showing motion clips of the video contained, followed by text forming an extended title/abstracts of the information contained in the video (see Figure 1).
Making this possible, image processing helped select representative still images for icons and sequences from scenes for intelligent moving icons. Speech recognition created transcripts which are used by natural language technologies to summarize and abstract the selections.
Through either a mouse or a spoken command, the student requests the second icon. The screen fills with a video of Arthur Clarke describing how he did not try to patent communications satellites, even though he was the first to describe them. Next the student requests the third choice, and sees villages in India that are using satellite dishes to view educational programming.
Asking to go back, Arthur Clarke reappears. Now, speaking directly to Clarke, she wonders if he has any thoughts on how his invention has shaped the world. Clarke, speaking from his office, starts talking about his childhood in England and how different the world was then. Using a skimming control she finds particularly relevant section to be included in her multimedia composition.
Beyond the requisite search and retrieval, to give our student such functionality requires image understanding to intelligently create scenes and the ability to skim them.
The next day she gives her teacher access to her project. More than a simple presentation of a few video clips, our student has created a video laboratory that can be explored and whose structure is itself indicative of the student's understanding.
Helping this student be successful are tools for building multimedia objects that include assistance in the language of cinema, appropriate use of video, and structuring composition. Behind the scenes the system has created a profile of how the video was used, distributing that information to the library's accounts. Assets for which the school has unlimited rights are tracked to understand curricular needs. And accounts for assets that the school has restricted, pay-per-use rights are debited.
Previous demonstrations ranging from the Aspen Project [Lippman80] to ClearBoard [Ishii92] use analog video which limits the user interface design. Current multimedia applications, usually CD-ROM based, associate short video and audio objects with an image or section of text (hypermedia); providing only the ability to select video clips based on the title. These techniques have been employed since the first computer controlled videodiscs [Fuller82]. Because they treat the video segment as a black box, however, they are totally inadequate for access to extremely large digital video libraries. These projects have also ignored the need for accounting of digital video use to allow owners of copyrighted material to be appropriately compensated.
By contrast, the following five technologies and research areas are synergistically integrated in Informedia:
Informedia will provide the user with a variety of techniques to locate desired material in the library. An initial query may be typed on a keyboard, clicked with a mouse or spoken into a microphone. Techniques for automatically computing image similarity can be used to process visual queries, allowing the user to find related video segments with similar images or backgrounds.
Today's designs typically employ a VCR/Video-Phone view of multimedia. In this simplistic model, video and audio can be played, stopped, their windows positioned on the screen, and, possibly, manipulated in other ways such as by displaying a graphic synchronized to a temporal point in the multimedia object. This is the traditional analog interactive video paradigm developed almost two decades ago. Rather than interactive video, a much more appropriate term for this is ``interrupted video.''
Today's interrupted video paradigm views multimedia objects more as text with a temporal dimension [Hodges89, Yankelovich88]. Researchers note the unique nature of motion video. However, differences between motion video and other media, such as text and still images, are attributed to the fact that time is a parameter of video and audio. In the hands of a user, every medium has a temporal nature. It takes time to read (process) a text document or a still image. However, in traditional media each user absorbs the information at his or her own rate. One may even assimilate visual information holistically, that is, come to an understanding of complex information nearly at once.
However, to convey almost any meaning at all, video and audio must be played at a constant rate, the rate at which they were recorded. While, a user might accept video and audio played back at 1.5 times normal speed for a brief time, it is unlikely that users would accept long periods of such playback rates. In fact, studies show that there is surprisingly significant sensitivity to altering playback fidelity [Christel91]. Even if users did accept accelerated playback, the information transfer rate would still be principally controlled by the system.
The real difference between video or audio and text or images is that video and audio have constant rate outputs that cannot be changed without significantly and negatively impacting the user's ability to extract information. Video and audio are a constant rate, continuous time media. Their temporal nature is constant due to the requirements of the viewer/listener. Text is a variable rate continuous medium. Its temporal nature only comes to life in the hands of the users.
While video and audio data types are constant rate, continuous-time, the information contained in them is not. In fact, the granularity of the information content is such that a one-half hour video may easily have one hundred semantically separate chunks. The chunks may be linguistic or visual in nature. They may range from sentences to paragraphs and from images to scenes.
Understanding the information contained in video is essential to successfully implementing the Informedia digital video library. Returning a full one-half hour video when only one minute is relevant is much worse than returning a complete book, when only one chapter is needed. With a book, electronic or paper, tables of contents, indices, skimming, and reading rates permit users to quickly find the chunks they need. Since the time to scan a video cannot be dramatically shorter than the real time of the video, a digital video library must give users just the material they need. Understanding the information content of video enables Informedia to not only find the relevant material but to present it in useful forms.
Content is conveyed in both narrative (speech and language) and image. Only by the collaborative interaction of image, speech and natural language understanding technology can we hope to automatically populate, segment, index, and search diverse video collections with satisfactory recall and precision. This approach uniquely compensates for problems of interpretation and search in error-full and ambiguous data environments.
Building on these principles, parallel presentation of video, taking advantage of the special abilities of the human vision system, will be investigated in this project. When a search produces multiple hits, as will usually be the case, the system presents numerous sequences simultaneously in separate windows. Several representations of this extracted video will be tested in this project. The simplest, single images extracted from the video, will use the first image with valid (i.e. non-blank) data as determined by the image recognition techniques described in section 2.4. A slightly more complex representation will be motion icons, micons [Brondmo90]. As implemented by Brondmo, micons are short motion sequences extracted from the first few seconds or minutes of the video they are to represent.
Both still iconic and miconic representations of video information can easily mislead a user. For example, a search for video sequences related to transportation of goods during the early 1800's may return 20 relevant items. If the first 20 seconds of several sequences are ``talking head'' introductions, icons and micons will provide no significant visual clue about the content of the video; the information after the introduction may or may not be interesting to the user. However, intelligent moving icons, micons, may overcome some of these limitations. Image segmentation technology can create short sequences that more closely map to the visual information contained in the video stream. Several frames from each new scene will be used to create the micon. This technique will allow for the inclusion of all relevant image information in the video and the elimination of redundant data.
For a video containing only one scene with little motion, a micon may be the appropriate representation. If video data contains a single scene but with considerable motion content, or multiple scenes, the imicon is needed to display the visual content. To determine the imicon content, significant research will be performed on the optimal number of frames needed to represent a scene, the optimal frame rate, and the requisite number of scenes needed for video representation. Since the human visual system is adept at quickly finding a desired piece of information, the simultaneous presentation of intelligently created motion icons will let the user act as a filter to choose high interest material.
Detailed indexing of the video can aid this process. However, users often wish to peruse video much as they flip through the pages of a book. Unfortunately, today's mechanisms for this are inadequate. Scanning by jumping a set number of frames may skip the target information completely. On the other hand, accelerating the playback of motion video to, for instance, twenty times normal rate presents the information at an incomprehensible speed. Even if users could comprehend such accelerated motion, this would still take six minutes to scan through two hours of videotape. A two second scene would be presented in only one tenth of a second.
Playing audio fast during the scan will not help. Beyond 1.5 or 2 times normal speed, audio becomes incomprehensible since the faster playback rates shift frequencies to inaudible ranges [Degen92]. Digital signal processing techniques are available to reduce these frequency shifts, but at high playback rates, these techniques present sound bytes much like the analog videodisc scan.
Tools have been created to facilitate sound browsing which present graphical representations of the audio waveform to the user to aid identification of locations of interest. However, this has been shown to be useful only for audio segments under three minutes [Degen92]. When searching for a specific piece of information in hours of audio or video, other mechanisms will be required. In previous work at CMU [Christel92, Stevens89], a multidimensional model of multimedia objects (text, images, digital video, and digital audio) was developed. With this model (called ALT), variable granularity knowledge about the domain, content, image structure, and the appropriate use of the multimedia object is embedded with the object. Based on a history of current interactions (input and output), the system makes a judgment on what to display, and how to display it. Techniques using such associated abstract representations have been proposed as mechanisms to facilitate searches of large digital video and audio spaces [Stevens92]. In this scheme, embedding knowledge of the video information with the video objects allows for scans by various views, such as by content area or depth of information.
Using automatically derived transcripts as detailed in section 2.4, natural language searching, and video segmentation, video objects are imbued with knowledge about their content and their use. This allows first pass searches to retrieve more focused segments of video. Integrated together, these techniques will permit the creation context-sizing interfaces. These will simulate slide switches, enabling the user to adjust the ``size'' of the retrieved video/audio segments for playback. Here, the ``size'' may be time duration, but more likely it will be abstract knowledge chunks where information complexity or type will be the determining measure. This research will investigate the appropriate metaphors to use when the ``size'' the user is adjusting is abstract content. Here we will research what it means, from both interface development and a search methods, to permit the user to say ``I want more background on each subject returned.''
Application of these techniques will permit the development of a skimming dial, an information-based scanning of digital video data. Much like the chapter and section headings in a book, the skimming dial will permit fast, content based perusal of video data. Even though the Informedia system is designed to return the most appropriate data, with the rich set of information available this ``dial'' will be critical, allowing users to skim by content, more precisely finding desired information in video.
Today, excellent stand-alone tools to edit digital video exist in the commercial market and this project will use commercial off-the-shelf (COTS) tools when available. However, there are currently no tools to aid in the creative design and use of video as there are for document production. One reason is the intrinsic, constant rate temporal aspect of video. Another is the complexities in understanding the nature and interplay of scene, framing, camera angle, and transition. Building on previous work at CMU [Stevens89, Christel92], tools will be developed to provide expert assistance in cinematic knowledge. The long range goal will be to integrate the output of the image understanding and natural language understanding sub-systems with this tool to create semantic understanding of the video. This would make possible context sensitive assistance in the reuse of video and its composition into new forms.
As an example, compared with watching a linear interview, permitting a student to interview an important historical or contemporary figure would provide a more interesting, personal, and exploratory experience. Creating such a synthetic interviewee is possible with existing video resources. Broadcast productions typically shoot 50 to 100 times as much material as they actually broadcast. As previously noted, WQED interviewed Arthur C. Clarke for its recent series ``Space Age.'' Approximately two minutes of the interview was broadcast, but over 4 hours were taped. While few would want to sit through 4 hours of an interview, many would like to ask their own questions. It would be especially interesting and motivating if the character responded in a fashion that caused the viewer to feel as if the answer was ``live.'' That is, specifically and dynamically created in response to the question.
Similar synthetic interviews have been hand-crafted [Stevens89, Christel92]. For typical users to create such a interview, new tools will be needed. Certainly, searching, parallel presentation, context sizing, and skimming will be needed to find, organize, and size the responses. In addition, tools must be developed to refine automatically generated abstractions of the transcript, associate those abstractions to specific responses, and define how to use the responses. The nature and form of such tools will be investigated through iterative development and testing with users interacting with the digital video testbed.
Because the economic value of any individual video clip or other information object may be only a few cents, these services must be provided in a highly automated way, so that the transaction costs associated with any individual purchase amount to fractions of a cent. While keeping the transaction simple, suppliers must have assurance that a user does not, with a few keystrokes, initiate transactions which far outstrip his or her ability to pay, while the user must be protected against fraudulent charges being debited to his account.
We envision an internet billing service, or NetBill System, which can provide all of the services necessary to account for intellectual property delivered via a network. The system will be designed to provide these services not just for the Informedia collections, but also for collections of intellectual property made available by other organizations. In particular, two other responders to this solicitation -- one group from CMU and one from M.I.T. -- have indicated their desire to rely on the NetBill system to handle these functions for the Digital Library testbeds they are proposing. Our intent is to design NetBill as an open system which could be used by any organization providing intellectual property even through simple means such as an anonymous FTP server. Because we expect multiple financial services organizations to eventually wish to provide factoring services, open protocol interfaces will be designed so that users may charge purchases through any of several independently operated NetBill compatible systems.
The Informedia will incorporate a number of independently developed research systems whose data and processes must be integrated. Figure 2 provides a pictorial view of how the Informedia system integrates these process and data flow. Processes are divided between those that occur off-line and are time-insensitive, and those that are executed on-line in real-time for the interacting user.
Furthermore, the system must be organized and constructed to allow for modification and incorporation of (1) emerging standards for media compression and storage, (2) breakthrough products for media manipulation and display, and (3) evolving high-bandwidth communication standards and services. Field experiments will produce feedback regarding system usability and performance which must be addressed rapidly in subsequent releases. Therefore, where possible and practical, we will build Informedia upon existing commercial hardware and software products.
We propose that the system be constructed as four major ``subassemblies,'' each of which provides a major function which can be independently implemented without a need-to-know about one another's internal structure.
Each subassembly production has within it a number of subtasks, which constitute the subassembly's role in Informedia. The background library creation assembles new analog video and audio and produces an indexed video, audio and textual database through the following processes:
The construction of the data and network infrastructure is based upon extant technology from research and commercial systems, but which has not been applied to the video server task. This approach provides predictable baseline performance and reliability levels which may be strained by this application. We incorporate real-time Unix extensions, hierarchically cached file systems, and commercial switched multi-megabit data services to provide the required performance and bandwidth.
The long-term implementation plan anticipates shifting our data repositories to industrial video-on-demand servers as they become available and accessible to us. This reinforces our adherence to industry or de facto standards in areas where there is marginal, if any, gain in building our own experimental systems.
Unlimited vocabulary speaker independent connected speech recognition is an unsolved problem. However, recent results in transcription of newspaper dictation recognition provides the promise and potential of being able to create automatic transcriptions of unlimited vocabulary spoken language. In this section we will provide the current state-of-the-art problems that remain to be solved in order to make progress in the area of video transcriptions, our proposed method of approach, and the expected results in the 1996 time frame.
The current best system, Sphinx II, uses a 20,000 word vocabulary to recognize connected spoken utterances from many different speakers. The task domain is recognition of dictation of passages from the Wall Street Journal. On a 150 MIPS DEC Alpha workstation the system operates in near real time and on average makes one error out of eight words.
Sphinx-II uses senonic semi-continuous hidden Markov models (HMMs) to model between-word context-dependent phones. The system uses four types of codebooks: mel-frequency cepstral coefficients, 1st cepstral differences, 2nd cepstral differences, and power and its first and second differences. Twenty-seven phone classes are identified, and a set of four VQ codebooks is trained for each phone class [Hwang94]. Cepstral vectors are normalized with an utterance-based cepstral mean value. The semi-continuous observation probability is computed using a variable-sized mixture of the top Gaussians from each phone-dependent codebook.
The recognizer processes an utterance in four steps:
Multiple Signal to Noise Ratio Problem. Broadcast video productions, whether they are documentary style interviews or theatrical productions, have to recognize speech from multiple speakers standing in different locations. This results in speech signal quality with different signal noise ratio properties. Further confounding the problem are the effects of different orientations of the speakers and reverberation characteristics of the room [Liu93]. Signal adaptation techniques have been developed which appear to automatically correct for such variability. However, such systems have not been tested with environments where nearly every other sentence has a different signal to noise ratio. We expect with appropriate preprocessing and detection of the signal levels to be able to modify the current CDCN technology to solve this problem.
Multiple Unknown Microphone Problem. Most current systems optimize the performance using close talking head mounted microphones. As we go to table top microphones, lapel microphones, and directional boom microphones traditionally used in broadcast video productions, the variability arising from differences in microphone characteristics and differences in signal to noise ratios will significantly degrade performance. Recent results by Stern and Sullivan indicate that dynamic microphone adaptation can significantly reduce the error without having to retrain for the new microphone [Sullivan & Stern 93].
Fluent Speech Problem. In a typical video interview, people speak fluently. This implies many of the words are reduced or mispronounced. Lexical descriptions of pronunciations used in conventional systems for dictation where careful articulation is the norm will not work very well for spontaneous, fluent speech. At present the only known technique is for manual adaptation of the Lexicon using knowledgeable linguists. It is our hope that this task domain will provide us with a rich data source so that automatic pronunciation learning techniques can be formulated to handle fluent speech phenomena.
Unlimited Vocabulary Problem. Unlike the Wall Street Journal dictation task where the domain limits the size and nature of the vocabulary likely to be used in sentences, video transcriptions generally tend not to have such constraints. However, they do represent specific task domains. Our recent research in long distance language models appears to indicate twenty to thirty percent improvement in accuracy may be realized by dynamically adapting the vocabulary based on words that have recently been observed in prior utterances. In addition, most broadcast video programs have significant descriptive text available. These include early descriptions of the program design called treatments, working scripts, abstracts describing the program, and captions. In combination, these resources can provide valuable additions to dictionaries used by the recognizer.
Relaxed Time Constraints. For transcription of digital video, processing time can be traded for higher accuracy. The system doesn't have to operate in real time. This permits the use of larger, continuously expanding dictionaries and more computationally intensive language models and search algorithms. In initial untuned testing on our current system, interview data appears to increase the error rate from around 12% to over 50% in unrestricted environments. It is our expectation that using the techniques outlined above we will remove many of the new problems. In addition, by removing the constraint of real-time processing it will be possible to deepen searches by the recognizer beyond that currently used in real-time applications.
By the end of 1995 the error rate is expected to return to 12 to 15% for unrestricted video data. Improvements in computer technology, search technology, and speech processing techniques can be expected to reduce the error again by one half resulting in a 5 to 6% word error rate by 1996. At these levels we believe semantically based indexing techniques proposed in this proposal should prove to be acceptable in routine use of multi-media libraries.
An initial query may be textual, entered either through the keyboard, mouse, or spoken words entered via microphone and recognized by the system. Subsequent refinements of the query, or new, related queries may relate to visual attributes such as, ``find me scenes with similar visual backgrounds.''
Subsequent goals for the project flow from the potential for automatic processing of the audio track. Summarization: by analyzing the words in the audio track for each visual paragraph, the Informedia system will be able to determine the subject area and theme of the narrative. This understanding can be used to generate headlines or summaries of each video segment for icon labeling, tables of contents, or indexing. Tagging: using data extraction technology (from the Tipster and Scout projects), Informedia will be able to identify names of people, places, companies, organizations and other entities mentioned in the sound track. This will allow the user to find all references to a particular entity with a single query. Transcript correction: the most ambitious goal is to automatically generate transcripts of the audio with speech recognition errors corrected. Using semantic and syntactic constraints from NLP, combined with a phonetic knowledge base such as the Sphinx dictionary, some recognition errors should be correctable.
Current retrieval research focuses on newspapers, electronic archives, and other sources of ``clean'' documents. Natural language queries (instead of complex query languages) allow straight-forward description of the material described [TREC93].
The video retrieval task challenges the state of the art in two ways:
SELF FULFILLING PROPHECIESbut because Sphinx was run using a smaller dictionary that does not contain the words ``prophecy'' or ``prophecies,'' Sphinx returns the closest phonetic match:
SELF FULFILLING PROFIT SEIZEMaintaining high recall performance will require the retrieval of segments in spite of such mis-recognition.
Performance of current retrieval algorithms on transcribed speech with recognition errors: The search and retrieval operations for regular text are well understood [Mauldin89,91, Jacobs93], but existing work has focused on high quality newswire text. What is not understood is how well these algorithms work in the context of spoken rather than written speech, and how their performance is degraded by errors in recognition.
Elaboration of syntactic and semantic models for spoken language: Our current retrieval technology relies on pattern sets and grammars that were developed for retrieving newspaper-quality texts from full-text databases. They do not address the additional complexity of spoken language.
Enhancement of pattern matching and parsing to recover from and correct errors in the token string: CMU researchers are already investigating the use of noise-tolerant grammar-based parsing [Lavie93]. We will investigate the use of this technique and the statistical techniques developed for SCOUT on the Digital Video Library corpus. Using the phonetic similarity measures produced by the Sphinx System, a graded string similarity measure will be used to retrieve and rank partial matches.
To address the issue of the inadequacy of current retrieval algorithms, we propose to first document their performance on transcribed video. We will create a test collection of queries and relevant video segments from the digital library. Using manual methods we will establish the relevant set of segments from the library. We will then use the test collection to evaluate the retrieval performance of our existing retrieval algorithms in terms of recall and precision.
We will use the results of the baseline performance test to direct additional research into two main lines of attack: (1) we will elaborate current pattern sets, rules, grammars and lexicons to cover the additional complexity of spoken language by using large, data-driven grammars. To provide efficient implementation and high development rates, we will use regular expression approximations to the context free grammars typically used for natural language. This approach worked well in our Textract data extraction system that was evaluated by ARPA under the Tipster text program. Our hypothesis is that extending this technique to an automatically recognized audio track will provide acceptable levels of recall and precision in video scene retrieval. (2) We will extend the basic pattern matching and parsing algorithms to be more robust, and to function in spite of lower level recognition errors by using a minimal divergence criterion for choosing between ambiguous interpretations of the spoken utterance. CMU's SCOUT text retrieval system already uses a partial match algorithm to recognize misspelled words in texts. We would extend our existing algorithm to match in phonetic space as well as textual. So the earlier example is converted in phonetic space to:
Query: P R AA1 F AH0 S IY0 Z prophecy Data: P R AA1 F AH0 T S IY1 Z profit seizewhich deviate only by one insertion (T) and one change in stress (IY0 to IY1).
We will focus first on error-tolerance, and later we will extend that to error correction. We will periodically re-evaluate the performance of the retrieval against the baseline to track accomplishments.
Since the first phase of the retrieval engine must function early in the project to allow for iterative development and rapid prototyping, the first year's focus is on adapting existing boolean and vector-space models of information retrieval to the Informedia architecture. We will also develop a test collection to measure recall and precision, and establish a base line performance level. Users will have various options for ordering the returned set of ``hits,'' and for limiting the size of the hits as well.
In subsequent years we will develop enhancements to the retrieval algorithms, including phonetics-based matching, use of robust parsing, and concept-based retrieval. We will also explore the use of language processing of the audio track to improve the scene segmentation.
Finally, we will investigate the correction of transcript errors using context-based language processing. We will also investigate the automatic summarization of retrieved material and build a module that will assemble the retrieved segments into a single user-oriented video sequence.
The two standard measures of performance in information retrieval are recall and precision. Recall is the proportion of relevant documents that are actually retrieved, and precision is the proportion of retrieved documents that are actually relevant. These two measures may be traded off one for the other, and the goal of information retrieval is to maximize them both.
Each technology will be included in the testbed evaluation system as it is developed. Each additional algorithm enhancement or module will be evaluated for its effect on recall, precision, and user functionality.
The first capability required for digital video library is segmentation, or ``paragraphing,'' of video into a group of frames when video library is formed. Each group can be reasonably abstracted by a ``representative frame,'' and thus can be treated as a unit for context sizing or for image content search. Part of this task can be done by content-free methods that detect big ``image changes'', for example, ``key frame'' detection by changes in the DCT coefficient.
However, to be completely successful we need content-based video paragraphing methods; for example, an individual speaker paragraph or a ``talking head'' paragraph. Content-based query is also essential. What an Informedia user is interested in is ``subject or content'' retrieval, not just ``image'' retrieval. The subject consists of both image content and textual content; the combination specifies the subject. The textual information attached is useful to quickly filter video segments locating potential items of interest. But subsequent query is often visual, referring to image content. For example, ``Find video with similar scenery,'' ``Find the same scene with different camera motion,'' ``Find video with the same person,'' and so on. Again, we notice that part of the capability can be realized by content-free methods, such as histogram comparison, but real solutions lie in content-based image search which presents a long-term challenge to the field of computer vision research.
Image statistics methods computes primitive image features and their time functions, such as color histograms [Swain91, Gong92], coding coefficients, shape [Kato92, Satoh92] and texture measures [Kato92], and use them for indexing, matching and segmenting images. This is a practical and powerful approach for some applications, but obviously it deals with only images, but not their content.
Content-based retrieval by content-preserving coding has shown good results for a relatively well-defined class of static images of a single object, such as face, texture, and 2D shape [Pentland93a,b,94]. Ambitious efforts toward automatic and semi-automatic stratification of movie is also being studied [Smith91, Satoh92]. Howeer, not only is the visual recognition difficult, but also the text input for context setting is done manually.
CMU's Image Understanding group with 80 researchers is one of the largest vision groups in the country. The group's activities range from basic physics-based vision theory to applied vision systems for a large scale mobile robot. In particular, the results in color understanding [Klinker88,90], stereo [Kanade91], motion tracking [Lucas81], shape and motion from image sequence [Tomasi92, Poelman93], and efficient shape matching with arbitrary orientation and location [Yoshimura93] are unique and have a great potential for application to the digital video library. Our work for Informedia proposed below will be based on these results, and will continue to be leveraged by the basic image understanding effort.
Use of Comprehensive Image Statistics for Segmentation and Indexing. Raw video materials are first segmented into video paragraphs so that each segment can be connected/integrated for indexing with transcripted text. This initial segmentation can be done in a relatively content-free manner, as has been demonstrated previously, by using image statistics, especially by monitoring coding coefficients, such as DCT, and detecting fast changes in them. This analysis also allows for identifying the key frame(s) of each video paragraph; the key frame is usually at the beginning of the visual sentence and relatively static.
Once a video paragraph is identified, we extract image features like texture, color, and shape from video as attributes. We will develop a comprehensive set of image statistics as part of the video segmentation tool kit. While these are ``indirect statistics'' to image content, they have been proven to be quite useful in quickly comparing and categorizing images, and will be used at the time of retrieval.
Concurrent Use of Image and Speech/Language Information. In addition to image properties, other cues, such as speaker changes, timing of audio and/or background music, and change in content of spoken words can be used for reliable segmentation. Figure 5 is an example where keywords are used to locate items of interest and then image statistics (motion) is used to select representative figures of the video paragraph. In this example, the words, ``toy'' and ``kinex'' have been used as key words. The initial and closing frames have similar color and textual properties.
Figure not available
Structural and temporal relationships between video segments can also be extracted and indexed.
Camera and Object motion in 2D. One important kind of visual segmentation is based on the computer interpreting and following smooth camera motions such as zooming, panning, and forward camera motion. For example, when a large panoramic scene is being surveyed, or in which the camera (and narration) focus the viewer's attention on a small area within a large scene, or in which a camera is mounted on a vehicle such as a boat or airplane.
A more important kind of video segment is defined not by motion of the camera, but by motion or action of the objects being viewed. For example, in an interview segment, once the interviewer has been located by speech recognition, the user may desire to see the entire clip containing the interview with this same person. This can be done by looking forward or backward in the video sequence to locate the frame at which this person appeared or disappeared from the scene. Such a single-object tracking is relatively easy. We are actually capable to track far more complicated objects.
A technique is being developed [Rehg94] to track high degree-of-freedom objects, such as a human hand (27 degrees of freedom), based on ``deformable templates'' [Kass87] and the Extended Kalman Filtering method [Matthies89]. Such a technique provides a tool to the video database to track and classify motions of highly articulated objects.
Object Presence. Segmenting video by appearance of a particular object or a combination object is a powerful tool. While this is difficult for a general 3D object for arbitrary location and orientation, the technique of the KL Transform has proven to work to detect a particular class of object. Among object presence, Human content is the most important and common case of object presence detection. We will include this function.
Object and Scene in 3D. The techniques discussed so far were two-dimensional, but video represents mostly 3D shape and motion. Adding a 3D understanding capability to the image understanding tool kit will revolutionalize the scope of the system. The ``factorization'' approach, pioneered at Carnegie Mellon University [Tomasi90], has the potential. In this approach, in each image frame, an ``interest point'' operator finds numerous corner points and others in the image that lend themselves to unambiguous matching from frame to frame. All the coordinates of these interest points, in all frames of the video sequence, are put into a large array of data. Based on a liner algebra theory, it has been proven that this array - whose rank is always equal to or less than 3 - can be decomposed into shape and motion information; i.e., Observations = Shape X Motion. We will investigate use of such 3D shape and motion understanding for the digital video library.
The Informedia workstation will be instrumented to keep global history of each session. This will include all of the original digitized speech from the session, the associated text as recognized by Sphinx-II, the queries generated by Scout and the video objects returned, compositions created by users, and a log of all user interactions. In essence, Informedia will be able to replay a complete session, much like a flight simulator can replay a session. This will permit both comprehensive statistical studies and detailed individual protocol analyses.
Informedia's integration of speech recognition, natural language, and image understanding technologies creates a natural, literally invisible first-order user interface for searching large corpora of digital video. Nonetheless, significant user interface issues remain. Three principal issues with respect to searching for information are: how to aid users in the identification of desired video when multiple objects are returned; how to let the user adjust the size of the video objects returned; and how to let the user quickly skim the video objects to locate sections of interest. With respect to reuse of video objects tools that go beyond editing of video are required and include expert assistance in visual and temporal organization of video. Solutions to these problems require an intimate understanding of digital video and the development of new modes of interfaces based on this model.
The initial studies will focus on the presentation and control interfaces:
Parallel presentation. When a search contains many hits, the system will simultaneously present icons, intelligent moving icons, imicons and full motion sequences along with their text summarization. To develop heuristics for imicons creation, empirical studies will be performed to determine the number of unique scenes needed to represent a video chunk; the effect of camera movements and subject movements on the selection of images to represent each scene; and the best rate of presentation of images. Users will likely react differently to a screen populated by still images than the same number of moving images. Therefore studies will also be used to identify the optimal number and mix of object types. Outcomes of this work will be input to the image and natural language understanding portions of this research to refine the scene identification and summarization capabilities of Informedia.
Context-sizing slide switch. This simulated slide switch enables the user to adjust the ``size'' (duration) of the retrieved video/audio segments for playback. Here, the ``size'' may be time duration, but more likely it will be abstract chunks where information complexity or type will be the determining measure. This research will investigate the appropriate metaphors to use when the ``size'' the user is adjusting is abstract content. Here, empirical studies will be used to help determine typical visual ``paragraphs'' for different materials. For example, it is well know that higher production value video has more shot changes per minute than, for example, a video taped lecture. And although it is visually richer, it may be linguistically less dense. These studies will help determine unique balance of linguistic and visual information density appropriate for different types of video information. Here we will research what it means, from both interface development and a search methods, to permit the user to say ``I want more background on each subject returned.''
Skimming dial. This simulated analog rotary-dial will interactively control the rate of playback of a given retrieved segment, at the expense of both informational and perceptual quality. One could also set this dial to skim by content, e.g., visual scene changes. Video segmentation will aid this process. By knowing where scenes begin and end the Informedia system will perform high speed scans of digital video files by presenting quick representations of scenes. This can be an improvement over jumping a set number of frames, since scene changes often reflect changes in organization of the video much like sections in a book. Empirical studies will be conducted to determine the rate of scene presentation that best enables users searches and the differences, if any, image selection for optimal scans compared to image selection for the creation of imicons.
Once users identify video objects of interest they will need to be able to manipulate, organize, and reuse the video. Even the simple task of editing is far from trivial. To effectively reuse video assets, the user will need to combine text, images, video and audio in new and creative ways. To be able to effectively write, we spend years learning formal grammar. The language of film is both rich and complex and deep cinematic knowledge, the grammar of video, cannot be required of users.
While excellent stand-alone tools to edit digital video exist, and will be used by Informedia, there are currently no tools to aid in the creative design and use of video as there are for document production. One reason is the intrinsic, constant rate temporal aspect of video. Another is complexities in understanding the nature and interplay of scene, framing, camera angle, and transition. Building on previous work at CMU [Stevens89, Christel92] tools will be developed to provide expert assistance in cinematic knowledge. The long range goal will be to integrate the output of the image understanding and natural language understanding sub-systems with this tool to create semantic understanding of the video.
For example, the contraposition of a high quality, visually rich presentation edited together with a selection from a college lecture on the same material may be inappropriate. However, developing a composition where the lecture material is available for those interested, but not automatically presented, may create a richer learning environment. With deep understanding of the video materials, it will be possible to more intelligently assist in their reuse.
Prototypes will be placed early on in cooperating affiliate schools and laboratories. Beyond the user studies described above, multimedia compositions will be collected and analyzed along with the histories of the users' session. Additionally, focused protocol analyses and exit interviews will be conducted to refine both the tools described previously and those providing assistance in the reuse of video.
Advanced multimedia applications require much more of developers and computing systems than do today's interrupted video. The multimedia equivalent of a teletypewriter I/O paradigm must be avoided to take advantage of the convergence of computing and video. Through a creative, multi-disciplinary approach, this project proposes to engineer a new digital video interface paradigm, further extending CMU research related to human factors issues in multimedia information environments [Christel91, Stevens85].
The NetBill system will evolve from three earlier Internet Billing Service prototypes designed and built at CMU [Mak91, Sirbu93, Scope93]. We have also extensively analyzed user requirements [Requirements92,93], and design tradeoffs[Design92,93, Scalability93]. Figure 6 illustrates our model of how the NetBill system relates to the user and the multimedia library.
NetBill depends on a number of functions working correctly, each of which pose significant research questions.
Authentication: NetBill must establish the identities of all parties to ensure the proper transfer of funds (for reasons of privacy, it may be desirable to support some types of anonymous participation in a transaction). This raises subtle security issues (for example, the interaction of authentication protocols with other security-related protocols is not yet well understood [Heintze94]; security flaws may be introduced when two protocols are combined). Scaling provides another set of challenging questions - authentication software that can easily handle ten thousand users may fail when it must handle millions of users.
Possible starting points for authentication services include the Kerberos system [Steiner88] (which is based on Needham and Schroeder's authentication protocol [Needham78]), systems based on public key digital signature methods such as RSA [Rivest78] or NIST's DSS [NIST91], or ``zero-knowledge'' protocol based methods that use dynamic probabilistic proofs [Goldwasser85]. None of these methods has yet been extended to the scale envisioned when all of the nation's K-12 students become potential users.
We will select an authentication mechanism taking into account technical concerns, existing mechanisms, and standardization trends, choosing unencumbered algorithms where possible.
Security and Privacy: Security is paramount to this work; we must ensure that all transfers of funds are authorized by all relevant parties and that we can account for all funds. In particular, we will develop a mechanism for users to preauthorize their account to be charged for a certain sum; those funds are then frozen until the user is charged for the transaction, guaranteeing that the funds are reserved and reducing collection problems. We will explore a variety of cryptographic mechanisms to verify the negotiation and agreement between parties on a fund transfer limit, including the use of digital signatures [Rabin79], cryptographic checksums [Rabin81, Rivest91], and private key encryption [USNBS77].
Privacy raises other concerns. Logging of all transactions raises the specter of a third party tracing transactions by a given user, or a given service provider, or satisfying some particular criteria. We will research and build mechanisms which allow for normal billing while providing a maximum amount of privacy to individual users. In many cases, users may wish to request a service anonymously, or to be sure that even the service provider is not tracking the nature of requests.
Libraries in particular have long recognized the importance of keeping patron identity confidential. We will research mechanisms for anonymous billing of service use.
However, we must make sure that these services can be overridden to allow auditing of transactions under a number of circumstances. For example, if a bill is disputed, users may want to trace a transaction. Also, there are many circumstances under US law where a financial transaction must be reported (for example, transactions over $10,000). We will investigate the use of threshold schemes that allow all transactions - even anonymous ones - to be traced if several people authorize the tracing.
Access Control: The NetBill system must be able to specify which classes of users can access which services. For example, in this proposal, there may be a desire to restrict the access afforded to faculty versus students. In general, the access control restrictions will vary with each service provider, and may be fairly complex. Access control may be implemented independently by each service provider, or centrally by the NetBill system for a broad category of services. For example, it should be possible to restrict the access of a student to age appropriate materials.
Because of the highly dynamic nature of intellectual property, it will require research to find the best way to specify and enforce these access control lists. It is important to note that we do not intend to address the confinement problem in this research: we will not be able to stop a user who has legitimate access to some information from forwarding that information to a third party.
We do intend to investigate mechanisms that make it difficult and inconvenient to forward information to unauthorized third parties. We also intend to investigate mechanisms of tagging documents with tracing data that will make it possible to locate the source of a confinement breach.
Account Hierarchies: An important feature of our system is that it allows account hierarchies. This permits a single organization to manage spending by department as well as by user. On the service provider side, we use hierarchical accounts to allow aggregate payments for separate services operated by one administrative entity. This sort of structure provides advantages not only in management, but also in maintenance of availability. New research is required to fully integrate hierarchical accounts on the scale of use that we envision.
Auditing: There is a fundamental tension between our ability to audit accounts and protecting the privacy of users. Audits will occasionally be necessary: when requested by a user, when required by court order, when required by tax or Treasury authorities, and when security breaches are suspected. In other cases, summary reports must be generated (for economic experiments described later, or to monitor usage). It is also important that the mechanisms for privacy in this system provide maximum protection against individuals who may wish to probe accounts. For example, we want to keep accounts private from a nosy clerk, even if that clerk is working for a NetBill billing service. Basic research is required to find mechanisms that provide privacy but allow auditing under certain circumstances (such as when certain internal thresholds are exceeded, or when several parties approve the auditing). We plan to use basic cryptographic mechanisms such as secret sharing [Shamir79, Herlihy87, Rabin88] and secret counting [Benaloh87, Camp94].
Scalability: The Internet has 15 million users today. Mechanisms that work well for thousands or tens of thousands of users may fail when put on the Internet. Problems of scale will affect every aspect of system design. We must also provide very high availability for billing services. To achieve both high availability and wide scale, we will need to investigate advanced system mechanisms such as delegation of responsibility [Satya93], caches [Gray93, Weihl93], failure tolerant protocols and platforms [Lampson93], and multiprocessor platforms [Mullender93, Accetta86] to build our system.
Pricing raises a number of significant research questions:
The ``site-server'' sits on a local area net with end-user PC-workstations. The searchable transcripts and auxiliary indices will exist at the main server and be replicated at each site. This permits the cpu-intensive searches to be performed locally, and media to be served either from the local cache or from the central server. The local user pc-workstation can alternately be a buffering display station, a display plus search engine, or the latter plus media cache (approximately 2 gigabytes), depending upon its size and performance class. Caching strategies will be implemented through standard file system implementations: Transarc's Andrew File System (AFS) [Satya85, Howard88, Spector89] and OSF's industry standard Distributed File System (DFS) [DFS91]. Concentration of viewing strongly influences system architecture. Where and how much to cache depend on ``locality of viewing.'' Early in the project we will examine the economics of these architectural tradeoffs, and intend to build a mathematical model that we can use to compare architectures.
The stringent continuous stream network data requirements typical for video-on-demand systems is relaxed in our library system implementation because (1) most sequences are anticipated to be short (<2 minutes), (2) many will be delivered from the locally networked site-server, and (3) the data display is always performed from the buffer constituted by the user's local disk, typically 1-2 gigabytes in early system deployments. Currently used compression techniques reduce the data requirement to approximately 10 Mbytes/minute of video. The performance assumptions therefore hold well unless very long video sequences are requested. Forthcoming research and commercial file systems structured for delivery of continuous media [Anderson92] and video-on-demand [Rangan92, Vin93] address the problems of achieving sufficient server performance, including the use of disk striping on disk arrays to enable continuous delivery to large numbers of simultaneous viewers of the same material [Schwartz94]. As we can shift our data repositories to such higher performance servers and the higher bandwidth network links anticipated over the four years of this proposal, we can correspondingly reduce the on-line secondary storage requirements (and costs) for the end-user nodes.
We estimate that if all prime time television of the last 40 years (approximately 160,000 hours) were digitized in the same format that we propose using, it would require a 100 terabyte repository. Thus, our 1 terabyte testbed will be sufficiently representative of the commercial environments we forsee, and will demonstrate many of the same operational and performance issues.
A key element of the on-line digital video library is the communication fabric through which media-servers and satellite (user) nodes are interconnected. Traditional modem-based access over voice-grade phone lines is not adequate for this multi-media application. The ideal fabric has the following characteristics. First, communication should be transparent to the user. Special-purpose hardware and software support should be minimized in both server and slave nodes. Second, communication services must be cost effective, implying that link capability (bandwidth) be scalable to match the needs of a given node. Server nodes, for example, will require the highest bandwidth because they are shared among a number of satellite nodes. Finally, the deployment of a custom communication network must be avoided. The most cost-effective, and timely, solution will build on communication services already available or in field-test. The implementation already begun for this project satisfies these requirements. Currently, in cooperation with Bell Atlantic, we have begun to deploy a tele-commuting Wide-Area Network (WAN) ideally suited for the on-line digital video library. This WAN is based on services from Bell that are currently available.
The topology of the WAN we have deployed is shown in Figure 7. The two key elements of the communication fabric are (1) use of Central-Office Local-Area Networks (CO-LANs) to provide unswitched data services to workstations over digital subscriber loop technology and (2) use of a Switched Multi-Megabit Data Service (SMDS) ``cloud'' to interconnect the CO-LANs and high-bandwidth server nodes.
High-bandwidth server nodes are directly connected into the SMDS cloud through a standard 1.17 Mbit/s T1-access line. The SMDS infrastructure provides for higher bandwidth connections (from 4 Mbit/s through 34 Mbit/s) should they be required. Currently, a T1-class SMDS connection is tariffed at a flat $600/month, anywhere within our local (412) area code.
SMDS is a public, packet-switched data service that is in limited service today. It is offered by Bell Atlantic and other telecommunication carriers and supports a rante of data applications that depend on high-speed communications. SMDS extends the performance and efficiencies of LANs over a wide area, while offering the economic benefits of a shared service. SMDS is connectionless, meaning that there is no need to set up a connection through the network before sending data. This provides bandwidth on demand for efficient transmission of bursty data. SMDS also provides any-to-any communication. Any SMDS client can exchange data with any other SMDS client. Finally, SMDS is protocol independent, permitting any end-to-end protocols (e.g. TCP/IP, OSI, DECnet, Novell IPX, etc.) to be used between connected clients.
CMU will manage the project, conduct the fundamental research, and become the initial networked deployment testbed.
A key feature of the MS program is the requirement for an individual thesis or integrated group project. In 1992 and 1993 groups of 15 and 10 students respectively worked full time for four months on the problem of internet billing services under the supervision of faculty from several departments. The result has been two generations of requirements analysis, design, and prototype implementation covering both business and technical issues for an internet billing server.
QED will provide a large library of video resources and pursue follow-on commercial service opportunities through its commercial subsidiary, QED Enterprises.
Winchester will be the initial K-12 testbed site and has agreed to experiment with early prototypes in order to provide feedback on usability issues at various age levels.
The O.U. will provide a large collection of video course material in the math, science and technology disciplines and will deploy the system first for internal use by the faculty and potentially for remote student use pending issues of network accessibility.
The greatest societal impact of what we do will most likely be in K-12 education. The Digital Video Library represents a critical step toward an educational future that we can hardly recognize today. Ready access to multimedia resources will bring to the paradigm of ``books, blackboards, and classrooms'' the energy, vitality, and intimacy of ``entertainment'' television and video games. The key, of course, is the access mechanism itself: easy and intuitive to use, powerful and efficient in delivering the desired video clip. The persistent and pervasive impact of such capabilities will revolutionize education as we've known it, making it as engaging and powerful as the television students have come to love.
The greatest commercial impact will be in industrial/commercial training and education, at reduced cost and in less time. We enable individuals to learn through exploration and examples at varying levels of complexity in an often entertaining, highly visual and auditory information flow.
Our initial project members represent significant testbeds of two important sectors on which we are focused - K-12 and university education. The schools involved represent a diverse socio-economic and intellectual range of students. The Winchester Thurston School and CMU will provide the first testing of the new digital video library system, and play a key role in mapping the new technology into the urban and college classroom. Our studies of usage and motivation with these students will provide invaluable input on how to provide ubiquitous information services across the national information infrastructure (NII). Combined, these environments will provide the requisite span of discipline, reference, and casual users.
QED Enterprises, the commercial division of QED Communications, will be pursuing follow-on commercial licensing opportunities incorporating the systems and technology developed. They will explore use of their own and other past and future video assets as general and special purpose collections of library source materials for the education and training markets. In collaboration with other project affiliates, QED Enterprises will assess the business model for providing continuing video library reference services to local area schools, hospitals, and commercial clients, including requirements at the local sites, local- and metropolitan- area networks for delivery, and centralized video database repositories. This proposed effort provides a unique opportunity for them to achieve new commercial value from their vast libraries while enabling them to fulfill their fundamental education mission.
Bell Atlantic has been actively supporting our prototype networking efforts, including the provision of CMU equipment space and connections within the telephone central office. They have a direct interest in determining requirements and monitoring performance of our applications so that they can cost effectively offer appropriate data services for future commercial installations of digital video library service providers. Both Bell and QED will closely follow our research related to network billing servers and insuring data security and privacy.
We have engaged both Digital Equipment and Microsoft Corporation in technology exchange relationships which will provide them access to our developed technology for assessment and potential productization. Equally important, they will provide us early access to their research prototypes and production hardware and software systems for video and multimedia servers and related delivery systems. This will enable us to evaluate and build upon forthcoming commercial infrastructure and industry standards as they becomes available.