of The American Society for Information Science

Vol. 26, No. 5

June/July 2000

Go to
 Bulletin Index

bookstore2Go to the ASIS Bookstore

  Copies

User Interface Design for Speech-Based Retrieval

by Douglas W. Oard

Users now routinely search massive collections of electronic text. Interactive full-text searching has been embraced for both highly structured applications, such as electronic document delivery services, and highly distributed environments, such as the World Wide Web. The world's stock of electronic text is growing rapidly, but enormous quantities of information are now becoming available in other modalities as well. For example, in mid-February of this year, real.com identified over 1,500 Internet audio broadcasters.

Searching audio collections presents some unique challenges, but it also presents some unique opportunities that would be difficult to exploit in electronic text. In this article we briefly review the process of searching electronic text collections and then describe how the process can be adapted to support searching large audio collections based on speech contained in those collections.

The Search Process

The four stages shown in Figure 1 Figure 1are present to some degree in every full-text interactive search process. In the query formulation stage, users interact with the system to express their information need. The user's goal in this stage is to craft an expression of the information need - a query - that the system can use to produce a useful search result. Most commonly, the query takes the form of either a Boolean or a "natural language" expression. In the sorting stage, the system reorders the documents in an effort to put the more promising documents ahead of less promising ones. In natural language systems this is sometimes referred to as "relevance ranking"; in Boolean systems it typically equates to placing documents into one of two sets. Machines are able to search through very large collections quickly - particularly if information about the documents can be organized into an easily searched index structure in advance.

The speed of the machine is what makes it possible to search large collections quickly, but it is the synergy with the sophistication of the user on which the effectiveness of interactive searching ultimately depends. Humans bring sophisticated pattern recognition, abstraction and inference skills to the search process, but the number of documents to which those skills can usefully be applied is limited. The goal of the selection stage is to allow the user to discover efficiently the most promising documents from among those ranked highly by the system by examining indicative summaries - summaries that are designed to support selection decisions. The indicative summaries are generally quite terse because brief summaries can be scanned efficiently. The summaries typically contain a title, some information about the source and currency of the document, and perhaps some application-dependent information such as file size or a few extracted keywords. Because summaries may not provide enough information to support a final selection decision, full-text search systems also provide users with the ability to examine individual documents. Direct use of the document may also result from this examination process, or a separate document delivery stage may be required (for example, the document might be printed before being read). But our focus here is on examination for the purpose of making the final selection decision. The backward links in Figure 1 illustrate the iterative nature of the search process. Manual feedback (query reformulation) is always possible, and some systems also support automated techniques (relevance feedback).

It should be noted here that what we have called the "search process" is only one of many ways that users can use to seek information. Alternatives range from straightforward variations on the process, such as replacing the ranked list with a spatial visualization of the document collection, to radical departures from the process we have outlined, such as asking a colleague for advice. But the process described above has been applied with considerable success with electronic text. Our purpose in this article is to consider how the same process can be applied to search audio collections based on speech that is present in those collections.

Speech-Based Retrieval

Perhaps the simplest approach to searching audio would be to segment the audio stream into stories, transcribe each story as electronic text and then search the electronic text using the process described above. Such an approach introduces several challenges. Automatic story segmentation is imperfect, automatic transcription is reliable only under favorable conditions, selection is complicated by the lack of preexisting titles for automatically segmented stories, and examining an audio clip can take far longer than examining an equivalent amount of text in written form. Nevertheless, systems that adopt such a straightforward adaptation of techniques originally developed to search text can achieve a level of performance that is useful for some purposes - the Compaq SpeechBot (http://speechbot.research.compaq.com) is one example of such a system.

Automated support for searching electronic text evolved over the course of several decades, first from Vannevar Bush's vision of a "Memex," through H. P. Luhn's more refined vision of how specific technologies might be applied to the task. Then came the creation of large-scale systems using manual indexing and the introduction of automated indexing based on free text that were made possible by the increasing speed of computer hardware.

Now, declining storage costs have made it possible to examine resources online, and the breadth of end-user search experience is growing, thanks to widespread use of the Web.

At each step along the way, developers have sought to match the capabilities of their systems with the characteristics of the electronic texts that users wish to search. It should, therefore, come as little surprise that the resulting systems are optimized for electronic text and not for speech-based indexing of audio. What is needed is a view of the search process that is designed around two factors: what speech processing machines can do well and what use people can make of their output in the search process.

Speech is in many ways a richer form of communication than written text. In speech, for example, listeners can sometimes determine the speaker's identity based on the sound of the voice, and machines are able to attain a comparable level of performance at that task. The same task with written text is far more difficult for both people and machines. Authorship information is often important to searchers, so author metadata is often needed to support searching in collections of electronic text. Such support may be less crucial when searching speech. Another difference is that turn-taking among multiple speakers is common in speech - similar phenomena are far rarer in written text. Turn-taking patterns can reveal something about the purpose of the speech. Consider, for example, the different patterns in a news program and in the last classroom training session that you attended. Would you be able to distinguish the two using turn-taking patterns? Could you identify the person filling the role of teacher or news anchor based on the turn-taking patterns alone?

The list of speech features that might usefully support searching includes the words that were spoken, who spoke those words, when they were spoken, and the way in which they were spoken (e.g., language, accent, speaking rate and prosody). These features can be further processed to infer things that are not directly observable. For example, differences in stress or fatigue from one speech act to another by the same speaker can sometimes be detected automatically. Related characteristics such as detecting which participants were calling in by telephone or noting the presence of foreground or background music during some portions of a recording might also prove useful in some cases. In the remainder of this section we suggest some ways that these features might be used to construct interfaces that support the query formulation, selection and examination processes.

We envision rich interactive query formulation interfaces that offer users the ability to specify search terms, desired speakers, metadata such as program source and date, and genres such as interview or lecture. Search terms will likely be specified in the usual ways, either typed by the user, perhaps selected from one or more displayed transcripts using a pointing device or through some more automatically processed form or relevance feedback. Speakers that are known to the system might be selected from a name authority file using dynamic queries or selection from a list. Alternatively, searchers might designate desired speakers by pointing to a timeline on which speaker changes are marked.

This ability to designate speakers whose names are not known in advance offers intriguing possibilities for searches that first use term-based searching to find speakers that address a topic and then probe more deeply using speaker-based searching. If a set of understandable and reliably detectable genres can be identified, precision might be improved by culling out uninteresting types of recordings. Some recordings may include more than one genre, but such cases can likely be handled using straightforward adaptations of passage retrieval techniques.

So-called "natural language" retrieval systems depend heavily on robust selection interfaces to compensate for their lack of Boolean and proximity operators. Experience with the design of systems for retrieval of electronic text has shown this is often a good tradeoff because many searchers lack the skills needed to get good results from highly structured query languages. The design of interactive selection interfaces for speech-based retrieval poses interesting challenges, however, and there is presently a dearth of useful examples from which developers can learn. Generally, research prototypes have provided the user with a ranked list of programs in which only the date and time of the program and either the program title or the source (e.g., broadcast network) are shown. If we wish to support effective natural language searching, we will probably need to provide the user with a far richer view of the search results. The Informedia project at Carnegie Mellon University has pointed the way toward richer summaries by extracting a salient phrase that is intended to be evocative of the content. Similarly, MITRE's Broadcast News Navigator project has explored the use of named entity extraction and automatic classification to associate proper names and controlled vocabulary keywords with a speech recognition transcript. At the University of Maryland we are exploring the utility of compact timelines in which search terms, speaker changes and the gender of each speaker are depicted as a way of further enriching the selection interface.

One factor that complicates selection interface design is that the appropriate temporal extent for the items being searched may be difficult to determine. In some applications, natural units can be found. For example, story boundaries can often be detected fairly reliably in news broadcasts. In other applications, such as the electronic archive of C-SPAN public affairs programming being developed at Northwestern University, it may be necessary to rely on subtler topic shifts such as those detected by BBN's OASIS system. Many present research systems finesse this problem by dividing programs into arbitrary fixed length segments, but this is almost certainly a suboptimal solution from the perspective of user interface design.

Direct examination of recorded speech is time consuming, and there is a clear trade-off between fidelity and speed in the examination process. It is thus natural to provide the user with multiple representations. A timeline display can be used to provide an overview, as in the AT&T Research SCAN system. Speech recognition transcripts can provide a complementary view of the content, while highlighting terms or passages that are closely associated with the query can help to focus the reader's attention despite recognition errors that may be present. Automatically detected speaker changes can provide natural start and end points for audio replay that the user might select using either the timeline or the transcript display. If accelerated replay is desired, comprehensible replay three times faster than real time has been reported using perceptually motivated time compression techniques. Alternatively, automatic audio abstracts could be constructed using techniques similar to the MIT Speech Skimmer. It is important to point out that the adequacy of these techniques should not be judged by how well they support the decision to choose one recording over another. Supporting the ultimate uses to which retrieved audio might be put could require entirely different techniques, but that task is outside the scope of this article.

Conclusion

It is now quite practical to apply speech-based retrieval techniques to collections containing several thousand hours of retrieved audio, and Moore's Law assures us that those numbers will grow rapidly over the next few years. If we are to make the best use of these emerging capabilities, we will need to devote increased attention to the design of user interfaces that support effective search strategies. This will undoubtedly be an iterative process, since we cannot hope to understand which search strategies will be most effective until we have a rich set of user interfaces with which to experiment. Priming the pump requires an appreciation for what has worked well in other cases, what new challenges we face, and what new opportunities speech-based retrieval offers. With this as background, we should be well positioned to explore this new horizon.

For More Information

Links to many of the projects described in this article can be found at

http://www.clis.umd.edu/dlrg/speech/

Douglas W. Oard is with the College of Library and Information Services at the University of Maryland, College Park, MD 20742. He can be reached by e-mail at oard@glue.umd.edu


ASIS Home Search ASISSend us a Comment

How to Order

@ 2000, American Society for Information Science