of the American Society for Information Science and Technology   Vol. 27, No. 6    August / September 2001

Search

Go to
Bulletin Index

bookstore2Go to the ASIST Bookstore

 

Copies

The 2001 Infornortics Search Engine Meeting

by Candy S. Schwartz

Candy S. Schwartz is a professor, GSLIS, Simmons College. She can be reached there by mail at 300 The Fenway, Boston, MA 02115-5898; by phone at 617/521-2849; by fax at 617/521-3192; or by e-mail at candy.schwartz@simmons.edu

About 180 people attended the two-day Infonortics Search Engine Meeting held in Boston, April 9-10, 2001. Harry Collier of Infonortics kept everything running smoothly, and Ev Brenner, as always, was the power behind the podium, making sure that everything ran on time and as scheduled and serving as host and general moderator. The title "Search Engines: Diversity and Controversy" was a little odd – I could see the diversity, but I can't honestly say that there was much in the way of controversy. My take on general threads goes like this:

  • The aspects of search which cause the most problems (volume, heterogeneous data sources, poorly formed queries) are only going to get worse.
  • Many different approaches (frequency-based methods, natural language processing, automatically and intellectually built taxonomies) will be working side-by-side.
  • Search results need to be just-in-time and adjusted to individual work needs and tasks.
  • The enterprise portal is where the money is.

The title given to the first day was "Search Engines Today." Teaching duties kept me away from the morning session, which included the opening keynote address by David Seuss (Northern Light), followed by an online panel on search engine developments and trends since the last meeting. Since the presentations are all online ( www.infonortics.com/searchengines/sh01/slides-01/sh01pro.html ), I can tell you that Seuss focused on the role of search engines as large enterprises try to provide enterprise information portals tailored to the needs of employees. Enterprise portals are a growing market for search engine technology, but successful implementations need to address the difficulties of providing seamless and customizable access to blended external and internal content. Criteria for success include advanced search and categorization capabilities in a milieu of heterogeneous information resources, and Seuss believes that it is easier to scale a Web search engine down to meet these needs than to scale a typical non-Web search engine up.

The panel on developments and trends included Gregg Notess (Montana State University-Bozeman and creator of Search Engine Showdown), Chris Sherman (About.com) and Avi Rappaport (SearchTools.com). These reviews of progress are always useful overviews of what's been happening in the world of search engines and also a good source of URLs for further exploration. Notess' talk was titled "Death & Databases" and covered the changes in the commercial search engine landscape as companies come and go, database sizes and structures change, and profits continue to be hard to come by. Sherman reviewed improvements in crawlers (discovery agents), advances in indexing the deep Web, new work in visualization and the use of natural language processing (NLP) to provide answers rather than to improve query processing. Rappaport echoed Seuss' keynote in his remarks about search engine companies looking to the enterprise portal market and identified other trends such as remote search hosting and peer-to-peer search, the increasing importance of search in e-commerce, and improvements in handling diverse alphabets and file formats.

Three presentations rounded out the morning session – Lou Rosenfeld (Argus Center for Information Architecture), Bob Travis and Andrei Broder (Altavista) and Chris Cardinal (Hummingbird). Rosenfeld presented an information architect's perspective – since search has not solved the information finding problem and since volume will only increase, we need to focus on the entire "finding" process, which includes navigation, organization, labeling and other elements in addition to search.

Travis and Browder highlighted the differences between classic information retrieval and the Web search engine environment and suggested that different types of user needs (informational, navigational and transactional) require different approaches with respect to query processing and relevance ranking.

Cardinal addressed a question raised by many at last year's conference – why many commercial search engine companies had stopped participating in TREC. His overview of TREC costs and benefits led to the conclusion that apart from a general lack of knowledge and interest, the costs of participation (in terms of manpower and the need to adapt data) were not sufficiently balanced by the perceived benefits.

Mark Hansen and Elizabeth Shriver (Bell Labs, Lucent Technologies) opened Monday afternoon, describing their work on improving the search experience using information derived from page content, link information, existing content hierarchies and transaction log data based on user activities within and external to the search engine space.

The rest of the afternoon was devoted to a panel on the topic of content architecture, moderated by Susan Feldman (IDC).  Feldman defined content architecture as "a set of techniques for defining and extracting the main concepts, ideas, events and people, places and things from a text document" (Content architecture. Available: http://www.infonortics.com/searchengines/sh01/slides-01/feldman.pdf). These techniques might include categorization and concept mapping, XML tagging and data extraction, all intended to improve information discovery and analysis through processes such as query expansion, visualization, question-answering and text mining.

Each of the panelists then highlighted a specific approach to content architecture. Ashok Chandra (Verity) spoke of the need to combine linguistics, search, taxonomy, personalization and the social networks represented by hyperlinking. Eytan Ruppin (Zapper) described the use of the semantic information in query term context passages to generate new queries, select relevant search engines and re-rank search results. Richard Boulderstone (LookSmart) discussed the value of directories like LookSmart for some types of information needs and described LookSmart's work with metadata. Daniel Lulich (Rulespace) highlighted the importance of categorization for adding structure to unstructured data and described the advantages of "bootstrapping" user-driven categorization by using prefabricated classifiers derived from systems such as Yahoo or the Open Directory.

As the panel continued, Raymond Lau (iPhrase) described how iPhrase has experimented with adding structure to relatively unstructured free text by using data mining to extract information into a database and also using NLP to derive index terms from semantic concepts. At Fast, according to Tom Wilde, the focus is on the crawler and on preprocessing data – crawler throughput has to be increased as the Web grows, and intelligence will need to be added at the query level to disambiguate single word queries (by, for example, asking questions and looking at individual and collective past behaviors).  The last panelist of the day, Horst Koerner (consultant to Seruba), described the content architecture for a 4-language ontological "lexicosaurus" being developed by teams of subject experts as a query improvement tool.

The working title for Tuesday was "The New Frontier," beginning with an overview by Eric Brewer (Inktomi). He called on the search community to define search in terms of users rather than collections –individuals should be able to search across the arrays of collections that form personal networks. With the growth of Web resources, portals and the invisible Web, it is essential to forge relationships with content publishers (of all media types) so that content structure and metadata can be leveraged to replace inefficient crawling for content mining.

Most of the rest of the day's speakers showed examples of what the future might hold. Bernard Normier (LexiQuest) described work with Lexisez, a customizable NLP application used for a range of purposes in connection with question-answering systems, including best sentence extraction, summarization and cross-language search. David Evans (Clairvoyance) talked about the importance of moving beyond syntactic and semantic processing to include affect (feelings, emotions, attitudes, and so on). As one example, an affect lexicon (in which words are scored for their values on dimensions to do with affect and intensity) has been used to profile movie genres.

Getting back to the business side of things, Steve Arnold (AIT) reviewed recent shifts in vertical search engine funding, from general business-to-business to much more specific and task-oriented niche markets and applications. He sees trends toward "embedded" search (where search is tightly integrated into work tools), more attention to combining information from many heterogeneous sources and ubiquitous search (i.e., wired and wireless). John Snyder (Webtop.com) provided an example of breaking the keyboard barrier with Webtop, which currently works through highlight and click, but which Snyder hopes to take to voice interaction. He provided an excellent review of the current state of speech recognition and noted that the market for this kind of technology is expected to explode.

Matt Koll reminded the audience that the invisible Web is not a new concept, is much larger than most think and contains resources that are both controlled in distribution and of higher quality than public Web resources. Distributed searching, which can bring the invisible Web forward, requires different types of knowledge in order to accommodate merging and ranking. Examples of areas that have an impact on massively distributed search include

  • resource discovery capability (which databases to connect to);
  • meta-information (rich content abstraction and functional details);
  • authentication, trust and credibility; centralized control (or not); and
  • performance – some form of clustering will be needed, since search term frequency will be even worse than it is in normal engines, and centralized systems probably won't scale.

Craig Silverstein (Google) continued the look into the future with his speculation as to what the world of search would look like some years from now. His vision included ubiquity (including throwaway computers), sophisticated query processing, limited semantic analysis, voice recognition, automatic metadata generation (promoting cross-collection search), visualization, customized results pages more attuned to context and use, and a political information gap. Gregory Grefenstette (Xerox Research Centre Europe) picked up on a variation of the information gap theme with a discussion of the growth of non-English websites and the barriers faced by users who do not have English as a first language. In the context of global e-commerce applications, he discussed methods for estimating language presence and for cross-language search.

The conference closed with a panel on "the secularization of search," introduced by David Evans with a reminder that search tools which a decade ago were known only to a few are now used regularly by millions. We need to think about how a better understanding of search can become common knowledge among the general populace. Joshua Arai and Keiichi Kitagawa (Justsystem) described a collaborative model among teachers and students in a Japanese high school system, centered around a database of course materials, texts and newspapers. They found a general lack of awareness of information retrieval in education and, though students and teachers were enthusiastic about the project, know-how tended to reside in individuals rather than being integrated into training and curricula. William Hersh (Oregon Health Sciences University) looked at health care as an example of a field where improving search understanding is vital, as individuals are beginning to take more active roles in seeking out personal health care information, and practitioners are increasingly inundated with information. In this context, quality and accuracy become especially important, metadata has a large role to play, and search needs to become part of general health information literacy.

As usual, this was a good opportunity to catch up on the world of search engines, with a nice mix of research and industry updates. Next year the conference moves to San Francisco, April 15-16.

How to Order


ASIST Home Page

American Society for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright 2001, American Society for Information Science and Technology