Back to Dec/Jan 2000 Index

of The American Society for Information Science

Vol. 26, No. 2

December / January 2000

Go to
 Bulletin Index

bookstore2Go to the ASIS Bookstore


Annual Meeting Coverage

Track 1:
Knowledge Discovery, Capture and Creation

by Linda C. Smith

In 1972 the ASIS Annual Meeting in Washington, DC, had the theme A World of Information; online systems were in their infancy, but there was a vision of the possibilities of global information systems. In 1999 the meeting was again in Washington, this time with the theme Knowledge: Creation, Organization and Use. Papers in the track on Knowledge Discovery, Capture and Creation sought to go beyond discussions of information organization and retrieval to consider how new knowledge might be created (as suggested by Roy Davies in "The Creation of New Knowledge by Information Retrieval and Classification," Journal of Documentation, 45(4):273-301, 1989). The focus on knowledge discovery is timely: it is the theme of the November 1999 issue of Communications of the ACM , and the Summer 1999 issue of Library Trends (edited by ASIS members Jian Qin and M. Jay Norton) deals specifically with knowledge discovery in bibliographic databases.


What does knowledge discovery mean in the context of information systems? Traditionally the emphasis has been on an individual's role in gathering information and creating new knowledge. Anyone who does literature-based research can appreciate the aphorism of Georg Christoph Lichtenberg, an 18th century German physicist better known for his wit: Lesen heisst borgen, daraus erfinden abtragen [To read means to borrow; to create out of one's readings is paying off one's debts]. While information retrieval systems may be used to locate relevant documents, researchers carry out the subsequent knowledge discovery process on their own. The field of artificial intelligence, where discussion of knowledge discovery dates to the 1980s, offers a different vision for the future. Edward Feigenbaum ("Toward the Library of the Future," Long Range Planning, 22(1):122, 1989) contrasts the libraries of today ("warehouses of passive objects" where "books and journals sit on shelves waiting for us to use our intelligence to find them, to interpret them, and cause them finally to divulge their stored knowledge") with a library of the future where "books" would interact and collaborate with the user. This knowledge system would involve an intelligent computer agent interacting with one or more people. Ten years later work on knowledge discovery, capture and creation seeks to find new ways to foster such human-computer and human-human collaboration.

Data, Information, Knowledge

There is a large body of literature in information science as well as in other fields (such as philosophy) that seeks to define and differentiate among the concepts of data, information and knowledge. While that discussion is outside the scope of this article, it is still helpful to introduce two different categorizations of knowledge. These can serve as a framework for considering what types of knowledge are found through knowledge discovery techniques, and what is omitted. Volume 1 of Fritz Machlup's Knowledge: Its Creation, Distribution and Economic Significance (Princeton University Press, 1980) distinguishes five categories: practical knowledge (useful in the knower's work, decisions and actions); intellectual knowledge (satisfying intellectual curiosity); small-talk and pastime knowledge (satisfying nonintellectual curiosity or the desire for light entertainment); spiritual knowledge; and unwanted knowledge (outside one's interests, usually accidentally acquired, aimlessly retained). In the context of knowledge management, Karl M.Wiig distinguishes among three types of knowledge: public knowledge (explicit, taught and shared routinely, and generally available in the public domain); shared expertise (proprietary knowledge assets; held by knowledge workers and shared in their work or embedded in technology and other proprietary manifestations); and personal knowledge (exists tacitly in people's minds). Because terms such as data, information and knowledge are not used consistently, it is important to look beyond the terms and determine what is actually being analyzed and synthesized by the various techniques described briefly in the remainder of this article.

Text as Raw Material

Several techniques use texts/documents as the raw material for knowledge discovery. Meta-analysis is a statistical procedure for integrating results of independent studies that are combinable, often to gain greater confidence in the outcome of investigations such as randomized clinical trials in medicine. Bibliometrics is the application of mathematics and statistical methods to various forms of publications. One commonly used approach is citation analysis, made easier with the availability of citation indexes in electronic form. Citation counts may look for patterns in who is cited, the age of the literature cited or the types of literature cited. Bibliographic coupling examines linkages among documents in terms of the cited references that they hold in common. Co-citation analysis considers joint citation of earlier works and has been used to discover the intellectual structure of science and scholarship by clustering and mapping. Visualization is an important aid in knowledge discovery, as illustrated by the work of Howard White and Katherine McCain ("Visualizing a Discipline: An Author Co-Citation Analysis of Information Science, 1972-1995," Journal of the American Society for Information Science , 49(4):327-355, 1998).

In contrast to the various techniques exploiting citations, co-word analysis is based on the co-occurrence frequency of pairs of words or phrases in texts. It has been used to discover linkages among subjects in a research field and thus to trace the development of science. Text mining offers possibilities for creating knowledge out of the massive amounts of unstructured information available on the Internet and corporate intranets. This approach uses techniques from data mining, machine learning, information retrieval, natural language understanding, case-based reasoning, statistics and knowledge management to help people gain new insights from large quantities of text. Information extraction involves more focused processing of text through lexical preprocessing, parsing and semantic analysis, and discourse interpretation. The task is to extract information about a pre-specified set of entities, relations or events from natural language texts, such as extracting details of events from news stories. Finally, the search for undiscovered public knowledge, as pioneered by Don Swanson of the University of Chicago, seeks to use bibliographic databases to discover previously unknown causal connections. The process is based on identifying two literatures that are not co-cited and that do not cite each other, but that are implicitly related by the logic of their respective arguments.

Data as Raw Material

Data mining or knowledge discovery in databases (KDD) involves manipulation of data from structured databases. A variety of methods are used to evaluate data for relevant relationships that could yield new knowledge. The intent is to find valid, novel, potentially useful and ultimately understandable patterns in data. Goals of data mining can include prediction and description. Prediction makes use of existing variables in the database in order to predict unknown or future values of interest. Description focuses on finding patterns in the data for subsequent presentation for user interpretation. The work draws on machine learning, pattern recognition, statistics and visualization techniques.

Data warehouses help set the stage for data mining. They involve selection, assembly and structuring of data from disparate sources. This may require data cleaning to check for errors or missing data.

Knowledge Capture

In the artificial intelligence tradition, an expert system incorporates know-how gathered from experts and is designed to perform as human experts do. Developers of expert systems use various techniques for knowledge acquisition including interviewing, protocol analysis (asking the person to talk aloud while performing a task), questionnaires and surveys, and observation and simulation. Knowledge management in business settings is likewise concerned with knowledge capture, finding ways to make tacit knowledge explicit (e.g., documenting best practices) or creating expert directories to foster knowledge sharing through human-human collaboration.

Technology Assessment

Any discussion of techniques for knowledge discovery should take into account social, ethical and legal issues raised by the application of such techniques. For example, data mining may undermine personal privacy guidelines (O'Leary, D. E. et al. "Some Privacy Issues in Knowledge Discovery: The OECD Personal Privacy Guidelines," IEEE Expert 10(2):48-59, April 1995). Several countries have generated principles to protect individuals from the potential invasion of privacy posed by data collection and retrieval.

Laws and regulations may therefore constrain certain types of knowledge discovery, such as extraction and manipulation of data from medical records.


This article has briefly outlined the range of techniques being used to support knowledge discovery. The Annual Review of Information Science and Technology offers a useful starting point for learning more, with chapters on "Visualization of Literatures" and "Data Mining and Knowledge Discovery" in the 1997 volume and on "Text Mining" in the 1999 volume. It is interesting to note that the role of information science in knowledge discovery was foreshadowed by Pierre Piganiol as early as 1971 (Information for a Changing Society, OECD, 1971, p. 13):

    Information should not build up a dead structure: the body of knowledge is in continuous evolution and it is vital, in order to forecast and influence the future, that information should contain at least the seeds of tomorrow's progress and discoveries. What distinguishes modern information [science] from traditional documentation is precisely the introduction of this heuristic element.

Linda C. Smith is professor in the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 E. Daniel St., Champaign IL 61820-6211; 217/333-7742;

Go to Track 2


How to Order

@ 2000, American Society for Information Science