B  U L  L E  T I  N


of the American Society for Information Science and Technology       Vol. 29, No. 2      December/January  2003

Go to
Bulletin Index

bookstore2Go to the ASIST Bookstore

Copies

Editor's note: This article has been condensed from the paper that was awarded third place in the ASIST SIG/III 2002 International Paper Competition. Other papers from this year's contest, including the first and second prize winners, will appear in later issues of the Bulletin.

A Knowledge Network Constructed by Integrating Classification, Thesaurus and Metadata in a Digital Library
by Wang Jun

Wang Jun is associated with the Information Management Department of Peking University, Beijing, China, and can be reached by e-mail at junwang@pku.edu.cn

Knowledge management in digital libraries is a universal problem. Keyword-based searching is applied everywhere no matter whether the resources are indexed databases or full-text Web pages. In keyword matching, the valuable content description and indexing of the metadata, such as the subject descriptors and the classification notations, are merely treated as common keywords to be matched with the user query. Without the support of vocabulary control tools, such as classification systems and thesauri, the intelligent labor of content analysis, description and indexing in metadata production are seriously wasted. New retrieval paradigms are needed to exploit the potential of the metadata resources. Could classification and thesauri, which contain the condensed intelligence of generations of librarians, be used in a digital library to organize the networked information, especially metadata, to facilitate their usability and change the digital library into a knowledge management environment?

To examine that question, we designed and implemented a new paradigm that incorporates a classification system, a thesaurus and metadata. The classification and the thesaurus are merged into a concept network, and the metadata are distributed into the nodes of the concept network according to their subjects. The abstract concept node instantiated with the related metadata records becomes a knowledge node. A coherent and consistent knowledge network is thus formed. It is not only a framework for resource organization but also a structure for knowledge navigation, retrieval and learning.

We have built an experimental system based on the Chinese Classification and Thesaurus, which is the most comprehensive and authoritative in China, and we have incorporated more than 5000 bibliographic records in the computing domain from the Peking University Library. The result is encouraging. In this article, we review the tools, the architecture and the implementation of our experimental system, which is called Vision.

The Development of the Chinese Classification and Thesaurus

China has a long tradition of using classification due to her abundant ancient books. Modern Chinese classification was greatly influenced by Dewey although the Dewey Decimal Classification wasn't popular in China. All the classifications now in use were created after the foundation of the People's Republic of China. The Book Classification of Chinese Libraries (BCCL), which was published first in 1975 and has undergone four revisions, is the most developed.

The most famous comprehensive thesaurus in China is the Chinese Thesaurus (CT), compiled between 1974 and 1980 in an effort involving more than 1000 people. It was the biggest thesaurus at that time, containing 91,158 preferred terms and 17,410 non-preferred terms. Influenced by faceted thesauri, a huge project was started in 1986 to combine the BCCL and the CT into the Chinese Classification and Thesaurus (CCT). More than 40 institutions were involved, and it was finished in 1994 and contains 14 million words in six volumes. Currently, it is used in all public libraries and more than 90 percent of non-public libraries and information institutions of China.

The CCT is not designed for the network environment and has seldom been applied there. Its application has several inherent obstacles, most of which, as Zhang Qiyu recently explained ["Discussions of the information retrieval language of 21 Century," Forum of Libraries, 21 (5)], are common to other classifications and thesauri of broad scope. These problems include currency and difficulty in tailoring coverage to specific domains. Moreover, the tools were designed for the organization of information resources and for the management of hard copy documents, not for information retrieval. They are very complex and require that indexers and classifiers have extensive training. Finally, in the case of the CCT the classification and thesaurus are relatively independent of each other and cannot be updated synchronously.

The Knowledge Network

The exceptions, of course, are the online public access systems (OPACs) in Chinese libraries. Their bibliographic data are indexed strictly according to the CCT, and as collections of living materials they contain plenty of new professional terms in their title fields. To overcome the obstacles mentioned above, the classification, the thesaurus and the bibliographic data can be combined to complement each other. The knowledge structure of the classification and thesaurus provides a skeleton for the organization of the bibliographic data; the concrete bibliographic data restore blood and flesh to the skeleton. New terms can be extracted automatically from the bibliographic data to update the classification and thesaurus, which is based on the mapping between the subject descriptions and the titles that they index; while the classification and thesaurus are customized to the specific domains of the OPAC resources. A knowledge network thus formed provides the user with a natural structure for navigation, searching and learning. We will call it a KNICTM (Knowledge Network Integrated of Classification, Thesaurus and Metadata).

In Vision we combined the classification and indexing terms from the computing domain in the CCT with all the bibliographic records for Chinese materials in computer science held by the Peking University Library published between 1990 and 1999, which provided a database of more than 5000 bibliographic records for our Vision system.

A KNICTM is built in three steps:

  • 1. Construction of the original concept nodes based on the classification and thesaurus. First, the thesaurus is turned into a concept network consisting of nodes and edges. A node is composed of the synonym set of a subject descriptor, including the descriptor and all the terms connected to it by the equivalence relationship (Use/Use For) in the thesaurus. If there is a hierarchical relationship (Broader Term/Narrower Term) between two terms, an "is-a" edge is set up between the corresponding concept nodes. Next, the classification scheme is embedded in this original concept network as a discipline-oriented hierarchical backbone (Figure 1). Since the CCT is a reciprocal index between the BCCL and the CT, rather than a faceted thesaurus, there is no direct mapping between the categories of the BCCT and the concepts of the CT. Therefore, category nodes had to be created, and the relationships established among the category nodes and the concept nodes.
  • 2. Distribution of the bibliographic data to the concept network. The bibliographic data are arranged into the nodes of the original concept network according to their subjects. This is the key task of the KNICTM construction. Supported by the bibliographic data, each abstract concept node becomes a knowledge node where the abstract concept is bound to the metadata records. And the concept network turns into a knowledge network formed by the integration of classification, thesaurus and metadata. It is a kind of metadata "shelving." If a bibliographic record contains only one subject descriptor, we take the record as one of the instances of the corresponding concept node and add the record to the node. If it contains several descriptors, then we add it into all the related concept nodes as instances of them. If it contains a composite subject described by a coordination of descriptors, we create a new concept node, and connect the node to all the corresponding nodes of the coordinate descriptors with "related-to" edges. The newly created concept node is called a co-concept node and has bibliographic records only and no term for the moment.
  • For example, a bibliographic record with the title "Internet Firewall Technologies" is indexed with the string "Network--Security" provided that there is no "Firewall" in the thesaurus. To add this record into the concept network, a co-concept node is created and connected to the concept nodes for "Network" and "Security" by "related-to" edges. Since the associative relationships easily get out of control in a thesaurus, we don't create them in the Step 1. Only when the correlation of two concepts is supported by a bibliographic record do we establish the "related-to" relationship between them through a co-concept. Thus the bibliographic data function as the verifications of the associative relationship. In Step 3 when a new extracted term with the meaning of the co-concept occurs – "Firewall" in the above example – the new term is added to the co-concept. The KNICTM needs manual examination periodically to confirm the co-concepts created. When a preferred term is determined for the co-concept, the co-concept node becomes a common concept node.
  • 3. Enhancement of the KNICTM. The last and the most difficult task is to mine new terms from the metadata collection to enhance the KNICTM. The title of a scientific document usually summarizes its content and reveals its central topics. A direct mapping exists between the keywords of the title and the subject descriptors and the classification notation used to index the document. Based on this mapping, statistic and semantic techniques can be applied to extract new terms from the title and add them into the concept network. There are three difficulties:
    • · Segmenting the title into words and phrases – a classical problem for Chinese.
    • · Extracting valuable terms from the common terms in title.
    • · Determining the point where the extracted terms should be inserted into the KNICTM.

The Benefits of the KNICTM

The KNICTM provides a number of benefits:

    · A framework for the organization of network resources. It is a network of knowledge with substantial data appended rather than a mere abstract concept network. As the instances of the concept, the metadata records inherit all the relationships among the concepts. The metadata records which were isolated from each other become semantically connected now and are woven into an interconnected knowledge network.

    · An adaptive concept network based on the applied resources. The classification and thesaurus are the representation of general knowledge and cannot fit a specific information collection perfectly. The KNICTM is an adaptive concept network capable of self-customizing based on the scale and domain of the given collection. The nodes and edges, supported by the metadata instances, prove the usability of the corresponding concepts. If nodes and edges have no metadata instances the corresponding concepts and relationships are unusable and may need updating. Furthermore, statistic and semantic techniques can be applied in mining new terms, concepts and relationships in the metadata collection to enrich the concept network automatically.

    · A structure for knowledge navigation and retrieval. Keyword-based search seriously under-exploits the value of the metadata. The KNICTM provides a conceptual retrieval network and visual navigational ontology. First, the KNICTM can guide a user to clarify the information demand and express a query clearly. Second, because all the metadata have been arranged into the KNICTM, there is no need for the user to dig into the metadata collection by keyword matching. It is only necessary to locate the knowledge node that best matches the query and follow the surrounding edges to reach other nodes to complete the process. Third, now that all the metadata have been arranged into the structure of the KNICTM according to their subjects, the retrieval result is displayed in that structure, already ranked and classified.

    · A well-organized knowledge network to support knowledge learning. The knowledge nodes are organized into a discipline-based hierarchy and clustered into topic areas through the links among them. A friendly interface like the Cat-a-Cone developed by Marti Hearst and Chandu Karadi can display the organization of the knowledge nodes. A user facilitated by such an interface can learn the discipline structure of a domain, master the professional terms, understand the relationships among the subjects and pick up the documents to study.

    · A digital library of knowledge management. The most essential elements of a library are its information resources and the classification and thesauri, which are its information organization and retrieval tools. In KNICTM, these elements have been integrated into a coherent and consistent knowledge network. And the KNICTM could be easily extended to support other activities of digital libraries, such as collecting and indexing. Thus all the activities of the digital library, including indexing, organization, navigation, retrieval and learning, could center this knowledge network. If it can develop continuously, the KNICTM will bring the digital library from information management to knowledge management.

The Construction of the Knowledge Network

We have completed the first phase of the Vision system. It has a client/server architecture. On the server side the knowledge network is supported by Oracle9i. On the client side is a user interface implemented in Java. We chose Oracle9i for its powerful object-oriented features, such as nested tables and variable arrays, which support our complex objects. Java makes it easy to transfer the system to the Web.

The Ontology Design. There are many objects in our system and their relationships are complex. We therefore used ontology tools such as Ontolingua and Protégé to design it. We then converted this ontology into the database schema. Our ontology consists of seven classes: term, concept, co-concept, category, document, author and publisher. Their names reflect their meanings, but the relationships among them are entangled. Figure 2 depicts these relationships. The numbers indicate the cardinality of the links. We converted the completed ontology into the relational schema of the database system and created the corresponding tables in Oracle.

The Server Side: The Knowledge Network. The original dataset used to build the Vision system included the e-text of the CCT and the bibliographic data of the computing domain. Both of them were provided by the Peking University Library. The characteristics of the original data had considerable influence on the system design and implementation.

There are three steps in building the Vision server:

    · The e-text file of the CCT is processed to set up the fundamental structure of the Vision system. A particular tool was developed to serve this purpose. The e-text of the CCT is read in and all the entries (categories and terms) on computer science are processed. According to the structure, layout and notation rules of these entries, the related records are created and appended into four tables respectively: TERM, CONCEPT, CoCONCEPT and CATEGORY. Through this process, we collected 2194 terms, including 1684 preferred terms which became concepts, and 278 non-preferred terms. Some non-computing-domain terms were also captured since they are the related terms.

    · The bibliographic data are loaded in the database and organized into the original concept network constructed in the preceding process. The bibliographic data are in CNMARC format. We developed a tool to decode the CNMARC format and extract the required fields (title, subject, author, etc.) to form a new record, which is appended to the table DOCUMENT. In total 5053 document records were created. Others were discarded for various reasons, for example, unrecognized title or two ISBN numbers. Such data processing required a lot of time and energy. After the data was loaded, the records of the DOCUMENT table were connected with the records of the CONCEPT table based on the correspondence. When necessary, a new record was created in CoCONCEPT table. These processes accomplish the task of organizing the metadata into the knowledge network described above.

    · New terms are extracted from the DOCUMENT table and added to enhance the knowledge network of the Vision system. Some of the problems have already been mentioned. This process is the focus of the ongoing second phase of the Vision project, so here we just outline roughly what we have done to date.

      · Extraction: At present a statistical algorithm is applied to extract terms in titles. First, the title is segmented into basic words and phrases using a general segmentation tool, and then the co-occurrence frequencies of neighboring terms are counted. If the frequency is higher than a given threshold, the combination is selected as a candidate term. Then we look at the distribution of subject categories in the set of documents in which the candidate term occurs. If the distribution of the subject categories is convergent, the new term is accepted.

      · Insertion: The convergent point found above helps to determine the position where the new term should be inserted. We are considering applying Lattice Theory or Formal Concept Analysis to this problem.

The Client Side: Knowledge Navigation and Retrieval. We implemented a system in Java to navigate and retrieve the Vision knowledge network. Figure 3 is a snapshot of the user interface. There are four physical areas in the interface: the query dialog, the concept network window, the information window and the document window.

Within the concept window, there are three basic ways to view the concept network: hierarchical tree, alphabetical list and concept family. The hierarchical tree is similar to a faceted thesaurus. All the categories and concepts (identified by the preferred terms) are organized into an expandable conceptual tree. The alphabetic list is an index of all the terms in alphabetical (Chinese Pin Yin) order. The concepts can also be organized into concept families, that is, the term families of the thesaurus, and listed in alphabetical order by the top concepts. There is a fourth option, which is a hybrid of the hierarchical tree and the alphabetic list.

When the user clicks on a concept, its detailed information is displayed in the information window, including its term set, super-concept, sub-concepts, corresponding category and the co-concepts around it. The documents connected with it are displayed in the document window. All the windows trigger each other and act in a chain, and all the objects in the windows are clickable.

Conclusion and Future Work

Centuries of library work have proved that the organization of information is the basis for the sufficient utilization of information resources. It's the fundamental value of library. The same is true for digital libraries. For lack of organization the potential of metadata as one of the most important networked resources is not exploited sufficiently. This article has presented an approach to organizing metadata into an integrated knowledge network and setting up a new paradigm for knowledge management in digital libraries. Our approach is differentiated from other ontology-driven and concept-based systems by its incorporation of concepts and the relevant metadata records into integrated knowledge nodes that form a knowledge network. Our experiment also demonstrates that the traditional resources such as bibliographic data still have indispensable values worth further exploration in spite of the continuous increase in various digital resources.

The Vision system is entering its second phase. We are endeavoring to achieve the following goals:

    · A better method to compute the extension and intension of a term and determine its position in the concept network. This is critical if the concept network is to be a self-sufficient system. We are now considering applying a modified version of Formal Concept Analysis to this problem.

    · A language for concept query and manipulation that will simplify operations on the concept network and add an automatic query expansion and contraction mechanism.

    · A visualization interface such as Cat-a-Cone or Inxight-Star-Tree, which can provide more friendly interaction with the user. A visualization interface that is structurally isomorphic to the concept network will support knowledge learning more powerfully.

When all the aspects of the system have been tested, the system will be translated to the Web and incorporated into the current OPAC system.

To integrate classification, a thesaurus and metadata into a coherent knowledge network has promising applications in digital libraries. It could easily be expanded to support automatic classification and indexing in scientific domains. Enhanced by the bibliographic data, the knowledge network could absorb other metadata, such as the index databases of journals, magazines or newspapers.

The Web community also recognizes the importance of the standardization and organization of the Web information. XML, RDF, Dublin Core and other specifications are preparing the Web for the manageable Web – the Semantic Web as envisioned by Berners-Lee (www.w3.org/2000/Talks/1206-xml2k-tbl). But how to construct it? Our paradigm provides one approach.

Acknowledgements

This research is a portion of my doctoral dissertation at Peking University. I wish particularly to thank Deng Peng, Zhu Xingguo and Zu Yong who assisted me in developing the Vision system, and I'd like to thank Professors Yang Dongqin and Tang Shiwei as well as Dai Longji, president of the Peking University Library, and his staff.

Figure 1. The knowledge network with a knowledge node zoomed in.
category
is-a
is-a
related-to
co-cpt
Concept
Knowledge Node
Terms
Metadata

Figure 2. The objects and relationships in the Vision system.
Publisher
Author
Document
Term
Concept
Co-Concept
Category


How to Order

American Society for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright © 2003, American Society for Information Science and Technology