of the American Society for Information Science and Technology       Vol. 27, No. 3       February/March 2001

Search

Go to
Bulletin Index

bookstore2Go to the ASIST Bookstore

 

Copies

SIG/III International Paper Competition: Papers on Practical Collaborative Applications of Digital Libraries

    Editor's Note: Last year the ASIST Special Interest Group/International Information Issues (SIG/III) held an international competition for the best papers on practical collaborative applications of digital libraries or information science and technology in advancing communications, information and knowledge in the developing world. To be eligible, the principal author had to be a citizen of a developing country. In this issue of the Bulletin we present the winner of the first place honors in the competition. The paper was written by Kyiho Lee of the Republic of South Korea. Our congratulations to Mr. Lee and his colleagues for their outstanding effort. In subsequent issues of the Bulletin we will publish a number of other entries, including the second and third place papers.

    Under the leadership of Sue O'Neill Johnson of the World Bank and Nathalie Leroy of the United Nations, SIG/III was delighted to receive 35 papers representing 20 developing countries. Six were awarded prizes and presented at the 2000 Annual Meeting. ASIST, with help of Academic Press and other donors, was able to fund some of the winning authors to come to Washington to present their work. Contest judges were Dr. Trudi Bellardo Hahn, University of Maryland; Dr. Ben-Ami Lipetz, State University of New York at Albany; and Dr. Bahaa El-Hadidy, Professor Emeritus, University of South Florida. As editor of the Bulletin, I also read all of the entries and offer my own congratulations to all the contestants. There were many very fine papers submitted although not all the winners and other publishable papers were suitable for the Bulletin. As an alternative, the International Information and Library Review is preparing a special issue that will include many of the ones I was unable to accept.

    SIG/III is planning a similar competition for 2001. For more details and entry information, please contact Sue O'Neill Johnson ( sjohnson3@worldbank.org) or Nathalie Leroy (leroyn@un.org).

    Irene L. Travis, Editor

    Bulletin of the American Society for Information Science and Technology

Winning Papers from the SIG/III 2000 Competition

    1. Kyiho Lee (Republic of South Korea). Construction of a Full-Text Database and Service System for Korean Electronic Theses and Dissertations.

    2. Aashish Sharma (India) and William Yurcik. The Emergence of Rural Digital Libraries in India: the Gyandoot Digital Library Intranet.

    3. Chao-chen Chen, et al (Taiwan). The Design of Metadata Interchange for Chinese Information and Implementation of Metadata Management System.

    4. Goswani (India). Statistical Databases in India: Toward a Better Dissemination.

    5. Yan Quan Liu and Jin Zhang (China). Digital Library Infrastructure A Case Study on Sharing Information Resources in China.

    6. Duncan Wambogo Omole (Kenya). Information Science and Technology in Developing Countries.

 

Construction of a Full-Text Database and Service System for Korean Electronic Theses and Dissertations

by Kyiho Lee

Kyiho Lee is associated with the Korea Research and Development Center, Eundong 1, Taejon, Korea, and can be reached by e-mail at ghlee@kordic.re.kr

The recent rapid development of Web-based communication technologies and high-speed retrieval engines enables the world to share information on the Internet beyond the limits of time and space. With the generation of numerous kinds of data, information formats have changed from traditional text format to a variety of complex multimedia ones. From the mid-1990s most documents have been produced by various word processors and stored in electronic form. Furthermore, it is possible to transform text in paper to digitized form using OCR technique. Today, most universities require their students to submit electronic theses and dissertations (ETDs) for their graduation. Since ETD documents are full-text and generated by various word-processors, a full-text database and retrieval system is necessary for their effective utilization.

The concept of ETD was first publicly discussed at a 1987 meeting in Ann Arbor, Michigan, arranged by University Microfilms International (UMI), Virginia Polytechnic Institute and State University (Virginia Tech), the University of Michigan, SoftQuad and ArborText. Since 1992 Virginia Tech has worked with the Coalition for Networked Information (CNI), the Council of Graduate Schools, (CGS) and UMI in developing a digital library for ETDs.

The Networked Digital Library of Theses and Dissertations (NDLTD) was established in 1997 to coordinate the international efforts related to ETDs. NDLTD focuses on supporting education, research, exchange of scholarly information and technology transfer related to digital libraries and theses and dissertations. (Fox, 1999). It currently has 98 members, 87 member universities (including 3 consortia) and 11 institutions.

Since the mid-1990s, given the popularity of word processing, most universities in Korea have gradually requested that students submit electronic versions of their theses and dissertations. In 1997 Pohang University of Science and Technology (POSTEC) began requiring students to submit ETD files in one of a number of formats to be considered below. Today, most Korean national universities have adopted the same system and have plans to build their ETD digital libraries. However, standards for ETD formats were not established, and a large number of ETD files were stored in university library warehouses and not fully utilized.

In 1998, to improve this situation, the Korea Research and Development Center (KORDIC) initiated a project sponsored by the Korean Ministry of Science and Technology to build a national digital library for Korean ETDs. KORDIC is an independent institute under the Office of the Prime Minister. Its major function is to create a general distribution system for Korean-specific and overseas technical information, including research projects, reports, research manpower, and equipment and instruments for science and technology. The aim of the project is to construct a full-text database for Korean ETD files and to implement a large-scale full-text document retrieval system accessible on the Internet. The project team built the system in two steps:

    n First, we created a full-text database for which we had to develop a method of transforming a variety of complex document formats to a unified standard.

    n Second, we developed a retrieval system to provide full-text document services effectively on the Internet.

Format Conversion and Processing of Korean ETDs

Currently the most popular format for ETD files worldwide is Adobe's Portable Document Format (PDF). However, since Korean ETDs are multi-lingual (Korean, English and sometimes Chinese), the project had to select the electronic format that would be most efficient in our environment. This process is described below.

The Criteria for Format Conversion and Data Transmission for ETDs. One of the most important things in transforming different ETD formats to a unified format is to keep the contents of the original documents and to preserve the editing status of each word processor. Since most theses and dissertations are more than 100 pages long, it was necessary to consider a page-division technique that would allow the system to transmit a portion of the work to provide fast full-text services in the Internet environment. To promote effective ETD transformation and fast data transmission on the Internet, we established the following criteria:

    n The primary goal is to implement a full-text document retrieval system. The system should be able to provide basic full-text retrieval functions for the original documents and enhance retrieval effectiveness.

    n The text information should be extracted. To allow full-text retrieval ability, the system should be able to extract text from the transformed file.

    n The file size must be reasonable for Internet service. Generally, the file size of complex documents is too large for Internet transmission.

    n The system must support Korean fonts. Since many Korean ETDs are, of course, in Korean, a format that is compatible with Korean fonts is an important factor.

    n The transmission of page units should be allowed. Since the users usually want to read a portion of the document, the transmission of page units should be available.

    n Compatibility among different formats is an important consideration. If a format is exchangeable with different formats, it will provide great flexibility to reconstruct the full-text database.

The Standard for a Document Format

Due to the fast rate of change in computer technology and information industry trends, it is very difficult to predict which electronic format will be the standard for Internet service in the future. Table 1 shows a comparison of electronic document formats that are widely used at present. The table describes the characteristics of each format, introduces the related software and compares file sizes.

 Table 1.  Comparison of different electronic document formats

Format

Description

Generating Program

Viewer Program

File Size

DVI

 

Device Independent format
 

TeX, TeXplus (HWP) Write
 

TeXplus Viewer, Various DVI viewers

1

 

PS

 

Format for Postscript (PS) printer & graphics

PS Printer Driver, DVIPS
 

GhostScript/
Viewe
r
 

2
English
 

PDF

 

Compressed PS format generated by Adobe

Adobe PDF Write, Distiller DocuCom Distiller

Acrobat Reader, DocuCom Viewer
 

5

 

DOC
 

Microsoft Word document

MS Word
 

MS Word Viewer
 

6
 

HWP

 

Format for Hangul W/P (Korean language)

HWP Driver

 

Namo Viewer

 

4

 

XLX




 

Based on Hewlett-Packard's Printer Command Language (PCL) for HP-compatible laser printers

JetDoc




 

JetDoc Viewer




 

3




 

XML





 

Extensible Markup Language - a subset of the Standard General Markup Language (SGML) for the Web

Microsoft Office 99 (not released by Microsoft yet) and others


 

Internet Exlporer





 

2





 

Popular word processors such as Adobe Distiller, MS-Word, WordPerfect and HWP (Korean) provide their own conversion tools for HTML, CGI programs or plug-ins for Internet service. However, there is still no standard electronic format for all kinds of documents. In the 1990s the Standard General Markup Language (SGML) was proposed as an international standard format, but it is not widely used because it is too complex and expensive to encode documents. Currently, XML, a subset of SGML, has emerged as a new standard format for the Internet; however, no one can predict the future of XML as yet.

As mentioned before, ETD files in Korean generally contain a mixture of both English and Korean. Therefore, it is more complicated to decide the standard format for Korean ETDs, because the features of Korean should be considered. To provide full-text retrieval functions for Korean documents, we need to consider carefully such issues as support of the Korean font, availability of a Korean indexing program, compatibility among different file formats and trends for future standards for file formats.

PDF and DVI

Today the most popular and widely used document format in the field of electronic publishing would be PDF. A PDF file can be easily created in MS-Word or WordPerfect by printing the file to a PDF Writer. Alternatively, users can save their files as PostScript and use Adobe Distiller to create PDF files. For these reasons, UMI, for instance, has selected PDF as its standard format, and now transforms all the theses and dissertations it receives to PDF for its full-text database.

However, even though PDF is a flexible format for English ETD files, it is not efficient for Korean ETD files. First, PDF is based on 1-byte European languages, so it has some limitations in expressing 2-byte Asian languages such as Korean and Chinese. Second, to process Korean documents with PDF, it is necessary to convert Korean fonts to PostScript fonts and to store the converted fonts inside the PDF file. As a result, the average size of the PDF file becomes several megabytes, so it is not efficient to send the documents and provide full-text service through the Internet. Third, since Adobe does not support the indexing function for Korean, it is difficult to extract Korean text for full-text document retrieval.

DVI is a document format developed at Stanford University for academic document exchange and printing. Since DVI adopted TeX as the basic storage structure to generate academic documents, its format is appropriate for document exchange in the area of science and technology. The main advantage of DVI files is that the size of file is relatively small and the format of file is almost the same as the original document. Also, the technology of DVI is now open, so it is easy to obtain the source code and to apply the technology to any information retrieval system. It is possible to convert DVI files to PDF format using a simple conversion tool; therefore, DVI and PDF will be compatible to each other for Korean ETD files once Adobe supports the 2-byte Korean code. Based on these considerations, we selected DVI as the standard format for Korean ETDs. Table 2 compares DVI and PDF formats and shows that DVI provides greater flexibility for Korean ETD files (Koh, 1998).

 Table 2.  Comparison Between PDF and DVI

                                 

DVI format

PDF format

Font

True Type

PS Type 1 and 4

Font Process


 

Not embedded.  Both Korean and English fonts are replaced by DVI fonts

Korean font is embedded.  English font is replaced by PDF font

Inserted Vector Graphic

 

Compressed and embedded in document.  Supports both EPS and WMF

Compressed and embedded in document.  Based on EPS

Inserted Bitmap Graphic

Compressed and embedded JPG and GIF

Compressed and embedded JPG and GIF

Text



 

Uses standard compression method.  Wide selection of retrieval engines and indexers

Uses Adobe compression method.  Restricted selection of retrieval engines and indexers

Mathematics Formula
 

Uses its own font for TeX and MathType formulas

Embedded in the document that contains the formula

Document Conversion

The Project analyzed the Korean ETD files accumulated at KAIST and POSTEC since 1997 and found that more than 90% of Korean ETD files are composed of MS-Word, HWP and LaTex (Lee, 1998). Therefore, we developed conversion tools that transform these three different formats to DVI. We developed TeXplus Writer for MS-Word files and TeXplus HWP Writer for HWP files. To retrieve the documents by page, we developed DVI2TXT, which extracts the Korean index from the converted DVI file. Also we developed DVI Split to generate pages from the DVI file and developed TeXplus to browse the DVI file. The following are the major functions and the features of these tool:

TeXplus Writer. TeXplus Writer is the printer driver software that generates DVI files for MS-Windows 95/98NT. It supports various Korean word processors such as Hangul-Word of MS-Korea, Hoonminjungum of Samsung, and Arirang of Handy-Office, but it does not support HWP. Since it provides the functions for complex formulas, tables and graphics, it can generate Korean DVI files that are the same as those in as the original ETD. Since TeX is difficult to use, we designed TeXplus Writer to be used easily in various MS-Windows applications.

TeXplus HWP Writer . Like TeXplus Writer, TeXplus HWP Writer is the printer driver software that generates DVI files from the HWP files in MS-Windows 95/98/NT. Most Korean word processors follow the standard Microsoft printing method. However, HWP has its own printing method, so HWP files cannot be transformed to DVI through TeXplus Writer. TeXplus HWP Writer is designed to provide the same functions as TeXplus Writer; it can, therefore, generate Korean DVI files that are the same as the original ETD.

DVI Document Split. One of the main characteristics of complex documents is that the size of the generated file is generally much larger than the size of an HTML file. As a result, response time and transmission efficiency are issues in the Internet environment. Today in many information service systems, it takes considerable time to view a complex document on the Internet because the client has to receive the whole document from the server before users can view the pages of interest. However, when users want to see a portion of a thesis and dissertation, it is not efficient to bring all the pages from the server. To improve response time and transmission efficiency, the size of the file for transmission should be small, and partial data transmission should be allowed. We designed the structure of DVI files to enable sending and receiving a document partially. From the unified DVI file generated by different ETD files, DVI Split selects the part of an ETD file that satisfies a user's request, restructures that part of the DVIS file and sends the restructured file to the client by page unit.

DVI Text Extractor. To implement full-text document retrieval, it is essential to extract text from the DVI files to generate the index file. Currently, there are several free conversion programs that convert the DVI files to text. However, the programs are developed for English, so they cannot convert files in a 2-byte language such as Korean. DVI Text Extractor is an indexing tool that extracts English and Korean text from DVI files generated by TeXplus Writer, TeXplus HWP Writer or TeX. It extracts all the text of a document except images and mathematics formulas, and stores the text with the corresponding page information. Extracted texts from DVI files are loaded by the information retrieval system and stored in the full-text document database. Since all the texts are stored with their page information, users can search the full-text database by page unit.

DVI Viewer (TeXplus Viewer). TeXplus Viewer is a plug-in program that displays the DVI files in the Internet environment. Whenever a DVI file is opened on the Web, TeXplus automatically runs on the Web browser. TeXplus displays not only the DVI files from TeXplus Writer and TeXplus HWP Writer, but also the DVI files from all TeX compilers. The viewer supports Web browsers such as Netscape Navigator and Microsoft Explorer for Windows 95/98/NT.

Implementation of Full-Text Retrieval Service System

The major change facilitated by electronic documents is the transition from traditional bibliographic to full-text retrieval. To satisfy the needs of users who want to retrieve full-text rather than the bibliographic information, commercial database vendors have built full-text databases and developed service systems for more than a decade. Today full-text database services are becoming more popular and increasing at a remarkable rate (Moon, 1993).

A full-text retrieval system is defined as "an information retrieval system that stores the full-text of all documents in the collection on a computer so that every character of every word in every sentence of every document can be located by the machine" (Blair & Maron, 1985). Generally, automatic full-text retrieval in the Internet environment provides two advantages. First, its costs are declining with the rapid development of digital technology, which continues to provide computers that are larger, faster, cheaper, more reliable and easier to use. Second, it avoids the need for human indexers, whose employment is increasingly costly and whose work often appears inconsistent and less than fully effective.

As described above, we developed software to extract text from the unified DVI files for full-text search and to segment large files by page for fast data transmission on the Internet. In addition, we used the KRISTAL-II information retrieval engine developed at KORDIC to store the ETD files in a database and to implement full-text document retrieval.

KRISTAL-II

To provide fast and accurate access to researchers in science and technology, KRISTAL-II has been under development at KORDIC over the last 5 years. It is specially designed to manage Korean texts as well as English texts by incorporating an automatic Korean indexer. Its extended Boolean model, based on NISO Z39.58, supports automatic indexing for Korean texts, efficient management of variable length documents, and fast disk I/O based on raw devices. KRISTAL-II also supports convenient tool kits for developing WWW and Windows interfaces (Lee, 1997).

The early version of KRISTAL-II was an information retrieval (IR) system designed mainly for storing and retrieving bibliographic information. However, the explosive growth of the Internet and the progress in communication technology in recent years has given birth to new IR application areas such as digital library services and electronic document management. In 1998, KORDIC released KRISTAL-II version 2, which concentrates on the functions of digital library and full-text document retrieval. The new version added a ranking facility, which can provide an appropriate number of documents sorted in descending order of estimated relevance to a given query (Park, 1998). It also implements the SQL-Plus library to couple KRISTAL-II with the ORACLE DBMS so that users can retrieve data from commercial DBMS storage. For electronic document management, the new version implements file format conversion from HWP to DVI and MS-Word to DVI.

Currently, KRISTAL-II is widely used in research institutes, national universities and industrial companies in Korea. It has been adopted as the main information retrieval engine for the National Digital Library Project, in which the Korea Library of Congress and six other government-supported institutes are participating.

Construction of Full-Text Database

As mentioned previously, several universities in Korea began receiving data in electronic form from their graduate students in the mid-1990s. Beginning in 1997, the students in KAIST and POSTEC were required to submit their theses and dissertations in DOC, HWP or DVI. Today most graduate schools in Korean universities request their students to submit ETD files along with the traditional book format for their degree.

Table 3.  Distribution of ETD files

University

HWP

DOC

DVI

Total

KAIST

617

863

431

1911

POSTEC

110

136

79

325

Total

727

999

510

2236

As a result, a huge number of diskettes that contain various ETD files are stacked in the warehouses of university libraries, but most of them are not utilized properly. Even though there are some efforts to build digital libraries using the image database of theses and dissertations, there is still no full-text document database and service system for ETD files in Korea.

To develop a solution for this problem, we selected two universities, KAIST and POSTEC to build an initial ETD database. They provided all the ETD files and the corresponding bibliographic information along with the original hard copy text. Table 3 describes the distribution of ETD files which the two universities have acquired since 1997. It shows that more than 40% of all ETDs are in DOC format, the most popular format at the two universities.

To build the full-text ETD database, first we transformed three different ETD formats to DVI by applying the TeXplus Writer and TeXplus HWP Writer programs that we had developed. After generating the unified format, we stored the newly created DVI files in the DVI server in the KRISTAL-II system. For full-text information retrieval efficiency, index files for the new DVI files were regenerated by the KRISTAL-II Hangul indexer. Finally we developed a schema for the full-text extracted by the DVI Text Extractor and loaded the full-text data into the ETD full-text database using KRISTAL-II.

The Structure of the Full-Text Retrieval System

Design Features. The design principle of our full-text retrieval system is to search the entire ETD database and to retrieve all the pages in relevant documents according to users' requests. The following are some of the features to implement our full-text search functions:

  • Complete full-text retrieval. Most ETD service systems are designed to retrieve bibliographic information such as abstract, title and author and then provide access to full-text information. This system allows users to access the database and retrieve information by page unit.
  • Automatic indexing with ranking ability. The KRISTAL II engine automatically generates index files containing all the words in the text regardless of the file size. The new version adds a ranking facility for better precision, and it can generate a list of documents by their relevance to given query.
  • Page by page retrieval. When users want to retrieve some part of the full-text in a document, only the related pages are transmitted to improve the transmission speed.
  • Link to an image ETD database. For those theses and dissertations submitted before the mid-1990s, which are all in hard copy, KORDIC constructed an image database using the tiff format. Along with the full-text document service, the system is designed to provide image database services.

The Structure of System

Figure 1 (Structure of Service System for ETD) illustrates the structure of the full-text retrieval system for ETDs, which basically consists of four modules: conversion tools, full-text database, storage and retrieval engine (KRISTAL-II), and Web interface.

For a given query, users first receive the abstract information for all the retrieved pages such as title, author and the summary of each page. When the user wants to retrieve some part of the full-text of a document, only related pages are transmitted to the users. Since all the tables and figures are indexed as well, users can search tables and figures directly using keywords. The step-by-step retrieval process is as follows:

    1. The user first enters a query in natural language.

    2. The Web gateway converts the query to a new query for the KRISTAL-II retrieval engine. The retrieval engine searches the ETD database and retrieves relevant documents. It then sends the summary information for the pages within each ETD that contains the search term(s) to the client through the Web gateway.

    3. The user selects the pages for which he or she wants to see the full-text.

    4. The Web gateway finds the name of DVI file, page number and other location information of the page for the user query, then calls the DVI viewer.

    5. DVI viewer calls the DVI server to bring the corresponding page for the user query.

    6. DVI viewer displays the corresponding page. It also provides various functions for the full-text document such as next page, previous page and specific page search.

Conclusion

The purpose of this paper was to describe how the Korea ETD system was developed and the issues relating to the design and implementation of a multilingual full-text database and the retrieval system for Korean ETDs. To develop a full-text service system, KORDIC developed various data conversion tools to transform documents generated by different word processors into a unified format and built a full-text database using its own retrieval engine and Korean indexer.

For full-text retrieval and fast search on the Internet, the system design focused on the page-by-page retrieval, direct access to the full-text database, automatic indexing with ranking ability and a link to the traditional image ETD database.

Since Korean ETDs consist of multilingual texts, there were language-related issues such as determination of a standard format, support for the Korean font, availability of a Korean indexer and compatibility between different formats. After we considered two important factors for implementation of this multilingual full-text database system, that is, file size and support of Korean font, we decided that DVI was the best fit as the standard format for ETD files. Also, there were issues related to fast search and direct access to full-text database. Generally, the average file size of an ETD is relatively large to transmit effectively on the Internet; therefore, the system was designed to process ETD files in page units and to deliver only the related pages. In addition, the full-text retrieval system provides ranking ability for high precision.

Our next step in this project is to conduct use and user studies in order to improve our ETD system. We will evaluate the system performance and analyze user behavior relating to this full-text retrieval system. 

Acknowledgment

The author gratefully acknowledges Dr. Yin Zhang from Kent State School of Library and Information Science for her time and effort to review and edit previous versions of this paper.

References

    Blair, D.C. & Maron, M.E. (1985). An evaluation of a full-text document-retrieval system. Communications of the ACM, 28, 289-299.

    Fox, E. (1999). Contribution by Edward A. Fox regarding Networked Digital Library of Theses and Dissertations (NDLTD) for UNESCO Meeting, September 27-28, Paris.

    Koh, K. (1998). Comparison Between DVI Format and PDF Format [On-line]. Available: http://www.texplus.com/texplus/comp5.html.

    Lee, J. (1997). Development of an Effective Storage System for Information Retrieval. Project. Report in Korea Research and Development Information Center (KORDIC).

    Lee, K. (1998). Building Digital Library Infra-Structure and Its Database. Project. Report to KORDIC.

    Moon, S. (1993). Enhancing performance of full-text retrieval systems using relevance feedback. Journal of the Korea Society for Information Management, 10(2), 43-67.

    Park, H. (1998). Implementation of an Effective Information Retrieval Environment. Project Report to KORDIC.

    Suh, Y. (1998). A Study on Manipulation of Complex Document via Internet. Project Report to KORDIC.

    Yoo, S. (1998). A Study on the Online Document Management for MS-Word Format in the Internet/Intranet Environment. Project Report to KORDIC.

How to Order


ASIST Home Page

American Society for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright 2001, American Society for Information Science and Technology