Please tell us what you think of this issue!  Feedback

Bulletin, October/November 2009


Integration of Taxonomy and Keyword Searches: A Comparison of Two Implementations 


by Ronald P. Millett 

Ronald P. Millett is chief scientist at Perfect Search Corporation, 1145 South 800 East, Suite 325, Orem, Utah. He can be reached by email at ron.millett<at>perfectsearchcorp.com, or through the company’s website at www.PerfectSearchCorp.com

Subject taxonomies have a long history of usefulness in indexing libraries of documents or records. One of the weaknesses of this approach is the difficulty of traversing the taxonomy hierarchy to find the subject term. Full text indexing of keywords enables automatic discovery of records containing terms ordered by relevance including phrase, title text, word proximity, emphasized text, linguistic and other criteria. Major weaknesses of the keyword approach are the lack of the ability to use controlled subject terms and result hits that are false-positive. This paper compares two implementations that combine taxonomy and keyword approaches in an attempt to address these weaknesses for efficiency, ease of use, recall, precision and hit relevance.

NICEM Database Search Implementations
These two search implementations combine taxonomy and keyword searches: 

  1. A system based on Access Innovation's Search Harmony product, using their subject taxonomy and user interface with the Perfect Search search engine
     
  2. A system that is an earlier Access Innovation’s implementation using the Lucene open-source search engine with the same subject taxonomy. 

The comparison search data set for both systems uses about 500,000 records from summaries of multimedia videos or presentations in the National Information Center for Educational Media (NICEM) database stored as separate XML files. The subject taxonomy terms were specified with a XML subject category tag. Other important tags include title, abstract and publication date. Table 1 summarizes the sample database and taxonomy information. 
 

Records (in separate XML files) 480,000
Average record size 0.96 k
Database size 468 MB
Number of subject taxonomy terms About 6,000
Average subject terms per record 3.0
Table 1. NICEM test database

The database was indexed by both systems on similar 64 bit Windows servers, with from 3 to 16 GB of RAM. The indexing process was single threaded even though there were several cores available. Table 2 summarizes indexing and retrieval information for each system. The Lucene implementation used the Java version.
 

  Lucerne Perfect Search
Indexing Time 161 minutes 13 minutes
Index Size 168 MB 440 MB (main index) + 192 MB search accelerators
Threading Single Single
Typical Engine Retrieval Time Not Available 1,200-7,000 queries per second
Table 2. Lucene and Perfect Search Indexing and Retrieval Information

Retrieval Speed vs. Indexing Cost
One of the apparent anomalies of the Perfect Search system is that the typical inverse relationship between retrieval speed and indexing cost is missing. Faster retreival speed usually implies that a slow and expensive indexing process needs to be executed. However, Perfect Search's approach uses an indexing pipeline that creates temporary files and utilizes indexing pattern speedups at every point of the indexing process to improve the indexing functionality. Note that even though the Perfect Search index and accelerators used in retrieval are much bigger than Lucene's index, these files were generated much faster than the Lucene index. This indexing and retrieval architecture results in both impressive indexing speeds and retrieval speeds for the same system. 

Once indexing speeds are a small multiple of file copying speeds, as Perfect Search's data for this test illustrates, the whole character of creating and using indexes can change. Formerly expensive re-index operations are no longer a problem, if needed. Creation of incremental index files can become almost instantaneous and merging together many small index files can also be an easy part of making a very responsive indexing process to support the search of dynamic data. 

Document caching and acceleration files are also generated and included in the numbers in the table for the Perfect Search-based system. The 192 MB acceleration files were generated for the test system in under a minute. On a large system such as the installation for WorldVitalRecords.com, the 149 GB of indexes require acceleration files of only 13 GB (9% of index). 440 MB is not considered a large index for a Perfect Search-based system and the acceleration files are a much bigger percentage of the total index size (192 MB, 43%) than they would be for a larger search system.

These acceleration files play a key role in enabling very fast retrieval speeds for the system. Even when a single query can actually trigger as many as 20 sub-searches for different relevance rankings, a typical retrieval speed of 1,200 to 7,000 queries per second is truly impressive for each complete search. These speeds are for a system where both accelerators and index files have been cached.

Autocompletion of Subject Taxonomy Terms 
Using precategorized subject terms, searches can avoid irrelevant hits and find records or documents that directly pertain to the subject category selected. An important weakness of taxonomies is the difficulty of finding the descriptor that the user has in mind if it has to be traversed from the top of a hierarchy. Both test systems use the Access Innovations autocompletion module that uses a "key word in context" (KWIC) [3] list and term matching to allow the user to find any term that he has in mind that might be part of a subject description. Figure 1 shows a sample screen shot of this powerful taxonomy autocompletion window. 

Figure 1
Figure 1. Taxonomy Autocompletion Window after user input of "op"

The taxonomy tree structure is still very important. It shows subject descriptors in parent/child/sibling set relationships with other descriptors. A disadvantage of this approach can result when a pure binary tree or single top node results in too many operations to find the needed term like the "20 questions" game. Access Innovations uses a forest of taxonomies approach with typically non-binary child divisions that allows for rapid traversal of the taxonomy structure. Figure 2 shows a screen shot of this taxonomy directory available on the Search Harmony system.

The combination of both top down and bottom up approaches to locating taxonomy terms provides the user with unprecedented flexibility and power to find that carefully crafted subject category for his search.

Figure 2
Figure 2. Taxonomy Directory Window in Search Harmony NICEM search system 

Precision and Recall Comparisons
Except for one characteristic of the Lucene-based system that lowered precision to 99%, the two systems both receive 100% precision and recall values for the taxonomy subject descriptors. When the Perfect Search stemming module and its effects are added, the recall of the Lucene-based system went down. The Perfect Search algorithmic stemmer was too broad in some cases, which reduced precision for a few specific terms.

Table 3 compares the Lucene and Perfect Search hits for the text "communications" in descriptor and keyword modes. This search was selected because it shows some interesting precision and recall differences.
 

  Lucene Perfect Search
Taxonomy descriptor hits 2,760 2,738
Hits from "communications" sub-terms
"communications industry" 
"satellite communications"
22 0
Recall % for exact descriptor hits 100% 100%
Precision % for exact descriptor hits 99% 100%
     
Keyword hits  3,678 23,960
Keyword hits with correctly stemmed terms 0 13,018
Keyword hits with stemming errors 0 7,264
Precision for exact keyword hits 100% 100%
Precision including stemmed terms N/A 70%
Recall for keyword hits 100% 100%
Recall including correctly stemmed terms 22% 100%
Table 3. Hit analysis for "communications" subject descriptor and keyword searches

Note that because the Lucene implementation can get a hit for "communications" as a sub-term of subject taxonomy descriptors such as "communications industry" and "satellite communications," Lucene has a higher retrieval count on hits for this descriptor entry than Perfect Search. Because the actual descriptor "communications" was not in some hits, the precision value for this search for the Lucene implementation is 99%. 

Note also that the Perfect Search system uses a stemming subsystem to include terms like "communication," "communicate" and "communicating" as lower relevance hits for a keyword search for "communications." The current implementation uses an algorithmic stemming module that also included hits for "community," "communities," "communist" and "communism" as possible word forms for "communications." This algorithmic stemmer created false positive lower relevance hits. A dictionary-based stemming module being implemented for the next version of Perfect Search will avoid this problem and not include the extraneous forms.

When correctly stemmed terms are included for this keyword search, Lucene then has its recall percentage for "communications" go down to 22% while Perfect Search maintains a 100% recall percentage. When the false positive stems are counted, Perfect Search has its precision percentage go down to 70%. Lucene does have a similar algorithmic stemming module available, but it was not used in this implementation [4, pp. 282-284]. 

Relevance Algorithms
The descriptors receive a relevance score of 100 for the Search Harmony Perfect Search-based system and are ordered according to the order they were indexed. Keyword searches can have some false positive low relevance hits due to the stemming algorithm currently in use as previously mentioned. Because of the ambiguity inherent in all languages, the use of a word in a search may not match the query's intended meaning. We did not evaluate that factor in our search precision numbers. Having exact phrase and near proximity in compact fields like the title helps to have more relevant hits and avoid false positive hits. 

Combinations of the title, abstract and other fields plus term proximity (exact phrase, near content word proximity) and term linguistic stemming provide relevance scoring of hits from scores of 100 to 50. These orderings and values can be customized according to the users' preferences using XML configuration file information. 

The Lucene-based system also has calculated relevance for keyword searches. Boosts are available for low frequency terms, special fields and normalized term weights [4, pp.78-81] [5].

Mixing and Matching Taxonomy and Keyword Terms
The Lucene-based search system has two modes of searching: preferred terms (taxonomy subject categories) and keywords. The two modes cannot be mixed in the same query. The preferred terms are matched using the autocompletion typing window.

The Perfect Search-based search system allows the user to mix and match preferred taxonomy descriptors and keywords and use a similar syntax to Lucene. Each subject taxonomy descriptor is indexed as a single complex term that is not divisible (e.g., SC:"children's literature"). New searches and refinement of existing searches can select from the taxonomy terms or the keyword terms. 

Figures 3 and 4 illustrate multiple search terms using taxonomy descriptors and keywords in the same search using the Search Harmony Perfect Search-based system. Tables 4 and 5 show the hit frequency for each search term and each stage of the search intersection process. 

If, for example, a single term retrieves 1000 hits, the user cannot and will not in normal practice look through the hits to find the right hit. Typically, unless the hit is in the first few pages of results, it might as well be invisible as far as the user is concerned. Precision and recall numbers for large sets might as well not exist when the set cannot reasonably be examined. The power of having both carefully defined descriptors and full text terms in the same query results in a smaller search result set that can easily be viewed and examined. 

In Table 4 and Figure 3 we see a search where, even after two subject-descriptor terms were entered, there were still over 1,000 hits in the results list. The keywords "lion" and "witch" also occurred in many database entries, but after the descriptors filtered the set down to 1,078, the two terms resulted in a short list that contained two copies of an audio book of the C. S. Lewis’ work The Lion, The Witch and the Wardrobe

Table 5 and Figure 4 illustrate another search where a single category descriptor "business planning" was combined with keywords "computers" and "forecasting."
 

Search Term Frequency Cumulative Hit Count
Subject descriptor: children's literature 16,488 16,488
Subject descriptor: fiction 7,954 1,078
Keyword: lion 998 6
Keyword: witch 550 2
Table 4. Search for two subject descriptors and two keywords illustrating mix and match capability of the Search Harmony system

Figure 3
Figure 3. Screen shot of a Search Harmony search query with both taxonomy and keyword terms. Descriptors "children's literature" and "fiction" were intersected with keywords "lion" and "witch" to retrieve two hits. 
 

Search Term Frequency Cumulative Hit Count
Subject descriptor: business planning 516 516
Keyword: computers 12,638 37
Keyword: forecasting 384 2
Table 5. Search for multiple descriptor and keyword entries giving a small results set 

Figure 4
Figure 4. Screen shot for Search Harmony interface combining taxonomy and keyword terms 

More Relevant Results through Tapping Multiple Dimensions of Search
The key to success in a search system is to help users quickly find the data they are looking for. The ambiguity of language, syntactic and UI details and the amount of data that a search system has to winnow out to find key results for the users are formidible obstacles for any system. The advantages of including taxonomy-based subject descriptor searches and keyword searches in the same system were emphasized by searches on both test systems. The lack of being able to mix and match descriptor and keyword searches in the Lucene implementation made it much more difficult to retrieve a result list that can easily be examined by the user.

The Search Harmony system combines several search methodologies in one unified system, including mixing and matching descriptor and keyword search terms. These methodologies include navigational trees, subject descriptor autocompletion, search within results and relevance-based full text searches including fuzzy search features such as stemming. Combining all of these features in a single interface, each as it were in a different search dimension, provides an important laboratory for improving search results and finding that "Voila!" best result. 

Resources Mentioned in the Article
[1] Welty, C.A. (1998). The ontological nature of subject taxonomies. In N. Guarino (Ed.), Formal ontology in information systems (pp. 317-327). Fairfax, VA: IOS Press. Portions of this paper are available on Google Books. For more recent conferences go to www.formalontology.org.
 
[2] Bast, H., & Weber, I. (August 10, 2006). When you're lost for words: Faceted search with autocompletion. In A. Z. Broder & Y. S. Maarek (Eds.), Proceedings of the SIGIR 2006 Workshop on Faceted Search (pp. 31-25). Retrieved August 21, 2009, from www.mpi-inf.mpg.de/~bast/papers/autocompletion-faceted.pdf.
 
[3] Keyword in context. Wikipedia.org. Retrieved August 21, 2009, from http://en.wikipedia.org/wiki/Key_Word_in_Context
 
[4] Gospodnetić, O., & Hatcher, H. (2005). Lucene in action. Greenwich, CT: Manning Publications. 
 
[5] Kumar, J. (July 8, 2006). Document scoring/calculating relevance in lucene [blog entry]. Retrieved August 24, 2009, from http://jayant7k.blogspot.com/2006/07/document-scoringcalculating-relevance_08.html.