// you're reading

Features

Embracing Lossy: Sacrificing Metadata to Gain Agility

Editor’s Summary

Borrowed from data encoding and compression, the concept of lossiness captures our ability to extract meaning despite a reduced set of data. Metadata are lossy representations of complete data and may be the solution to information overload, enabling agile response to urgent information needs. But library professionals and data curators, eager to preserve original data, tend to favor their own customized descriptions of content and may be reluctant to compromise by using the minimalist Dublin Core metadata vocabulary. By favoring unique descriptions and resisting adoption of common bibliographic metadata, they lose access to valuable open access data sources. A better approach is to enable access through shared metadata, embracing lossy and making vast stores of rich information resources available. InfoSynth, a prototype system, helps small digital libraries adapt their custom metadata, represent metadata including Dublin Core elements as RDF/XML triples and return trimmed down and normalized metadata-based descriptions in triple format. The use of a standard format facilitates sharing with other sources for enhanced information access.

by James Powell, Tamara M. McMahon and Linn Collins

Can you imagine an emergency response plan that includes “Contact the nearest university or public library for assistance with information needs”? No? This omission is unfortunate, because the knowledge housed within libraries, the “thought in cold storage”[1], would be helpful as communities grapple with disasters. But the effort required to “thaw” this information, that is, to locate, filter, collect and present it to those who could make use of it, is still significant. However, libraries have adopted standards that facilitate data sharing. These advances have made it possible to quickly build custom, focused collections of content. The problem is that few libraries leverage this new agility, even though it is needed now more than ever.

Lossy is a term used to describe a class of compression algorithms or heuristics used to reduce the size of digital files representing acoustic or electromagnetic phenomena. Some lossy algorithms take advantage of the fact that what a human being can actually perceive can be represented with a fraction of the information that would be required to fully represent the phenomena. For example, there are frequencies of sound that are masked by other sounds in studio recordings. Omitting the data that represents these overlapping frequencies has virtually no perceptible effect on what a human hears. In digital images, extremely minute variations in color between a pixel and those that surround it can be eliminated with minimal perceptible change to the image. But the resulting file can at times be substantially smaller than the original uncompressed image or sound file. Nonetheless, because some data has been lost, the compression strategy is referred to as lossy.

Anyone who uses the Internet is aware of the overwhelming amount of content it offers. We typically employ our own patchwork of strategies to cope with this problem. We select sites we trust or which provide us with useful information or entertainment options. We use search engines, tags and social contacts to filter. We don’t spend time on sites written in languages we can’t read. These strategies make the information overload more manageable, although we don’t necessarily see everything of interest to us.

Lossiness is not entirely foreign to librarians. After all, cataloging is inherently a lossy process, whereby the product of someone’s intellect is categorized, classified and reduced to a representative surrogate so that it can be located among all the other items in the collection. In some cases, text searching is slowly displacing these meticulously crafted surrogates, but they still have significant value. In fact, as we will show later in this article, specific slices of this data can be effectively explored using newer information retrieval techniques, yielding new insights.

All the information in the universe cannot be represented by all of the particles that could possibly be used to describe various aspects of it [2]. A finite entity can exceed its own capacity to describe itself. Approximate lossy representations then are the only practical solution to many information overload challenges. Yet the natural tendency of a library is to behave like that hoarding uncle we all have or know of, the one whose home is stacked floor to ceiling with books, magazines, newspapers or vinyl records, leaving only a narrow passage for coming and going. And in a sense, the world needs this obsessive compulsive tendency now more than ever, because a great deal of information that is born digital either disappears literally, as when a hard drive crashes on a server somewhere, or becomes essentially un-findable. What libraries have failed to do well, however, is to help users identify and maintain persistent focused collections of content of interest to them. Such a capability would greatly improve the utility of a library to researchers – and even more so in emergency.

Imagine if your library were asked to assist with a terrorist attack involving a biological weapon. A 2008 U.S. National Intelligence Estimate report identified a bio-attack as one of the most likely types of terrorist attacks that the United States and the world at large may soon face [3]. Given the pace of technology and the emergence of biohacking as a hobby, it is equal or perhaps even more likely that a bioterrorism attack will be based upon a novel or modified organism that we will not immediately be able to identify, for which no vaccine exists and whose characteristics might be derived from several different organisms. And such an attack may not be limited to humans – agriculture, pharmaceuticals and even biofuel production are also at risk of attack.

The outbreak of a virulent pathogen that attacks humans or agriculture is likely to be fought with information before it is conquered with vaccines. And libraries, with their vetted and high-quality content, are a better source of information than general-purpose web search engines, which typically lump in the good with the bad. Libraries devote more effort to exposing material that has intellectual heft, scientific value and/or historical significance. And standards such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a standard not well known outside of the digital library world, allow libraries to expose this information for aggregation, exploration and sharing [4].

In the development of the OAI standard, data providers (libraries) were offered an opportunity – and a challenge: large quantities of metadata could be made available, some in multiple formats, but in order to provide services that could be applied to the largest possible number of records, the records would have to include Dublin Core metadata, even if richer variants of metadata were available from some repositories. And in January 2001 members of the digital library community were able to hammer out a workable standard for enabling interoperability among distributed metadata repositories. OAI-PMH defines a set of “verbs” (e.g., commands) by which a client can find out about the contents of a remote repository. Clients can request a list of identifiers or sets contained within a repository, request records, request another batch of records (sites can limit the number of records returned per request and provide a resumption token that clients can use to retrieve the next set of records) and request the metadata types supported by a repository. A data provider may offer other XML tagged variants of its metadata, but all are required to supply OAI Dublin Core metadata (oai_dc).

This new standard also vaulted the modest Dublin Core metadata vocabulary into a crucial role in distributed digital libraries. It also introduced lossiness into metadata curation and exposure. But the more significant accomplishment of OAI is to force libraries to think about how to share their collections. What it has failed to do, thus far, is to generate momentum within libraries for supporting special digital collections for institutions or individuals.

Nine years after the debut of OAI-PMH, a large number of repositories expose metadata for harvesting; that is, they are acting as data providers. The idea behind OAI-PMH was that there would be sites, also known as service providers, that would collect, aggregate and expose that data in useful, and perhaps novel, ways (that is, make it searchable). Despite the maturity of this standard, the world is not yet awash in OAI service providers. And many services that were created have not persisted. We performed a survey of a popular OAI service registry site, the University of Illinois OAI-PMH Service Provider Registry (http://gita.grainger.uiuc.edu/registry/services/), and found that more than 45% of the registered OAI service-provider services were no longer available [5]. Those that remained tended to aggregate content from a large number of sources, without regard to subject matter. Of those, only a handful featured advanced interfaces, which we define as interactive interfaces that support query, browse and personalization via Web 2.0 technologies.

We believe that one of the barriers to OAI digital library services is the library’s historical role as curator of data and materials. Libraries find it difficult to provide services which may be built upon a small, heterogeneous subset of data, and yet, as we show, it is possible to build rich, highly useful services atop minimal subsets of bibliographic metadata, such as Dublin Core metadata, which is the minimal required metadata format offered by all OAI data providers. Focused harvests, combined with title de-duplication, basic author name disambiguation and explicit representation of implied social authorship networks, as well as augmentation of the data using natural language and semantic web tools, can be used to provide deep, rich, sophisticated services for exploring harvested data sets and personal digital libraries. But increasingly, libraries are electing to outsource metadata provisioning services, providing users with access to tools like Web of Science or SCOPUS and maintaining a local catalog only for physical holdings. This trade-off may save money and staff, but represents a lost opportunity for libraries.

Lossiness, whether it occurs in mapping metadata from one XML standard to another or, as with photographs, when an image file is compressed, is generally viewed as a bad compromise in the digital library profession. Preservation of the original data and even the original software used to view the data, are among the many concerns of data curators. And yet, as mentioned earlier, workable standards usually require compromise. Dublin Core is a significant compromise, at least when compared to standards such as MARC XML, which represents the culmination of decades of effort in describing a vast array of library content. Dublin Core appears almost conversational in its accessibility, readability and ease of creation, in comparison to MARC. A MARC record describing a single item may contain dozens of fields, each designated with a numeric field code (a subject is a 650 field, for example). Each numeric field can have alphabetic subfield indicators, and the record is prefaced with a header block containing still more metadata, resulting in something that looks almost like internal microprocessor instructions, rather than metadata. Indeed, MARC itself stands for MAchine-Readable Cataloging. So there’s little doubt who, or what, the intended audience of MARC was.

It is important not to let frustrations with the MARC format trivialize the knowledge, skill and craftsmanship that go into the construction of a cataloging record. The effort expended to link to normalized author names, well-defined subject headings still yields value today. And even though the MARC format emphasizes physical description and discovery, it also encapsulates exhaustive granularity for describing an object so that it can be discovered meticulously or serendipitously, though with a strong emphasis on discovery of physical items. MARC is machine-readable, but the first machines to read it were simply mass-producing cards for card catalogs. Depending on when you peg the birth of the relational database, MARC preceded this technological leap by three or maybe even 15 years [6].

And speaking of pegs, MARC is clearly a square peg to the RDBMS round hole – its concept of structure more strongly resembles formatting instructions than database tables and columns. MARC was born when computing resources were extremely modest by today’s standards, so every bit counted. Many of its subtleties have to do with nuances of formatting printed cards, and it only makes a great screen display format if you want to replicate a 3 x 5 note card on a computer screen. Still, no serious contender has been able to dethrone MARC from its central role in library information systems because none matches MARC in terms of granularity. But we suggest that granularity is often a hindrance rather than a help, and that tools that enable the exploration of slices of the carefully crafted metadata that is applicable to online content are now more useful than exhaustive details. In fact, as the volume of information increases, it is considerably more important to be able to explore larger amounts of data than to zero in on the specifics. The result set as a tool has been neglected for far too long, in favor of an emphasis on finding the one right answer (thank you, Google). Research bears little resemblance to say, playing Jeopardy, because there is often no one right answer (sorry, Watson) [7].

Given this new standard and a philosophy that embraces lossiness, we decided to prototype a system that would facilitate rapid response to an information challenge in an emergency. Our system, called InfoSynth, supports small focused digital library collections for diverse users [8]. You can think of it as a sort of LibGuides for metadata. The system has a number of distinct components that work in concert to create and expose custom metadata collections. The topicHarvester component is designed to collect metadata from a local or remote repository, selected in response to a user query. The harvested data is normalized, regardless of its source or original format (though it is usually Dublin Core), to an RDF/XML (Resource Description Framework/Extensible Markup Language) representation. The RDF/XML expresses various entries in the metadata in the form of triples. A triple is, as the name implies, a three-part statement that represents some piece of information [9]. We load these triples into an RDF repository, also known as a triplestore. A triplestore is much like a relational database, but designed to store and retrieve subject-predicate-object statements. Despite some minor variations in the structure of the XML returned by these repositories, we found that it was quite simple to map Dublin Core elements into triples. Here is an example, first, of a raw OAI harvested record and then of some triples in Notation 3 (N3). N3 is “a shorthand non-XML serialization of Resource Description Framework models, designed with human-readability in mind” [10] to represent a semantic mapping of that data:

RDF/XML:
<record>
<header>
<identifier>oai:casi.ntrs.nasa.gov:20010044998</identifier>
<datestamp>2004-09-16</datestamp>
<setSpec>casi</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc=”http://www.openarchives.org/OAI/2.0/oai_dc/”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xmlns:dc=”http://purl.org/dc/elements/1.1/”
xsi:schemaLocation=”http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd”>
<dc:rights>No Copyright</dc:rights>
<dc:subject>91</dc:subject>
<dc:title>Synthesis and Characterization of Nanocomposite Analogs of Interstellar Amorphous Silicates</dc:title>
<dc:date>20010101</dc:date>
<dc:date>2001</dc:date>
<dc:description>Synthetic analog materials consisting of nanophase Fe metal in a silica glass matrix have been prepared. The infrared properties of the analog materials show remarkable similarities to amorphous interstellar silicates. Additional information is contained in the original extended abstract.</dc:description>
<dc:creator>Al-Badri, Z.</dc:creator>
<dc:creator>Keller, L. P.</dc:creator>
<dc:creator>Grier, D. G.</dc:creator>
<dc:creator>McCarthy, G. J.</dc:creator>
<dc:creator>Chauhan, B. P. S.</dc:creator>
<dc:creator>Boudjouk, P.</dc:creator>
<dc:identifier>Document ID: 20010044998</dc:identifier>
<dc:type>Lunar and Planetary Science XXXII, LPI-Contrib-1080</dc:type>
</oai_dc:dc>
</metadata>
</record>

Triples in N3 notation:
“oai:casi.ntrs.nasa.gov:20010044998” <dc:title> “Synthesis and Characterization of Nanocomposite Analogs of Interstellar Amorphous Silicates”.
“oai:casi.ntrs.nasa.gov:20010044998” <dc:creator> “Al-Badri, Z.”.
“oai:casi.ntrs.nasa.gov:20010044998” <dc:date> “20010101”.

In plain English, the above three triples state that an object identified by “oai:casi.ntrs.nasa.gov:20010044998” has the title “ Synthesis and Characterization of Nanocomposite Analogs of Interstellar Amorphous Silicates,” has an author “Al-Badri, Z.” and has a publication date of “20010101.” Some fields, such as date, vary so much from one repository to another that we can do little more than display the date field to end users. Sorting by date within a collection of records from a single repository may still be possible if date content is uniformly structured. Such consistency is not always the case, so date becomes an informational field for end-user consumption. However, fields such as subjects, descriptions and creators lend themselves to interesting mappings and/or use with augmentation services.

Another component of InfoSynth is the collections and services registry. The registry is a set of triples describing each focused collection and all the possible services that could be used to explore the data. Collection metadata includes enough information to identify the source of the contents of a collection along with specific query or filter terms used to select records. A collection can contain data from multiple queries and/or repositories.

The normalized data lends itself to exploration in many different ways. Our prototype system emphasizes graph-based tools. Any triple is also a graph, such that a subject and object are nodes, and the predicate is an edge that connects them. A bibliographic record becomes a multidimensional graph, where authors, titles and other metadata represent edges all pointing back at the identifier for that record [11]. An InfoSynth user can perform a title and/or abstract query into a focused collection, which results in a subset of records that matches his or her search. This result set can be explored in various ways. Users can view a textual list of results and browse the metadata, explore a co-authorship network for the items that matched their search, see the results sets overlaid onto maps or even retrieve raw data sets that they can process or integrate with other data. We also leverage external web services to augment the content when appropriate. For example we send the abstracts of journal articles to OpenCalais, which returns a rich set of RDF triples describing various aspects of the content, such as georeference data for the places mentioned in titles, subjects or abstracts.

Figure 1
Figure 1
. At left, a subject-title graph with papers linked by subject heading nodes. At right, a co-authorship graph of a slice of authorship data from a result set returned for a query.

At an architectural level, the tools expose both a human-friendly rendering layer and a web-services middleware layer that offers search results in formats compatible with other visualization or rendering tools. For example, in the context of our digital library application, when a user elects to view the results of a query overlaid onto a map, we present the data integrated with Google maps, but we also provide a mechanism for requesting the results set as KML, which is compatible with Google Earth, and other geospatial applications.

Figure 2
Figure 2
. InfoSynth screen shot showing georeferenced papers about the Fukushima Daiichi Power Station in Japan.

The simplicity of the normalized data means that it is easy to incorporate different metadata from other sources into our digital library. For example, when the World Health Organization (WHO) feared that swine flu might become a pandemic flu strain, we created a focused collection about what was known about H1N1 by aggregating information from PubMed with RSS feeds from WHO and records returned in response to a query into our own local bibliographic metadata collection containing some 95,000,000 MARC-XML records. Content from searches against Apache SOLR or RSS newsfeeds lend themselves to a similar process of automated mapping and augmentation. And since the process can be automated, we believe that it could be developed into a user self-service system, where users select content from the net and automatically generate their own semantic digital library repositories.

The system’s tolerance for lossiness in the mapped data makes it possible to aggregate and explore content from a variety of sources. Our RDF digital library solution uses a variety of technologies to preserve and expose as much data as is available for a collection and allow users to explore that data through a suite of tools built atop web services and semantic web repositories. And yet its reliance on normalized semantic representations of the original content enables new methods of exploring and understanding the content. Automatic inferencing can expose relationships within larger collections of data than a human would be able to review. The digital library becomes a toolkit from which users can select the best tool for their information needs. Since the tools can accommodate lossiness in the data, a much greater variety of data can be explored, albeit with varying degrees of utility depending on the richness of the original data source. But then it’s about time the digital library stepped out into the real world!

And so we return to our information-centric disaster scenario – a biological disaster threatening humanity. This decade has offered ample evidence that mankind will soon face just such a crisis. In 2001 researchers in Australia accidentally created a strain of a mousepox virus that exhibited 100% mortality in the lab even among mice previously vaccinated for the pathogen, subsequently dubbed “more virulent than nature’s worst.” [12] The researchers subsequently reported that the same modifications could be applied to the human smallpox virus. If this organism had escaped from the lab, scientists would have been scrambling to determine what it was and how it might be stopped. This is exactly what happened in 2003 when SARS emerged. In the early weeks of the outbreak, when scientists had no idea what the organism was, they formed ad hoc journal clubs to identify, review and share relevant literature as they sorted out the mystery of this newly emerged pathogen [8]. Libraries provide this literature and can facilitate this process of knowledge discovery and dissemination, but only if they focus more effort on agile and focused responses to information-seeking challenges.

Resources Mentioned in the Article
[1] Hull, D., Pettifer, S.R., & Kell, D.B. (October 31, 2008). Defrosting the digital library: Bibliographic tools for the next generation web. PLoS Computational Biology, 4(10), e1000204, 1-14.

[2] Boucher, N.J. (December 17, 2003). Simulation of the universe. Retrieved March 17, 2011, from www.seti.org.au/spacecom/SimUniverseV1.htm.

[3] Ackerman, G. (2009). World at risk: The report of the Commission on the Prevention of WMD Proliferation and Terrorism. Journal of Homeland Security and Emergency Management, 6(1), 41.

[4] Van de Sompel, H. & Lagoze, C. (2000). The Santa Fe Convention of the Open Archives Initiative. D-Lib Magazine. Retrieved March 17, 2011, from www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html.

[5] Powell, J.E. (2009). Unpublished survey.

[6] Thomale, J. (2010, September 21). Interpreting MARC: Where’s the bibliographic data? Code{4}Lib Journal, 11. Retrieved March 24, 2011, from http://journal.code4lib.org/articles/3832

[7] Jones, N. (2011, February 15). Jeopardy-playing Watson computer system could revolutionize research. Scientific American. Retrieved March 24, 2011, from www.scientificamerican.com/article.cfm?id=jeopardy-playing-watson

[8] Powell, J. E., Collins, L.M., & Martinez, M.L.B. (2009). Using architectures for semantic interoperability to create journal clubs for emergency response. Proceedings of the 6th International ISCRAM Conference, Gothenburg, Sweden, Retrieved March 24, 2011, from www.iscram.org/ISCRAM2009/papers/Contributions/155_Using%20Architectures%20for%20Semantiv%20Interoperability_Powell2009.pdf

[9] Rodriguez, M.A. (2009, September). A reflection on the structure and process of the Web of Data. Bulletin of the American Society for Information Science and Technology, 35(6), 38-43. Retrieved March 23, 2011, from www.asis.org/Bulletin/Aug-09/AugSep09_Rodriguez.html

[10] Palmer, S.B. A rough guide to Notation3. Informesh.net, Retrieved March 24, 2011, from http://infomesh.net/2002/notation3/

[11] Powell, J.E., Alcazar, D.A., Hopkins, M., Olendorf, R., McMahon, T.M., Wu, A., & Collins, L. (2011). Graphs in libraries: A primer. ITAL: Information Technology and Libraries (in press). Preprint retrieved March 24, 2011, from www.ala.org/ala/mgrps/divs/lita/ital/prepub/powell.pdf.

[12] Nowak, R. (2001, January 10). Killer virus. New Scientist. Retrieved March 23, 2011, from www.newscientist.com/article/dn311-killer-virus.html


The authors are members of the knowledge systems team at Los Alamos National Laboratory Research Library. James Powell can be reached at jepowell<at>lanl.gov; Tamara M. McMahon at tmcmahon<at>lanl.gov; and Linn Collins at linn<at>lanl.gov.