Building a community of those interested in data management was a key focus of ASIS&T’s March 2011 Research Data Access and Preservation (RDAP) Summit, starting with the kickoff session by Gary Marchionini. The keynote address emphasized the role of institutional repositories and digital library services in handling research data, offering data management considerations used by the National Science Foundation including the data’s value, storage, lifecycle, sharing and reuse. Through several case studies, speakers explored challenges for data publication repositories and federal agencies; described advantages of cross-discipline advisory groups; addressed the need for data replication and security; and promoted metadata use, monitoring of data reuse and strong policy and governance. Open discussion revealed management to be the most challenging issue on a personal level while infrastructure was viewed as the major concern for the field. Among the conclusions of a closing panel on the future of digital libraries was recognition of digital libraries’ key role in data management and archiving and the need for training for specialists in data management.

digital object preservation
information resources management
digital repositories
scientific and technical information
librarians
scientists
meetings

Bulletin, June/July 2011


Research Data Access and Preservation 2 Summit

RDAP2:  Session by Session 

by Joseph A. Hourclé

Editor’s Note:
This is a significantly reduced summary of the RDAP Summit; all presentations have been posted to the ASIS&T slideshare at www.slideshare.net/asist_org/

Kick Off
Gary Marchionini opened the Research Data Access and Preservation (RDAP) Summit with lessons learned from last year, reminding people to submit comments on ways to improve the Summit and to use #RDAP11 in twitter. He discussed the goal of building community and how issues of data management have received more visibility recently [1]. 

Keynote
Clifford Lynch from the Coalition for Networked Information gave the keynote address, in which he discussed how the library community and institutional repositories (IRs) fit into the management of research data. Because IRs focus on long-term preservation of deposited objects, they are useful to scientists looking for places to store and distribute their data, particularly for smaller data collections that are frequently forgotten as people deal with “big data.”

Lynch presented five aspects of the National Science Foundation (NSF) data management plan requirements:

  1. What data is valuable?
     
  2. How is it stored?
     
  3. What is the lifecycle?
     
  4. How is it shared?
     
  5. How can it be re-used?

He pointed out some of the flaws of IRs for data storage, such as the need for specialized scientific metadata, the uncertainty of the cost of storage and the occasional need for complex authorization rules. But he also suggested solutions such as separating the scientific data cataloging from the IR and considering some parts as services that could be contracted out or done through consortia or endowments. Lynch called out the need to focus on the older, smaller data sets that are more at risk without getting distracted by the more complex and difficult collections that are not a good fit for traditional IRs.

Some significant issues were brought up by the audience, such as how to measure the importance of the data as its value may change over time [2] to which Lynch responded that the issues were economic, technical and ethical; the costs of storage vs. re-running the experiment wouldn't work if future experimental procedures might make past trials obsolete. One has to weigh the cost of running the experiment if it might put lives at risk. Also discussed were how IRs fit into scientific workflows, issues of data encumbered by regulatory and other restrictions and issues of cost recovery by discipline repositories that are ineligible for NSF funding.

Institutional Repository Case Studies
The first session featured speakers Jonas Dupuich from Berkeley Electronic Press, Katherine Kott from the Stanford Digital Repository and Terry Reese from Oregon State University, who presented on the current use of IRs to expose research data.

Dupuich discussed three common approaches, all of which store metadata about the research data, but then either offer (1) a link to the data, (2) a guide to obtaining and using the data or (3) the data itself. There are a number of advantages to the second approach, as the textual nature of the documentation helps make the object findable to search engines. 

Kott demonstrated how Stanford used a re-writing of Ranganathan's five laws to evaluate their institutional repository and how that evaluation drove their latest implementation. She also provided details on services they currently perform. 

Reese showed the benefits of IRs, making objects easier to access by other researchers while providing download metrics to the depositor. He also discussed problems that research data causes for IRs and how modifications to DSpace's storage provided for better performance and alignment with their storage needs. Reese relayed that researchers are interested in metadata training and help with other aspects of curation even if they do not want the library to take over the storage of their data.

Highlights of Audience Discussion on the Institutional Repository Case Studies

● The inability to assign structured metadata at the collection level with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) to more easily enable groups building scientific catalogs to have the ability to discover which collections they might wish to harvest.

● Questions about the types of information that went into user guides for the data.

● Limitations of using the same license for both data and publications, the need for varying embargo lengths and the inclusion of policies on attribution, all of which were covered again during Mackenzie Smith's talk. 

● Many fields that don't consider themselves to have “data,” but they have many “files” such as documents or images that serve the same function and could be described as the “data” in their research. 

● Faculty in different fields have different attitudes about being connected to their advisee's theses.

NSF Data Management Plan Case Studies
In this session, Eric Chen from Cornell, Andrew Sallans from the University of Virginia and Mackenzie Smith from MIT discussed cross-discipline groups that have formed at their institutions to advise researchers on the NSF data management plan (DMP) requirements and the new issues that surfaced. 

Chen described Cornell's effort to determine the needs of their research community. Data from their surveys shows that most people suspect they will need help with DMPs. He described their solution of a concierge service to receive questions from researchers and direct them to different people from libraries, research computing or a specific research department.

Sallans gave an overview of an effort at University of Virginia with an even broader collaboration (including an attorney to look into policy and ownership issues) and their template system for NSF DMPs based on the UK Joint information Systems Committee (JISC) DMP online, developed in collaboration with a number of other groups.

Smith discussed the importance of policies and pointed out that two of the five NSF guidelines relate to policy: (1) access and sharing and (2) re-use and derived works. She addressed some of the technical, social and legal issues, including confusion over who “owns” the data, applicability of foreign laws in international collaborations, issues with copyright of data in the United States, enforceability of “data use licenses” and some of the problems with using some of the Creative Commons licenses with data. She mentioned the need for further work on issues concerning licenses, attribution, persistent identifiers, provenance, metadata and registries. 

Highlights of Audience Discussion of the NSF DMP Case Studies

● How should we deal with people coming for help at the last minute? Make sure they understand the five basic areas of the NSF DMP requirements and that they are educated about where to go for help the next time.

● How do we make contact with researchers earlier in the process? Do we look into either tying into existing Institutional Review Boards or setting up something similar? 

● How do we fund these efforts? The general consensus was to treat it as overhead, although there may be specific cost-recovery models for offering storage as a service.

Data Publication Repositories
The third panel session featured Ruth Duerr from the National Snow and Ice Data Center (NSDIC), Phil Bourne from Protein Data Bank and Steve Hughes from the California Institute of Technology, who made presentations on their science data efforts and shared various recommendations and insights from their decades of experience in this field. 

Duerr talked about how NSIDC provides standard ways to cite data as if it were any other publication, but noted that scientific journals do not require that citation in the published research. She discussed efforts from Smithsonian Astrophysical Observatory (SAO)/NASA Astrophysics Data System (ADS) and arXiv to link articles to the data and how the lack of metadata affects the ability to find, select, obtain, understand and use the data. She reminded us that data files are not like a book, but are more like a page or even a sentence from a book: a single data file might be meaningless without the associated context. She also reminded the audience that there should be metadata assigned at both the file and collection levels and that the appropriate level of records returned depends on the context of the request. 

Bourne described Protein Data Bank (PDB) as a community effort and how the scientists successfully lobbied their discipline's journals not to accept papers without corresponding data in the PDB. He mentioned that the issues were much harder and took longer than they had originally thought and that political issues took more time to solve than technical issues. Bourne remarked how the PDB's role has shifted from being an archive to including both analysis and education. They seek to integrate more tightly with analysis tools to allow someone to explore the data in much richer ways than would be available from just reading about it in a journal article. 

Hughes discussed Open Object Database Technology (OODT), a flexible repository architecture that he helped build for the Planetary Data System (PDS), but which is now also used by National Institutes of Health’s (NIH) Early Detection Research Network and managed by the Apache Foundation. He discussed the trends in eScience towards highly distributed, loosely coupled, federated systems with complex modeling and the need to support both varied data analysis and decision support tools. Hughes also talked about how OODT's modular design allows different groups to swap out the data model, security, discovery tools or other components to support their specific needs.

Highlights from Audience Discussion of Data Publication Repositories

● What were metrics for evaluating as well as the contributing factors leading to the success of these systems?

● One of the significant contributing factors in the space and earth sciences was the 1986 CODMAC report [3], which decided that the scientist should be responsible for their own data and their own self-organizing to develop PDS and other systems. 

Data Archives in Federal Agencies
Arnold Rots from the Harvard/Smithsonian Center for Astrophysics, Joey Comeaux from the National Center for Atmospheric Research (NCAR), Jay Hnilo from the National Oceanographic and Atmospheric Administration (NOAA) and Dan Kowal, also from NOAA, spoke about data management from the view of federally funded archives in this panel session.

Rots’ presentation was on the Virtual Astronomical Observatory (VAO) and its role of federating search across data systems from U.S. observatories and space missions and its participation as a member of the larger International Virtual Observatory Alliance (IVOA). He pointed out that many of the issues mentioned in earlier sessions weren't a problem in their field, as they had standardized file formats and didn't collect personal information and the data wouldn't lead to patents. Many of the standards and general analysis tools had been developed by IVOA, solving the interoperability issues in their field. Rots talked about the varied groups within VAO, including user support, operations, data curation and preservation, education and public outreach, and technical assessment. He described their relationships with ADS and expanded on ADS's efforts to use data identifiers to provide semantic linking between data and publications. 

Comeaux discussed the need for stable funding to retain good staff and provide services such as user support and software maintenance and development. He also raised the point that it's impossible to document everything; there are times when one has to consult a human expert. NCAR’s experience is that it takes five to 10 years to develop the necessary expertise. Comeaux explained NCAR's evaluation of the reasons for data loss, and although natural disasters and hardware failure are on the list, aspects of bad curation such as loss of metadata, accidental overwrites and simple lack of sufficient information on the value of the data are also problems. He spoke of the need for selecting good archival data formats that are documented at the byte level and were not dependent on specific software, hardware or operating systems.

Hnilo introduced the NCDC's National Climate Model Portal and revealed that until 2002 there had been no practice of long-term archiving of climate data. He discussed the need for reducing the size of the data though sub-setting and downscaling to provide for interoperability and reuse. Hnilo also described the portal’s formal requirements for a submission agreement and its processes for determining what to archive.

Kowal went into more detail on the appraisal process that occurs even before the submission agreement and also on the process to review and sunset data rather than keeping all data in perpetuity. He discussed the need for communicating the value of various metadata fields so that the people making data systems understand the need for populating those fields to promote re-use. Kowal also raised the issues of how to communicate with researchers, of the need to review the return on investment of data rescue and of how to deal with data nominated for archiving by someone other than the people who currently maintain that data. 

Highlights of Audience Discussion on Data Archives in Federal Agencies

● How long did it take to respond to requests for data to be archived? We were told that the full NOAA process could take more than a year, but could move faster for data at risk of being lost. 

● What types of people are needed to curate the data? Response: Typically, the staff are research domain experts who are trained in data management, metadata standards, curation or whatever additional skills that might be needed. However, because staff is tightly affiliated with the research field, there is a risk they might get distracted by the exciting new data and neglect the older, less exciting but equally valuable data.

Policy-Based Data Management
The final session of presentations by Micah Altman from Harvard, Eliot Metsger from Johns Hopkins University, Monica Omodei from the Australian National Data Service and Reagan Moore from the University of North Carolina focused on specific technologies to help manage archives. 

Altman discussed the DataVerse network's SafeArchive, a layer over LOCKSS (Lots of Copies Keep Stuff Safe) to verify that sufficient numbers of copies of files on the network exist and, should it be necessary, initiate copies at other archives with sufficient resources to provide compliance with the Trac open source project Checklist. 

Metsger described the current plans of the Data Conservancy to abstract many of the policy rules to allow embargos, logging, authentication and authorization or obfuscation of the data. He explained the need for obfuscating of the data through “fuzzing,” where the information is made less precise to allow for data re-use without revealing sensitive information. 

Omodei presented on the efforts of the Australian National Data Service (ANDS) to promote sharing and re-use of the data and the systems they have developed to track the existence of data in a registry, provide for discovery of that data, assign Digital Object Identifiers (DOIs) for data and provide vocabulary services. She explained other external efforts that have been funded to establish IRs and registries and software tools for data integration and how ANDS has partnered with other agencies within Australia for name authority. She also described some of their problems in dealing with metadata catalogs being distributed in pdf format or as spreadsheets and the inability to track specific records within those files. 

Moore discussed aspects of policy-based data management, including policy-based data environments and policy aspects of the data life cycle, building shared collections, policy-based interoperability, generic vs. specific data infrastructure and data virtualization.

Highlights of Audience Discussion on Policy-Based Data Management

● How to deal with policies that aren't enforceable as part of the system, such as attribution.

● How to deal with conflicting policies, such as different embargo times.

● If the tools and techniques described could be used for sunsetting of data.

Themes and Questions of RDAP2
After the last session of prepared talks, Gary Marchionini led a discussion with attendees about key themes and challenges. He posed the following question: “What keeps you up at night?” He included in the challenge issues for both institutions and the field. 

Topic What keeps me up all night?/What keeps the field up all night?
Infrastructure 17 / 26
Management  23 / 12
Researchers 15 / 14
Description 12 / 13
Personnel 5 / 7
Funding 5 / 4
Appraisal 4 / 4

Table 1. What keeps me up at night?/What keeps the field up at night?

Attendees were surprised that management issues appeared so high on the list. The category includes issues related to the need for public outreach and the need for centralized management rather than many stovepipes. We continued with a more in-depth discussion of what exactly was meant by public outreach and if it meant the need to advertise our services and values to the scientists and researchers whom we serve or to the general populace. It was noted that we need to reach the general citizenry to explain the value of curating data for the advancement of society. The public rightfully has questions about what researchers and faculty do to earn their salaries, and we need to explain this benefit to society by expanding their knowledge and also by explaining how data leads to evidence-based policy making.

We also discussed the need for funding for public infrastructure, particularly as it affects these data efforts, and the need for all people involved in these efforts to explain the importance of curating this data. We would like to make data both easier to find and more usable to the general community. Although, for the original investigation, the more highly processed data is not useful, there is a lot of data out there, and it may prove useful to various projects down the line. Although there were concerns about spending too much effort on preparing data for public use, as it might not fit within an organization's funding mandates, there was a suggestion to better track press releases and the visualizations of data used in them to help find both the professional and popular websites that are making use of data.

Attendees found it surprising that funding isn't seen as the largest issue, as in, if funding is cut, the data effectively evaporates. We saw a need to review economics and sustainability, cost models and of inter-institution collaboration, which was later expanded into a discussion of socio-political aspects and the need for better stories of the benefits of good data practices that we can use to bolster continued support from management and funders.

There was great interest in what other communities we need to reach out to and work with such as IASSIST (International Association for Social Sciences Research and Technology), CODATA (Committee on Data for Science and Technology), AAAS (American Society for the Advancement of Science), ACRL (Association of College and Research Libraries), ARL (Association of Research Libraries), the eScience Institute, DCC (Digital Curation Center) and DCMI-SAM (Dublin Core Metadata Initiative Science and Mathematics Community), and it was noted that this meeting conflicted with a meeting for ACRL. We discussed the need to involve the national agencies, various discipline-focused communities of data and metadata managers, science data librarians and the Digital Libraries Federation, as well as other digital preservation and infrastructure efforts. We don’t want to duplicate efforts, but agree there is a need for a forum such as RDAP to discuss the cross-discipline issues as well as a need for many perspectives for this effort to be successful. All attendees were asked to share our trip reports and notes widely so that we can help to inform others of these efforts.

The discussion closed with the issue of reduced budgets and the need for ways for other people to participate and be informed without attending the meeting in person. We hope that this summary is a small start, but live streaming of next year's meeting was mentioned as a better alternative. There were also valid points raised about the need for the effort to be more than just an annual meeting, indeed to be an ongoing conversation. The RDAP mailing list hosted by ASIS&T is one place to discuss and engage with the participants from this, last year’s and next year’s RDAP Summits - http://mail.asis.org/mailman/listinfo/rdap 

The Future of Digital Libraries
The Summit ended with a panel on the future of digital libraries. There were too many issues brought up to discuss them all in depth here, but the following were among the broad topics:

  • The role of digital libraries in data management and archiving
     
  • The move towards knowledge repositories
     
  • Training for specialists in data management
     
  • The need for outreach to scientists, the public and policy makers
     
  • How social networks and other techniques could be applied to data systems
     
  • The importance of re-use of data
     
  • The integration of discipline search engines with IRs and digital libraries
     
  • The need for linkages between data and publications
     
  • The need to reach out to other communities working on similar efforts

An amazing variety of issues was covered in both the presentations and the open discussion. I suggest that we continue this discussion on the RDAP mailing list and that we look into ways in which we can organize into task groups to try to tackle these many and varied issues.

Thank you to Gary Marchionini, University of North Carolina, for organizing the Summit, to Erin O'Meara, University of North Carolina, Michael Giarlo, Penn State, Bill Anderson, University of Texas at Austin, and Reagan Moore, University of North Carolina-RENCI, who helped to organize specific sessions, and to the many other ASIS&T members from the special interest groups for both digital libraries (SIG/DL) and science and technical information (SIG/STI) who participated in planning and other guidance for this meeting.

Resources Mentioned in the Article
[1] Science Magazine. (2011). Special online collection: Dealing with data. Retrieved April 13, 2011, from www.sciencemag.org/site/special/data/.

[2] W.K. Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B., & Stafford, S.G. (1997). Nongeospatial metadata for the ecological sciences. Ecological Applications, 7(1), 330–342.

[3] Committee on Data Management and Computation, Space Sciences Board and the Committee on Physical Sciences, Mathematics, and Resources, National Research Council. (1986). Issues and Recommendations Associated with Distributed Computation and Data Management Systems for the Space Sciences. Washington, DC: National Academy Press. Retrieved April 13, 2011, from www.nap.edu/catalog.php?record_id=12343.
 


Joseph Hourclé is principal software engineer, Wyle Information Systems. He can be reached at oneiros<at>grace.nascom.nasa.gov