Librarians, data managers and educators convened at ASIS&T’s third annual Research Data Access and Preservation Summit in March 2012 to discuss the complex issues involved in managing research data. Major topics included the sociocultural landscape of data management, education and training for data managers, and data citation practices. Among the many questions without answers were the issues of revenue sources and whether science data curation services are better handled by institutions or scientific communities. Financial and technical sustainability are critical, together with human sustainability, including data managers’ training and education. While the relative importance of data management theory and technical skills was a matter of debate, all supported the need for continuing education in data curation. There was agreement that current data citation practices are inconsistent and that citation metadata standards and unique and persistent identifiers for datasets are necessary.

research data sets
data set management
bibliographic citations
information science education
standards
metadata
persistent identifiers

Bulletin, June/July 2012


Special Section
 
RDAP12 Summit: Challenges and Opportunities for Data Management

by Karen M. Wickett, Xiao Hu and Andrea Thomer

The third annual ASIS&T Research Data Access and Preservation Summit was held in conjunction with the ASIS&T Information Architecture Summit on March 22nd and 23rd, 2012, in New Orleans. The Summit brought together panelists and participants from a variety of institutional contexts to discuss topics related to the management of research data. Major themes that arose from these discussions include the socio-cultural landscape in research data management, education and training for data managers and issues around data citation.

The Sociocultural Landscape in Data Management
The steadily increasing ability for researchers to create and analyze large amounts of data has been raising questions about providing and maintaining access to those data throughout the past century and was a motivating factor in the formation of societies and professional groups like ASIS&T. Though new solutions for search and storage have arisen along with new technologies, the need for data management has reached a scale where it is inefficient or unrealistic to manage data in the context of an individual researcher or project, and questions remain around what kinds of data management services an institution should provide and under what service models.

At some research universities, libraries have participated in the development of service models to support the management and preservation of research data. As part of the curation service models panel at the RDAP summit, Michael Witt from Purdue University spoke about the development of the Purdue University Research Repository, which “provides an online, collaborative working space and data-sharing platform to support the data management needs of Purdue researchers and their collaborators.” [1] Witt traced the development of this service through the Purdue e-Data Task Force, the Distributed Data Curation Center (D2C2) system to support online collaboration and the development of Curation Service Profiles, which identify researcher needs and expectations for data curation.

The Purdue services follow a philosophy in the development of data service models that is oriented toward meeting scientists “where they are” in terms of data management and developing services to meet the perceived needs of scientists. This model is something of a departure from what might be seen as a more traditional approach to providing services in academic libraries, where, although collection development was shaped by the user needs of researchers, the basic service of providing literature access was fairly static. The curation service model approach is oriented toward identifying areas in the work of scientists, such as planning for the long-term preservation of data, where the expertise of an information professional can usefully be integrated into scientists’ practices. 

The recent requirements from the National Science Foundation and other funding agencies for the provision of a data management plan (DMP) along with grant proposals has drawn attention to curation services and has fostered new opportunities for information professionals in research contexts. Barbara Pralle from Johns Hopkins University (JHU) spoke about their data management services unit, which is housed by the JHU Sheridan Library and “offers direct support for data management planning and assists with the generation of data management plans.” [2] The data management services group provides consultation around DMPs, conducted in a form similar to a reference interview, where a consultant reviews operational data management with a researcher or group and guides them in exploring the options for curation and preservation of data and associated materials.

Pralle discussed data curation services as lying along a continuum based on the size and coherence of the research community being served. She identified individual researchers as being at one end of the continuum and moved from emerging research clusters to established communities of researchers at the other end of the continuum. The kinds of services a researcher or group needs will depend not only on the kinds and volumes of data they use, but also on what communities they are a part of and the degree to which those communities have established shared practices and services.

Many scientific communities have established shared practices and built curation services that are focused around certain fields or topics and do not have the same kind of institutional identity as a library or university data management service. The existence of these communities means that some scientists don’t feel a need for an institutional repository since they have been using a repository oriented toward their domain-based community. They may feel that the domain repository is better suited to meet requirements around accounting for methodologies and data processing that are specific to their domain.

Many concerns around data management, such as proper accounting for processing methods or documentation of research methodologies around scientific data, are tied closely to the particular scientific domain in which researchers or groups are working. Beyond questions about whether a domain repository or an institutional repository is to be preferred for archiving, these concerns are a live issue for the development of curation services. Researchers may feel that librarians or other information professionals do not have sufficient training in or awareness of the relevant research methods to make decisions about the description or preservation of scientific data.

David Minor from the University of California, San Diego presented on the services provided by the university’s research cyberinfrastructure (RCI) group [3]. RCI provides services for researchers across the university in high-performance computing, data center collocation, network services, storage and data curation. The data curation services are provided through a collaborative effort between the UC, San Diego libraries and San Diego Supercomputer Center (SDSC). RCI has developed several data curation pilots [4] where RCI has worked closely with research teams to develop and implement preservation and access techniques for research data associated with specific research communities. These pilots include the Brain Observatory [5] which is designed to support the preservation of a digital version of a highly studied patient brain; the UCSD Levantine Archaeology Laboratory [6], which unites field work, objects in cold storage and digital images to support archaeological research; and the Laboratory for Computational Astrophysics [7], which supports large-scale computing and is aimed at improving scientific collaboration around data products.

Strategies for starting data management and curation services in an institutional context like a university were a lively topic of discussion at RDAP. The RCI approach of identifying pilot projects and building services and infrastructure around the particular needs of a specific group of users is promising since it can allow a new group to demonstrate value and help define and solidify core services. This kind of pilot project can provide clarity in terms of how a curation or data management service can improve the perceived scholarly impact of the institution as a whole. However, it should be noted that while scaling up technical elements like hardware and storage is relatively straightforward, scaling up the provision of services like consultation and intensive curation will generally be more difficult.

Questions about how curation and data management services can be maintained in the long run in an institutional context were addressed throughout the meeting and were a particular topic of discussion during the ASIS&T SIG/DL sustainability panel. This panel in particular featured speakers discussing funding models that fall a bit outside of the models for centrally funded services, which tend to have grown out of traditional institutional services like libraries or information technology management. Oya Rieger from the Cornell University Library discussed the funding model for the open access e-print service arXiv [8]. Although Cornell has hosted arXiv since 2001, a very low percentage of the download usage of the service is due to Cornell users, which means that the library is hosting a service that is used mostly outside of the institution itself. So although the university library does continue to fund some portion of the service, arXiv has shifted toward seeking voluntary contributions from libraries and research institutions. 

Peggy Schaeffer also spoke on the sustainability panel about efforts to create a sustainable business model for the Dryad data repository [9], which is a repository for data associated with peer-reviewed publications in the biological sciences. Although Dryad currently receives funding through NSF, they are developing a business model that will not be solely dependent on grant funding. In the anticipated sustainability model, deposit fees are expected to be the main source of revenue, since most of the costs for Dryad are incurred at the time of deposit, and deposit fees will allow revenue to scale up along with costs over time. Dryad’s close association with peer-reviewed publications means that they expect to establish payment models with journals or with individual authors. For example, a member journal might pay Dryad for a subscription and then handle deposits as part of their author fees for publication.

Across institutional contexts, sustainability and business models that recognize institutional realities are essential for data management and curation services. Ms. Pralle discussed sustainability for the data management services at JHU in terms of three facets: financial sustainability, human sustainability and technical sustainability. While all three of these facets are essential for maintaining services, the issues around human sustainability are of great interest at the moment since the sociocultural landscape around data management and curation are currently in flux. This instability means that, although there are opportunities for the development of new service models, there are also tensions around what kinds of expertise are required for dealing with research data, who has the authority to deal with data and how resources are deployed within an institutional context.

This shifting sociocultural landscape was a theme throughout the RDAP summit. In her presentation on the DataOne cyberinfrastructure [10], Suzie Allard highlighted the sociocultural elements behind gaining adoption for data management and curation initiatives. The issues around training and education for information professionals prepared to work in data management and curation also drew attention to the sociocultural issues around these topics.

Education and Training for Data Managers
Some of the tensions between scientists and information professionals with respect to data management arise from concerns about who is adequately prepared to manage research data. Educational topics centered on this issue were discussed during the panel on training data management practitioners. The panelists came to these issues from two perspectives: the hard sciences and library and information science (LIS). Kirk Borne from the School of Physics, Astronomy and Computational Sciences at the George Mason University and Peter Fox from the Tetherless World Constellation at the Rensselaer Polytechnic Institute represented educators in the hard science disciplines where data analysis has had a long tradition in research discovery and education. Jian Qin from the School of Information Studies at Syracuse University represented efforts in library and information science (LIS) where managing information and knowledge has been the center of intellectual pursuits. Such a combination broadened the horizon for audience members who have been following either tradition and inspired a lively discussion among Summit participants. (See more on this session in Session Summary: The RDAP12 Panel on Training Data Management Practitioners in this section.)

A common theme in all three panel presentations was the degree to which teaching should address theory as opposed to technical skills. There was a general sense of agreement that technical skills are important, especially in helping students start their careers, but that fostering an understanding of the conceptual foundations underlying technical skills was more likely to empower students throughout their entire careers. However, there is no single established body of theory that underlies data management. As Peter Fox pointed out, we have basic theories of information and knowledge, and one can argue that analog data has theoretical roots in museum studies, but these theories seem to fall short for the foundations of digital data. 

Although we don’t have a single shared theory for data management, there is progress in developing learning objectives for educational programs. Jian Qin discussed the learning objectives that were identified as part of the eScience librarianship curriculum project [11]. These objectives include competency in scientific data management, competency in cyberinfrastructure technologies and the ability to plan and carry out relevant projects. 

Another topic of discussion in the education and training panel was whom educators expect to enroll in these programs. In the science- (rather than library-) based programs, students mainly have backgrounds in physical sciences, information technologies or computer science. Those programs emphasize data analysis as an integral part of science research and being data smart as a requirement for future scientists. In the LIS programs, students are more likely to be thought of as future e-Science librarians, who must be data smart when it comes to providing services, rather than analysis. 

While some LIS students have backgrounds in domain sciences, computer science or information technologies, they are more likely to have backgrounds in the humanities or already be practicing librarians who wish to extend their expertise to science data management. This prior study means that there are questions about whether it is reasonable to expect students in LIS programs to get up to speed on topics like data analysis, since many students won’t be entering an LIS program with adequate preparation for those topics.

There was strong audience support for continued education opportunities for mid-career professionals, especially with a focus on science data curation. The demand for data management professionals is on such a steep rise that the number of new graduates from programs in e-Science librarianship, data curation and data management is unlikely to be sufficient to supply the workforce. There are currently only a handful of programs focused on data curation and management in U.S. library schools (for example, those at Illinois, North Carolina, Syracuse and Tennessee), and all are still in their early years. It is also necessary to provide in-career training for information professionals who are working on jobs related to data services. 

Although the curricula in science data analysis and e-Science librarianship have different emphases, they share a similar vision on the importance of theoretical foundations, competency in cyberinfrastructure technologies and soft skills such as communication, teamwork and leadership. 

Data Citation 
The development of standards and practices around the citation of research data was a major topic of discussion at the RDAP Summit. Like traditional publication citation, data citation can support measurement of scholarly output and allow proper credit for all parties that contribute to a body of research. In addition, data citation has the potential to go beyond traditional publication citation by linking scientific research with relevant datasets, thus facilitating data reuse and data-driven discovery. 

Despite these potential benefits, current data citation practices are far from mature. Workshops that target specific domains let researchers, publishers and information professionals come together to develop solutions that address the needs of their community. However, citation practices that are too closely based on the needs of researchers working with particular kinds of data are unlikely to gain adoption outside of that field. This restriction is a problem if we anticipate data citation supporting the reuse of data in new contexts or data-driven discovery, since domain-specific citation practices will raise a higher bar for finding and using data across disciplines. Venues like the RDAP summit can bring together information professionals from a broader base to discuss proposals and seek solutions that work across domains. 

The data citation panel at RDAP featured a discussion of the current state of data citation practices and possible approaches to the challenges. (See more on this session in Session Summary: The RDAP12 Data Citation Panel in this section.) Mark Martin from DataCite, an international organization devoted to creating a global citation framework for research data [12], discussed the methods that researchers currently use to approximate data citation. One common approach is to use an acknowledgement section in a publication to give credit to data providers. Approaches like this will not support metrics for measuring scholarly output since the acknowledgement sections are not commonly indexed by publication databases. 

Another technique currently used by researchers to approximate data citation is to cite a research paper where the dataset is introduced and used. This approach does not make a distinction between the data and other parts of a paper such as research results and analysis. As a result, it also will not support metrics that correctly gauge the impact of a dataset itself. In general these challenges bring up the point that the development of metrics for the scholarly impact of research data will require coordination with publishers, who will need to expose data citation for indexing, and the producers of impact metrics, who will need to develop and refine their metrics to incorporate citation of research data.

The final approach mentioned by Martin is to cite a published paper focusing on the data, if such a publication is available. One problem with this approach is that it will not reflect the changing nature of many of the datasets used by researchers. In practice, datasets may be updated, sometimes frequently, whereas published papers are static and may become outdated. This issue hints at a major challenge for the management of research data, which is that it is often difficult to apply many of our conceptual models for information management to research data and datasets. The fact that datasets are commonly considered to change or grow over time is difficult to accommodate in models that are designed to handle more static objects like journal articles. 

One clear solution to many of these issues would be to cite a well-described dataset itself (for example, one with descriptive and provenance metadata as well as links to related studies). However, in reality, little data is formally published, and many datasets do not even have formal titles or authors, let alone well-written metadata. 

The development of citation standards would be advanced significantly by agreement on standards for unique and persistent identifiers for datasets. Digital object identifiers (DOIs) are one of many possible solutions, but this issue is still an open question. The Earth Science Information Partner (ESIP) federation [13] has convened experts to review possible identifier schemes in terms of their suitability for data citation and other data management problems. Although ESIP members have published the results of their analysis as a report [14], it seems that we are still quite a way from having any standard established identifier scheme for scientific data. Many challenges arise from the variety of specific needs for identifier schemes connected to factors such as whether datasets are expected to grow over time and what kind of authority is deemed necessary to assign an identifier to a particular data product.

Similarly, there is no widely accepted metadata schema for describing and managing research data. There are groups working to develop such schemas, groups such as the Dublin Core Science and Metadata Community (DC-SAM), which provides a forum for discussing metadata challenges specific to scientific data curation. Once again, this issue is complicated by the differences in requirements for data description among different disciplines and scientific communities.

It is also still unclear from current practices who will create metadata for datasets. Researchers who create the datasets often do not attend to the process of metadata creation and at best provide insufficient documentation for their datasets. On the other hand, librarians who ingest the datasets into repositories may not have complete knowledge of the research projects that produce the data in the first place. 

This kind of problem may be addressed to some degree by systems like OpenWMS [15], which was discussed by Aletia Morgan from Rutgers University. Systems like this one streamline the creation of metadata by integrating it into the ingest process for bringing digital objects into a repository. Making the creation of metadata for datasets easier may help ease metadata creation into scholarly practices, but it will not address tensions around who has the authority to create metadata or address fundamental modeling problems such as those that arise from the mutability of data and datasets.

Beyond the modeling and description issues for data citation, there is a sociocultural element to the adoption of any data citation standard or practice. This situation suggests investigating and promoting the incentives for citing data in the context of scholarly communities. Taking a large view, it is not hard to understand the benefits of data citation for scientific research, but for individual researchers there is still a lack of direct incentives for citing data properly or even citing data at all. Educating researchers about the benefits of data citation will be an important and necessary element of any efforts to promote standard practices for data citation. 

Standards and best practices will be a necessary element of promoting data citation. Paul Uhlir from the Board on Research Data and Information (BRDI) [16] at the U.S. National Academy of Sciences presented recent activities and efforts carried out in the Data Citation Standards and Practices Task Group in the International Council for Science: Committee on Data for Science and Technology (CODATA) [17]. This task group is in the process of developing a white paper on current practices in data citation, which is expected in 2013 and will include developments in standardization proposals and best practices, emerging principles for data citation and tools and infrastructure. 

Concluding Remarks
The RDAP Summit brought together librarians, data managers and educators from a number of different environments to discuss open issues in preservation and access for research data. Workshops and meetings that address the needs and capabilities of specific communities are an important part of making progress in these areas, but we also need meetings that bring together different perspectives and can draw attention to the issues that are common across many disciplines. Issues of how to prepare and train information professionals or the sociocultural elements of developing data management services are unlikely to receive sufficient attention to aid progress without meetings like RDAP. The RDAP Summit gave participants the opportunity to share experiences, consider strategies and hear about new opportunities and challenges for data management.

We thank all of the panelists and participants at the RDAP meeting and especially the RDAP12 program committee for bringing the meeting together.

Resources Mentioned in the Article
[1] Purdue University Research Repository: http://research.hub.purdue.edu/

[2] Johns Hopkins University Data Management Services: http://dmp.data.jhu.edu/

[3] UC San Diego Research Cyberinfrastructure: http://rci.ucsd.edu/

[4] UC San Diego Research Cyberinfrastructure: Pilot Projects: http://rci.ucsd.edu/pilots/index.html

[5] The Brain Observatory: http://thebrainobservatory.ucsd.edu/

[6] Center of Interdisciplinary Science for Art, Architecture and Archaeology (CISA3): http://culturalheritage.calit2.net/cisa3/

[7] Laboratory for Computational Astrophysics: http://lca.ucsd.edu/

[8] arXiv.org: http://arxiv.org/

[9] Dryad: http://datadryad.org/

[10] DataOne: www.dataone.org/

[11] Syracuse University eScience Fellows Program: http://eslib.ischool.syr.edu/

[12] Datacite: www.datacite.org

[13] Esip: the Federation of Earth Science Information Partners: www.esipfed.org/

[14] Duerr, R. E., Downs, R. R., Tilmes, C., Barkstrom, B., Lenhardt, W. C., Glassy, J., et al. (2011). On the utility of identification schemes for digital earth science data: An assessment and recommendations. Earth Science Informatics 4(3), 139-160. doi:10.1007/s12145-011-0083-6. Retrieved April 28, 2012 from www.springerlink.com/content/52760gq3h200gw38/

[15] RUCore: Rutgers University Community Repository: http://rucore.libraries.rutgers.edu/open/projects/openwms/

[16] National Academy of Sciences. Board on Research Data and Information: www.nas.edu/brdi

[17] CODATA: www.codata.org/taskgroups/TGdatacitation/index.html


Karen Wickett is a doctoral candidate in library and information science at the University of Illinois at Urbana-Champaign. Her research interests are conceptual foundations of information organization systems and the semantics of metadata vocabularies; her teaching interests are information organization and modeling. She can be reached at wickett2<at>illinois.edu.

Xiao Hu is an assistant professor in the library and information science program at the University of Denver. Her research and teaching interests include organization of information, digital libraries, music information retrieval and data mining for information professionals. She can be reached at xiao.hu<at>du.edu

Andrea Thomer will be starting the doctoral program at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign in the fall of 2012. Her research interests include data curation, biodiversity informatics, natural history museum informatics and the organization of information. She can be reached at thomer2<at>illinois.edu.