The WDS Scientific Committee (WDS-SC) requests WDS Members to maintain the quality of their data and services. Another important task for the WDS-SC is to recruit data centres from various disciplines, as many and wide as possible, to serve their data to promote science—in particular interdisciplinary science. However, based on my experiences as a researcher of Solar–Terrestrial Physics and as Director of the World Data Centre for Geomagnetism, Kyoto (WDS Regular Member), I believe that we have one more important task for promoting science: collecting and serving useful data from the 'dark long tail' of datasets.
There are a huge number of datasets—mainly obtained on a research project basis—that are not registered to active data centres, and hence are 'dark' to many of us. These datasets are typically built by small research groups for a limited period, and data quality checks are often not sufficient. Although their quality may not be good and they exist only for a limited period, such data are very important and useful if the location of observation site is highly unique, or if other observations are not available.
We know of many such 'dark long tail' datasets, and some have been sent to our data centre, but even if we find them and can ingest them, we often have difficulty to keep (or to confirm) their quality. Nevertheless, my personal opinion is that these data should also be served by WDS Members, even if they conflict with the membership requirements of WDS.
One way to compromise for the data quality and service of data from the 'dark long tail' is to register metadata that describe the observations in as much detail as possible. An example of this in practice is IUGONET (Interuniversity Upper atmosphere Global Observation NETwork), which has a common database of metadata and forms a virtual data centre of distributed databases at several institutions. This data system includes databases from the 'dark long tail', as well as large well-known databases.
The WDS-SC and WDS Member Organizations must therefore take action (and advocate) to ensure such 'dark' datasets are registered in appropriate data centres or systems with adequate metadata to make them useful. Otherwise, I have a concern that they may just be kept by each institutional repository in a way that cannot be exploited or could even be lost forever.
To improve the situation domestically, we held two workshops at Kyoto University last autumn that explored possibilities for collaboration among Japanese university libraries, informatics experts, and research scientists. University libraries in Japan are not very positive in general about functioning as repositories for scientific data. In contrast, some researchers are actively trying to develop the related technology or systems for that to happen. Moreover, a Japanese endeavour to register datasets and attach Digital Object Identifiers started last year. My hope is that these activities grow and form a stream of open data from the 'dark long tail'.
I would like to introduce a new initiative of DANS (Data Archiving and Networked Services) in the Netherlands. During the International Open Access Week last month, DANS launched, together with the Dutch publisher Brill, a new Research Data Journal for the Humanities and Social Sciences. The Research Data Journal is a digital-only, open access journal, which documents deposited datasets through the publication of data papers. The journal concentrates on the Social Sciences and the Humanities, covering history, archaeology, language and literature in particular.
Data papers are scholarly publications of medium length containing a non-technical description of a dataset and putting the data in a research context. Each paper gets a persistent identifier providing publication credits to the author.
Data papers call attention to particular research datasets, which may increase the likelihood that the datasets could be re-used or re-purposed by other researchers in the future. Additional benefits are that they are peer-reviewed, can be listed on CVs, and can accumulate citations just like traditional journal articles. This way they provide important incentives for researchers to put time and effort into preparing their datasets for public access.
The DANS Research Data Journal is an enhanced publication in more than one respect. The text is enhanced with direct links to datasets in the long-term repository. Additionally, the journal is enriched with features that contribute to greater usability of the content in terms of overview and navigation by adding background information and various forms of visualization. Where possible, data can be previewed and explored online, rather than through time-consuming downloads and offline applications. In short, an enhanced data paper provides an integrated view of data in their research context.
At DANS we hope that this initiative will stimulate researchers in the Netherlands and abroad to make their data more easily available to others.
Just returned from a useful trip to visit my collaborators working with the 'Chinese plant trait database' at the Northwestern Agricultural and Forestry University in Yangling, China. We now have information from several hundred sites across China, and this will allow us to make detailed analyses of trait–climate relationships. But trips like these remind me that data are the bitcoin of Chinese science. It requires management of complex social networks to put together a dataset this large, but there are always people outside these networks who nevertheless could contribute. And then, once the science is done, what happens to the data? There is an international database for plant trait data, but where is the trusted repository for such data in China? My young collaborators are keen to share openly with other scientists and we need to make this easier. China is one of the few countries that has a national group supporting WDS activities – guess what I am going to be talking to them about next? Sandy H.
During the last International Polar Year (IPY) 2007–2008, a wide range of research topics were addressed, from glaciology to biology, from biochemistry to biophysics, from oceanography to physiology, from atmospheric to social sciences. Despite the vast amounts collected, there was no central archive for IPY-related data. Instead they have been spread widely, with a lot of the data published in research articles only.
To enhance the availability and visibility of publication-related IPY data, a concerted effort among PANGAEA – Data Publisher for Earth and Environmental Science, the ICSU World Data System, and the International Council for Scientific and Technical Information (ICSTI) was undertaken to extract data resulting from IPY publications for long-term preservation. A list of 1380 references was provided by ICSTI, and this bibliography served as a basis for me to filter out journal articles containing extractable data—either from the articles themselves (in the form of tables) or from supplementary materials supplied with the publication.
Ultimately, data and their associated metadata were extracted from 450 IPY articles. These data can now be accessed from here, and individual parts can be searched using the PANGAEA search engine and adding +project:ipy.
For more information, see also Driemel et al. (2015), The IPY 2007–2008 data legacy –creating open data from IPY publications. Earth Syst. Sci. Data, 7, 239–244, doi:10.5194/essd-7-239-2015.
Christine Borgman—a Professor in Information Studies at University of California, Los Angeles—has been given a three-year research grant by the Alfred P. Sloan Foundation to analyze how data are handled in four different research projects with the aim of simplifying data practices and challenging assumptions about the value of sharing data.
The following article on Professor Borgman's work and on the complexities of data sharing by Tiffany Esmailian was first published on 25 September 2015 on phys.org. We hope that the WDS community will find it of interest.
A Blog post by Paolo Manghi and Sandro La Bruzzo (OpenAIRE)
Sharing links between the published literature and datasets is crucial to achieve the full potential of research data publishing. This article presents the coordination and implementation efforts of the ICSU-WDS–RDA Data Publishing Services Working Group (DPS-WG) and the OpenAIRE infrastructure towards realizing and operating an open and universal data-literature interlinking service (DLI Service). The service is the result of an open collaboration between major stakeholders in the field of data publishing. It provides access to a graph of dataset–literature and dataset–dataset links collected from a variety of major data centres, publishers, and research organizations. On the basis of feedback from content providers and consumers, the service will also enable the incremental refinement of an interlinking data model and exchange format, towards shaping up a universal, cross-platform, cross-discipline solution for sharing dataset–literature links.
Introduction and vision
Challenges to realize the full potential of research data exist at different levels—from cultural aspects, such as proper rewards and incentives, to policy and funding, and to technology. The challenges are interconnected and impact a diversity of stakeholders in the research data landscape—including researchers, research organizations, funding bodies, data centres, and publishers. To make progress in overcoming barriers and building a stronger research data infrastructure, it is essential that the different stakeholders work together to address common issues and move forward on a common path. Alongside other organizations, the ICSU World Data System (ICSU-WDS), the Research Data Alliance (RDA), and OpenAIRE provide useful forums for such collaborations. In particular, they are today working in synergy on an initiative that brings together different parties in the research data landscape with the objective of creating the Data Literature Interlinking Service (DLI Service), namely, 'an open, freely accessible, web-based service that enables its users to identify datasets that are associated with a given article, and vice versa'. At the moment of writing, members of the initiative include: the ICSU-WDS–RDA DSP-WG, OpenAIRE, RDA, ICSU-WDS, STM, CrossRef, DataCite, ORCID, the Australian National Data Service, and the RMap project. The vision is that of moving away from several bilateral arrangements that characterizes the research ecosystem today, towards establishing common standards and tools that sit in the middle and interact with all parties (see Figure). Such a transition would facilitate interoperability between platforms and systems operated by the different parties, reduce systemic inefficiencies in the ecosystem, and ultimately enable new tools and functionalities to the benefit of researchers.
The DLI Service populates and provides access to a graph of 'authoritative' dataset–literature links collected and aggregated from a variety of major data centres, publishers, and research organizations. It is intended to offer facilities for the following classes of actors:
– End users: Searching and browsing the graph of links via the Prototype PORTAL – Third-party service developers: Accessing publications and datasets in the graph via programmatic APIs – Content providers: Willing to feed high-quality authoritative links between publications and datasets or between datasets to the service (complete list of content providers).
Note: Formal data acquisition policies, SLAs, and data provider registration procedures will be produced at a later stage; currently each 'application' is processed independently with bilateral agreements. on the basis of feedback from content providers and consumers, The DLI Service will refine its underlying interlinking data model and exchange format to make it a universal, cross-platform, cross-discipline solution for collecting and sharing dataset–literature links, balancing between the information that can be shared across content providers and the information needed by its consumers.
In the forthcoming months, further work will be carried out towards the delivery of a production service that is fully reliable in terms of QoS and quality of content. The following actions will be undertaken:
Definition of a content acquisition policy: minimal quality requirements to be respected by content providers in order for their publications, datasets and relative relationships to be aggregated by the system;
Definition of SLAs for content providers: make sure content providers are aware and agree on how their content (metadata) will be made openly accessible via the service;
Technical enhancements: data harmonization (e.g. cross-PID deduplication), data programmatic access (e.g. high-throughput resolver), data scalability (e.g. moving away from open source databases).
Deployment as an OpenAIRE infrastructure operational service: deploying the service on the OpenAIRE hardware infrastructure.