The main point of the article is to encourage the broad research community to work towards open and FAIR data and put in place the policies, guidelines, incentives, and funding necessary to support the needed culture and systemic change around how we handle our scientific data.
A Blog post by Ingrid Dillo (WDS Scientific Committee Vice-chair; Deputy Director of WDS Regular Member: DANS, Data Archiving and Networked Services)
In this WDS Blog post, I want to highlight a set of guidelines developed in a community that is not yet very well represented within the membership of the World Data System, but that is getting more and more involved. I am talking about the Humanities. Coming from the Humanities myself, and being active in a broader international data environment, I know from experience that the Humanities data community has a lot to offer other disciplines. Humanists often struggle with very fuzzy, multi-interpretable, scattered, and incomplete data, and so they need to be highly resourceful. For the Digital Humanities, therefore, international collaboration is a sine qua non.
An example of such international collaboration is the PARTHENOS Projectthat comprises 16 European partners, including DANS (a WDS Regular Member). PARTHENOS stands for ‘Pooling Activities, Resources and Tools for Heritage E-research Networking, Optimization and Synergies’. It is inspired by Athena Parthenos, the Greek goddess of wisdom, inspiration, and civilization.
PARTHENOS aims to strengthen the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology, and other related fields. This is being achieved through, for example, the definition and support of common standards and the harmonization of policy definitions and implementation.
One of the activities under the umbrella of PARTHENOS concerns the definition of common policies and implementation strategies for Research Data Management (RDM). The ubiquitous FAIR principles were chosen as a framework to structure a set of guidelines and recommendations. The concrete (and freely available) outcome of this activity is the very practical booklet: Guidelines to FAIRify data management and make data reusable.
The booklet offers a series of guidelines to align the efforts of data producers, data archivists, and data users in the Humanities, and thus make research data as reusable as possible. The guidelines are the result of the work of over 50 PARTHENOS project members, who were responsible for investigating commonalities in the implementation of policies and strategies for RDM and who conducted desk research, questionnaires, and interviews with selected experts to gather around 100 current data management policies—including guides for preferred formats, data review policies, and best practices (both formal and tacit).
The booklet also offers recommendations for two important stakeholder groups:
Researchers and research communities,
Research infrastructures and in particular, data repositories.
By focussing on (meta)data and repository quality, a set of twenty guidelines was extracted. For easy reference, the guidelines have been grouped under the four FAIR principles.
The guide starts with an important message: Invest in people and infrastructure. Investing in data infrastructures and trustworthy data repositories, as well as in hiring and educating data experts, is an important prerequisite to be able to implement any data management guideline. This way, we can enable researchers to comply with data management mandates coming from funders and journals.
Please have a look at the set of guidelines and see whether they are reusable in your domain.
Data drives so much of our professional life today. From the organization of business meetings (virtual or face-to-face) to the publication of our research results. Data may simplify or complicate our lives, but for sure it is ubiquitous, though often unseen and behind the scenes.
But, what are the future challenges? And who are the future influencers and curators, when thinking about scientific data, its analysis and curation? We take a closer look at "our" WDS future generation, the enthusiastic group that builds the Early Career Researchers and Scientists (ECR) Network. And we take our hat off to our young and outstanding awardees, such as Wouter Beek in 2018. They represent the next generation and are our link to upcoming thrills and challenges in data science and management. They are our inspiration and hope when it comes to data curation for the next generations to come and we hope they raise their voice to become data influencers in the scientific community.
Want to be part of the next generation of data influencers? Want to meet fellow data experts keen to share their experience or want to support an outstanding colleague working with data? Do not hesitate to join and participate in different activities, WDS proposes. The WDS ECR Network is always happy to invite data experts or future data influencers to join their telecons and events. Moreover, the WDS Data Stewardship award is a good opportunity for you to support your colleague to be part of the next generation of data experts: the Call for Nominations is now open.
Big data gurus and advocates for a cyberinfrastructure or big data science describe a data-centric future in which massive quantities of digital data will be available for reuse in research, artificial intelligence, making predictions, or engaging in data-driven discovery. In the world of biology, this translates into the expectation is that molecular sequence information will be available from the nucleotide repositories such as GenBank, or that any and all occurrence data can be found at GBIF. It is also presumed that the data will have been vetted and, in all aspects, are trustworthy.
The vision is flawed. An unknown but large fraction of newborn digital data does not make it beyond the maternity ward. If data are to be properly prepared for re-use in the big data world, they must have moved a long way from the hands of its creators and into the custody of data managers, and repositories that will guarantee access to vetted content in perpetuity.
There are hundreds of thousands of sources where digital information is born. The long tail of parents include individual researchers, research teams, research programs, legacy data recovery projects, local, state, national governmental bodies and international initiatives. These parents rarely have the understanding or skillsets to ensure that their newborn will mature appropriately for a rôle in the big data world. For this to happen, data must be handed on to those who specialise in data management and curation. These adoptive parents will shepherd the content through the maturation process that will make it ready for repositories that are designed to make trustworthy data and services available to the public. The challenges to completion of the path are numerous. The first step is simply to make the data visible and accessible. Bad data need to be set aside or put back on the right path. For content to be discoverable, standardized metadata and ontologies need to be added so that the data can be found in the appropriate context. Interoperability requires access through appropriate services and for the data to be clothed with standardized ontologies and metadata. Just as the idiosyncratic swaddling clothes must be set aside, the new descriptors will need to be embellished with increasing detail, and be continually corrected and improved. Provenance metadata will help creators and managers gain credit for their effort, and will open up a pathway through which concerns about the data can be expressed. There will be problems that are specific to particular disciplines. As an example, relationships among taxa in ‘evolutionary trees’ which are created by algorithms become less trustworthy as new information and new algorithms emerge. In the biodiversity sciences, taxa may be mis-identified. Further, with the passage of time, new species are discovered - a process that renders ambiguous taxa identified by earlier less stringent criteria The ecosystem through which the content moves must provide the support that ensures continued fitness for purpose. Confidentiality and ethical concerns vary with subject matter but also have to be addressed.
As data mature, they will move from the hundreds of thousands of parents to a small number of data repositories that are funded using models that guarantee the persistence of their services. As far as is feasible, we expect the managers and repositories to apply the FAIR principles to the content they hold. Then, if the holder of the baton can meet the expectations of the CoreTrustSeal accreditation, the data will have found a secure and persistent home, with data ready for reuse. Fifty or so repositories have gained the CoreTrustSeal certification. But, as we have seen from the recent US governmental shutdown in December 2018 and January 2019, even major and certified data suppliers cannot be relied upon and may blink out unpredictably.
Many components already exist, nor are they joined up. Not only do most data fall by the wayside, much is not fit for a rôle in a data-centric world. The data are too contextualised, descriptors are incomplete or inaccurate. Few, if any, of the big data world providers allow users to correct errors. The consequence is that users of open data have to work with contaminated material. The World Data System is charged by the International Science Council to promote universal and equitable access to scientific data and information and increasing the capacity to generate new knowledge. WDS is especially concerned with the trustworthiness of the data and services. We will move further faster when we acknowledge that the research and discovery paradigm needs to be complemented with an investment in infrastructure and services. That investment will provide the framework and support that is required for data to live long and to prosper.
Several countries in the Latin American and Caribbean (LAC) region, such as Argentina, Brazil and Mexico, among others, have relevant scientific production and participation in important multinational associations and projects. However, this scientific activity has not been reflected in effective participation in forums related to the management of scientific data. An example is that no LAC country has a scientific data system accredited as a Regular Member of the World Data System (or certified by the CoreTrustSeal). At both the International Data Week held in September 2016 in Denver, Colorado and the recent one held in November 2018 in Botswana, Gaborone, the number of researchers from these countries was very limited.
The Workshop brought together more than 150 participants, comprising researchers from countries in the region such as Brazil, Chile, Argentina, Costa Rica, Uruguay, Guatemala, Paraguay, Panama, El Salvador and Honduras; members of the WDS Scientific Committee; representative of the Research Data Alliance; and policy-making and funding institutions, including the Brazilian Ministry of Science, Technology, Innovation and Communication and the Foundation for Research Support of São Paulo State (FAPESP). Unfortunately, representation from Mexico, the second largest producer of science in the region, was not present, despite numerous efforts to identify and attract researchers.
The Workshop consisted of presentations on the state-of-the-art and future perspectives for scientific data management; on ways to accredit scientific data repositories; and on opportunities for funding, training, and infrastructure in scientific data management. In addition, 44 scientific data system initiatives in the region were presented from a wide range of fields—soil and agriculture, biodiversity, climate, astronomy, geology, health sciences, anthropometry, and the humanities, to name but some—providing a good landscape of what is being done. The full programme, the papers presented, and videos of all the sessions can be accessed on the Workshop website.
It became clear from the presentations that there are major disparities in the LAC region. On the one side, there are highly structured and complex data systems, such as the Inter-Institutional Laboratory of e-Astronomy developed by the National Observatory, Center for Integration of Data and Health Knowledge of the Oswaldo Cruz Foundation, and Biodiversity Research Program coordinated by National Institute of Amazonian Research. On the other side, there are universities or countries that are taking the first steps towards creating their repositories of scientific publications. For the latter, the La Referencia project has been fundamental, since it is creating a network of open access repositories for science and involves many countries in the LAC region.
The Workshop, moreover, brought to the fore some of difficulties face by countries in the region to develop scientific data systems. One is that the structuring and management of scientific data is not perceived by institutions as bringing benefits. As a result, they are the concern of individual initiatives, which lack adequate infrastructure and sustainability. This problem is especially aggravated by the discontinuity of project funding by governments and agencies. Another aspect mentioned is the lack of rules and standards established by agencies, and fostered at the time when financial support is given to a project. The exception is FAPESP, which in 2017 implemented a rule whereby projects receiving up to approximately 50,000 USD must plan in advance the destination of all scientific data collected or produced during the course of the research.
In addition to exchanges of information and the development of greater synergies among the participants, the Workshop provided a number of concrete results. The first was the creation of a Working Group at the ABC to elaborate norms and guidelines on the treatment of open data for science and technology in Brazil. This Working Group is currently in the process of planning a follow-up meeting to the Workshop to be held in 2019. A further consequence of the Workshop is that ABC has been invited to participate in debates in the National Congress of Brazil on a new law regarding the treatment and protection of personal data. There are likely many more such examples beyond Brazil, and Workshop participants will be contacted to discover the positive outcomes of the Workshop in promoting scientific data systems in countries and projects throughout the LAC region.
A Blog post by H. K. 'Rama' Ramapriyan (Science Systems and Applications, Inc. contractor for NASA Earth Science Data and Information System Project) and Alex de Sherbinin (WDS Scientific Committee member)
Any service organization exists to serve its customers. For a science data repository, the customers are users of its data, be they researchers or applied users in the scientific domain of that repository. For the data repository to serve the community best, it is essential that its managers understand the requirements of the users, respond to their changing needs, and evolve with technological changes as well. Many different mechanisms can be used to interact with users to continually maintain an understanding of their needs. NASA’s Earth Science Data and Information System (ESDIS) Project and its constituent Distributed Active Archive Centers (DAACs) engage with User Working Groups (UWGs) to fulfil this function.The purpose of this Blog post is to discuss how UWGs have benefitted ESDIS and the DAACs, and to share lessons gleaned from 25 years of UWG experience to demonstrate how advisory bodies can improve data curation and management for the benefit of diverse user communities.
Figure 1. Map of the NASA DAACs.
NASA’s ESDIS Project, a WDS Network Member, is responsible for 12 DAACs, ten of which are Regular Members of ICSU-WDS. The DAACs serve user communities in various Earth Science disciplines and are geographically distributed across the United States as shown in Figure 1. The majority of the DAACs have UWGs consisting of experts who represent broad user communities in the respective Earth Science disciplines; specifically, the members of a UWG are regular users of the data served by that particular DAAC. These UWGs have typically existed since the mid-1990’s when the DAACs started their operations, although their memberships have changed during that time with regular rotation off of members and the recruitment of new members.
Each UWG has a charter that specifies its role in providing advice to its DAAC regarding the data and services offered. A summary of the common aspects of these charters is given below:
• Ensure science user involvement in the planning, development, and operations of the DAAC. •Define the DAAC's science goals. • Provide recommendations on annual work plans and long-range planning. • Represent the science user community in reviewing and guiding the DAAC activities. • Review progress and performance of the DAAC relative to its missions. • Assess data-products and service quality by periodically reviewing applications of the data products made by the broad user community, and by sampling the confidence of the user community. •Communicate users’ assessment of the DAAC performance to the DAAC and NASA. •Advise the DAAC on the levels of service provided to the user community. •Advise the DAAC on improvements to the user access, user interface, and relative priorities for DAAC-related functions. •Recommend to the DAAC and NASA the addition of new data products and new services based upon documented NASA research needs. •Provide advice on research and development in support of product prototyping and generation.
The UWGs generally hold annual in-person meetings attended by representatives of the responsible DAAC and the ESDIS Project. Staff members from some of the other DAACs also attend the meetings to benefit from the discussions that may apply to their own activities. The UWGs also hold teleconferences a few times each year. UWG meetings consist of presentations by the DAAC staff addressing data and services offered, on-going developments, responses to prior recommendations by the UWG, and the status of action items. Following the presentations, the UWG provides comments on implementation of past recommendations and advice regarding improvements for the future. The advice can either be DAAC-specific or apply to the broader cross-DAAC and ESDIS Project activities.
Some examples of advice provided by the UWGs in the past year are given below. 'Data search' is highlighted separately because this is the primary way the users find the data they need in repositories that are growing larger by the day, and so getting this right has important implications for the user community.
•Obtain user input on the design and usability of the Earthdata Search Client. Search relevance should be based on user experience in addition to the characteristics of the data. •Provide multiple avenues to access data holdings so that different types of users have tools appropriate for their needs. •Make data more readily searchable to a non-technical audience. Develop a data search page intended for inexperienced and non-specialist users, containing popular data products and explanations of these products. •Make filtering and refining more obvious on webpages ('Amazon style'). •Add ability to save a search/search parameters. •Enable users to rank search results in order of 'popularity' based on, for example, the number of previous downloads (search relevance by popularity). •Enable users to rank search results by relevance in order to help them deal with the large number of search returns. •Add the ability to filter by spatial and temporal resolution.
•Create a simpler and more intuitive user interface that is optimized for touchscreen. •Convene a focus group to determine how current users have performed analyses. The focus group would also solicit guidance from data producers. •Communicate with Principal Investigators (who provide datasets to the DAAC) at least twice a year to ensure data and metadata are current and accurate. •Work within security requirements, but have a reliable, easy to access system that provides users with direct access to all the data holdings. Include an Application Programming Interface (API) and OPeNDAP with all data holdings available using these tools. •Focus on increasing the capacity of the service so that the entire satellite data record can be accessed using a secure but easy to use interface, and which includes an API to script data access. •Expand tools that facilitate coordinate system changes. (Context: The impact of the Digital Elevation Map choice and coordinate system on the ability for users to intercompare radar and optical imagery is currently poorly understood.) •Explore ways to reorganize high value datasets into analytics-optimized storage, while assessing the impact on the user community, and emphasizing variable-level analysis. •Pursue/develop a plan to become an event-driven repository for major hydrometeorological events (e.g., Hurricane Harvey), providing bundled datasets that enable researchers to analyze specific events. •Examine issues involved with multisensor analyses, including cross-DAAC coordination. (Context: User needs are becoming increasingly complex. Analyses often include multisensor data from a variety of sources.) •Explore methods to improve and enhance data and computing, and tools using alternative access points; for example, Commercial cloud systems. •Provide users more consistency DAAC-wide (e.g., comparable product descriptions, more unified terminology, and similar 'look and feel').
As these examples suggest, significant involvement by knowledgeable users in advising the DAACs has been extremely valuable for maintaining the high level of their performance. UWGs is one of several mechanisms that the NASA ESDIS Project and the DAACs use for receiving user inputs and feedback regarding their data and services. As enshrined in CoreTrustSeal Requirement VI, it is recommended that data repositories in all disciplines engage advisory groups in their respective areas to ensure responsiveness to their user communities.