A Blog post by H. K. 'Rama' Ramapriyan (Science Systems and Applications, Inc. contractor for NASA Earth Science Data and Information System Project) and Alex de Sherbinin (WDS Scientific Committee member)
Any service organization exists to serve its customers. For a science data repository, the customers are users of its data, be they researchers or applied users in the scientific domain of that repository. For the data repository to serve the community best, it is essential that its managers understand the requirements of the users, respond to their changing needs, and evolve with technological changes as well. Many different mechanisms can be used to interact with users to continually maintain an understanding of their needs. NASA’s Earth Science Data and Information System (ESDIS) Project and its constituent Distributed Active Archive Centers (DAACs) engage with User Working Groups (UWGs) to fulfil this function.The purpose of this Blog post is to discuss how UWGs have benefitted ESDIS and the DAACs, and to share lessons gleaned from 25 years of UWG experience to demonstrate how advisory bodies can improve data curation and management for the benefit of diverse user communities.
Figure 1. Map of the NASA DAACs.
NASA’s ESDIS Project, a WDS Network Member, is responsible for 12 DAACs, ten of which are Regular Members of ICSU-WDS. The DAACs serve user communities in various Earth Science disciplines and are geographically distributed across the United States as shown in Figure 1. The majority of the DAACs have UWGs consisting of experts who represent broad user communities in the respective Earth Science disciplines; specifically, the members of a UWG are regular users of the data served by that particular DAAC. These UWGs have typically existed since the mid-1990’s when the DAACs started their operations, although their memberships have changed during that time with regular rotation off of members and the recruitment of new members.
Each UWG has a charter that specifies its role in providing advice to its DAAC regarding the data and services offered. A summary of the common aspects of these charters is given below:
• Ensure science user involvement in the planning, development, and operations of the DAAC. •Define the DAAC's science goals. • Provide recommendations on annual work plans and long-range planning. • Represent the science user community in reviewing and guiding the DAAC activities. • Review progress and performance of the DAAC relative to its missions. • Assess data-products and service quality by periodically reviewing applications of the data products made by the broad user community, and by sampling the confidence of the user community. •Communicate users’ assessment of the DAAC performance to the DAAC and NASA. •Advise the DAAC on the levels of service provided to the user community. •Advise the DAAC on improvements to the user access, user interface, and relative priorities for DAAC-related functions. •Recommend to the DAAC and NASA the addition of new data products and new services based upon documented NASA research needs. •Provide advice on research and development in support of product prototyping and generation.
The UWGs generally hold annual in-person meetings attended by representatives of the responsible DAAC and the ESDIS Project. Staff members from some of the other DAACs also attend the meetings to benefit from the discussions that may apply to their own activities. The UWGs also hold teleconferences a few times each year. UWG meetings consist of presentations by the DAAC staff addressing data and services offered, on-going developments, responses to prior recommendations by the UWG, and the status of action items. Following the presentations, the UWG provides comments on implementation of past recommendations and advice regarding improvements for the future. The advice can either be DAAC-specific or apply to the broader cross-DAAC and ESDIS Project activities.
Some examples of advice provided by the UWGs in the past year are given below. 'Data search' is highlighted separately because this is the primary way the users find the data they need in repositories that are growing larger by the day, and so getting this right has important implications for the user community.
•Obtain user input on the design and usability of the Earthdata Search Client. Search relevance should be based on user experience in addition to the characteristics of the data. •Provide multiple avenues to access data holdings so that different types of users have tools appropriate for their needs. •Make data more readily searchable to a non-technical audience. Develop a data search page intended for inexperienced and non-specialist users, containing popular data products and explanations of these products. •Make filtering and refining more obvious on webpages ('Amazon style'). •Add ability to save a search/search parameters. •Enable users to rank search results in order of 'popularity' based on, for example, the number of previous downloads (search relevance by popularity). •Enable users to rank search results by relevance in order to help them deal with the large number of search returns. •Add the ability to filter by spatial and temporal resolution.
•Create a simpler and more intuitive user interface that is optimized for touchscreen. •Convene a focus group to determine how current users have performed analyses. The focus group would also solicit guidance from data producers. •Communicate with Principal Investigators (who provide datasets to the DAAC) at least twice a year to ensure data and metadata are current and accurate. •Work within security requirements, but have a reliable, easy to access system that provides users with direct access to all the data holdings. Include an Application Programming Interface (API) and OPeNDAP with all data holdings available using these tools. •Focus on increasing the capacity of the service so that the entire satellite data record can be accessed using a secure but easy to use interface, and which includes an API to script data access. •Expand tools that facilitate coordinate system changes. (Context: The impact of the Digital Elevation Map choice and coordinate system on the ability for users to intercompare radar and optical imagery is currently poorly understood.) •Explore ways to reorganize high value datasets into analytics-optimized storage, while assessing the impact on the user community, and emphasizing variable-level analysis. •Pursue/develop a plan to become an event-driven repository for major hydrometeorological events (e.g., Hurricane Harvey), providing bundled datasets that enable researchers to analyze specific events. •Examine issues involved with multisensor analyses, including cross-DAAC coordination. (Context: User needs are becoming increasingly complex. Analyses often include multisensor data from a variety of sources.) •Explore methods to improve and enhance data and computing, and tools using alternative access points; for example, Commercial cloud systems. •Provide users more consistency DAAC-wide (e.g., comparable product descriptions, more unified terminology, and similar 'look and feel').
As these examples suggest, significant involvement by knowledgeable users in advising the DAACs has been extremely valuable for maintaining the high level of their performance. UWGs is one of several mechanisms that the NASA ESDIS Project and the DAACs use for receiving user inputs and feedback regarding their data and services. As enshrined in CoreTrustSeal Requirement VI, it is recommended that data repositories in all disciplines engage advisory groups in their respective areas to ensure responsiveness to their user communities.
In the 'European State of the Climate 2017' report, presented at the European Parliament in April 2018, glaciers represent one-out-of-five 'Headline Climate Indicators', which focus on long-term key indicators for global and regional climate change.
On the global level, the World Glacier Monitoring Service (WGMS; WDS Regular Member) is in charge of compiling changes in glacier length, area, volume, and mass, based on in-situ (length, mass) and remote sensing (area, volume) measurements. An increasing number of glacier-related research projects produce data from air- and space-borne sensors. Combined with the existing in-situ network, these studies complement the multi-level glacier monitoring strategy of the Global Terrestrial Network for Glaciers. WGMS actively fosters glacier observations from space through several initiatives, such as its cooperation with the Copernicus Climate Change Service (C3S) or the Climate Change Initiative of the European Space Agency.
C3S is implemented by the European Centre for Medium-Range Weather Forecasts (ECMWF) within the Copernicus programme, which is Europe’s flagship project to monitor the Earth and her many ecosystems. C3S delivers freely accessible operational data and information services that provide users with reliable and up-to-date information related to environmental issues, and that are based on the Sentinel satellites and other contributing space missions, as well as 'in-situ' (meaning 'in the field' or 'on-site') measurement sensors on land, sea, and air. Collectively, these sources provide huge amounts of Earth observation data that are converted into products for up to 20 Environmental Climate Variables. The products can be freely accessed from a one stop portal: the Copernicus Climate Data Store (CDS). This wealth of climate information forms the basis for generating a wide variety of climate indicators aimed at supporting adaptation and mitigation policies in Europe in a number of sectors; for example, water management, tourism, energy, and health (Fig. 1).
WGMS and the Department of Geography at the University of Zurich, together with Gamma Remote Sensing, the University of Oslo, and the US National Snow and Ice Data Center are contributing glacier data and information to C3S. The partners compile and produce information on the global distribution of glaciers (inventories), as well as their volume and mass changes, using field and remote sensing observations at a global scale. Starting in 2017, they integrated both the Randolph Glacier Inventory 6.0 and the available glacier volume and mass change series from the Fluctuations of Glaciers database maintained by WGMS into the CDS. Furthermore, C3S enables the improvement and extension of existing glacier inventory datasets and will also boost the compilation and computation of glacier volume changes (geodetic method) using space-borne sensors.
Figure 2. Cumulative glacier mass changes in Europe from 1967–2017 for glaciers with long-term records in nine different regions. Cumulative mass balance values are given in the unit ‘metre water equivalent (m w.e.)’ relative to 1997. Data source: WGMS (2017, updated),.Credit: WGMS / C3S.
Similar to the global mean, glaciers in Europe experienced relative stable to slightly negative mass balances in the 1970s and 1980s with short periods of mass gain (e.g., in the Alps at end of the 1980s or in coastal Scandinavia in the 1990s), followed by strong and continued mass losses after the year 1997 (Fig. 2). Since 1997, the monitored glaciers in Europe have lost between 7 m and 23 m of water equivalent or tons per square metre. In the mean, this is a loss of about 14 tons of water per square metre of ice cover, or when multiplied by the total glacier area (51250 km2, excluding peripheral glaciers on Greenland), about 690 km3; namely, 14-times the water volume of Lake Constance. In other words, the mean annual ice loss of these glaciers (35 km3) would cover the fresh water needs of New York City (see NYC OpenData 2018), the city with the highest consumption worldwide (Kennedy et al. 2015), over a period of about 25 years. The graph also highlights a strong regional variability; in particular, when comparing Northern Scandinavia and the Alps.
A Blog post by Guoqing Li (WDS Scientific Committee member)
On 12 February 2018, the Chinese Academy of Sciences (CAS) announced the launch of its 'Big Earth Data Science Engineering' (CASEarth) project in Beijing. With a total funding of approximately RMB 1.76 billion (USD 279 million), and greater than 1,200 scientists from 130 institutions around the world involved, this five-year project (2018–2022) is headed by Huadong Guo, an academician of the CAS Institute of Remote Sensing and Digital Earth.
The overall objective of CASEarth is to establish an International Centre of Big Earth Data Science with the mission of
1. Building the world’s leading Big Earth Data infrastructure to overcome the bottlenecks of data access and sharing. 2. Developing a world-class Big Earth Data platform to drive the discipline. 3. Constructing a decision-support system to serve high-level government authorities and solve multiple issues.
This joint centre will provide a very large volume of data, comprehensively enhancing national technological innovation, scientific discovery, macro-decision making, and public knowledge dissemination, as well as other significant outputs.
Framework of the Big Earth Data Science Engineering Project (CASEarth)
To achieve its goals, CASEarth consists of the eight Work Packages (research components) to help it achieve technological advances and obtain innovative results, paying special attention to data sharing and encouraging interested scholars to rely on the platform to carry out research. For example, CASEarth Small Satellites will provide continuous observing data from space according to the missions of the project; the Big Data and Cloud Service Platform will build a cloud-based Big Data information infrastructure, and processing engines with 50 PB storage and 2 Pflops data-intensive computing resources; and the Digital Earth Science Platform will integrate multidisciplinary data and information into a visualization facility to meet the requirements of decision-making and application in areas such as Biodiversity and Ecological Security, Three-Dimensional Information Ocean, and Spatiotemporal Three-Pole Environment. In particular, it will provide a comprehensive display and dynamic simulation for sustainable development processes and ecological conditions along the Silk Belt and Road, and provide accurate evaluation and decision support for Beautiful China’s sustainable development.
In summary, CASEarth is expected to increase open data and data sharing; realize comprehensive assimilation of data, models, and services in the fields of resources, the environment, biology, and ecology; and build platforms with global influence. CASEarth will explore a new paradigm of scientific discovery involving Big Data-driven multidisciplinary integration and worldwide collaboration, and will constitute a major breakthrough in Earth Systems Science, Life Sciences, and related disciplines.
Nominate an Early Career Researcher for the 2018 WDS Data Stewardship Award here (Deadline: 21 May 2018)
A Blog post by Linhuan Wu (2017 WDS Data Stewardship Award Winner)
Demands on Information technology from the culture collections
Global culture collections play a crucial role in long-term and stable preservation of microbial resources, and provide authentic reference materials for scientists and industries. With the development of modern biotechnology, available knowledge on a certain microbial species is growing unprecedentedly. Especially, since the advent of high-throughput sequencing technology, enormous sequence data have been accumulated and are increasing exponentially, making data processing and analysis capacity indispensable to microbiological and biotechnological research. Culture collections must not only preserve and provide authentic microbial materials, but also function as data and information repositories to serve academia, industry and the public.
Figure 1. A system-level overview of the WDCM databases.
There is a gap between the capacity of culture collections and the needs of their potential users for a stable and efficient data management system, as well as advanced information services. Ideally, curators and scientists from culture collections now need to not only share data but also design and implement data platforms to meet the changing requirements of the microbial community. However, not all culture collections can afford the infrastructure and personnel to maintain their own databases and ensure a high-level of data quality, let alone provide additional services such as visualization, statistical, and other analytical tools to enhance understanding and utilization of the microbial resources they have preserved.
The WFCC-MIRCEN World Data Centre for Microorganisms (WDCM) was established 50 years ago, and became a WDS Regular Member in 2013. The longstanding aim of WDCM is to provide integrated information services for culture collections and microbiologists all over the world. To clear the roadblocks in utilization of information technology and capacity building in culture collections, WDCM has constructed its own informative system, and a comprehensive data platform with several constituent databases (Fig. 1).
Figure 2. Web interface of CCINFO database.
Culture Collections Information Worldwide (CCINFO), which serves as a metadata recorder, has collected detailed information of 745 culture collections in 76 countries and regions up to January, 2018 (Fig. 2). In addition, WDCM has assign a unique identifier for each culture collection registered in CCINFO to facilitate further data sharing.
To help culture collections establish an online catalogue and further digitalize information of microbial resources, WDCM launched the Global Catalogue of Microorganisms (GCM) project in 2012 and tried to build up a system with fast, accurate, and convenient data accessibility gradually. The current version of the GCM database has recorded 403,572 strains from 118 collections in 46 countries and regions, and performs automatic data quality control, including validation of the data format and contents—for example checking species names with taxonomic databases.
At present, a major problem impeding the exploitation of microbial resources by academia and bioindustries is the low efficiency of data sharing. Since most of the culture collections tend to use different data forms for data management and publication, users have to tap into a huge body of data to find valuable information, let alone obtain suitable microbial materials efficiently, which has resulted in considerable wastes of time and money.
Although many international organizations and initiatives have implemented their own data standards or recommended datasets for microbial resources data management—for instance, Darwin Core and the Organization for Economic Cooperation and Development's Best Practice Guidelines for Biological Resource Centres—there is still a long way to go before realizing efficient data exchange, sharing, and integration globally.
WDCM has established minimum and recommended datasets and has implemented them in the database management system of GCM to ensure uniform data formats and data fields. WDCM has also committed to developing a standard under International Organization for Standardization Technical Committee 276 – Biotechnology; namely, AWI 21710, 'specification on data management and publication in microbial resource centres (mBRCs)'. This work is trying to improve data traceability of microbial resources preserved in different culture collections by normalizing local identifiers and mapping the ones already in use, as well as give recommendations for popularizing a machine-readable persistent identifier system to enable data integration.
The WDCM database is now using a centralized model for data integration. Future developments such as 'Big Data' technologies, including Semantic Web or Linked Open Data, will enable the system to provide more flexible data integration from broader data sources. Linking WDCM strain data to, for example, environmental, chemistry, and research literature datasets can add value to data mining and help in targeting microorganisms as potential sources of new drugs or industrial products. Linking microbial strain data to climate, agriculture, and environmental data can also provide tools for climate-smart agriculture and food security. WDCM will work with Research Infrastructures, publishers, research funders, data holders, and individual collections and scientists to ensure data interoperability and the provision of enhanced tools for research and development.
The Scientific Committee of the ICSU World Data System (WDS-SC) believes that all Early Career Researchers (ECRs) require a basic set of data-related skills. The following presents essential areas of Research Data Management (RDM) that are relevant to budding scientists and that cover the range of issues they are likely to encounter as they collect, analyze, and manage data over the course of their careers. It has been formulated with the assumption that ECRs play an important role for future data sharing, and must take an interest in data stewardship and best practices in data management, including how to make data openly accessible and reusable.
The Essentials of RDM
Open Data. Almost all science funding agencies require that research results, including data, be made publicly available. Journals, too, are requesting that authors of scientific articles post their data, and even the code used to generate results. Data sharing and open data are important to the advancement of science, and data reuse has resulted in important scientific discoveries. ECRs need to be familiar with the FAIR principles—that data need to be Findable, Accessible, Interoperable, and Reusable—and work towards data sharing and research transparency in their own work.
Big Data. The term ‘Big Data’ arose to describe the Volume, Variety, and Velocity (the three Vs) of data being generated almost continuously by a range of sciences, from Biomedical to Earth Sciences. An ECR should have an understanding of what is meant by Big Data, and how they are increasingly important to a variety of scientific fields. Familiarity with tools and approaches to analyzing Big Data is also an important requisite for career advancement.
Definitions and Jargon. An ECR must know some of the terminology in the data arena, such as ‘ontologies’, ‘informatics’, ‘metadata’, and ‘knowledge networks’. A critical element for data sharing is common definitions, and particular attention should be paid to understanding ontologies, thesauri, and controlled vocabularies: what ontologies are, where to find them, and how to create them, as well as ways for integrating ontologies and using them to support metadata and data disambiguation efforts.
Funder Requirements and Writing Data Management Plans (DMPs). Funders increasingly require that scientists articulate through a DMP how they will ensure the open availability of their data for the long term at the onset of a project. An ECR should know how to thoughtfully prepare a DMP that will also increase the odds of them obtaining funding. Awareness of the domain-specific data repositories where their data may be archived is also important (see below). A conceptually ideal DMP is extensible, interoperable, and machine readable, and an ECR must understand why these aspects are needed and how to address them.
Data Organization and Storage. Organization and long-term preservation of data is an increasingly daunting task. An ECR should know methods of sustainability to ensure the continuance of databases as they begin to generate data. Documenting versioning, choice of technology and standards, and archiving also need to be understood. The principle that data have several end uses throughout their lifecycle—each with its own requirements—is fundamental within this, and the concepts of ‘Analysis-ready’ and ‘Publication-ready’ (data with quality assurance, citation, and metadata) data should be familiar to an ECR.
Metadata Formats, Usage and Data Discovery. Metadata are critical for data discovery and reuse, and are the bread-and-butter of catalogue services. Metadata standards are strongly format and discipline dependent, but common elements are increasingly captured in efforts by DataCite, DCAT, and others. The International Organization for Standardization (ISO) has also developed a number of domain specific standards, such as ISO-19115 for geospatial information. An ECR should recognize the importance of proper metadata development, and be aware of a number of the standards that are available.
Data Documentation. To be of use to other researchers, data need to be carefully documented: to describe how they were developed, their limitations, and to what use they may be put. Incomplete and cursory documentation often renders data unfit for future use. An ECR should have knowledge of the different approaches taken to data documentation in various fields of science, as well as of the increasingly important practice of properly referencing protocols, methods, and samples.
Data Formats and Interoperability. Data formats and applicable standards for data and metadata are largely dependent on the scientific discipline and the type of software used. There are data formats that are common across disciplines, but this is not the norm. An ECR should support open formats and well-entrenched standardized services (e.g., CSV files, DDI, OGC services, and OPENDaP, to name a few), and having an overview of their scope is a useful starting point for an ECR to make appropriate choices. For a discussion of data standards and interoperability in the health domain, visit AHIMA.
Choosing a Long-term Repository. An ECR must have an understanding of not only which disciplinary repositories are best suited to the domain in which they are working, but also the ‘trustworthiness’ of these data repositories, and how this is underpinned by a hierarchy of certification standards (e.g., the CoreTrustSeal). By examining the strengths and weaknesses of different repositories in terms of data access, documentation, and so on, it helps an ECR to conceptualize what makes for a successful data service.
Standardization, Licences, and Intellectual Property Rights. To aid in their reuse, data should ideally be made available in standardized schema and using standardized services. Each ‘data family’ has its own set of such standards, and an ECR should know which are relevant to their discipline. Moreover, with Open Data an increasing norm in the scientific community (see above), an ECR should be aware of the different types of licensing and copyright arrangements under which data are often disseminated, in addition to the importance of machine-readable licensing arrangements.
Data Ethics. While primarily salient for ECRs working in the Health Sciences, Social Sciences, and Humanities, ethical issues that arise throughout the data management lifecycle should be a topic of broad interest to all researchers likely to engage with disclosive data (e.g., research on rare biodiversity, where there may be commercial interests in their exploitation). Areas that an ECR should have knowledge about include, but are not limited to: data ownership and stewardship, handling sensitive data, consent, privacy and confidentiality, reconciling ethical and legal norms impacting data sharing and exportation, constructing equitable partnerships and data sharing agreements, and navigating the complexity of ethics review.
Data Publication, Citation, and Persistent Identifiers. An increasing number of data journals, such as the Nature Group’s Scientific Data, are now available for the publication of datasets. In addition, proper citation of data using persistent identifiers is becoming the norm in the scientific community. An ECR should be aware of the approaches to data publication and citation and the importance of doing these properly.
Research Translation and Societal Benefits. To facilitate use of data collected and stored within archives, an increasingly wide range of software has been developed for decision analytics and support. In addition, there is a great deal of work on integrating data across disciplines to support new discoveries. An ECR should understand the value that well-curated and sustained data management provides to the scientific community and larger society, and have some understanding of data indicators, decision-analysis techniques; and the graphical interfaces that can simplify exchanges. Linked ontologies and robust metadata can facilitate these possibilities.
Citizen Science and Crowdsourced Scientific Data. Citizen science and crowdsourced data have already proven to be of tremendous scientific value. However, the modest budgets of these initiatives typically mean that systems are lacking for the curation and long-term stewardship of their data. An ECR should know what citizen science is, and how to design an initiative that engages citizens in improving scientific data collection and use: addressing issues of data stewardship, validation, confidentiality, dissemination, and licensing from the beginning. SciStarter provides a good introduction to Citizen Science, and an example of pointers for the design of citizen science can be found at the Cornell Lab of Ornithology.
Figure 1 shows the number of observatories from which we keep data in analogue and digital forms.
Optical recording on photo paper was originally used for most analogue recording.The digital recording of the data observed by modern electronic magnetometers started to increase from around 1980, and in 1992, finally overtook analogue recording. In 2000, the number of analogue stations had decreased to less than 10% of the total, and now all data are provided in digital form. In mid-1990s, the Internet and World Wide Web became popular, with WDC - Geomagnetism, Kyoto starting its web service in 1995.
The WDCs for Geomagnetism have been exchanging among themselves the data collected at each data centre for 60 years. During the analogue data era, it took money and manpower to collect data from distant observatories and copy them onto microfilms; and the big data centres such as WDC-A in Boulder (now World Data Service for Geophysics) or WDC-B in Moscow (now WDC - Solar-Terrestrial Physics, Moscow) mainly collected the data and distributed them to the other smaller data centres. After shifting to the digital data and Internet era, the situation changed. Collecting data via the Internet is much easier than collecting photo papers from distant stations, and international collaboration is also much easier than before.
Nowadays, more than half of geomagnetic data are provided through an international consortium, INTERMAGNET (WDS Network Member). The transition from analogue to digital recording thus also changed the main player in the provision of geomagnetic data services.