Personal tools

Boris Biskaborn Wins 2016 WDS Data Stewardship Award

eDS

Congratulations to Dr Boris Biskaborn, who was selected by the WDS Scientific Committee (WDS-SC) as the 2016 winner of the WDS Data Stewardship Award. This accolade  highlights exceptional contributions to the improvement of scientific data stewardship by early career researchers through their engagement with the community, academic achievements, and innovations.  Dr Biskaborn will be ...

WDS-convened Session at AGU 2016 – Call for Papers (Deadline: 3 August)

Paper contributions are invited for the session Publishing and Managing Data: The case for Trustworthy Digital Repositories accepted in the Earth and Space Science Informatics focus group at the AGU 2016 Fall Meeting . This session is expected to highlight the important role played by the WDS community in seizing the opportunities and addressing the challenges outlined in the below ...

RDA and ICSU-WDS Announce the Scholix Framework for Linking Data and Literature

New framework presents vision and guidelines for linking research data and literature using a common, global approach. The  Research Data Alliance  (RDA) and the International Council for Science  World Data System  (ICSU-WDS) announce a new global framework for linking publications and datasets.  The Scholix framework — Scholarly Link Exchange —represents a set of aspirational principles and ...

Call for Papers for SciDataCon 2016 Now Closed!

Following numerous requests, the deadline for submitting abstracts to  SciDataCon 2016  was extended until Monday, 30 May. With that deadline passed, we would like to confirm that the submissions system has now been closed. Thank you to everyone for your poster and paper abstracts, the response to the Call has been overwhelming! A reminder that online registration  is open for International ...

More »

Identifying the Birth Defect Epidemic for Zika Virus: Where Are the Relevant Databases?

Elaine FaustmanA Blog post by Elaine Faustman (WDS Scientific Committee member)

Hello! I am a Professor in a School of Public Health who directs an Institute in Risk Analyses and Risk Communication, and in that role I am frequently asked questions on current health risks. The recent Zika epidemic is a significant example of such a request, and provides an opportunity to illustrate use of databases to answer risk assessment questions for this emergent issue.

In risk assessment for Zika virus, we are interested in identifying specific health impacts—including potential birth defects—that may be associated with exposure. We are also interested in the potency of the virus, duration of infection, and whether the duration of the infection relates to the severity of the health impacts. In this post, we pose the question: what databases and data sources exist for us to examine this epidemic but also to be prepared for potential future epidemics? I share with you example databases that I used to answer these questions in a recent journal club. I have also included a series of comments and conclusions about the utility of these databases for risk assessment questions.

Background on Zika Virus

I’d like to start by providing a little background on Zika virus, as one critical step in risk assessment is hazard identification and characterization. Though Zika virus was first discovered in 1947 in Africa, the first large epidemic was not reported until 2007 in the Pacific Island of Yap (Al- Qahtani et al. 2016). Since then, outbreaks have been reported in French Polynesia (2013), and Brazil and surrounding countries (Chang et al. 2016). The first case of Zika virus in Brazil was reported in May of 2015. Currently, 30 countries in the Americas have reported active cases of Zika virus. Though Zika is usually transmitted through the bite of a mosquito from the Aedes genera (Aedes albopictus and Aedes aegypti), it can also be spread through sexual activities and intravenous infection, such as blood transfusions. For most healthy individuals, infection can lead to mild flu-like symptoms or even be asymptomatic. However, infection (both symptomatic and asymptomatic) during pregnancy can lead to irreparable birth defects that severely impair child development (Kleber de Oliveira et al. 2016).

The most common birth defect associated with Zika virus exposure during pregnancy is microcephaly (Rasmussen et al. 2016). The basic definition of microcephaly is 'the clinical finding of a small head compared with infants of the same sex and gestational age' (CDC 2016). Problematically, there is no universally accepted definition of microcephaly; thus, when tracking cases of microcephaly and Zika viruses across healthcare providers, provinces, states, countries, and regions, the criteria employed can be drastically different. Inconsistencies in data collection techniques frequently limit the ability of Public Health professionals to accurately identify and predict Zika-induced microcephaly cases. To add further complications, microcephaly is not unique to Zika infection, but can be caused by a number of environmental and viral exposures, such as toxicoplasmosis, rubella, cytomegalovirus, herpes, HIV, Syphilis, mercury, alcohol, radiation, as well as genetic and maternal health conditions including poorly controlled material diabetes and hyperphenylalaninemia (CDC 2016).


Figure 1: Visual representation of microcephaly (CDC 2016)

This fast spreading epidemic demonstrates the need for access to global databases tracking the spread of mosquito species, infections, and birth defects, both under current and future climate conditions. Next, I will describe databases and data sources relevant to tackling this multifaceted global health risk.

Surveillance databases

Mosquitos: Because Zika virus is a vector-born infection, tracking the distribution of both Aedes albopictus and Aedes aegypti under current and future climate conditions will be critical to combating seasonal outbreaks, preventing the geographical spread of current outbreaks, and developing long-term strategic interventions to interrupt the vector-host pathway. HealthMap provides an excellent resource for tracking and predicting the spread of Zika virus with up-to-date interactive maps that show the distributions of both mosquito species and Zika infections on a global scale. Through an automated system, HealthMap updates distributions on a daily basis and provides convenient interfaces in nine different languages. Because the Zika epidemic has spread at such an alarming rate, the availability of data in real-time is critical. In addition to Zika cases, HealthMap also tracks Yellow Fever, West Nile Virus, and Chikungunya, which are related to Zika virus. By co-tracking these better characterized viruses, we may be able to translate lessons learned into Zika research and prevention. The Centers for Disease Control and Prevention (CDC) also tracks mosquito distributions in the United States. These ranges show that while Aedes aegypti distributions are primarily in the southern region of the United States, the Aedes Allopictus distribution reaches as far north as New Hampshire, and extends into the mid-west, reaching Minnesota. While this does not mean that Zika will spread in all of these areas, knowing mosquito distribution patterns can help communities prepare and mitigate risks.

Figure 2: HealthMap shows the distribution of Zika (purple dots) along with the distribution of Aedes aegypti, available here.

As the global climate changes, mosquito distributions are predicted to expand. Many options exist for predicting mosquito distribution changes alongside increased temperatures and changes in global precipitation patterns (see resources below). Many of these programs have been optimized to describe the changes in malaria infections (e.g., Medlock et al. 2015). By translating lessons learned from malaria surveillance programs that predict changes in disease related to climate change, this will be relevant for Zika epidemic prediction.

Zika Infections: Both the World Health Organization (WHO) and CDC are actively tracking global cases of Zika virus. However, because infection can be mild or asymptomatic, it is expected that these may be underestimates. Additionally, Zika infections occurring in underserved communities may go unreported due to lack of access to healthcare.

Figure 3: Distribution of Zika infections in the United States from CDC found here.

Birth Defects Registries: Both CDC and WHO track incidents of microcephaly at national and global scales, respectively. Generally, birth defects are identified by active or passive surveillance systems. Under active surveillance, Public Health or healthcare professionals seek out birth defect information. For example, the expert goes to hospitals and reviews medical reports to find babies with birth defects. Passive surveillance, on the other hand, relies on doctors or hospitals to send reports to the Public Health Department responsible for tracking birth defects. In this model, doctors and healthcare providers must be able to accurately diagnosis birth defects and report them to the proper Public Health Department. Hybrid approaches are also used, in which the surveillance is passive; however, Public Health professionals will follow-up to confirm birth defect reports. For microcephaly, it is particularly complicated due to discrepancies in how the condition is diagnosed. Comparing countries with active and passive surveillance systems is complex and often introduces biases into the analyses. Additionally, depending on the legal and healthcare environment, women carrying fetuses with known birth defects may terminate their pregnancies before a birth defect can be reported, leading to an underestimation of birth defects. These complexities make international comparisons of birth defects complicated.

Dysmorphology: Efforts to standardize the definitions of congenital abnormalities, including microcephaly, are important in harmonizing data collection at national and international levels. CDC uses the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) ontology as a controlled vocabulary for describing congenital abnormalities. Additionally, SNOMED CT has compiled an extensive database of known causes of microcephaly, including genetic abnormalities.

Conclusions

Databases were available that answered all of these questions, and which provided additional details on potential challenges related to data collection. However, separate databases need to be consulted to track microcephaly and Zika cases alongside mosquito populations under current and projected climate scenarios. Some of these databases are automatically updated consistently; however, others have to be manually updated and can become out-of-date relatively quickly. Current projections are what are being accessed to answer questions about the global and local risks associated with the upcoming Olympic Games.

The available databases enabled decision-makers to craft location-specific risk communication advice and also to make predictions of vector spread. As with many emerging risks, more information is always needed, and hence the frequency of database updates directly correlated with the increasing frequency of revised messages. Information sources differed in detail and were dynamic. In particular, with birth defects, getting the message wrong or having access to inaccurate data can result in serious healthcare actions. Most of the databases we accessed to make these assessments were government- and/or agency-based databases, best used for population level predictions rather than for use in individual patient-based decisions. At the population level, these databases were exceptionally helpful.

All in all, we found a wide variety of databases available that are relevant to understanding and predicting risks associated with Zika virus. Some weaknesses include: lack of international standards for diagnosing microcephaly, and difficulties in quantifying prevalence of Zika virus in rural and underserved communities; infrequently updated databases; and 'lack of one-stop shopping'. However, there are many promising tools such as HealthMap, which contains information on both mosquitos and Zika cases and is frequently updated.

Special thanks for M. Smith and D. Pyle with the Institute for Risk Analysis and Risk Communication for their contributions to this blog post.

Relevant Resources

Mosquito Surveillance
Geographic spread of mosquitos:
 – HealthMap
 – Zika Virus Vector Range (CDC)

Climate change models for mosquito spread:
 – Medlock, J. M. and S. A. Leach (2015) 'Effect of climate change on vector-borne disease risk in the UK.' The Lancet Infectious Diseases 15(6): 721-730.
 – Paz, S. and J. C. Semenza (2016) 'El Niño and climate change-contributing factors in the dispersal of Zika virus in the Americas?' The Lancet 387(10020): 745.
 – Sucaet, Y., J. V. Hemert, B. Tucker and L. Bartholomay (2008). 'A Web-based Relational Database for Monitoring and Analyzing Mosquito Population Dynamics.' Journal of Medical Entomology 45(4): 775-784.
 – Vector Map

Zika Tracking
WHO Pan American Health Organization:
 – 'Zika Virus Infection'

CDC:
 – Global Surveillance
 – National Surveillance

Birth Defects
CDC:
 – Active and passive surveillance registries
 – Tools and resources

WHO:
 – Atlas of selected congenital anomalies

European Union Tracking and Surveillance:
 – European Surveillance of Congenital Anomalies
 – The International Clearinghouse for Birth Defects

Dysmorphology
CDC Public Health Information Vocabulary Access and Distribution System based on the SNOMED ontology:
 – For congenital abnormalities
 – For microcephaly causes

Contribution to Long-term Environmental Monitoring

Arona DiedhiouA Blog post by Arona Diedhiou (WDS Scientific Committee member)

The world as a whole faces many more changes in its climate than it did in past decades. Such global changes have direct impacts on both the social and economic aspects of our lives (human activity), as well as on the environment. To comprehend the complexity of these phenomena and their effects on the aforementioned sectors requires in-depth investigations. To this end, environmental observations can supply information about past climates while providing benchmarks for comparison with future changes. The observations hence serve as a basis for assessing potential impacts and for planning adaptation measures and mitigation policies against them.

The Institute of Research for Development (IRD, France) has been involved for many years in observing the environment in intertropical zones. The observation systems it has put in place are an integral part of the research carried out by IRD and its partners in developing countries. The ongoing operation of these systems is essential to gain an understanding, over a sufficiently long period, of variations in both environmental processes and major cycles within the current context of climate change and accelerated development of human activity.

The observatories are jointly operated and managed with partners from the South and the North, which promotes North–South and South–South exchanges. They back up data and results, make them available to scientific communities, and disseminate them to a wider audience. These actions thus build on and complete environmental monitoring efforts carried out in each country by local organizations or inter-governmental entities—which include training and technology transfer initiatives and an aim to foster academic training in topic-based schools.

Together, with standard observational procedures and certified data, we can overcome these global issues.

For more information on IRD's climate surveillance systems, please go to: https://en.ird.fr/climate/research-on-climatic-change/environmental-research-observatories

Yet Another Paradigm Shift…

Wim HugoA Blog post by Wim Hugo (WDS Scientific Committee member)

At the recently completed European Geosciences Union General Assembly 2016, I was one of the participants in a double session called "20 years of persistent identifiers – where do we go next?". Apart from reviewing the obvious elements, issues, and benefits of persistent identification—and agreeing on the success of the Research Data Alliance (RDA) Working Group on Data Citation and their excellent set of 14 guidelines for implementation—we also had a number of robust discussions; not least because Vienna was an airport too far for some of the presenters, leaving us with free time.

Firstly, most of us agreed that being able to reproduce the result of queries (and potentially other transformations or processes) applied to data or subsets of the data was the hardest of the guidelines to implement.

One can deal with this by keeping archived copies of all such query and transformation results (painless to implement, but potentially devastating from a storage provisioning perspective), or one could opt to store the query and transformation instructions themselves, with a view to reproducing the query or transformation result at some point in the future.

This second option equates to always starting with base ingredients (egg yolks, lemon juice, butter, and maybe mustard or cayenne) and to store this with a recipe (in this case for Hollandaise Sauce). This option is also painless to implement, until there is a change in the underlying database schema, code, or both—in which case one will have to (potentially almost ad infinitum) maintain backward compatibility so that historical operations continue to work, or maintain working copies of all historical releases for the purpose of reproducing a query or transformation result at some point in the future. Clearly this is not very practical.

By the way, there were some excellent ideas on how to record recipes systematically: Lesley Wyborn presented work on defining an ontology whereby queries and transformations could be documented as an automated script, and Edzer Pebesma and colleagues are conceiving an algebra for spatial operations with much the same objective in mind.

This approach, of course, requires an additional consensus: at what point do we store results as a new dataset instead of executing a potentially longer and longer list of processes on original data? There must be some value to buying Hollandaise Sauce off the shelf for our Eggs Benedict—at least some of the time.

Secondly, all of this trouble is required to achieve either one or both of two objectives: reliably finding the data referenced by a citation (via a digital object identifier or other persistent identifier), and supporting reproducibility in science. This last point was enthusiastically agreed on by most (one or two abstained, and there was one dissenter):

"Science Isn’t Science If It Isn’t Reproducible".

This assertion set me thinking about the process of reproducing results in the new world of data-intensive science, a world in which code and systems are increasingly distributed, reliant on external vocabularies, lookups, services, and libraries (that may be themselves referenced by persistent identifiers). None of these resources, which may have a significant outcome on the result of a process should they change, are under the control of the code running in my environment. Which brings us to Claerbout’s Principle:

"The scholarship does not only consist of theorems and proofs but also (and perhaps even more important) of data, computer code and a runtime environment which provides readers with the possibility to reproduce all tables and figures in an article."

Easier said than done. We can, of course (as we should in a world of formal systems engineering) insist on proper configuration control and versioning of all components, internal and external, but I am not convinced that the research community is ready for this level of maturity—typically reserved for moon rockets and defense procurement, with orders of magnitude in additional costs. Perhaps more importantly, the scientists writing code are not going to invest time and effort to document, version, and package their code to a standard that supports reproducibility. Hence, the code that we use to transform our data, whether we like it or not, will not automatically produce the same result at some unspecified point in the future, and much more so if it has external web-based dependencies (which, in turn, may also have external dependencies). There is some utility in packaging entire runtime environments (much in the way that one could persist the result of a query or transformation), but this does not solve the problem of external dependencies.

Which raises an interesting dilemma: in the world of linked open data, the semantic web, and open distributed processing, the state of the web at any point in time cannot be reproduced ever again—which may create significant issues for reproducible science if it uses any form of distributed code.

Not only that! As we rely more and more on processing enormous volumes of data by digital means, we will depend more and more on artificial intelligence, machine learning, and automated research. As the body of knowledge available to automated agents changes, so presumably, will their conclusions and inferences.

So...we need a new consensus on what science means in the era of data-intensive, increasingly automated science: our rules, notions, and paradigms will soon be outdated.

Fitting subject for an RDA Interest Group, I would think.

Some interesting additional reading:
http://www.nature.com/news/reproducibility-1.17552
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/MattShotwell/MSRetreat2013Slides.pdf

Surveying User Satisfaction: The NASA DAAC Experience

Alex de SherbininA Blog post by Alex de Sherbinin (WDS Scientific Committee member)

It is often said—disparagingly—that America’s culture is a consumer culture. Although it may be true that America’s consumerism is problematic, not least for the planet, the flip side is how consumer culture drives a service mentality in businesses and government. The old adage that “the customer is king” does motivate US government agencies and government-supported centers, including NASA’s Distributed Active Archive Centers (DAACs), to innovate and improve services in response to user feedback and evolving user needs.

Since 2004, NASA’s Earth Science Data and Information System (ESDIS) Project [WDS Network Member] has commissioned the CFI Group to conduct an annual customer satisfaction survey of users of Earth Observing System Data and Information System (EOSDIS) data and services available through the twelve DAACs. The American Customer Satisfaction Index (ACSI) is a uniform, cross-industry measure of satisfaction with goods and services available to US consumers, including both the private and public sectors. The ACSI represents an important source of information on user satisfaction and needs that feeds into DAAC operations and evolution. This may hold some lessons for WDS data services more broadly as they seek feedback from their users, and endeavor to expand their user bases and justify funding support.

The ACSI survey invitation is sent to anyone who has registered to download data from the NASA DAACs. In the past registration was ad hoc, and each DAAC had its own system. In early 2015, ESDIS began implementing a uniform user registration system called EarthData Login that requires that users establish a free account before they can access datasets. Accounts are associated with a given DAAC, but they allow access to data across all the DAACs. All those who register are sent invitations to fill out the ACSI survey. Response rates vary from a few percent among most DAACs, to as high as 38% for the Land Processes DAAC [WDS Regular Member] (which also has the highest number of respondents at just over 2,000).

In 2015, the overall EOSDIS ACSI was 77 out of 100, which is better then the overall government and National ACSI scores for 2015 (64 and 74, respectively), but lower than the National Weather Service (80). This score is based on users’ overall assessment of satisfaction with each data center based on expectations and comparison with an “ideal” data center. The ACSI model provided by the CFI Group also assesses specific “drivers” of user satisfaction—customer support, product search, product selection and order, product documentation, product quality, and data delivery—and their relative importance to the overall ACSI score. This allows the DAACs to identify areas where improvement is needed and should have the most impact on overall satisfaction.

The ACSI enables the EOSDIS to assess changes from year to year. For example, from 2014 to 2015 customer support went from 89 to 86, with drops in professionalism, technical knowledge, helpfulness in correcting a problem, and timeliness of response (all statistically significant). Many changes likely reflect the fact that the pool of survey respondents changes over time, as do their expectations, rather than actual drops in service provision. But for individual DAACs, declining scores in certain areas, in combination with free-text responses to open-ended questions, can help to flag issues that are in need of attention.

For example, the ACSI scores and free-text responses to open-ended questions helped our DAAC—the Socioeconomic Data and Applications Center (SEDAC) [WDS Regular Member]—in undertaking a major website overhaul in 2011. From a disparate set of pages with different designs, we created a coherent site with consistent navigation. The resulting site was evaluated very favorably by Blink UX, a user experience evaluation firm that reviewed all of the DAAC websites. Deficiencies in data documentation for selected datasets have also been pointed out by survey respondents, and we are now reviewing our guidelines for documentation to ensure that all datasets meet a minimum standard. Some users indicated difficulty in finding the latest dataset releases, so we are developing an email alert system for new data releases.

At the Alaska Satellite Facility (ASF) DAAC [WDS Regular Member], the ACSI results have been very helpful in getting a sense of how people are using ASF DAAC data and services. The free-text responses to questions regarding new data, services, search capabilities, and data formats are particularly informative. For example, one user suggested that it would be useful to have quick access to Synthetic Aperture Radar data for specific regions in the world for disaster response. A data feed was developed after the recent Nepal earthquake that notified users of any new Sentinel-1A data received at ASF DAAC for that specific area. This data feed quickly provided additional data for disaster responders and researchers studying this event. Data feeds are now available for several seismically active areas of the world that have been designated by the scientific community (i.e., Supersites).

Overall, the strong EOSDIS ACSI scores have been important in objectively demonstrating and documenting the continuing value of EOSDIS and the individual DAACs to the broad user community. The annual score is reported as one of NASA’s annual performance metrics, supporting NASA’s goal to provide results-driven management focused on optimizing value to the American public.

Although surveys can be costly, and the response rates low, WDS Members would do well to consider periodic surveys of users. We find that highly motivated users do respond and provide really useful suggestions, especially if they find that their responses actually lead to tangible changes in their user experience. While annual surveys may be more than is needed, surveys every 2–3 years could provide your data service with valuable feedback on its content and services. And of course, none of this should supplant other mechanisms for gathering user feedback, such as help desk software (e.g., UserVoice used by SEDAC or Kayako used by NASA’s EarthData), email, and telephone helplines. Through these multiple mechanisms, our user communities can help drive significant improvements in the services offered by WDS Members and the successful use of our valuable data by growing numbers of users.

More »