How to Compromise Data Quality and Service of Data From the 'Dark Long Tail'

Toshihiko IyemoriA Blog post by Toshihiko Iyemori (WDS-SC)

The WDS Scientific Committee (WDS-SC) requests WDS Members to maintain the quality of their data and services. Another important task for the WDS-SC is to recruit data centres from various disciplines, as many and wide as possible, to serve their data to promote science—in particular interdisciplinary science. However, based on my experiences as a researcher of Solar–Terrestrial Physics and as Director of the World Data Centre for Geomagnetism, Kyoto (WDS Regular Member), I believe that we have one more important task for promoting science: collecting and serving useful data from the 'dark long tail' of datasets.

There are a huge number of datasets—mainly obtained on a research project basis—that are not registered to active data centres, and hence are 'dark' to many of us. These datasets are typically built by small research groups for a limited period, and data quality checks are often not sufficient. Although their quality may not be good and they exist only for a limited period, such data are very important and useful if the location of observation site is highly unique, or if other observations are not available.

We know of many such 'dark long tail' datasets, and some have been sent to our data centre, but even if we find them and can ingest them, we often have difficulty to keep (or to confirm) their quality. Nevertheless, my personal opinion is that these data should also be served by WDS Members, even if they conflict with the membership requirements of WDS.

One way to compromise for the data quality and service of data from the 'dark long tail' is to register metadata that describe the observations in as much detail as possible. An example of this in practice is IUGONET (Interuniversity Upper atmosphere Global Observation NETwork), which has a common database of metadata and forms a virtual data centre of distributed databases at several institutions. This data system includes databases from the 'dark long tail', as well as large well-known databases.

The WDS-SC and WDS Member Organizations must therefore take action (and advocate) to ensure such 'dark' datasets are registered in appropriate data centres or systems with adequate metadata to make them useful. Otherwise, I have a concern that they may just be kept by each institutional repository in a way that cannot be exploited or could even be lost forever.

To improve the situation domestically, we held two workshops at Kyoto University last autumn that explored possibilities for collaboration among Japanese university libraries, informatics experts, and research scientists. University libraries in Japan are not very positive in general about functioning as repositories for scientific data. In contrast, some researchers are actively trying to develop the related technology or systems for that to happen. Moreover, a Japanese endeavour to register datasets and attach Digital Object Identifiers started last year. My hope is that these activities grow and form a stream of open data from the 'dark long tail'.