Monthly Archives: May 2014

Sharing data: a step forward

You would think that scientists would be eager to share data. After all, the myth of science that is taught to students is that we build on each other’s work, so of course if we have an interesting data set, we will let anyone have it who wants it, right?

It turns out that the truth is somewhat other than we would like it to be. There are both bad and worse reasons why data is not routinely shared. Probably the worst reason of all is wanting to sit on the data so that one extracts the maximum benefit from the data set while shutting out others. A variation on this theme is only allowing people access to your data if they will agree to make you a coauthor. I once collaborated with a scientist who wanted to use a crystal structure obtained by another lab. (I will leave the names out of it since they’re not relevant.) This was in the mid-1990s when the requirement to deposit structures with the Protein Data Bank (PDB) prior to publication was not yet universal. She was told by her colleagues (and I use the word loosely here) that she could only have their coordinate files if she agreed to include them as coauthors on any paper in which those coordinates were used for the following five years. My colleague and I were astonished by this. Beyond providing coordinates from an already-published structure, they would have made no intellectual contribution to her work, and yet they wanted to be treated as coauthors for an extended period of time. This is no longer possible with protein structures due to the now-universal requirement to deposit structures at the PDB as a condition of publication, but clearly people who hold data sometimes feel this gives them power they can use to further their careers. This is just wrong.

A not-so-good reason for not sharing data is that doing so takes time. A data set that may be perfectly OK for your use may not be suitable for sharing as is. I won’t get into issues of confidentiality with human subjects because I’m not an expert in this area, but clearly anonymizing medical data prior to sharing is important, and then there’s the tricky issue of consent: If the participants in a study did not explicitly agree to have their data used in other studies, is it OK to share the data set with others, even with suitable safeguards in place to protect the privacy of the study participants? Even for data not involving human subjects, sharing data takes time because you have to make sure you provide enough information about the data set for users to be able to make sense of it. This includes (obviously) a full description of what the various data fields represent, but also the conditions under which the data were obtained, any post-processing of the data, etc. Many scientists opt to just keep their data to themselves rather than generating all the necessary metadata. This situation is made worse by the fact that one gets very little credit for putting together a usable data set: It doesn’t count as a publication, so it won’t help a student land a scholarship, tenure and promotion committees are unlikely to give a data set much weight in their deliberations, and granting agencies won’t give you a grant solely because you generate high-quality reusable data.

A significant step forward has been taken with the launch of a new online journal by the Nature Publishing Group entitled Scientific Data. (Incidentally, I learned about this new journal from an article in The Scientist.) This journal is dedicated to the publication of data sets with proper metadata so that they can be used widely. Hopefully, the clout of the Nature Publishing Group will make the various bodies that make decisions about what scientific activities are valued pay attention, and will lead to an increase in the sharing of data sets.

In case you’re wondering whether I put my money where my mouth is: My web site includes a small section of data sets that I have generated that others might find of interest. Could I do more? Sure. Making it worth my while to do so via a journal like Scientific Data might be just the push I need.