I’ve been posting recently about some issues in scientific data management. One topic I haven’t addressed yet is policies around data sharing. Generally:
- Scientists, like other academics, have their research judged largely on the basis of their published papers.
- The data scientists capture benefits scientists’ careers mainly by informing and being used in their published papers.
- Scientists are correspondingly uninterested in, if not actively opposed to, sharing their data with the rest of the world
- Promptly (for the data they use to directly support their publications)
- Perhaps ever (for the rest of the data)
On the other hand, it’s blindingly obvious that the world as a whole would be better off with widespread scientific data sharing, provided that making data “free” doesn’t significantly undermine scientists’ incentives to capture it in the first place. And institutions such as funding agencies are taking note. Thus:
Scientific data management technology should be suitable for either of the scenarios:
- Data is widely shared among scientists.
- Data is jealously guarded by the scientists who first gather it.
Biologists, it seems are furthest along in sharing data. But they’ve had some drama about that recently. My very incomplete knowledge includes:
- At XLDB, it was said that in some areas of biology — and perhaps in some journals? — it was required that you make your data available to get a paper published.
- The NIH (National Institutes of Health) often requires or at least encourages data sharing as a condition of funding.
- A common practice is for data to be shared immediately, but for anybody except the scientists who gathered it to be prohibited from using it in a paper for a 12-month embargo period.
- There was a recent kerfuffle as an embargo was broken, on data residing in the NIH-sponsored genomics data repository dbGaP.