To share or not to share?

Big data is great. I don’t think anyone can argue against the benefits of making datasets, whether they are from independent, controlled experiments, or from large-scale projects such as the Earth microbiome, publicly available. Depositing your data in one of the databases available, such as figshare or MG-RAST, can only ever help science. It progresses science by preventing fraud, making the process more transparent, and allowing for crosschecking of results. Sharing facilitates discussion. I have never heard of unethical use of shared data.

Nothing new there. In fact, sharing data might prevent masses and masses of unpublished data from getting lost forever, and thus has the potential to save millions of pounds of public money being spent on experiments that have already been done, but that no one knows about. Think about all those PhD thesis chapters, all analysed and written up, that never get published, and are therefore potentially lost for science forever. Similar to their open access strategy, research councils and funding agencies should perhaps set up a system that obliges PhD students to deposit their data before they can get their doctorate.

Of course, data sharing, or big data, is not going to answer all our questions. To test specific hypotheses, you need to perform mechanistic, controlled experiments. However, observational data, or the mining of datasets not tailored to answer your specific question, can reveal patterns and help to develop hypotheses. Analysing these large amounts of data is greatly helped by the rapid development and availability of new and sophisticated data processing and analysis methods, such as those of the R project. The hypotheses that are developed based on patterns in big data then need to be tested in controlled experiments, as Professor Jim Prosser recently argued in this excellent article. In line with his ‘think before you sequence’ I wrote in my previous blog post that you don’t always have to use the sexiest method for analyzing soil organisms; rather, you should use the method that answers your question most appropriately.

So, I am very much in favour of data sharing, and analyzing big data, as long as you make sure you do it ethically, and acknowledge the limitations. Together with a collaborator, I recently re-analysed a couple of datasets to propose and explore hypotheses about microbial community stability. Some of these data were my own, some were kindly provided by a colleague, and some we got from MG-RAST. All of this was done in agreement with the owners of the data, and the results allowed us to pose hypotheses and make recommendations for testing them mechanistically (this work is currently in review).

However, so far, I have not shared any of my own data. Although I am very keen to do so, there are many unknowns and uncertainties. Although I am either the lead author, or the principal investigator, on most of the datasets I’d like to share, I feel I need permission of my coauthors, which is something I simply haven’t got round to asking. Also, although published, I am still thinking of re-analysing some datasets myself. What if I posted them, and someone came up with the same idea, but published it before me? Or, what if someone used my data in a way that I didn’t think was appropriate?

Still, I firmly believe in the principle of data sharing, and, I promise, I will soon be making my contribution!

Advertisements

2 comments on “To share or not to share?

  1. ibartomeus says:

    You mention a couple of “fears” to share data. People being quicker than you and people misusing the data. I think the first scenario is unlikely because you are in a way better position than anyone else to understand and use your data, specially if you already have a question in mind. I am not saying it can’t happen, but most people will use your data in new ways and in combination with other datasets, and I think is unlikely it will overlap much with your plans. Misuse is also posible, but I like to think that for any bad use of the data (which hopefully will be catch by the revision process), you create 10 opportunities to use it appropriately.

    As the pro’s, other than being scientifically open with the community, as you point out, sharing data increase your chances of being noticed (and cited!) and also open the door to new collaborations!

  2. em409 says:

    Yes! It’s not just data we should share, with so many people spilling over their PhD deadline and short-term contracts I’ve seen so many people leave our lab and take information with them that would help others. It’s a perennial problem of short-term contracts but to be more efficient science needs to build a community that feels it’s OK to take time to write protocols and share skills and experience, too many people think they “don’t have time” for that sort of behaviour.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s