Big data is great. I don’t think anyone can argue against the benefits of making datasets, whether they are from independent, controlled experiments, or from large-scale projects such as the Earth microbiome, publicly available. Depositing your data in one of the databases available, such as figshare or MG-RAST, can only ever help science. It progresses science by preventing fraud, making the process more transparent, and allowing for crosschecking of results. Sharing facilitates discussion. I have never heard of unethical use of shared data.
Nothing new there. In fact, sharing data might prevent masses and masses of unpublished data from getting lost forever, and thus has the potential to save millions of pounds of public money being spent on experiments that have already been done, but that no one knows about. Think about all those PhD thesis chapters, all analysed and written up, that never get published, and are therefore potentially lost for science forever. Similar to their open access strategy, research councils and funding agencies should perhaps set up a system that obliges PhD students to deposit their data before they can get their doctorate.
Of course, data sharing, or big data, is not going to answer all our questions. To test specific hypotheses, you need to perform mechanistic, controlled experiments. However, observational data, or the mining of datasets not tailored to answer your specific question, can reveal patterns and help to develop hypotheses. Analysing these large amounts of data is greatly helped by the rapid development and availability of new and sophisticated data processing and analysis methods, such as those of the R project. The hypotheses that are developed based on patterns in big data then need to be tested in controlled experiments, as Professor Jim Prosser recently argued in this excellent article. In line with his ‘think before you sequence’ I wrote in my previous blog post that you don’t always have to use the sexiest method for analyzing soil organisms; rather, you should use the method that answers your question most appropriately.
So, I am very much in favour of data sharing, and analyzing big data, as long as you make sure you do it ethically, and acknowledge the limitations. Together with a collaborator, I recently re-analysed a couple of datasets to propose and explore hypotheses about microbial community stability. Some of these data were my own, some were kindly provided by a colleague, and some we got from MG-RAST. All of this was done in agreement with the owners of the data, and the results allowed us to pose hypotheses and make recommendations for testing them mechanistically (this work is currently in review).
However, so far, I have not shared any of my own data. Although I am very keen to do so, there are many unknowns and uncertainties. Although I am either the lead author, or the principal investigator, on most of the datasets I’d like to share, I feel I need permission of my coauthors, which is something I simply haven’t got round to asking. Also, although published, I am still thinking of re-analysing some datasets myself. What if I posted them, and someone came up with the same idea, but published it before me? Or, what if someone used my data in a way that I didn’t think was appropriate?
Still, I firmly believe in the principle of data sharing, and, I promise, I will soon be making my contribution!