The culture of "mine" in science at the age of cloud-based analysis systems*

Scientists needs computing power and storage space for their data sets. For scientific institutions, this translates onto long-term fixed-costs that are relatively high. Resources required for buying hardware, keeping network infrastructure, paying system administrators to take care of these masses of electronics is a considerable overhead. As a result, public scientific institutes spend lots of money and human resources to create and maintain infrastructures for storing and analyzing scientific data sets.

Yet, one scientific institute is pretty much the exact replica of another one when it comes to hardware demands. That is, resources that are needed in one place should in theory be very similar to another place. Therefore, instead of investing money for system administrators, storage and computational resources, scientific institutes may actually lease these services from cloud-based infrastructures with more flexible pricing opportunities and lack of overhead. Replacing your system administrator with two PhD students is an appealing idea after all.

There is actually nothing illuminating in this view because this has been actually happening already since more than 10 years in the corporate world. Many hosting companies offers as also VNC based system to connect to their servers and use software on powerful machines. Beyond simple hosting companies, Google Cloud Computing and Amazon AWS making the transformation real by integrating all sort of compute, storage, parallelization tools and selling it as a service.

Where are we in neuroscience? Some important milestones are becoming finally a reality in natural sciences, I think that the point of no return is also being slowly reached for neuroscience. I believe this because standardization procedures on how to store and share datasets is becoming more and more mainstream, and this shift has the potential to change day-to-day scientific enterprise radically. For example, Open Neuro is one such platform, where you can upload your brain imaging dataset using the BIDS format, and let analyses run on these servers. I think this is just start of a big scale transformation on how we do science.

Here is how I think how:

(1) Scientific publication

The way we publish our reports didn't change probably since the times of Fisher or even Newton. The world today is a very different place, but many of the novel tools that have been invented in the internet-based communication era have not been incorporated into the way how we conduct science today. OK, instead of sending a manuscript to the editor's office via post, we are today using emails, fine.

For example, the scientific reviewing system did not incorporate crowd-sourcing mechanisms to evaluate the quality of scientific papers. The decision of whether a manuscript or research proposal is worth being published stays largely within the hands of few not-randomly selected referees and an editor. The process is opaque, prone to biases and has no means to stop formation of small-world cartels that mutually benefit from positive biases.

The re-distribution of reputation is not based on metrics that reflect the long-term value of person for science in general. In the best case, reputation is equivalent to your h-index, which is heavily biased by the random success of your publication track, not how good a scientist your are. Metrics that ensures long-term advancements of science are typically not included. For example, we lack a metric that judges a professor based on the number of students that became also professors in the last 5-10 years. The infrastructure to achieve a better and more democratic system is in place since more than a decade. I believe this change will come faster with cloud-based systems decreasing the cost for storage and computational resources.

In the very near future, I believe any serious publication will also need to contain the related datasets, the analysis pipeline and make it publicly available to all scientific community (but also other citizens). This is already happening, and many journals let you agree with their terms of sharing data promptly when requested. However, the definition of "prompt" is also very subjective. For example, you may want to read this twit-storm to see a recent example. Even if it was obligatory to upload the dataset, the re-evaluation of the data is not within the responsibilities of the referees. This means that modifying an existing system incrementally to make it more and more suitable for the current demands of scientific democratization is not enough, we need a radically different way of publishing science.

When the data is stored and analysis ran in a cloud-based system, there will be no more excuses for reviewers for not being involved in the data analysis, as the time it will take for them to have a closer look on the data and the analysis pipelines will be insignificant. Therefore, I believe that any serious publications will take the concerted efforts of, on the one hand authors who designed the experiment, collected data and wrote the initial draft of the paper, and on the other, reviewers who will be required to contribute in the data analysis using infra-structures provided by the cloud-based storage and computational infrastructures. There will possibly be not much difference between collaborators of today and reviewers of tomorrow.

(2) Cloud-based analysis

Most of published reports use similar methods, which are re-invented again and again by generations of PhD and postdoc crew, which is a complete waste of time and resources. I believe actually there could possibly not be a more inefficient system than today's science. A large company would not be able to function like this.

Once we start talking about cloud-based storage and analysis pipelines, it will also be possible to run these analyses automatically on a server. You will need to tick the checkbox for this or that analysis and receive the results as an email in the form of a presentation or a web page (example) to click/browse around. This is of course an over simplification, but what I would like to say is that scientists will spend more time on (1) standardizing their datasets to be able to run analyses on the cloud-based system and (2) making analysis pipelines that are compatible with standardized datasets. Therefore, many scientists will use this time to record more data.

(3) End of culture of "mine"

One of the most intriguing anthropological traits of the daily scientific enterprise, is what I call the culture of "mine". This is not something that is somewhere out there, it is right inside our offices. By this I mean the way how students, PhDs, postdocs and professors (the whole crew basically) are closed to the idea of sharing and opening their projects to external influences. Most often if not always, a project is assigned to a single person in the lab, and this person is expected to run this project until the end. Because the person believes that it is her/his project, he/she can control the monopoly together with his/her boss on how this project has to run and adjust the level of external factors (politics). This results in a very conservative set of interactions between people, as any request of help, or any communication can be seen as a contribution to the project. The culture of mine, will of course be there and start the appropriate set of behaviors to not let this happen. Unfortunately, there are countless examples of authorship disputes which appear exactly from this type of culture.

Once the opportunity to upload your dataset and run your analysis in a cloud-based system is within the reach, there will be no reason to not open your data and let other people analyze it in ways different than what you have actually thought would be most appropriate. In a crow-sourced science, you will own your data, but will actually allow other people to look into it. Pretty much the same way, when people are allowed to look at you when you are walking in the street. The constructive discussions that follows during this process belongs to all parties and can be moderated by the person who created the dataset. I believe there will be a shift in the way how people conceptualize the way how they own projects and data, replacing culture of mine with crowd-sourced intelligence.

I found this article from Jeremy Freeman, entitled "Open source tools for large-scale neuroscience" which made me super happy as it expresses many of the thoughts I scratched on this post in a systematic and professional manner.

*This article has a bias from the perspective of a neuroscientist.