Academic production before internet
Today, science is produced by the following basic infrastructure. This simplistic picture depicts how humans in a lab interact with the nodes making up the infrastructure. Humans basically use computational resources to analyze data by writing code. The input to these nodes overall is extremely sparse. That is we generally do not have other peoples' data available, we generally cook our own code specifically for our own data, we generally use our own computational resources.
|Unit infrastructure for scientific production|
This unit is the fundament for academia. Research is carried out by the same infrastructure that is simply replicated across geographies and time. Of course, there could be labs collaborating with each other, of course we could be using an external grid engine to run our tasks, of course we might download a toolbox to run analyses. These are all connections that are not shown in this picture.
But the emphasis here is that we spent most of our time to reinvent the wheel by
-writing the same piece of code that many people had done it in the past,
-collecting yet a new dataset instead of generating a new hypothesis compatible with available datasets,
-buying large computers that could be used by other people
-hiring system administrators that are doing exactly the same work as in another lab
the list is long...
Each lab in this culture becomes a specialized idiosyncratic creature with its own way of doing things. Politically this implies committing in long-term fixed-costs to maintain an academic infrastructure that is short-sighted and benefits mainly to the labs short-term agenda. Academia mainly benefits from the contributions of labs in the form of publications, which is considered as the unique currency in academic reward system. What is the impact of this system on the society? In the light of current replication crises in science, it is hard to be optimistic.
|Infrastructure for academic work. Culture of mine.|
This type of infrastructure organization has mainly historical reasons. This model is archaic, and has been a good model for the pre-internet era where people and systems were connected sparsely with each other, where it made sense to travel to a conference and meet other people.
A new way of doing science at the age of cloud-based systems
We have to rethink about how to place boxes shown in the previous pictures, how to set novel incentive mechanisms, and how to organize the work flow across scientists and nodes. Let's talk about this simple picture.
|A novel infrastructure for academic work. Culture of sharing.|
Outsourcing the storage and compute resources to a cloud service (e.g. AWS, GDC or some supranational public cloud service yet to be put in place) are for the benefit of the society in terms of reducing overall costs.
However the main point here is not about outsourcing storage and compute resources. The real reason for this move is for making datasets accessible to other scientists. And in the long-term this actually means making data to be publicly available to all citizens.
The only thing that is specific to a given lab is the data that is collected there. That's what labs should do: collect data. Most importantly data must to be stored according to strict standardization. That is to each data set, a map has to be associated, that will help people on how to navigate this data set. Furthermore, every dataset should be stored with a minimal code that ensure basic access to data. Also, most often datasets spans multiple modalities. For example, my fMRI datasets are typically bundled together with pupil recordings and heart-beat recordings. We therefore need not only a standardization for storage of specific kind of datasets, but we also need a way to create dataset-bundles that represents an experiment in a flexible manner. A principled way to bundle standardized datasets. Let's call this step 1.
The other thing that labs do is to write code to process their data. To my opinion this is where the biggest challenge is located, namely on finding a system where people can collaborate and create something together. Assuming that the step 1 is solved, the code that is written will also be publicly available. Therefore, code that is written will be directly connected to a dataset type.
For example, if I am trying to detect peaks in a more or less periodical physiological recording, I will not start looking for literature, find someone's algorithm, implement my version of it. I will simply search for code that is compatible with this type of data, browse among alternative codes, read comments to figure out strengths and weaknesses, consider ratings and incorporate that code to my pipeline.
Basically putting up an analysis will be about creating a pipeline using previously coded nodes or coding new nodes when the analysis has not been previously carried out. When something doesn't work as expected, code needs to be improved via collaboration. Writing good quality code will be one great novel incentive for scientists.
Another challenge is to find a way how to fund this novel system. This is certainly beyond the capacity of a single start-up. This is also beyond the scope of a single lab or institute. I also don't think today's national states are visionary enough to take such moves. To my opinion this could only be established by some tech giants who have the know-how required to solve all these problems.