The culture of "mine" in science at the age of cloud-based analysis systems*

Scientists needs computing power and storage space for their data sets. For scientific institutions, this translates onto long-term fixed-costs that are relatively high. Resources required for buying hardware, keeping network infrastructure, paying system administrators to take care of these masses of electronics is a considerable overhead. As a result, public scientific institutes spend lots of money and human resources to create and maintain infrastructures for storing and analyzing scientific data sets.

Yet, one scientific institute is pretty much the exact replica of another one when it comes to hardware demands. That is, resources that are needed in one place should in theory be very similar to another place. Therefore, instead of investing money for system administrators, storage and computational resources, scientific institutes may actually lease these services from cloud-based infrastructures with more flexible pricing opportunities and lack of overhead. Replacing your system administrator with two PhD students is an appealing idea after all.

There is actually nothing illuminating in this view because this has been actually happening already since more than 10 years in the corporate world. Many hosting companies offers as also VNC based system to connect to their servers and use software on powerful machines. Beyond simple hosting companies, Google Cloud Computing and Amazon AWS making the transformation real by integrating all sort of compute, storage, parallelization tools and selling it as a service.

Where are we in neuroscience? Some important milestones are becoming finally a reality in natural sciences, I think that the point of no return is also being slowly reached for neuroscience. I believe this because standardization procedures on how to store and share datasets is becoming more and more mainstream, and this shift has the potential to change day-to-day scientific enterprise radically. For example, Open Neuro is one such platform, where you can upload your brain imaging dataset using the BIDS format, and let analyses run on these servers. I think this is just start of a big scale transformation on how we do science.

Here is how I think how:

(1) Scientific publication

The way we publish our reports didn't change probably since the times of Fisher or even Newton. The world today is a very different place, but many of the novel tools that have been invented in the internet-based communication era have not been incorporated into the way how we conduct science today. OK, instead of sending a manuscript to the editor's office via post, we are today using emails, fine.

For example, the scientific reviewing system did not incorporate crowd-sourcing mechanisms to evaluate the quality of scientific papers. The decision of whether a manuscript or research proposal is worth being published stays largely within the hands of few not-randomly selected referees and an editor. The process is opaque, prone to biases and has no means to stop formation of small-world cartels that mutually benefit from positive biases.

The re-distribution of reputation is not based on metrics that reflect the long-term value of person for science in general. In the best case, reputation is equivalent to your h-index, which is heavily biased by the random success of your publication track, not how good a scientist your are. Metrics that ensures long-term advancements of science are typically not included. For example, we lack a metric that judges a professor based on the number of students that became also professors in the last 5-10 years. The infrastructure to achieve a better and more democratic system is in place since more than a decade. I believe this change will come faster with cloud-based systems decreasing the cost for storage and computational resources.

In the very near future, I believe any serious publication will also need to contain the related datasets, the analysis pipeline and make it publicly available to all scientific community (but also other citizens). This is already happening, and many journals let you agree with their terms of sharing data promptly when requested. However, the definition of "prompt" is also very subjective. For example, you may want to read this twit-storm to see a recent example. Even if it was obligatory to upload the dataset, the re-evaluation of the data is not within the responsibilities of the referees. This means that modifying an existing system incrementally to make it more and more suitable for the current demands of scientific democratization is not enough, we need a radically different way of publishing science.

When the data is stored and analysis ran in a cloud-based system, there will be no more excuses for reviewers for not being involved in the data analysis, as the time it will take for them to have a closer look on the data and the analysis pipelines will be insignificant. Therefore, I believe that any serious publications will take the concerted efforts of, on the one hand authors who designed the experiment, collected data and wrote the initial draft of the paper, and on the other, reviewers who will be required to contribute in the data analysis using infra-structures provided by the cloud-based storage and computational infrastructures. There will possibly be not much difference between collaborators of today and reviewers of tomorrow.

(2) Cloud-based analysis

Most of published reports use similar methods, which are re-invented again and again by generations of PhD and postdoc crew, which is a complete waste of time and resources. I believe actually there could possibly not be a more inefficient system than today's science. A large company would not be able to function like this.

Once we start talking about cloud-based storage and analysis pipelines, it will also be possible to run these analyses automatically on a server. You will need to tick the checkbox for this or that analysis and receive the results as an email in the form of a presentation or a web page (example) to click/browse around. This is of course an over simplification, but what I would like to say is that scientists will spend more time on (1) standardizing their datasets to be able to run analyses on the cloud-based system and (2) making analysis pipelines that are compatible with standardized datasets. Therefore, many scientists will use this time to record more data.

(3) End of culture of "mine"

One of the most intriguing anthropological traits of the daily scientific enterprise, is what I call the culture of "mine". This is not something that is somewhere out there, it is right inside our offices. By this I mean the way how students, PhDs, postdocs and professors (the whole crew basically) are closed to the idea of sharing and opening their projects to external influences. Most often if not always, a project is assigned to a single person in the lab, and this person is expected to run this project until the end. Because the person believes that it is her/his project, he/she can control the monopoly together with his/her boss on how this project has to run and adjust the level of external factors (politics). This results in a very conservative set of interactions between people, as any request of help, or any communication can be seen as a contribution to the project. The culture of mine, will of course be there and start the appropriate set of behaviors to not let this happen. Unfortunately, there are countless examples of authorship disputes which appear exactly from this type of culture.

Once the opportunity to upload your dataset and run your analysis in a cloud-based system is within the reach, there will be no reason to not open your data and let other people analyze it in ways different than what you have actually thought would be most appropriate. In a crow-sourced science, you will own your data, but will actually allow other people to look into it. Pretty much the same way, when people are allowed to look at you when you are walking in the street. The constructive discussions that follows during this process belongs to all parties and can be moderated by the person who created the dataset. I believe there will be a shift in the way how people conceptualize the way how they own projects and data, replacing culture of mine with crowd-sourced intelligence.

I found this article from Jeremy Freeman, entitled "Open source tools for large-scale neuroscience" which made me super happy as it expresses many of the thoughts I scratched on this post in a systematic and professional manner.

*This article has a bias from the perspective of a neuroscientist.

Standardization of fMRI analysis with SPM using the FancyCarp Matlab toolbox

Fancycarp is a repository for streamlining fMRI preprocessing and analysis based on SPM using Matlab.

With the FancyCarp toolbox, you create a folder hierarchy for your fMRI project. Download the fMRI data from the DICOM server (only for internals of IFSN). Preprocess fMRI data with SPM. Conduct first- and second-level analyses.

It can be integrated together with SCR, Pupil anaylses (experimental) for the analysis of autonomic responses.

How do I proceed?

1/ Get the repo

First clone this repository and switch to mrt/main branch.
onat@neocortex:/tmp$ git clone https://github.com/selimonat/fancycarp.git
Cloning into 'fancycarp'...
remote: Counting objects: 2154, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 2154 (delta 0), reused 0 (delta 0), pack-reused 2150
Receiving objects: 100% (2154/2154), 986.53 KiB | 801.00 KiB/s, done.
Resolving deltas: 100% (1414/1414), done.
Checking connectivity... done.

onat@neocortex:/tmp/fancycarp$ git checkout mrt/main
Branch mrt/main set up to track remote branch mrt/main from origin.
Switched to a new branch 'mrt/main'
If you would like to version-track your repository (which you should) create a new branch following /mrt/XXXX, where XXXX is the name of your project.

2/ Add paths

Fire up Matlab, add Fancycarp to Matlab's path. Don't forget to add SPM to your repository.
addpath('/tmp/fancycarp')

3/ Project Object

Fancycarp is based on OOP. This is convenient as it helps one to define a Project with a set of properties and methods, and encapsulate them in one single file. Functions that are specific to a project are declared in the ProjectObject, whereas other functions, which are for example specific to participants are declared accordingly at their respective objects. This removes clutter and helps defining clear functional separation of code. ProjectObject contains all sort of functions used for low-level house keeping. For example, finding the path to a subject at a certain run you can run the following code.
>> Project().pathfinder(2,2)
ans =
/tmp/mynextsciencepaper/data//sub002/run002/
The role of the ProjectObject is to provide functionality that spans multiple participants at once. For example, second-level analysis is defined in the ProjectObject, as this is a project-wide operation that uses all participants. Similarly, path finding, DICOM transformations, sanity checks, Mysql requests are all defined here at the ProjectObject.
Now, when you start a new project the first thing is to adapt the properties defined in the Project object to your project. These are defined in the file called Project.m.
properties (Hidden, Constant)%adapt these properties for your project
        %All these properties MUST BE CORRECT and adapted to one owns project
        ====> Path to your project folder.
        path_project          = '/tmp/mynextsciencepaper/data/';
        ====> Path to your SPM installation.
        path_spm              = '/home/onat/Documents/Code/Matlab/spm12-6685/';
        ====> Here enter your participants number.
        trio_sessions         = {'PRISMA_19873' 'PRISMA_19875'};
        ====> For every participant enter the number of acquisition runs of TRIO/PRISMA scanner to be downloaded.
        dicom_serie_selector  = {[8 19 20 21 6 7 17 18 ] [8 19 20 21 6 7 17 18 ] };
        ====> Runs [8, 19, 20...] will be distributed to folders [1, 2, 3, ...]
        dicom2run             = repmat({[1:8]},1,length(Project.dicom_serie_selector));
        ====> For field map corrections, enter their runs as stored in the dicom server.
        runs_fieldmap         = [{5 6} {7 8}];%The order is important: first one is the magnitude and the second one is the phase.
        ====> Related to VDM correction. 
        apply_vdm             = [{1}   {2 3 4}];        
        data_folders          = {'midlevel' 'mrt' 'design'};%if you need another folder, do it here.
        ====> Your TR, High-pass filtering values, wheterh to exclude surface extraction (requires CAT12)
        TR                    = 0.99;              
        HParam                = 128;%parameter for high-pass filtering
        surface_wanted        = 0;%do you want CAT12 toolbox to generate surfaces during segmentation (0/1)                
        smoothing_factor      = 4;%how many mm images should be smoothened when calling the SmoothVolume method        
    end

4/ Data Folders

Standardization of data folders in a project is the first step to facilitate code sharing and reproduction. To create a folder hierarchy, run
>> Project().CreateFolderHierarchy()
This will use Properties of the ProjectObject and loop over all participants and runs to create a folder structure to download anatomical and functional data. Check it out with the shell command tree.

The community seems to be settled upon the BIDS (brain imaging data structure) standard. BIDS will also be adopted in future versions of Fancycarp toolbox.
>> !tree                      
.
└── data
    ├── sub001
    │   ├── run000
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run001
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run002
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run003
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run004
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run005
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run006
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run007
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   └── run008
    │       ├── design
    │       ├── midlevel
    │       └── mrt
    ├── sub002
    │   ├── run000
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run001
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run002
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run003
    │   │   ├── design
    │   │   ├── midlevel
    │   │   └── mrt
    │   ├── run004
    │   │   ├── design

5/ Downloading Anatomical Data (Internal)

Now we have the folder hierarcy, we can proceed with downloading the anatomical and functional data from the internal server. To download data of a given participant, we need to create a Subject instance. Thanks to OOP inheritance, SubjectObject receives all the methods defined in the ProjectObject in addition to the methods that itself defines. Running the Subject().get_hr method will download the subject's anatomical scans, convert that to NifTi and store them at thesubXXX/run000/mrt folder.
>> s = Subject(1)
Subject Constructor for id:1 is called:
s = 
  Subject with properties:
              id: 1
            path: '/tmp/mynextsciencepaper/data//sub001/'
    trio_session: 'PRISMA_19873'
       total_run: 8
>> s.get_hr                                 
get_hr:
Will now dump the latest HR (16:05:01)
Dicom Server returns:
=====
Database: prisma
Patient: XXXXXX
  * Examination: PRISMA_XXXXX (XXXX, F)                          [ 1|  23|  5192]
    + Study:  1 (2017-11-08 11:15:00)                            [  |  23|  5192] 
      - Series:   1 {localizer (expectb-a, blank)              } [  |    |    11] 
      - Series:   2 {AAHead_Scout_64ch-head-coil               } [  |    |   128] 
      - Series:   3 {AAHead_Scout_64ch-head-coil_MPR_sag       } [  |    |     5] 
      - Series:   4 {AAHead_Scout_64ch-head-coil_MPR_cor       } [  |    |     3] 
      - Series:   5 {AAHead_Scout_64ch-head-coil_MPR_tra       } [  |    |     3] 
      - Series:   6 {gre_field_map, 2mm, filter M              } [  |    |    90] 
      - Series:   7 {gre_field_map, 2mm, filter M              } [  |    |    45] 
      - Series:   8 {ep2d_bold, mb3, loc                       } [  |    |  1193] 
      - Series:   9 {ep2d_diff, mb3, ref                       } [  |    |    45] 
      - Series:  10 {ep2d_diff, mb3, blip inv                  } [  |    |    45] 
      - Series:  11 {mprage, HR64                              } [  |    |   240] 
      - Series:  12 {localizer (expectb-b, blank)              } [  |    |    11] 
      - Series:  13 {AAHead_Scout_64ch-head-coil               } [  |    |   128] 
      - Series:  14 {AAHead_Scout_64ch-head-coil_MPR_sag       } [  |    |     5] 
      - Series:  15 {AAHead_Scout_64ch-head-coil_MPR_cor       } [  |    |     3] 
      - Series:  16 {AAHead_Scout_64ch-head-coil_MPR_tra       } [  |    |     3] 
      - Series:  17 {gre_field_map, 2mm, filter M              } [  |    |    90] 
      - Series:  18 {gre_field_map, 2mm, filter M              } [  |    |    45] 
      - Series:  19 {ep2d_bold, mb3, part 1                    } [  |    |  1061] 
      - Series:  20 {ep2d_bold, mb3, part 2                    } [  |    |   887] 
      - Series:  21 {ep2d_bold, mb3, part 3                    } [  |    |  1061] 
      - Series:  22 {ep2d_diff, mb3, ref                       } [  |    |    45] 
      - Series:  23 {ep2d_diff, mb3, blip inv                  } [  |    |    45] 
=====
The latest recorded HR data:
Series:  11 {mprage, HR64                              } [  |    |   240] 
DicomDownload:
Calling system's COPY function to dump the data...16:05:04
source:/XXX/XXX/prisma/images/XXXX
destination:mrt/
COPY finished successully 16:05:04
ConvertDicom:
Found 240 files...
Dicom conversion s#1... (16:05:10)
Will call spm_jobman...
Running SPM jobman 1...
------------------------------------------------------------------------
Running job #1
------------------------------------------------------------------------
Running 'DICOM Import'
   Changing directory to: mrt/
   Changing back to directory: /tmp/fancycarp
Done    'DICOM Import'
Done
Finished... (16:05:15)
Deleting DICOM images in (16:05:15)
mrt/
Finished... (16:05:15)

6/ Downloading Functional Data

Similarly, get_epi method downloads the EPIs based on ProjectObject properties.
>> s.get_epi
Making a dicom query, sometimes this might take long (so be patient)...(16:05:14)
This is what I found for you:
You told me to download the following series: 8,19,20,21,6,7,17,18,
Double check if everything is fine.
Making a dicom query, sometimes this might take long (so be patient)...(16:05:15)
Will now dump series (16:05:17)
DicomDownload:
Calling system's COPY function to dump the data...16:05:17
source:/XXX/XXX
destination:/tmp/mynextsciencepaper/data//sub001/run001/mrt/

7/ Preprocessing

s.preprocess_pipeline takes care of the preprocessing steps. It tries to make a fieldmap correction if required files are present. And continues with a/ Surface Segmentation of the anatomical data, b/ Normalization to MNI space, c/ Realignment time and participants, d/ Gray to white matter segmentation with newSegment, e/ Strips away skull voxels
 function preprocess_pipeline(self,runs)
            %meta method to run all the required steps for hr
            %preprocessing. RUNS specifies the functional runs, make it a
            %vector if needed. RUNS will be used with Re_Coreg.
            if nargin > 1
                try
     fprintf('Will attempt field corrections\n');
     self.ComputeVDM;
                 self.ApplyVDM;
     self.epi_prefix = 'u_';%if we came that far, change the default EPI prefix.
    catch
     fprintf('Failed... Will work on non-field corrected EPIs\n');
    end
                %
       self.SegmentSurface_HR;%cat12 segmentation
             self.SkullStrip;%removes non-neural voxels
             self.MNI2Native;%brings the atlas (if present) to native space
             self.Re_Coreg(runs);%realignment and coregistration
             self.Segment_meanEPI;%segments mean EPI with new segment
    self.SkullStrip_meanEPI;%creates a native mask
      else
    fprintf('One input argument is required!\n');
      end
        end
8/ Analysis of functional data. So far, Fancycarp covered all the basic preprocessing steps, which should be fairly common to anybody, independent of their projects. However, first-level analyses in fMRI has certainly project specific flavors, that cannot be fully automated with additional data. To this end, the CreateFolderHierarchy method has created the design folder in every subject and run folder, which is supposed to contain parameters necessary for the building of a design matrix required for the first-level analysis. One may create as many design models indexed with an integer, and use this number as argument to first-level analysis method defined in the SubjectObject.
>> pwd
ans =
/mnt/data/project_fearamy/data/sub005/run001/design
>> ls -l
total 136
drwxr-xr-x 2 onat onat 4096 Feb 24  2017 model01
drwxr-xr-x 2 onat onat 4096 Feb 24  2017 model02
drwxrwxr-x 2 onat onat 4096 Feb 13 11:29 model03
drwxr-xr-x 2 onat onat 4096 Feb 24  2017 model04
....
In each of these folders there should be a data.mat file that contains SPM compatible timeing information for different stimuli and conditions. For example the model below contains 10 condition as a Matlab structure which could be directly fed to the SPM gui.
>> cond
cond = 
1x10 struct array with fields:
    name
    onset
    duration
    tmod
    pmod
Calling s.analysis_firstlevel(1,1) should fit a first-level GLM using the model 1 to run 1.