Volume 9: How Would ReproNim Containerize a Workflow¶

Version 0.7, published August 27, 2020.

Change log

Version 0.5: Published April 5, 2020.
Version 0.6: Published May 25, 2020.
Version 0.7: Published August 27, 2020. Formatting for Sphinx.

Authors: Peer Herholz and The ReproNim Team1

Special Thanks to: Fabrizio Pizzagalli, Neda Jahanshad and Paul Thompson at ENIGMA.

Overview

Problem Statement
ReproNim Solution
What did this cost me?
What have I gained?
Conclusion

Stakeholder: Data Analyst

Problem Statement¶

Imagine you are part of a tremendously large research project that has collaborators all over the world and you are responsible for creating comprehensive neuroimaging data analysis workflows.More specifically, the research project focuses on mental health, has thousands of participants and you are tasked to create a pipeline that computes and extracts anatomical brain shape features from MRI brain data in volume and surface format generated by the FreeSurfer2 structural analysis pipeline. A good real-world example of such an endeavor is the Enhancing NeuroImaging Genetics through MetaData (ENIGMA) project3. After a very long development time (several months or even years), your pipeline (in this case /home/frodo/work/enigma/shape_features_pipeline)is working robustly and ready to be shared with hundreds of collaborators all around the world. Not long after you shared your script via Email, you receive hundreds, yes hundreds, of responses, all stating that your pipeline isn’t working and won’t even start running. What is happening? It worked perfectly fine for you, hence they must have all been doing it wrong… But then it hits you: nothing is wrong, everything works for you, however, only you. In this case “you” refers to your computing environment, the resources and setup you used all this time to create your pipeline. It was never tested somewhere else. Maybe you are not aware of it, but prior research work (insert references here) showed that everything from the base, the operating system, to the top, specific versions of libraries, and everything in between can and will have an influence on your results. This encompasses everything from different numbers at the 10th decimal place to changes with regard to significance of your hypotheses. What now? Is it like the end of the Lord of the rings:

“How do you pick up the threads of an old life? How do you go on, when in your heart you begin to understand… there is no going back? There are some things that time cannot mend. Some hurts that go too deep, that have taken hold.”?

Behold, all is not lost, ReproNim to the rescue!

ReproNim Solution¶

In theory¶

Ideally, going full ReproNim, every you do is under version control, FAIR and reproducible: from data, through analysis workflows to results. At this point in time, we have all the resources to make this a reality and standard. Instead of only working on your machine, your analysis pipeline can be made fully reproducible everywhere by utilizing container technologies such as Docker or Singularity. Instead of everything breaking down after changes, these containers and the workflow can be placed under version control through DataLad4 and git5. The former allows you to put your data under version control to prevent the following situation(s): “What happened to my data? What did you do with the data? Why are 20 participants missing?” and many more. Last but not least, these tools leverage repositories, so that everyone can find and reuse your resource without having to ask you for it and you handling hundreds of emails in case of updates and/or changes. Depending on your level of experience and training this all might sound like rocket science and fairytales to you. However, it’s all real and all possible to understand and implement yourself. ReproNim is here to guide you along this exciting journey that will save you and countless other folks an immense amount of time and stress.

In Practice: The Gory Details¶

Enough talking! Pitter patter, let’s do this! To start from the beginning we prepare a few things. What we, that is you, need is the following: git, a GitHub or GitLab account, Docker or Singularity and DataLad (provide links for all). Please note: for this guide, we’re assuming you are using GitHub and Docker. The other resources work comparably, but may differ in small aspects and can be the topics of future chapters of this document. On a related note: these tools work best/natively under Unix based systems such as Linux distributions and macOS. Windows users should consider using the Ubuntu subsystem (reference).

Step 1: Create a project directory and place it under version control.¶

We start with creating a project directory (we will call it enigma_shape), that will be placed under version control and within which we will work on the pipeline and containers.

We create the project directory:

$ mkdir /home/frodo/enigma_shape

go into it:

$ cd /home/frodo/enigma_shape

and initiate version control:

$ git init

As a result a hidden file called .git will be created in our project directory (/home/frodo/enigma_shape/.git). Great, with that we have already completed one of the most crucial steps: providing the possibility to track and log every single change we make to whatever being placed in our project directory. As a first and very common step, we can create a README file that explains the content and goal of our project, among other potentially useful and important information. Using your favorite text editor (nano in the example), create the README file. In this example, the first line creates a file called “README.md”, the second depicts the text we want to include and the third/fourth closes and saves our newly created file:

$ nano README.md

This is the project of the ENIGMA anatomical brain shape feature
pipeline.

crtl + x

y

Now we have to add this file to our version control:

$ git add README.md

While we are at it, we will move the scripts we have worked on to this directory as our goal was to place these files under version control for further development:

$ mv /home/frodo/work/enigma/shape_features_pipeline /home/frodo/enigma_shape
$ git add shape_features_pipeline

This is our new beginning, our new year zero. Everything we will do to the analysis pipeline from this point in time onwards can and will be version controlled. This includes adding the README.md and our scripts to the “new” project. In version control terms, we ‘commit these changes’. While doing so, we make sure to include a helpful and informative commit message that tells us (and our future selves) what we have changed:

$ git commit -m “Add README.md and current version of scripts.”

To make everything FAIR, speed up the development process and create the opportunity for interaction and collaboration, we will make our project available on GitHub. To do so, we have to create a respective project on GitHub first:

Login to your GitHub account
Click on “New repository” and name it “enigma_shape”
Click on “Create new repository”

With that, we have both the parts we need to make this collaborative: our local repository where we will work on, test and implement changes and our remote place on GitHub where we will “push” our changes to.

$ git remote add origin git@github.com:frodo/enigma_shape
$ git push -u origin master

Shortly after, our project and the included files, along with the changes we committed can be found on GitHub, either visible for everyone (ideal case) or only your collaborators. In either case, folks can go to your repository and download the resources without you having to send everything per email. Once downloaded, they can also run the analyses. Oh, wait…wasn’t there something? Yes, there was. Namely, the whole reason you are reading this: they most likely still cannot run it, as the pipeline only works for you and your machine/setup. We also have to provide the corresponding necessary computing environment. But how can we do that? Buying hundreds of laptops on which we put clones of your machine? Maybe yes, maybe no, maybe we will see how in Step 2: Isolating and sharing computing environments.

Step 3: Automating your container¶

A common example of containers with such behavior are BIDS-Apps (http://bids-apps.neuroimaging.io/), containerized pipelines or applications that understand BIDS. While you could adapt your pipeline to also work very well with BIDS datasets, we won’t go down that rabbit hole for now and leave the pipeline dataset agnostic. Speaking of which, how does your pipeline work again? Oh, that’s right the main function is run_pipeline.sh and it assumes that the data you want to process are in a directory called `/freesurfer_data` and based on that wants the identifier of the participant you want to analyse within that directory as input. Furthermore, it needs a FreeSurfer license. Finally, it will save the results to a directory called /output. We can make a respective automatization happen through two steps. First, we need to tell the container that your pipeline should be executed upon starting. This behavior can be achieved through modifying the entrypoint of your container. Neurodocker’s way of doing this is the –entrypoint argument and we can use add to add the main function of your pipeline which is located in /opt/shape-features-pipeline to the entrypoint or startup of the container:

--entrypoint "/neurodocker/startup.sh /opt/ENIGMA_BV45/run_brainvisa45.sh"

With that, your main function is automatically executed every time you run your container. And now, for the last time: commit and rebuild:

$ git commit -m “added main function to entrypoint in generate_enigma-sulci.sh”
$ bash generate_enigma-sulci.sh
$ git commit -m “added main function to entrypoint in Dockerfile”
$ docker build -t enigma-sulci .
$ docker images
$ docker rmi

The next step entails an adaption of the docker run command. We need to make the data directory, the output directory and a FreeSurfer license available within our container. Quick reminder: our computing environment is isolated, the doors of the container are close, nothing in, nothing out. Except, we create the possibility for the container to interact with files and paths on your local machine or in container terms host machine. This process is called mapping and implemented through the -v flag which expects two arguments: a path or file on your host machine and a path where it should be mapped to inside the container. Adjusted to our needs, this would look as follows:

-v /home/frodo/freesurfer_outputs:/fs_data
-v /home/frodo/shape_analyses:/output
-v /home/frodo/freesurfer/license:/opt/freesurfer/license

The first two map the input data and output directory respectively, while the third maps the FreeSurfer license file. We can now bring it all together and also add the ID of the participant that should analyzed, let’s call him Spock:

docker run -v /home/frodo/freesurfer_outputs:/fs_data \
    -v /home/frodo/shape_analyses:/output \
    -v /home/frodo/freesurfer/license:/opt/freesurfer/license \
    enigma-sulci \
    Spock

This, this is the moment. Experience it in all its glory. You have come a long way. From a complex pipeline that only worked on your machine and was shared by email to an application that works for everyone that has Docker installed and can be shared via version controlled repositories that are accessible for everyone. What a ride! Let’s bring it home and actually push your container to DockerHub:

$ docker tag d06b7f5f6a55 frodo/enigma-sulci
$ docker push frodo/enigma-sulci

Your container can now be ‘pulled’ by others! However, there’s always room for more reproducibility…

Step 4: It’s getting meta¶

During the beginning of our ancient history, something about version controlled data was mentioned. And while this is strictly speaking not part of bringing your pipeline into a container, it should be! Because these days, we have the possibility to connect your containerized pipeline to a dataset you want to process it with. Furthermore, we version control the application of our container. The future is now! A future called DataLad8. Learn all about it here. We will discuss this further in future chapters of this story.

Step 5: Going the extra mile - automated builds, perturbation analyses and more¶

There are many more chapters to this never-ending story. Stay tuned for more ways to enhance your reproducibility and efficiency. Chapters will include: automated builds from github; testing different base systems to see if updates to newer base is possible; and many more.

What did this cost me?¶

In the long term, this did not cost you much, all the steps covered here are things you already had to do anyway. You procured a computer, you installed a base operating system, you installed a bunch of software tools you needed and solved their dependencies, you developed a processing script, and you used that script. The cost, in this example, of a retrospective application of these procedures is in remembering what you had to do in order to make this work, which could have been a development timeline that spanned multiple years. And you had to learn about containerization, Docker specifically in this case, and a new tool to help you perform containerization (NeuroDocker, in this case).

What have I gained?¶

Going forward, building each of your specific processing workflows using the complete enumeration of all the details necessary to the implementation of that process greatly facilitates your own reuse of the workflow going forward. You have gained clarity and reproducibility and complete ‘describability’ for your ‘future self’, the readers of wonderful manuscripts, and provide a foundation upon which your results and conclusions can more seamlessly fit into the fabric of the emerging scientific knowledge. Also, your shared container is a scientific product of your research efforts, and itself is reusable, citable, and can be a source of scientific productivity to advance your research career. You can’t get any of that from a script sitting on your lab computer…

Conclusion¶

Appendix 1: The final generate_enigma-sulci_images.sh script¶

1: https://repronim.org/aboutus.html
2: https://surfer.nmr.mgh.harvard.edu/
3: http://enigma.ini.usc.edu/
4: https://www.datalad.org/
5: https://git-scm.com/
6: https://github.com/myyoda/poster/blob/master/ohbm2018.pdf
7: http://brainvisa.info/web/index.html
8: https://www.datalad.org/

Table of Contents

Related Topics

This Page

Volume 9: How Would ReproNim Containerize a Workflow¶

Problem Statement¶

ReproNim Solution¶

In theory¶

In Practice: The Gory Details¶

Step 1: Create a project directory and place it under version control.¶

Step 3: Automating your container¶

Step 4: It’s getting meta¶

Step 5: Going the extra mile - automated builds, perturbation analyses and more¶

What did this cost me?¶

What have I gained?¶

Conclusion¶

Appendix 1: The final generate_enigma-sulci_images.sh script¶

Volume 9: How Would ReproNim Containerize a Workflow¶

Problem Statement¶

ReproNim Solution¶

In theory¶

In Practice: The Gory Details¶

Step 1: Create a project directory and place it under version control.¶

Step 2: Isolating and sharing computing environments.¶

Step 3: Automating your container¶

Step 4: It’s getting meta¶

Step 5: Going the extra mile - automated builds, perturbation analyses and more¶

What did this cost me?¶

What have I gained?¶

Conclusion¶

Appendix 1: The final generate_enigma-sulci_images.sh script¶