- Posted on May 4, 2020
Reproducible research with Overleaf and Gigantum
This is a guest blog post written by Dean Kleissas, CTO at Gigantum.
Quick look: Gigantum’s gigaleaf package automatically updates Overleaf with your latest results, keeping your paper in-sync with your research.
If you are reading this you probably know how to use Overleaf, but you might not have heard about Gigantum. So let’s start with a quick introduction to Gigantum and the problems that it solves.
Data science and machine learning projects should be transparent, reproducible, and portable (i.e., easy to move and run on different computers). These properties are essential for understanding results, collaborating, and optimizing cost and effort. If neglected, work often ends up anchored to a single person or computer, resulting in something that is unreliable and hard to share.
Sharing a data science project isn’t as simple as dumping your Jupyter notebooks on GitHub and calling it a day. You must instead bundle together all of your code, data, environment configuration, and a complete work history. A skilled and diligent person may be able to do this well, but for many it can be too hard and time consuming.
So what if there were a way to automatically version and package everything you do, providing a complete and ready-to-run record of your project? There is, and that’s why we built Gigantum.
What is Gigantum?
Gigantum is a platform for the creation and exchange of reproducible data science. It has two main components: Gigantum Client and Gigantum Hub.
The MIT-licensed Gigantum Client is free to install and runs almost anywhere. It automates things like Git and Docker containers to version and bundle your work so you don't have to. The Client makes running managed and containerized Juypter or RStudio projects super easy and reproducible, even on your laptop.
Gigantum Hub is a cloud-based service for backing up, sharing, and moving your work. You can sync projects to and from the Hub to share with others or move between different compute resources.
Why integrate Overleaf and Gigantum?
The most obvious reason is to make communicating reproducible research easier. It is common to use LaTeX to author a paper or presentation that is backed by results from a data science project. Overleaf and Gigantum do similar things for these different, but related, research activities.
Overleaf is an extremely popular tool for creating LaTeX documents. Among its many benefits, Overleaf enables collaborative authorship with zero configuration required. For paid accounts, it also provides a full version history and syncing with Git-based repositories.
Gigantum provides similar benefits, except for your data science work. It manages complex compute environments for you and versions every result. Inserting these results, such as a figure or table, in your LaTeX document is typically a manual export and upload process. But if you could link outputs (e.g., saved images and CSV files) to an Overleaf project, the results in your paper would have a fully reproducible provenance trail. Update your analysis and your paper is automatically updated and correct!
Another clear use case is saving time. If you have a LaTeX document with many figures or tables, updating everything yourself can take a lot of effort and cause errors. We are sure there are other interesting use cases and we’d love to hear how you use these tools together on Twitter or in our forum!
How does it work?
To test this idea we built a proof-of-concept integration through a Python package called gigaleaf. This package provides a way to link and update an Overleaf project with files generated in a Gigantum Project. gigaleaf uses Git syncing in Overleaf (via Overleaf’s Git bridge feature), so the Overleaf project owner must have a paid account. In the rest of this post we’ll walk through a short example of gigaleaf in action!
To start, we’ll assume we have a Gigantum Project that contains a Jupyter Notebook that generates a figure and a table (you can learn more about creating Projects here). We will then use gigaleaf to automatically update our Overleaf document with these results.
First, we need to add gigaleaf to the Project’s environment using
pip. In the Gigantum Client, use the package manager widget on the Environment tab of the Project, as shown below.
Next we need to configure gigaleaf to connect the Overleaf and Gigantum projects. Open the Overleaf project and copy the Overleaf Git URL from the Git sync pop-up as shown in the diagram below.
Now open a Jupyter notebook in the Gigantum Project and create an instance of the
Gigaleafclass. As shown in the video clip below, you will be prompted to set up the integration when no existing configuration is detected. You must provide the Overleaf project’s Git URL, as copied above, along with your Overleaf username and password. These will be used to clone the Overleaf project and automatically update Overleaf for you.
After connecting the two projects, you must tell gigaleaf which files to track. A Gigantum Project organizes files into three folders: code, input, and output. The example Project saves a figure as a PNG file and a table as a CSV file to the Project’s output directory. We can track these files via the
link_csv()methods and they will be automatically updated in the Overleaf project.
Finally, run the
sync()method to update Overleaf. This will check for any file or metadata changes and then synchronize everything for you. Run
sync()any time results have changed and you want to update the Overleaf project.
The following video shows how gigaleaf detects changes and automatically updates Overleaf.
To make this process even easier, gigaleaf generates LaTeX subfiles that you can use in the
main.texfile of your Overleaf project. Notice how in the video above, the figure and dataframe (in Gigantum on the left) appear in the LaTeX document as a captioned figure and table (in Overleaf on the right). Since the subfiles are programmatically generated, whenever you make updates, including metadata such as captions, the changes are seamlessly reflected in Overleaf.
The following image shows an Overleaf LaTeX code fragment using subfiles.
If you want to try this out yourself, then check out the gigaleaf README for detailed instructions on how to install, configure, and use the library. The example Gigantum Project from this post is published without gigaleaf configured, so you can copy it as a starting point. The completed Overleaf project is also available to inspect.
If people find gigaleaf useful, the team at Gigantum would love to build similar functionality directly into the Gigantum Client. We would appreciate your feedback: What could we do to make this better? How would you use it? What’s missing? If you have suggestions, ideas, or feedback please contact us via our forum or create an issue in the gigaleaf GitHub repository. We look forward to seeing what you create!
Finally, a very special thanks to Overleaf for building an awesome product and supporting this post.
Gigantum is a system-of-record for the transfer and exchange of data science and machine learning projects. Our open-core model and decentralized architecture provide a fundamentally new approach to computational reproducibility and collaboration. Gigantum strives to make something that works everywhere, for everybody, and on every budget. Founded in 2017, Gigantum, Inc is based in Washington, DC.
Disclaimer: Overleaf and Gigantum are part of Digital Science.