Walk the Walk: Reproducibility All the Way Down

We hope, when doing science and publishing papers regarding it, for excellent reproducibility. Whether or not anyone actually tries to confirm or disprove a given result, the core goal of a peer-reviewed, published paper is—ostensibly—to enable as complete a recreation as possible of the work the paper covers.

When the paper in question is itself a reproducibility study, this, if anything, sets the bar even higher. A confirmation has—whether it should or not—a synergistic effect with the original paper, cementing its result in the minds of the scientific community. On the other hand a failure to reproduce can, depending on the impact of the original, have far-reaching repercussions, as it calls into question (if not directly disproves) the entire branch of work that may have grown hence.

Thus, it is probably a reasonable expectation that a reproducibility study, 'walk the walk', and itself be highly—and hopefully easily—reproducible.

At the recent Curry On 2019 conference in London, Jan Vitek's talk regarding a reproducibility study caught our interest:

This study, On the Impact of Programming Languages on Code Quality by Berger et al. examines the 2014 paper, A large scale study of programming languages and code quality in github, by Ray et al., who found statistically significant differences in the prevalence of 'defects' in code between 17 languages, studied across 729 GitHub projects. Berger et al.'s attempt to reproduce encountered a number of issues including methodological and labeling problems, and when corrected to the best of their abilities they found most of the original results were invalidated, and the significance of the remainder was greatly reduced.

Now, the core topic itself is quite interesting, and the original study had a fairly large impact. Whether there are truly better languages—if it's possible to scientifically determine a real answer to that—is a potentially important question, both academically and economically. It also speaks to peoples' opinions, and so reception of research pointing one way or another could easily be affected by confirmation bias.

Aside from the topic, the particular provision of a Github repo for Berger et al.'s analysis promised the potential for an interesting case study for Nextjournal. Reproducibility is, of course, one of the core goals of the Nextjournal platform: to place controls on as many factors as possible in a given notebook's software stack, thus making an analytical repetition instantaneous while also lending transparency to the work. So, we took a look.

Essentially, there were the makings of two notebooks for us here: a methods notebook which contains the details of the analysis itself, and the publishing-quality results paper, a PDF file generated with LaTeX using some of the output of the analysis.

Reproducing the PDF Paper

As a first level, we determined to rebuild the publishable PDF. As this includes the implicit rerunning of the R code within a set of Rmd files, this is also confirming that the analysis itself is runnable, though it is hidden in the background.

The Nextjournal reproduction at this level takes over some of the functions of the Makefile:

An environment is built for the repo, installing a few missing system packages and checking the required R packages in setup_r.R for any we are missing.
Download the large artifact.tgz file (containing the original study's data) into storage.
Perform the analysis by running the Makefile's artifact target, building the .Rmd files into HTML documents and producing all required images.
The code, LaTeX sources, and results—excluding data files—are then archived into storage.
Finally, the archive is extracted in our complete LaTeX environment, and the final PDF is built.

The result does differ somewhat from the PDF on arXiv.org; presumably, the changes are those mentioned in the Curry On talk, made after the paper was submitted in response to reviewer comments.

Reproducing the Methods in a Notebook

That's all well and good, but we wanted to get the code behind that make artifact command out into the open. This presented a challenge, as the code and comments for the analysis are spread through seven R Markdown files, and Nextjournal only allows importing to turn one file into one notebook.

Thus, the first step was to combine all seven files into one. The issues encountered in this process were:

Some repeated empty lines would show up in the output.
The headers of each file contained titles, but these would get lost in the concatenation, and so had to be converted into first-level headers. This meant that all existing headers needed to be increased by one level (while leaving code comments alone).
Extraneous directives in code block language directives could confuse the Nextjournal importer, making it skip cells.
The presence of Bash code before R code revealed an error in the import logic—such a configuration leads to most cells in the import result being broken.

This notebook contains the code to process out all of these issues and combine the result (and a small header text) into one .Rmd file.

With this in hand, it turned out we were basically finished. Importing this file into Nextjournal yields a large, cleanly formatted methods notebook. An environment was created, much like the one created for the paper-building notebook, but this time we chose to include the full repo (with data) in the environment. On this environment, the notebook runs cleanly except for one font error in a ggsave() call, making it unable to save a figure PDF.

Conclusion

It is clear that this reproducibility study does indeed 'walk the walk': its analysis all the way from source data to results is very easily repeated, its code is well documented (allowing for the potential for 'higher level' reproducibility as mentioned in the talk), and as a bonus it can build all the way to the publishable paper.

Of course, it is regrettable that the original study did not have a repository this extensive and well-conceived, given that higher-level reproducibility was found to be impossible. As the reasons listed include data discrepancies and missing code, a complete, versioned repository—or dare we suggest a Nextjournal notebook—would have made a reproducibility study easier, and perhaps shed additional light on the problems. Or, perhaps, with complete, runnable, and transparent code, the original study could have been improved—or disproved—before it was published at all.