Analysis of Open Data and Computational Reproducibility in Registered Reports in Psychology

was published last week, on which yours truly is a co-author.

Its abstract (emphasis added):

Ongoing technological developments have made it easier than ever before for scientists to share their data, materials, and analysis code. Sharing data and analysis code makes it easier for other researchers to reuse or check published research. However, these benefits will emerge only if researchers can reproduce the analyses reported in published articles and if data are annotated well enough so that it is clear what all variable and value labels mean. Because most researchers are not trained in computational reproducibility, it is important to evaluate current practices to identify those that can be improved. We examined data and code sharing for Registered Reports published in the psychological literature from 2014 to 2018 and attempted to independently computationally reproduce the main results in each article. Of the 62 articles that met our inclusion criteria, 41 had data available, and 37 had analysis scripts available. Both data and code for 36 of the articles were shared. We could run the scripts for 31 analyses, and we reproduced the main results for 21 articles. Although the percentage of articles for which both data and code were shared (36 out of 62, or 58%) and the percentage of articles for which main results could be computationally reproduced (21 out of 36, or 58%) were relatively high compared with the percentages found in other studies, there is clear room for improvement. We provide practical recommendations based on our observations and cite examples of good research practices in the studies whose main results we reproduced.

How my involvement came about is: Daniel Lakens submitted a compute capsule to Code Ocean last year that reproduced the article’s results; as Code Ocean’s then Developer Advocate, I verirfied the project’s computational reproducibility before it was published. I next wrote to Daniel to volunteer to reproduce the results of papers whose code was in MATLAB or Python (and, it turned out, Julia) – and the team graciously added me as a co-author in exchange for my doing so.

One thing I like about this paper is that its GitHub page effectively records a lot of our conversations about the piece as pull requests and commits. If this becomes common in academia, you could scrape data in this category and investigate e.g., if there’s a strong relationship between quantity (or quality?) of commits and author order (in disciplines where that matters).

Happy to discuss further – I’m sag2212 at columbia dot edu.