Skip to content

Latest commit

 

History

History
2857 lines (2822 loc) · 99.9 KB

readme.md

File metadata and controls

2857 lines (2822 loc) · 99.9 KB

Awesome Reproducible Research Awesome DOI

A curated list of reproducible research case studies, projects, tutorials, and media

Contents

Case studies

The term "case studies" is used here in a general sense to describe any study of reproducibility. A reproduction is an attempt to arrive at comparable results with identical data using computational methods described in a paper. A refactor involves refactoring existing code into frameworks and other reproducibility best practices while preserving the original data. A replication involves generating new data and applying existing methods to achieve comparable results. A robustness test applies various protocols, workflows, statistical models or parameters to a given data set to study their effect on results, either as a follow-up to an existing study or as a "bake-off". A census is a high-level tabulation conducted by a third party. A survey is a questionnaire sent to practitioners. A case narrative is an in-depth first-person account. An independent discussion utilizes a secondary independent author to interpret the results of a study as a means to improve inferential reproducibility.

Study

Field

Approach

Size

Glasziou et al 2008

Medicine

Census

80 studies

Baggerly & Coombes 2009

Cancer biology

Refactor

8 studies

Hothorn et al. 2009

Biostatistics

Census

56 studies

Ioannidis et al 2009

Genetics

Reproduction

18 studies

Anda et al 2009

Software engineering

Replication

4 companies

Vandewalle et al 2009

Signal processing

Census

134 papers

Prinz 2011

Biomedical sciences

Survey

23 PIs

Horthorn & Leisch 2011

Bioinformatics

Census

100 studies

Begley & Ellis 2012

Cancer biology

Replication

53 studies

Collberg et al 2014
Collberg & Proebsting 2016

Computer science

Census

613 papers

OSC 2015

Psychology

Replication

100 studies

Bandrowski et al 2015

Biomedical sciences

Census

100 papers

Patel et al 2015

Epidemiology

Robustness test

417 variables

Chang et al 2015

Economics

Reproduction

67 papers

Iqbal et al 2016

Biomedical sciences

Census

441 papers

Baker 2016

Science

Survey

1,576 researchers

Névéol et al 2016

NLP

Replication

3 studies

Reproducibility Project 2017

Cancer biology

Replication

9 studies

Vasilevsky et al 2017

Biomedical sciences

Census

318 journals

Kitzes et al 2017

Science

Case narrative

31 PIs

Barone et al 2017

Biological sciences

Survey

704 PIs

Kim & Dumas 2017

Bioinformatics

Refactor

1 study

Camerer 2017

Economics

Replication

18 studies

Olorisade 2017

Machine learning

Census

30 studies

Strupler & Wilkinson 2017

Archaeology

Case narrative

1 survey

Danchev et al 2017

Comparative toxicogenomics

Census

51,292 claims in 3,363 papers

Kjensmo & Gundersen 2018

Artificial intelligence

Census

400 papers

Gertler et al 2018

Economics

Census

203 papers

Stodden et al 2018

Computational science

Reproduction

204 papers, 180 authors

Madduri et al 2018

Genomics

Case narrative

1 study

Camerer et al 2018

Social sciences

Replication

21 papers

Silberzahn et al 2018

Psychology

Robustness test

One data set, 29 analyst teams

Boulesteix et al 2018

Medicine and health sciences

Census

30 papers

Eaton et al 2018

Microbiome immuno oncology

Replication

1 paper

Vaquero-Garcia et al 2018

Bioinformatics

Refactor and test of robustness

1 paper

Wallach et al 2018

Biomedical Sciences

Census

149 papers

Miller et al 2018

Bioinformatics

Synthetic replication & refactor

1 paper

Konkol et al 2018

Geosciences

Survey, Reproduction

146 scientists, 41 papers

Rahtz 2018

Reinforcement Learning

Reproduction, case narrative

1 paper

Stodden et al 2018

Computational physics

Census

306 papers

AlNoamany & Borghi 2018

Science & Engineering

Survey

215 participants

Li et al 2018

Nephrology

Robustness test

1 paper

Chen 2018

Social sciences & other

Census

810 Dataverse studies

Trisovic et al 2021

Social sciences & other

Census, Survey

2109 replication datasets

Nüst et al 2018

GIScience/Geoinformatics

Census, Survey

32 papers, 22 participants

Raman et al 2018

Genomics

Robustness test

8 studies

Stagge et al 2019

Geosciences

Survey

360 papers

Bizzego et al 2019

Deep learning

Robustness test

1 analysis

Madduri et al 2019

Genomics

Case narrative

1 analysis

Mammoliti et al 2019

Pharmacogenomics

Case narrative

2 analyses

Allen & Mehler 2019

Biomedical sciences and Psychology

Census

127 registered reports

Pimentel et al 2019

All

Census

1,159,166 Jupyter notebooks

Fergusson et al 2019

Virology

Census

236 papers

Vlisides et al 2019
Sieber et al 2019

Anaesthesia

Indepedent discussion

1 study

Bakker et al 2019

Psychology

Replication

1 paper

Niepel et al 2019

Cell pharmacology

Robustness test

5 labs

Dacrema et al 2019

Machine learning

Reproduction

18 conference papers

Eran et al 2019

Experimental archaeology

Replication

1 theory

Rauh et al 2019

Neurology

Census

202 papers

Sætrevik & Sjåstad 2019

Psychology

Replication

2 experiments

Feng et al. 2019

Ecology and Evolution

Census

163 papers

Botvinik-Nezer et al. 2019

Neuroimaging

Robustness test

1 data set, 70 teams

Klein et al. 2019

Psychology

Replication

1 experiment, 21 labs, 2,220 participants

Obels et al. 2019

Psychology

Census

62 papers

Wayant et al 2019

Oncology

Census

154 meta-analyses

Simoneau et al. 2020

Bioinformatics

Robustness test

1 data set

Miyakawa 2020

Neurobiology

Census

41 papers

Thelwall et al 2020

Genetics

Census

1799 papers

Maassen et al 2020

Psychology

Reproduction

33 meta-analyses

Riedel et al 2020

Biomedical science

Census

792 papers

Culina et al 2020

Ecology

Census

346 papers

Clementi & Barba 2020

Physics

Replication

2 papers

Kemper et al 2020

Reproductive endocrinology

Census

222 papers

Marqués et al 2020

Biomedical sciences

Census

240 papers

Janssen et al 2020

Environmental Modelling

Census

7500 papers

Anderson et al 2020

Cardiology

Census

532 papers

Ostermann et al 2020

GIS

Census

75 papers

Samota & Davey 2020

Life sciences

Survey

251 researchers

Bedford & Tzovaras 2020

Genetics

Robustness test

1 paper

Krassowski et al 2020 (repo)

Life sciences

Census

3377 articles

Boudreau et al 2021

Computational Biology

Census

622 papers

Heumos et al 2021

Computational Biology

Robustness test

6 studies

Hrynaszkiewicz et al 2021

Computational Biology

Survey

214 researchers

Päll et al 2021

Differential expression

Census

2109 GEO submissions

Ad-hoc reproductions

These are one-off unpublished attempts to reproduce individual studies

Reproduction

Original study

https://rdoodles.rbind.io/2019/06/reanalyzing-data-from-human-gut-microbiota-from-autism-spectrum-disorder-promote-behavioral-symptoms-in-mice/ and https://notstatschat.rbind.io/2019/06/16/analysing-the-mouse-autism-data/

Sharon, G. et al. Human Gut Microbiota from Autism Spectrum Disorder Promote Behavioral Symptoms in Mice. Cell 2019, 177 (6), 1600–1618.e17.

https://github.com/sean-harrison-bristol/CCR5_replication

Wei, X.; Nielsen, R. CCR5-∆32 Is Deleterious in the Homozygous State in Humans. Nat. Med. 2019 DOI: 10.1038/s41591-019-0459-6. (retracted)

https://github.com/leipzig/placenta

Leiby et al "Lack of detection of a human placenta microbiome in samples from preterm and term deliveries" https://doi.org/10.1186/s40168-018-0575-4

Theory papers

Authors/Date

Title

Field

Type

Ioannidis 2005

Why most published research findings are false

Science

Statistical reproducibility

Noble 2005

A Quick Guide to Organizing Computational Biology Projects

Bioinformatics

Best practices

Sandve et al 2013

Ten Simple Rules for Reproducible Computational Research

Computational science

Best practices

Freedman et al 2015

The Economics of Reproducibility in Preclinical Research

Preclinical research

Best practices

Yarkoni 2019

The Generalizability Crisis

Psychology

Statistical reproducibility

Bouthillier et al 2019

Unreproducible Research is Reproducible

Machine Learning

Methodology

Milton & Possolo 2019

Trustworthy data underpin reproducible research

Physics

Scientific philosophy

Devezer et al 2019

Scientific discovery in a model-centric framework: Reproducibility, innovation, and epistemic diversity

Science

Statistical reproducibility

Tierney et al 2020

A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility

Science

Best practices

Haibe-Kains et al 2020

The importance of transparency and reproducibility in artificial intelligence research

Artificial Intelligence

Critique

Nosek & Errington 2020

What is replication?

Science

Scientific philosophy

Alston & Rick 2020

A Beginner’s Guide to Conducting Reproducible Research

Ecology

Best Practices

Hejblum et al 2020

Realistic and Robust Reproducible Research for Biostatistics

Biostatistics

Best practices

Pawlik et al 2019

A Link is not Enough – Reproducibility of Data

Databases

Best practices

Schriml et al 2020

COVID-19 pandemic reveals the peril of ignoring metadata standards

Virology

Critique

Stoudt et al 2020

Principles for data analysis workflows

Data science

Best practices

Peng & Hicks 2020

Reproducible Research: A Retrospective

Public health

Review

Reiter et al 2020

Streamlining Data-Intensive Biology With Workflow Systems

Biology

Best practices

Ulrich & Miller 2020

Meta Research: Questionable research practices may have little effect on replicability

Science

Statistical reproducibility

Kasif & Roberts 2020

We need to keep a reproducible trace of facts, predictions, and hypotheses from gene to function in the era of big data

Functional genomics

Critique

Raman 2021

A research parasite's perspective on establishing a baseline to avoid errors in secondary analyses

Science

Best practices

Hoffmann et al 2021

The multiplicity of analysis strategies jeopardizes replicability: lessons learned across disciplines

Science

Critique

Tool reviews

Authors/Date

Title

Tools

Isdahl & Gundersen 2019

Out-of-the-box Reproducibility: A Survey of Machine Learning Platforms

MLflow, Polyaxon, StudioML, Kubeflow, CometML, Sagemaker, GCPML, AzureML, Floydhub, BEAT, Codalab, Kaggle

Pimentel et al 2019

A Survey on Collecting, Managing, and Analyzing Provenance from Scripts

Astro-Wise, CPL, CXXR, Datatrack, ES3, ESSW, IncPy, Lancet, Magni, noWorkflow, Provenance Curios, pypet, RDataTracker, Sacred, SisGExp, SPADE, StarFlow, Sumatra, Variolite, VCR, versuchung, WISE, YesWorkflow

Leipzig et al 2019 (supplemental)

The Role of Metadata in Reproducible Computational Research

CellML, CIF2, DATS, DICOM, EML, FAANG, GBIF, GO, ISO/TC 276, MIAME, NetCDF, OGC, ThermoML, CRAN, Conda, pip setup.cfg, EDAM, CodeMeta, Biotoolsxsd, DOAP, ontosoft, SWO, OBCS, STATO, SDMX, DDI, MEX, MLSchema, MLFlow, Rmd, CWL, CWLProv, RO-Crate, RO, WICUS, OPM, PROV-O, ReproZip, ProvOne, WES, BagIt, BCO, ERC, BEL, DC, JATS, ONIX, MeSH, LCSH, MP, Open PHACTS, SWAN, SPAR, PWO, PAV, Manubot, ReScience, PandocScholar

Konkol, Markus, Nüst, Daniel, Goulier, Laura 2020

Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication

Authorea, Binder, CodeOcean, eLife RDS, Galaxy Project, Gigantum, Manuscript, o2r, REANA, ReproZip, Whole tale

Courses

Development Resources

  • R
  • Python
    • mlf-core - Framework to develop GPU deterministic machine learning models with PyTorch, TensorFlow and XGBoost

User tools

  • Open With Binder for Chrome or Firefox - open the GitHub repository you are visiting using MyBinder.org
  • DVC - DVC tracks machine learning models and data sets
  • SciScore - SciScore methods sections for a variety of rigor criteria and analyzes sentences that contain research resources (antibodies, cell lines, plasmids and software tools) and determines how uniquely identifiable that resource is based off of the provided metadata.
  • Ripeta - Ripeta quickly scans research manuscripts or articles to identify and record key reproducibility variables, such as data availability, code acknowledgements, and research analysis methods.

Books

Databases

  • ReplicationWiki - Database for empirical studies with information about methods, data and software used, availability of replication material and whether replications, corrections or retractions are known. Mostly focused on social sciences.
  • ReproCrawl

Data Repositories

All these repositories assign Digital Object Identifiers (DOIs) to data

  • DataCite - 12M+ DOIs registered for 46 allocators. Offers APIs and a metadata schema.
  • Data Dryad - curated, metadata-centric, focused on articles associated with published artices, $120 submission fee (various waivers available)
  • Figshare - 20 GB of free private space, unlimited public space, >2M articles, >5k projects
  • OSF - Project-oriented system with access control and integration with popular tools. Unlimited storage for projects, but individual files are limited to 5 gigabytes (GB) each.
  • Zenodo - Allows embargoed, restricted access, metadata support. 50GB limit.

Exemplar Portals

Places to find papers with code or portals to host them

  • Jupyter Gallery - Gallery of interesting Jupyter notebooks
  • Papers With Code - ML papers with code
  • NARPS - Code related to Neuroimaging Analysis Replication and Prediction Study
  • Codeocean - A gallery of cloud-based containers with reproducible analyses

Runnable Papers

Experimental papers that have associated notebooks

Haibe-Kains lab

Publication CodeOcean link
Mer AS et al. Integrative Pharmacogenomics Analysis of Patient Derived Xenografts codeocean.com/capsule/056639
Gendoo, Zon et al. MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature codeocean.com/capsule/643863
Yao et al. Tissue specificity of in vitro drug sensitivity codeocean.com/capsule/550275
Safikhani Z et al. Gene isoforms as expression-based biomarkers predictive of drug response in vitro codeocean.com/capsule/000290
El-Hachem et al. Integrative cancer pharmacogenomics to infer large-scale drug taxonomy codeocean.com/capsule/425224
Safikhani Z et al. Revisiting inconsistency in large pharmacogenomic studies codeocean.com/capsule/627606
Sandhu V et al. Meta-analysis of 1,200 transcriptomic profiles identifies a prognostic model for pancreatic ductal adenocarcinoma codeocean.com/capsule/269362

Patcher lab

Publication Github link
Pimental et al 2017. Differential analysis of RNA-seq incorporating quantification uncertainty sleuth_paper_analysis
Melsted et al 2019. Modular and efficient pre-processing of single-cell RNA-seq MBGBLHGP_2019
Chari et al 2021. Whole Animal Multiplexed Single-Cell RNA-Seq Reveals Plasticity of Clytia Medusa Cell Types CWGFLHGCCHAP_2021

Siepel lab

Blumberg et al 2021. Characterizing RNA stability genome-wide through combined analysis of PRO-seq and RNA-seq data https://codeocean.com/capsule/7351682

Journals

  • ReScience - Journal dedicated to insilico reproductions and tests of robustness, lives on Github.
  • eLife - Executable Research Articles (ERA) inline executable blocks

Ontologies

Minimal Standards

  • STORMS - Strengthening The Organization and Reporting of Microbiome Studies (STORMS) is a checklist for reporting on human microbiome studies. Preprint

Organizations

  • ResearchObject.org - RO specifications and publications
  • BioCompute - BCO specs
  • rOpenSci - Tools, conferences, and education
  • Open Science Framework - Open source project management
  • pyOpenSci - Promotes open and reproducible research through peer-review of scientific Python packages
  • Replication Network - Furthering the practice of replication in economics. Econ replication database.
  • repliCATS project - Estimating the replicability of research in the social sciences. Paper
  • ReproHack - 1-day reproducibility hackathons held worldwide
  • CODECHECK - community for checking executability of scientific preprints and papers
  • CASCaD - Certification Agency for Scientific Code and Data. Issues reproducibility certificates.

Awesome Lists

Contribute

Contributions welcome! Read the contribution guidelines first. You may find my src/doi2md.py script useful for quickly generating entries from a DOI.

License

CC0

To the extent possible under law, Jeremy Leipzig has waived all copyright and related or neighboring rights to this work.