The Practice of Reproducible Research by Justin Kitzes, Daniel Turek, Fatma Deniz - Paper

About the Book

The Practice of Reproducible Research presents concrete examples of how researchers in the data-intensive sciences are working to improve the reproducibility of their research projects. In each of the thirty-one case studies in this volume, the author or team describes the workflow that they used to complete a real-world research project. Authors highlight how they utilized particular tools, ideas, and practices to support reproducibility, emphasizing the very practical how, rather than the why or what, of conducting reproducible research.

Part 1 provides an accessible introduction to reproducible research, a basic reproducible research project template, and a synthesis of lessons learned from across the thirty-one case studies. Parts 2 and 3 focus on the case studies themselves. The Practice of Reproducible Research is an invaluable resource for students and researchers who wish to better understand the practice of data-intensive sciences and learn how to make their own research more reproducible.

About the Author

Justin Kitzes is Assistant Professor of Biology at the University of Pittsburgh.

Daniel Turek is Assistant Professor of Statistics at Williams College.

Fatma Deniz is Postdoctoral Scholar at the Helen Wills Neuroscience Institute and the International Computer Science Institute, and Data Science Fellow at the University of California, Berkeley.

Contributors
Preface: Nullius in Verba
Philip B. Stark
Introduction
Justin Kitzes

PART I: PRACTICING REPRODUCIBILITY
Assessing Reproducibility
Ariel Rokem, Ben Marwick, and Valentina Staneva
The Basic Reproducible Workflow Template
Justin Kitzes
Case Studies in Reproducible Research
Daniel Turek and Fatma Deniz
Lessons Learned
Kathryn Huff
Building toward a Future Where Reproducible, Open Science Is the Norm
Karthik Ram and Ben Marwick
Glossary
Ariel Rokem and Fernando Chirigati

PART II: HIGH-LEVEL CASE STUDIES
Case Study 1: Processing of Airborne Laser Altimetry Data Using Cloud-Based Python and Relational Database Tools
Anthony Arendt, Christian Kienholz, Christopher Larsen, Justin Rich, and Evan Burgess
Case Study 2: The Trade-Off between Reproducibility and Privacy in the Use of Social Media Data to Study Political Behavior
Pablo Barberá
Case Study 3: A Reproducible R Notebook Using Docker
Carl Boettiger
Case Study 4: Estimating the Effect of Soldier Deaths on the Military Labor Supply
Garret Christensen
Case Study 5: Turning Simulations of Quantum Many- Body Systems into a Provenance-Rich Publication
Jan Gukelberger and Matthias Troyer
Case Study 6: Validating Statistical Methods to Detect Data Fabrication
Chris Hartgerink
Case Study 7: Feature Extraction and Data Wrangling for Predictive Models of the Brain in Python
Chris Holdgraf
Case Study 8: Using Observational Data and Numerical Modeling to Make Scientific Discoveries in Climate Science
David Holland and Denise Holland
Case Study 9: Analyzing Bat Distributions in a Human- Dominated Landscape with Autonomous Acoustic Detectors and Machine Learning Models
Justin Kitzes
Case Study 10: An Analysis of Household Location Choice in Major US Metropolitan Areas Using R
Andy Krause and Hossein Estiri
Case Study 11: Analyzing Cosponsorship Data to Detect Networking Patterns in Peruvian Legislators
José Manuel Magallanes
Case Study 12: Using R and Related Tools for Reproducible Research in Archaeology
Ben Marwick
Case Study 13: Achieving Full Replication of Our Own Published CFD Results, with Four Diff erent Codes
Olivier Mesnard and Lorena A. Barba
Case Study 14: Reproducible Applied Statistics: Is Tagging of Therapist-Patient Interactions Reliable?
K. Jarrod Millman, Kellie Ottoboni, Naomi A. P. Stark, and Philip B. Stark
Case Study 15: A Dissection of Computational Methods Used in a Biogeographic Study
K. A. S. Mislan
Case Study 16: A Statistical Analysis of Salt and Mortality at the Level of Nations
Kellie Ottoboni
Case Study 17: Reproducible Workflows for Understanding Large-Scale Ecological Effects of Climate Change
Karthik Ram
Case Study 18: Reproducibility in Human Neuroimaging Research: A Practical Example from the Analysis of Diff usion MRI
Ariel Rokem
Case Study 19: Reproducible Computational Science on High-Performance Computers: A View from Neutron Transport
Rachel Slaybaugh
Case Study 20: Detection and Classification of Cervical Cells
Daniela Ushizima
Case Study 21: Enabling Astronomy Image Processing with Cloud Computing Using Apache Spark
Zhao Zhang

PART III: LOW-LEVEL CASE STUDIES
Case Study 22: Software for Analyzing Supernova Light Curve Data for Cosmology
Kyle Barbary
Case Study 23: pyMooney: Generating a Database of Two-Tone Mooney Images
Fatma Deniz
Case Study 24: Problem-Specific Analysis of Molecular Dynamics Trajectories for Biomolecules
Konrad Hinsen
Case Study 25: Developing an Open, Modular Simulation Framework for Nuclear Fuel Cycle Analysis
Kathryn Huff
Case Study 26: Producing a Journal Article on Probabilistic Tsunami Hazard Assessment
Randall J. LeVeque
Case Study 27: A Reproducible Neuroimaging Workflow Using the Automated Build Tool “Make”
Tara Madhyastha, Natalie Koh, and Mary K. Askren
Case Study 28: Generation of Uniform Data Products for AmeriFlux and FLUXNET
Gilberto Pastorello
Case Study 29: Developing a Reproducible Workflow for Large-Scale Phenotyping
Russell Poldrack
Case Study 30: Developing and Testing Stochastic Filtering Methods for Tracking Objects in Videos
Valentina Staneva
Case Study 31: Developing, Testing, and Deploying Efficient MCMC Algorithms for Hierarchical Models Using R
Daniel Turek

Index

Reviews

“Understanding why science should be open is only the first step; the second is to actually do it. This wide-ranging new book shows how researchers from a variety of disciplines are translating general principles into specific improvements in their work that others can learn from and imitate. Everyone who is trying to squeeze insight out of data will learn something from it, and those who are trying to train the next generation of scientists will find it a rich source of examples and models for their students to emulate.”—Greg Wilson, Cofounder of Software Carpentry, now Principal Consultant with Rangle.io

“The integrity of scientific knowledge depends crucially on the reliability and reproducibility of our published results. But what happens if we don’t even report enough information to allow our experiments to be repeated? This book offers practical solutions to enhance the reporting and validation of research.”—Randy Schekman, Professor, Department of Molecular and Cell Biology, University of California, Berkeley, and Investigator, Howard Hughes Medical Institute

About the Book

About the Author

Table of Contents

Reviews