• 3. Reproducible analyses

Motivating scenario: You can load data and make a plot. Now you want to use your skills to write a clean R script showing best practices in reproducible science.

Learning goals: By the end of this chapter you should be able to

Make an R script file following best practices in reproducibility.
Recognize best practices in reproducible code and see when / where it’s gone wrong.

It is now expected that your research is reproducible. By reproducible, we mean that someone else (or future you) can run your code on the same data and obtain the same results. From an altruistic perspective, this increases the integrity of science and allows others to build on your work. From a selfish perspective, you are the person most likely to revisit your own code, and your life will be better if it runs smoothly.

Here I provide a checklist for best practices in reproducible code, and then provide an example of a reproducible R script.

Reproducible Code: A Checklist From AmNat

What you do to data and how you analyze it is as much a part of science as how you collect it. As such, it is essential to make sure your code:

Reliably works – even on other computers
And can be understood.

As a guide, I provide the principles from the scientific journal The American Naturalist’s policy in the box below. I have worked at “AmNat” as a data editor for years, and they have been leading the way in reproducible science. I have bolded key points. You will see that this reiterates my guidance in this section and from my previous guidance on writing scripts.

REQUIRED:

Scripts should start by loading required packages, then importing raw data from files archived in your data repository.
Use relative paths to files and folders (e.g. avoid setwd() with an absolute path in R), so other users can replicate your data input steps on their own computers.
Make sure your code works. Shut down your R. (or type rm(list=ls()) into the console and run the code again. You should get the same results. If not, go back and fix your mistakes.
Annotate your code with comments indicating what the purpose of each set of commands is (i.e., “why?”). If the functioning of the code (i.e., “how”) is unclear, strongly consider re-writing it to be clearer/simpler. In-line comments can provide specific details about a particular command.
- Note that ChatGPT is very good at commenting your code.
Annotate code to indicate how commands correspond to figure numbers, table numbers, or subheadings of results within the manuscript.
If you are adapting other researcher’s published code for your own purposes, acknowledge and cite the sources you are using. Likewise, cite the authors of packages that you use in your published article.

RECOMMENDED:

Test code ideally on a pristine machine without any packages installed, but at least using a new session.
Use informative names for input files, variables, and functions (and describe them in the README file).
Any data manipulations (merging, sorting, transforming, filtering) should be done in your script, for fully transparent documentation of any changes to the data.
Organize your code by splitting it into logical sections, such as importing and cleaning data, transformations, analysis and graphics and tables. Sections can be separate script files run in order (as explained in your README) or blocks of code within one script that are separated by clear breaks (e.g., comment lines, #————–), or a series of function calls (which can facilitate reuse of code).
Label code sections with headers that match the figure number, table number, or text subheading of the paper.
Omit extraneous code not used for generating the results of your publication, or place any such code in a Coda at the end of your script.

Reproducible Code: Example

Below I provide an example of reproducible script that

Loads necessary packages.
Loads my data.
Cleans up column names.
Makes a plot

# Yaniv Brandvain
# Loading clarkia data, cleaning names and making a plot
# Feb 18th 2026

#################################################
#                 Loading packages              #
#################################################
library(conflicted)
library(readr)
library(dplyr)
library(janitor)
library(ggplot2)


#################################################
#                  Loading data                 #
#################################################

path     <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
ril_data <- read_csv(path)


#################################################
#                 Cleaning names                #
#################################################

ril_data <- clean_names(ril_data) |>        # Convert to standard naming 
  rename(petal_area       = petal_area_mm,  # Remove units & clarify names
         petal_perimeter  = petal_perim_mm, 
         stem_diameter    = stem_dia_mm,
         anther_stigma_distance = asd_mm)

##################################################
# Plot pollinator visits by petal size and color #
##################################################

visit_plots <- ril_data |> 
  ggplot(aes(x = petal_perimeter, 
             y = mean_visits,
             color = petal_color ))+
    geom_point()+
    facet_wrap(~petal_color)+
    geom_smooth(method = "lm")+
    theme(legend.position = "bottom")

# save the plot
ggsave("visits_by_petal_perimeter_and_color.png", visit_plots)

Note that I wrote this interactively.
For example, after entering ril_data <- read_csv(path), I went into the console and entered glimpse(ril_data). Similarly I made my plot a few lines at a time. First typing and entering

ril_data |>
ggplot(aes(x=petal_perimeter, y=mean_visits, color=petal_color))+
geom_point()

I then added

facet_wrap(~petal_color)

and then geom_smooth(method = "lm")

and then theme(legend.position = "bottom")

and finally assigning the plot to visit_plots