• 4. Data in R: Code
Overview
The past two chapters have introduced the basic function and workflow for preparing to analyze your data in R. Below I provide an example script for this process. Remember that:
The contents of your script is not all the code that you wrote. Rather, your script is what is needed for someone else to generate your results and understand how you got there. For example, I used the
glimpse()function to look into the data. I also had numerous versions of wrong code and false starts etc. While it is useful for you (as learners) to understand that such additional code and false starts are common, it is not needed to recreate my final product.Reproducibility requires a reliable file structure. I cheated by reading data from the web. If your data reside on your computer be sure to include a .zip file with the R script, the data and an R project that will work on a naive computer
Functions
I used the following functions to handle the data (let’s not focus on the ggplot functions) for now.
read_csv(). In thereadrpackage. It loads the data into R.
select(). In thedplyrpackage allows us to winnow down our dataset to the columns we care about. NOTE: You can choose either the columns you want with their names, or the columns you don’t want with their name after a negative sign,-.
rename(). In thedplyrpackage. It allows us to change the names of columns:rename(NEW_NAME, OLD_NAME).
mutate(). In thedplyrpackage. Adds or modifies a column.
ifelse(). In base R. This is a bonus function that you don’t NEED to know now, but it is helpful! It allows you to do different things based on the output of a logical question. In this case, I used it to have the values ofvisitedequalvisitedif there are nonzero pollinator visits, andnot visitedif there are zero visits.. This gives us better x-labels thanTRUE/FALSE.
filter(). In thedplyrpackage. Allows you to choose rows to retain based on their values in a column. Because there is anotherfilter()function, always specify that you want dplyr’s filter. You can do this by:- Using the
conflict_prefer()function in theconflictedpackage:conflict_prefer("filter", winner = "dplyr"). OR - Using the package::function() convention:
dplyr::filter().
- Using the
Reproducible script
Now here it is!
# Yaniv Brandvain
# Feb 21 2026
# Goal to load, and clean Clarkia data for analysis
###--------------------------
### Load packages
library(dplyr)
library(ggplot2)
library(readr)
library(conflicted)
conflict_prefer("filter", winner = "dplyr") # Prefer dplyr's filter function.
###--------------------------
### Load data
ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
ril_data <- readr::read_csv(ril_link)
###--------------------------
### Format data. Note the many line breaks make code easier to read, but don't change how it works
ril_data <- ril_data |>
select(location, # Focus on a few columns of interest
prop_hybrid,
mean_visits,
petal_color,
petal_area_mm,
asd_mm,
growth_rate) |>
rename(petal_area = petal_area_mm, # Makes names better
asd = asd_mm,
visits = mean_visits)|>
mutate(growth_rate = as.numeric(growth_rate), # Improve and add columns
visited = ifelse(visits > 0, "visited", "not visited"),
has_hyb = ifelse(prop_hybrid > 0, "yes hybrid", "no hybrid"),
relative_asd = asd / petal_area) |>
filter(!is.na(visited), # Remove NA data
!is.na(has_hyb))
###--------------------------
### Plot the association between receiving a visits and having a hybrid by location
final_plot <- ggplot(ril_data, aes(x = has_hyb, fill = visited))+
geom_bar()+
facet_wrap(~location, labeller = "label_both")+
theme(legend.position = "bottom", # tricks to make better plots
axis.title.y = element_text(size = 18), # we didnt learn these tricks yet
axis.title.x = element_blank(), # we will learn this in chapter 8
axis.text = element_text(size = 18),
legend.title = element_blank(),
legend.text = element_text(size = 18),
strip.text = element_text(size = 18))