• 4. Modifying columns

Motivating scenario: you want to change or add a column in a tibble.

Learning goals: By the end of this sub-chapter you should be able to

  1. Add or change a column with the mutate() function in the dplyr package.
  2. Change between variable types in a column.

Loading and formatting data to match where we last left off.
library(dplyr)
library(readr)
library(ggplot2)

ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
ril_data <- readr::read_csv(ril_link)
ril_data <- ril_data |> 
  select(location,  prop_hybrid,  mean_visits, petal_color, 
         petal_area_mm, asd_mm, growth_rate) |>
  rename(petal_area = petal_area_mm, 
         asd        = asd_mm,
         visits     = mean_visits)

Why clean a column?

Perhaps the most important rule in statistics is not to lie, cheat, or deceive. With that in mind, you might wonder why we should ever modify values in a column. Figure 1 shows an example of the need to modify columns. Such modifications are not “cheating” or dishonest, they are necessary for honest analyses.

library(ggplot2)
ggplot(ril_data, aes(x = growth_rate,  y = petal_area))+
  geom_point()
Scatterplot of petal_area vs growth_rate. There are many x-axis labels and x-axis ticks.
Figure 1: A plot of petal area across growth rates. Can you spot something fishy? This is not a pretend example. I did not “cook it up” for this story. I see this mistake numerous times every year.

Q): Identify a few weird things about Figure 1. Then guess why they may arise and how this plot could be fixed. (Write at least 20 words to reveal my explanation.)

Word count: 0


Using dplyr’s mutate() to change columns

To fix Figure 1, we must convert growth_rate into a number. But this isn’t the only time we want to change values in a column. We may want to:

  • Log transform data.
  • Standardize values by dividing by some constant.
  • Convert from Celsius to Fahrenheit.
  • and so on.

dplyr’s mutate() function helps us with this. Here’s the syntax:

TIBBLE_NAME |> mutate(COLUMN_NAME = FUNCTION(COLUMN_NAME))

In our case we use the as.numeric() function to convert growth_rate into a number.

ril_data <- ril_data      |>
  mutate(growth_rate = as.numeric(growth_rate))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `growth_rate = as.numeric(growth_rate)`.
Caused by warning:
! NAs introduced by coercion

The error above is telling us that there was a character that could not be turned into a number, and was converted to NA.

It turns out the mistake was that I entered “1.80” as “1.8O.” If you knew that to be true you could fix the data directly in the spreadsheet. Better yet make this correct in the R script, so the correction is documented and reproducible.

library(ggplot2)
library(dplyr)

ggplot(ril_data, aes(x = growth_rate, y = petal_area))+
  geom_point()
Corrected scatterplot of petal_area against growth_rate. The x-axis now displays a continuous scale with a modest number of evenly spaced numeric ticks, indicating growth_rate is being treated as numeric rather than categorical
Figure 2: A corrected plot of petal area across growth rates.

Using dplyr’s mutate() to add columns

A visual representation of a data transformation using `mutate()` in `dplyr`. The top table contains two columns: `prop_hyb` (proportion of hybrids) and `n_assayed` (number of individuals assayed), with values showing different proportions and a constant sample size of 8. Below, an R code snippet applies `mutate(n_hyb = prop_hyb * n_assayed)`, generating a new column, `n_hyb`, which contains the computed number of hybrids (0, 1, and 2, respectively). The updated dataset is displayed in a bottom table with the new `n_hyb` column highlighted in a darker shade.
Figure 3: An illustration of the mutate() function. The top table represents the original dataset, containing columns for the proportion of hybrids (prop_hyb) and the number of individuals assayed (n_assayed). The mutate() function is then applied to compute n_hyb, the total number of hybrid individuals, by multiplying prop_hyb by n_assayed. The resulting dataset, shown in the bottom table, includes this newly created n_hyb column.

Our data often include important information that is not directly noted. For example we may want to know

  • If a plant received any pollinators
  • If a plant set any hybrid seeds
  • If a plant’s anther stigma distance is unusually large for its petal area

Here we want to add columns, not overwrite old ones. No worries, we can also add columns with mutate() – just be sure to use a new column name:

ril_data <- ril_data             |>
  mutate(visited      = visits > 0,
         has_hyb      = prop_hybrid > 0,
         rel_asd      = asd / petal_area)  

Below, I show this transformation after temporarily removing location (for space, I do not save this change). Note a new dplyr::select() trick – you can select all but specified columns with a minus sign.

library(dplyr)
ril_data |>
  rename(prop_hyb = prop_hybrid)|>
  select(-location) # select everything but location