| location | GC | GC | GC | GC | GC | GC |
| ril | A1 | A100 | A102 | A104 | A106 | A107 |
| mean_visits | 0.0000 | 0.1875 | 0.2500 | 0.0000 | 0.0000 | 0.0000 |
• 4. Data in R summary
Links to: Summary, Chatbot Tutor, Practice Questions, Glossary, R functions, R packages introduced, and Additional resources.
Chapter summary

Keeping data in the tidy format—where each column represents a variable and each row represents an observation—allows you to fully leverage the powerful tools of the tidyverse. In the tidyverse, data are stored in tibbles, a modern update to data frames that enhances readability and maintains consistent data types. The dplyr package offers a suite of intuitive functions for transforming and analyzing data. These functions include: mutate() for adding or modifying columns.
select()for choosing columns.
filter()for subsetting rows, and
rename(). for changing column names.
Together – especially when used with the pipe operator—these – tools enable clear, reproducible workflows. In the next two chapters we will begin summarizing data with dplyr tools.
Chatbot tutor
Please interact with this custom chatbot (ChatGPT link here, Gemini link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.
Practice Questions
Try these questions!
Q1) Consider the table below. The data are
Here the data are transposed, so the data are not tidy. Remember in tidy data each variable is a column, not a row. This is particularly hard for R because there are numerous types of data in a column.
.
Q2) Consider the table below. The data are
| location-ril | mean_visits |
|---|---|
| GC-A1 | 0.0000 |
| GC-A100 | 0.1875 |
| GC-A102 | 0.2500 |
| GC-A104 | 0.0000 |
| GC-A106 | 0.0000 |
| GC-A107 | 0.0000 |
Here location and ril are combined in a single column, so the data are not tidy. Remember in tidy data each variable is its own column. It would be hard to get e.g. means for RILs of locations in this format.
Q3) Consider the table below. The data are
| ril | GC | SR |
|---|---|---|
| A1 | 0.0000 | 0.6667 |
| A100 | 0.1875 | 0.5833 |
| A102 | 0.2500 | 0.6667 |
| A104 | 0.0000 | 1.7500 |
| A106 | 0.0000 | 0.5000 |
| A107 | 0.0000 | 1.5000 |
This is known as “wide format” and is not tidy. Here the variable, location, is used as a column heading. This can be a fine way to present data to people, but it’s not how we are analyzing data.
Q4) You should always make sure data are tidy when (pick best answer)
For the following questions consider the iris data set built into R. Preview below.
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
Q5) Consider the table above and code below. What is the value of pl_pw_ratio in the first row?
iris |>
mutate(pl_pw_ratio = Petal.Length / Petal.Width)Q6) Consider the code below. Which column names will appear in the output? (select all that apply).
iris |>
mutate(pl_pw_ratio = Petal.Length / Petal.Width) |>
select(Species, pl_pw_ratio)Q7) Say I wanted to use clean_names() from janitor. Which two of the scripts below would work?
# A
iris |>
clean_names() |>
mutate(pl_pw_ratio = Petal.Length / Petal.Width)# B
iris |>
mutate(pl_pw_ratio = Petal.Length / Petal.Width)|>
clean_names() # C
iris |>
clean_names() |>
mutate(pl_pw_ratio = petal_length / petal_width)# D
iris |>
clean_names() |>
mutate(pl_pw_ratio = Petal.Length / Petal.Width)Use the R environment below to answer the next set of questions
As you can see, the code above returns the error: "Error: object ‘Species’ not found."
Q8) What was this code aiming to do?Q9) Why did the error, "Error: object ‘Species’ not found." arise?
Try typing glimpse(iris)
Q10) How to fix the error "Error: object ‘Species’ not found."? Find both correct answers `
Debugger’s note: In a hidden setup chunk I wrote: conflict_prefer("filter", "stats") to tell R to prefer stats::filter() over dplyr::filter(). Because of this, filter() no longer understands column names inside a data frame. So I forced this error on you.
In real workflows, it can be hard to predict which version of a function R will use. This issue arises even when I don’t trick you. So always tell R which version of filter() you want to use. filter() may on your computer today without dplyr::filter() or conflict_prefer("filter","dplyr"), but it might not work on someone else computer (it might even fail on your computer next week).
Glossary of Terms
- Tidy Data A structured format where:
- Each row represents an observation.
- Each column represents a variable.
- Each cell contains a single measurement.
- Each row represents an observation.
- Tibbles: A modern form of a data frame in R with:
- Cleaner printing (only first 10 rows, fits columns to screen).
- Explicit display of data types (e.g.,
, ).
- Strict subsetting (prevents automatic type conversion).
- Character data is not automatically converted to factors.
- Cleaner printing (only first 10 rows, fits columns to screen).
- Piping (|>) functions: A way to chain operations together, making code more readable and modular.
- Missing Data (
NA): R usesNAto represent missing values. Operations withNAreturnNAunless handled explicitly (e.g., na.rm = TRUE to ignore missing values, use = "pairwise.complete.obs", etc).
- Warnings: Indicate a possible issue but allow code to run (e.g., NAs introduced by coercion).
- Errors: Stop execution completely when something is invalid.
Key R functions
read_csv()(readr): Reads a CSV file into R as a tibble, automatically guessing column types.dplyr::filter()(dplyr. Because there is a different filter in a different package, always specify which filter you want r to use.): Only retains rows wihich meet some logical criteria.rename()` (dplyr: Allows you to manually changes column names.as.numeric(): Converts a vector to a numeric data type.|>(Base R Pipe Operator): Passes the result of one function into another, making code more readable.
R Packages Introduced
Additional resources
R Recipes:
Other web resources:
- Chapter 10: Tidy data from R for data science (Grolemund & Wickham (2018)).
- Animated dplyr functions from R or the rest of us.
Videos:
Basic Data Manipulation (From Stat454).
Calculations on tibble (From Stat454).