• 4. Data in R summary

Links to: Summary, Chatbot Tutor, Practice Questions, Glossary, R functions, R packages introduced, and Additional resources.

Chapter summary

A close-up photograph of a vibrant pink *Clarkia xantiana* flower with delicate, deeply lobed petals. The petals have a soft gradient, fading from a rich pink at the center to a lighter shade towards the edges. The reproductive structures—dark purple stamens with pollen-covered anthers and a protruding stigma—are prominently visible. The background is softly blurred, showing additional flowers and green stems in what appears to be a greenhouse or controlled growth environment. — A beautiful *Clarkia xantiana* flower.

Keeping data in the tidy format—where each column represents a variable and each row represents an observation—allows you to fully leverage the powerful tools of the tidyverse. In the tidyverse, data are stored in tibbles, a modern update to data frames that enhances readability and maintains consistent data types. The dplyr package offers a suite of intuitive functions for transforming and analyzing data. These functions include: mutate() for adding or modifying columns.

select() for choosing columns.
filter() for subsetting rows, and
rename(). for changing column names.

Together – especially when used with the pipe operator—these – tools enable clear, reproducible workflows. In the next two chapters we will begin summarizing data with dplyr tools.

Chatbot tutor

Please interact with this custom chatbot (ChatGPT link here, Gemini link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.

Practice Questions

Try these questions!

Q1) Consider the table below. The data are

location	GC	GC	GC	GC	GC	GC
ril	A1	A100	A102	A104	A106	A107
mean_visits	0.0000	0.1875	0.2500	0.0000	0.0000	0.0000

Here the data are transposed, so the data are not tidy. Remember in tidy data each variable is a column, not a row. This is particularly hard for R because there are numerous types of data in a column.

Q2) Consider the table below. The data are

location-ril	mean_visits
GC-A1	0.0000
GC-A100	0.1875
GC-A102	0.2500
GC-A104	0.0000
GC-A106	0.0000
GC-A107	0.0000

Here location and ril are combined in a single column, so the data are not tidy. Remember in tidy data each variable is its own column. It would be hard to get e.g. means for RILs of locations in this format.

Q3) Consider the table below. The data are

ril	GC	SR
A1	0.0000	0.6667
A100	0.1875	0.5833
A102	0.2500	0.6667
A104	0.0000	1.7500
A106	0.0000	0.5000
A107	0.0000	1.5000

This is known as “wide format” and is not tidy. Here the variable, location, is used as a column heading. This can be a fine way to present data to people, but it’s not how we are analyzing data.

Q4) You should always make sure data are tidy when (pick best answer)

collecting data presenting data analyzing data with dplyr all of the above

For the following questions consider the iris data set built into R. Preview below.

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa

Q5) Consider the table above and code below. What is the value of pl_pw_ratio in the first row?

iris |> 
  mutate(pl_pw_ratio = Petal.Length / Petal.Width)

Q6) Consider the code below. Which column names will appear in the output? (select all that apply).

iris |> 
  mutate(pl_pw_ratio = Petal.Length / Petal.Width) |>
  select(Species, pl_pw_ratio)

Q7) Say I wanted to use clean_names() from janitor. Which two of the scripts below would work?

# A
iris |> 
  clean_names() |>
  mutate(pl_pw_ratio = Petal.Length / Petal.Width)

# B
iris |> 
  mutate(pl_pw_ratio = Petal.Length / Petal.Width)|>
  clean_names()

# C
iris |> 
  clean_names()  |>
  mutate(pl_pw_ratio = petal_length / petal_width)

# D
iris |> 
  clean_names()  |>
  mutate(pl_pw_ratio = Petal.Length / Petal.Width)

Use the R environment below to answer the next set of questions

Glossary of Terms

Tidy Data A structured format where:
- Each row represents an observation.
- Each column represents a variable.
- Each cell contains a single measurement.
Tibbles: A modern form of a data frame in R with:
- Cleaner printing (only first 10 rows, fits columns to screen).
- Explicit display of data types (e.g., , ).
- Strict subsetting (prevents automatic type conversion).
- Character data is not automatically converted to factors.
Piping (|>) functions: A way to chain operations together, making code more readable and modular.
Missing Data (NA): R uses NA to represent missing values. Operations with NA return NA unless handled explicitly (e.g., na.rm = TRUE to ignore missing values, use = "pairwise.complete.obs", etc).
Warnings: Indicate a possible issue but allow code to run (e.g., NAs introduced by coercion).
Errors: Stop execution completely when something is invalid.

Key R functions

read_csv() (readr): Reads a CSV file into R as a tibble, automatically guessing column types.
select() (dplyr): Selects specific columns from a dataset.
mutate() (dplyr): Creates or modifies columns in a dataset.
dplyr::filter() (dplyr. Because there is a different filter in a different package, always specify which filter you want r to use.): Only retains rows wihich meet some logical criteria.
rename()` (dplyr: Allows you to manually changes column names.
as.numeric(): Converts a vector to a numeric data type.
|> (Base R Pipe Operator): Passes the result of one function into another, making code more readable.

R Packages Introduced

readr: A tidyverse package for reading rectangular data files (e.g., read_csv()).
dplyr: A tidyverse package for data manipulation, including mutate(), glimpse(), and across().

Additional resources

R Recipes:

Other web resources:

Chapter 10: Tidy data from R for data science (Grolemund & Wickham (2018)).
Animated dplyr functions from R or the rest of us.

Videos:

Basic Data Manipulation (From Stat454).
Calculations on tibble (From Stat454).