• 6. Associations I: Script


Here’s a brief run down of the key functions and pipelines used in this chapter.

Two categorical variables

Lets look at a new (classic) data set - the survival of adults sitting first class on the Titanic. The code below takes the data from old-R style into the tidyverse framework we’re used to. The data now look like this:

Formatting the Titanic data & then glimpsing
library(tidyr) # for "uncount"
library(dplyr)
library(janitor)
library(ggplot2)
titanic_first_class_adults <- Titanic["1st",,,]  |>
    data.frame()                                 |> 
    filter(Age == "Adult")                       |> 
    select(-Age)                                 |>
    uncount(weights = Freq)                      |>
  slice_sample(prop = 1)

glimpse(titanic_first_class_adults)
Rows: 319
Columns: 2
$ Sex      <fct> Male, Male, Female, Male, Female, Male, Female, Female, Male,…
$ Survived <fct> No, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, No, Yes, N…

Stacked bar chart showing counts of adult first-class Titanic passengers by sex and survival status. For males, the majority did not survive, with a smaller portion surviving. For females, most survived, with very few deaths. The y-axis shows counts, the x-axis shows sex, and colors distinguish survival status (teal = survived, coral = did not survive).

Survival counts for adult first-class passengers on the Titanic by sex. Stacked bars show the number of adult first-class passengers who did or did not survived, separated by sex.
Sex prop_survived
Male 0.326
Female 0.972
library(dplyr)
library(janitor)
library(ggplot2)

# making plot
titanic_first_class_adults |>
  ggplot(aes(x = Sex, fill = Survived))+
  geom_bar()

# finding conditional proportions
titanic_first_class_adults                     |>
  tabyl(Sex, Survived)                         |> 
  mutate(n_tot =  No + Yes)                    |>
  mutate(prop_survived =  Yes / n_tot)         |>
  select(Sex, prop_survived)
Formatting the Titanic data & then looking
titanic_first_class_adults_counts  <- Titanic["1st",,,]  |>
    data.frame()                                 |> 
    filter(Age == "Adult")                       |> 
    select(-Age)  

titanic_first_class_adults_counts
     Sex Survived Freq
1   Male       No  118
2 Female       No    4
3   Male      Yes   57
4 Female      Yes  140

Often categorical data comes summarized as counts, as above. When this is the case we can still proceed without much delay. We just need a few different R tricks.

  • For plotting, map the “count” column (in this case Freq) onto y in aes and use geom_col() instead of geom_bar().
  • For counting and summarizing we “munge” the data to make it easy to deal with. In this case, we first make the data wide with tidyr’s pivot_wider() function.
library(dplyr)
library(janitor)
library(ggplot2)
library(tidyr)
# making plot
titanic_first_class_adults_counts |>
  ggplot(aes(x = Sex, y = Freq, fill = Survived))+
  geom_col()

# finding conditional proportions
titanic_first_class_adults_counts                        |>
  pivot_wider(names_from = Survived, values_from = Freq) |> #make it wide
  # now were back to the previous version after tabyl
  mutate(n_tot =  No + Yes)                    |>
  mutate(prop_survived =  Yes / n_tot)         |>
  select(Sex, prop_survived)

A binary and continuous variable

Returning to Clarkia RIL’s let’s compare petal area of pink plants at GC that were and were not visited by pollinators.

Formatting the GC data
library(readr)
ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
ril_data <- readr::read_csv(ril_link) 

pink_gc_rils <- ril_data |>
  dplyr::filter(location == "GC",  petal_color == "pink", !is.na(petal_area_mm), !is.na(mean_visits))|>
  mutate(visited = ifelse(mean_visits > 0,"some_visits","no_visits")) |>
  select( petal_area_mm,  visited)

Scatter plot of petal area (mm²) by visitation category (no_visits vs. some_visits) for pink-flowered plants. Black dots show individual plant values. Red dots show group means. The mean petal area for plants receiving visits is slightly larger than for plants receiving no visits. A dashed line connects the two group means, highlighting the small difference.

Mean petal area differs slightly between visited and unvisited pink-flowered RILs at the GC site. Each point represents an individual pink-flowered recombinant inbred line (RIL). Black points show individual observations. Red points indicate group means for plants receiving no visits and those receiving at least one visit. The dashed line connects the group mean.
visited mean_petal_area
no_visits 61.5
some_visits 63.1
Cohens_d CI CI_low CI_high
0.11 0.95 -0.44 0.66
library(effectsize)
library(ggplot2)
library(dplyr)

# Make a plot
pink_gc_rils |> 
    ggplot(aes(x = visited, y = petal_area_mm))+ 
    geom_jitter(width = .2)+
    stat_summary(aes(group = 1), geom = "line", lty = 2)+
    stat_summary(geom = "point", color = "red",size = 2)

# Conditional means
pink_gc_rils |> 
    group_by(visited) |>
    summarise(mean_petal_area = mean(petal_area_mm))

# Cohen's D
cohens_d(petal_area_mm ~ visited, data = pink_gc_rils,reference = "no_visits")