• 15. Two-sample t-test

Motivating scenario: We want to test the null that two groups have the same mean.

Learning goals: By the end of this section, you should be able to:

  • Explain the logic of the two-sample t-test and how it relates to comparing group means.

  • Calculate the test statistic and p-value by hand and interpret it relative to a critical value.

  • Use R’s t.test() function to test for differences between two groups.

  • Write up results of a two-sample t-test in terms of both statistical significance and effect size, and connect them back to biological meaning.

Loading and processing data
ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
SR_visits <- readr::read_csv(ril_link) |>
    dplyr::rename(visits = mean_visits)  |>
    dplyr::filter(location == "SR",
                  !is.na(petal_color),
                  !is.na(visits))        |>
    dplyr::select(petal_color, visits)

color_visit_summaries <- SR_visits |>
  group_by(petal_color)|>
  summarise(MEAN  = mean(visits),
            VAR   = var(visits),
            N     = n())

Testing the null of equal means

In the previous section, we calculated a difference between pink- and white- petaled RILs at site SM of \(1.76 - 0.733 = 1.027\). We also quantified uncertainty about this estimate as a standard error of 0.262. Recalling the equation for the t-statistic:

\[t = \frac{\bar{x}-\mu_0}{s_\bar{x}} = \frac{(\bar{x_1}-\bar{x_2})-0}{s_\overline{x_1-x_2}}\] In this case

\[t = \frac{1.76 - 0.733}{0.262}= 3.91\]

Since we had 105 degrees of freedom, our critical t-value will be just a hair greater than 1.96. Because our observed t-value of \(\approx 5\) is way bigger than our critical t-value of \(\approx 2\), we strongly reject the null hypothesis that pink and white-flowered Clarkia xantiana subspecies parviflora RILs at site SM are visited equally by pollinators.

More precisely, crit-t = qt(0.025, df= 102,lower.tail = F) = \(1.983\).

We can use the pt() function to quantify exactly how rarely the null would generate a result this or more extreme as follows:

2 * pt(q = 3.91, df = 105, lower.tail = FALSE) = 0.000164 = 1.64e-4

Review: Why do we multiply our p-value by two?

A two-sample t-test in R

We can use the “formula” syntax in the t.test() function to have R test this null for us. Note that we set var.equal = TRUE. We see that this provides p-values and 95% confidence intervals identical to what we calculated ourselves:

t.test(visits ~ petal_color, data = SR_visits, var.equal = TRUE)

    Two Sample t-test

data:  visits by petal_color
t = 3.9082, df = 105, p-value = 0.0001649
alternative hypothesis: true difference in means between group pink and group white is not equal to 0
95 percent confidence interval:
 0.505242 1.545845
sample estimates:
 mean in group pink mean in group white 
           1.758946            0.733402 

Again, we can make this output tidy / easier to process in R with the tidy() function in the broom package:

For some reason that I don’t understand, tidy() labels the column with the “degrees of freedom” as “parameter”. smdh.

library(broom)
t.test(visits ~ petal_color, data = SR_visits, var.equal = TRUE)|>
  tidy()
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1.026 1.759 0.733 3.908 1.65e-4 105 0.505 1.546 Two Sample t-test two.sided

If our data met assumptions we would be done. But they do not. So let’s validate our conclusions with tests whose assumptions better match our data.

A two-sample t-test is often robust

Now that we know we can reject the null when data are transformed to meet assumptions of the two-sample t-test, and we have an estimate of the bootstrap 95% confidence interval, we could complement these analyses with an analysis of the untransformed data. This step is not always necessary or reliable - but we are using our brains to tell a coherent story.

First, let’s calculate our summary statistics:

summary estimate
mean_diff 1.026
pooled_var 1.834
pooled_sd 1.354
cohens_D 0.758
color_visit_summaries <- SR_visits |>
  group_by(petal_color)|>
  summarise(MEAN = mean(visits),
            VAR  = var(visits),
            N                 = n()) |>
    summarise(mean_diff  = diff(MEAN) |> abs(), 
              pooled_var = sum((N-1)*(VAR)) / (sum(N)-2),
              pooled_sd  = sqrt(pooled_var),
              cohens_D   = mean_diff / pooled_sd)

Now let’s use R to test the null hypothesis and estimate confidence intervals. For fun let’s compare results from an analysis assuming equal variance to one that does not:

library(broom)

bind_rows(
  t.test(visits ~ petal_color, data = SR_visits, var.equal = TRUE)|> tidy(),
  t.test(visits ~ petal_color, data = SR_visits, var.equal = FALSE)|> tidy()
)
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1.026 1.759 0.733 3.908 1.65e-4 105.0 0.505 1.546 Two Sample t-test two.sided
1.026 1.759 0.733 4.047 1.61e-4 89.6 0.522 1.529 Welch Two Sample t-test two.sided

The Welch’s two-sample t-test does not assume equal variance, and in practice is universally better than the standard two-sample t-test (that’s why R has it as a default).

\[t = \frac{\overline{X}_1 - \overline{X}_2}{\sqrt{\frac{s_1^2}{N_1} + \frac{s_2^2}{N_2}}} \text{, and degrees of freedom: } \text{df} \approx \frac{\left(\frac{s_1^2}{N_1} + \frac{s_2^2}{N_2}\right)^2}{\frac{s_1^4}{N_1^2 \times df_1} + \frac{s_2^4}{N_2^2 \times df_2}}\]

However the standard two-sample t-test is often good enough. It also has the benefit of being much like all linear modeling efforts, and is simpler mathematically, so we usually teach the standard two-sample t-test. In practice the difference between these tests rarely matters, except for when the variance between groups is massively different

Permutation test

Of course, rather than using a parametric test, we could find a p-value by a permutation test. As I introduced previously, permutation tests build a null sampling distribution by shuffling the relation between explanatory and response variables, and comparing our actual difference in means to this permuted distribution. One way to do this in R is with the pipeline in the infer package:

library(infer)
raw_diff <- SR_visits %>%
  specify(visits ~ petal_color) |>              # response ~ explanatory
  calculate(stat = "diff in means",             # difference in group means
            order = c("pink", "white"))

# PERMUTE
perm_SR_visits <-  SR_visits %>%
  specify(visits ~ petal_color) |>              # specify response ~ explanatory
  hypothesize(null = "independence") |>
  generate(reps = 1000000, type = "permute") |>
  calculate(stat = "diff in means", order = c("pink", "white"))

perm_SR_visits  |>
  get_p_value(obs_stat = pull(raw_diff) , direction = "two-sided")
tibble(p_value = 0.00006)
# A tibble: 1 × 1
  p_value
    <dbl>
1 0.00006

Here, our p-value of 0.00006 is a bit smaller than that estimated from our t-test. But this would not meaningfully change any of our conclusions.

Writing up results

Now we can write up our results. Note this takes some thinking because we had to make some decisions. Here’s my attempt.

Note that I chose to report values from Welch’s t-test (which does not assume equal variance) on untransformed data. To me, this strikes a balance between interpretability and statistical correctness. Although the result is essentially the same as above, the conservatism of Welch’s t-test may help build readers’ trust.

At the Sawmill Road site, pink-flowered Clarkia xantiana ssp. parviflora RILs received, on average, more pollinators during a 15-minute observation than white-flowered RILs (mean pink = 1.76, mean white = 0.73). The mean difference of 1.03 visits was statistically significant (Welch’s t = 4.05, df = 89.6, p = 0.00016) and associated with a moderate-to-large effect size (Cohen’s D = 0.75). Results were robust to the right-skew in the data – bootstrap confidence intervals closely matched analytic ones, and a permutation-based p-value was similarly much smaller than 0.0001. We reject the null and conclude that pink-flowered plants attract more pollinators than white-flowered plants at this site.