• 18. Regression: R workflow

Motivating scenario: We have now learned how to fit a regression, interpret a slope, quantify uncertainty, test a null hypothesis, check model assumptions, and compare parametric results to simulation-based approaches. Now you just want to do it!

Learning goals: By the end of this section, you should be able to:

  1. Visualize a linear relationship with a scatterplot and regression line.
  2. Fit a linear model with lm().
  3. Extract estimates, standard errors, confidence intervals, and p-values from a fitted model.
  4. Make basic diagnostic plots for a linear model.
  5. Use permutation and bootstrap approaches to evaluate a regression slope.

Loading and formatting data
library(broom)

ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"

gc_rils <- readr::read_csv(ril_link) |>
  dplyr::mutate(visited = mean_visits > 0) |>
  filter(location == "GC", !is.na(prop_hybrid), !is.na(petal_area_mm)) |>
  dplyr::select(ril, location, prop_hybrid, petal_area_mm) |>
  mutate(
    petal_area_quartile = cut(
      petal_area_mm,
      breaks = quantile(petal_area_mm, probs = seq(0, 1, 0.25), na.rm = TRUE),
      include.lowest = TRUE
    )
  )

This chapter was a bit long. As a quick reference for doing a linear regression in R, I present a simple workflow. This does not mean you should ignore the rest of this chapter - it is key for understanding what goes into these functions and for helping you responsibly interpret this output.

Step 1: Plot the data

Remember ABP - Always Be Plotting. The first step in any analysis is to look at your data. The code below generates a simple plot with a regression line, like Figure 1.

ggplot(gc_rils, aes(x = petal_area_mm, y = prop_hybrid)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)
A scatterplot shows proportion hybrid seed on the y-axis and petal area in square millimeters on the x-axis. Each point represents one Clarkia RIL from the GC site. A blue regression line slopes upward, showing that RILs with larger petals tend to have a higher proportion of hybrid seed.
Figure 1: Relationship between petal area and proportion hybrid seed in Clarkia RILs from the GC site. Points show individual RILs, and the blue line shows the fitted linear regression.

Step 2: Build a model

Now we fit the linear model of the expected proportion of hybrid seed as a linear function of petal area, with R’s lm() function.

hybrid_lm <- lm(prop_hybrid ~ petal_area_mm, data = gc_rils)

Step 3: Extract estimates and uncertainty

Base R’s summary() function gives a lot of useful information about the model:

summary(hybrid_lm)

Call:
lm(formula = prop_hybrid ~ petal_area_mm, data = gc_rils)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32681 -0.14608 -0.05947  0.08676  0.83802 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.207834   0.099466  -2.089 0.039176 *  
petal_area_mm  0.005738   0.001555   3.691 0.000362 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2246 on 101 degrees of freedom
Multiple R-squared:  0.1188,    Adjusted R-squared:  0.1101 
F-statistic: 13.62 on 1 and 101 DF,  p-value: 0.0003625

We can also find confidence intervals with R’s confint() function: confint(hybrid_lm). Broom’s tidy function is even slicker!

tidy(hybrid_lm, conf.int = TRUE)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.2078 0.0995 -2.0895 0.0392 -0.4051 -0.0105
petal_area_mm 0.0057 0.0016 3.6909 0.0004 0.0027 0.0088

Step 4: Generate diagnostic plots

Before trusting the model too much, we should check whether the model looks reasonable. The plot() function in Base R generates a useful set of diagnostic plots (Figure 2) for lm() objects. See the section on regression assumptions for help on how to interpret these.

plot(hybrid_lm)

When you run this on your computer, R will show you one plot at a time. So for reproducible code, try the trick in the box below.

par(mfrow = c(2, 2))
plot(hybrid_lm)      # make base r plots 2x2
par(mfrow = c(1, 1)) # reset layout
A four-panel diagnostic plot for a linear regression model. The panels show residuals versus fitted values, a normal Q-Q plot of standardized residuals, a scale-location plot, and residuals versus leverage. These plots are used to check whether the linear model has obvious problems such as nonlinear patterns, non-normal residuals, unequal residual variance, or highly influential observations.
Figure 2: Diagnostic plots for the linear model of proportion hybrid seed as a function of petal area. The residuals vs. fitted plot helps identify nonlinearity or unequal variance; the Q-Q plot helps evaluate whether residuals are approximately normal; the scale-location plot gives another view of changing residual spread; and the residuals vs. leverage plot helps identify observations that may have unusually strong influence on the fitted model.

When data violate assumptions

If data violate assumptions you can:

  1. Build a more appropriate model (we will discuss this later).
  2. Use a nonparametric test (e.g. Spearman’s rho)
  3. Transform (Use mutate() once you get the right transformation).
  4. Ignore minor violations.
  5. I usually pair 4 with bootstrap and permutation to help me feel ok about ignoring these violations (or not). Here is the code:

Bootstrap based Uncertainty

boot_dist <- gc_rils |>
  specify(prop_hybrid ~ petal_area_mm) |> 
  generate(reps = 10000, type = "bootstrap") |>
  calculate(stat = "slope")

boot_dist |>
  get_confidence_interval(level = 0.95, type = "percentile")
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1  0.00218  0.00889

Permutation based NHST

library(infer)
perm_dist <- gc_rils |>
  specify(prop_hybrid ~ petal_area_mm) |>
  hypothesize(null = "independence") |>
  generate(reps = 10000, type = "permute") |>
  calculate(stat = "slope")

obs_slope <- coef(hybrid_lm)["petal_area_mm"]

perm_dist |>
  mutate(as_or_more_extreme = abs(stat) >= abs(obs_slope)  )|>
  summarise(p_value = mean(as_or_more_extreme))
# A tibble: 1 × 1
  p_value
    <dbl>
1  0.0003