Motivating example: We have learned a few linear models (a two-sample t-test and an ANOVA), and now we want to see how linear regression fits in this linear model framework. Here, we connect linear regression to the models we have already seen by showing how an intercept and slope generate predicted values, how residuals measure prediction error, and how to build this model in R with the lm() function.
Learning goals: By the end of this section, you should be able to:
Explain how a regression model predicts \(\hat{Y}\) from an intercept and slope.
Calculate and interpret fitted values and residuals.
Explain how the slope and intercept are found from summaries of \(X\) and \(Y\).
Linear models predict (or “model”) each response value by starting with an intercept and then adding each coefficient multiplied by the corresponding explanatory-variable value.
In a two-sample t-test, each individual starts with the reference group mean, represented by the intercept (\(a\)). We then add the difference between group means (\(b_1\)) for individuals in the non-reference group. For individuals in the reference group, the indicator variable equals 0, so we add nothing (Equation 1).
In a one-way ANOVA (with three groups), each individual starts with the reference group mean, represented by the intercept (\(a\)). We then add the appropriate difference between group means for individuals in each non-reference group. For individuals in the reference group, all indicator variables equal 0, so we add nothing (Equation 2).
In a linear regression, each individual starts with the intercept (\(a\)), corresponding to the value of \(Y\) predicted by the model when \(X\) equals zero. We then add the product of the individual’s value of the explanatory variable (\(X_i\)) and the slope (\(b\)) to find an individual’s predicted value for the response variable (Equation 3).
\[
\hat{Y}_i = a + X_i \times b
\tag{3}\]
In all linear models, the residual (\(e_i\)) is the difference between an individual’s observed and predicted value.
\[
e_i = Y_i - \hat{Y}_i
\tag{4}\]
Understanding linear regression is easier if you have the basic notation down. Here is a reference table of this notation to help you get familiar with this:
A table of standard notation in a linear model
Notation
Name
\(Y\)
Response variable
\(X\)
Explanatory variable
\(Y_i\)
Value of \(Y\) for individual \(i\)
\(X_i\)
Value of \(X\) for individual \(i\)
\(\hat{Y}_i\)
Predicted value of \(Y\) for individual \(i\)
\(e_i\)
Residual for individual \(i\)
\(\bar{Y}\)
Mean of \(Y\)
\(\bar{X}\)
Mean of \(X\)
\(a\) (sometimes \(b_0\))
Intercept
\(b\) (sometimes \(b_1\))
Slope
Finding the slope and intercept
Predicting \(\hat{Y}\) requires finding the equation for the linear regression, which consists of both a slope and intercept (Figure 1):
The slope equals the covariance divided by the variance in \(X\), \(\sigma_x^2\) (Equation 5), as introduced previously.
The intercept equals the mean of \(Y\), minus the product of the slope (\(b\)) and the mean of \(X\) (Equation 6).
\[
a = \overline{Y} - b \times \overline{X}
\tag{6}\]
Figure 1: A) The intercept shifts a line up or down, while the slope describes the direction and steepness of the relationship between two variables. A positive slope means predicted values increase as the predictor increases, a negative slope means predicted values decrease, and a slope of zero means the predicted value does not change with the predictor. B) In a regression model, the slope tells us the expected change in the response variable for a one-unit increase in a predictor, holding other variables constant. In this example, a dragon that is one foot taller is expected to weigh 0.3 tons more, on average, assuming all other variables stay the same. From Allison Horst.
Linear regression in R
Like all linear models, we fit a linear regression in R with the linear modeling function: lm(y ~ x), which generates an lm object. We can see a quick summary of this model by just running the code as follows:
So, we expect a 0.57 percentage point increase in proportion hybrid seeds for each \(1\text{ mm}^2\) increase in petal area.
Finding \(\hat{Y}_i\) from the regression equation
Now that we have our linear regression equation, we can find \(\hat{Y}_i\) by simply plugging in numbers. Say we wanted to predict the proportion of hybrid seeds for a plant with a petal area of \(103\text{ mm}^2\):
\[\text{PROP HYBRID | Petal area of 103 mm}^2 = -0.208 + 0.00574 \times 103\]
\[\text{PROP HYBRID | Petal area of 103 mm}^2 = -0.208 + 0.591\]
\[\text{PROP HYBRID | Petal area of 103 mm}^2 = 0.383\]
Residuals
The actual RIL with a petal area of \(103 \text{ mm}^2\) had a hybrid seed proportion of 0.625. So the residual for this sample is \(0.625 - 0.383 = 0.242\).
The augment() function in the broom package shows \(X\) and \(Y\), as well as the predicted value, \(\hat{Y}\) (with column heading, .fitted), and the residual value (.resid).
library(broom)lm(prop_hybrid ~ petal_area_mm, data = gc_rils) |>augment()
Remember: the intercept is often the value needed to place the regression line correctly, not a biological claim.
To see this, consider this example, where the intercept of negative \(0.2\) is the proportion of hybrid seeds predicted by the model for a RIL with a petal area of zero. Of course neither a flower with zero petal area nor a negative proportion of hybrid seeds are biologically meaningful, so this is a nonsensical biological prediction. But it does allow for reasonable predictions for our actual data.