Section III: Linear Models

The major goals of statistics are to: (1) Summarize data, (2) Estimate uncertainty, (3) Test hypotheses, (4) Build models, (5) Infer cause. Now that we understand uncertainty & null hypothesis significance testing, we can build & interpret statistical models with some sophistication. So let’s go!

A humpback whale cartoon, teaching a class of dolphins, seal and another whale some calculus (probably) by pointing out the local maximum of a "MAX SPLASH!" function on a whiteboard. — Figure 1: An example of teaching a linear model, From Allison Horst.

Review: Conditional Means

In our study of associations between a categorical explanatory variable and a continuous response variable, I introduced the idea of a conditional mean. We can think of a conditional mean as the “predicted” or “expected” value of our response variable, \(y\), given the value of the relevant explanatory variables. So, for example, we may be interested in the expected proportion of hybrid seeds for a pink flower that has an area of 70 square millimeters. In this case, the expected value is conditional on petal color (pink) and petal area (\(70\text{ }mm^2\)).

Conditional mean: The expected value of a response variable given specific values of the explanatory variables (i.e., the model’s best guess for the response based on the explanatory variables).

In somewhat more formal notation, the conditional mean for the \(i^\text{th}\) observation (e.g., the proportion of hybrid seeds of the \(4^{\textth}\) plant in a spreadsheet) is written as \(\hat{Y_i}\). This represents the predicted response for the explanatory variable values associated with that observation:

\[\begin{equation} \hat{Y}_i = f(\text{explanatory variables}_i) \end{equation}\]

Adding Uncertainty and NHST

We have just completed our section on the foundations of statistics. In that section, we introduced the idea that we should make sure to quantify uncertainty when presenting estimates.

We also introduced the idea that the “null hypothesis significance testing” (NHST) tradition in statistics works by assuming that data came from the “null model”, and that we “reject” this hypothesis when the null model rarely generates values as extreme as what we see empirically.

Here, rather than using bootstrapping and permutation to quantify uncertainty and test null hypotheses, we run through some of the mathematical tools used in linear modelling. The models are the bread and butter of what we see in most biostats papers.

However, whether we use mathematical or computational approaches to estimate uncertainty and test null hypotheses, the concepts are the same.

Assumptions of linear models

A major difference between linear models and computational approaches to stats is that while all statistical models make assumptions, linear models make a specific set of assumptions that are needed to make the math work.

Luckily for us, we will see that:

Many of these assumptions are actually appropriate most of the time.
Linear models are often robust to modest violations of assumptions.
We can build more specific models that better fit our data.

We will say more about these points as we go on, but now let’s introduce the major assumptions of linear models:

Linear models assume linearity

In the coming sections, we will see that linear models are “linear” because an individual’s predicted value, \(\hat{Y}_i\), is built by adding the contributions of each component of the model.

This linearity assumption does not mean that we cannot include squared terms or interactions. In fact, the assumption of linearity sometimes requires that we add non-linear terms. The key is that we add up each term to get the overall prediction.

Thus, a fundamental assumption of linear models is that predictions are formed by adding the contributions of each explanatory variable.

Linear models assume independence

Linear models assume that observations are independent. Or more precisely, that they are independent conditional on the explanatory variables. A simple way to say this is that we assume the residuals are independent.

In the next chapter (Chapter 12), we will see that a residual is the difference between an observation and the model prediction. So the residual value for individual \(i\), \(e_i\), is the difference between the value of their response variable, \(Y_i\), and the value the model predicts given individual \(i\)’s values of explanatory variables, \(X_i\).

\[e_i = Y_i -\hat{Y_i}\]

Linear models assume normality

Not only do linear models assume independence of residuals, also the errors (residuals) are assumed to follow a normal distribution. A normal distribution is a symmetric, bell-shaped curve that occurs frequently in nature and has many convenient mathematical properties. We will introduce the normal distribution in Chapter 13.

Linear models assume constant variance

Linear models assume that the variance of residuals is independent of the predicted value of the response variable, \(\hat{Y_i}\).

Fancy words for these ideas are:

Homoscedasticity: Variabilty of residuals is constant – i.e.standard deviation in residuals, \(\sigma_e\) does not vary by the predicted value, \(\hat{Y}\).

Heteroscedasticity: Variance of residuals is not constant; it depends on predictors.

Linear models assume independence of explanatory variables

For models with multiple explanatory variables, it is assumed that predictors are not strongly correlated with one another

Multicollinearity is the fancy word for high correlations between predictor variables.

What’s ahead

This section gets into linear models, the workhorse for data analysis.

Chapter 12, introduces the idea of linear models, and preview several common types of linear models without worrying about uncertainty or hypothesis testing. We will also use this section to familiarize ourselves with R’s lm() function for building linear models, as well as the augment() function from the broom package to extract residuals and predicted values from a model.
Chapter 13 introduces key features of the normal distribution, to prepare us for modelling normally distributed residuals.
Chapter 14 address a challenge of linear models. Linear models assume normally distributed residuals, but because we estimate model parameters from data, don’t know the parameter value of the population standard deviation. So, we introduce the t-distribution – a distribution which is like the normal but incorporates the uncertainty in our estimate of the standard deviation. This allows us to both testing null hypotheses and estimate uncertainty using standard linear modelling tools.
Chapter 15, we build on this by modelling means as a function of a binary explanatory variable. In this chapter, we estimate uncertainty in the difference of group means, and test the null that group means do not differ.
Chapter 16 presents the final key component of linear modelling - partitioning variation into that explained by the model and the residual variation. We show that the ratio of these variance components provides another critical tool for evaluating null hypotheses posed by linear models.
With these tools in place we can now build models with a nominal categorical predictor (Chapter 17), a numeric predictor (Chapter 18), and all combinations thereof!

After completing this section we will be equipped with the standard tools of biostatistics.