• 13. The Normal is Common
Motivating scenario: You’re getting ready to enter the world of linear models, but you’ve heard they often assume normally distributed residuals. What are the odds your data will actually meet that assumption? Here, you’ll learn that the odds are good, and that it’s often okay even if our raw data aren’t quite normally distributed. You’ll see that the normal distribution arises whenever we add up small, random deviations. This is the key to understanding why sampling distributions (which are built from sample means) so often end up being normally distributed, even when the raw data are not.
Learning goals: By the end of this chapter you should be able to:
- Explain the Central Limit Theorem (CLT) and why it is so important in statistics.
- Distinguish between the distribution of data in your sample (or population) and the shape of the sampling distribution.
- Explain how the shape of the population distribution affects the sample size (
n) needed for the CLT to apply.
Why normal distributions are common
One amazing thing about the world is just how frequently normal distributions occur. The reason for this is that whenever a value results from adding up many independent factors, that value will follow a normal distribution, regardless of the underlying distribution of these individual factors. For example, your height is influenced by many genes in your genome, as well as numerous environmental factors that contribute to this outcome.
A Galton board! At every peg, a bead has a 50/50 chance of bouncing left or right. The final position of the bead in a bin at the bottom is the sum of all these random left and right steps. Most often paths right and left even out and the bead lands in the center, but not always! This ultimately generates a normal distribution.
An important consequence of this is that the sampling distribution of means tends to be normally distributed, provided the sample size is not too small. This principle, known as the Central Limit Theorem, is very useful in statistics. It allows us to create reasonable statistical models of sample means by assuming normality, even when the underlying data may not be perfectly normal.
The Central Limit Theorem is crucial for statistics because many of the statistical analyses we perform, which assume normality, are still valid even if the underlying data are not perfectly normal. This central limit theorem is remarkably useful because it means we can use statistical tests that assume normality (like the t-test) to make inferences about the mean, even if our raw data aren’t quite normally distributed.
How Large Must a Sample Be for the Central Limit Theorem to Work?
The Central Limit Theorem assures us that with a sufficiently large sample size, the sampling distribution of means will be normal, regardless of the distribution of the underlying data points. But how large is sufficiently large? The answer depends on how far from normal the initial data are. The less normal the original data, the larger the sample size needed before the sampling distribution becomes normal.
The webapp below is a simulation of the sampling distribution to help you build intuition for the Central Limit Theorem. It lets you draw samples from four variables from our parviflora RIL plants at GC.
- The top row of plots always shows the shape of the original data.
- The bottom row shows the sampling distribution of the mean, which is built from the averages of 1000 different samples.
Use the Sample Size (n) slider for each variable, and evaluate how large n needs to before the QQ plot points form a straight line, signaling that the sampling distribution has become approximately normal.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| column: page-right
#| standalone: true
#| viewerHeight: 1000
library(shiny)
library(ggplot2)
library(dplyr)
library(readr)
library(tidyr)
library(cowplot)
# --- UI Definition ---
ui <- fluidPage(
titlePanel("The Central Limit Theorem in Action"),
sidebarLayout(
sidebarPanel(
selectInput("var", "Population Distribution:",
choices = c("Petal Area" = "petal_area_mm",
"Prop. Hybrid" = "prop_hybrid",
"Mean Visits" = "mean_visits",
"Pink Flowers" = "pink_flowers")),
selectInput("n", "Sample Size (n):",
choices = c("2", "5", "10", "25", "50", "100"),
selected = "25"),
hr(),
helpText("We take 1000 random samples and calculate the mean for each.")
),
mainPanel(
plotOutput("distPlot", height = "600px")
)
)
)
# --- Server Logic ---
server <- function(input, output) {
# Load data once
dataset <- reactive({
ril_link <- "https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/clarkia_rils.csv"
df <- readr::read_csv(ril_link) %>%
mutate(growth_rate = ifelse(growth_rate == "1.8O", "1.80", growth_rate),
growth_rate = as.numeric(growth_rate),
visited = mean_visits > 0,
pink_flowers = as.numeric(petal_color == "pink")) %>%
filter(location == "GC") %>%
select(petal_area_mm, pink_flowers, mean_visits, prop_hybrid) %>%
drop_na()
df
})
population_dist <- reactive({
req(input$var, dataset())
tibble(x = dataset()[[input$var]])
})
sampling_dist <- reactive({
req(population_dist(), input$n)
pop_vec <- population_dist()$x
n_val <- as.numeric(input$n)
means <- replicate(1000, {
mean(sample(pop_vec, size = n_val, replace = TRUE))
})
tibble(mean_x = means)
})
output$distPlot <- renderPlot({
pop_data <- population_dist()
samp_dist <- sampling_dist()
# Population Plots
pop_hist <- ggplot(pop_data, aes(x = x)) +
geom_histogram(bins = 30, color = "white", fill = "pink") +
labs(x = "Observed Values", title = "Actual Data (Population)") +
theme_minimal(base_size = 14)
pop_qq <- ggplot(pop_data, aes(sample = x)) +
geom_qq(color = "pink") +
geom_qq_line(color = "pink") +
labs(title = "Actual Data QQ") +
theme_minimal(base_size = 14)
# Sampling Plots
samp_hist <- ggplot(samp_dist, aes(x = mean_x)) +
geom_histogram(bins = 30, color = "white", fill = "#3b82f6") +
labs(x = "Sample Means", title = "Sampling Distribution") +
theme_minimal(base_size = 14)
samp_qq <- ggplot(samp_dist, aes(sample = mean_x)) +
geom_qq(color = "#3b82f6") +
geom_qq_line(color = "#3b82f6") +
labs(title = "Sampling Dist QQ") +
theme_minimal(base_size = 14)
plot_grid(pop_hist, pop_qq, samp_hist, samp_qq, ncol = 2)
})
}
shinyApp(ui = ui, server = server)
For each of the populations below, use the app to find the smallest sample size (n) where the sampling distribution of the mean becomes approximately normal (i.e., the QQ plot is a straight line).
Q1. What is the minimum sample size for the petal area data?
At n=25, the QQ plot is reasonably straight, showing the CLT has taken effect. For smaller sample sizes, the QQ plot still shows some curvature. I couldn’t tell if the right answer was 10, 25 or 50, so I split the difference. Smaller sample sizes still show noticeable curvature.
Q2. What is the minimum sample size for the proportion pink data?
The sampling distribution only starts to look continuous and normal-like when the sample size is large enough. At n=25, the QQ plot straightens out nicely.
Q3. What is the minimum sample size for the highly skewed pollinator visits data?
The original data is very skewed, so a large sample size is needed for the CLT to work. At n=25, the sampling distribution is still visibly skewed, but by n=100, it becomes much more symmetric and bell-shaped.