6. Associations: Part I

Motivating Scenario:
You’re curious to know the extent to which two variables are associated and need background on standard ways to summarize associations.

Learning Goals: By the end of this chapter, you should be able to:

Recognize the difference between correlation and causation
- Memorize the phrase “Correlation does not necessarily imply causation,” explain what it means and why it’s important in statistics, and know that this is true of all measures of association.
- Identify when correlation may or may not reflect a causal relationship.
Explain and interpret summaries of associations with a binary explanatory variable
- Describe an association between two binary variables.
  - As differences in conditional proportions.
- Describe an association between a binary explanatory variable and a continuous response
  - As differences in conditional means.
  - As an “effect size”.
- Use R to calculate and interpret summaries of association.
- Use R to visualize difference between two categorical predictor values.

library(GGally)
ggpairs(gc_rils)

A matrix of plots showing pairwise relationships among variables in the Clarkia xantiana dataset. The matrix includes bar plots for categorical variables, histograms and density plots for distributions, scatterplots for continuous variables, and boxplots comparing continuous and categorical variables. Correlation coefficients are displayed in the upper triangle for numeric pairs. The figure visually summarizes a range of associations among traits such as petal color, pollinator visitation, petal area, anther–stigma distance, and hybridization rate. — Figure 1: A matrix of plots showing associations among several variables in the RIL dataset. It includes combinations of categorical and continuous variables related to floral traits, pollinator visitation, and hybrid seed formation. The diagonals show distributions of individual variables, while the upper triangle displays correlation coefficients, and the lower triangle shows scatterplots or boxplots depending on the variable type.

Summarizing a single variable can be illuminating. But we usually want to know more than that in our biostatistical adventures. We want to know about associations between variables. A careful study of Figure 1 (plus a bit more information) shows some potentially interesting associations:

Pink flowers seem to have a better chance of receiving at least one pollinator than do white flowers.
Despite our attempts to genetically disentangle floral traits by creating Recombinant Inbred Lines (RILs), an association between petal area and anther stigma distance remains.

Of course, it’s not just the association between variables we care about — it’s what such associations imply. We want to:

Predict one variable from another.
Anticipate what some intervention will do to a biological system.

In future chapters we will see when and how we can achieve these higher goals. But for now, know that while such goals are noble,

We cannot make causal claims or even good predictions from correlations alone.

Correlation is not causation

“Correlation is not causation.” You’ve probably heard that before — but what does it actually mean? Let’s start by unpacking the two key concepts in that statement:

Correlation means that two variables are associated.
- A positive association means that when one variable is large (or small) the other is often big (or large).
- A negative association means that when one variable is large (or small) the other is often small (or large).
Causation means that changing one variable produces a change in the other. (For a deeper dive, see Wikipedia.)

Figure 2: This man is not starting and stopping the train. (tweet)

Correlation is often confused for causation because it’s easy to assume that if two things are associated, one must be causing the other — especially when the association feels intuitive or lines up with our expectations, but this is wrong. While correlation may hint at causation, a direct cause is neither necessary nor sufficient to generate a correlation. Take the video in Figure 2 – an alien might think this man is starting and stopping the train, but clearly he has nothing to do with the train starting or stopping.

There are three basic reasons why and when we can have a correlation without a causal relationship– Coincidence, Confound, and Reverse causation.

Coincidence: Chance is surprisingly powerful. In a world full of many possible combinations between variables, some strong associations will arise purely by luck. Later sections of the book will show how to evaluate the “NULL” hypothesis that an observed association arose by chance.

Figure 3: **Potential confounding** in parviflora RILs. The observed association between proportion hybrid seed and anther–stigma distance (A), might be due to the fact that both anther–stigma distance (B), and proportion hybrid seed increases with petal area (C), rather than a causal effect of anther–stigma distance itself.

Confounding: An association between two variables may reflect not a causal connection between them – but rather the fact that both are caused by a third variable (known as a confound). Such confounding may be at play in our RIL data – we observe that anther–stigma distance is associated with the proportion hybrid seed, but anther–stigma distance is also associated petal area (presumably because both are caused by flower growth), which itself is associated with the proportion of hybrids (Figure 3). So, does petal area or anther stigma distance (or both or neither) cause an increase in proportion of hybrid seed? The answer awaits better data, or at least better analyses (see section on causal inference), but I suspect that petal area, not anther stigma distance “causes” proportion hybrid. Unfortunately, we rarely know the confound, let alone its value. So, interpreting any association as causation requires exceptional caution.

Reverse causation: Figure 4 shows that pink flowers are more likely to receive a pollinator than are white flowers. We assume this means that pink attracts pollinators– and with the caveat that we must watch out for coincidence and confounds, this conclusion makes sense. However, an association alone cannot tell if pink flowers attracted pollinators or if pollinator visitation turned plants pink. In this case the answer is clear – petal color was measured for RILs in the greenhouse, and there’s no biological mechanism by which a pollinator could change petal color. However, these answers require us to bring in biological knowledge – the data alone can’t tell us which way the effect goes.

Two bar plots showing the relationship between petal color (pink or white) and pollinator visitation (visited or not visited). The left plot places visit status on the x-axis and shows the proportion of pink and white flowers within each visit category. The right plot reverses this, placing petal color on the x-axis and showing the proportion of flowers that were visited or not. Both plots show that pink flowers are more likely to be visited, but the choice of x-axis can influence how we interpret the relationship, emphasizing the importance of considering — but not assuming — causal direction. — Figure 4: **Visualizing an association between petal color and pollinator visitation.** Both panels show that pink-petaled flowers are more likely to be visited by pollinators than white-petaled flowers. In the left panel, visit status is on the x-axis and petal color is shown within bars; in the right panel, petal color is on the x-axis and visit status is shown within bars. While the association is the same, the visual framing shifts the way we interpret direction — we typically place and think of explanatory variables (causes) on the x-axis and outcomes on the y-axis. However, these alternative visualizations make it clear that the data cannot speak to cause.

This issue isn’t just theoretical — I’m currently grappling with a real case in which the direction of causation is unclear. In natural hybrid zones, white-flowered parviflora plants tend to carry less genetic ancestry from xantiana (their sister taxon) than do pink-flowered parviflora plants. There are two potential explanations for this observation:

Perhaps, as the RIL data suggests, white-flowered parviflora plants are less likely to hybridize with xantiana than are pink-flowered parviflora, so white-flowered plants have less xantiana ancestry (pink flowers cause more gene flow).
Alternatively, all xantiana are pink-flowered, white-flowered parviflora can be white or pink. So maybe the pink flowers are actually caused by ancestry inherited from xantiana (more gene flow causes pink flowers).

I do not yet know the answer.

# YANIV ADD FIGURE FROM SHELLEY

Making things independent

If we cannot break the association between anther stigma distance and petal area by genetic crosses maybe we could do so by physical manipulation. For example, we could use tape or some other approach to move stigmas closer to or further from anthers.

There is still value in finding associations

The caveats above are important, but they should not stop us from finding associations. With appropriate experimental designs, statistical analyses, biological knowledge, and humility in interpretation, quantifying associations is among the most important ways to summarize and understand data.

The following sections provide the underlying logic, mathematical formulas, and R functions to summarize associations.

Let’s get started with summarizing associations!

The following sections introduce how to summarize associations between variables by:

Then we summarize the chapter, present practice questions, a glossary, a review of R functions and R packages introduced, and present additional resources.

In the next chapter we will continue our study of associations as we will investigate associatiosn between continuous variables!