Motivating Scenario:
You are digging deeper into your summaries of associations between two categorical variables.
Learning Goals: By the end of this subchapter, you should be able to:
Explain the intuition of the multiplication rule for independent events
Calculate and explain the covariance as a deviation from expectations under independence. Again you should be able to do this with basic math and with R code.
In the previous section we calculated UNCONDITIONAL (also known as marginal) proportions (n_case / n_total) of two binary categorical variables. Let;s call these proportions:
A common quantification of the association between two binary variables is the covariance. This quantification follows some simple mathematical logic:
If two categorical variables, A & B are independent, the probability of observing both A and B (aka, \(P_{AB}\)) should equal the product of their unconditional probabilities: \[P_{AB \mid \text{independence}} = P_A \times P_B\]
The covariance is the deviation from this expectation: \[COV_{AB} = P_{AB} - P_A \times P_B\]
So for the association between petal color and being visited, the covariance is:
\[COV_{\text{visited, pink}} =P_{\text{visited and pink}} - P_{\text{visited}} \times P_{\text{pink}}\]
Alternatively, R can calculate the covariance with the cov() function. This only works if the values are numbers or logical (TRUE [1] / FALSE [0]). Note the slight mismatch between our calculation due to “Bessel’s correction” (see fyi box, below).
Our answer differ slightly from one another. We used a denominator of \(n\) (which calcualtes the population covariance), while the sample covariance has a denominator of \(n-1\) (this is known as Bessel’s correction. We will soon see that although the later is more appropriate – their difference is negligible when n is large. We can convert the populaiton covariance to the sample covaraiance by multiplying by \(\frac{n}{n-1}\). See the next section for a discussion of the difference between populaiton parameters and sample estimates.
From covariance to correlation
Covariance gives us a numerical measure of how far our data deviate from what we’d expect under independence. In this case, the value is 0.10 — but is that meaningful? It turns out a covariance is difficult to interpret as strong or weak. The correlation, r, describes both the sign and strength of the association between variables. An absolute value of r close to 1 means that x almost perfectly predicts y, while a value close to zero means x has no information about y. The sign of r describes if x and y increase (\(r > 0\)), or decrease (\(r <0\)) with one another.
We can convert a covariance to an effect size, known as a correlation (or r), by dividing the covariance by the product of standard deviations.
\[r_{x,y}= \frac{cov_{x,y}}{s_x s_y}\]
We have previously seen the equation for a standard deviation: \[s_x = \sqrt{\frac{\sum{(x_i-\bar{x})^2 }}{n-1}}\].
Again this simplification arises when we ignore Bessel’s correction and divide by n instead of n − 1.
For a binary variable, this equation equals \(p_x (1-p_x)\). So in R:
Two population genetic summaries tend to confuse genetics students – deviations from Hardwy-Weinberg Equilibrium, and Linkage disequilibrium. Both can be naturally considered as covarances between two binary variables.
Hardy-Weinberg Equilbrium (HWE)
According to the Hardy-Weinberg Equilbrium principle, states that under anumerous assumptions – which essentially ensure that an individual’s two alleles at a locus are chosen indepednently – geneotypes at a locus are simply two random samples of alleles from a population,
Genotype frequencies should be \(p^2\), \(2p(1-p)\), and \((1-p)^2\)
Where \(p\) is the freuqncy of one allele, A, at a locus.
We can think of this as a covariance between alleles inherited from mom and dad at the same locus.
So \(p^2\) is actually \(p_\text{A maternal} \times p_\text{A paternal}\).
The deviation from HWE, \(F\), is simplythe difference between the observed frequency of homozygotes (\(p_\text{AA} = p_\text{A maternal,A paternal}\)) and their expected frequency, \(p^2\). So \(F = p_\text{AA}- p_A^2\).
Linkage Disequilibrium (LD)
Linkage disquilibrium quantifies the statistical asscoiation between alleles at different loci. The most standard quantification of linkage disequilbrium is \(D = Cov(A,B) = p_{AB} - p_A p_B\). Where A and B asre alleles at two different loci. Although usually quantified as a covariance, other measures of LD include a correlation in allele frequencies between loci.