Correlation/NP/Spearman’s Rho
Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.
Background
Spearman’s rho (\(\rho\)) is a statistic used for measuring rank correlation [1] .
Notation
Let \((Y_{i1}, Y_{i2})\) be a pair of random variables corresponding to the \(i\) th sample where \(i = 1, \ldots, n\).
\(Y_{i1}\) |
\(Y_{i2}\) |
|
|---|---|---|
Sample: 1 |
\(Y_{11}\) |
\(Y_{12}\) |
Sample: 2 |
\(Y_{21}\) |
\(Y_{22}\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
Sample: \(n\) |
\(Y_{n1}\) |
\(Y_{n2}\) |
Let \((R_{i1}, R_{i2})\) be the rank of \(Y_{i1}\) and the rank of \(Y_{i2}\). In the case of ties, one method is to assign the tied group with the average of unique ranks corresponding the tied group. For the \(i\) th sample, let \(S_{i1,1}\) be the number of observed values less than \(Y_{i1}\), \(S_{i1,2}\) be the number of observed values equal to \(Y_{i1}\), and \(S_{i1,3}\) be the number of observed values greater to \(Y_{i1}\). We can calculate the rank of a single sample as
For a vector, pandas.DataFrame has the rank function with method='average' option to calculate rank as defined in Eq. (20).
In R, that can be calculated using the rank function with ties.method='average' option.
See reference [2] for ranking in Julia.
The Spearman’s \(\rho\) can be calculated as:
where \(s_1^2 = \sum_i R_{i1}^2 - \frac{1}{4}(n+1)^2\), and \(s_2^2 = \sum_i R_{i2}^2 - \frac{1}{4}(n+1)^2\).
Example - Group-1
\(Y_{i1}\) |
\(Y_{i2}\) |
|
|---|---|---|
Sample: 1 |
1 |
4 |
Sample: 2 |
3 |
6 |
Sample: 3 |
2 |
5 |
\(Y_{i1}\) |
\(Y_{i2}\) |
|
|---|---|---|
Sample: 1 |
1 |
6 |
Sample: 2 |
3 |
4 |
Sample: 3 |
2 |
5 |
How-to
To use scipy.stats [3]:
from scipy.stats import spearmanr
y1 = [1, 3, 2]
y2 = [4, 6, 5]
rho, p_value = spearmanr(y1, y2)
print("Spearman's rho:", rho)
More Details
Assume that \(Y_{i1} \sim \mathcal{D}\). For continuous \(Y_{i1}\), if we can assume that \(P(S_{i1,2}=1)=1\) for all \(i\), then Eq. (20) can be simplified as \(R_{i1} = S_{i1,1}+1\). For a given sample size \(n\), and \(r \in \{1, \ldots, n\}\), the pmf of \(R_{i1}\) is \(P(R_{i1} = r) = \frac{1}{n}\), which does not depend on \(r\) or \(\mathcal{D}\) [4].