Correlation/NP/Spearman’s Rho

Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.

Background

Spearman’s rho (\(\rho\)) is a statistic used for measuring rank correlation [1] .

Notation

Let \((Y_{i1}, Y_{i2})\) be a pair of random variables corresponding to the \(i\) th sample where \(i = 1, \ldots, n\).

Table 23 Observed Value

\(Y_{i1}\)

\(Y_{i2}\)

Sample: 1

\(Y_{11}\)

\(Y_{12}\)

Sample: 2

\(Y_{21}\)

\(Y_{22}\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

Sample: \(n\)

\(Y_{n1}\)

\(Y_{n2}\)

Let \((R_{i1}, R_{i2})\) be the rank of \(Y_{i1}\) and the rank of \(Y_{i2}\). In the case of ties, one method is to assign the tied group with the average of unique ranks corresponding the tied group. For the \(i\) th sample, let \(S_{i1,1}\) be the number of observed values less than \(Y_{i1}\), \(S_{i1,2}\) be the number of observed values equal to \(Y_{i1}\), and \(S_{i1,3}\) be the number of observed values greater to \(Y_{i1}\). We can calculate the rank of a single sample as

(20)\[R_{i1} = S_{i1,1} + \frac{S_{i1,2}+1}{2} = n - S_{i1,3} - \frac{S_{i1,2}-1}{2}.\]

For a vector, pandas.DataFrame has the rank function with method='average' option to calculate rank as defined in Eq. (20). In R, that can be calculated using the rank function with ties.method='average' option. See reference [2] for ranking in Julia.

The Spearman’s \(\rho\) can be calculated as:

(21)\[\rho = \frac{\frac{1}{n}\sum_i R_{i1}R_{i2} - \frac{1}{4}(n+1)^2}{s_1 s_2},\]

where \(s_1^2 = \sum_i R_{i1}^2 - \frac{1}{4}(n+1)^2\), and \(s_2^2 = \sum_i R_{i2}^2 - \frac{1}{4}(n+1)^2\).

Example - Group-1

Table 24 Spearman’s \(\rho = 1.0\)

\(Y_{i1}\)

\(Y_{i2}\)

Sample: 1

1

4

Sample: 2

3

6

Sample: 3

2

5

Table 25 Spearman’s \(\rho = -1.0\)

\(Y_{i1}\)

\(Y_{i2}\)

Sample: 1

1

6

Sample: 2

3

4

Sample: 3

2

5

How-to

To use scipy.stats [3]:

from scipy.stats import spearmanr

y1 = [1, 3, 2]
y2 = [4, 6, 5]

rho, p_value = spearmanr(y1, y2)
print("Spearman's rho:", rho)

More Details

Assume that \(Y_{i1} \sim \mathcal{D}\). For continuous \(Y_{i1}\), if we can assume that \(P(S_{i1,2}=1)=1\) for all \(i\), then Eq. (20) can be simplified as \(R_{i1} = S_{i1,1}+1\). For a given sample size \(n\), and \(r \in \{1, \ldots, n\}\), the pmf of \(R_{i1}\) is \(P(R_{i1} = r) = \frac{1}{n}\), which does not depend on \(r\) or \(\mathcal{D}\) [4].

Reference