Correlation/NP/Kendall’s Tau

Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.

Background

Kendall’s tau (\(\tau\)) is a statistic used for measuring rank correlation [1] [2].

Notation

Let \((Y_{i1}, Y_{i2})\) be a pair of random variables corresponding to the \(i\) th sample where \(i = 1, \ldots, n\).

Table 20 Observed Value

\(Y_{i1}\)

\(Y_{i2}\)

Sample: 1

\(Y_{11}\)

\(Y_{12}\)

Sample: 2

\(Y_{21}\)

\(Y_{22}\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

Sample: \(n\)

\(Y_{n1}\)

\(Y_{n2}\)

Let \(Z_{ij1} \equiv sign(Y_{i1}-Y_{j1})\), \(Z_{ij2} \equiv sign(Y_{i2}-Y_{j2})\), \(c = \sum_{i=1}^n \sum_{j < i} I(Z_{ij1}Z_{ij2}=1)\), \(d = \sum_{i=1}^n \sum_{j < i} I(Z_{ij1}Z_{ij2}=-1)\) and \(t = \frac{n(n-1)}{2}\). The coefficient \(\tau\) (tau-a) can be calculated as

(18)\[\tau = \frac{ c - d }{t}.\]

If there are no ties, the maximum value of Eq. (18) is 1 at \(c=t\), and the minimum is -1 at \(d=t\).

Eq. (18) can also be expressed as

(19)\[\begin{split}\tau =& \frac{2}{n(n-1)} \left( \sum_{i=1}^n \sum_{j < i} Z_{ij1}Z_{ij2} \right) \\ =& \frac{1}{n(n-1)} \left( \sum_{i=1}^n \sum_{j=1}^n Z_{ij1}Z_{ij2} \right).\end{split}\]

Under independent sample assumption, for a fixed \(n\), we know that \(E(Z_{ij1})=E(Z_{ij2})=0\) and \(Var(Z_{ij1})=Var(Z_{ij2})=1-\frac{1}{n}\). From Eq. (19), we can see that \(\tau\) is a type of correlation coefficient.

If there are ties, the maximum value of Eq. (18) becomes less then 1. Consider the scenario that there are \(n_{t1}\) groups of ties in \(\{Y_{i1}\}\), and there are \(n_{t2}\) groups of ties in \(\{Y_{i2}\}\). Let \(n_{t1,k}\) be the number of ties within the \(k\) th group of ties in \(\{Y_{i1}\}\), and \(n_{t2,k}\) be the number of ties within the \(k\) th group of ties in \(\{Y_{i2}\}\) The adjusted \(\tau\) (tau-b) is calculated by replacing \(t\) in Eq. (18) with \(t^* = \sqrt{\frac{1}{2}n(n-1)-\sum_{k=1}^{n_{t1}} \frac{1}{2}n_{t1,k}(n_{t1,k}-1)}\sqrt{\frac{1}{2}n(n-1)-\sum_{k=1}^{n_{t2}} \frac{1}{2}n_{t2,k}(n_{t2,k}-1)}\)

Example - Group-1

Table 21 Kendall’s \(\tau = 1.0\)

\(Y_{i1}\)

\(Y_{i2}\)

Sample: 1

1

4

Sample: 2

3

6

Sample: 3

2

5

Table 22 Kendall’s \(\tau = -1.0\)

\(Y_{i1}\)

\(Y_{i2}\)

Sample: 1

1

6

Sample: 2

3

4

Sample: 3

2

5

How-to

To use scipy.stats [3]:

from scipy.stats import kendalltau
y1 = [1,3,2]
y2 = [4,6,5]

tau, p_value = kendalltau(y1, y2)
print("Kendall's tau:", tau)

Lab Exercise

  1. Show \(E(Z_{ij})=0\).

Algorithm - 1

Warning

FOR SMALL SAMPLE SIZES ONLY

Note that the algorithm in this section is implement in mtbp3Lab for illustration purpose. Although the matrix form is closely representing Eq. (19), the calculation time increases greatly when the sample size increases. Other algorithms can be found in references.

Let \(Y_{1} = (Y_{11}, \ldots, Y_{n1})\) and \(Y_{2} = (Y_{12}, \ldots, Y_{n2})\). Let \(\times\) represent the matrix product, \(\times_{car}\) represent the Cartesian product, \(\times_{ele}\) represent the element-wise product, \(g([(a,b)]) = [sign(a-b)]\). and \(h(X_n) = 1_n \times X_n \times 1_n^T\) where \(X_n\) is a size \(n\) by \(n\) matrix, and \(1_n\) is a length \(n\) one vector. Both tau-a and tau-b can be calculated using the following steps:

  1. calculate components \(\tau_1 = g(Y_{1} \times_{car} Y_{1})\) and \(\tau_2 = g(Y_{2} \times_{car} Y_{2})\)

  2. calculate \(\tau\) as \(\tau = \frac{h(\tau_1 \times_{ele} \tau_2) }{ \sqrt{h(abs(\tau_1))}\sqrt{h(abs(\tau_2))} }\)

How-to

To use mtbp3Lab:

import numpy as np
from mtbp3Lab.statlab.corr import CorrCalculator

size = 100
y1 = np.random.randint(1, size+1, size=size).tolist()
y2 = np.subtract(np.random.randint(1, size+1, size=size),y1).tolist()
t = CorrCalculator([y1,y2])
print("Kendall's tau (mtbp3Lab.corr):", t.calculate_kendall_tau())

To create a scatter plot of y1 and y2:

t.plot_y_list(axis_label=['y1','y2'])

Reference