Fleiss’s Kappa

Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.

Background

Fleiss’s kappa (\(\kappa\)) is a statistic used for describing inter-rater reliability of multiple independent raters with categorical rating outcomes [1] [2].

Notation

Assume there are the same \(R+N_0\) (\(\geq 2+N_0\)) raters and each of \(n\) samples were rated by \(R\) randomly selected raters and were not rated by the rest of \(N_0\) raters. For \(J\) categories rating, let \(Y_{r,i} \in \{v_0, v_1,v_2,\ldots, v_J \}\) represent rating from rater \(r=1,2,\ldots,R+N_0\) for sample \(i = 1, \ldots, n\). Let \(N_{ij}\) represent the total number of raters gave rating \((v_j)\) to sample \(i\), where \(j \in \{0, 1,\ldots,J\}\). The value \(v_0\) represent raters did not rate the sample \(i\) and \(N_{i0}=N_0\) is a fixed number for all \(i\). Therefore, \(v_0\) will not be included in the discussion below.

Table 11 Count of Ratings
	\(v_1\)	\(v_2\)	\(\ldots\)	\(v_J\)	Row Total
Sample: 1	\(N_{11}\)	\(N_{12}\)	\(\ldots\)	\(N_{1J}\)	\(R\)
Sample: 2	\(N_{21}\)	\(N_{22}\)	\(\ldots\)	\(N_{2J}\)	\(R\)
Sample: 3	\(N_{31}\)	\(N_{32}\)	\(\ldots\)	\(N_{3J}\)	\(R\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)	\(\vdots\)
Sample: \(n\)	\(N_{n1}\)	\(N_{n2}\)	\(\ldots\)	\(N_{nJ}\)	\(R\)
Column total	\(N_{\bullet 1}\)	\(N_{\bullet 2}\)	\(\ldots\)	\(N_{\bullet J}\)	\(nR\)

The observed averaged agreement is calculated as

(2)\[\bar{p}_O = \frac{1}{n} \sum_{i=1}^n p_{O,i},\]

where \(p_{O,i} = \frac{1}{R(R-1)} \left(\sum_{j=1}^J N_{ij}(N_{ij}-1)\right)= \frac{1}{R(R-1)} \left(\sum_{j=1}^J N_{ij}^2 - R\right)\).

The expected agreement is calculated as

(3)\[\bar{p}_E = \sum_{j=1}^J p_{E,j}^2,\]

where \(p_{E,j} = \frac{N_{\bullet j}}{nR}\).

The Fleiss’s \(\kappa\) statistic is calculated from Eq. (2) and Eq. (3) as

(4)\[\kappa = \frac{\bar{p}_O - \bar{p}_E}{1-\bar{p}_E}.\]

Example - Group-1

Table 12 Fleiss’s \(\kappa = 1.0\)
	\(v_1\)	\(v_2\)	\(v_3\)	\(v_4\)
Sample 1	12	0	0	0
Sample 2	0	12	0	0
Sample 3	0	0	12	0
Sample 4	0	0	12	0
Sample 5	0	0	0	12
Column Total	12	12	24	12

Table 13 Fleiss’s \(\kappa\) = -0.0909090909090909
	\(v_1\)	\(v_2\)	\(v_3\)	\(v_4\)
Sample 1	3	3	3	3
Sample 2	3	3	3	3
Sample 3	3	3	3	3
Sample 4	3	3	3	3
Sample 5	3	3	3	3
Column Total	15	15	15	15

How-to

To use both statsmodels.stats.inter_rater and mtbp3Lab.statlab:

import statsmodels.stats.inter_rater as ir
from mtbp3Lab.statlab import kappa

r1 = ['NA'] * 20 + ['B'] * 50 + ['A'] * 30
r2 = ['A'] * 20 + ['NA'] * 20 + ['B'] * 60
r3 = ['A'] * 40 + ['NA'] * 20 + ['B'] * 30 + ['C'] * 10
r4 = ['B'] * 60 + ['NA'] * 20 + ['C'] * 10 + ['A'] * 10
r5 = ['C'] * 60 + ['A'] * 10 + ['B'] * 10 + ['NA'] * 20
data = [r1, r2, r3, r4, r5]
kappa = KappaCalculator(data, stringna='NA')

print("Fleiss's kappa (stasmodels.stats.inter_rater): "+str(ir.fleiss_kappa(kappa.y_count)))
print("Fleiss's kappa (mtbp3Lab.statlab): "+str(kappa.fleiss_kappa))
print("Number of raters per sample: "+str(kappa.n_rater))
print("Number of rating categories: "+str(kappa.n_category))
print("Number of sample: "+str(kappa.y_count.shape[0]))

Output:

Fleiss's kappa (stasmodels.stats.inter_rater): -0.14989733059548255
Fleiss's kappa (mtbp3Lab.statlab): -0.14989733059548255
Number of raters per sample: 4.0
Number of rating categories: 3
Number of sample: 100

Lab Exercise

Find Bootstrap CI of Fleiss’s kappa. (see the function of Cohen’s kappa CI)

More Details

Eq. (2) corresponds to the observed probability of having agreement for a sample from two randomly selected raters estimated from Tabel 11. Eq. (3) corresponds to the expected probability of having agreement for a sample from two randomly selected raters under the assumption of no agreement, which corresponds to the assumption of \((N_{i1},\ldots, N_{iJ}) \sim multi(R, (p_1,\ldots, p_J))\) where \(R>4\).

Let \(S_{p2} = \sum_j p_j^2\), \(S_{p3} = \sum_j p_j^3\), and \(S_{p4} = \sum_j p_j^4\). The equation Eq. (4) can be expressed as [2] ^{(Eq. 9)},

\[\kappa = \frac{\sum_{i=1}^{n}\sum_{j=1}^J N_{ij}^2 - nR\left(1+(R-1) S_{p2} \right)}{nR(R-1)(1- S_{p2} )}\]

Note that Fleiss (1971) assumed large \(n\) and fixed \(p_j\) while deriving variance of kappa. Please see the Fleiss (1971) for more discussions. The variance of \(\kappa\) under the assumption of no agreement beyond chance can be approximated as:

(5)\[var(\kappa) = c(n,R,\{p_j\}) var\left(\sum_{j=1}^J N_{1j}^2 \right),\]

where

\[c(n,R,\{p_j\}) = n^{-1}\left(R(R-1)\left(1-S_{p2}\right)\right)^{-2},\]

and

(6)\[\begin{split}var\left(\sum_{j} N_{ij}^2 \right) =& E\left( \left(\sum_{j} N_{ij}^2\right)^2\right) - \left(E\left(\sum_{j} N_{ij}^2\right)\right)^2 \\ =& E\left(\sum_{j} N_{ij}^4\right) + E\left(\sum_j\sum_{k \neq j} N_{ij}^2 N_{ik}^2 \right) - \left(E\left(\sum_{j} N_{ij}^2\right)\right)^2.\end{split}\]

To calculate Eq. (6), we can use the MGF, \(\left(\sum_{j}p_je^{t_j}\right)^R\), to derive \(E\left(N_{ij}^2\right) = Rp_j + R(R-1)p_j^2\), and \(E\left(N_{ij}^3\right) = Rp_j + 3R(R-1)p_j^2 + R(R-1)(R-2)p_j^3\).

The first element of Eq. (6) can be calculated as [2] ^{(Eq. 12)}

(7)\[E\left(\sum_{j} N_{ij}^4\right) = R + 7R(R-1)S_{p2} + 6R(R-1)(R-2)S_{p3} + R(R-1)(R-2)(R-3)S_{p4}\]

The third element of Eq. (6) can be calculated as [2] ^{(Eq. 14)}

(8)\[\begin{split}\left(E\left(\sum_{j} N_{ij}^2\right)\right)^2 =& R^2\left(1 + (R-1)S_{p2} \right)^2 \\ =& R^2 + R^2(R-1)\left(2 S_{p2} + (R-1)S_{p2}^2\right)\end{split}\]

The second element of Eq. (6) can be calculated, using \(E\left( N_{ij}^2 N_{ik}^2 \right) = R(R-1)p_j(p_k+(R-2)p_k^2) + R(R-1)(R-2)p_j^2(p_k+(R-3)p_k^2)\), as

(9)\[\begin{split}E\left( \sum_j\sum_{k \neq j} N_{ij}^2 N_{ik}^2 \right) =& R(R-1) + R(R-1)(2R-5)S_{p2} - 2R(R-1)(R-2)S_{p3} \\ &- R(R-1)(R-2)(R-3)S_{p4} + R(R-1)(R-2)(R-3) S_{p2}^2\end{split}\]

Combining Eq. (7), Eq. (8), and Eq. (9), Eq. (6) can be calculated as [2] ^{(Eq. 15)}

\[var\left(\sum_{j} N_{ij}^2 \right) = 2R(R-1)\left(S_{p2} - (2R-3)S_{p2}^2 + 2(R-2)S_{p3}\right).\]

Let \(s^2\) be the estimated variance of \(\kappa\) using Eq. (5). Under the hypothesis of no agreement beyond chances, the limit distribution \(\kappa/s\) would be a standard normal distribution. The value of \(\kappa/s\) then could be used to describe if the overall agreement is greater then by chance alone [2].

Lab Exercise

Find \(Cov(N_{i1},N_{i2})\) under no agreement assumption.

Reliability/cate/Fleiss’s Kappa

Background

Notation

Example - Group-1

How-to

Lab Exercise

More Details

Lab Exercise

Reference