Reliability/cate/Fleiss’s Kappa
Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.
Background
Fleiss’s kappa (\(\kappa\)) is a statistic used for describing inter-rater reliability of multiple independent raters with categorical rating outcomes [1] [2].
Notation
Assume there are the same \(R+N_0\) (\(\geq 2+N_0\)) raters and each of \(n\) samples were rated by \(R\) randomly selected raters and were not rated by the rest of \(N_0\) raters. For \(J\) categories rating, let \(Y_{r,i} \in \{v_0, v_1,v_2,\ldots, v_J \}\) represent rating from rater \(r=1,2,\ldots,R+N_0\) for sample \(i = 1, \ldots, n\). Let \(N_{ij}\) represent the total number of raters gave rating \((v_j)\) to sample \(i\), where \(j \in \{0, 1,\ldots,J\}\). The value \(v_0\) represent raters did not rate the sample \(i\) and \(N_{i0}=N_0\) is a fixed number for all \(i\). Therefore, \(v_0\) will not be included in the discussion below.
\(v_1\) |
\(v_2\) |
\(\ldots\) |
\(v_J\) |
Row Total |
|
|---|---|---|---|---|---|
Sample: 1 |
\(N_{11}\) |
\(N_{12}\) |
\(\ldots\) |
\(N_{1J}\) |
\(R\) |
Sample: 2 |
\(N_{21}\) |
\(N_{22}\) |
\(\ldots\) |
\(N_{2J}\) |
\(R\) |
Sample: 3 |
\(N_{31}\) |
\(N_{32}\) |
\(\ldots\) |
\(N_{3J}\) |
\(R\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\ddots\) |
\(\vdots\) |
\(\vdots\) |
Sample: \(n\) |
\(N_{n1}\) |
\(N_{n2}\) |
\(\ldots\) |
\(N_{nJ}\) |
\(R\) |
Column total |
\(N_{\bullet 1}\) |
\(N_{\bullet 2}\) |
\(\ldots\) |
\(N_{\bullet J}\) |
\(nR\) |
The observed averaged agreement is calculated as
where \(p_{O,i} = \frac{1}{R(R-1)} \left(\sum_{j=1}^J N_{ij}(N_{ij}-1)\right)= \frac{1}{R(R-1)} \left(\sum_{j=1}^J N_{ij}^2 - R\right)\).
The expected agreement is calculated as
where \(p_{E,j} = \frac{N_{\bullet j}}{nR}\).
The Fleiss’s \(\kappa\) statistic is calculated from Eq. (2) and Eq. (3) as
Example - Group-1
\(v_1\) |
\(v_2\) |
\(v_3\) |
\(v_4\) |
|
|---|---|---|---|---|
Sample 1 |
12 |
0 |
0 |
0 |
Sample 2 |
0 |
12 |
0 |
0 |
Sample 3 |
0 |
0 |
12 |
0 |
Sample 4 |
0 |
0 |
12 |
0 |
Sample 5 |
0 |
0 |
0 |
12 |
Column Total |
12 |
12 |
24 |
12 |
\(v_1\) |
\(v_2\) |
\(v_3\) |
\(v_4\) |
|
|---|---|---|---|---|
Sample 1 |
3 |
3 |
3 |
3 |
Sample 2 |
3 |
3 |
3 |
3 |
Sample 3 |
3 |
3 |
3 |
3 |
Sample 4 |
3 |
3 |
3 |
3 |
Sample 5 |
3 |
3 |
3 |
3 |
Column Total |
15 |
15 |
15 |
15 |
How-to
To use both statsmodels.stats.inter_rater and mtbp3Lab.statlab:
import statsmodels.stats.inter_rater as ir
from mtbp3Lab.statlab import kappa
r1 = ['NA'] * 20 + ['B'] * 50 + ['A'] * 30
r2 = ['A'] * 20 + ['NA'] * 20 + ['B'] * 60
r3 = ['A'] * 40 + ['NA'] * 20 + ['B'] * 30 + ['C'] * 10
r4 = ['B'] * 60 + ['NA'] * 20 + ['C'] * 10 + ['A'] * 10
r5 = ['C'] * 60 + ['A'] * 10 + ['B'] * 10 + ['NA'] * 20
data = [r1, r2, r3, r4, r5]
kappa = KappaCalculator(data, stringna='NA')
print("Fleiss's kappa (stasmodels.stats.inter_rater): "+str(ir.fleiss_kappa(kappa.y_count)))
print("Fleiss's kappa (mtbp3Lab.statlab): "+str(kappa.fleiss_kappa))
print("Number of raters per sample: "+str(kappa.n_rater))
print("Number of rating categories: "+str(kappa.n_category))
print("Number of sample: "+str(kappa.y_count.shape[0]))
Output:
Fleiss's kappa (stasmodels.stats.inter_rater): -0.14989733059548255
Fleiss's kappa (mtbp3Lab.statlab): -0.14989733059548255
Number of raters per sample: 4.0
Number of rating categories: 3
Number of sample: 100
Lab Exercise
Find Bootstrap CI of Fleiss’s kappa. (see the function of Cohen’s kappa CI)
More Details
Eq. (2) corresponds to the observed probability of having agreement for a sample from two randomly selected raters estimated from Tabel 11. Eq. (3) corresponds to the expected probability of having agreement for a sample from two randomly selected raters under the assumption of no agreement, which corresponds to the assumption of \((N_{i1},\ldots, N_{iJ}) \sim multi(R, (p_1,\ldots, p_J))\) where \(R>4\).
Let \(S_{p2} = \sum_j p_j^2\), \(S_{p3} = \sum_j p_j^3\), and \(S_{p4} = \sum_j p_j^4\). The equation Eq. (4) can be expressed as [2] (Eq. 9),
Note that Fleiss (1971) assumed large \(n\) and fixed \(p_j\) while deriving variance of kappa. Please see the Fleiss (1971) for more discussions. The variance of \(\kappa\) under the assumption of no agreement beyond chance can be approximated as:
where
and
To calculate Eq. (6), we can use the MGF, \(\left(\sum_{j}p_je^{t_j}\right)^R\), to derive \(E\left(N_{ij}^2\right) = Rp_j + R(R-1)p_j^2\), and \(E\left(N_{ij}^3\right) = Rp_j + 3R(R-1)p_j^2 + R(R-1)(R-2)p_j^3\).
The first element of Eq. (6) can be calculated as [2] (Eq. 12)
The third element of Eq. (6) can be calculated as [2] (Eq. 14)
The second element of Eq. (6) can be calculated, using \(E\left( N_{ij}^2 N_{ik}^2 \right) = R(R-1)p_j(p_k+(R-2)p_k^2) + R(R-1)(R-2)p_j^2(p_k+(R-3)p_k^2)\), as
Combining Eq. (7), Eq. (8), and Eq. (9), Eq. (6) can be calculated as [2] (Eq. 15)
Let \(s^2\) be the estimated variance of \(\kappa\) using Eq. (5). Under the hypothesis of no agreement beyond chances, the limit distribution \(\kappa/s\) would be a standard normal distribution. The value of \(\kappa/s\) then could be used to describe if the overall agreement is greater then by chance alone [2].
Lab Exercise
Find \(Cov(N_{i1},N_{i2})\) under no agreement assumption.