Reliability/cate/Kappa Variations
Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.
Background
Cohen’s kappa (\(\kappa\)) is a statistic used for describing inter-rater reliability of two raters (or intra-rater) with categorical rating outcomes [1]. This page discusses some variations.
Notation
For two raters and two or more category ratings, let \(Y_{r,i} \in \{v_1,\ldots, v_J \}\) represent the rating from rater \(r \in \{1,2\}\) for sample \(i \in \{ 1, \ldots, n \}\). Let \(N_{j_1,j_2}\) represent the total number of samples that received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,\ldots,J\}\). See Table 14.
Rater 2: \(v_1\) |
Rater 2: \(v_2\) |
Rater 2: \(v_3\) |
\(\ldots\) |
Row Total |
|
---|---|---|---|---|---|
Rater 1: \(v_1\) |
\(N_{11}\) |
\(N_{12}\) |
\(N_{13}\) |
\(\ldots\) |
\(N_{1\bullet}\) |
Rater 1: \(v_2\) |
\(N_{21}\) |
\(N_{22}\) |
\(N_{23}\) |
\(\ldots\) |
\(N_{2\bullet}\) |
Rater 1: \(v_3\) |
\(N_{31}\) |
\(N_{32}\) |
\(N_{33}\) |
\(\ldots\) |
\(N_{3\bullet}\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\ddots\) |
\(\vdots\) |
Column Total |
\(N_{\bullet 1}\) |
\(N_{\bullet 2}\) |
\(N_{\bullet 3}\) |
\(\ldots\) |
\(n\) |
The observed raw percentage of agreement is defined as \(p_O = \sum_{j=1}^J N_{jj} / n\). The expected number of agreement is estimated by \(\sum_{j=1}^J\hat{E}_{j} = \frac{1}{n}\sum_{j=1}^J N_{\bullet j} N_{j\bullet} \equiv n p_E\). The Cohen’s \(\kappa\) statistic is calculated as \(\kappa = \frac{p_O - p_E}{1-p_E}\). The SE of \(\kappa\) is calculated as \(\sqrt{\frac{p_O(1-p_O)}{n(1-p_E)^2}}\).
Bias, Prevalence and Adjusted Kappas (Byrt et al., 1993)
All discussion in this section are based on Byrt, T., Bishop, J., & Carlin, J. B. (1993) [1] unless cited otherwise.
Bias Index (BI)
For two raters and two categories (\(J=2\)), Byrt et al. define Bias Index (BI) as difference of probability of one rating from two raters, which can be estimated as:
\(\hat{BI}\) has the following properties:
when two off-diagonal counts are equal, which means \(N_{12} = N_{21}\), then \(\hat{BI} = 0\);
when two raters have the same frequencies of ratings, which means \(N_{1 \bullet} = N_{\bullet 1}\), and \(N_{11}+N_{12} = N_{11}+N_{21}\), then \(\hat{BI} = 0\);
when \(N_{12} = n\) or \(N_{21}=n\), \(|\hat{BI}|=1\).
Note that the sign of \(\hat{BI}\) depends on which rater (\(j=1\) or \(j=2\)) is assigned as “rater A”. Within this page, rater \(j=2\) is corresponding to the rater labeled “A” in Byrt et al. (1993) to have similar table structures.
Bias-adjusted Kappa (BAK)
BAK is defined as kappa calculated by replacing \(N_{12}\) and \(N_{21}\) with
That yields
and
See Table 15.
Rater 2: \(v_1\) |
Rater 2: \(v_2\) |
Row Total |
|
---|---|---|---|
Rater 1: \(v_1\) |
\(N_{11}\) |
\(N_{12}^{(BA)}\) |
\(N_{1\bullet}^{(BA)}\) |
Rater 1: \(v_2\) |
\(N_{12}^{(BA)}\) |
\(N_{22}\) |
\(N_{2\bullet}^{(BA)}\) |
Column Total |
\(N_{1 \bullet}^{(BA)}\) |
\(N_{2 \bullet}^{(BA)}\) |
\(n\) |
Prevalence Index (PI)
For two raters and two categories (\(J=2\)), Byrt et al. defined Prevalence Index (PI) as the difference of averaged probability of two ratings, which can be estimated as:
\(\hat{PI}\) has the following properties:
when \(N_{11} = N_{22}\), \(\hat{PI}=0\)
when \(N_{11} = n\), \(\hat{PI}=1\)
when \(N_{22} = n\), \(\hat{PI}=-1\)
Prevalence-adjusted Bias-adjusted Kappa (PABAK)
PABAK is defined as kappa calculated by replacing \(N_{12}\) and \(N_{21}\) as in BAK, and replacing \(N_{11}\) and \(N_{22}\) as:
Note that a superscript “(PA)” is used to indicate there might be a difference between original observed \(N_{11}\) and replaced \(N_{11}^{(PA)}\).
That yields
and
See illustration in Table 16.
Rater 2: \(v_1\) |
Rater 2: \(v_2\) |
Row Total |
|
---|---|---|---|
Rater 1: \(v_1\) |
\(N_{11}^{(PA)}\) |
\(N_{12}^{(BA)}\) |
\(\frac{n}{2}\) |
Rater 1: \(v_2\) |
\(N_{12}^{(BA)}\) |
\(N_{11}^{(PA)}\) |
\(\frac{n}{2}\) |
Column Total |
\(\frac{n}{2}\) |
\(\frac{n}{2}\) |
\(n\) |
Based on Table 16, we can find adjusted \(p_E^{(BAPA)}\) and \(p_O^{(BAPA)}\):
and
Therefore, the \(\kappa\) value based on Table 16 can be calculated as:
which is a linear function of \(p_O\) with possible values between -1 and 1.
Observed \(\kappa\) as a function of PABAK, \(\hat{BI}\), and \(\hat{PI}\)
From Eq. (12), we can see that [1] (Equation 1 and Appendix A)
\(p_O = \frac{1}{2}(\kappa^{(BAPA)} + 1)\)
combining \(p_O = \frac{1}{n}(N_{11}+N_{22})\) and \(1-p_O = \frac{1}{n}(N_{12}+N_{21})\), the observed counts can be expressed as Table 17 below
Rater 2: \(v_1\) |
Rater 2: \(v_2\) |
Row Total |
|
---|---|---|---|
Rater 1: \(v_1\) |
\(\frac{n}{2}(p_O + \hat{PI})\) |
\(\frac{n}{2}(1 - p_O + \hat{BI})\) |
\(\frac{n}{2}(1 + \hat{BI} + \hat{PI})\) |
Rater 1: \(v_2\) |
\(\frac{n}{2}(1 - p_O - \hat{BI})\) |
\(\frac{n}{2}(p_O - \hat{PI})\) |
\(\frac{n}{2}(1-\hat{BI}-\hat{PI})\) |
Column Total |
\(\frac{n}{2}(1 - \hat{BI} + \hat{PI})\) |
\(\frac{n}{2}(1 + \hat{BI} - \hat{PI})\) |
\(n\) |
From Table 17, we can see that \(p_E = \frac{1}{2}( 1 - \hat{BI}^2 + \hat{PI}^2)\) and [1] (Equation 1 and Appendix A)
From Eq. (13), we can observe change of \(\kappa\) related to \(\kappa^{(BAPA)}\), \(\hat{BI}\), and \(\hat{PI}\).
Extend PABAK to More Than 2 Categories
Bryt et al. (1993) [1] discuss \(PABAK\) in details for ratings in 2 categories, and mentioned the equivalence of \(PABAK\) to Bennett’s \(S\), which can be calculated for more than 2 categories (\(J \geq 2\)) and that yields variance used by SAS [2] [3].
For \(J \geq 2\), Eq. (10) and Eq. (11) become
and
Combining Eq. (14) and Eq. (15), We can see that Eq. (12) becomes
which is a linear function of \(p_O\) and a fixed value \(J\). The variance of \(\kappa^{(BAPA)}\) in Eq. (16) can be expressed as
We can see from Eq. (17) the notation \(R = J = \frac{1}{p_E^{(BAPA)}}\).
Discussion from the Original Paper
The first paragraph in the Discussion section of Bryt et al. (1993) [1] mentioned:
“We have shown that for a \(2 \times 2\) table of agreement kappa can be simply expressed in terms of three easily interpreted indices. … The reexpression of kappa enables a clear explanation of the conceptually distinct and independent components that arise in the investigation of agreement.”
Examples
Example 1
Given a fixed \(p_O\), the \(\kappa\) statistic can be calculated as \(\kappa = 1 + \frac{p_O - 1}{1-p_E}\), which is a decreasing function of \(p_E\). Byrt et al. (1993) [1] (Table 1 and Table 2) quoted an example from Feinstein and Cicchetti (1990), reproduced as Table 18 and Table 19, showing that given the same values of \(p_O\), different values of \(p_E\) can yield \(\kappa\) “more than 2-fold higher in one instance than the other”.
Rater 2: \(v_1\) |
Rater 2: \(v_2\) |
Row Total |
|
---|---|---|---|
Rater 1: \(v_1\) |
40 |
9 |
49 |
Rater 1: \(v_2\) |
6 |
45 |
51 |
Column Total |
46 |
54 |
100 |
Rater 2: \(v_1\) |
Rater 2: \(v_2\) |
Row Total |
|
---|---|---|---|
Rater 1: \(v_1\) |
80 |
10 |
90 |
Rater 1: \(v_2\) |
5 |
5 |
10 |
Column Total |
85 |
15 |
100 |