Reliability/cate/Kappa Variations

Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.

Background

Cohen’s kappa (\(\kappa\)) is a statistic used for describing inter-rater reliability of two raters (or intra-rater) with categorical rating outcomes [1]. This page discusses some variations.

Notation

For two raters and two or more category ratings, let \(Y_{r,i} \in \{v_1,\ldots, v_J \}\) represent the rating from rater \(r \in \{1,2\}\) for sample \(i \in \{ 1, \ldots, n \}\). Let \(N_{j_1,j_2}\) represent the total number of samples that received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,\ldots,J\}\). See Table 14.

Table 14 Counts for 3 or more categories

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Rater 2: \(v_3\)

\(\ldots\)

Row Total

Rater 1: \(v_1\)

\(N_{11}\)

\(N_{12}\)

\(N_{13}\)

\(\ldots\)

\(N_{1\bullet}\)

Rater 1: \(v_2\)

\(N_{21}\)

\(N_{22}\)

\(N_{23}\)

\(\ldots\)

\(N_{2\bullet}\)

Rater 1: \(v_3\)

\(N_{31}\)

\(N_{32}\)

\(N_{33}\)

\(\ldots\)

\(N_{3\bullet}\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\ddots\)

\(\vdots\)

Column Total

\(N_{\bullet 1}\)

\(N_{\bullet 2}\)

\(N_{\bullet 3}\)

\(\ldots\)

\(n\)

The observed raw percentage of agreement is defined as \(p_O = \sum_{j=1}^J N_{jj} / n\). The expected number of agreement is estimated by \(\sum_{j=1}^J\hat{E}_{j} = \frac{1}{n}\sum_{j=1}^J N_{\bullet j} N_{j\bullet} \equiv n p_E\). The Cohen’s \(\kappa\) statistic is calculated as \(\kappa = \frac{p_O - p_E}{1-p_E}\). The SE of \(\kappa\) is calculated as \(\sqrt{\frac{p_O(1-p_O)}{n(1-p_E)^2}}\).

Bias, Prevalence and Adjusted Kappas (Byrt et al., 1993)

All discussion in this section are based on Byrt, T., Bishop, J., & Carlin, J. B. (1993) [1] unless cited otherwise.

Bias Index (BI)

For two raters and two categories (\(J=2\)), Byrt et al. define Bias Index (BI) as difference of probability of one rating from two raters, which can be estimated as:

\[\hat{BI} = \frac{1}{n}(N_{1 \bullet} - N_{\bullet 1}) = \frac{1}{n}(N_{12} - N_{21}).\]

\(\hat{BI}\) has the following properties:

  • when two off-diagonal counts are equal, which means \(N_{12} = N_{21}\), then \(\hat{BI} = 0\);

  • when two raters have the same frequencies of ratings, which means \(N_{1 \bullet} = N_{\bullet 1}\), and \(N_{11}+N_{12} = N_{11}+N_{21}\), then \(\hat{BI} = 0\);

  • when \(N_{12} = n\) or \(N_{21}=n\), \(|\hat{BI}|=1\).

Note that the sign of \(\hat{BI}\) depends on which rater (\(j=1\) or \(j=2\)) is assigned as “rater A”. Within this page, rater \(j=2\) is corresponding to the rater labeled “A” in Byrt et al. (1993) to have similar table structures.

Bias-adjusted Kappa (BAK)

BAK is defined as kappa calculated by replacing \(N_{12}\) and \(N_{21}\) with

\[N_{12}^{(BA)} = N_{21}^{(BA)} = \frac{1}{2}(N_{12} + N_{21}).\]

That yields

\[N_{1 \bullet}^{(BA)} = N_{\bullet 1}^{(BA)} = N_{11} + \frac{1}{2}(N_{12} + N_{21}),\]

and

\[N_{2 \bullet}^{(BA)} = N_{\bullet 2}^{(BA)} = N_{22} + \frac{1}{2}(N_{12} + N_{21}).\]

See Table 15.

Table 15 Adjusted counts for 2 categories adjusted - using BA

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

\(N_{11}\)

\(N_{12}^{(BA)}\)

\(N_{1\bullet}^{(BA)}\)

Rater 1: \(v_2\)

\(N_{12}^{(BA)}\)

\(N_{22}\)

\(N_{2\bullet}^{(BA)}\)

Column Total

\(N_{1 \bullet}^{(BA)}\)

\(N_{2 \bullet}^{(BA)}\)

\(n\)

Prevalence Index (PI)

For two raters and two categories (\(J=2\)), Byrt et al. defined Prevalence Index (PI) as the difference of averaged probability of two ratings, which can be estimated as:

\[\hat{PI} = \frac{1}{n}(N_{11} - N_{22}).\]

\(\hat{PI}\) has the following properties:

  • when \(N_{11} = N_{22}\), \(\hat{PI}=0\)

  • when \(N_{11} = n\), \(\hat{PI}=1\)

  • when \(N_{22} = n\), \(\hat{PI}=-1\)

Prevalence-adjusted Bias-adjusted Kappa (PABAK)

PABAK is defined as kappa calculated by replacing \(N_{12}\) and \(N_{21}\) as in BAK, and replacing \(N_{11}\) and \(N_{22}\) as:

\[N_{11}^{(PA)} = N_{22}^{(PA)} = \frac{1}{2}(N_{11} + N_{22}).\]

Note that a superscript “(PA)” is used to indicate there might be a difference between original observed \(N_{11}\) and replaced \(N_{11}^{(PA)}\).

That yields

\[N_{1 \bullet}^{(BAPA)} = N_{\bullet 1}^{(BAPA)} = \frac{n}{2},\]

and

\[N_{2 \bullet}^{(BAPA)} = N_{\bullet 2}^{(BAPA)} = \frac{n}{2}\]

See illustration in Table 16.

Table 16 Adjusted counts for 2 categories adjusted - using both BA and PA

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

\(N_{11}^{(PA)}\)

\(N_{12}^{(BA)}\)

\(\frac{n}{2}\)

Rater 1: \(v_2\)

\(N_{12}^{(BA)}\)

\(N_{11}^{(PA)}\)

\(\frac{n}{2}\)

Column Total

\(\frac{n}{2}\)

\(\frac{n}{2}\)

\(n\)

Based on Table 16, we can find adjusted \(p_E^{(BAPA)}\) and \(p_O^{(BAPA)}\):

(10)\[p_E^{(BAPA)} = \frac{1}{n^2} \left( \frac{n}{2}\frac{n}{2} + \frac{n}{2}\frac{n}{2} \right) = \frac{1}{2},\]

and

(11)\[p_O^{(BAPA)} = \frac{1}{n} \left( N_{11}^{(PA)} + N_{22}^{(PA)} \right) = \frac{1}{n} \left( N_{11} + N_{22} \right) = p_O.\]

Therefore, the \(\kappa\) value based on Table 16 can be calculated as:

(12)\[\kappa^{(BAPA)} = \frac{p_O - 0.5}{1 - 0.5} = 2p_O - 1,\]

which is a linear function of \(p_O\) with possible values between -1 and 1.

Observed \(\kappa\) as a function of PABAK, \(\hat{BI}\), and \(\hat{PI}\)

From Eq. (12), we can see that [1] (Equation 1 and Appendix A)

  • \(p_O = \frac{1}{2}(\kappa^{(BAPA)} + 1)\)

  • combining \(p_O = \frac{1}{n}(N_{11}+N_{22})\) and \(1-p_O = \frac{1}{n}(N_{12}+N_{21})\), the observed counts can be expressed as Table 17 below

Table 17 Observed counts for 2 categories - expressed using \(p_O\), \(BI\) and \(PI\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

\(\frac{n}{2}(p_O + \hat{PI})\)

\(\frac{n}{2}(1 - p_O + \hat{BI})\)

\(\frac{n}{2}(1 + \hat{BI} + \hat{PI})\)

Rater 1: \(v_2\)

\(\frac{n}{2}(1 - p_O - \hat{BI})\)

\(\frac{n}{2}(p_O - \hat{PI})\)

\(\frac{n}{2}(1-\hat{BI}-\hat{PI})\)

Column Total

\(\frac{n}{2}(1 - \hat{BI} + \hat{PI})\)

\(\frac{n}{2}(1 + \hat{BI} - \hat{PI})\)

\(n\)

From Table 17, we can see that \(p_E = \frac{1}{2}( 1 - \hat{BI}^2 + \hat{PI}^2)\) and [1] (Equation 1 and Appendix A)

(13)\[\kappa = \frac{\kappa^{(BAPA)} + \hat{BI}^2 - \hat{PI}^2}{1 + \hat{BI}^2 - \hat{PI}^2}.\]

From Eq. (13), we can observe change of \(\kappa\) related to \(\kappa^{(BAPA)}\), \(\hat{BI}\), and \(\hat{PI}\).

Extend PABAK to More Than 2 Categories

Bryt et al. (1993) [1] discuss \(PABAK\) in details for ratings in 2 categories, and mentioned the equivalence of \(PABAK\) to Bennett’s \(S\), which can be calculated for more than 2 categories (\(J \geq 2\)) and that yields variance used by SAS [2] [3].

For \(J \geq 2\), Eq. (10) and Eq. (11) become

(14)\[p_E^{(BAPA)} = \frac{1}{n^2} \left( \sum_{j=1}^J \frac{n}{J}\frac{n}{J} \right) = \frac{1}{J},\]

and

(15)\[p_O^{(BAPA)} = \frac{1}{n} \left( \sum_{j=1}^J N_{jj}^{(PA)} \right) = \frac{1}{n} \left( \sum_{j=1}^J N_{jj} \right) = p_O.\]

Combining Eq. (14) and Eq. (15), We can see that Eq. (12) becomes

(16)\[\kappa^{(BAPA)} = \frac{p_O - \frac{1}{J}}{1 - \frac{1}{J}},\]

which is a linear function of \(p_O\) and a fixed value \(J\). The variance of \(\kappa^{(BAPA)}\) in Eq. (16) can be expressed as

(17)\[var\left(\kappa^{(BAPA)}\right) = \left(\frac{1}{1 - \frac{1}{J}}\right)^2\left(\frac{p_o(1-p_o)}{n}\right) = \left(\frac{J}{J - 1}\right)^2\left(\frac{p_o(1-p_o)}{n}\right)\]

We can see from Eq. (17) the notation \(R = J = \frac{1}{p_E^{(BAPA)}}\).

Discussion from the Original Paper

The first paragraph in the Discussion section of Bryt et al. (1993) [1] mentioned:

“We have shown that for a \(2 \times 2\) table of agreement kappa can be simply expressed in terms of three easily interpreted indices. … The reexpression of kappa enables a clear explanation of the conceptually distinct and independent components that arise in the investigation of agreement.”

Examples

Example 1

Given a fixed \(p_O\), the \(\kappa\) statistic can be calculated as \(\kappa = 1 + \frac{p_O - 1}{1-p_E}\), which is a decreasing function of \(p_E\). Byrt et al. (1993) [1] (Table 1 and Table 2) quoted an example from Feinstein and Cicchetti (1990), reproduced as Table 18 and Table 19, showing that given the same values of \(p_O\), different values of \(p_E\) can yield \(\kappa\) “more than 2-fold higher in one instance than the other”.

Table 18 \(p_O = 0.85\) and Cohen’s \(\kappa = 0.7\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

40

9

49

Rater 1: \(v_2\)

6

45

51

Column Total

46

54

100

Table 19 \(p_O = 0.85\) and Cohen’s \(\kappa = 0.32\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

80

10

90

Rater 1: \(v_2\)

5

5

10

Column Total

85

15

100

Reference