Reliability/cate/Cohen’s Kappa

Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.

Background

Cohen’s kappa (\(\kappa\)) is a statistic used for describing inter-rater reliability of two raters (or intra-rater) with categorical rating outcomes [1]. Please note that there are also additional considerations for the use of \(\kappa\) for quantifying agreement [2] [3].

Notation

For two raters and two category ratings, let \(Y_{r,i} \in \{v_j; j=1,2\}\) represent the rating from rater \(r \in \{1,2\}\) for sample \(i \in \{ 1, \ldots, n \}\). Let \(N_{j_1,j_2}\) represent the total number of samples that received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,2\}\). See Table 1.

Table 1 Counts for 2 categories

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

\(N_{11}\)

\(N_{12}\)

\(N_{1\bullet}\)

Rater 1: \(v_2\)

\(N_{21}\)

\(N_{22}\)

\(N_{2\bullet}\)

Column Total

\(N_{\bullet 1}\)

\(N_{\bullet 2}\)

\(n\)

For two raters and three or more category ratings, let \(Y_{r,i} \in \{v_1,v_2,v_3, \ldots, v_J \}\) represent the rating from rater \(r \in \{1,2\}\) for sample \(i \in \{ 1, \ldots, n \}\). Let \(N_{j_1,j_2}\) represent the total number of samples that received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,\ldots,J\}\). See Table 2.

Table 2 Counts for 3 or more categories

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Rater 2: \(v_3\)

\(\ldots\)

Row Total

Rater 1: \(v_1\)

\(N_{11}\)

\(N_{12}\)

\(N_{13}\)

\(\ldots\)

\(N_{1\bullet}\)

Rater 1: \(v_2\)

\(N_{21}\)

\(N_{22}\)

\(N_{23}\)

\(\ldots\)

\(N_{2\bullet}\)

Rater 1: \(v_3\)

\(N_{31}\)

\(N_{32}\)

\(N_{33}\)

\(\ldots\)

\(N_{3\bullet}\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\ddots\)

\(\vdots\)

Column Total

\(N_{\bullet 1}\)

\(N_{\bullet 2}\)

\(N_{\bullet 3}\)

\(\ldots\)

\(n\)

The observed raw percentage of agreement is defined as

\[p_O = \sum_{j=1}^J N_{jj} / n\]

where \(J \geq 2\) is the size of value set.

Assume that

\[(N_{1\bullet}, \ldots N_{J\bullet}) \sim multi(n, (p_{r=1,1}, \ldots, p_{r=1,J})),\]

and

\[(N_{\bullet 1}, \ldots N_{\bullet J}) \sim multi(n, (p_{r=2,1}, \ldots, p_{r=2,J})),\]

with \(\sum_j N_{j \bullet} = \sum_j N_{\bullet j} = n\) and \(\sum_j p_{r=1,j} = \sum_j p_{r=2, j} = 1\).

Under independence assumption, the expected number of agreement is estimated by \(\sum_{j=1}^J\hat{E}_{j} = \frac{1}{n}\sum_{j=1}^J N_{\bullet j} N_{j\bullet} \equiv n p_E\).

The Cohen’s \(\kappa\) statistic is calculated as

(1)\[\kappa = \frac{p_O - p_E}{1-p_E}.\]

The SE of \(\kappa\) is calculated as

\[\sqrt{\frac{p_O(1-p_O)}{n(1-p_E)^2}}.\]

Interpretation of Cohen’s Kappa Suggested in Literature

There are several groups of interpretation. Some roughly (not-strictly) defined types are listed below:

  1. Table based interpretation: a shared interpretation simplifies application process and provides a easy to compare values.

  2. Interpretation based on Approximated model based confidence interval or Bootstrap confidence intervals with a preselected criterion

  3. Bayesian inference based interpretation [8]

Cohen (1960) [4] suggested the Kappa result be interpreted as follows:

Table 3 Cohen’s Kappa Interpretation (Cohen, 1960)

Value of \(\kappa\)

Interpretation

\(-1 \leq \kappa \leq 0\)

indicating no agreement

\(0 < \kappa \leq 0.2\)

none to slight

\(0.2 < \kappa \leq 0.4\)

fair

\(0.4 < \kappa \leq 0.6\)

moderate

\(0.6 < \kappa \leq 0.8\)

substantial

\(0.8 < \kappa \leq 1\)

almost perfect agreement

Interpretation suggested by McHugh (2012) [5]:

Table 4 Cohen’s Kappa Interpretation (McHugh, 2012)

Value of \(\kappa\)

Level of Agreement

% of Data That Are Reliable

\(-1 \leq \kappa \leq 0\)

Disagreement

NA

\(0-.20\)

None

\(0-4%\)

\(.21-.39\)

Minimal

\(4-15%\)

\(.40-.59\)

Weak

\(15-35%\)

\(.60-.79\)

Moderate

\(35-63%\)

\(.80-.90\)

Strong

\(64-81%\)

Above.90

Almost Perfect

\(82-100%\)

As discussed by Sim and Wright [6] , biases and other factors could have impact on the interpretation.

Example - Group-1

Table 5 Cohen’s \(\kappa = 0\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

9

21

30

Rater 1: \(v_2\)

21

49

70

Column Total

30

70

100

Table 6 Cohen’s \(\kappa = 0\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

49

21

70

Rater 1: \(v_2\)

21

9

30

Column Total

70

30

100

Table 7 Cohen’s \(\kappa = 1\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

30

0

30

Rater 1: \(v_2\)

0

70

70

Column Total

30

70

100

Table 8 Cohen’s \(\kappa = 1\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1 \(v_1\)

50

0

50

Rater 1: \(v_2\)

0

50

50

Column Total

50

50

100

Table 9 Cohen’s \(\kappa = -1\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

0

50

50

Rater 1: \(v_2\)

50

0

50

Column Total

50

50

100

Table 10 Cohen’s \(\kappa = -0.7241379310344827\)

Rater 2: \(v_1\)

Rater 2: \(v_2\)

Row Total

Rater 1: \(v_1\)

0

30

30

Rater 1: \(v_2\)

70

0

70

Column Total

70

30

100

How-to

To use sklearn.metrics (stable):

from sklearn.metrics import cohen_kappa_score
y1 = ['v2'] * 70 + ['v1'] * 30
y2 = ['v1'] * 70 + ['v2'] * 30
print("Cohen's kappa:", cohen_kappa_score(y1, y2))

To use mtbp3Lab (testing):

from mtbp3Lab.statlab import kappa
y1 = ['v2'] * 70 + ['v1'] * 30
y2 = ['v1'] * 70 + ['v2'] * 30
kappa = kappa.KappaCalculator([y1,y2])
print("Cohen's kappa:", kappa.cohen_kappa)

Bootstrap CI

To use mtbp3Lab.statlab:

print( kappa.bootstrap_cohen_ci(n_iterations=1000, confidence_level=0.95, out_digits=6) )

Output:

Cohen's kappa: -0.724138
Confidence Interval (95.0%): [-0.907669, -0.496558]

Note that examples of using SAS/PROC FREQ and R package vcd for calculating \(\kappa\) can be found in reference [7] .

Bubble Plot

To create a bubble plot using mtbp3Lab.statlab:

from mtbp3Lab.statlab import kappa

fruits = ['Apple', 'Orange', 'Pear']
np.random.seed(100)
r1 = np.random.choice(fruits, size=100).tolist()
r2 = np.random.choice(fruits, size=100).tolist()

kappa = KappaCalculator([r1,r2], stringna='NA')
print("Cohen's kappa (mtbp3Lab.statlab): "+str(kappa.cohen_kappa))
print("Number of raters per sample: "+str(kappa.n_rater))
print("Number of rating categories: "+str(kappa.n_category))
print("Number of sample: "+str(kappa.y_count.shape[0]))

kappa.create_bubble_plot()

Output:

Cohen's kappa (mtbp3Lab.statlab): 0.06513872135102527
Number of raters per sample: 2.0
Number of rating categories: 3
Number of sample: 100
bubble plot

Sometimes monitoring individual raters rates might be needed for the interpretation of \(\kappa\). To create a bubble plot with individual raters summary using mtbp3Lab.statlab:

kappa.create_bubble_plot(hist=True)
bubble plot with hist

Note that the agreed counts are on the 45 degree line. To put agreed counts on the -45 degree line:

kappa.create_bubble_plot(hist=True, reverse_y=True)
bubble plot with hist - reverse

Lab Exercise

Assume that there are two raters responsible for rating 2 studies with a sample size of 100 for each study. Assume that the you are tasked with studying the characteristics of \(\kappa\).

For the first study, the first rater completed the rating with marginal rates following a multinomial distribution (100, (1/3, 1/3, 1/3)). Afterwards, assume that you filled a portion (\(0 < r < 1\)) of the sample’s ratings as a second rater with exactly the same rating as the first rater, and filled out the rest with random ratings following the same distribution as the first rater.

For the second study, the first rater completed the rating with marginal rates following a multinomial distribution (100, (0.9, 0.05, 0.05)). Afterwards, assume that you filled a portion (\(0 < r < 1\)) of the sample’s ratings as a second rater with exactly the same rating as the first rater, and filled out the rest with random ratings following the same distribution as the first rater.

  1. Find the relationship between \(r\) and \(\kappa\) for these two studies.

Extensions

Some scenarios discussed by Hallgren (2012) [9] include:

  • the prevalence problem: one category has much higher percentage than other categories and causes \(\kappa\) to be low.

  • the bias problem: there are substantial differences in marginal distributions and causes \(\kappa\) tend to be high.

  • unequal importance

(Please note that this is not an exhaustive list.)

Weighted \(\kappa\)

Let \(w_{j_1,j_2}\) represent the weight given to total number of sample received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,\ldots,J\}\). The weighted \(\kappa\) is calculated as

\[\kappa = 1- \frac{\sum_{j_1=1}^J\sum_{j_2=1}^J w_{j_1,j_2}N_{j_1,j_2}}{\sum_{j_1=1}^J\sum_{j_2=1}^J w_{j_1,j_2}\hat{E}_{j_1, j_2}}.\]

(There shall be another page discussing weighted methods and variations)

Reference