Cohen’s Kappa

Disclaimer: This page is provided only for studying and practicing. The author does not intend to promote or advocate any particular analysis method or software.

Background

Cohen’s kappa (\(\kappa\)) is a statistic used for describing inter-rater reliability of two raters (or intra-rater) with categorical rating outcomes [1]. Please note that there are also additional considerations for the use of \(\kappa\) for quantifying agreement [2] [3].

Notation

For two raters and two category ratings, let \(Y_{r,i} \in \{v_j; j=1,2\}\) represent the rating from rater \(r \in \{1,2\}\) for sample \(i \in \{ 1, \ldots, n \}\). Let \(N_{j_1,j_2}\) represent the total number of samples that received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,2\}\). See Table 1.

Table 1 Counts for 2 categories
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Row Total
Rater 1: \(v_1\)	\(N_{11}\)	\(N_{12}\)	\(N_{1\bullet}\)
Rater 1: \(v_2\)	\(N_{21}\)	\(N_{22}\)	\(N_{2\bullet}\)
Column Total	\(N_{\bullet 1}\)	\(N_{\bullet 2}\)	\(n\)

For two raters and three or more category ratings, let \(Y_{r,i} \in \{v_1,v_2,v_3, \ldots, v_J \}\) represent the rating from rater \(r \in \{1,2\}\) for sample \(i \in \{ 1, \ldots, n \}\). Let \(N_{j_1,j_2}\) represent the total number of samples that received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,\ldots,J\}\). See Table 2.

Table 2 Counts for 3 or more categories
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Rater 2: \(v_3\)	\(\ldots\)	Row Total
Rater 1: \(v_1\)	\(N_{11}\)	\(N_{12}\)	\(N_{13}\)	\(\ldots\)	\(N_{1\bullet}\)
Rater 1: \(v_2\)	\(N_{21}\)	\(N_{22}\)	\(N_{23}\)	\(\ldots\)	\(N_{2\bullet}\)
Rater 1: \(v_3\)	\(N_{31}\)	\(N_{32}\)	\(N_{33}\)	\(\ldots\)	\(N_{3\bullet}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
Column Total	\(N_{\bullet 1}\)	\(N_{\bullet 2}\)	\(N_{\bullet 3}\)	\(\ldots\)	\(n\)

The observed raw percentage of agreement is defined as

\[p_O = \sum_{j=1}^J N_{jj} / n\]

where \(J \geq 2\) is the size of value set.

Assume that

\[(N_{1\bullet}, \ldots N_{J\bullet}) \sim multi(n, (p_{r=1,1}, \ldots, p_{r=1,J})),\]

and

\[(N_{\bullet 1}, \ldots N_{\bullet J}) \sim multi(n, (p_{r=2,1}, \ldots, p_{r=2,J})),\]

with \(\sum_j N_{j \bullet} = \sum_j N_{\bullet j} = n\) and \(\sum_j p_{r=1,j} = \sum_j p_{r=2, j} = 1\).

Under independence assumption, the expected number of agreement is estimated by \(\sum_{j=1}^J\hat{E}_{j} = \frac{1}{n}\sum_{j=1}^J N_{\bullet j} N_{j\bullet} \equiv n p_E\).

The Cohen’s \(\kappa\) statistic is calculated as

(1)\[\kappa = \frac{p_O - p_E}{1-p_E}.\]

The SE of \(\kappa\) is calculated as

\[\sqrt{\frac{p_O(1-p_O)}{n(1-p_E)^2}}.\]

Interpretation of Cohen’s Kappa Suggested in Literature

There are several groups of interpretation. Some roughly (not-strictly) defined types are listed below:

Table based interpretation: a shared interpretation simplifies application process and provides a easy to compare values.
Interpretation based on Approximated model based confidence interval or Bootstrap confidence intervals with a preselected criterion
Bayesian inference based interpretation [8]

Cohen (1960) [4] suggested the Kappa result be interpreted as follows:

Table 3 Cohen’s Kappa Interpretation (Cohen, 1960)
Value of \(\kappa\)	Interpretation
\(-1 \leq \kappa \leq 0\)	indicating no agreement
\(0 < \kappa \leq 0.2\)	none to slight
\(0.2 < \kappa \leq 0.4\)	fair
\(0.4 < \kappa \leq 0.6\)	moderate
\(0.6 < \kappa \leq 0.8\)	substantial
\(0.8 < \kappa \leq 1\)	almost perfect agreement

Interpretation suggested by McHugh (2012) [5]:

Table 4 Cohen’s Kappa Interpretation (McHugh, 2012)
Value of \(\kappa\)	Level of Agreement	% of Data That Are Reliable
\(-1 \leq \kappa \leq 0\)	Disagreement	NA
\(0-.20\)	None	\(0-4%\)
\(.21-.39\)	Minimal	\(4-15%\)
\(.40-.59\)	Weak	\(15-35%\)
\(.60-.79\)	Moderate	\(35-63%\)
\(.80-.90\)	Strong	\(64-81%\)
Above.90	Almost Perfect	\(82-100%\)

As discussed by Sim and Wright [6] , biases and other factors could have impact on the interpretation.

Example - Group-1

Table 5 Cohen’s \(\kappa = 0\)
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Row Total
Rater 1: \(v_1\)	9	21	30
Rater 1: \(v_2\)	21	49	70
Column Total	30	70	100

Table 6 Cohen’s \(\kappa = 0\)
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Row Total
Rater 1: \(v_1\)	49	21	70
Rater 1: \(v_2\)	21	9	30
Column Total	70	30	100

Table 7 Cohen’s \(\kappa = 1\)
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Row Total
Rater 1: \(v_1\)	30	0	30
Rater 1: \(v_2\)	0	70	70
Column Total	30	70	100

Table 8 Cohen’s \(\kappa = 1\)
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Row Total
Rater 1 \(v_1\)	50	0	50
Rater 1: \(v_2\)	0	50	50
Column Total	50	50	100

Table 9 Cohen’s \(\kappa = -1\)
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Row Total
Rater 1: \(v_1\)	0	50	50
Rater 1: \(v_2\)	50	0	50
Column Total	50	50	100

Table 10 Cohen’s \(\kappa = -0.7241379310344827\)
	Rater 2: \(v_1\)	Rater 2: \(v_2\)	Row Total
Rater 1: \(v_1\)	0	30	30
Rater 1: \(v_2\)	70	0	70
Column Total	70	30	100

How-to

To use sklearn.metrics (stable):

from sklearn.metrics import cohen_kappa_score
y1 = ['v2'] * 70 + ['v1'] * 30
y2 = ['v1'] * 70 + ['v2'] * 30
print("Cohen's kappa:", cohen_kappa_score(y1, y2))

To use mtbp3Lab (testing):

from mtbp3Lab.statlab import kappa
y1 = ['v2'] * 70 + ['v1'] * 30
y2 = ['v1'] * 70 + ['v2'] * 30
kappa = kappa.KappaCalculator([y1,y2])
print("Cohen's kappa:", kappa.cohen_kappa)

Bootstrap CI

To use mtbp3Lab.statlab:

print( kappa.bootstrap_cohen_ci(n_iterations=1000, confidence_level=0.95, out_digits=6) )

Output:

Cohen's kappa: -0.724138
Confidence Interval (95.0%): [-0.907669, -0.496558]

Note that examples of using SAS/PROC FREQ and R package vcd for calculating \(\kappa\) can be found in reference [7] .

Bubble Plot

To create a bubble plot using mtbp3Lab.statlab:

from mtbp3Lab.statlab import kappa

fruits = ['Apple', 'Orange', 'Pear']
np.random.seed(100)
r1 = np.random.choice(fruits, size=100).tolist()
r2 = np.random.choice(fruits, size=100).tolist()

kappa = KappaCalculator([r1,r2], stringna='NA')
print("Cohen's kappa (mtbp3Lab.statlab): "+str(kappa.cohen_kappa))
print("Number of raters per sample: "+str(kappa.n_rater))
print("Number of rating categories: "+str(kappa.n_category))
print("Number of sample: "+str(kappa.y_count.shape[0]))

kappa.create_bubble_plot()

Output:

Cohen's kappa (mtbp3Lab.statlab): 0.06513872135102527
Number of raters per sample: 2.0
Number of rating categories: 3
Number of sample: 100

Sometimes monitoring individual raters rates might be needed for the interpretation of \(\kappa\). To create a bubble plot with individual raters summary using mtbp3Lab.statlab:

kappa.create_bubble_plot(hist=True)

Note that the agreed counts are on the 45 degree line. To put agreed counts on the -45 degree line:

kappa.create_bubble_plot(hist=True, reverse_y=True)

Lab Exercise

Assume that there are two raters responsible for rating 2 studies with a sample size of 100 for each study. Assume that the you are tasked with studying the characteristics of \(\kappa\).

For the first study, the first rater completed the rating with marginal rates following a multinomial distribution (100, (1/3, 1/3, 1/3)). Afterwards, assume that you filled a portion (\(0 < r < 1\)) of the sample’s ratings as a second rater with exactly the same rating as the first rater, and filled out the rest with random ratings following the same distribution as the first rater.

For the second study, the first rater completed the rating with marginal rates following a multinomial distribution (100, (0.9, 0.05, 0.05)). Afterwards, assume that you filled a portion (\(0 < r < 1\)) of the sample’s ratings as a second rater with exactly the same rating as the first rater, and filled out the rest with random ratings following the same distribution as the first rater.

Find the relationship between \(r\) and \(\kappa\) for these two studies.

Extensions

Some scenarios discussed by Hallgren (2012) [9] include:

the prevalence problem: one category has much higher percentage than other categories and causes \(\kappa\) to be low.
the bias problem: there are substantial differences in marginal distributions and causes \(\kappa\) tend to be high.
unequal importance

(Please note that this is not an exhaustive list.)

Weighted \(\kappa\)

Let \(w_{j_1,j_2}\) represent the weight given to total number of sample received ratings \((v_{j_1}, v_{j_2})\) from two raters, where \(j_1,j_2 \in \{1,\ldots,J\}\). The weighted \(\kappa\) is calculated as

\[\kappa = 1- \frac{\sum_{j_1=1}^J\sum_{j_2=1}^J w_{j_1,j_2}N_{j_1,j_2}}{\sum_{j_1=1}^J\sum_{j_2=1}^J w_{j_1,j_2}\hat{E}_{j_1, j_2}}.\]

(There shall be another page discussing weighted methods and variations)