**This has been integrated into the overview slides.**

Here, we demonstrate the use of the Kappa statistic for computing inter-rater agreement for categorical labels as compared with simple percent agreement. Simple percent agreement is the proportion of times that two raters agree on a categorical item.

$$
p_a = \frac{a}{n}
$$

&hellip;where $a$ is the number of times the raters agree, and $n$ is the total number of ratings.

In contrast, the Kappa statistic takes into account the possibility of the agreement occurring by chance.

$$
\kappa = \frac{p_a - p_e}{1 - p_e}
$$

&hellip;where $p_e$ is the probability of agreement by chance.

The Kappa statistic ranges from -1 to 1, where 1 indicates perfect agreement, 0 indicates agreement by chance, and -1 indicates perfect disagreement. 0.8 is considered a good kappa score. Kappa is generally considered to be a more robust measure than simple percent agreement calculation, and particularly useful when the categories are not equally distributed among the raters.

In [2]:
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score


Consider this example from <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html>

In [3]:
# The original test data from the example:
t1_c1 = ["negative", "positive", "negative", "neutral", "positive"]
t1_c2 = ["negative", "positive", "negative", "neutral", "negative"]

# Modified test data, with the same number of agreements, but only two possible answers (no neutral):
t2_c1 = ["negative", "positive", "negative", "positive", "positive"]
t2_c2 = ["negative", "positive", "negative", "positive", "negative"]

Now we see that the percent agreement stays the same, even though the agreement in the three-way coding is more significant than it is in the two-was coding. The Kappa statistic gets it right.

In [4]:
print(f"3-Way Test -\tAgreement: {accuracy_score(t1_c1, t1_c2)}\tKappa: {cohen_kappa_score(t1_c1, t1_c2)}")
print(f"2-Way Test -\tAgreement: {accuracy_score(t2_c1, t2_c2)}\tKappa: {cohen_kappa_score(t2_c1, t2_c2)}")

3-Way Test -	Agreement: 0.8	Kappa: 0.6875
2-Way Test -	Agreement: 0.8	Kappa: 0.6153846153846154


Here, we hand-calculate the Kappa for the original example.

In [5]:
p_positive = 2/5 * 1/5
p_negative = 2/5 * 3/5
p_neutral = 1/5 * 1/5

p_e = p_positive + p_negative + p_neutral

kappa = (4/5 - p_e) / (1 - p_e)
kappa

0.6875000000000001

Here's a question for the preparation quiz.

In [6]:
# The original three-value test data from the example, modified for the quiz.
q_c1 = ["negative", "positive", "negative", "neutral", "positive"]
q_c2 = ["negative", "positive", "negative", "negative", "negative"]

print(f"2-Way Test -\tAgreement: {accuracy_score(q_c1, q_c2)}\tKappa: {cohen_kappa_score(q_c1, q_c2)}")

p_positive = 2/5 * 1/5
p_negative = 2/5 * 4/5
p_neutral = 1/5 * 0/5

p_e = p_positive + p_negative + p_neutral

kappa = (3/5 - p_e) / (1 - p_e)
kappa

2-Way Test -	Agreement: 0.6	Kappa: 0.3333333333333335


0.33333333333333326

And here's a final quiz version of the question.

In [15]:
# a set of 4 labels from 2 labelers for a 2-value labeling task.
q1 = ["negative", "positive", "negative", "positive"]
q2 = ["negative", "positive", "negative", "negative"]

print(f"Kappa = {cohen_kappa_score(q1, q2)}")

p_positive = 2/4 * 1/4
p_negative = 2/4 * 3/4

p_e = p_positive + p_negative + p_neutral
print(f"p_e = {p_e}")

kappa = (3/4 - p_e) / (1 - p_e)
print(f"kappa = (3/4 - 1/2) / (1 - 1/2) = {kappa}")

Kappa = 0.5
p_e = 0.5
kappa = (3/4 - 1/2) / (1 - 1/2) = 0.5
