Data Engineering: Training Data

Keith VanderLinden
Calvin University

Sampling Data

  • Non-probabilistic Sampling
  • Probabilistic Sampling
    • Random
    • Stratified
    • Weighted
    • Reservoir
    • Importance

Labeling Data

Labeling data is the process of assigning a class label to each data item in a dataset.

  • Hand labeling
  • Natural labeling
  • Weak supervised labeling
  • Semi-supervised labeling
  • Active learning

Labeling is a key part of many ML system workflows.

The Kappa Statistic

Kappa measures inter-rater agreement for category labels.

Percent Agreement \[ p_a = \frac{a}{n} \] …where \(a\) is the number of times the raters agree, and \(n\) is the number of ratings.

Kappa Statistic \[ \kappa = \frac{p_a - p_e}{1 - p_e} \] …where \(p_e\) is the probability of agreement by chance.

A labeled dataset with a high Kappa is often known as a gold standard dataset.

The Kappa Statistic - Example

Here, we compare the statistics for two different label-sets.

# The original test data from the example:
t11 = ["negative", "positive", "negative", "neutral", "positive"]
t12 = ["negative", "positive", "negative", "neutral", "negative"]

# Modified: same agreement count, only two possible answers:
t21 = ["negative", "positive", "negative", "positive", "positive"]
t22 = ["negative", "positive", "negative", "positive", "negative"]

print(f"Agmt: {accuracy_score(t11, t12)} Kappa: {cohen_kappa_score(t11, t12)}\
      \nAgmt: {accuracy_score(t21, t22)} Kappa: {cohen_kappa_score(t21, t22)}")
Agmt: 0.8 Kappa: 0.6875      
Agmt: 0.8 Kappa: 0.6153846153846154

Here, we hand-calculate the Kappa for the original example.

p_o = 4/5
p_e = (2/5 * 1/5) + (2/5 * 3/5) + (1/5 * 1/5)
(p_o - p_e) / (1 - p_e)
0.6875000000000001