overview

Data Engineering: Training Data

Keith VanderLinden
Calvin University

Sampling Data

Non-probabilistic Sampling
Probabilistic Sampling
- Random
- Stratified
- Weighted
- Reservoir
- Importance

The most common form of non-probability sampling is convenience sampling, which is all too common, but problematic because it’s rarely representative of the real world and usually biased.
DMLS covers several forms of probability sampling, including simple random sampling, stratified sampling, weighted sampling, reservoir sampling, and importance sampling.
The SLO project went through several iterations of sampling, starting with convenience sampling and ending with something that might be classified as probabilistic quota sampling.
- Quota - We tried to get a balanced number for for, against, and neutral tweets.
- Probabilistic - Mostly, we took every positive/for tweet we could find and randomly selected from the (voluminous) negative/against tweets.
- Controversy - As we’ll see later, there were issues labeling neutral tweets.
- N.b., tweets have turned out to be a notoriously biased source of information. E.g., they’re mostly negative.
- But the labeling was, nevertheless, critical to the project.

Labeling Data

Labeling data is the process of assigning a class label to each data item in a dataset.

Hand labeling
Natural labeling
Weak supervised labeling
Semi-supervised labeling
Active learning

Labeling is a key part of many ML system workflows.

DMLS discusses:
- Hand labeling - We’ll use this in the lab and, perhaps, the project.
  - Data lineage came up in the they’d had a number of different modeling and, thus, labeling efforts over the years. My first job was to sort that all out. In the end, we had to hand-label a lot of data (again).
  - The Mechanical Turk is a popular way to get hand-labeled data (see references below).
- Natural labeling - The data itself includes explicit or implicit labels (e.g., IMDB movie reviews).
- Weak supervised labeling - We can encode heuristics/rules to set labels (e.g., SLO inferred stance neutrality from news sources).
- Semi-supervised labeling - We can generate new labels using an initial set of existing labels.
- Transfer learning - Here, we use a pre-trained model, which needs no labeled data.
- Active learning - Here, we label the most problematic data first.
This is highly dependent on the nature of your data and domain. But it’s really important for supervised learning.
- Recount the quote from A. Karpathy (DMLS p. 88).

References

Mechanical Turk
- Original: https://en.wikipedia.org/wiki/Mechanical_Turk
- Current: https://www.mturk.com/
https://madewithml.com/courses/mlops/labeling/

The Kappa Statistic

Kappa measures inter-rater agreement for category labels.

Percent Agreement \[ p_a = \frac{a}{n} \] …where \(a\) is the number of times the raters agree, and \(n\) is the number of ratings.

Kappa Statistic \[ \kappa = \frac{p_a - p_e}{1 - p_e} \] …where \(p_e\) is the probability of agreement by chance.

A labeled dataset with a high Kappa is often known as a gold standard dataset.

For hand labeling, it’s best to collect multiple labels (from multiple labelers).
- Don’t trust yourself to “just get it right”. That doesn’t always work.
- This requires that we assess inter-labeler agreement.
Note that percent agreement will naturally be higher when there are fewer categories. E.g., the “monkey score” for a for/against labeling will be higher than it is for a for/against/neutral labeling; kappa controls for this.
- Best practice is to shoot for a kappa of 0.8 or higher.
One classic problem with using the mechanical turk is that the labels are often unreliable, for many reasons (poor instructions, labeler inattention, lack of domain expertise, etc.). Much effort has been expending in trying to improve the reliability of MTurk labels.

References

I could dig up the spreadsheet I used to compute this for SLO.

The Kappa Statistic - Example

Here, we compare the statistics for two different label-sets.

# The original test data from the example:
t11 = ["negative", "positive", "negative", "neutral", "positive"]
t12 = ["negative", "positive", "negative", "neutral", "negative"]

# Modified: same agreement count, only two possible answers:
t21 = ["negative", "positive", "negative", "positive", "positive"]
t22 = ["negative", "positive", "negative", "positive", "negative"]

print(f"Agmt: {accuracy_score(t11, t12)} Kappa: {cohen_kappa_score(t11, t12)}\
      \nAgmt: {accuracy_score(t21, t22)} Kappa: {cohen_kappa_score(t21, t22)}")

Agmt: 0.8 Kappa: 0.6875      
Agmt: 0.8 Kappa: 0.6153846153846154

Here, we hand-calculate the Kappa for the original example.

p_o = 4/5
p_e = (2/5 * 1/5) + (2/5 * 3/5) + (1/5 * 1/5)
(p_o - p_e) / (1 - p_e)

0.6875000000000001

p_positive = 2/5 * 1/5
p_negative = 2/5 * 3/5
p_neutral = 1/5 * 1/5

p_e = p_positive + p_negative + p_neutral

kappa = (4/5 - p_e) / (1 - p_e)

References

I could dig up the spreadsheet I used to compute this for SLO.