overview

Model Monitoring

Keith VanderLinden
Calvin University

ML System Failures

ML systems fail if:

The software system fails to operate as expected.
The ML system fails to perform as expected.

The first criterion is shared by all software systems; the second is unique to ML systems.

System operation failures
- Types:
  - hardware and software failures
  - crashes and downtime
- These are: common to all software systems; easier to detect (e.g., 404 errors on a REST API).
ML performance failures
- Types:
  - Poorly configured/tuned models (q.v., other ML courses)
  - Training vs inference pipeline mismatches (aka, training-serving skew; q.v., module 12)
  - Degenerate feedback loops (i.e., when the ML system changes the environment it’s trying to model; e.g., suggesting songs makes them more popular)
  - Data distribution shifts (see below)
- These are: unique to ML systems; harder to detect.
N.b., traditional software has data changes as well, but they tend to be structural rather than distributional. E.g.:
- Traditional: WorkDay changed its course schedule CSV output format, which resulted in a system crash during I/O.
- ML system: Course schedules slowly started including more 8am and online courses, which resulted in models failing to accurately predict course time scheduling distributions.
We’ll focus on the latter.

Data Distribution Shifts

We distinguish these distributions:

Source
Target

Rarely are these distributions:

Identical
Stationary

Detecting and addressing distribution shifts are crucial for maintaining ML system performance.

The problem
- Distributions (p. 237)
  - Source distribution: the distribution of the training data.
  - Target distribution: the distribution of the live inference data.
- Properties (p. 230 - earlier in the chapter)
  - Identical: the source and target distributions are the same (e.g., training-serving skew) (cf. identically distributed)
  - Stationary: the target distribution doesn’t change over time.
- Distribution shifts (p. 237-242 - all interesting, but not in the assigned readings)
  - Concept drift; Covariate shift; Label shift
  - Feature change; label schema change
The solution
- Detecting DDSs (p.242-247 - not in the readings) - We’ll:
  - Focus on general monitoring rather than on the mathematics of these shifts.
  - Log performance metrics.
- Addressing DDSs (p. 248-250 - in the readings) - One could:
  - Just train on huge models. This has become common (e.g., GPT-type LLMs), but we won’t (can’t?) do it.
  - Adapt models without new labels. This is pretty theoretical, so we’ll skip it.
  - Retrain using labeled target data. We’ll say more about this later, in the Chapter 9 materials.

Monitoring ML Systems

If our systems are observable, we can monitor key metrics on:

Raw inputs
Features
Predictions
Accuracy

The latter metrics are easier to monitor and to interpret.

In order to detect data distribution shifts, we need to monitor our ML systems.
Terms
- Observability: to configure the system in a way that allows monitoring (e.g., exposing the ML system’s internal operations)
- Monitoring: to log/watch selected metrics for help in discovering/debugging issues (e.g., logging the ML system’s external performance)
- [OPTIONAL] C. Huyen makes a big deal out of this distinction and of the rising importance of observability. It’s good that we’re discovering the value of observability, but this idea has been around for a long time.
Artifacts to log/monitor
- Raw inputs are usually monitored by data engineers rather than ML engineers.
- Features can be monitored by tracking shifts in descriptive statistics (e.g., mean, max, min, var, …).
- Predictions can also be monitored using descriptive statistics (e.g., distributions of predicted values).
- Accuracy metrics can be monitored if we can get feedback (e.g., user thumbs up/down). We’re going to focus on this one.

Monitoring Tools

There are three main types of monitoring tools:

Logs
Dashboards
Alerts

DMLS focuses on monitoring from the user’s perspective.

Continual Learning

Continual Learning establishes an infrastructure for retraining models in production. We distinguish:

Stateful retraining
Stateless retraining

It can help address data distribution shifts and the cold start problem, but is challenged by data collection and model management.

How often should we retrain our models?
- In Chapter 7, myth #3 C. Huyen said that “How often should I retrain my model” is the wrong question; the right question is “How often can I retrain my model?” (p. 196)
- That’s not really the whole story. Probably, we should retrain more often than we do, but deciding how often is context dependent (p. 265).
- In this chapter, C. Huyen discusses how to construct our infrastructure so that we can retrain more often if necessary.
Continual learning (n.b., *continual” learning != “continuous” learning)
- Has two basic approaches:
  - Stateful retraining is fine-tuning by continuing to train the existing model.
  - Stateless retraining is retraining from scratch. We’ll focus on this (much simpler) one.
- Potentially helps address:
  - Data distribution shifts.
  - The cold start problem, in which new users need good predictions based on little or no data.
- Has challenges:
  - Getting data, i.e., you need crowd-sourced, natural, or automatic labeling.
  - Deciding when a new model is better than the old one. E.g., MS Tay showed how dangerous this could be.
  - Catastrophic forgetting is when the model forgets old data as it learns new data.
  - Model drift is when the model’s performance degrades over time.
C. Huyen suggests phased adoption (as usual):
1. Manual, stateless retraining
2. Automated, stateless retraining
3. Automated, stateful retraining (on a fixed schedule)
4. Continual learning (#3 on a flexible schedule)
We won’t be implementing this.

Test in Production

Test in Production is a practice of proactively testing models in production. We distinguish between:

Shadow deployment
A/B testing
Canary testing

It’s commonly used in large ML systems.