Exercise 376.1: LM Evaluation | CS 375-376 Spring 2026 at Calvin University

Warning: This content has not yet been fully revised for this year.

Outcomes

Conduct a quantitative experiment comparing openly available language models

This exercise addresses the following course objectives:

[MS-LLM-Eval]
[MS-LLM-API]
[MS-Eval-Experiment]

You may also found opportunities to demonstrate the following course objectives:

[CI-LLM-Failures]
[MS-LLM-Prompting]
[MS-LLM-Advanced]
[MS-LLM-Compute]
[LM-ICL]

Overview

We’re going to do the same task as Discussion 1, but in code.

Start by picking one specific example from your Discussion 1 task. We’ll hard-code it for simplicity.

Part 1: LLM APIs

Write code that runs the “FlipFlop” experiment for your one example by calling an LLM API. Run the experiment 5 times (in a loop) and report the average initial accuracy and average accuracy after the “are you sure?”.

Notes:

Make sure to handle the conversation context correctly.
Instruct the model to provide its final answer as a single word, e.g., “You may think as much as necessary, but end your response with either ‘Answer: yes’ or ‘Answer: no’”.

Part 2: Local Models (optional)

Repeat the same experiment but running the model within your own notebook. Follow instructions on Hugging Face Chat Basics.

You might choose models like:

Submit

Write a Jupyter notebook with your code and experiment results.