Exercise 376.1: LM Evaluation | CS 375-376 Spring 2025 at Calvin University

Outcomes

Conduct a quantitative experiment comparing openly available language models

This exercise addresses the following course objectives:

[MS-LLM-Eval] I can apply and critically analyze evaluation strategies for generative models.
[MS-LLM-API] I can apply industry-standard APIs to work with pretrained language models (LLMs) and generative AI systems.
[MS-Eval-Experiment] I can design, run, and analyze empirical experiments to quantify the impact of hyperparameter changes on model performance.

You may also found opportunities to demonstrate the following course objectives:

[CI-LLM-Failures] I can identify common types of failures in LLMs, such as hallucination (confabulation) and bias.
[MS-LLM-Prompting] I can critique and refine prompts to improve the quality of responses from an LLM.
[MS-LLM-Advanced] I can apply techniques such as Retrieval-Augmented Generation, in-context learning, tool use, and multi-modal input to solve complex tasks with an LLM.
[MS-LLM-Compute] I can analyze the computational requirements of training and inference of generative AI systems.
[LM-ICL] I can explain how in-context learning can be used to improve test-time performance of a model.

Overview

We’re going to do the same task as Discussion 1, but in code.

Start by picking one specific example from your Discussion 1 task. We’ll hard-code it for simplicity.

Part 1: LLM APIs

Write code that runs the “FlipFlop” experiment for your one example by calling an LLM API. Run the experiment 5 times (in a loop) and report the average initial accuracy and average accuracy after the “are you sure?”.

Notes:

Make sure to handle the conversation context correctly.
Instruct the model to provide its final answer as a single word, e.g., “You may think as much as necessary, but end your response with either ‘Answer: yes’ or ‘Answer: no’”.

Part 2: Local Models (optional)

Repeat the same experiment but running the model within your own notebook. Follow instructions on Hugging Face Chat Basics.

You might choose models like:

Submit

Write a Jupyter notebook with your code and experiment results.