Exercise 376.1: LM Evaluation

Outcomes

This exercise addresses the following course objectives:

You may also found opportunities to demonstrate the following course objectives:

Overview

We’re going to do the same task as Discussion 1, but in code.

Start by picking one specific example from your Discussion 1 task. We’ll hard-code it for simplicity.

Part 1: LLM APIs

Write code that runs the “FlipFlop” experiment for your one example by calling an LLM API. Run the experiment 5 times (in a loop) and report the average initial accuracy and average accuracy after the “are you sure?”.

Notes:

Part 2: Local Models (optional)

Repeat the same experiment but running the model within your own notebook. Follow instructions on Hugging Face Chat Basics.

You might choose models like:

Submit

Write a Jupyter notebook with your code and experiment results.

Token and Context Embeddings