Outcomes
- Conduct a quantitative experiment comparing openly available language models
This exercise addresses the following course objectives:
- [MS-LLM-Eval] I can apply and critically analyze evaluation strategies for generative models.
- [MS-LLM-API] I can apply industry-standard APIs to work with pretrained language models (LLMs) and generative AI systems.
- [MS-Eval-Experiment] I can design, run, and analyze empirical experiments to quantify the impact of hyperparameter changes on model performance.
You may also found opportunities to demonstrate the following course objectives:
- [CI-LLM-Failures] I can identify common types of failures in LLMs, such as hallucination (confabulation) and bias.
- [MS-LLM-Prompting] I can critique and refine prompts to improve the quality of responses from an LLM.
- [MS-LLM-Advanced] I can apply techniques such as Retrieval-Augmented Generation, in-context learning, tool use, and multi-modal input to solve complex tasks with an LLM.
- [MS-LLM-Compute] I can analyze the computational requirements of training and inference of generative AI systems.
- [LM-ICL] I can explain how in-context learning can be used to improve test-time performance of a model.
Overview
We’re going to do the same task as Discussion 1, but in code.
Start by picking one specific example from your Discussion 1 task. We’ll hard-code it for simplicity.
Part 1: LLM APIs
Write code that runs the “FlipFlop” experiment for your one example by calling an LLM API. Run the experiment 5 times (in a loop) and report the average initial accuracy and average accuracy after the “are you sure?”.
Notes:
- Make sure to handle the conversation context correctly.
- Instruct the model to provide its final answer as a single word, e.g., “You may think as much as necessary, but end your response with either ‘Answer: yes’ or ‘Answer: no’”.
Part 2: Local Models (optional)
Repeat the same experiment but running the model within your own notebook. Follow instructions on Hugging Face Chat Basics.
You might choose models like:
- https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
- One of the SmolLM2 models
- https://www.kaggle.com/models/google/gemma-3
- https://www.kaggle.com/models/metaresearch/llama-3.2
Submit
Write a Jupyter notebook with your code and experiment results.