376 Unit 4: Generation and Prompt Engineering

If under the hood all a language model does is predict the next token, how can you get it to do useful tasks for you? The key idea is to make “doing a task” look like “predicting the next token” in some context. This lab will introduce you to a few ways to do that.

We’ll be using a model released by Google, called Gemma.

Objectives

Describe how “few-shot learning” can be helpful for getting a language model to do what you want.
Describe how a causal language model can be used to power a dialog agent.
Explain how “chain of thought” prompting helps a model reason better.
Explain how “tool use” works in language models.

This lab will address the following course objectives:

[MS-LLM-API] I can apply industry-standard APIs to work with pretrained language models (LLMs) and generative AI systems.
[MS-LLM-Prompting] I can critique and refine prompts to improve the quality of responses from an LLM.
[MS-LLM-Advanced] I can apply techniques such as Retrieval-Augmented Generation, in-context learning, tool use, and multi-modal input to solve complex tasks with an LLM.

You may also use this lab to demonstrate the following course objectives (e.g., by adding additional discussion to your notebook submission or having a conversation with the instructor or a chatbot):

[MS-LLM-Generation] I can extract and interpret model outputs (token logits) and use them to generate text.
[MS-LLM-Compute] I can analyze the computational requirements of training and inference of generative AI systems.
[MS-LLM-Tokenization] I can explain the purpose, inputs, and outputs of tokenization.
[LM-ICL] I can explain how in-context learning can be used to improve test-time performance of a model.
[CI-LLM-Failures] I can identify common types of failures in LLMs, such as hallucination (confabulation) and bias.
[MS-LLM-Train] I can describe the overall process of training a state-of-the-art dialogue LLM such as Llama or OLMo.
[MS-LLM-Eval] I can apply and critically analyze evaluation strategies for generative models.
[MS-LLM-TokenizationImpact] I can analyze how tokenization choices affect the performance of an LLM.
[NC-Scaling] I can analyze how the computational requirements of a model scale with number of parameters and context size.

Getting Started

Start by accepting Google’s license agreement for the Gemma model. You’ll need to accept the license for Gemma models. If you have any difficulty with accepting this license, let the instructor know.

Start with the Lab 4 notebook. Also open a document where you can write the answers to the questions (we won’t be turning in a notebook for this lab). Create headings for each section of the lab and write your answers under each heading.

Prompt Engineering (name: u11n1-prompt-engineering.ipynb; show preview, open in Colab)

We’ll use two different models: first, the non-instruction-tuned model, then the instruction-tuned model (-it). We’ll use the “2B” model size for both.

If you’re using Kaggle, the models should already be added to the notebook. Check the Inputs section to see if it already has two Gemma models. If not, add the Gemma model to your notebook by following these steps:

On the right sidebar, click Add Input -> Select Models -> Gemma 3 (Google) (Framework: Transformers), variation: ‘1b-pt’

If you’re not on Kaggle, you can use the Hugging Face model hub to download the model; see the code in the notebook for details.

Gemma Warm-Up

Try completing the following tasks using the Gemma model (without instruction tuning). Do this by modifying the doc given in the example code chunk. You might try setting the do_sample parameter to True (to get a sense of the range of possible outputs), or False (to get a single prediction).

A trivia task, like: “The capital of France is”
A math task, like: “2 + 2 = ___.” (You might want to frame it like “Expression: 2 + 2. Result:”)
A translation task, like: “An expert Spanish translation of ‘Language models are statistical models that can generate text.’ is ___.”
A programming task, like:

def sum_evens(lst):
  # Input: a list of numbers
  # Output: the sum of the even numbers in the list

Prompt Engineering

Few-Shot Learning

In machine learning jargon, “shot” refers to the number of examples you have of a particular task. “Few-shot learning” refers to the problem of learning a new task with only a few examples.

For example, you might have noticed that the model completes “The capital of France is” as if it were a travel article (or perhaps a multiple-choice question)—because that’s the sort of document it was trained on! But you can give it examples of the sort of things you want. For example, try this instead:

The capital of Michigan is Lansing.
The capital of England is London.
The capital of France is

We can consider the first two lines as “examples” of the task we want the model to do. This is a “few-shot” example, because we’re giving the model only a few examples of the task we want it to do.

Write a brief summary of how Gemma performed on this task, as compared with not giving it any examples.

Also try this:

Request: capital("Michigan")
Response: "Lansing"
Request: capital("England")
Response: "London"
Request: capital("France")
Response:

Chain of Thought

Try the following prompt, again with plain Gemma:

I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?

(Before you run this, think about how you would solve this problem.)

What does Gemma predict?

Now, add the following to the prompt: Let's think step by step. After I bought the apples, I had

How does the generated text change?

Instruction Tuning

Instruction tuning does several things to make the model more useful at following tasks:

An instruction can get the model into the right “mode” for a task, more efficiently and reliably than few-shot examples can. For example, an instruction-tuned model will interpret questions like “What is the capital of France?” as a question, not as a continuation of a travel article.
Instruction tuning is also often done together with human feedback. Since the model was trained on the Internet, its training data includes both helpful and unhelpful examples, but the human feedback tunes the model to give the most helpful responses.
Instruction tuning gets the model to play the role of a dialogue agent, which is often a useful way to interact with a model.

Let’s switch to the instruction-tuned model.

Repeat the steps above to add a model, except this time select the variation: ‘gemma-3-1b-it’. Change the model loading code in the notebook to load this model (USE_INSTRUCITON_TUNED should be True). Stop the session and then restart it to run with this new model.

Conversations as Documents

Instruction-tuned models were fine-tuned on documents formatted as dialogues between a user and an assistant. To get the best performance from these models at inference, we need to format our prompts in a similar way as the documents were formatted during fine-tuning. Different models have different fine-tuning formats, but fortunately the HuggingFace Transformers library has code to help us format our prompts correctly for each model.

The “Chat Templating” section of the notebook includes code to format the prompt for the instruction-tuned model. The apply_chat_template method takes a list of messages, where each message is a dictionary with two keys: “role” and “content”. The “role” key can be either “user”, “assistant”, or “system”. The “content” key is the text of the message. See the Gemma documentation for more details.

role = """You are a helpful 2nd-grade teacher. Help a 2nd grader to answer questions in a short and clear manner."""
task = """Explain why the sky is blue"""

messages = [
    {
        "role": "user",
        "content": f"{role}\n\n{task}",
    },
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

Example from: https://www.promptingguide.ai/models/gemma
Overall Gemma docs: https://ai.google.dev/gemma/docs/core
Gemma’s chat template docs: https://ai.google.dev/gemma/docs/formatting
API docs: https://huggingface.co/docs/transformers/main/en/chat_templating

Retrieval-Augmented Generation

Completing a task often requires information that was not included in a model’s training set. For example, try asking the following question to the instruction-tuned model (either omit the {role} or use a role like “You are an expert in PyTorch”):

“Explain the options for TripletMarginLoss in PyTorch.”

Notice that the result includes hallucinations, i.e., information that it simply made up. This is a common problem with language models.

One way to reduce (but not eliminate) hallucinations is to explicitly provide the model with the information it needs. This is called retrieval-augmented generation. The idea is to provide the model with a “retrieval” of relevant information, which it can then use to generate a response.

We’ll use the docstrings for PyTorch functions as our knowledge base. Use the following code to extract the docstrings for all functions in the torch.nn module:

import inspect

docstrings = {}
for name, obj in inspect.getmembers(torch.nn):
    if inspect.isfunction(obj) or inspect.isclass(obj):
        docstrings[name] = inspect.getdoc(obj)

Now, give the instruction-tuned model a prompt like:

{task}

Answer using the following context:
{context}

where {task} is the question you want to ask and {context} is the docstring for the function you want to ask about. In this example, use context = docstrings['TripletMarginLoss']. (Refer to the documentation page for the module to check the model’s answer.)

Note: In practice, we use a more sophisticated retrieval system, like a search engine, to provide the model with context. Often, vector search is used for the retrieval system: we find the document with the most similar vector to the prompt vector. Models like Sentence Transformers are often used for this purpose, using models found on the Hugging Face model hub, such as GTE. See that model’s documentation page for an example; you might try it out on your own.

Tool Use

We can also prompt the model to use a tool, like a calculator, when it recognizes that it can’t answer a question directly. For example, try the following dialogue:

    {
        "role": "user",
        "content": f"What is the sum of the odd numbers less than 20?",
    },
    {
        "role": "assistant",
        "content": f"""
Run Python code: print(sum(x for x in range(20) if x % 2 == 1))
Code output: 100

The result is 100."""
    },
    {
        "role": "user",
        "content": f"What is the sum of the even numbers less than 40?",
    },

Note that we would need to intercept the generation process and detect that the model has generated a request to run some code – then run that code and insert the result in the dialogue. For simplicity we won’t actually do that in this lab.

In this case, the “tool” is the Python interpreter. We could also provide the model with other tools, like a search engine (which would insert text from the search results into the dialogue), a call to an API like Wolfram Alpha (see Stephen Wolfram’s blog post on this topic), or a call to an API that does something in the physical world (like turning on a light).

Note: we could treat retrieval as a tool, too. For example, the model could generate a request to run a search query against a database, then insert the results into the dialogue. This is called “agentic RAG”.

Your Turn

Suppose we wanted to make a chatbot that answers incoming students’ questions about Calvin University on topics like courses, schedule, recent events, activities, etc..

Wrap-Up

Since we didn’t need to write much code today, you don’t need to submit a notebook. Instead, submit the answers to the tasks above.

This exercise is focused on prompting and structured output techniques to attempt to make a useful and reliable system out of an LLM. This exercise will allow you to demonstrate the following course objectives:

[MS-API-Integration] I can integrate an ML model into a larger application.
[MS-LLM-API] I can apply industry-standard APIs to work with pretrained language models (LLMs) and generative AI systems.
[MS-LLM-Prompting] I can critique and refine prompts to improve the quality of responses from an LLM.
[MS-LLM-Advanced] I can apply techniques such as Retrieval-Augmented Generation, in-context learning, tool use, and multi-modal input to solve complex tasks with an LLM.
[MS-LLM-Eval] I can apply and critically analyze evaluation strategies for generative models.
[CI-LLM-Failures] I can identify common types of failures in LLMs, such as hallucination (confabulation) and bias.

The fancy (resume/buzzword) name for what we’re going to do here is Agentic RAG. But we’re going to own our control flow rather than letting the LLM fully drive the interaction. We’re also going to be practicing engineering techniques to make the system reliable and measure its performance.

Task: Make a course advisor bot

We’ll try to create a chatbot that can help students choose courses according to their interests and goals, using retrieval-augmented generation (RAG) techniques to query the course catalog.

You may choose to do this as a Streamlit app or a Jupyter notebook.

Here’s how I approached it:

First, I defined a set of “tools” that the bot can use, for example, a tool that can query the course catalog and a tool that can recommend a set of courses. (Note that tools are just structured outputs, so we don’t need the model to specifically be trained to “use tools”.)

Then, I wrote a function that basically did the following:

def get_courses_matching_interests(interests):
    messages = [{
        "role": "system",
        "content": "# System message describing the goal, the tools available, and guidance for the conversation.
    }]
    messages.append({
        "role": "user",
        "content": interests
    })
  
    # Get search queriers from the LLM
    search_query_tool = do_llm_call() # with Search Query output format required
    messages.append({
        "role": "assistant",
        "content": search_query_tool.model_dump_json()
    })
    # Search for courses matching the queries.
    courses = search_courses(search_query_tool.queries)
    messages.append({
        "role": "user",
        "content": format_courses(courses)
    })

    if len(courses) == 0:
        # Repeat the previous request, so the model can try a different search.
    

    # Get recommendations from the LLM
    recommendations = do_llm_call() # with the Recommendations output format required

    return recommendations

We’ll walk through the LLM calls and the course search process below.

Part 1: Structured Output from an LLM

When we’re making a larger system out of component modules, it’s critical that each module have a well-defined interface. Fortunately, we can constrain the LLM to generate responses of a desired format.

Start by getting an OpenAI-compatible LLM endpoint. Here’s a few options:

Use the OpenAI API (e.g., gpt-4o), but that will cost money. So…
Use a free Google Gemini API key (but the OpenAI API), as described in CS 375 Homework 2. Or:
Run locally, using Ollama.

I’d recommend Ollama, because (1) it’s actually running on your computer, and (2) you’ll be constrained to smaller LLMs, so prompt engineering will make a bigger difference. To do this:

Install ollama.
Start the server: ollama serve.
Pull the model: ollama pull gemma3:1b-it-qat

If you have a lot of memory, or a good GPU, you can try gemma3:4b-it-qat.

If you do this, you can use:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
model = "gemma3:1b-it-qat"

Test that your model is working

completion = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
)
print(completion.choices[0].message.content)

Structured Output

A common library for working with structured data in Python is Pydantic. It allows you to define a data model (not to be confused with an AI model) and then validate that the data you get from any source (an API result, an LLM call, etc.) matches that model.

Here’s an example Pydantic model for a search query:

from typing import Literal
from pydantic import BaseModel

class SearchTool(BaseModel):
    tool_name: Literal["search_course_catalog"] = "search_course_catalog"
    thinking: str
    queries: list[str]


example_search = SearchTool(
    thinking="The user wants to know some trivia.",
    queries=[
        "What is the capital of France?",
        "What is the largest mammal?",
    ])

print(example_search.thinking)
print('; '.join(example_search.queries))

And here’s how we might use it in an OpenAI-compatible LLM call:

completion = client.beta.chat.completions.parse(
    model=model,
    messages=[
        {"role": "system", "content": f"""Write 10 search queries."""},
        {"role": "user", "content": "I'm looking for courses related to AI."},
    ],
    response_format=SearchTool,
    temperature=0.5
)

event = completion.choices[0].message.parsed
event

Observe that the response_format parameter is set to SearchTool, which means the LLM will be forced to output JSON that matches the SearchTool schema.

Try making the following changes to the system prompt and see how they affect the output:

Add an example

For in-context learning, it can sometimes be helpful to provide examples of the kind of output that you expect. But it can also sometimes lead to the model getting fixated on your specific examples. Try it out by adding an example to the system prompt. Try adding something like:

Example:
Student interest: "art"
Queries: ["art", "photography", "visual rhetoric", "painting", "sculpture", "art history", "graphic design", "digital media", "art theory", "contemporary art"]

How useful was adding this example?

Add additional instructions

You might try adding a “notes” section to the system prompt to give the model additional guidance. For example, you could say:

Notes:
- Before responding, write a short thought about what kinds of courses might be relevant to the user's interest.
- Assume that queries will be run against a specific course catalog, so avoid general terms like "course" or "department".
- Ensure that each query would match the title or description of one or more specific courses in an undergraduate program

Did these notes help the model produce better output? How would you measure that?

Add the output schema

You might add (within the f-string):

The output should be JSON with the following schema: {json.dumps(SearchTool.model_json_schema())}

Overall, which of these changes was most helpful for getting the model to produce useful output? Are there any other changes you could make? Refer to our course readings on prompt engineering for more ideas.

Part 2: Search the Course Catalog

Now we need to find courses that match those queries.

To keep it fast and simple, we’ll use a local mirror of the course catalog.

I found a list of course sections that will be offered in FA25 by inspecting what Network requests were made by the Calvin Course Offerings tool.
To avoid hitting their site too hard, I downloaded that JSON file and mirrored it here

Here’s how to load that file and search it:

import requests

sections_json = requests.get(sections_json_url)
sections_json.raise_for_status()
sections = sections_json.json()

example_section = next(section for section in sections if section['SectionName'].startswith('CS 108'))
print(example_section)

The listing is by section, so it’ll be helpful to organize by course instead:

course_descriptions = {
    section['SectionName'].split('-', 1)[0].strip(): (section["SectionTitle"], section["CourseDescription"])
    for section in sections
    if "CourseDescription" in section
    and section.get('AcademicLevel') == 'Undergraduate'
    and section.get('Campus') == 'Grand Rapids Campus'
}

print("Found", len(course_descriptions), "courses")
print(course_descriptions["CS 108"])

Here’s a function to find courses matching a query:

def search_courses(query: str):
    """
    Search for courses that match the query.
    """
    query = query.lower()
    matches = []
    for course, (title, description) in course_descriptions.items():
        if query in title.lower() or query in description.lower():
            matches.append((course, title, description))
    return matches
search_courses("programming")

If you have multiple queries, you might want to combine the results:

def find_courses_matching_queries(queries: list[str]):
    """
    Find courses that match any of the queries.
    """
    return set(
        course
        for query in queries
        for course in search_courses(query)
    )
find_courses_matching_queries(["programming", "AI"])

Part 3: Recommendations

Here’s a possible recommendation output format (it has some issues that you might want to fix later):

class CourseRecommendation(BaseModel):
    course_code: str
    course_title: str
    course_description: str
    reasoning: str

class RecommendTool(BaseModel):
    tool_name: Literal["recommend_course"] = "recommend_course"
    thinking: str
    recommendations: list[CourseRecommendation]

Now you put it together to make a course advisor bot! First try running these steps “by hand” to see how it works. Then, put it all together in a function. You can follow the rough outline given in the code snippet above.

Part 4: Testing

Test your bot with a few different student interests. Measure at least the following:

How often does it fail to respond, or give poor results?
- can you fix this, e.g., by changing the system prompt?
Latency of the system, broken down by part. (which part is slow? Does that make sense?)
- can you improve this, e.g., by changing the input or output formats?
Are the search queries relevant for the student’s interest?
Are the courses returned relevant for the search queries?
Are the recommendations relevant for the interest?

You’ll have to think about how to measure this. You might want to ask a few friends to try it out and give you feedback.

Think about how you could improve the system.