If under the hood all a language model does is predict the next token, how can you get it to do useful tasks for you? The key idea is to make “doing a task” look like “predicting the next token” in some context. This lab will introduce you to a few ways to do that.
We’ll be using a model released by Google, called Gemma.
This lab will address the following course objectives:
You may also use this lab to demonstrate the following course objectives (e.g., by adding additional discussion to your notebook submission or having a conversation with the instructor or a chatbot):
Start by accepting Google’s license agreement for the Gemma model. You’ll need to accept the license for Gemma models. If you have any difficulty with accepting this license, let the instructor know.
Start with the Lab 4 notebook. Also open a document where you can write the answers to the questions (we won’t be turning in a notebook for this lab). Create headings for each section of the lab and write your answers under each heading.
Prompt Engineering
(name: u11n1-prompt-engineering.ipynb; show preview,
open in Colab)
We’ll use two different models: first, the non-instruction-tuned model, then the instruction-tuned model (-it). We’ll use the “2B” model size for both.
If you’re using Kaggle, the models should already be added to the notebook. Check the Inputs section to see if it already has two Gemma models. If not, add the Gemma model to your notebook by following these steps:
If you’re not on Kaggle, you can use the Hugging Face model hub to download the model; see the code in the notebook for details.
Try completing the following tasks using the Gemma model (without instruction tuning). Do this by modifying the doc given in the example code chunk. You might try setting the do_sample parameter to True (to get a sense of the range of possible outputs), or False (to get a single prediction).
def sum_evens(lst):
# Input: a list of numbers
# Output: the sum of the even numbers in the list
In machine learning jargon, “shot” refers to the number of examples you have of a particular task. “Few-shot learning” refers to the problem of learning a new task with only a few examples.
For example, you might have noticed that the model completes “The capital of France is” as if it were a travel article (or perhaps a multiple-choice question)—because that’s the sort of document it was trained on! But you can give it examples of the sort of things you want. For example, try this instead:
The capital of Michigan is Lansing.
The capital of England is London.
The capital of France is
We can consider the first two lines as “examples” of the task we want the model to do. This is a “few-shot” example, because we’re giving the model only a few examples of the task we want it to do.
Write a brief summary of how Gemma performed on this task, as compared with not giving it any examples.
Also try this:
Request: capital("Michigan")
Response: "Lansing"
Request: capital("England")
Response: "London"
Request: capital("France")
Response:
Try the following prompt, again with plain Gemma:
I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
(Before you run this, think about how you would solve this problem.)
What does Gemma predict?
Now, add the following to the prompt: Let's think step by step. After I bought the apples, I had
How does the generated text change?
Instruction tuning does several things to make the model more useful at following tasks:
Let’s switch to the instruction-tuned model.
Repeat the steps above to add a model, except this time select the variation: ‘gemma-3-1b-it’. Change the model loading code in the notebook to load this model (USE_INSTRUCITON_TUNED should be True). Stop the session and then restart it to run with this new model.
Instruction-tuned models were fine-tuned on documents formatted as dialogues between a user and an assistant. To get the best performance from these models at inference, we need to format our prompts in a similar way as the documents were formatted during fine-tuning. Different models have different fine-tuning formats, but fortunately the HuggingFace Transformers library has code to help us format our prompts correctly for each model.
The “Chat Templating” section of the notebook includes code to format the prompt for the instruction-tuned model. The apply_chat_template method takes a list of messages, where each message is a dictionary with two keys: “role” and “content”. The “role” key can be either “user”, “assistant”, or “system”. The “content” key is the text of the message. See the Gemma documentation for more details.
role = """You are a helpful 2nd-grade teacher. Help a 2nd grader to answer questions in a short and clear manner."""
task = """Explain why the sky is blue"""
messages = [
{
"role": "user",
"content": f"{role}\n\n{task}",
},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))
Completing a task often requires information that was not included in a model’s training set. For example, try asking the following question to the instruction-tuned model (either omit the {role} or use a role like “You are an expert in PyTorch”):
“Explain the options for TripletMarginLoss in PyTorch.”
Notice that the result includes hallucinations, i.e., information that it simply made up. This is a common problem with language models.
One way to reduce (but not eliminate) hallucinations is to explicitly provide the model with the information it needs. This is called retrieval-augmented generation. The idea is to provide the model with a “retrieval” of relevant information, which it can then use to generate a response.
We’ll use the docstrings for PyTorch functions as our knowledge base. Use the following code to extract the docstrings for all functions in the torch.nn module:
import inspect
docstrings = {}
for name, obj in inspect.getmembers(torch.nn):
if inspect.isfunction(obj) or inspect.isclass(obj):
docstrings[name] = inspect.getdoc(obj)
Now, give the instruction-tuned model a prompt like:
{task}
Answer using the following context:
{context}
where {task} is the question you want to ask and {context} is the docstring for the function you want to ask about. In this example, use context = docstrings['TripletMarginLoss']. (Refer to the documentation page for the module to check the model’s answer.)
Note: In practice, we use a more sophisticated retrieval system, like a search engine, to provide the model with context. Often, vector search is used for the retrieval system: we find the document with the most similar vector to the prompt vector. Models like Sentence Transformers are often used for this purpose, using models found on the Hugging Face model hub, such as GTE. See that model’s documentation page for an example; you might try it out on your own.
We can also prompt the model to use a tool, like a calculator, when it recognizes that it can’t answer a question directly. For example, try the following dialogue:
{
"role": "user",
"content": f"What is the sum of the odd numbers less than 20?",
},
{
"role": "assistant",
"content": f"""
Run Python code: print(sum(x for x in range(20) if x % 2 == 1))
Code output: 100
The result is 100."""
},
{
"role": "user",
"content": f"What is the sum of the even numbers less than 40?",
},
Note that we would need to intercept the generation process and detect that the model has generated a request to run some code – then run that code and insert the result in the dialogue. For simplicity we won’t actually do that in this lab.
In this case, the “tool” is the Python interpreter. We could also provide the model with other tools, like a search engine (which would insert text from the search results into the dialogue), a call to an API like Wolfram Alpha (see Stephen Wolfram’s blog post on this topic), or a call to an API that does something in the physical world (like turning on a light).
Note: we could treat retrieval as a tool, too. For example, the model could generate a request to run a search query against a database, then insert the results into the dialogue. This is called “agentic RAG”.
Suppose we wanted to make a chatbot that answers incoming students’ questions about Calvin University on topics like courses, schedule, recent events, activities, etc..
Since we didn’t need to write much code today, you don’t need to submit a notebook. Instead, submit the answers to the tasks above.
This exercise is focused on prompting and structured output techniques to attempt to make a useful and reliable system out of an LLM. This exercise will allow you to demonstrate the following course objectives:
The fancy (resume/buzzword) name for what we’re going to do here is Agentic RAG. But we’re going to own our control flow rather than letting the LLM fully drive the interaction. We’re also going to be practicing engineering techniques to make the system reliable and measure its performance.
We’ll try to create a chatbot that can help students choose courses according to their interests and goals, using retrieval-augmented generation (RAG) techniques to query the course catalog.
You may choose to do this as a Streamlit app or a Jupyter notebook.
Here’s how I approached it:
First, I defined a set of “tools” that the bot can use, for example, a tool that can query the course catalog and a tool that can recommend a set of courses. (Note that tools are just structured outputs, so we don’t need the model to specifically be trained to “use tools”.)
Then, I wrote a function that basically did the following:
def get_courses_matching_interests(interests):
messages = [{
"role": "system",
"content": "# System message describing the goal, the tools available, and guidance for the conversation.
}]
messages.append({
"role": "user",
"content": interests
})
# Get search queriers from the LLM
search_query_tool = do_llm_call() # with Search Query output format required
messages.append({
"role": "assistant",
"content": search_query_tool.model_dump_json()
})
# Search for courses matching the queries.
courses = search_courses(search_query_tool.queries)
messages.append({
"role": "user",
"content": format_courses(courses)
})
if len(courses) == 0:
# Repeat the previous request, so the model can try a different search.
# Get recommendations from the LLM
recommendations = do_llm_call() # with the Recommendations output format required
return recommendations
We’ll walk through the LLM calls and the course search process below.
When we’re making a larger system out of component modules, it’s critical that each module have a well-defined interface. Fortunately, we can constrain the LLM to generate responses of a desired format.
Start by getting an OpenAI-compatible LLM endpoint. Here’s a few options:
gpt-4o), but that will cost money. So…I’d recommend Ollama, because (1) it’s actually running on your computer, and (2) you’ll be constrained to smaller LLMs, so prompt engineering will make a bigger difference. To do this:
ollama.ollama serve.ollama pull gemma3:1b-it-qatIf you have a lot of memory, or a good GPU, you can try
gemma3:4b-it-qat.
If you do this, you can use:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
model = "gemma3:1b-it-qat"
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": "What is the capital of France?"},
],
)
print(completion.choices[0].message.content)
A common library for working with structured data in Python is Pydantic. It allows you to define a data model (not to be confused with an AI model) and then validate that the data you get from any source (an API result, an LLM call, etc.) matches that model.
Here’s an example Pydantic model for a search query:
from typing import Literal
from pydantic import BaseModel
class SearchTool(BaseModel):
tool_name: Literal["search_course_catalog"] = "search_course_catalog"
thinking: str
queries: list[str]
example_search = SearchTool(
thinking="The user wants to know some trivia.",
queries=[
"What is the capital of France?",
"What is the largest mammal?",
])
print(example_search.thinking)
print('; '.join(example_search.queries))
And here’s how we might use it in an OpenAI-compatible LLM call:
completion = client.beta.chat.completions.parse(
model=model,
messages=[
{"role": "system", "content": f"""Write 10 search queries."""},
{"role": "user", "content": "I'm looking for courses related to AI."},
],
response_format=SearchTool,
temperature=0.5
)
event = completion.choices[0].message.parsed
event
Observe that the response_format parameter is set to SearchTool, which means the LLM will be forced to output JSON that matches the SearchTool schema.
Try making the following changes to the system prompt and see how they affect the output:
For in-context learning, it can sometimes be helpful to provide examples of the kind of output that you expect. But it can also sometimes lead to the model getting fixated on your specific examples. Try it out by adding an example to the system prompt. Try adding something like:
Example:
Student interest: "art"
Queries: ["art", "photography", "visual rhetoric", "painting", "sculpture", "art history", "graphic design", "digital media", "art theory", "contemporary art"]
How useful was adding this example?
You might try adding a “notes” section to the system prompt to give the model additional guidance. For example, you could say:
Notes:
- Before responding, write a short thought about what kinds of courses might be relevant to the user's interest.
- Assume that queries will be run against a specific course catalog, so avoid general terms like "course" or "department".
- Ensure that each query would match the title or description of one or more specific courses in an undergraduate program
Did these notes help the model produce better output? How would you measure that?
You might add (within the f-string):
The output should be JSON with the following schema: {json.dumps(SearchTool.model_json_schema())}
Overall, which of these changes was most helpful for getting the model to produce useful output? Are there any other changes you could make? Refer to our course readings on prompt engineering for more ideas.
Now we need to find courses that match those queries.
To keep it fast and simple, we’ll use a local mirror of the course catalog.
Here’s how to load that file and search it:
import requests
sections_json = requests.get(sections_json_url)
sections_json.raise_for_status()
sections = sections_json.json()
example_section = next(section for section in sections if section['SectionName'].startswith('CS 108'))
print(example_section)
The listing is by section, so it’ll be helpful to organize by course instead:
course_descriptions = {
section['SectionName'].split('-', 1)[0].strip(): (section["SectionTitle"], section["CourseDescription"])
for section in sections
if "CourseDescription" in section
and section.get('AcademicLevel') == 'Undergraduate'
and section.get('Campus') == 'Grand Rapids Campus'
}
print("Found", len(course_descriptions), "courses")
print(course_descriptions["CS 108"])
Here’s a function to find courses matching a query:
def search_courses(query: str):
"""
Search for courses that match the query.
"""
query = query.lower()
matches = []
for course, (title, description) in course_descriptions.items():
if query in title.lower() or query in description.lower():
matches.append((course, title, description))
return matches
search_courses("programming")
If you have multiple queries, you might want to combine the results:
def find_courses_matching_queries(queries: list[str]):
"""
Find courses that match any of the queries.
"""
return set(
course
for query in queries
for course in search_courses(query)
)
find_courses_matching_queries(["programming", "AI"])
Here’s a possible recommendation output format (it has some issues that you might want to fix later):
class CourseRecommendation(BaseModel):
course_code: str
course_title: str
course_description: str
reasoning: str
class RecommendTool(BaseModel):
tool_name: Literal["recommend_course"] = "recommend_course"
thinking: str
recommendations: list[CourseRecommendation]
Now you put it together to make a course advisor bot! First try running these steps “by hand” to see how it works. Then, put it all together in a function. You can follow the rough outline given in the code snippet above.
Test your bot with a few different student interests. Measure at least the following:
You’ll have to think about how to measure this. You might want to ask a few friends to try it out and give you feedback.
Think about how you could improve the system.