Language Model Inputs and Outputs¶

Task: Trace the complete pipeline from a text string to model output tokens; explain how a chat conversation is represented as a structured document.

Objectives: OG-LLM-Tokenization, OG-LLM-ConversationAsDocument

By the end of this notebook you should be able to answer:

  • What does a language model actually receive as input?
  • What does it produce as output?
  • How does a multi-turn chat conversation get turned into something the model can process?

Setup¶

We'll use Qwen2.5-0.5B-Instruct, a small (500M parameter) instruction-tuned language model. It's fast enough to run on free Kaggle/Colab GPUs, and supports chat templates — which we'll need in Section 3.

Run this cell to load the model. It may take a minute.

In [ ]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_name = 'Qwen/Qwen2.5-0.5B-Instruct'

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', dtype=torch.bfloat16)
streamer = TextStreamer(tokenizer)
# Silence a warning.
tokenizer.decode([tokenizer.eos_token_id]);
print("Loaded.")
Loading tokenizer...
Loading model...
Loaded.
In [ ]:
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Model parameters: {model.num_parameters():,}")
print(f"Device: {model.device}, dtype: {model.dtype}")

Section 1: From Words to Numbers¶

A language model cannot process text directly — it works with integers. The tokenizer converts text into a sequence of token IDs (and back again).

The tokenizer has two jobs:

  1. Segment the text into subword pieces called tokens
  2. Map each token to an integer ID from the vocabulary

Let's trace this pipeline step by step.

In [ ]:
phrase = "I visited Muskegon"

1a. Use tokenizer.tokenize to split the phrase into tokens.

In [ ]:
# your code here

Notice the Ġ character (a special underscore) at the beginning of some tokens. This marks a space before the token in the original text — it’s how the tokenizer records word boundaries within a flat sequence of tokens.

What do you observe about how "Muskegon" was split? Why might this happen?

your answer here

1b. Use tokenizer.encode to convert the phrase directly to integer IDs. Call the result input_ids.

In [ ]:
input_ids = ...
input_ids

1c. Use tokenizer.decode to convert input_ids back to a readable string. Verify the round-trip works.

In [ ]:
# your code here

1d. Try tokenizing a made-up word or a badly misspelled word. What happens, and why?

In [ ]:
# your code here

your answer here

Section 2: Packaging Input for the Model¶

The model doesn’t take a Python list — it takes a PyTorch tensor with a specific shape. This section shows how to package text as a model-ready batch and run a generation.

2a. Call tokenizer(phrase, return_tensors='pt') and inspect the result. What keys does the dictionary have? What is the shape of input_ids?

In [ ]:
batch = tokenizer(...)
print(batch)
print("input_ids shape:", batch['input_ids'].shape)

Fill in the blanks:

The shape (1, N) means ___ example(s) in the batch, and the sequence has ___ tokens.

your answer here

2b. Run model.generate on the batch to see generation in action. The streamer will print tokens as they appear.

In [ ]:
with torch.inference_mode():
    output_ids = model.generate(
        **batch.to(model.device),
        max_new_tokens=20,
        do_sample=False,
        streamer=streamer
    )

2c. Decode output_ids[0] (the first example in the batch) to a string. Notice that the output includes the input tokens — the model returns the entire sequence, not just the new part.

In [ ]:
# your code here

Section 3: A Conversation is a Document¶

Here’s the central insight of this unit: the model has no built-in concept of “user” and “assistant.” It just completes text. When we use a chat model, the conversation is encoded as a specially formatted document, and the model predicts what comes next.

The apply_chat_template method takes a list of messages and formats them into that document. Let’s see exactly what it produces.

3a. Call apply_chat_template with tokenize=False to see the raw string before it becomes numbers.

In [ ]:
messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

# See the raw document string BEFORE tokenization
raw_doc = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(raw_doc)

In the output above, identify:

  • The special tokens that mark the start and end of the user’s turn
  • What comes at the very end (the “generation prompt” — this is where the model will continue from)

your answer here

3b. Now tokenize the same messages and generate a response.

In [ ]:
tokenized_chat = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors='pt', return_dict=True
)
print(f"Input shape: {tokenized_chat['input_ids'].shape}")
with torch.inference_mode():
    output_ids = model.generate(
        **tokenized_chat.to(model.device),
        max_new_tokens=40,
        do_sample=False,
        streamer=streamer
    )

3c. Build a two-turn conversation: add the assistant’s response as a second message, then ask a follow-up. Print the raw document string. What does the full conversation look like as a document?

In [ ]:
messages_2turn = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What is it famous for?"},
]

# your code here

You should see the entire conversation history stitched together as one document — both turns, with special tokens separating them. The model’s task at the end is simply to predict the next token after <|im_start|>assistant.

3d. Count the token overhead from the template. How many tokens does the formatting add beyond the raw message content?

In [ ]:
# Total tokens in the formatted chat:
tokenized_2turn = tokenizer.apply_chat_template(...)
total_tokens = tokenized_2turn.shape[1]

# Tokens for just the message content (no template):
content_only = ' '.join(m['content'] for m in messages_2turn)
content_tokens = len(tokenizer.encode(content_only))

print(f"Total tokens (with template): {total_tokens}")
print(f"Content tokens only: {content_tokens}")
print(f"Template overhead: {total_tokens - content_tokens} tokens")

Section 4: Why Tokenization Choices Matter¶

OG-LLM-Tokenization, OG-LLM-ConversationAsDocument

The tokenizer’s design shapes what the model can and cannot do easily. Let’s look at three concrete consequences.

4a. Letter counting. Tokenize the word "strawberry". Then explain why asking an LLM “how many r’s are in strawberry?” is structurally harder than it seems.

In [ ]:
# your code here

your answer here

4b. Capitalization. Tokenize "Paris", "paris", and "PARIS". Does capitalization affect the tokenization? What about the number of tokens?

In [ ]:
# your code here

your answer here

Analysis Questions¶

Write your answers in the markdown cells below. Aim for 3–5 sentences each.

Q1 (OG-LLM-Tokenization): Describe the complete journey from a user’s typed message to the model’s first output token. Be specific — name each step, what goes in, and what comes out. Your answer should mention: tokenization, chat template / special tokens, input_ids, model.generate, and decode.

your response here

Q2 (OG-LLM-ConversationAsDocument): A model trained only on text documents can behave as a chat assistant — how? Describe what the “document” looks like just before the assistant’s first token is predicted.

your response here

Q3 (OG-LLM-Tokenization): Pick one of the three tokenization effects from Section 4 (letter counting, capitalization, or token efficiency). Explain the root cause, and describe a realistic developer scenario where you would need to account for it.

your response here

Q4 (Bonus — preview of next week): Qwen2.5-0.5B-Instruct has a vocabulary of ~150,000 tokens. At each generation step, the model outputs one score (logit) for every possible next token.

If the input so far is 50 tokens long, what is the shape of the logits tensor for that generation step? Why does vocabulary size matter for memory and compute?

your response here