Optional Extension: Token Efficiency Analysis | CS 375-376 Spring 2026 at Calvin University

Warning: This content has not yet been fully revised for this year.

This is an optional mini-project that builds on the tokenization exercise that we did this week.

Select 4-5 text samples (100-200 words each) from different domains. Examples include:
- General prose (e.g., news article)
- Code or structured data (e.g., HTML, JSON, XML, CSV, …)
- Technical/scientific text
- Social media/informal text
- Non-English or multilingual text
Select two or three tokenizers to compare (e.g., Llama, Gemma, spaCy, sklearn’s TextVectorizer, etc.). You may use the same tokenizers from the previous exercise or try new ones.
For each tokenizer, tokenize the text samples and calculate the following metrics:
- Characters-per-token ratio
- 5 longest tokens
- 5 multi-token words (if applicable)
- 5 words mapped to an “unknown” token (if applicable)
Analyze the results and answer the following questions:
- Which types of text tokenize most efficiently and why?
- How might these insights influence prompt design for different tasks?
- What specific improvements could be made to the tokenization process for one of your text domains?