Final Homework | CS 344 Spring 2023 at Calvin University

You have two options for this final assignment. You may choose to do one or both.

Train a Sequence Model on Your Own Data

Collect a plain text file (.txt) of some text that you’ve written recently. (Perhaps all of your discussion posts in this class, or all of the papers you’ve written in the past year, or even all of the code you’ve written for a recent project). It should be your own work.
Fine-tune the GPT-2 on your text using the run_clm.py script documented in the Hugging Face Transformers repo.
- You will need a validation txt file also, but it need not be large. You may even be able to leave it empty, I haven’t tested.
- Unless line breaks are significant, use --keep_linebreaks=false.
- If you get complains about the version of transformers, grab the script from an older release tag, e.g., 4.27.0
- You can run a script from a notebook using !python script.py --args --go --here

Generate some example text. We did this in a Fundamentals notebook a few weeks ago, or you can use this script.

Pick a phrase from your corpus.
Evaluate the perplexity of three different models on your phrase:
- The original GPT-2 model
- Your fine-tuned model
- A model in the OpenAI Playground (use Full Spectrum probabilities to show request the logprobs)
What do you notice about the perplexities of the three models?

Write a brief summary of your experience: