Professor Arnold is willing to advise one of the following research-y projects. Others may be permitted if you have a very clear proposal, discussed well in advance.
(You may notice some commonalities among these ideas. That’s intentional.)
Denoising Autoencoder for Text
Goal: help people clean up or edit text.
Approach: train a model to go from corrupted text to original text. (in general this is called a “denoising autoencoder” because it removes “noise”). Specific approaches:
- drop out some words entirely (this is the main approach used in TSDAE)
- add noise to the word vectors
- treat position as a continuous value and add noise to it (so word order gets partly shuffled) – this is a unique characteristic of Transformers models and I don’t know if this has been explored in autoencoders before.
Text vectorization using prefix learning
In many tasks it’s helpful to have a vector that captures the general meaning of a sentence. (Search, paraphrasing, summarization, etc.). In Transformers models, generally the representation has one vector per input token (exception: the Perceiver architecture, e.g., Perceiver AR). But it’s unclear how to compare these by similarity, and they may capture too much specific information about the input. So: what if we try to use smaller sequences to represent the larger ones? This is the basic idea of compression, that’s fundamental to learning.
The paper Multimodal Few-Shot Learning with Frozen Language Models exemplifies recent results in using pretrained models: you keep the generation model frozen, but just prepend some virtual tokens to the input, represented by vectors that are generated by an encoder model. Those tokens basically make a virtual context that the existing model finishes. This is cool because there are way fewer parameters to learn: you don’t need to learn to generate English, only to control the existing model so that it generates the sequence you want by itself.
The approach here would look like:
- Set up a frozen language model (like GPT-2) as a decoder.
- Set up the model to take a learnable vector (or several vectors) as the beginning of the decoded text.
- Set up an encoder to read the input sentence and output those vectors.
- Train the model so that both the encoder and decoder get the target sentence, and optimize the cross-entropy loss of the target sequence.
- Investigate the representation that it learned, e.g., what word vectors are close to the learned vectors for the sentence?
- Evaluate this using sentence-vector metrics; see Sentence Transformers
Question generation by inverting a QA system
Goal: When a writer feels “stuck” about what to write about, they can request for a system to generate questions that they might answer.
Approach: One approach for this is to take an existing question-answering system and just run it backwards: given answers, generate questions. But that might not help writers, because the system would need to already have the answer written. But question-answering datasets usually annotate where in the document the answer to the question was found. So what if we just blank out the part of the document that gives the answer, and train the model to generate the corresponding question?
See Question Answering Datasets | Papers With Code for some potential datasets. You could probably use one of the transformers examples code bases with little modification.
Decompose and Recompose Complex Sentences using Simple Sentences
Writers sometimes like to write really long and complicated sentences because those are the first things that come to mind and it’s easy to just keep typing and get your ideas out there but it’s not really clear what you’re trying to say and you’re thinking while you’re writing so you end up with this big long train of thought that’s hard for people to follow and it would be really helpful to readers if the writer could split the big sentence apart into little sentences that are simpler but sometimes there are actually complicated things that the writer is trying to explain and the simple little sentences get hard to follow so we don’t necessary want to do this entirely automatically so it would be helpful to have the writer stay in control of this process. So:
- Task 1: Complex input sentence in, set of simpler sentences out.
- Task 2: Set of simple sentences in, combined sentence out.
Possible dataset: BiSECT Dataset | Papers With Code. There are also “sentence combination” exercises that language students do; there are probably some datasets from those.
Effect of different loss functions for NLP
NLP models tend to focus on common, expected phrasing. How could we get models to capture the richness of how people express themselves? One of many possible approaches could be to tweak the loss function. Current models almost exclusively use cross-entropy loss, but other losses might encourage different model behavior. A few potential options:
- cyclical focal loss (see General Cyclical Training of Neural Networks for a description in context)
- use the unigram or bigram frequency to weight the cross-entropy loss
- penalize the difference between the model’s cross-entropy loss and the cross-entropy loss that a sentence should have—certain sequences just should be more informative / surprising than others because they convey information, so a model shouldn’t be too confident about it. The expected loss could be computed from corpus statistics perhaps. (Related idea: actor-critic setting in reinforcement learning.)
One way to measure success would be to generate from this model and check if the distribution of characteristics of the generated sentences match the corpus distribution.
Predictive Text from Very Rough Drafts (e.g., rambling speech)
Speech recognition technology is a powerful and efficient way to enter text on a touchscreen device, but many people don’t use it. One reason is that it is cognitively challenging: you must think of exactly what to say, and how to say it clearly enough to be understood, on the first time, potentially in a distracting or non-private environment. But what if you could first “think out loud” about what you want to say, perhaps whispering a stream of consciousness to your phone—then your phone would give you (a) an outline of the main points you wanted to say and (b) really accurate predictions about what word to type next in order to say it?
-
Input: a “stream of consciousness” rambling about something you want to communicate, likely already in text form (perhaps by a low-quality speech recognizer)
- Training data could be generated automatically by corrupting ground-truth outputs in various ways.
-
Output: a prediction of the next word to be typed in the final message you want to send.
Other language tasks
- Paraphrasing using any-to-any translation
- Mining Wikipedia edit history (or other datasets, like Newsela) for data on editing language
- Language models that learn to generate “blanks” that a writer would need to fill in
De-EQ
Given a sound corrupted by a random EQ curve or other processing step, predict the parameters for that processing step. This kind of task is called self supervised learning. See Microsoft’s HEXA.