Ken's hunches

Ken’s hunches

The project for this class is a great place to explore crazy ideas that probably won’t work but would be cool if they did. Here’s a few things I’ve thought of, just for some ideas.

  • Look at the change in loss (maybe proportional to change in activations?) on a batch other than the one being updated
    • could be a useful diagnostic for learning
    • could be useful for adapting learning rate / strategy
  • For sequence modeling tasks, what if attention values could mix with the attention values of nearby timesteps? smoothed attention
    • More flexible: the attention given to a token at “past” timesteps can be used as input to computing the attention at the current timestep
  • Regularization by “crossing wires” (mixing weights/activations with reshuffled versions from same layer) or local blur (nonlinear?) or windowed softmax for local inhibition.
Ken Arnold
Ken Arnold
Assistant Professor of Computer Science