Block Recurrent Transformer

A PyTorch implementation of Hutchins & Schlag et al.. Owes very much to Phil Wang's x-transformers. Very much in-progress.

Dockerfile, requirements.txt, and environment.yaml because I love chaos.

Differences from the Paper (as of 2022/05/04)

Keys and values are not shared between the "vertical" and "horizontal" directions (the standard input -> output information flow and the recurrent state flow, respectively).
The state vectors are augmented with Rotary Embeddings for positional encoding, instead of using learned embeddings.
The special LSTM gate initialization is not yet implemented.