@@ -87,8 +87,8 @@ If we use fixed-positional encodings these pre-calculated embeddings will have
...
@@ -87,8 +87,8 @@ If we use fixed-positional encodings these pre-calculated embeddings will have
the same positions as the current context.
the same positions as the current context.
They introduce relative positional encoding, where the positional encodings
They introduce relative positional encoding, where the positional encodings
are introduced at the attention calculation.</p>
are introduced at the attention calculation.</p>
<p>Annotated implementation of relative multi-headed attention is in <ahref="relative_mha.html"><code>relative_mha.py</code></a>.</p>
<p>Annotated implementation of relative multi-headed attention is in <ahref="https://nn.labml.ai/transformers/xl/relative_mha.html"><code>relative_mha.py</code></a>.</p>
<p>Here’s <ahref="experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
<p>Here’s <ahref="https://nn.labml.ai/transformers/xl/experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
<p><ahref="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"/></a>
<p><ahref="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"/></a>
The paper introduces multiple choices for $f_c$ and we have only implemented
1D convolution which seems to give the best results.
Each layer has a separate compression operation $f_c^{(i)}$ where
$i$ is the layer number.
## Training compression operation
Since training compression with BPTT requires maintaining
a very large computational graph (many time steps), the paper proposes
an *auto-encoding loss* and an *attention reconstruction loss*.
The auto-encoding loss decodes the original memories from the compressed memories
and calculates the loss.
Attention reconstruction loss computes the multi-headed attention results
on the compressed memory and on uncompressed memory and gets a mean squared error
between them.
We have implemented the latter here since it gives better results.
This implementation uses pre-layer normalization
while the paper uses post-layer normalization.
Pre-layer norm does the layer norm before FFN[../feedforward.html) and
self-attention, and the pass-through in the residual connection is not normalized.
This is supposed to be more stable in standard transformer setups.
Here are [the training code](https://nn.labml.ai/transformers/compressive/experiment.html) and a notebook for training a compressive transformer
model on the Tiny Shakespeare dataset.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/compressive/experiment.ipynb)
@@ -16,9 +16,9 @@ the same positions as the current context.
...
@@ -16,9 +16,9 @@ the same positions as the current context.
They introduce relative positional encoding, where the positional encodings
They introduce relative positional encoding, where the positional encodings
are introduced at the attention calculation.
are introduced at the attention calculation.
Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](relative_mha.html).
Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](https://nn.labml.ai/transformers/xl/relative_mha.html).
Here's [the training code](experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
Here's [the training code](https://nn.labml.ai/transformers/xl/experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb)