提交 5442dfb1 编写于 作者: V Varuna Jayasiri

compressive transformer links

上级 969df719
...@@ -88,6 +88,7 @@ implementations.</p> ...@@ -88,6 +88,7 @@ implementations.</p>
<li><a href="transformers/xl/relative_mha.html">Relative multi-headed attention</a></li> <li><a href="transformers/xl/relative_mha.html">Relative multi-headed attention</a></li>
</ul> </ul>
</li> </li>
<li><a href="transformers/compressive/index.html">Compressive Transformer</a></li>
<li><a href="transformers/gpt/index.html">GPT Architecture</a></li> <li><a href="transformers/gpt/index.html">GPT Architecture</a></li>
<li><a href="transformers/glu_variants/simple.html">GLU Variants</a></li> <li><a href="transformers/glu_variants/simple.html">GLU Variants</a></li>
<li><a href="transformers/knn/index.html">kNN-LM: Generalization through Memorization</a></li> <li><a href="transformers/knn/index.html">kNN-LM: Generalization through Memorization</a></li>
......
...@@ -84,6 +84,10 @@ and derivatives and enhancements of it.</p> ...@@ -84,6 +84,10 @@ and derivatives and enhancements of it.</p>
<h2><a href="xl/index.html">Transformer XL</a></h2> <h2><a href="xl/index.html">Transformer XL</a></h2>
<p>This implements Transformer XL model using <p>This implements Transformer XL model using
<a href="xl/relative_mha.html">relative multi-head attention</a></p> <a href="xl/relative_mha.html">relative multi-head attention</a></p>
<h2><a href="compressive/index.html">Compressive Transformer</a></h2>
<p>This is an implementation of compressive transformer
that extends upon <a href="xl/index.html">Transformer XL</a> by compressing
oldest memories to give a longer attention span.</p>
<h2><a href="gpt">GPT Architecture</a></h2> <h2><a href="gpt">GPT Architecture</a></h2>
<p>This is an implementation of GPT-2 architecture.</p> <p>This is an implementation of GPT-2 architecture.</p>
<h2><a href="glu_variants/simple.html">GLU Variants</a></h2> <h2><a href="glu_variants/simple.html">GLU Variants</a></h2>
...@@ -102,10 +106,10 @@ Our implementation only has a few million parameters and doesn&rsquo;t do model ...@@ -102,10 +106,10 @@ Our implementation only has a few million parameters and doesn&rsquo;t do model
It does single GPU training but we implement the concept of switching as described in the paper.</p> It does single GPU training but we implement the concept of switching as described in the paper.</p>
</div> </div>
<div class='code'> <div class='code'>
<div class="highlight"><pre><span class="lineno">52</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span> <div class="highlight"><pre><span class="lineno">57</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span>
<span class="lineno">53</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span> <span class="lineno">58</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span>
<span class="lineno">54</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span> <span class="lineno">59</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span>
<span class="lineno">55</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div> <span class="lineno">60</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div>
</div> </div>
</div> </div>
</div> </div>
......
...@@ -87,8 +87,8 @@ If we use fixed-positional encodings these pre-calculated embeddings will have ...@@ -87,8 +87,8 @@ If we use fixed-positional encodings these pre-calculated embeddings will have
the same positions as the current context. the same positions as the current context.
They introduce relative positional encoding, where the positional encodings They introduce relative positional encoding, where the positional encodings
are introduced at the attention calculation.</p> are introduced at the attention calculation.</p>
<p>Annotated implementation of relative multi-headed attention is in <a href="relative_mha.html"><code>relative_mha.py</code></a>.</p> <p>Annotated implementation of relative multi-headed attention is in <a href="https://nn.labml.ai/transformers/xl/relative_mha.html"><code>relative_mha.py</code></a>.</p>
<p>Here&rsquo;s <a href="experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p> <p>Here&rsquo;s <a href="https://nn.labml.ai/transformers/xl/experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a> <p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
<a href="https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p> <a href="https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
</div> </div>
......
...@@ -19,6 +19,7 @@ implementations. ...@@ -19,6 +19,7 @@ implementations.
* [Transformer building blocks](transformers/models.html) * [Transformer building blocks](transformers/models.html)
* [Transformer XL](transformers/xl/index.html) * [Transformer XL](transformers/xl/index.html)
* [Relative multi-headed attention](transformers/xl/relative_mha.html) * [Relative multi-headed attention](transformers/xl/relative_mha.html)
* [Compressive Transformer](transformers/compressive/index.html)
* [GPT Architecture](transformers/gpt/index.html) * [GPT Architecture](transformers/gpt/index.html)
* [GLU Variants](transformers/glu_variants/simple.html) * [GLU Variants](transformers/glu_variants/simple.html)
* [kNN-LM: Generalization through Memorization](transformers/knn/index.html) * [kNN-LM: Generalization through Memorization](transformers/knn/index.html)
......
...@@ -21,6 +21,12 @@ and derivatives and enhancements of it. ...@@ -21,6 +21,12 @@ and derivatives and enhancements of it.
This implements Transformer XL model using This implements Transformer XL model using
[relative multi-head attention](xl/relative_mha.html) [relative multi-head attention](xl/relative_mha.html)
## [Compressive Transformer](compressive/index.html)
This is an implementation of compressive transformer
that extends upon [Transformer XL](xl/index.html) by compressing
oldest memories to give a longer attention span.
## [GPT Architecture](gpt) ## [GPT Architecture](gpt)
This is an implementation of GPT-2 architecture. This is an implementation of GPT-2 architecture.
...@@ -30,7 +36,6 @@ This is an implementation of GPT-2 architecture. ...@@ -30,7 +36,6 @@ This is an implementation of GPT-2 architecture.
This is an implementation of the paper This is an implementation of the paper
[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
## [kNN-LM](knn) ## [kNN-LM](knn)
This is an implementation of the paper This is an implementation of the paper
......
# [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
This is an implementation of
[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
in [PyTorch](https://pytorch.org).
This is an extension of [Transformer XL](https://nn.labml.ai/transformers/xl/index.html) where past memories
are compressed to give a longer attention range.
That is, the furthest $n_{cm} c$ memories are compressed into
$n_{cm}$ memories, where $c$ is the compression rate.
## Compression operation
The compression operation is defined as
$f_c: \mathbb{R}^{nc \times d} \rightarrow \mathbb{R}^{n \times d}$.
The paper introduces multiple choices for $f_c$ and we have only implemented
1D convolution which seems to give the best results.
Each layer has a separate compression operation $f_c^{(i)}$ where
$i$ is the layer number.
## Training compression operation
Since training compression with BPTT requires maintaining
a very large computational graph (many time steps), the paper proposes
an *auto-encoding loss* and an *attention reconstruction loss*.
The auto-encoding loss decodes the original memories from the compressed memories
and calculates the loss.
Attention reconstruction loss computes the multi-headed attention results
on the compressed memory and on uncompressed memory and gets a mean squared error
between them.
We have implemented the latter here since it gives better results.
This implementation uses pre-layer normalization
while the paper uses post-layer normalization.
Pre-layer norm does the layer norm before FFN[../feedforward.html) and
self-attention, and the pass-through in the residual connection is not normalized.
This is supposed to be more stable in standard transformer setups.
Here are [the training code](https://nn.labml.ai/transformers/compressive/experiment.html) and a notebook for training a compressive transformer
model on the Tiny Shakespeare dataset.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/compressive/experiment.ipynb)
[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=0d9b5338726c11ebb7c80242ac1c0002)
...@@ -16,9 +16,9 @@ the same positions as the current context. ...@@ -16,9 +16,9 @@ the same positions as the current context.
They introduce relative positional encoding, where the positional encodings They introduce relative positional encoding, where the positional encodings
are introduced at the attention calculation. are introduced at the attention calculation.
Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](relative_mha.html). Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](https://nn.labml.ai/transformers/xl/relative_mha.html).
Here's [the training code](experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset. Here's [the training code](https://nn.labml.ai/transformers/xl/experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb)
[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002) [![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002)
...@@ -25,6 +25,7 @@ implementations almost weekly. ...@@ -25,6 +25,7 @@ implementations almost weekly.
* [Transformer building blocks](https://nn.labml.ai/transformers/models.html) * [Transformer building blocks](https://nn.labml.ai/transformers/models.html)
* [Transformer XL](https://nn.labml.ai/transformers/xl/index.html) * [Transformer XL](https://nn.labml.ai/transformers/xl/index.html)
* [Relative multi-headed attention](https://nn.labml.ai/transformers/xl/relative_mha.html) * [Relative multi-headed attention](https://nn.labml.ai/transformers/xl/relative_mha.html)
* [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
* [GPT Architecture](https://nn.labml.ai/transformers/gpt/index.html) * [GPT Architecture](https://nn.labml.ai/transformers/gpt/index.html)
* [GLU Variants](https://nn.labml.ai/transformers/glu_variants/simple.html) * [GLU Variants](https://nn.labml.ai/transformers/glu_variants/simple.html)
* [kNN-LM: Generalization through Memorization](https://nn.labml.ai/transformers/knn) * [kNN-LM: Generalization through Memorization](https://nn.labml.ai/transformers/knn)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册