compressive transformer links

5442dfb1 · Varuna Jayasiri · 969df719 · 5442dfb1 · 5442dfb1 · 5442dfb1
8 changed file
--- a/docs/index.html
+++ b/docs/index.html
@@ -88,6 +88,7 @@ implementations.</p>
 <li><a href="transformers/xl/relative_mha.html">Relative multi-headed attention</a></li>
 </ul>
 </li>
+<li><a href="transformers/compressive/index.html">Compressive Transformer</a></li>
 <li><a href="transformers/gpt/index.html">GPT Architecture</a></li>
 <li><a href="transformers/glu_variants/simple.html">GLU Variants</a></li>
 <li><a href="transformers/knn/index.html">kNN-LM: Generalization through Memorization</a></li>

--- a/docs/transformers/index.html
+++ b/docs/transformers/index.html
@@ -84,6 +84,10 @@ and derivatives and enhancements of it.</p>
 <h2><a href="xl/index.html">Transformer XL</a></h2>
 <p>This implements Transformer XL model using
 <a href="xl/relative_mha.html">relative multi-head attention</a></p>
+<h2><a href="compressive/index.html">Compressive Transformer</a></h2>
+<p>This is an implementation of compressive transformer
+that extends upon <a href="xl/index.html">Transformer XL</a> by compressing
+oldest memories to give a longer attention span.</p>
 <h2><a href="gpt">GPT Architecture</a></h2>
 <p>This is an implementation of GPT-2 architecture.</p>
 <h2><a href="glu_variants/simple.html">GLU Variants</a></h2>
@@ -102,10 +106,10 @@ Our implementation only has a few million parameters and doesn&rsquo;t do model
 It does single GPU training but we implement the concept of switching as described in the paper.</p>
            </div>
            <div class='code'>
-                <div class="highlight"><pre><span class="lineno">52</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span>
+                <div class="highlight"><pre><span class="lineno">57</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span>
-<span class="lineno">53</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span>
+<span class="lineno">58</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span>
-<span class="lineno">54</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span>
+<span class="lineno">59</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span>
-<span class="lineno">55</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div>
+<span class="lineno">60</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div>
            </div>
        </div>
    </div>

--- a/docs/transformers/xl/readme.html
+++ b/docs/transformers/xl/readme.html
@@ -87,8 +87,8 @@ If we use fixed-positional encodings these pre-calculated embeddings will have
 the same positions as the current context.
 They introduce relative positional encoding, where the positional encodings
 are introduced at the attention calculation.</p>
-<p>Annotated implementation of relative multi-headed attention is in <a href="relative_mha.html"><code>relative_mha.py</code></a>.</p>
+<p>Annotated implementation of relative multi-headed attention is in <a href="https://nn.labml.ai/transformers/xl/relative_mha.html"><code>relative_mha.py</code></a>.</p>
-<p>Here&rsquo;s <a href="experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
+<p>Here&rsquo;s <a href="https://nn.labml.ai/transformers/xl/experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
 <p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
 <a href="https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
            </div>

--- a/labml_nn/__init__.py
+++ b/labml_nn/__init__.py
@@ -19,6 +19,7 @@ implementations.
 * [Transformer building blocks](transformers/models.html)
 * [Transformer XL](transformers/xl/index.html)
    * [Relative multi-headed attention](transformers/xl/relative_mha.html)
+* [Compressive Transformer](transformers/compressive/index.html)
 * [GPT Architecture](transformers/gpt/index.html)
 * [GLU Variants](transformers/glu_variants/simple.html)
 * [kNN-LM: Generalization through Memorization](transformers/knn/index.html)

--- a/labml_nn/transformers/__init__.py
+++ b/labml_nn/transformers/__init__.py
@@ -21,6 +21,12 @@ and derivatives and enhancements of it.
 This implements Transformer XL model using
 [relative multi-head attention](xl/relative_mha.html)
+## [Compressive Transformer](compressive/index.html)
+This is an implementation of compressive transformer
+that extends upon [Transformer XL](xl/index.html) by compressing
+oldest memories to give a longer attention span.
 ## [GPT Architecture](gpt)
 This is an implementation of GPT-2 architecture.
@@ -30,7 +36,6 @@ This is an implementation of GPT-2 architecture.
 This is an implementation of the paper
 [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
 ## [kNN-LM](knn)
 This is an implementation of the paper

--- a/labml_nn/transformers/compressive/readme.md
+++ b/labml_nn/transformers/compressive/readme.md
+# [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
+This is an implementation of
+[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
+in [PyTorch](https://pytorch.org).
+This is an extension of [Transformer XL](https://nn.labml.ai/transformers/xl/index.html) where past memories
+are compressed to give a longer attention range.
+That is, the furthest $n_{cm} c$ memories are compressed into
+$n_{cm}$ memories, where $c$ is the compression rate.
+## Compression operation
+The compression operation is defined as
+$f_c: \mathbb{R}^{nc \times d} \rightarrow \mathbb{R}^{n \times d}$.
+The paper introduces multiple choices for $f_c$ and we have only implemented
+1D convolution which seems to give the best results.
+Each layer has a separate compression operation $f_c^{(i)}$ where
+$i$ is the layer number.
+## Training compression operation
+Since training compression with BPTT requires maintaining
+a very large computational graph (many time steps), the paper proposes
+an *auto-encoding loss* and an *attention reconstruction loss*.
+The auto-encoding loss decodes the original memories from the compressed memories
+and calculates the loss.
+Attention reconstruction loss computes the multi-headed attention results
+on the compressed memory and on uncompressed memory and gets a mean squared error
+between them.
+We have implemented the latter here since it gives better results.
+This implementation uses pre-layer normalization
+while the paper uses post-layer normalization.
+Pre-layer norm does the layer norm before FFN[../feedforward.html) and
+self-attention, and the pass-through in the residual connection is not normalized.
+This is supposed to be more stable in standard transformer setups.
+Here are [the training code](https://nn.labml.ai/transformers/compressive/experiment.html) and a notebook for training a compressive transformer
+model on the Tiny Shakespeare dataset.
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/compressive/experiment.ipynb)
+[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=0d9b5338726c11ebb7c80242ac1c0002)
--- a/labml_nn/transformers/xl/readme.md
+++ b/labml_nn/transformers/xl/readme.md
@@ -16,9 +16,9 @@ the same positions as the current context.
 They introduce relative positional encoding, where the positional encodings
 are introduced at the attention calculation.
-Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](relative_mha.html).
+Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](https://nn.labml.ai/transformers/xl/relative_mha.html).
-Here's [the training code](experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
+Here's [the training code](https://nn.labml.ai/transformers/xl/experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb)
 [![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002)
--- a/readme.md
+++ b/readme.md
@@ -25,6 +25,7 @@ implementations almost weekly.
 * [Transformer building blocks](https://nn.labml.ai/transformers/models.html) 
 * [Transformer XL](https://nn.labml.ai/transformers/xl/index.html)
    * [Relative multi-headed attention](https://nn.labml.ai/transformers/xl/relative_mha.html)
+* [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
 * [GPT Architecture](https://nn.labml.ai/transformers/gpt/index.html)
 * [GLU Variants](https://nn.labml.ai/transformers/glu_variants/simple.html)
 * [kNN-LM: Generalization through Memorization](https://nn.labml.ai/transformers/knn)