📚 layer norm docs

3089e913 · Varuna Jayasiri · 5388e807 · 3089e913 · 3089e913 · 3089e913
14 changed file
--- a/docs/index.html
+++ b/docs/index.html
@@ -125,6 +125,7 @@ and
 <h4>✨ <a href="https://nn.labml.ai/normalization/index.html">Normalization Layers</a></h4>
 <ul>
 <li><a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></li>
+<li><a href="https://nn.labml.ai/normalization/layer_norm/index.html">Layer Normalization</a></li>
 </ul>
 <h3>Installation</h3>
 <pre><code class="bash">pip install labml_nn

--- a/docs/normalization/batch_norm/index.html
+++ b/docs/normalization/batch_norm/index.html
--- a/docs/normalization/batch_norm/readme.html
+++ b/docs/normalization/batch_norm/readme.html
+<!DOCTYPE html>
+<html>
+<head>
+    <meta http-equiv="content-type" content="text/html;charset=utf-8"/>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+    <meta name="description" content=""/>
+
+    <meta name="twitter:card" content="summary"/>
+    <meta name="twitter:image:src" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta name="twitter:title" content="Batch Normalization"/>
+    <meta name="twitter:description" content=""/>
+    <meta name="twitter:site" content="@labmlai"/>
+    <meta name="twitter:creator" content="@labmlai"/>
+
+    <meta property="og:url" content="https://nn.labml.ai/normalization/batch_norm/readme.html"/>
+    <meta property="og:title" content="Batch Normalization"/>
+    <meta property="og:image" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta property="og:site_name" content="LabML Neural Networks"/>
+    <meta property="og:type" content="object"/>
+    <meta property="og:title" content="Batch Normalization"/>
+    <meta property="og:description" content=""/>
+
+    <title>Batch Normalization</title>
+    <link rel="shortcut icon" href="/icon.png"/>
+    <link rel="stylesheet" href="../../pylit.css">
+    <link rel="canonical" href="https://nn.labml.ai/normalization/batch_norm/readme.html"/>
+    <!-- Global site tag (gtag.js) - Google Analytics -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-4V3HC8HBLH"></script>
+    <script>
+        window.dataLayer = window.dataLayer || [];
+
+        function gtag() {
+            dataLayer.push(arguments);
+        }
+
+        gtag('js', new Date());
+
+        gtag('config', 'G-4V3HC8HBLH');
+    </script>
+</head>
+<body>
+<div id='container'>
+    <div id="background"></div>
+    <div class='section'>
+        <div class='docs'>
+            <p>
+                <a class="parent" href="/">home</a>
+                <a class="parent" href="../index.html">normalization</a>
+                <a class="parent" href="index.html">batch_norm</a>
+            </p>
+            <p>
+
+                <a href="https://github.com/lab-ml/labml_nn/tree/master/labml_nn/normalization/batch_norm/readme.md">
+                    <img alt="Github"
+                         src="https://img.shields.io/github/stars/lab-ml/nn?style=social"
+                         style="max-width:100%;"/></a>
+                <a href="https://join.slack.com/t/labforml/shared_invite/zt-egj9zvq9-Dl3hhZqobexgT7aVKnD14g/"
+                   rel="nofollow">
+                    <img alt="Join Slact"
+                         src="https://img.shields.io/badge/slack-chat-green.svg?logo=slack"
+                         style="max-width:100%;"/></a>
+                <a href="https://twitter.com/labmlai"
+                   rel="nofollow">
+                    <img alt="Twitter"
+                         src="https://img.shields.io/twitter/follow/labmlai?style=social"
+                         style="max-width:100%;"/></a>
+            </p>
+        </div>
+    </div>
+    <div class='section' id='section-0'>
+            <div class='docs'>
+                <div class='section-link'>
+                    <a href='#section-0'>#</a>
+                </div>
+                <h1><a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></h1>
+<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of Batch Normalization from paper
+ <a href="https://arxiv.org/abs/1502.03167">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a>.</p>
+<h3>Internal Covariate Shift</h3>
+<p>The paper defines <em>Internal Covariate Shift</em> as the change in the
+distribution of network activations due to the change in
+network parameters during training.
+For example, let&rsquo;s say there are two layers $l_1$ and $l_2$.
+During the beginning of the training $l_1$ outputs (inputs to $l_2$)
+could be in distribution $\mathcal{N}(0.5, 1)$.
+Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
+This is <em>internal covariate shift</em>.</p>
+<p>Internal covariate shift will adversely affect training speed because the later layers
+($l_2$ in the above example) has to adapt to this shifted distribution.</p>
+<p>By stabilizing the distribution batch normalization minimizes the internal covariate shift.</p>
+<h2>Normalization</h2>
+<p>It is known that whitening improves training speed and convergence.
+<em>Whitening</em> is linearly transforming inputs to have zero mean, unit variance,
+and be uncorrelated.</p>
+<h3>Normalizing outside gradient computation doesn&rsquo;t work</h3>
+<p>Normalizing outside the gradient computation using pre-computed (detached)
+means and variances doesn&rsquo;t work. For instance. (ignoring variance), let
+<script type="math/tex; mode=display">\hat{x} = x - \mathbb{E}[x]</script>
+where $x = u + b$ and $b$ is a trained bias.
+and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).</p>
+<p>Note that $\hat{x}$ has no effect of $b$.
+Therefore,
+$b$ will increase or decrease based
+$\frac{\partial{\mathcal{L}}}{\partial x}$,
+and keep on growing indefinitely in each training update.
+The paper notes that similar explosions happen with variances.</p>
+<h3>Batch Normalization</h3>
+<p>Whitening is computationally expensive because you need to de-correlate and
+the gradients must flow through the full whitening calculation.</p>
+<p>The paper introduces simplified version which they call <em>Batch Normalization</em>.
+First simplification is that it normalizes each feature independently to have
+zero mean and unit variance:
+<script type="math/tex; mode=display">\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}</script>
+where $x = (x^{(1)} &hellip; x^{(d)})$ is the $d$-dimensional input.</p>
+<p>The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
+and variance $Var[x^{(k)}]$ from the mini-batch
+for normalization; instead of calculating the mean and variance across whole dataset.</p>
+<p>Normalizing each feature to zero mean and unit variance could affect what the layer
+can represent.
+As an example paper illustrates that, if the inputs to a sigmoid are normalized
+most of it will be within $[-1, 1]$ range where the sigmoid is linear.
+To overcome this each feature is scaled and shifted by two trained parameters
+$\gamma^{(k)}$ and $\beta^{(k)}$.
+<script type="math/tex; mode=display">y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}</script>
+where $y^{(k)}$ is the output of the batch normalization layer.</p>
+<p>Note that when applying batch normalization after a linear transform
+like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
+So you can and should omit bias parameter in linear transforms right before the
+batch normalization.</p>
+<p>Batch normalization also makes the back propagation invariant to the scale of the weights.
+And empirically it improves generalization, so it has regularization effects too.</p>
+<h2>Inference</h2>
+<p>We need to know $\mathbb{E}[x^{(k)}]$ and $Var[x^{(k)}]$ in order to
+perform the normalization.
+So during inference, you either need to go through the whole (or part of) dataset
+and find the mean and variance, or you can use an estimate calculated during training.
+The usual practice is to calculate an exponential moving average of
+mean and variance during the training phase and use that for inference.</p>
+<p>Here&rsquo;s <a href="https://nn.labml.ai/normalization/layer_norm/mnist.html">the training code</a> and a notebook for training
+a CNN classifier that use batch normalization for MNIST dataset.</p>
+<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
+<a href="https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
+            </div>
+            <div class='code'>
+                
+            </div>
+        </div>
+    </div>
+</div>
+<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-AMS_HTML">
+</script>
+<!-- MathJax configuration -->
+<script type="text/x-mathjax-config">
+    MathJax.Hub.Config({
+        tex2jax: {
+            inlineMath: [ ['$','$'] ],
+            displayMath: [ ['$$','$$'] ],
+            processEscapes: true,
+            processEnvironments: true
+        },
+        // Center justify equations in code and markdown cells. Elsewhere
+        // we use CSS to left justify single line equations in code cells.
+        displayAlign: 'center',
+        "HTML-CSS": { fonts: ["TeX"] }
+    });
+
+
+
+
+
+
+
+
+
+
+
+
+
+</script>
+</body>
+</html>
\ No newline at end of file
--- a/docs/normalization/index.html
+++ b/docs/normalization/index.html
@@ -74,10 +74,10 @@
                <h1>Normalization Layers</h1>
 <ul>
 <li><a href="batch_norm/index.html">Batch Normalization</a></li>
+<li><a href="layer_norm/index.html">Layer Normalization</a></li>
 </ul>
 <p><em>TODO</em></p>
 <ul>
-<li>Layer Normalization</li>
 <li>Instance Normalization</li>
 <li>Group Normalization</li>
 </ul>

--- a/docs/normalization/layer_norm/index.html
+++ b/docs/normalization/layer_norm/index.html
--- a/docs/normalization/layer_norm/readme.html
+++ b/docs/normalization/layer_norm/readme.html
+<!DOCTYPE html>
+<html>
+<head>
+    <meta http-equiv="content-type" content="text/html;charset=utf-8"/>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+    <meta name="description" content=""/>
+
+    <meta name="twitter:card" content="summary"/>
+    <meta name="twitter:image:src" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta name="twitter:title" content="Layer Normalization"/>
+    <meta name="twitter:description" content=""/>
+    <meta name="twitter:site" content="@labmlai"/>
+    <meta name="twitter:creator" content="@labmlai"/>
+
+    <meta property="og:url" content="https://nn.labml.ai/normalization/layer_norm/readme.html"/>
+    <meta property="og:title" content="Layer Normalization"/>
+    <meta property="og:image" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta property="og:site_name" content="LabML Neural Networks"/>
+    <meta property="og:type" content="object"/>
+    <meta property="og:title" content="Layer Normalization"/>
+    <meta property="og:description" content=""/>
+
+    <title>Layer Normalization</title>
+    <link rel="shortcut icon" href="/icon.png"/>
+    <link rel="stylesheet" href="../../pylit.css">
+    <link rel="canonical" href="https://nn.labml.ai/normalization/layer_norm/readme.html"/>
+    <!-- Global site tag (gtag.js) - Google Analytics -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-4V3HC8HBLH"></script>
+    <script>
+        window.dataLayer = window.dataLayer || [];
+
+        function gtag() {
+            dataLayer.push(arguments);
+        }
+
+        gtag('js', new Date());
+
+        gtag('config', 'G-4V3HC8HBLH');
+    </script>
+</head>
+<body>
+<div id='container'>
+    <div id="background"></div>
+    <div class='section'>
+        <div class='docs'>
+            <p>
+                <a class="parent" href="/">home</a>
+                <a class="parent" href="../index.html">normalization</a>
+                <a class="parent" href="index.html">layer_norm</a>
+            </p>
+            <p>
+
+                <a href="https://github.com/lab-ml/labml_nn/tree/master/labml_nn/normalization/layer_norm/readme.md">
+                    <img alt="Github"
+                         src="https://img.shields.io/github/stars/lab-ml/nn?style=social"
+                         style="max-width:100%;"/></a>
+                <a href="https://join.slack.com/t/labforml/shared_invite/zt-egj9zvq9-Dl3hhZqobexgT7aVKnD14g/"
+                   rel="nofollow">
+                    <img alt="Join Slact"
+                         src="https://img.shields.io/badge/slack-chat-green.svg?logo=slack"
+                         style="max-width:100%;"/></a>
+                <a href="https://twitter.com/labmlai"
+                   rel="nofollow">
+                    <img alt="Twitter"
+                         src="https://img.shields.io/twitter/follow/labmlai?style=social"
+                         style="max-width:100%;"/></a>
+            </p>
+        </div>
+    </div>
+    <div class='section' id='section-0'>
+            <div class='docs'>
+                <div class='section-link'>
+                    <a href='#section-0'>#</a>
+                </div>
+                <h1><a href="https://nn.labml.ai/normalization/layer_norm/index.html">Layer Normalization</a></h1>
+<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of
+<a href="https://arxiv.org/abs/1607.06450">Layer Normalization</a>.</p>
+<h3>Limitations of <a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></h3>
+<ul>
+<li>You need to maintain running means.</li>
+<li>Tricky for RNNs. Do you need different normalizations for each step?</li>
+<li>Doesn&rsquo;t work with small batch sizes;
+large NLP models are usually trained with small batch sizes.</li>
+<li>Need to compute means and variances across devices in distributed training</li>
+</ul>
+<h2>Layer Normalization</h2>
+<p>Layer normalization is a simpler normalization method that works
+on a wider range of settings.
+Layer normalization transformers the inputs to have zero mean and unit variance
+across the features.
+<em>Note that batch normalization fixes the zero mean and unit variance for each element.</em>
+Layer normalization does it for each batch across all elements.</p>
+<p>Layer normalization is generally used for NLP tasks.</p>
+<p>We have used layer normalization in most of the
+<a href="https://nn.labml.ai/transformers/gpt/index.html">transformer implementations</a>.</p>
+            </div>
+            <div class='code'>
+                
+            </div>
+        </div>
+    </div>
+</div>
+<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-AMS_HTML">
+</script>
+<!-- MathJax configuration -->
+<script type="text/x-mathjax-config">
+    MathJax.Hub.Config({
+        tex2jax: {
+            inlineMath: [ ['$','$'] ],
+            displayMath: [ ['$$','$$'] ],
+            processEscapes: true,
+            processEnvironments: true
+        },
+        // Center justify equations in code and markdown cells. Elsewhere
+        // we use CSS to left justify single line equations in code cells.
+        displayAlign: 'center',
+        "HTML-CSS": { fonts: ["TeX"] }
+    });
+
+
+
+
+
+
+
+
+
+
+
+
+
+</script>
+</body>
+</html>
\ No newline at end of file
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -50,7 +50,7 @@

    <url>
      <loc>https://nn.labml.ai/activations/swish.html</loc>
-      <lastmod>2021-01-25T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -83,6 +83,13 @@
    </url>
    

+    <url>
+      <loc>https://nn.labml.ai/normalization/layer_norm/index.html</loc>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
+      <priority>1.00</priority>
+    </url>
+    
+
    <url>
      <loc>https://nn.labml.ai/normalization/index.html</loc>
      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
@@ -183,7 +190,7 @@

    <url>
      <loc>https://nn.labml.ai/optimizers/mnist_experiment.html</loc>
-      <lastmod>2020-12-10T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -225,7 +232,7 @@

    <url>
      <loc>https://nn.labml.ai/transformers/knn/train_model.html</loc>
-      <lastmod>2021-01-25T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -253,7 +260,7 @@

    <url>
      <loc>https://nn.labml.ai/transformers/models.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -267,14 +274,14 @@

    <url>
      <loc>https://nn.labml.ai/transformers/gpt/index.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    

    <url>
      <loc>https://nn.labml.ai/transformers/feed_forward.html</loc>
-      <lastmod>2021-01-30T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -295,7 +302,7 @@

    <url>
      <loc>https://nn.labml.ai/transformers/feedback/index.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -309,7 +316,7 @@

    <url>
      <loc>https://nn.labml.ai/transformers/feedback/experiment.html</loc>
-      <lastmod>2021-01-29T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -330,14 +337,14 @@

    <url>
      <loc>https://nn.labml.ai/transformers/glu_variants/experiment.html</loc>
-      <lastmod>2021-01-26T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    

    <url>
      <loc>https://nn.labml.ai/transformers/glu_variants/simple.html</loc>
-      <lastmod>2021-01-26T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -358,7 +365,7 @@

    <url>
      <loc>https://nn.labml.ai/transformers/switch/index.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    
@@ -372,28 +379,28 @@

    <url>
      <loc>https://nn.labml.ai/transformers/switch/experiment.html</loc>
-      <lastmod>2021-01-25T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    

    <url>
      <loc>https://nn.labml.ai/transformers/positional_encoding.html</loc>
-      <lastmod>2021-01-07T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    

    <url>
      <loc>https://nn.labml.ai/transformers/label_smoothing_loss.html</loc>
-      <lastmod>2020-12-10T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    

    <url>
      <loc>https://nn.labml.ai/transformers/mha.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    

--- a/labml_nn/__init__.py
+++ b/labml_nn/__init__.py
@@ -60,6 +60,7 @@ and

 #### ✨ [Normalization Layers](https://nn.labml.ai/normalization/index.html)
 * [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+* [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)

 ### Installation


--- a/labml_nn/normalization/__init__.py
+++ b/labml_nn/normalization/__init__.py
@@ -8,10 +8,10 @@ summary: >
 # Normalization Layers

 * [Batch Normalization](batch_norm/index.html)
+* [Layer Normalization](layer_norm/index.html)

 *TODO*

-* Layer Normalization
 * Instance Normalization
 * Group Normalization
 """
\ No newline at end of file
--- a/labml_nn/normalization/batch_norm/__init__.py
+++ b/labml_nn/normalization/batch_norm/__init__.py
@@ -109,18 +109,21 @@ class BatchNorm(Module):

    When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
    where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
    $$\text{BN}(X) = \gamma
    \frac{X - \underset{B, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{B, H, W}{Var}[X] + \epsilon}}
    + \beta$$

-    When input $X \in \mathbb{R}^{B \times C}$ is a batch of vector embeddings,
+    When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
    where $B$ is the batch size and $C$ is the number of features.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
    $$\text{BN}(X) = \gamma
    \frac{X - \underset{B}{\mathbb{E}}[X]}{\sqrt{\underset{B}{Var}[X] + \epsilon}}
    + \beta$$

-    When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of sequence embeddings,
+    When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of a sequence embeddings,
    where $B$ is the batch size, $C$ is the number of features, and $L$ is the length of the sequence.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
    $$\text{BN}(X) = \gamma
    \frac{X - \underset{B, L}{\mathbb{E}}[X]}{\sqrt{\underset{B, L}{Var}[X] + \epsilon}}
    + \beta$$
@@ -205,6 +208,9 @@ class BatchNorm(Module):


 def _test():
+    """
+    Simple test
+    """
    from labml.logger import inspect

    x = torch.zeros([2, 3, 2, 4])
@@ -216,5 +222,6 @@ def _test():
    inspect(bn.exp_var.shape)


+#
 if __name__ == '__main__':
    _test()
--- a/labml_nn/normalization/batch_norm/readme.md
+++ b/labml_nn/normalization/batch_norm/readme.md
+# [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+
+This is a [PyTorch](https://pytorch.org) implementation of Batch Normalization from paper
+ [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167).
+
+### Internal Covariate Shift
+
+The paper defines *Internal Covariate Shift* as the change in the
+distribution of network activations due to the change in
+network parameters during training.
+For example, let's say there are two layers $l_1$ and $l_2$.
+During the beginning of the training $l_1$ outputs (inputs to $l_2$)
+could be in distribution $\mathcal{N}(0.5, 1)$.
+Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
+This is *internal covariate shift*.
+
+Internal covariate shift will adversely affect training speed because the later layers
+($l_2$ in the above example) has to adapt to this shifted distribution.
+
+By stabilizing the distribution batch normalization minimizes the internal covariate shift.
+
+## Normalization
+
+It is known that whitening improves training speed and convergence.
+*Whitening* is linearly transforming inputs to have zero mean, unit variance,
+and be uncorrelated.
+
+### Normalizing outside gradient computation doesn't work
+
+Normalizing outside the gradient computation using pre-computed (detached)
+means and variances doesn't work. For instance. (ignoring variance), let
+$$\hat{x} = x - \mathbb{E}[x]$$
+where $x = u + b$ and $b$ is a trained bias.
+and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).
+
+Note that $\hat{x}$ has no effect of $b$.
+Therefore,
+$b$ will increase or decrease based
+$\frac{\partial{\mathcal{L}}}{\partial x}$,
+and keep on growing indefinitely in each training update.
+The paper notes that similar explosions happen with variances.
+
+### Batch Normalization
+
+Whitening is computationally expensive because you need to de-correlate and
+the gradients must flow through the full whitening calculation.
+
+The paper introduces simplified version which they call *Batch Normalization*.
+First simplification is that it normalizes each feature independently to have
+zero mean and unit variance:
+$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
+where $x = (x^{(1)} ... x^{(d)})$ is the $d$-dimensional input.
+
+The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
+and variance $Var[x^{(k)}]$ from the mini-batch
+for normalization; instead of calculating the mean and variance across whole dataset.
+
+Normalizing each feature to zero mean and unit variance could affect what the layer
+can represent.
+As an example paper illustrates that, if the inputs to a sigmoid are normalized
+most of it will be within $[-1, 1]$ range where the sigmoid is linear.
+To overcome this each feature is scaled and shifted by two trained parameters
+$\gamma^{(k)}$ and $\beta^{(k)}$.
+$$y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}$$
+where $y^{(k)}$ is the output of the batch normalization layer.
+
+Note that when applying batch normalization after a linear transform
+like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
+So you can and should omit bias parameter in linear transforms right before the
+batch normalization.
+
+Batch normalization also makes the back propagation invariant to the scale of the weights.
+And empirically it improves generalization, so it has regularization effects too.
+
+## Inference
+
+We need to know $\mathbb{E}[x^{(k)}]$ and $Var[x^{(k)}]$ in order to
+perform the normalization.
+So during inference, you either need to go through the whole (or part of) dataset
+and find the mean and variance, or you can use an estimate calculated during training.
+The usual practice is to calculate an exponential moving average of
+mean and variance during the training phase and use that for inference.
+
+Here's [the training code](https://nn.labml.ai/normalization/layer_norm/mnist.html) and a notebook for training
+a CNN classifier that use batch normalization for MNIST dataset.
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb)
+[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002)
--- a/labml_nn/normalization/layer_norm/__init__.py
+++ b/labml_nn/normalization/layer_norm/__init__.py
@@ -24,7 +24,7 @@ Layer normalization is a simpler normalization method that works
 on a wider range of settings.
 Layer normalization transformers the inputs to have zero mean and unit variance
 across the features.
-*Note that batch normalization, fixes the zero mean and unit variance for each vector.
+*Note that batch normalization fixes the zero mean and unit variance for each element.*
 Layer normalization does it for each batch across all elements.

 Layer normalization is generally used for NLP tasks.
@@ -41,18 +41,42 @@ from labml_helpers.module import Module


 class LayerNorm(Module):
-    """
+    r"""
    ## Layer Normalization
+
+    Layer normalization $\text{LN}$ normalizes the input $X$ as follows:
+
+    When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
+    where $B$ is the batch size and $C$ is the number of features.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
+    $$\text{LN}(X) = \gamma
+    \frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
+    + \beta$$
+
+    When input $X \in \mathbb{R}^{L \times B \times C}$ is a batch of a sequence of embeddings,
+    where $B$ is the batch size, $C$ is the number of channels, $L$ is the length of the sequence.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
+    $$\text{LN}(X) = \gamma
+    \frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
+    + \beta$$
+
+    When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
+    where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
+    This is not a widely used scenario.
+    $\gamma \in \mathbb{R}^{C \times H \times W}$ and $\beta \in \mathbb{R}^{C \times H \times W}$.
+    $$\text{LN}(X) = \gamma
+    \frac{X - \underset{C, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{C, H, W}{Var}[X] + \epsilon}}
+    + \beta$$
    """

    def __init__(self, normalized_shape: Union[int, List[int], Size], *,
                 eps: float = 1e-5,
                 elementwise_affine: bool = True):
        """
-        * `normalized_shape` $S$ is shape of the elements (except the batch).
+        * `normalized_shape` $S$ is the shape of the elements (except the batch).
         The input should then be
         $X \in \mathbb{R}^{* \times S[0] \times S[1] \times ... \times S[n]}$
-        * `eps` is $\epsilon$, used in $\sqrt{Var[X}] + \epsilon}$ for numerical stability
+        * `eps` is $\epsilon$, used in $\sqrt{Var[X] + \epsilon}$ for numerical stability
        * `elementwise_affine` is whether to scale and shift the normalized value

        We've tried to use the same names for arguments as PyTorch `LayerNorm` implementation.
@@ -74,34 +98,35 @@ class LayerNorm(Module):
         For example, in an NLP task this will be
        `[seq_len, batch_size, features]`
        """
-        # Keep the original shape
-        x_shape = x.shape
        # Sanity check to make sure the shapes match
        assert self.normalized_shape == x.shape[-len(self.normalized_shape):]

-        # Reshape into `[M, S[0], S[1], ..., S[n]]`
-        x = x.view(-1, *self.normalized_shape)
+        # The dimensions to calculate the mean and variance on
+        dims = [-(i + 1) for i in range(len(self.normalized_shape))]

-        # Calculate the mean across first dimension;
-        # i.e. the means for each element $\mathbb{E}[X}]$
-        mean = x.mean(dim=0)
-        # Calculate the squared mean across first dimension;
+        # Calculate the mean of all elements;
+        # i.e. the means for each element $\mathbb{E}[X]$
+        mean = x.mean(dim=dims, keepdims=True)
+        # Calculate the squared mean of all elements;
        # i.e. the means for each element $\mathbb{E}[X^2]$
-        mean_x2 = (x ** 2).mean(dim=0)
-        # Variance for each element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
+        mean_x2 = (x ** 2).mean(dim=dims, keepdims=True)
+        # Variance of all element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
        var = mean_x2 - mean ** 2

-        # Normalize $$\hat{X} = \frac{X} - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}$$
+        # Normalize $$\hat{X} = \frac{X - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}$$
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        # Scale and shift $$\text{LN}(x) = \gamma \hat{X} + \beta$$
        if self.elementwise_affine:
            x_norm = self.gain * x_norm + self.bias

-        # Reshape to original and return
-        return x_norm.view(x_shape)
+        #
+        return x_norm


 def _test():
+    """
+    Simple test
+    """
    from labml.logger import inspect

    x = torch.zeros([2, 3, 2, 4])
@@ -113,5 +138,6 @@ def _test():
    inspect(ln.gain.shape)


+#
 if __name__ == '__main__':
    _test()
--- a/labml_nn/normalization/layer_norm/readme.md
+++ b/labml_nn/normalization/layer_norm/readme.md
+# [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)
+
+This is a [PyTorch](https://pytorch.org) implementation of
+[Layer Normalization](https://arxiv.org/abs/1607.06450).
+
+### Limitations of [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+
+* You need to maintain running means.
+* Tricky for RNNs. Do you need different normalizations for each step?
+* Doesn't work with small batch sizes;
+large NLP models are usually trained with small batch sizes.
+* Need to compute means and variances across devices in distributed training
+
+## Layer Normalization
+
+Layer normalization is a simpler normalization method that works
+on a wider range of settings.
+Layer normalization transformers the inputs to have zero mean and unit variance
+across the features.
+*Note that batch normalization fixes the zero mean and unit variance for each element.*
+Layer normalization does it for each batch across all elements.
+
+Layer normalization is generally used for NLP tasks.
+
+We have used layer normalization in most of the
+[transformer implementations](https://nn.labml.ai/transformers/gpt/index.html).
\ No newline at end of file
--- a/readme.md
+++ b/readme.md
@@ -66,6 +66,7 @@ and

 #### ✨ [Normalization Layers](https://nn.labml.ai/normalization/index.html)
 * [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+* [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)

 ### Installation