提交 3089e913 编写于 作者: V Varuna Jayasiri

📚 layer norm docs

上级 5388e807
......@@ -125,6 +125,7 @@ and
<h4><a href="https://nn.labml.ai/normalization/index.html">Normalization Layers</a></h4>
<ul>
<li><a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></li>
<li><a href="https://nn.labml.ai/normalization/layer_norm/index.html">Layer Normalization</a></li>
</ul>
<h3>Installation</h3>
<pre><code class="bash">pip install labml_nn
......
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<meta name="description" content=""/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:image:src" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
<meta name="twitter:title" content="Batch Normalization"/>
<meta name="twitter:description" content=""/>
<meta name="twitter:site" content="@labmlai"/>
<meta name="twitter:creator" content="@labmlai"/>
<meta property="og:url" content="https://nn.labml.ai/normalization/batch_norm/readme.html"/>
<meta property="og:title" content="Batch Normalization"/>
<meta property="og:image" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
<meta property="og:site_name" content="LabML Neural Networks"/>
<meta property="og:type" content="object"/>
<meta property="og:title" content="Batch Normalization"/>
<meta property="og:description" content=""/>
<title>Batch Normalization</title>
<link rel="shortcut icon" href="/icon.png"/>
<link rel="stylesheet" href="../../pylit.css">
<link rel="canonical" href="https://nn.labml.ai/normalization/batch_norm/readme.html"/>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4V3HC8HBLH"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-4V3HC8HBLH');
</script>
</head>
<body>
<div id='container'>
<div id="background"></div>
<div class='section'>
<div class='docs'>
<p>
<a class="parent" href="/">home</a>
<a class="parent" href="../index.html">normalization</a>
<a class="parent" href="index.html">batch_norm</a>
</p>
<p>
<a href="https://github.com/lab-ml/labml_nn/tree/master/labml_nn/normalization/batch_norm/readme.md">
<img alt="Github"
src="https://img.shields.io/github/stars/lab-ml/nn?style=social"
style="max-width:100%;"/></a>
<a href="https://join.slack.com/t/labforml/shared_invite/zt-egj9zvq9-Dl3hhZqobexgT7aVKnD14g/"
rel="nofollow">
<img alt="Join Slact"
src="https://img.shields.io/badge/slack-chat-green.svg?logo=slack"
style="max-width:100%;"/></a>
<a href="https://twitter.com/labmlai"
rel="nofollow">
<img alt="Twitter"
src="https://img.shields.io/twitter/follow/labmlai?style=social"
style="max-width:100%;"/></a>
</p>
</div>
</div>
<div class='section' id='section-0'>
<div class='docs'>
<div class='section-link'>
<a href='#section-0'>#</a>
</div>
<h1><a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></h1>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of Batch Normalization from paper
<a href="https://arxiv.org/abs/1502.03167">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a>.</p>
<h3>Internal Covariate Shift</h3>
<p>The paper defines <em>Internal Covariate Shift</em> as the change in the
distribution of network activations due to the change in
network parameters during training.
For example, let&rsquo;s say there are two layers $l_1$ and $l_2$.
During the beginning of the training $l_1$ outputs (inputs to $l_2$)
could be in distribution $\mathcal{N}(0.5, 1)$.
Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
This is <em>internal covariate shift</em>.</p>
<p>Internal covariate shift will adversely affect training speed because the later layers
($l_2$ in the above example) has to adapt to this shifted distribution.</p>
<p>By stabilizing the distribution batch normalization minimizes the internal covariate shift.</p>
<h2>Normalization</h2>
<p>It is known that whitening improves training speed and convergence.
<em>Whitening</em> is linearly transforming inputs to have zero mean, unit variance,
and be uncorrelated.</p>
<h3>Normalizing outside gradient computation doesn&rsquo;t work</h3>
<p>Normalizing outside the gradient computation using pre-computed (detached)
means and variances doesn&rsquo;t work. For instance. (ignoring variance), let
<script type="math/tex; mode=display">\hat{x} = x - \mathbb{E}[x]</script>
where $x = u + b$ and $b$ is a trained bias.
and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).</p>
<p>Note that $\hat{x}$ has no effect of $b$.
Therefore,
$b$ will increase or decrease based
$\frac{\partial{\mathcal{L}}}{\partial x}$,
and keep on growing indefinitely in each training update.
The paper notes that similar explosions happen with variances.</p>
<h3>Batch Normalization</h3>
<p>Whitening is computationally expensive because you need to de-correlate and
the gradients must flow through the full whitening calculation.</p>
<p>The paper introduces simplified version which they call <em>Batch Normalization</em>.
First simplification is that it normalizes each feature independently to have
zero mean and unit variance:
<script type="math/tex; mode=display">\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}</script>
where $x = (x^{(1)} &hellip; x^{(d)})$ is the $d$-dimensional input.</p>
<p>The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
and variance $Var[x^{(k)}]$ from the mini-batch
for normalization; instead of calculating the mean and variance across whole dataset.</p>
<p>Normalizing each feature to zero mean and unit variance could affect what the layer
can represent.
As an example paper illustrates that, if the inputs to a sigmoid are normalized
most of it will be within $[-1, 1]$ range where the sigmoid is linear.
To overcome this each feature is scaled and shifted by two trained parameters
$\gamma^{(k)}$ and $\beta^{(k)}$.
<script type="math/tex; mode=display">y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}</script>
where $y^{(k)}$ is the output of the batch normalization layer.</p>
<p>Note that when applying batch normalization after a linear transform
like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
So you can and should omit bias parameter in linear transforms right before the
batch normalization.</p>
<p>Batch normalization also makes the back propagation invariant to the scale of the weights.
And empirically it improves generalization, so it has regularization effects too.</p>
<h2>Inference</h2>
<p>We need to know $\mathbb{E}[x^{(k)}]$ and $Var[x^{(k)}]$ in order to
perform the normalization.
So during inference, you either need to go through the whole (or part of) dataset
and find the mean and variance, or you can use an estimate calculated during training.
The usual practice is to calculate an exponential moving average of
mean and variance during the training phase and use that for inference.</p>
<p>Here&rsquo;s <a href="https://nn.labml.ai/normalization/layer_norm/mnist.html">the training code</a> and a notebook for training
a CNN classifier that use batch normalization for MNIST dataset.</p>
<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
<a href="https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
</div>
<div class='code'>
</div>
</div>
</div>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-AMS_HTML">
</script>
<!-- MathJax configuration -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true,
processEnvironments: true
},
// Center justify equations in code and markdown cells. Elsewhere
// we use CSS to left justify single line equations in code cells.
displayAlign: 'center',
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
</body>
</html>
\ No newline at end of file
......@@ -74,10 +74,10 @@
<h1>Normalization Layers</h1>
<ul>
<li><a href="batch_norm/index.html">Batch Normalization</a></li>
<li><a href="layer_norm/index.html">Layer Normalization</a></li>
</ul>
<p><em>TODO</em></p>
<ul>
<li>Layer Normalization</li>
<li>Instance Normalization</li>
<li>Group Normalization</li>
</ul>
......
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<meta name="description" content=""/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:image:src" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
<meta name="twitter:title" content="Layer Normalization"/>
<meta name="twitter:description" content=""/>
<meta name="twitter:site" content="@labmlai"/>
<meta name="twitter:creator" content="@labmlai"/>
<meta property="og:url" content="https://nn.labml.ai/normalization/layer_norm/readme.html"/>
<meta property="og:title" content="Layer Normalization"/>
<meta property="og:image" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
<meta property="og:site_name" content="LabML Neural Networks"/>
<meta property="og:type" content="object"/>
<meta property="og:title" content="Layer Normalization"/>
<meta property="og:description" content=""/>
<title>Layer Normalization</title>
<link rel="shortcut icon" href="/icon.png"/>
<link rel="stylesheet" href="../../pylit.css">
<link rel="canonical" href="https://nn.labml.ai/normalization/layer_norm/readme.html"/>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4V3HC8HBLH"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-4V3HC8HBLH');
</script>
</head>
<body>
<div id='container'>
<div id="background"></div>
<div class='section'>
<div class='docs'>
<p>
<a class="parent" href="/">home</a>
<a class="parent" href="../index.html">normalization</a>
<a class="parent" href="index.html">layer_norm</a>
</p>
<p>
<a href="https://github.com/lab-ml/labml_nn/tree/master/labml_nn/normalization/layer_norm/readme.md">
<img alt="Github"
src="https://img.shields.io/github/stars/lab-ml/nn?style=social"
style="max-width:100%;"/></a>
<a href="https://join.slack.com/t/labforml/shared_invite/zt-egj9zvq9-Dl3hhZqobexgT7aVKnD14g/"
rel="nofollow">
<img alt="Join Slact"
src="https://img.shields.io/badge/slack-chat-green.svg?logo=slack"
style="max-width:100%;"/></a>
<a href="https://twitter.com/labmlai"
rel="nofollow">
<img alt="Twitter"
src="https://img.shields.io/twitter/follow/labmlai?style=social"
style="max-width:100%;"/></a>
</p>
</div>
</div>
<div class='section' id='section-0'>
<div class='docs'>
<div class='section-link'>
<a href='#section-0'>#</a>
</div>
<h1><a href="https://nn.labml.ai/normalization/layer_norm/index.html">Layer Normalization</a></h1>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of
<a href="https://arxiv.org/abs/1607.06450">Layer Normalization</a>.</p>
<h3>Limitations of <a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></h3>
<ul>
<li>You need to maintain running means.</li>
<li>Tricky for RNNs. Do you need different normalizations for each step?</li>
<li>Doesn&rsquo;t work with small batch sizes;
large NLP models are usually trained with small batch sizes.</li>
<li>Need to compute means and variances across devices in distributed training</li>
</ul>
<h2>Layer Normalization</h2>
<p>Layer normalization is a simpler normalization method that works
on a wider range of settings.
Layer normalization transformers the inputs to have zero mean and unit variance
across the features.
<em>Note that batch normalization fixes the zero mean and unit variance for each element.</em>
Layer normalization does it for each batch across all elements.</p>
<p>Layer normalization is generally used for NLP tasks.</p>
<p>We have used layer normalization in most of the
<a href="https://nn.labml.ai/transformers/gpt/index.html">transformer implementations</a>.</p>
</div>
<div class='code'>
</div>
</div>
</div>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-AMS_HTML">
</script>
<!-- MathJax configuration -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true,
processEnvironments: true
},
// Center justify equations in code and markdown cells. Elsewhere
// we use CSS to left justify single line equations in code cells.
displayAlign: 'center',
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
</body>
</html>
\ No newline at end of file
......@@ -50,7 +50,7 @@
<url>
<loc>https://nn.labml.ai/activations/swish.html</loc>
<lastmod>2021-01-25T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -83,6 +83,13 @@
</url>
<url>
<loc>https://nn.labml.ai/normalization/layer_norm/index.html</loc>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://nn.labml.ai/normalization/index.html</loc>
<lastmod>2021-02-01T16:30:00+00:00</lastmod>
......@@ -183,7 +190,7 @@
<url>
<loc>https://nn.labml.ai/optimizers/mnist_experiment.html</loc>
<lastmod>2020-12-10T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -225,7 +232,7 @@
<url>
<loc>https://nn.labml.ai/transformers/knn/train_model.html</loc>
<lastmod>2021-01-25T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -253,7 +260,7 @@
<url>
<loc>https://nn.labml.ai/transformers/models.html</loc>
<lastmod>2021-02-01T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -267,14 +274,14 @@
<url>
<loc>https://nn.labml.ai/transformers/gpt/index.html</loc>
<lastmod>2021-02-01T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://nn.labml.ai/transformers/feed_forward.html</loc>
<lastmod>2021-01-30T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -295,7 +302,7 @@
<url>
<loc>https://nn.labml.ai/transformers/feedback/index.html</loc>
<lastmod>2021-02-01T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -309,7 +316,7 @@
<url>
<loc>https://nn.labml.ai/transformers/feedback/experiment.html</loc>
<lastmod>2021-01-29T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -330,14 +337,14 @@
<url>
<loc>https://nn.labml.ai/transformers/glu_variants/experiment.html</loc>
<lastmod>2021-01-26T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://nn.labml.ai/transformers/glu_variants/simple.html</loc>
<lastmod>2021-01-26T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -358,7 +365,7 @@
<url>
<loc>https://nn.labml.ai/transformers/switch/index.html</loc>
<lastmod>2021-02-01T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......@@ -372,28 +379,28 @@
<url>
<loc>https://nn.labml.ai/transformers/switch/experiment.html</loc>
<lastmod>2021-01-25T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://nn.labml.ai/transformers/positional_encoding.html</loc>
<lastmod>2021-01-07T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://nn.labml.ai/transformers/label_smoothing_loss.html</loc>
<lastmod>2020-12-10T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://nn.labml.ai/transformers/mha.html</loc>
<lastmod>2021-02-01T16:30:00+00:00</lastmod>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......
......@@ -60,6 +60,7 @@ and
#### ✨ [Normalization Layers](https://nn.labml.ai/normalization/index.html)
* [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
* [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)
### Installation
......
......@@ -8,10 +8,10 @@ summary: >
# Normalization Layers
* [Batch Normalization](batch_norm/index.html)
* [Layer Normalization](layer_norm/index.html)
*TODO*
* Layer Normalization
* Instance Normalization
* Group Normalization
"""
\ No newline at end of file
......@@ -109,18 +109,21 @@ class BatchNorm(Module):
When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
$$\text{BN}(X) = \gamma
\frac{X - \underset{B, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{B, H, W}{Var}[X] + \epsilon}}
+ \beta$$
When input $X \in \mathbb{R}^{B \times C}$ is a batch of vector embeddings,
When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
where $B$ is the batch size and $C$ is the number of features.
$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
$$\text{BN}(X) = \gamma
\frac{X - \underset{B}{\mathbb{E}}[X]}{\sqrt{\underset{B}{Var}[X] + \epsilon}}
+ \beta$$
When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of sequence embeddings,
When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of a sequence embeddings,
where $B$ is the batch size, $C$ is the number of features, and $L$ is the length of the sequence.
$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
$$\text{BN}(X) = \gamma
\frac{X - \underset{B, L}{\mathbb{E}}[X]}{\sqrt{\underset{B, L}{Var}[X] + \epsilon}}
+ \beta$$
......@@ -205,6 +208,9 @@ class BatchNorm(Module):
def _test():
"""
Simple test
"""
from labml.logger import inspect
x = torch.zeros([2, 3, 2, 4])
......@@ -216,5 +222,6 @@ def _test():
inspect(bn.exp_var.shape)
#
if __name__ == '__main__':
_test()
# [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
This is a [PyTorch](https://pytorch.org) implementation of Batch Normalization from paper
[Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167).
### Internal Covariate Shift
The paper defines *Internal Covariate Shift* as the change in the
distribution of network activations due to the change in
network parameters during training.
For example, let's say there are two layers $l_1$ and $l_2$.
During the beginning of the training $l_1$ outputs (inputs to $l_2$)
could be in distribution $\mathcal{N}(0.5, 1)$.
Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
This is *internal covariate shift*.
Internal covariate shift will adversely affect training speed because the later layers
($l_2$ in the above example) has to adapt to this shifted distribution.
By stabilizing the distribution batch normalization minimizes the internal covariate shift.
## Normalization
It is known that whitening improves training speed and convergence.
*Whitening* is linearly transforming inputs to have zero mean, unit variance,
and be uncorrelated.
### Normalizing outside gradient computation doesn't work
Normalizing outside the gradient computation using pre-computed (detached)
means and variances doesn't work. For instance. (ignoring variance), let
$$\hat{x} = x - \mathbb{E}[x]$$
where $x = u + b$ and $b$ is a trained bias.
and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).
Note that $\hat{x}$ has no effect of $b$.
Therefore,
$b$ will increase or decrease based
$\frac{\partial{\mathcal{L}}}{\partial x}$,
and keep on growing indefinitely in each training update.
The paper notes that similar explosions happen with variances.
### Batch Normalization
Whitening is computationally expensive because you need to de-correlate and
the gradients must flow through the full whitening calculation.
The paper introduces simplified version which they call *Batch Normalization*.
First simplification is that it normalizes each feature independently to have
zero mean and unit variance:
$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
where $x = (x^{(1)} ... x^{(d)})$ is the $d$-dimensional input.
The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
and variance $Var[x^{(k)}]$ from the mini-batch
for normalization; instead of calculating the mean and variance across whole dataset.
Normalizing each feature to zero mean and unit variance could affect what the layer
can represent.
As an example paper illustrates that, if the inputs to a sigmoid are normalized
most of it will be within $[-1, 1]$ range where the sigmoid is linear.
To overcome this each feature is scaled and shifted by two trained parameters
$\gamma^{(k)}$ and $\beta^{(k)}$.
$$y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}$$
where $y^{(k)}$ is the output of the batch normalization layer.
Note that when applying batch normalization after a linear transform
like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
So you can and should omit bias parameter in linear transforms right before the
batch normalization.
Batch normalization also makes the back propagation invariant to the scale of the weights.
And empirically it improves generalization, so it has regularization effects too.
## Inference
We need to know $\mathbb{E}[x^{(k)}]$ and $Var[x^{(k)}]$ in order to
perform the normalization.
So during inference, you either need to go through the whole (or part of) dataset
and find the mean and variance, or you can use an estimate calculated during training.
The usual practice is to calculate an exponential moving average of
mean and variance during the training phase and use that for inference.
Here's [the training code](https://nn.labml.ai/normalization/layer_norm/mnist.html) and a notebook for training
a CNN classifier that use batch normalization for MNIST dataset.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb)
[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002)
......@@ -24,7 +24,7 @@ Layer normalization is a simpler normalization method that works
on a wider range of settings.
Layer normalization transformers the inputs to have zero mean and unit variance
across the features.
*Note that batch normalization, fixes the zero mean and unit variance for each vector.
*Note that batch normalization fixes the zero mean and unit variance for each element.*
Layer normalization does it for each batch across all elements.
Layer normalization is generally used for NLP tasks.
......@@ -41,18 +41,42 @@ from labml_helpers.module import Module
class LayerNorm(Module):
"""
r"""
## Layer Normalization
Layer normalization $\text{LN}$ normalizes the input $X$ as follows:
When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
where $B$ is the batch size and $C$ is the number of features.
$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
$$\text{LN}(X) = \gamma
\frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
+ \beta$$
When input $X \in \mathbb{R}^{L \times B \times C}$ is a batch of a sequence of embeddings,
where $B$ is the batch size, $C$ is the number of channels, $L$ is the length of the sequence.
$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
$$\text{LN}(X) = \gamma
\frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
+ \beta$$
When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
This is not a widely used scenario.
$\gamma \in \mathbb{R}^{C \times H \times W}$ and $\beta \in \mathbb{R}^{C \times H \times W}$.
$$\text{LN}(X) = \gamma
\frac{X - \underset{C, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{C, H, W}{Var}[X] + \epsilon}}
+ \beta$$
"""
def __init__(self, normalized_shape: Union[int, List[int], Size], *,
eps: float = 1e-5,
elementwise_affine: bool = True):
"""
* `normalized_shape` $S$ is shape of the elements (except the batch).
* `normalized_shape` $S$ is the shape of the elements (except the batch).
The input should then be
$X \in \mathbb{R}^{* \times S[0] \times S[1] \times ... \times S[n]}$
* `eps` is $\epsilon$, used in $\sqrt{Var[X}] + \epsilon}$ for numerical stability
* `eps` is $\epsilon$, used in $\sqrt{Var[X] + \epsilon}$ for numerical stability
* `elementwise_affine` is whether to scale and shift the normalized value
We've tried to use the same names for arguments as PyTorch `LayerNorm` implementation.
......@@ -74,34 +98,35 @@ class LayerNorm(Module):
For example, in an NLP task this will be
`[seq_len, batch_size, features]`
"""
# Keep the original shape
x_shape = x.shape
# Sanity check to make sure the shapes match
assert self.normalized_shape == x.shape[-len(self.normalized_shape):]
# Reshape into `[M, S[0], S[1], ..., S[n]]`
x = x.view(-1, *self.normalized_shape)
# The dimensions to calculate the mean and variance on
dims = [-(i + 1) for i in range(len(self.normalized_shape))]
# Calculate the mean across first dimension;
# i.e. the means for each element $\mathbb{E}[X}]$
mean = x.mean(dim=0)
# Calculate the squared mean across first dimension;
# Calculate the mean of all elements;
# i.e. the means for each element $\mathbb{E}[X]$
mean = x.mean(dim=dims, keepdims=True)
# Calculate the squared mean of all elements;
# i.e. the means for each element $\mathbb{E}[X^2]$
mean_x2 = (x ** 2).mean(dim=0)
# Variance for each element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
mean_x2 = (x ** 2).mean(dim=dims, keepdims=True)
# Variance of all element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
var = mean_x2 - mean ** 2
# Normalize $$\hat{X} = \frac{X} - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}$$
# Normalize $$\hat{X} = \frac{X - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}$$
x_norm = (x - mean) / torch.sqrt(var + self.eps)
# Scale and shift $$\text{LN}(x) = \gamma \hat{X} + \beta$$
if self.elementwise_affine:
x_norm = self.gain * x_norm + self.bias
# Reshape to original and return
return x_norm.view(x_shape)
#
return x_norm
def _test():
"""
Simple test
"""
from labml.logger import inspect
x = torch.zeros([2, 3, 2, 4])
......@@ -113,5 +138,6 @@ def _test():
inspect(ln.gain.shape)
#
if __name__ == '__main__':
_test()
# [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)
This is a [PyTorch](https://pytorch.org) implementation of
[Layer Normalization](https://arxiv.org/abs/1607.06450).
### Limitations of [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
* You need to maintain running means.
* Tricky for RNNs. Do you need different normalizations for each step?
* Doesn't work with small batch sizes;
large NLP models are usually trained with small batch sizes.
* Need to compute means and variances across devices in distributed training
## Layer Normalization
Layer normalization is a simpler normalization method that works
on a wider range of settings.
Layer normalization transformers the inputs to have zero mean and unit variance
across the features.
*Note that batch normalization fixes the zero mean and unit variance for each element.*
Layer normalization does it for each batch across all elements.
Layer normalization is generally used for NLP tasks.
We have used layer normalization in most of the
[transformer implementations](https://nn.labml.ai/transformers/gpt/index.html).
\ No newline at end of file
......@@ -66,6 +66,7 @@ and
#### ✨ [Normalization Layers](https://nn.labml.ai/normalization/index.html)
* [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
* [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)
### Installation
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册