diff --git a/README.md b/README.md
index d4edd17af2cfae28fb1e21827329609f54a8c004..f2bfe856e01ce47fca4efe74c46d080090e37793 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,5 @@
-## fast.ai 机器学习和深度学习中文笔记
+# fast.ai 机器学习和深度学习中文笔记
 
+## 贡献指南
+
+需要校对。接受 Pull Request。
diff --git a/zh/dl1.md b/zh/dl1.md
new file mode 100644
index 0000000000000000000000000000000000000000..f557361a93c74d87f35e05369308d54eafc9d0bd
--- /dev/null
+++ b/zh/dl1.md
@@ -0,0 +1,431 @@
+# 深度学习2：第1部分第1课
+
+### [第1课](http://forums.fast.ai/t/wiki-lesson-1/9398/1)
+
+#### 入门[ [0:00](https://youtu.be/IPBSB1HLNLo) ]：
+
+*   为了训练神经网络，你肯定需要图形处理单元（GPU） - 特别是NVIDIA GPU，因为它是唯一支持CUDA（几乎所有深度学习图书馆和从业者都使用的语言和框架）的人。
+*   租用GPU有几种方法：Crestle [04:06]，Paperspace [ [06:10](https://youtu.be/IPBSB1HLNLo%3Ft%3D6m10s) ]
+
+#### [Jupyter笔记本和狗与猫的介绍](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1.ipynb) [ [12:39](https://youtu.be/IPBSB1HLNLo%3Ft%3D12m39s) ]
+
+*   您可以通过选择它并按下`shift+enter`来运行单元格（您可以按住`shift`并多次按`enter`键以继续按下单元格），或者您可以单击顶部的“运行”按钮。 单元格可以包含代码，文本，图片，视频等。
+*   Fast.ai需要Python 3
+
+```
+ %reload_ext autoreload  %autoreload 2  %matplotlib inline 
+```
+
+```
+ _# This file contains all the main external libs we'll use_  **from** **fastai.imports** **import** * 
+```
+
+```
+ **from** **fastai.transforms** **import** *  **from** **fastai.conv_learner** **import** *  **from** **fastai.model** **import** *  **from** **fastai.dataset** **import** *  **from** **fastai.sgdr** **import** *  **from** **fastai.plots** **import** * 
+```
+
+```
+ PATH = "data/dogscats/"  sz=224 
+```
+
+**先看图片[** [**15:39**](https://youtu.be/IPBSB1HLNLo%3Ft%3D15m40s) **]**
+
+```
+ !ls {PATH} 
+```
+
+```
+ _models sample test1 tmp train valid_ 
+```
+
+*   `!` 告诉使用bash（shell）而不是python
+*   如果您不熟悉训练集和验证集，请查看Practical Machine Learning课程（或阅读[Rachel的博客](http://www.fast.ai/2017/11/13/validation-sets/) ）
+
+```
+ !ls {PATH}valid 
+```
+
+```
+ _cats dogs_ 
+```
+
+```
+ files = !ls {PATH}valid/cats | head  files 
+```
+
+```
+ _['cat.10016.jpg',_  _'cat.1001.jpg',_  _'cat.10026.jpg',_  _'cat.10048.jpg',_  _'cat.10050.jpg',_  _'cat.10064.jpg',_  _'cat.10071.jpg',_  _'cat.10091.jpg',_  _'cat.10103.jpg',_  _'cat.10104.jpg']_ 
+```
+
+*   此文件夹结构是共享和提供图像分类数据集的最常用方法。 每个文件夹都会告诉您标签（例如`dogs`或`cats` ）。
+
+```
+ img = plt.imread(f' **{PATH}** valid/cats/ **{files[0]}** ')  plt.imshow(img); 
+```
+
+![](../img/1_Uqy-JLzpyZedFNdpm15N2A.png)
+
+*   `f'{PATH}valid/cats/{files[0]}'` - 这是一个Python 3.6。 格式化字符串，可以方便地格式化字符串。
+
+```
+ img.shape 
+```
+
+```
+ _(198, 179, 3)_ 
+```
+
+```
+ img[:4,:4] 
+```
+
+```
+ _array([[[ 29, 20, 23],_  _[ 31, 22, 25],_  _[ 34, 25, 28],_  _[ 37, 28, 31]],_ 
+```
+
+```
+ _[[ 60, 51, 54],_  _[ 58, 49, 52],_  _[ 56, 47, 50],_  _[ 55, 46, 49]],_ 
+```
+
+```
+ _[[ 93, 84, 87],_  _[ 89, 80, 83],_  _[ 85, 76, 79],_  _[ 81, 72, 75]],_ 
+```
+
+```
+ _[[104, 95, 98],_  _[103, 94, 97],_  _[102, 93, 96],_  _[102, 93, 96]]], dtype=uint8)_ 
+```
+
+*   `img`是一个三维数组（又名3级张量）
+*   这三个项目（例如`[29, 20, 23]` ）表示0到255之间的红绿蓝像素值
+*   我们的想法是利用这些数字并用它们来预测这些数字是代表猫还是狗，这是基于查看猫和狗的大量图片。
+*   这个数据集来自[Kaggle竞赛](https://www.kaggle.com/c/dogs-vs-cats) ，当它发布时（早在2013年），最先进的技术准确率为80％。
+
+**让我们训练一个模型[** [**20:21**](https://youtu.be/IPBSB1HLNLo%3Ft%3D20m21s) **]**
+
+以下是训练模型所需的三行代码：
+
+```
+ **data** = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz))  **learn** = ConvLearner.pretrained(resnet34, data, precompute= **True** )  **learn.fit** (0.01, 3) 
+```
+
+```
+ _[ 0\. 0.04955 0.02605 0.98975]_  _[ 1\. 0.03977 0.02916 0.99219]_  _[ 2\. 0.03372 0.02929 0.98975]_ 
+```
+
+*   这将做3个**时期** ，这意味着它将三次查看整个图像集。
+*   输出中的三个数字中的最后一个是验证集上的准确度。
+*   前两个是训练集和验证集的损失函数值（在这种情况下是交叉熵损失）。
+*   开始（例如， `1.` ）是纪元号。
+*   我们通过3行代码在17秒内达到了~99％（这将在2013年赢得Kaggle比赛）！ [ [21:49](https://youtu.be/IPBSB1HLNLo%3Ft%3D21m49s) ]
+*   很多人都认为深度学习需要大量的时间，大量的资源和大量的数据 - 一般来说，这不是真的！
+
+#### Fast.ai图书馆[ [22:24](https://youtu.be/IPBSB1HLNLo%3Ft%3D22m24s) ]
+
+*   该库采用了他们可以找到的所有最佳实践和方法 - 每次出现看起来很有趣的论文时，他们会对其进行测试，如果它适用于各种数据集，并且他们可以弄清楚如何调整它，它会得到它在库中实现它。
+*   Fast.ai策划所有这些最佳实践并为您打包，并且大部分时间都会找出自动处理事物的最佳方法。
+*   Fast.ai位于名为PyTorch的库之上，这是一个非常灵活的深度学习，机器学习和由Facebook编写的GPU计算库。
+*   大多数人比PyTorch更熟悉TensorFlow，但Jeremy现在知道的大多数顶级研究人员已经转向PyTorch。
+*   Fast.ai非常灵活，您可以根据需要尽可能多地使用所有这些策划的最佳实践。 在任何时候都很容易挂钩并编写自己的数据扩充，丢失功能，网络架构等，我们将在本课程中学习所有内容。
+
+#### 分析结果[ [24:21](https://youtu.be/IPBSB1HLNLo%3Ft%3D24m12s) ]
+
+这就是验证数据集标签（将其视为正确答案）的样子：
+
+```
+ data.val_y 
+```
+
+```
+ _array([0, 0, 0, ..., 1, 1, 1])_ 
+```
+
+这些0和1代表什么？
+
+```
+ data.classes 
+```
+
+```
+ _['cats', 'dogs']_ 
+```
+
+*   `data`包含验证和培训数据
+*   `learn`包含模型
+
+让我们对验证集进行预测（预测是以对数比例）：
+
+```
+ log_preds = learn.predict()  log_preds.shape 
+```
+
+```
+ _(2000, 2)_ 
+```
+
+```
+ log_preds[:10] 
+```
+
+```
+ _array([[ -0.00002, -11.07446],_  _[ -0.00138, -6.58385],_  _[ -0.00083, -7.09025],_  _[ -0.00029, -8.13645],_  _[ -0.00035, -7.9663 ],_  _[ -0.00029, -8.15125],_  _[ -0.00002, -10.82139],_  _[ -0.00003, -10.33846],_  _[ -0.00323, -5.73731],_  _[ -0.0001 , -9.21326]], dtype=float32)_ 
+```
+
+*   输出表示猫的预测和狗的预测
+
+```
+ preds = np.argmax(log_preds, axis=1) _# from log probabilities to 0 or 1_  probs = np.exp(log_preds[:,1]) _# pr(dog)_ 
+```
+
+*   在PyTorch和Fast.ai中，大多数模型返回预测的对数而不是概率本身（我们将在后面的课程中了解原因）。 现在，只知道要获得概率，你必须做`np.exp()`
+
+![](../img/1_7upaprK7pvlI1x4SnIl9aQ.png)
+
+*   确保你熟悉numpy（ `np` ）
+
+```
+ _# 1\. A few correct labels at random_ plot_val_with_title(rand_by_correct( **True** ), "Correctly classified") 
+```
+
+*   图像上方的数字是成为狗的概率
+
+```
+ _# 2\. A few incorrect labels at random_  plot_val_with_title(rand_by_correct( **False** ), "Incorrectly classified") 
+```
+
+![](../img/1_ZLhFRuLXqQmFV2uAok84DA.png)
+
+```
+ plot_val_with_title(most_by_correct(0, **True** ), "Most correct cats") 
+```
+
+![](../img/1_RxYBmvqixwG4BYNPQGAZ4w.png)
+
+```
+ plot_val_with_title(most_by_correct(1, **True** ), "Most correct dogs") 
+```
+
+![](../img/1_kwUuA3gN-xbNBIUjDBHePg.png)
+
+更有趣的是，这里的模型认为它绝对是一只狗，但结果却是一只猫，反之亦然：
+
+```
+ plot_val_with_title(most_by_correct(0, **False** ), "Most incorrect cats") 
+```
+
+![](../img/1_gvPAqSdB9IRFmhU4DCk-mg.png)
+
+```
+ plot_val_with_title(most_by_correct(1, **False** ), "Most incorrect dogs") 
+```
+
+![](../img/1_jXaTLkWMrvpC8Yz0QfR6LA.png)
+
+```
+ most_uncertain = np.argsort(np.abs(probs -0.5))[:4]  plot_val_with_title(most_uncertain, "Most uncertain predictions") 
+```
+
+![](../img/1_wZDDn_XFH-z7libyMUlsBg.png)
+
+*   为什么看这些图像很重要？ Jeremy在构建模型后所做的第一件事就是找到一种可视化其构建方式的方法。 因为如果他想让模型变得更好，那么他需要利用做得好的事情来解决那些做得很糟糕的事情。
+*   在这种情况下，我们已经了解了有关数据集本身的一些信息，即这里有一些可能不应该存在的图像。 但同样清楚的是，这种模式还有改进的余地（例如数据增加 - 我们将在后面介绍）。
+*   现在您已准备好构建自己的图像分类器（对于常规照片 - 可能不是CT扫描）！ 例如， [这](https://towardsdatascience.com/fun-with-small-image-data-sets-8c83d95d0159)是其中一个学生所做的。
+*   查看[此论坛帖子](http://forums.fast.ai/t/understanding-softmax-probabilities-output-on-a-multi-class-classification-problem/8194) ，了解可视化结果的不同方式（例如，当有超过2个类别时等）
+
+#### 自上而下与自下而上[ [30:52](https://youtu.be/IPBSB1HLNLo%3Ft%3D30m52s) ]
+
+自下而上：了解您需要的每个构建块，并最终将它们组合在一起
+
+*   很难保持动力
+*   很难知道“大局”
+*   很难知道你真正需要哪些作品
+
+fast.ai：让学生立即使用神经网络，尽快获得结果
+
+*   逐渐剥离层，修改，看看引擎盖下
+
+#### 课程结构[ [33:53](https://youtu.be/IPBSB1HLNLo%3Ft%3D33m53s) ]
+
+![](../img/1_xTuKc0FAP9yKZ6fpcymKbA.png)
+
+1.  深度学习的图像分类器（代码最少）
+2.  多标签分类和不同类型的图像（例如卫星图像）
+3.  结构化数据（例如销售预测） - 结构化数据来自数据库或电子表格
+4.  语言：NLP分类器（例如电影评论分类）
+5.  协同过滤（例如推荐引擎）
+6.  生成语言模型：如何从头开始逐字逐句地编写自己的尼采哲学
+7.  回到计算机视觉 - 不只是识别猫照片，而是找到猫在照片中的位置（热图），还学习如何从头开始编写我们自己的架构（ResNet）
+
+#### 图像分类器示例：
+
+图像分类算法对很多东西很有用。
+
+*   例如，AlphaGo [ [42:20](https://youtu.be/IPBSB1HLNLo%3Ft%3D42m20s) ]看了成千上万的棋盘，每个人都有一个标签，说明棋盘是否最终成为赢球或输球球员。 因此，它学会了一种图像分类，能够看到一块棋盘，并弄清楚它是好还是坏 - 这是玩得好的最重要的一步：知道哪个动作更好。
+*   另一个例子是早期的学生创建[了鼠标移动图像](https://www.splunk.com/blog/2017/04/18/deep-learning-with-splunk-and-tensorflow-for-security-catching-the-fraudster-in-neural-networks-with-behavioral-biometrics.html)和检测到的欺诈性交易[的图像分类器](https://www.splunk.com/blog/2017/04/18/deep-learning-with-splunk-and-tensorflow-for-security-catching-the-fraudster-in-neural-networks-with-behavioral-biometrics.html) 。
+
+#### 深度学习≠机器学习[ [44:26](https://youtu.be/IPBSB1HLNLo%3Ft%3D44m26s) ]
+
+*   深度学习是一种机器学习
+*   机器学习是由Arthur Samuel发明的。 在50年代后期，他通过发明机器学习，得到了一台IBM大型机，可以更好地玩跳棋。 他让大型机多次与自己对抗并弄清楚哪种东西能够取得胜利，并在某种程度上用它来编写自己的程序。 1962年，亚瑟·塞缪尔说，有一天，绝大多数计算机软件将使用这种机器学习方法而不是手工编写。
+*   C-Path（计算病理学家）[ [45:42](https://youtu.be/IPBSB1HLNLo%3Ft%3D45m42s) ]是传统机器学习方法的一个例子。 他拍摄了乳腺癌活组织检查的病理学幻灯片，咨询了许多病理学家关于哪些类型的模式或特征可能与长期生存相关的想法。 然后他们编写专家算法来计算这些特征，通过逻辑回归进行运算，并预测存活率。 它的表现优于病理学家，但是领域专家和计算机专家需要多年的工作才能建立起来。
+
+#### 更好的方式[ [47:35](https://youtu.be/IPBSB1HLNLo%3Ft%3D47m35s) ]
+
+![](../img/1_R4qix1l4TjKOrLkrA4t6EA.png)
+
+*   具有这三个属性的一类算法是深度学习。
+
+#### 无限灵活的功能：神经网络[ [48:43](https://youtu.be/IPBSB1HLNLo%3Ft%3D48m43s) ]
+
+深度学习使用的基础功能称为神经网络：
+
+![](../img/1_0YOpyzGWkrS4VW3ntJRQ5Q.png)
+
+*   您现在需要知道的是，它由许多简单的线性层组成，其中散布着许多简单的非线性层。 当你散布这些层时，你会得到一种称为通用逼近定理的东西。 通用近似定理所说的是，只要添加足够的参数，这种函数就可以解决任何给定的问题，任意精度。
+
+#### 通用参数拟合：梯度下降[ [49:39](https://youtu.be/IPBSB1HLNLo%3Ft%3D49m39s) ]
+
+![](../img/1_ezus486-s4OMT2YrXq81Dg.png)
+
+#### 快速且可扩展：GPU [ [51:05](https://youtu.be/IPBSB1HLNLo%3Ft%3D51m5s) ]
+
+![](../img/1_qPZYWZebPi6Sx_usSPWHUg.png)
+
+上面显示的神经网络示例具有一个隐藏层。 我们在过去几年中学到的东西是这些神经网络不是快速或可扩展的，除非我们添加了多个隐藏层 - 因此称为“深度”学习。
+
+#### 全部放在一起[ [53:40](https://youtu.be/IPBSB1HLNLo%3Ft%3D53m40s) ]
+
+![](../img/1_btdxcSWzAoJMPuo0qgVXKw.png)
+
+以下是一些示例：
+
+*   [https://research.googleblog.com/2015/11/computer-respond-to-this-email.html](https://research.googleblog.com/2015/11/computer-respond-to-this-email.html)
+*   [https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/](https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/)
+*   [https://www.skype.com/en/features/skype-translator/](https://www.skype.com/en/features/skype-translator/)
+*   [https://arxiv.org/abs/1603.01768](https://arxiv.org/abs/1603.01768)
+
+![](../img/1_BFG_B7UpS3AvJxE6lH0lug.gif)
+
+#### 诊断肺癌[ [56:55](https://youtu.be/IPBSB1HLNLo%3Ft%3D56m55s) ]
+
+![](../img/1__E0tiKelpZ3_7u0rOo6T5A.png)
+
+其他目前的应用：
+
+![](../img/1_LtaJU-2GsBanHnxavDysDA.png)
+
+### 卷积神经网络[ [59:13](https://youtu.be/IPBSB1HLNLo%3Ft%3D59m13s) ]
+
+#### 线性层
+
+[http://setosa.io/ev/image-kernels/](http://setosa.io/ev/image-kernels/)
+
+![](../img/1_saLeHFHg9zMRmmCAR-qqzA.png)
+
+#### 非线性层[ [01:02:12](https://youtu.be/IPBSB1HLNLo%3Ft%3D1h2m12s) ]
+
+[**神经网络和深度学习**](http://neuralnetworksanddeeplearning.com/chap4.html "http://neuralnetworksanddeeplearning.com/chap4.html")[
+](http://neuralnetworksanddeeplearning.com/chap4.html "http://neuralnetworksanddeeplearning.com/chap4.html")[_在本章中，我给出了普遍性定理的简单且大部分的视觉解释。_](http://neuralnetworksanddeeplearning.com/chap4.html "http://neuralnetworksanddeeplearning.com/chap4.html") [_我们将一步一步地进行..._ neuralnetworksanddeeplearning.com](http://neuralnetworksanddeeplearning.com/chap4.html "http://neuralnetworksanddeeplearning.com/chap4.html")[](http://neuralnetworksanddeeplearning.com/chap4.html)
+
+![](../img/1_hdh2d_jl5YohqGU265ce5A.png)
+
+![](../img/1_QindKA4Dt7Ol3CbICMSxWw.png)
+
+<figcaption class="imageCaption" style="width: 269.898%; left: -169.898%;">Sigmoid和ReLU</figcaption>
+
+
+
+*   线性层跟随元素非线性函数的组合允许我们创建任意复杂的形状 - 这是通用逼近定理的本质。
+
+#### 如何设置这些参数来解决问题[ [01:04:25](https://youtu.be/IPBSB1HLNLo%3Ft%3D1h4m25s) ]
+
+*   随机梯度下降 - 我们沿着山坡走小步。 步长称为**学习率**
+
+![](../img/1_BOJUtuAtlS9wUUyj0JJCHw.gif)
+
+![](../img/1_YQWdnHVTRPGjr0-VGuSxyA.jpeg)
+
+*   如果学习率太大，它将发散而不是收敛
+*   如果学习率太小，则需要永远
+
+#### 可视化和理解卷积网络[ [01:08:27](https://youtu.be/IPBSB1HLNLo%3Ft%3D1h8m27s) ]
+
+![](../img/1_RPakI9UqMTYmGIm4ELhh6w.png)
+
+我们从一些非常简单的东西开始，但如果我们使用它作为一个足够大的规模，由于通用近似定理和在深度学习中使用多个隐藏层，我们实际上获得了非常丰富的功能。 这实际上是我们在训练我们的狗与猫识别器时使用的。
+
+#### Dog vs. Cat Revisited - 选择学习率[ [01:11:41](https://youtu.be/IPBSB1HLNLo%3Ft%3D1h11m41s) ]
+
+```
+ learn.fit(0.01, 3) 
+```
+
+*   第一个数字`0.01`是学习率。
+*   _学习速率_决定了您想要更新_权重_ （或_参数_ ）的速度或速度。 学习率是设置最困难的参数之一，因为它会显着影响模型性能。
+*   方法`learn.lr_find()`可帮助您找到最佳学习率。 它使用2015年论文“ [循环学习率训练神经网络”中](http://arxiv.org/abs/1506.01186)开发的技术，我们只需将学习率从非常小的值提高，直到损失停止下降。 我们可以绘制不同批次的学习率，看看它是什么样的。
+
+```
+ learn = ConvLearner.pretrained(arch, data, precompute= **True** )  learn.lr_find() 
+```
+
+我们的`learn`对象包含一个属性`sched` ，其中包含我们的学习速率调度程序，并具有一些方便的绘图功能，包括以下内容：
+
+```
+ learn.sched.plot_lr() 
+```
+
+![](../img/1_iGjSbGhX60ZZ3bHbqaIURQ.png)
+
+*   Jeremy目前正在尝试以指数方式和线性方式提高学习率。
+
+我们可以看到损失与学习率的关系，看看我们的损失在哪里停止下降：
+
+```
+ learn.sched.plot() 
+```
+
+![](../img/1_CWF7v1ihFka2QG4RebgqjQ.png)
+
+*   然后我们选择学习率，其中损失仍在明显改善 - 在这种情况下`1e-2` （0.01）
+
+#### 选择时代数量[ [1:18:49](https://youtu.be/IPBSB1HLNLo%3Ft%3D1h18m49s) ]
+
+```
+ _[ 0\. 0.04955 0.02605 0.98975]_  _[ 1\. 0.03977 0.02916 0.99219]_  _[ 2\. 0.03372 0.02929 0.98975]_ 
+```
+
+*   尽可能多的人，但如果你运行太久，准确性可能会变得更糟。 它被称为“过度拟合”，我们稍后会详细了解它。
+*   另一个考虑因素是您可以使用的时间。
+
+#### 提示与技巧[ [1:21:40](https://youtu.be/IPBSB1HLNLo%3Ft%3D1h21m40s) ]
+
+**1.** `Tab` - 当您忘记功能名称时，它将自动完成
+
+![](../img/1_g5JpxoRhb-rIPeaXlrII1w.png)
+
+**2.** `Shift + Tab` - 它将显示函数的参数
+
+![](../img/1_u5P7II8U2C-6tYpQoTyIVA.png)
+
+**3.** `Shift + Tab + Tab` - 它将显示一个文档（即docstring）
+
+![](../img/1_bCD9vgqELl-M4qQY2WU-Zw.png)
+
+**4.** `Shift + Tab + Tab + Tab` - 它将打开一个具有相同信息的单独窗口。
+
+![](../img/1_632kupN-LwycuzR7wLBwKA.png)
+
+打字`?` 后跟一个单元格中的函数名称并运行它将与`shift + tab (3 times)`相同`shift + tab (3 times)`
+
+![](../img/1_fCXsVFZq_0Pmsa8-g_5nUg.png)
+
+**5.**键入两个问号将显示源代码
+
+![](../img/1_rmoEReBWYIfc-Kgz-QOgWQ.png)
+
+**6.**在Jupyter Notebook中键入`H`将打开一个带有键盘快捷键的窗口。 尝试每天学习4或5个快捷方式
+
+![](../img/1_z0rM6FP5gJZExHbmsVr5Xg.png)
+
+**7.**停止Paperspace，Crestle，AWS - 否则你将被收取$$
+
+**8.**请记住[论坛](http://forums.fast.ai/)和[http://course.fast.ai/](http://course.fast.ai/) （每节课）以获取最新信息。
diff --git a/zh/dl10.md b/zh/dl10.md
new file mode 100644
index 0000000000000000000000000000000000000000..73646ba9dcd4b7a2393b8e46fdd2ffcacb564e7e
--- /dev/null
+++ b/zh/dl10.md
@@ -0,0 +1,1012 @@
+# 深度学习2：第2部分第10课
+
+#### [视频](https://youtu.be/h5Tz7gZT9Fo) / [论坛](http://forums.fast.ai/t/part-2-lesson-10-wiki/14364/1)
+
+![](../img/1_g_wGv7SlgRghedYKSaIJ1w.png)
+
+#### 回顾上周[ [0:16](https://youtu.be/h5Tz7gZT9FoY%3Ft%3D16s) ]
+
+*   许多学生正在努力学习上周的材料，所以如果你觉得很难，那很好。 杰里米预先把它放在那里的原因是我们有一些东西要思考，思考，并逐渐努力，所以在第14课，你将得到第二个裂缝。
+*   要理解这些碎片，您需要了解卷积层输出，感受域和损失函数的形状 - 无论如何，这些都是您需要了解的所有深度学习研究的内容。
+*   一个关键的问题是我们从简单的东西开始 - 单个对象分类器，没有分类器的单个对象边界框，然后是单个对象分类器和边界框。 除了我们首先必须解决匹配问题之外，我们去多个对象的位实际上几乎相同。 我们最终创建了比我们的基础真实边界框所需的激活更多的激活，因此我们将每个地面实况对象与这些激活的子集相匹配。 一旦我们完成了这个，我们对每个匹配对做的损失函数几乎与这个损失函数（即单个对象分类器和边界框的一个）相同。
+*   如果您感觉困难，请返回第8课，确保您了解数据集，DataLoader，最重要的是了解损失函数。
+*   因此，一旦我们有一些可以预测一个对象的类和边界框的东西，我们通过创建更多激活来进入多个对象[ [2:40](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D2m40s) ]。 然后我们必须处理匹配问题，处理了一个匹配问题，然后我们将每个锚箱移入和移出一点点左右，所以他们试图与特定的地面实例对象进行对齐。
+*   我们谈到了我们如何利用网络的卷积性质来尝试进行具有与我们预测的基本事实对象类似的接受场的激活。 Chloe提供了以下精彩图片来讨论SSD_MultiHead.forward一行一行：
+
+![](../img/1_BbxbH3gWu8RHMTuXZlDasA.png)
+
+<figcaption class="imageCaption">作者： [Chloe Sultan](http://forums.fast.ai/u/chloews)</figcaption>
+
+
+
+Chloe在这里所做的是她特别关注路径中每个点处张量的维数，因为我们使用步幅2卷积逐渐下采样，确保她理解为什么这些网格大小发生然后理解输出是如何产生的。
+
+![](../img/1_gwvD-lSxiUH_EFq9n-ofOQ.png)
+
+*   这是你必须记住这个`pbd.set_trace()` 。 我刚刚上课前进入`SSD_MultiHead.forward`并输入了`pdb.set_trace()`然后我运行了一个批处理。 然后我可以打印出所有这些的大小。 我们犯了错误，这就是为什么我们有调试器并且知道如何检查事物并一路上做一些小事。
+*   然后我们讨论了增加_k_ [ [5:49](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D5m49s) ]，这是每个卷积网格单元的锚箱数量，我们可以用不同的缩放比例，宽高比进行处理，这给我们提供了大量的激活，因此预测了边界框。
+*   然后我们使用非最大抑制去了一个小数字。
+*   非最大抑制是一种hacky，丑陋和完全启发式，我们甚至没有谈论代码，因为它看起来很可怕。 最近有人提出了一篇论文，试图用一个端到端的转发网来取代那个NMS（ [https://arxiv.org/abs/1705.02950](https://arxiv.org/abs/1705.02950) ）。
+
+![](../img/1_PadSMuPPUl1W0fPhIdylYQ.png)
+
+*   没有足够的人在阅读论文！ 我们现在在课堂上所做的是实施论文，论文是真正的基本事实。 而且我想你通过与人们交谈了解人们不读纸的原因很多，因为很多人都不认为他们有能力读报纸。 他们认为他们不是那种阅读论文的人，但你是。 你在这里。 我们上周开始查看一篇论文，我们读到的是英文单词，我们对它们有很大的了解。 如果你仔细看看上面的图片，你会发现`SSD_MultiHead.forward`没有做同样的事情。 然后你可能想知道这是否更好。 我的回答可能是。 因为SSD_MultiHead.forward是我尝试过的第一件事。 在这篇与YOLO3论文之间，它们可能是更好的方法。
+*   有一点你会特别注意到他们使用较小的k，但他们有更多的网格组1x1,3x3,5x5,10x10,19x19,38x38 - 8732每个级别。 比我们更多，所以这将是一个有趣的实验。
+*   我注意到的另一件事是我们有4x4,2x2,1c1这意味着有很多重叠 - 每一组都适合每一组。 在这种情况下你有1,3,5，你没有那个重叠。 所以它实际上可能更容易学习。 你可以玩很多有趣的东西。
+
+![](../img/1_1AG_zIXUouogXB5--BFuzw.png)
+
+*   也许我建议最重要的是将代码和方程式放在一起。 你是数学家或代码人。 通过将它们并排放置，您将学习到另一个。
+*   学习数学很难，因为符号似乎很难查找，但有很好的资源，如[维基百科](https://en.wikipedia.org/wiki/List_of_mathematical_symbols) 。
+*   你应该尝试做的另一件事是重新创建你在论文中看到的东西。 这是焦点损失论文中关键的最重要的数字1。
+
+![](../img/1_NJ8DKEP6qwePIi9hGhvhAg.png)
+
+![](../img/1_WhwnGf2r6-aboRYEDWrOUA.png)
+
+*   上周我确实在我的代码中发现了一个小错误[ [12:14](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D12m14s) ] - 我对卷积激活的讨论方式并不符合我在丢失函数中使用它们的方式，并且修复它使它更好一些。
+
+![](../img/1_B_pXi5zpN2EGnhATYo-ZWg.png)
+
+**问题** ：通常，当我们下采样时，我们会增加滤波器的数量或深度。 当我们从7x7到4x4等进行采样时，为什么我们将数字从512减少到256？ 为什么不降低SSD头的尺寸？ （性能相关？）[ [12:58](https://youtu.be/_ercfViGrgY%3Ft%3D12m58s) ]我们有许多外出路径，我们希望每个路径都相同，所以我们不希望每个路径都有不同数量的过滤器，这也就是论文的内容我试图与之相匹配。 拥有这256个 - 这是一个不同的概念，因为我们不仅利用了最后一层，而且利用了之前的层。 如果我们让它们更加一致，生活会更容易。
+
+* * *
+
+### 自然语言处理[ [14:10](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D14m10s) ]
+
+#### 我们要去的地方：
+
+我们在每节课中都看到过采用预先训练过的模型的想法，在顶部掀起一些东西，用新的东西替换它，并让它做类似的事情。 我们已经有点潜入了与`ConvLearner.pretrained`更深入的内容。 `ConvLearner.pretrained`它有一种标准的方式将东西粘在顶部，这是一个特定的事情（即分类）。 然后我们了解到实际上我们可以在最后使用我们喜欢的任何PyTorch模块并让它用`custom_head`做任何我们喜欢的`custom_head` ，所以突然你发现我们可以做一些非常有趣的事情。
+
+事实上，杨璐说“如果我们做了不同类型的自定义头怎么办？”并且不同的自定义头是让我们拍摄原始图片，旋转它们，并使我们的因变量与旋转相反，看它是否可以学习解旋它。 这是一个非常有用的东西，事实上，我认为Google照片现在有这个选项，它实际上会自动为你旋转你的照片。 但是很酷的是，正如他在这里展示的那样，你可以通过与我们上一课完全相同的方式建立这个网络。 但是你的自定义头是一个吐出一个数字的头部，它可以旋转多少，而你的数据集有一个因变量，你可以旋转多少。
+
+![](../img/1_4MNEdvVzjzHbbtlub98chw.png)
+
+<figcaption class="imageCaption">[http://forums.fast.ai/t/fun-with-lesson8-rotation-adjustment-things-you-can-do-without-annotated-dataset/14261/1](http://forums.fast.ai/t/fun-with-lesson8-rotation-adjustment-things-you-can-do-without-annotated-dataset/14261/1)</figcaption>
+
+
+
+所以你突然意识到这个主干和定制头的想法，你几乎可以做任何你能想到的事情[ [16:30](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D16m30s) ]。
+
+*   今天，我们将看看同样的想法，看看它如何适用于NLP。
+*   在下一课中，我们将进一步说明NLP和计算机视觉是否可以让您做同样的基本想法，我们如何将两者结合起来。 我们将学习一个实际上可以学习从图像中找到单词结构，从单词结构中找到图像或从图像中找到图像的模型。 如果你想进一步做一些事情，比如从一个图像到一个句子（即图像字幕），或者从一个句子到一个我们开始做的图像，一个短语到图像，那将构成基础。
+*   从那里开始，我们必须更深入地进入计算机视觉，思考我们可以用预先训练的网络和定制头的想法做些什么。 因此，我们将研究各种图像增强，例如增加低分辨率照片的分辨率以猜测缺少的内容或在照片上添加艺术滤镜，或将马的照片更改为斑马照片等。
+*   最后，这将再次带我们回到边界框。 为了达到这个目的，我们首先要学习分割，这不仅仅是找出边界框的位置，而是要弄清楚图像中每个像素的一部分 - 所以这个像素是一个部分人，这个像素是汽车的一部分。 然后我们将使用这个想法，特别是一个名为UNet的想法，结果证明了这个UNet的想法，我们可以应用于边界框 - 它被称为特征金字塔。 我们将使用它来获得带有边界框的非常好的结果。 这是我们从这里开始的道路。 这一切都将相互依赖，但将我们带入许多不同的领域。
+
+#### torchtext to fastai.text [ [18:56](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D18m56s) ]：
+
+![](../img/1_E9I4m0Oj7vpWbf0DlnyFmw.png)
+
+对于NLP，最后一部分，我们依赖于一个名为torchtext的库，但是从那时起，我从那时起就发现它的局限性太难以继续使用了。 正如你们很多人在论坛上抱怨的那样，部分原因是因为它没有进行并行处理，部分是因为它不记得你上次做了什么，而是从头开始重新做了。 然后很难做出相当简单的事情，比如很多你试图进入Kaggle的有毒评论竞赛，这是一个多标签问题，并试图用火炬文本做到这一点，我最终得到它的工作，但它带我像一个一周的黑客攻击有点荒谬。 为了解决所有这些问题，我们创建了一个名为fastai.text的新库。 Fastai.text是torchtext和fastai.nlp组合的替代品。 所以不要再使用fastai.nlp - 这已经过时了。 它更慢，更混乱，在各方面都不太好，但有很多重叠。 有意地，许多类和函数具有相同的名称，但这是非torchtext版本。
+
+#### IMDb [ [20:32](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D20m32s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/imdb.ipynb)
+
+我们将再次与IMDb合作。 对于那些已经忘记的人，请回去看看第[4课](https://medium.com/%40hiromi_suenaga/deep-learning-2-part-1-lesson-4-2048a26d58aa) 。 这是一个电影评论的数据集，我们用它来确定我们是否可以享受Zombiegeddon，我们认为可能是我的事情。
+
+```
+ **from** **fastai.text** **import** *  **import** **html** 
+```
+
+> 我们需要从这个网站下载IMDB大型电影评论： [http](http://ai.stanford.edu/~amaas/data/sentiment/) ： [//ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)直接链接： [链接](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)
+
+```
+ BOS = 'xbos' _# beginning-of-sentence tag_  FLD = 'xfld' _# data field tag_ 
+```
+
+```
+ PATH=Path('data/aclImdb/') 
+```
+
+#### 标准化格式[ [21:27](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D21m27s) ]
+
+NLP的基本路径是我们必须采用句子并将它们转换为数字，并且有几个要到达那里。 目前，有些故意，fastai.text没有提供那么多辅助函数。 它的设计更多是为了让您以相当灵活的方式处理事物。
+
+```
+ CLAS_PATH=Path('data/imdb_clas/')  CLAS_PATH.mkdir(exist_ok= **True** ) 
+```
+
+```
+ LM_PATH=Path('data/imdb_lm/')  LM_PATH.mkdir(exist_ok= **True** ) 
+```
+
+正如你在这里[ [21:59](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D21m59s) ]所看到的，我写了一些叫做get_texts的东西，它贯穿了`CLASSES`每一件事。 IMDb中有三个类：负数，正数，然后还有另一个文件夹“无人监督”，其中包含尚未标记的文件夹 - 所以我们现在只称它为一个类。 所以我们只需浏览这些类中的每一个，然后查找该文件夹中的每个文件，打开它，读取它，然后将它放入数组的末尾。 正如您所看到的，使用pathlib，抓取内容并将其拉入内容非常容易，然后标签就是我们到目前为止所处的任何类别。 我们将为训练集和测试集做到这一点。
+
+```
+ CLASSES = ['neg', 'pos', 'unsup'] 
+```
+
+```
+ **def** get_texts(path):  texts,labels = [],[]  **for** idx,label **in** enumerate(CLASSES):  **for** fname **in** (path/label).glob('*.*'):  texts.append(fname.open('r').read())  labels.append(idx)  **return** np.array(texts),np.array(labels) 
+```
+
+```
+ trn_texts,trn_labels = get_texts(PATH/'train')  val_texts,val_labels = get_texts(PATH/'test') 
+```
+
+```
+ len(trn_texts),len(val_texts) 
+```
+
+```
+ _(75000, 25000)_ 
+```
+
+列车有75,000，测试有25,000。 火车组中的50,000个是无人监管的，当我们进入分类时，我们实际上无法使用它们。 Jeremy发现这比使用大量图层和包装器的torch.text方法更容易，因为最后阅读文本文件并不那么难。
+
+```
+ col_names = ['labels','text'] 
+```
+
+有一点总是好主意是随机排序[ [23:19](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D23m19s) ]。 知道这个随机排序的简单技巧很有用，特别是当你有多个东西需要以同样的方式排序时。 在这种情况下，您有标签和`texts. np.random.permutation` `texts. np.random.permutation` ，如果你给它一个整数，它会给你一个0到0之间的随机列表，不包括你以某种随机顺序给它的数字。
+
+```
+ np.random.seed(42)  trn_idx = np.random.permutation(len(trn_texts))  val_idx = np.random.permutation(len(val_texts)) 
+```
+
+你可以将它作为索引器传递给你一个按随机顺序排序的列表。 所以在这种情况下，它将以相同的随机方式对`trn_texts`和`trn_labels`进行排序。 所以这是一个有用的小习惯用法。
+
+```
+ trn_texts = trn_texts[trn_idx]  val_texts = val_texts[val_idx] 
+```
+
+```
+ trn_labels = trn_labels[trn_idx]  val_labels = val_labels[val_idx] 
+```
+
+现在我们将文本和标签排序，我们可以从它们创建数据[框](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D24m7s) [ [24:07](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D24m7s) ]。 我们为什么这样做？ 原因是因为文本分类数据集开始出现一些标准的方法，即将训练设置为首先带有标签的CSV文件，然后是NLP文档的文本。 所以它基本上是这样的：
+
+![](../img/1_KUPgEBboQilVi7wcp6RO0Q.png)
+
+```
+ df_trn = pd.DataFrame({'text':trn_texts, 'labels':trn_labels},  columns=col_names)  df_val = pd.DataFrame({'text':val_texts, 'labels':val_labels},  columns=col_names) 
+```
+
+```
+ df_trn[df_trn['labels']!=2].to_csv(CLAS_PATH/'train.csv',  header= **False** , index= **False** )  df_val.to_csv(CLAS_PATH/'test.csv', header= **False** , index= **False** ) 
+```
+
+```
+ (CLAS_PATH/'classes.txt').open('w')  .writelines(f' **{o}\n** ' **for** o **in** CLASSES)  (CLAS_PATH/'classes.txt').open().readlines() 
+```
+
+```
+ _['neg\n', 'pos\n', 'unsup\n']_ 
+```
+
+所以你有你的标签和文本，然后是一个名为classes.txt的文件，它只列出了类。 我说有点标准，因为在最近的一篇学术论文中，Yann LeCun和一个研究小组研究了不少数据集，并且他们使用这种格式。 所以这就是我最近开始使用的论文。 你会发现这款笔记本，如果你把数据放到这种格式，整个笔记本每次都会工作[ [25:17](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D25m17s) ]。 因此，我只是选择一种标准格式，而不是拥有一千种不同的格式，而您的工作就是将数据放入CSV文件格式。 默认情况下，CSV文件没有标头。
+
+你会注意到一开始我们有两条不同的路径[ [25:51](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D25m51s) ]。 一个是分类路径，另一个是语言模型路径。 在NLP中，你会一直看到LM。 LM意味着语言模型。 分类路径将包含我们将用于创建情绪分析模型的信息。 语言模型路径将包含创建语言模型所需的信息。 所以他们有点不同。 有一点不同的是，当我们在分类路径中创建train.csv时，我们会删除标签为2的所有内容，因为标签2是“无人监督”而我们无法使用它。
+
+![](../img/1_CdwPQjBC0mmDYXXRQh6xMA.png)
+
+```
+ trn_texts,val_texts = sklearn.model_selection.train_test_split(  np.concatenate([trn_texts,val_texts]), test_size=0.1) 
+```
+
+```
+ len(trn_texts), len(val_texts) 
+```
+
+```
+ _(90000, 10000)_ 
+```
+
+第二个区别是标签[ [26:51](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D26m51s) ]。 对于分类路径，标签是实际标签，但对于语言模型，没有标签，所以我们只使用一堆零，这使得它更容易，因为我们可以使用一致的数据帧/ CSV格式。
+
+现在是语言模型，我们可以创建自己的验证集，所以你现在可能已经遇到过， `sklearn.model_selection.train_test_split`这是一个非常简单的函数，可以抓取数据集并随机将其拆分为训练集和验证集根据你指定的比例。 在这种情况下，我们将分类培训和验证结合在一起，将其拆分10％，现在我们有90,000次培训，10,000次验证我们的语言模型。 这样我们的语言模型和分类器就会以标准格式获取数据。
+
+```
+ df_trn = pd.DataFrame({'text':trn_texts, 'labels':  [0]*len(trn_texts)}, columns=col_names)  df_val = pd.DataFrame({'text':val_texts, 'labels':  [0]*len(val_texts)}, columns=col_names) 
+```
+
+```
+ df_trn.to_csv(LM_PATH/'train.csv', header= **False** , index= **False** )  df_val.to_csv(LM_PATH/'test.csv', header= **False** , index= **False** ) 
+```
+
+#### 语言模型令牌[ [28:03](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D28m3s) ]
+
+接下来我们要做的就是标记化。 标记化意味着在这个阶段，对于文档（即电影评论），我们有一个很长的字符串，我们想把它变成一个标记列表，类似于单词列表但不完全。 例如， `don't`希望它是`do`而`n't` ，我们可能希望完全停止成为令牌，等等。 标记化是我们传递给一个名为spaCy的极好的库 - 部分非常棒，因为澳大利亚人写了它并且部分非常棒，因为它擅长它的功能。 我们在spaCy上添加了一些东西，但绝大部分工作都是由spaCy完成的。
+
+```
+ chunksize=24000 
+```
+
+在我们将它传递给spaCy之前，Jeremy编写了这个简单的修复功能，每次他查看不同的数据集（大约十几个建立这个）时，每个人都有不同的奇怪的东西需要被替换。 所以这是他到目前为止所提出的所有内容，希望这也会帮助你。 所有的实体都是未转义的html，我们会替换更多的东西。 看一下在你输入的文本上运行它的结果，并确保那里没有更多奇怪的标记。
+
+```
+ re1 = re.compile(r' +') 
+```
+
+```
+ **def** fixup(x):  x = x.replace('#39;', "'").replace('amp;', '&')  .replace('#146;', "'").replace('nbsp;', ' ')  .replace('#36;', '$').replace(' **\\** n', " **\n** ")  .replace('quot;', "'").replace('<br />', " **\n** ")  .replace(' **\\** "', '"').replace('<unk>','u_n')  .replace(' @.@ ','.').replace(' @-@ ','-')  .replace(' **\\** ', ' **\\** ')  **return** re1.sub(' ', html.unescape(x)) 
+```
+
+```
+ **def** get_texts(df, n_lbls=1):  labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)  texts = f' **\n{BOS}** **{FLD}** 1 ' + df[n_lbls].astype(str)  **for** i **in** range(n_lbls+1, len(df.columns)):  texts += f' **{FLD}** {i-n_lbls} ' + df[i].astype(str)  texts = texts.apply(fixup).values.astype(str) 
+```
+
+```
+ tok = Tokenizer().proc_all_mp(partition_by_cores(texts))  **return** tok, list(labels) 
+```
+
+`get_all function`调用`get_texts`和get_texts会做一些事情[ [29:40](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D29m40s) ]。 其中之一是应用我们刚才提到的`fixup` 。
+
+```
+ **def** get_all(df, n_lbls):  tok, labels = [], []  **for** i, r **in** enumerate(df):  print(i)  tok_, labels_ = get_texts(r, n_lbls)  tok += tok_;  labels += labels_  **return** tok, labels 
+```
+
+让我们看看这个，因为有一些有趣的事情需要指出[ [29:57](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D29m57s) ]。 我们将使用pandas从语言模型路径打开我们的train.csv，但是我们传入了一个额外的参数，你可能在看到`chunksize`之前看不到。 在存储和使用文本数据时，Python和pandas的效率都很低。 因此，您会发现NLP中很少有人使用大型语料库。 杰里米认为，部分原因是传统工具使得它变得非常困难 - 你的内存总是耗尽。 因此，他今天向我们展示了这个过程，他使用这个确切的代码成功地使用了超过十亿字的语料库。 其中一个简单的伎俩就是用大熊猫称为`chunksize` 。 这意味着pandas不返回数据帧，但它返回一个迭代器，我们可以迭代数据帧的块。 这就是为什么我们不说`tok_trn = get_text(df_trn)` ，而是调用`get_all`来循环数据帧，但实际上它正在做的是它循环遍历数据帧的块，所以这些块中的每一个基本上都是数据帧代表数据的一个子集[ [31:05]](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D31m5s) 。
+
+**问题** ：当我使用NLP数据时，很多时候我遇到了带有外国文本/字符的数据。 放弃它们还是保留它们更好[ [31:31](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D31m31s) ]？ 不，不，绝对保留它们。 整个过程是unicode，我实际上在中文文本上使用了这个。 这适用于几乎任何事情。 一般来说，大多数时候，删除任何东西都不是一个好主意。 老式的NLP方法倾向于完成所有这些，如词形还原和所有这些规范化步骤，以摆脱事物，小写一切等等。但这会丢弃你不知道它是否有用的信息。 所以不要丢掉信息。
+
+所以我们遍历每个块，每个块都是一个数据帧，我们调用`get_texts` [ [32:19](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D32m19s) ]。 `get_texts`将获取标签并使它们成为整数，并且它将获取文本。 有几点需要指出：
+
+*   在我们包含文本之前，我们在开头定义了“开始流”（ `BOS` ）令牌。 这些特殊的字母串没有什么特别之处 - 它们只是我认为不经常出现在普通文本中的字母。 因此，每个文本都将以`'xbos'`开头 - 为什么？ 因为模型通常有助于了解新文本何时开始。 例如，如果它是一种语言模型，我们将把所有文本连接在一起。 因此，知道所有这些文章已经完成并且新的文章已经开始真的很有用，所以我现在可能会忘记它们的一些上下文。
+*   同上，文本通常有多个字段，如标题和摘要，然后是主文档。 因此，出于同样的原因，我们在这里得到了这个东西，它让我们在CSV中实际上有多个字段。 所以这个过程的设计非常灵活。 再次在每一个的开头，我们放置一个特殊的“field starts here”标记，后跟从这里开始的字段数量，就像我们拥有的字段一样多。 然后我们对它应用`fixup` 。
+*   然后最重要的是[ [33:54](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D33m54s) ]，我们将它标记化 - 我们通过执行“进程所有多处理”（ `proc_all_mp` ）来标记它。 令牌化速度往往相当缓慢，但我们现在已经在我们的机器中拥有多个内核，而AWS上的一些更好的机器可以拥有数十个内核。 spaCy不太适合多处理，但Jeremy最终想出了如何让它发挥作用。 好消息是它现在全部包含在这一功能中。 因此，您需要传递给该函数的是要标记化的事物列表，该列表的每个部分将在不同的核心上进行标记化。 还有一个名为`partition_by_cores`的函数，它接受一个列表并将其拆分为子列表。 子列表的数量是计算机中的核心数。 在没有多处理的Jeremy机器上，这需要大约一个半小时，并且通过多处理，大约需要2分钟。 所以这是一个非常有用的东西。 随意查看它，并利用它为您自己的东西。 请记住，即使在我们的笔记本电脑中，我们都拥有多个内核，并且Python中很少有东西可以利用它，除非你付出一些努力使其工作。
+
+```
+ df_trn = pd.read_csv(LM_PATH/'train.csv', header= **None** ,  chunksize=chunksize)  df_val = pd.read_csv(LM_PATH/'test.csv', header= **None** ,  chunksize=chunksize) 
+```
+
+```
+ tok_trn, trn_labels = get_all(df_trn, 1)  tok_val, val_labels = get_all(df_val, 1) 
+```
+
+```
+ _0_  _1_  _2_  _3_  _0_ 
+```
+
+```
+ (LM_PATH/'tmp').mkdir(exist_ok= **True** ) 
+```
+
+这是最后的结果[ [35:42](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D35m42s) ]。 流令牌（ `xbos` ）的开头，字段编号1标记（ `xfld 1` ）的开头和标记化文本。 您会看到标点符号现在是一个单独的标记。
+
+`**t_up**` ： `t_up mgm` - MGM最初资本化。 但有趣的是，通常人们要么小写一切，要么就是按原样离开。 现在，如果您保持原样，那么“SCREW YOU”和“screw you”是两组完全不同的令牌，必须从头开始学习。 或者如果你将它们全部小写，那么根本就没有区别。 那么你如何解决这个问题，以便你们都能得到“我现在正在发挥作用”的语义影响，但不必学习大喊大叫的版本与正常版本。 因此，我们的想法是提出一个独特的令牌来表示接下来的事情都是大写的。 然后我们小写它，所以现在任何曾经是大写的是小写的，然后我们可以学习所有大写的语义含义。
+
+`**tk_rep**` ：同样，如果你有29个`!` 在连续的情况下，我们没有为29个感叹号学习单独的标记 - 而是我们为“下一个重复很多次”添加了一个特殊标记，然后输入数字29和一个感叹号（即`tk_rep 29 !` ） 。 所以有一些这样的技巧。 如果您对NLP感兴趣，请查看Jeremy添加的这些小技巧的tokenizer代码，因为其中一些很有趣。
+
+```
+ ' '.join(tok_trn[0]) 
+```
+
+![](../img/1_avBwSHfjT31_-m28KGf4NA.png)
+
+以这种方式做事的`np.save`是我们现在可以只是`np.save`并将其加载回来[ [37:44](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D37m44s) ]。 我们不必每次重新计算所有这些东西，就像我们倾向于使用torchtext或许多其他库一样。 现在我们已经将它标记化了，接下来我们要做的就是把它变成我们称之为数字化的数字。 我们将它数字化的方式非常简单。
+
+*   我们列出了以某种顺序出现的所有单词的列表
+*   然后我们将每个单词的索引替换为该列表
+*   所有令牌的列表，我们称之为词汇表。
+
+```
+ np.save(LM_PATH/'tmp'/'tok_trn.npy', tok_trn)  np.save(LM_PATH/'tmp'/'tok_val.npy', tok_val) 
+```
+
+```
+ tok_trn = np.load(LM_PATH/'tmp'/'tok_trn.npy')  tok_val = np.load(LM_PATH/'tmp'/'tok_val.npy') 
+```
+
+这是一些词汇[ [38:28](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D38m28s) ]的例子。 Python中的Counter类非常方便。 它基本上为我们提供了一个独特项目及其计数的列表。 以下是词汇表中最常见的25个内容。 一般来说，我们不希望词汇表中的每个唯一标记。 如果它没有出现至少两次，那么可能只是一个拼写错误或一个单词，如果它不经常出现，我们无法学到任何关于它的东西。 此外，一旦你的词汇量超过60,000，我们将在这部分中学到的东西会变得有点笨拙。 如果时间允许的话，我们可能会看看Jeremy最近在处理大型词汇表时所做的一些工作，否则可能需要在未来的课程中进行。 但实际上对于分类来说，做大于60,000个单词似乎无论如何都没有帮助。
+
+```
+ freq = Counter(p **for** o **in** tok_trn **for** p **in** o)  freq.most_common(25) 
+```
+
+```
+ _[('the', 1207984),_  _('.', 991762),_  _(',', 985975),_  _('and', 587317),_  _('a', 583569),_  _('of', 524362),_  _('to', 484813),_  _('is', 393574),_  _('it', 341627),_  _('in', 337461),_  _('i', 308563),_  _('this', 270705),_  _('that', 261447),_  _('"', 236753),_  _("'s", 221112),_  _('-', 188249),_  _('was', 180235),_  _('\n\n', 178679),_  _('as', 165610),_  _('with', 159164),_  _('for', 158981),_  _('movie', 157676),_  _('but', 150203),_  _('film', 144108),_  _('you', 124114)]_ 
+```
+
+因此，我们将把词汇量限制为60,000个单词，这些单词至少出现两次[ [39:33](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D39m33s) ]。 这是一个简单的方法。 使用`.most_common` ，传递最大词汇量。 这将按频率排序，如果它看起来不是最低频率，那么根本不用担心它。 这给了我们`itos` - 这与torchtext使用的名称相同，它意味着整数到字符串。 这只是词汇中唯一标记的列表。 我们`_unk_`插入两个令牌 - 一个用于未知（ `_unk_` ）的词汇项和一个用于填充的词汇项（ `_pad_` ）。
+
+```
+ max_vocab = 60000  min_freq = 2 
+```
+
+```
+ itos = [o **for** o,c **in** freq.most_common(max_vocab) **if** c>min_freq]  itos.insert(0, '_pad_')  itos.insert(0, '_unk_') 
+```
+
+然后我们可以创建相反方向的字典（字符串到整数）[ [40:19](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D40m19s) ]。 这不会涵盖所有内容，因为我们故意将其截断至60,000字。 如果我们遇到字典中没有的东西，我们想用零替换未知，所以我们可以使用带有lambda函数的`defaultdict` ，它始终返回零。
+
+```
+ stoi = collections.defaultdict( **lambda** :0,  {v:k **for** k,v **in** enumerate(itos)})  len(itos) 
+```
+
+```
+ _60002_ 
+```
+
+所以现在我们定义了`stoi`字典，然后我们可以为每个句子调用每个单词[ [40:50](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D40m50s) ]。
+
+```
+ trn_lm = np.array([[stoi[o] **for** o **in** p] **for** p **in** tok_trn])  val_lm = np.array([[stoi[o] **for** o **in** p] **for** p **in** tok_val]) 
+```
+
+这是我们的数字化版本：
+
+![](../img/1_1VnI0YwW5Lb2N1qFHuOpAA.png)
+
+当然，好消息是我们也可以保存这一步骤。 每次我们进入另一步，我们都可以保存它。 与您使用图像时相比，这些文件不是很大。 文字通常很小。
+
+保存那些词汇（ `itos` ）非常重要。 数字列表没有任何意义，除非你知道每个数字所指的是什么，这就是`itos`告诉你的。
+
+```
+ np.save(LM_PATH/'tmp'/'trn_ids.npy', trn_lm)  np.save(LM_PATH/'tmp'/'val_ids.npy', val_lm)  pickle.dump(itos, open(LM_PATH/'tmp'/'itos.pkl', 'wb')) 
+```
+
+所以你保存了这三件事，以后再加载它们就可以了。
+
+```
+ trn_lm = np.load(LM_PATH/'tmp'/'trn_ids.npy')  val_lm = np.load(LM_PATH/'tmp'/'val_ids.npy')  itos = pickle.load(open(LM_PATH/'tmp'/'itos.pkl', 'rb')) 
+```
+
+现在我们的词汇量为60,002，我们的培训语言模型中有90,000个文档。
+
+```
+ vs=len(itos)  vs,len(trn_lm) 
+```
+
+```
+ _(60002, 90000)_ 
+```
+
+这是你做的预处理[ [42:01](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D42m1s) ]。 如果我们想要的话，我们可以在实用程序函数中包含更多的内容，但它非常简单，并且一旦您以CSV格式获得它，那么确切的代码将适用于您拥有的任何数据集。
+
+#### 预训练[ [42:19](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D42m19s) ]
+
+![](../img/1_4RrK26gE8W28T8uJBl013g.png)
+
+这是一种新的洞察力，它根本不是新的，我们想要预先培训一些东西。 我们从第4课就知道，如果我们通过首先创建一个语言模型然后将其作为分类器进行微调来预先训练我们的分类器，这是有帮助的。 它实际上为我们带来了一个新的最先进的结果 - 我们获得了相当多的IMDb分类器结果。 我们的目标不是那么远，因为IMDb的电影评论与其他任何英文文件没有什么不同; 与它们与随机字符串甚至中文文档的不同程度相比较。 因此，就像ImageNet允许我们训练能够识别看起来像图片的东西的东西一样，我们可以将它用在与ImageNet无关的东西上，就像卫星图像一样。 为什么我们不训练一个擅长英语的语言模型，然后对它进行微调以擅长电影评论。
+
+因此，这一基本见解促使Jeremy尝试在维基百科上构建语言模型。 Stephen Merity已经处理了维基百科，发现了其中大部分内容的一部分，但抛弃了留下较大文章的愚蠢小文章。 他称之为wikitext103。 Jeremy抓住了wikitext103并训练了一个语言模型。 他使用完全相同的方法向您展示训练IMDb语言模型，但他训练了一个wikitext103语言模型。 他保存了它并将其提供给任何想要在[此URL](http://files.fast.ai/models/wt103/)上使用它的人。 现在的想法是让我们训练一个以这些权重开始的IMDb语言模型。 希望对你们这些人来说，这是一个非常明显的，极具争议性的想法，因为它基本上就是我们迄今为止在几乎所有课程中所做的。 但是，当Jeremy在去年6月或7月向NLP社区的人们首次提到这一点时，可能并没有那么少的兴趣，并被告知它是愚蠢的[ [45:03](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D45m3s) ]。 因为杰里米很顽固，所以即使他们对NLP有更多了解并且无论如何都试过它，他都会忽视他们。 让我们看看发生了什么。
+
+#### wikitext103转换[ [46:11](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D46m11s) ]
+
+我们就是这样做的。 抓住wikitext模型。 如果你做了`wget -r` ，它会递归地抓取整个目录，里面有一些东西。
+
+```
+ # ! wget -nH -r -np -P {PATH} [http://files.fast.ai/models/wt103/](http://files.fast.ai/models/wt103/) 
+```
+
+我们需要确保我们的语言模型具有与Jeremy的wikitext完全相同的嵌入大小，隐藏数量和层数，否则您无法加载权重。
+
+```
+ em_sz,nh,nl = 400,1150,3 
+```
+
+这是我们预先训练的路径和我们预先训练的语言模型路径。
+
+```
+ PRE_PATH = PATH **/** 'models' **/** 'wt103'  PRE_LM_PATH = PRE_PATH **/** 'fwd_wt103.h5' 
+```
+
+让我们继续前进，并从前面的wikitext103模型中`torch.load`那些权重。 我们通常不使用torch.load，但这是PyTorch抓取文件的方式。 它基本上为您提供了一个字典，其中包含图层的名称和这些权重的张量/数组。
+
+现在问题是wikitext语言模型是用一定的词汇构建的，与我们的词汇不同[ [47:14](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D47m14s) ]。 我们的＃40与wikitext103型号＃40不同。 所以我们需要将一个映射到另一个。 这非常简单，因为幸运的是Jeremy为wikitext词汇保存了itos。
+
+```
+ wgts = torch.load(PRE_LM_PATH, map_location= **lambda** storage,  loc: storage) 
+```
+
+```
+ enc_wgts = to_np(wgts['0.encoder.weight'])  row_m = enc_wgts.mean(0) 
+```
+
+Here is the list of what each word is for wikitext103 model, and we can do the same `defaultdict` trick to map it in reverse. We'll use -1 to mean that it is not in the wikitext dictionary.
+
+```
+ itos2 = pickle.load((PRE_PATH / 'itos_wt103.pkl').open('rb')) 
+```
+
+```
+ stoi2 = collections.defaultdict( lambda : - 1, {v:k for k,v  in enumerate(itos2)}) 
+```
+
+So now we can just say our new set of weights is just a whole bunch of zeros with vocab size by embedding size (ie we are going to create an embedding matrix) [ [47:57](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D47m57s) ]. We then go through every one of the words in our IMDb vocabulary. We are going to look it up in `stoi2` (string-to-integer for the wikitext103 vocabulary) and see if it's a word there. If that is a word there, then we won't get the `-1` . So `r` will be greater than or equal to zero, so in that case, we will just set that row of the embedding matrix to the weight which was stored inside the named element `'0.encoder.weight'` . You can look at this dictionary `wgts` and it's pretty obvious what each name corresponds to. It looks very similar to the names that you gave it when you set up your module, so here are the encoder weights.
+
+If we don't find it [ [49:02](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D49m2s) ], we will use the row mean — in other words, here is the average embedding weight across all of the wikitext103\. So we will end up with an embedding matrix for every word that's in both our vocabulary for IMDb and the wikitext103 vocab, we will use the wikitext103 embedding matrix weights; for anything else, we will just use whatever was the average weight from the wikitext103 embedding matrix.
+
+```
+ new_w = np.zeros((vs, em_sz), dtype=np.float32)  for i,w in enumerate(itos):  r = stoi2[w]  new_w[i] = enc_wgts[r] if r > =0 else row_m 
+```
+
+We will then replace the encoder weights with `new_w` turn into a tensor [ [49:35](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D49m35s) ]. We haven't talked much about weight tying, but basically the decoder (the thing that turns the final prediction back into a word) uses exactly the same weights, so we pop it there as well. Then there is a bit of weird thing with how we do embedding dropout that ends up with a whole separate copy of them for a reason that doesn't matter much. So we popped the weights back where they need to go. So this is now a set of torch state which we can load in.
+
+```
+ wgts['0.encoder.weight'] = T(new_w)  wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))  wgts['1.decoder.weight'] = T(np.copy(new_w)) 
+```
+
+#### Language model [ [50:18](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D50m18s) ]
+
+Let's create our language model. Basic approach we are going to use is we are going to concatenate all of the documents together into a single list of tokens of length 24,998,320\. That is going to be what we pass in as a training set. So for the language model:
+
+*   We take all our documents and just concatenate them back to back.
+*   We are going to be continuously trying to predict what's the next word after these words.
+*   We will set up a whole bunch of dropouts.
+*   Once we have a model data object, we can grab the model from it, so that's going to give us a learner.
+*   Then as per usual, we can call `learner.fit` . We do a single epoch on the last layer just to get that okay. The way it's set up is the last layer is the embedding words because that's obviously the thing that's going to be the most wrong because a lot of those embedding weights didn't even exist in the vocab. So we will train a single epoch of just the embedding weights.
+*   Then we'll start doing a few epochs of the full model. How is it looking? In lesson 4, we had the loss of 4.23 after 14 epochs. In this case, we have 4.12 loss after 1 epoch. So by pre-training on wikitext103, we have a better loss after 1 epoch than the best loss we got for the language model otherwise.
+
+**Question** : What is the wikitext103 model? Is it a AWD LSTM again [ [52:41](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D52m41s) ]? Yes, we are about to dig into that. The way I trained it was literally the same lines of code that you see above, but without pre-training it on wikitext103\.
+
+* * *
+
+#### A quick discussion about fastai doc project [ [53:07](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D53m7s) ]
+
+The goal of fastai doc project is to create documentation that makes readers say “wow, that's the most fantastic documentation I've ever read” and we have some specific ideas about how to do that. It's the same kind of idea of top-down, thoughtful, take full advantage of the medium approach, interactive experimental code first that we are all familiar with. If you are interested in getting involved, you can see the basic approach in [the docs directory](https://github.com/fastai/fastai/tree/master/docs) . In there, there is, amongst other things, [transforms-tmpl.adoc](https://raw.githubusercontent.com/fastai/fastai/master/docs/transforms-tmpl.adoc) . `adoc` is [AsciiDoc](http://asciidoc.org/) . AsciiDoc is like markdown but it's like what markdown needs to be to create actual books. A lot of actual books are written in AsciiDoc and it's as easy to use as markdown but there's way more cool stuff you can do with it. [Here](https://raw.githubusercontent.com/fastai/fastai/master/docs/transforms.adoc) is more standard asciiDoc example. You can do things like inserting a table of contents ( `:toc:` ). `::` means put a definition list here. `+` means this is a continuation of the previous list item. So there are many super handy features and it is like turbo-charged markdown. So this asciidoc creates this HTML and no custom CSS or anything added:
+
+![](../img/1_9UfkC1UD_8TZP0PpTbJhdg.png)
+
+We literally started this project 4 hours ago. So you have a table of contents with hyper links to specific sections. We have cross reference we can click on to jump straight to the cross reference. Each method comes along with its details and so on. To make things even easier, they've created a special template for argument, cross reference, method, etc. The idea is, it will almost be like a book. There will be tables, pictures, video segments, and hyperlink throughout.
+
+You might be wondering what about docstrings. But actually, if you look at the Python standard library and look at the docstring for `re.compile()` , for example, it's a single line. Nearly every docstring in Python is a single line. And Python then does exactly this — they have a website containing the documentation that says “this is what regular expressions are, and this is what you need to know about them, and if you want do them fast, you need to compile, and here is some information about compile” etc. These information is not in the docstring and that's how we are going to do as well — our docstring will be one line unless you need like two sometimes. Everybody is welcome to help contribute to the documentation.
+
+* * *
+
+**Question** : Hoes this compare to word2vec [ [58:31](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D58m31s) ]? This is actually a great thing for you to spend time thinking about during the week. I'll give you the summary now but it's a very important conceptual difference. The main conceptual difference is “what is word2vec?” Word2vec is a single embedding matrix — each word has a vector and that's it. In other words, it's a single layer from a pre-trained model — specifically that layer is the input layer. Also specifically that pre-trained model is a linear model that is pre-trained on something called a co-occurrence matrix. So we have no particular reason to believe that this model has learned anything much about English language or that it has any particular capabilities because it's just a single linear layer and that's it. What's this wikitext103 model? It's a language model and it has a 400 dimensional embedding matrix, 3 hidden layers with 1,150 activations per layer, and regularization and all that stuff tied input output matrices — it's basically a state-of-the-art AWD LSTM. What's the difference between a single layer of a single linear model vs. a three layer recurrent neural network? Everything! They are very different levels of capabilities. So you will see when you try using a pre-trained language model vs. word2vec layer, you'll get very different results for the vast majority of tasks.
+
+**Question** : What if the numpy array does not fit in memory? Is it possible to write a PyTorch data loader directly from a large CSV file [ [1:00:32](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h32s) ]? It almost certainly won't come up, so I'm not going to spend time on it. These things are tiny — they are just integers. Think about how many integers you would need to run out of memories? That's not gonna happen. They don't have to fit in GPU memory, just in your memory. I've actually done another Wikipedia model which I called giga wiki which was on all of Wikipedia and even that easily fits in memory. The reason I'm not using it is because it turned out not to really help very much vs. wikitext103\. I've built a bigger model than anybody else I've found in the academic literature and it fits in memory on a single machine.
+
+**Question** : What is the idea behind averaging the weights of embeddings [ [1:01:24](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h1m24s) ]? They have to be set to something. These are words that weren't there, so the other option is we could leave them as zero. But that seems like a very extreme thing to do. Zero is a very extreme number. Why would it be zero? We could set it equal to some random numbers, but if so, what would be the mean and standard deviation of those random numbers? Should they be uniform? If we just average the rest of the embeddings, then we have something that's reasonably scaled. Just to clarify, this is how we are initializing words that didn't appear in the training corpus.
+
+#### Back to Language Model [ [1:02:20](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h2m20s) ]
+
+This is a ton of stuff we've seen before, but it's changed a little bit. It's actually a lot easier than it was in part 1, but I want to go a little bit deeper into the language model loader.
+
+```
+ wd=1e-7  bptt=70  bs=52  opt_fn = partial(optim.Adam, betas=(0.8, 0.99)) 
+```
+
+```
+ t = len(np.concatenate(trn_lm))  t, t//64 
+```
+
+```
+ (24998320, 390598) 
+```
+
+This is the `LanguageModelLoader` and I really hope that by now, you've learned in your editor or IDE how to jump to symbols [ [1:02:37](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h2m37s) ]. I don't want it to be a burden for you to find out what the source code of `LanguageModelLoader` is. If your editor doesn't make it easy, don't use that editor anymore. There's lots of good free editors that make this easy.
+
+So this is the source code for LanguageModelLoader, and it's interesting to notice that it's not doing anything particularly tricky. It's not deriving from anything at all. What makes something that's capable of being a data loader is that it's something you can iterate over.
+
+![](../img/1_ttM96lLbHQn06byFwmHj0g.png)
+
+Here is the `fit` function inside fastai.model [ [1:03:41](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h3m41s) ]. This is where everything ends up eventually which goes through each epoch, creates an iterator from the data loader, and then just does a for loop through it. So anything you can do a for loop through can be a data loader. Specifically it needs to return tuples of independent and dependent variables for mini-batches.
+
+![](../img/1_560U29nWI0xNGLsHgnWFNQ.png)
+
+So anything with a `__iter__` method is something that can act as an iterator [ [1:04:09](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h4m9s) ]. `yield` is a neat little Python keywords you probably should learn about if you don't already know it. But it basically spits out a thing and waits for you to ask for another thing — normally in a for loop or something. In this case, we start by initializing the language model passing it in the numbers `nums` this is the numericalized long list of all of our documents concatenated together. The first thing we do is to “batchfy” it. This is the thing which quite a few of you got confused about last time. If our batch size is 64 and we have 25 million numbers in our list. We are not creating items of length 64 — we are creating 64 items in total. So each of them is of size `t` divided by 64 which is 390k. So that's what we do here:
+
+`data = data.view(self.bs, -1).t().contiguous()`
+
+We reshape it so that this axis is of length 64 and `-1` is everything else (390k blob), and we transpose it. So that means that we now have 64 columns, 390k rows. Then what we do each time we do an iterate is we grab one batch of some sequence length, which is approximately equal to `bptt` (back prop through time) which we set to 70\. We just grab that many rows. So from `i` to `i+70` rows, we try to predict that plus one. Remember, we are trying to predict one past where we are up to.
+
+So we have 64 columns and each of those is 1/64th of our 25 million tokens, and hundreds of thousands long, and we just grab 70 at a time [ [1:06:29](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h6m29s) ]. So each of those columns, each time we grab it, it's going to kind of hook up to the previous column. That's why we get this consistency. This language model is stateful which is really important.
+
+Pretty much all of the cool stuff in the language model is stolen from Stephen Merity's AWD-LSTM [ [1:06:59](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h6m59s) ] including this little trick here:
+
+![](../img/1_Mv0c41-UvTGlNKHMuPlsHw.png)
+
+If we always grab 70 at a time and then we go back and do a new epoch, we're going to grab exactly the same batches every time — there is no randomness. Normally, we shuffle our data every time we do an epoch or every time we grab some data we grab it at random. You can't do that with a language model because this set has to join up to the previous set because it's trying to learn the sentence. If you suddenly jump somewhere else, that doesn't make any sense as a sentence. So Stephen's idea is to say “okay, since we can't shuffle the order, let's instead randomly change the sequence length”. Basically, 95% of the time, we will use `bptt` (ie 70) but 5% of the time, we'll use half that. Then he says “you know what, I'm not even going to make that the sequence length, I'm going to create a normally distributed random number with that average and a standard deviation of 5, and I'll make that the sequence length.” So the sequence length is seventy-ish and that means every time we go through, we are getting slightly different batches. So we've got that little bit of extra randomness. Jeremy asked Stephen Merity where he came up with this idea, did he think of it? and he said “I think I thought of it, but it seemed so obvious that I bet I didn't think of it” — which is true of every time Jeremy comes up with an idea in deep learning. It always seems so obvious that you just assume somebody else has thought of it. But Jeremy thinks Stephen thought of it.
+
+`LanguageModelLoader` is a nice thing to look at if you are trying to do something a bit unusual with a data loader [ [1:08:55](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h8m55s) ]. It's a simple role model you can use as to creating a data loader from scratch — something that spits out batches of data.
+
+Our language model loader took in all of the documents concatenated together along with batch size and bptt [ [1:09:14](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h9m14s) ].
+
+```
+ trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)  val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)  md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs,  bptt=bptt) 
+```
+
+Now generally speaking, we want to create a learner and the way we normally do that is by getting a model data object and calling some kind of method which have various names but often we call that method `get_model` . The idea is that the model data object has enough information to know what kind of model to give you. So we have to create that model data object which means we need LanguageModelData class which is very easy to do [ [1:09:51](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h9m51s) ].
+
+Here are all of the pieces. We are going to create a custom learner, a custom model data class, and a custom model class. So a model data class, again this one doesn't inherit from anything so you really see there's almost nothing to do. You need to tell it most importantly what's your training set (give it a data loader), what's the validation set (give it a data loader), and optionally, give it a test set (data loader), plus anything else that needs to know. It might need to know the bptt, it needs to know number of tokens(ie the vocab size), and it needs to know what is the padding index. And so that it can save temporary files and models, model datas as always need to know the path. So we just grab all that stuff and we dump it. 而已。 That's the entire initializer. There is no logic there at all.
+
+![](../img/1_GPeBIZ7A9P8gdCulCCrREw.png)
+
+Then all of the work happens inside `get_model` [ [1:10:55](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h10m55s) ]. get_model calls something we will look at later, which just grabs a normal PyTorch nn.Module architecture, and chucks it on GPU. Note: with PyTorch, we would say `.cuda()` , with fastai it's better to say `to_gpu()` , the reason is that if you don't have GPU, it will leave it on the CPU. It also provides a global variable you can set to choose whether it goes on the GPU or not, so it's a better approach. We wrapped the model in a `LanguageModel` and the `LanguageModel` is a subclass of `BasicModel` which almost does nothing except it defines layer groups. Remember when we do discriminative learning rates where different layers have different learning rates or we freeze different amounts, we don't provide a different learning rate for every layer because there can be a thousand layers. We provide a different learning rate for every layer group. So when you create a custom model, you just have to override this one thing which returns a list of all of your layer groups. In this case, the last layer group contains the last part of the model and one bit of dropout. The rest of it ( `*` here means pull this apart) so this is going to be one layer per RNN layer. So that's all that is.
+
+Then finally turn that into a learner [ [1:12:41](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h12m41s) ]. So a learner, you just pass in the model and it turns it into a learner. In this case, we have overridden learner and the only thing we've done is to say I want the default loss function to be cross entropy. This entire set of custom model, custom model data, custom learner all fits on a single screen. They always basically look like this.
+
+The interesting part of this code base is `get_language_model` [ [1:13:18](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h13m18s) ]. Because that gives us our AWD LSTM. It actually contains the big idea. The big, incredibly simple idea that everybody else here thinks it's really obvious that everybody in the NLP community Jeremy spoke to thought was insane. That is, every model can be thought of as a backbone plus a head, and if you pre-train the backbone and stick on a random head, you can do fine-tuning and that's a good idea.
+
+![](../img/1_QoAsI-zGJ3XKMBDY-3o1Rg.png)
+
+These two bits of code, literally right next to each other, this is all there is inside `fastai.lm_rnn` .
+
+`get_language_model` : Creates an RNN encoder and then creates a sequential model that sticks on top of that — a linear decoder.
+
+`get_rnn_classifier` : Creates an RNN encoder, then a sequential model that sticks on top of that — a pooling linear classifier.
+
+We'll see what these differences are in a moment, but you get the basic idea. They are doing pretty much the same thing. They've got this head and they are sticking on a simple linear layer on top.
+
+**Question** : There was a question earlier about whether that any of this translates to other languages [ [1:14:52](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h14m52s) ]. Yes, this whole thing works in any languages. Would you have to retrain your language model on a corpus from that language? 绝对！ So the wikitext103 pre-trained language model knows English. You could use it maybe as a pre-trained start for like French or German model, start by retraining the embedding layer from scratch might be helpful. Chinese, maybe not so much. But given that a language model can be trained from any unlabeled documents at all, you'll never have to do that. Because almost every language in the world has plenty of documents — you can grab newspapers, web pages, parliamentary records, etc. As long as you have a few thousand documents showing somewhat normal usage of that language, you can create a language model. One of our students tried this approach for Thai and he said the first model he built easily beat the previous state-of-the-art Thai classifier. For those of you that are international fellow, this is an easy way for you to whip out a paper in which you either create the first ever classifier in your language or beat everybody else's classifier in your language. Then you can tell them that you've been a student of deep learning for six months and piss off all the academics in your country.
+
+Here is our RNN encoder [ [1:16:49](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h16m49s) ]. It is a standard nn.Module. It looks like there is more going on in it than there actually is, but really all there is is we create an embedding layer, create an LSTM for each layer that's been asked for, that's it. Everything else in it is dropout. Basically all of the interesting stuff (just about) in the AWS LSTM paper is all of the places you can put dropout. Then the forward is basically the same thing. Call the embedding layer, add some dropout, go through each layer, call that RNN layer, append it to our list of outputs, add dropout, that's about it. So it's pretty straight forward.
+
+![](../img/1_HrRraVW1kuyghw-PIhV89g.png)
+
+The paper you want to be reading is the AWD LSTM paper which is [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) . It's well written, pretty accessible, and entirely implemented inside fastai as well — so you can see all of the code for that paper. A lot of the code actually is shamelessly plagiarized with Stephen's permission from his excellent GitHub repo [AWD LSTM](https://github.com/Smerity/awd-lstm-lm) .
+
+The paper refers to other papers. For things like why is it that the encoder weight and the decoder weight are the same. It's because there is this thing called “tie weights.” Inside `get_language_model` , there is a thing called `tie_weights` which defaults to true. If it's true, then we literally use the same weight matrix for the encoder and the decoder. They are pointing at the same block of memory. 这是为什么？ What's the result of it? That's one of the citations in Stephen's paper which is also a well written paper you can look up and learn about weight tying.
+
+![](../img/1_b0FeRkWrz1MxE96PMak8xw.png)
+
+We have basically a standard RNN [ [1:19:52](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h19m52s) ]. The only reason where it's not standard is it has lots more types of dropout in it. In a sequential model on top of the RNN, we stick a linear decoder which is literally half the screen of code. It has a single linear layer, we initialize the weights to some range, we add some dropout, and that's it. So it's a linear layer with dropout.
+
+![](../img/1_8qFWffVOekS8lvZYmZIUdA.png)
+
+So the language model is:
+
+*   RNN → A linear layer with dropout
+
+#### Choosing dropout [ [1:20:36](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h20m36s) ]
+
+What dropout you choose matters a lot .Through a lot of experimentation, Jeremy found a bunch of dropouts that tend to work pretty well for language models. But if you have less data for your language model, you'll need more dropout. If you have more data, you can benefit from less dropout. You don't want to regularize more than you have to. Rather than having to tune every one of these five things, Jeremy's claim is they are already pretty good ratios to each other, so just tune this number ( `0.7` below), we just multiply it all by something. If you are overfitting, then you'll need to increase the number, if you are underfitting, you'll need to decrease this. Because other than that, these ratio seem pretty good.
+
+```
+ drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])* 0.7 
+```
+
+```
+ learner= md.get_model(opt_fn, em_sz, nh, nl,  dropouti=drops[0], dropout=drops[1], wdrop=drops[2],  dropoute=drops[3], dropouth=drops[4]) 
+```
+
+```
+ learner.metrics = [accuracy]  learner.freeze_to(-1) 
+```
+
+#### Measuring accuracy [ [1:21:45](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h21m45s) ]
+
+One important idea which may seem minor but again it's incredibly controversial is that we should measure accuracy when we look at a language model . Normally for language models, we look at a loss value which is just cross entropy loss but specifically we nearly always take e to the power of that which the NLP community calls “perplexity”. So perplexity is just `e^(cross entropy)` . There is a lot of problems with comparing things based on cross entropy loss. Not sure if there's time to go into it in detail now, but the basic problem is that it is like that thing we learned about focal loss. Cross entropy loss — if you are right, it wants you to be really confident that you are right. So it really penalizes a model that doesn't say “I'm so sure this is wrong” and it's wrong. Whereas accuracy doesn't care at all about how confident you are — it cares about whether you are right. This is much more often the thing which you care about in real life. The accuracy is how often do we guess the next word correctly and it's a much more stable number to keep track of. So that's a simple little thing that Jeremy does.
+
+```
+ learner.model.load_state_dict(wgts) 
+```
+
+```
+ lr=1e-3  lrs = lr 
+```
+
+```
+ learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1) 
+```
+
+```
+ _epoch trn_loss val_loss accuracy_ 
+ 0 4.398856 4.175343 0.28551 
+```
+
+```
+ [4.175343, 0.2855095456305303] 
+```
+
+```
+ learner.save('lm_last_ft') 
+```
+
+```
+ learner.load('lm_last_ft') 
+```
+
+```
+ learner.unfreeze() 
+```
+
+```
+ learner.lr_find(start_lr=lrs/10, end_lr=lrs*10, linear= True ) 
+```
+
+```
+ learner.sched.plot() 
+```
+
+We train for a while and we get down to a 3.9 cross entropy loss which is equivalent of ~49.40 perplexity ( `e^3.9` ) [ [1:23:14](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h23m14s) ]. To give you a sense of what's happening with language models, if you look at academic papers from about 18 months ago, you'll see them talking about state-of-the-art perplexity of over a hundred. The rate at which our ability to understand language and measuring language model accuracy or perplexity is not a terrible proxy for understanding language. If I can guess what you are going to say next, I need to understand language well and the kind of things you might talk about pretty well. The perplexity number has just come down so much that it's been amazing, and it will come down a lot more. NLP in the last 12–18 months, it really feels like 2011–2012 computer vision. We are starting to understand transfer learning and fine-tuning, and basic models are getting so much better. Everything you thought about what NLP can and can't do is rapidly going out of date. There's still lots of things NLP is not good at to be clear. Just like in 2012, there were lots of stuff computer vision wasn't good at. But it's changing incredibly rapidly and now is a very very good time to be getting very good at NLP or starting startups base on NLP because there is a whole bunch of stuff which computers would absolutely terrible at two years ago and now not quite good as people and then next year, they'll be much better than people.
+
+```
+ learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=15) 
+```
+
+```
+ epoch trn_loss val_loss accuracy  0 4.332359 4.120674 0.289563  1 4.247177 4.067932 0.294281  2 4.175848 4.027153 0.298062  3 4.140306 4.001291 0.300798  4 4.112395 3.98392 0.302663  5 4.078948 3.971053 0.304059  6 4.06956 3.958152 0.305356  7 4.025542 3.951509 0.306309  8 4.019778 3.94065 0.30756  9 4.027846 3.931385 0.308232  10 3.98106 3.928427 0.309011  11 3.97106 3.920667 0.30989  12 3.941096 3.917029 0.310515  13 3.924818 3.91302 0.311015  14 3.923296 3.908476 0.311586 
+```
+
+```
+ [3.9084756, 0.3115861900150776] 
+```
+
+**Question** : What is your ratio of paper reading vs. coding in a week [ [1:25:24](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h25m24s) ]? Gosh, what do you think, Rachel? You see me. I mean, it's more coding, right? “It's a lot more coding. I feel like it also really varies from week to week” (Rachel). With that bounding box stuff, there were all these papers and no map through them, so I didn't even know which one to read first and then I'd read the citations and didn't understand any of them. So there was a few weeks of just kind of reading papers before I even know what to start coding. That's unusual though. Anytime I start reading a paper, I'm always convinced that I'm not smart enough to understand it, always, regardless of the paper. And somehow eventually I do. But I try to spend as much time as I can coding.
+
+Nearly always after I've read a paper [ [1:26:34](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h26m34s) ], even after I've read the bit that says this is the problem I'm trying to solve, I'll stop there and try to implement something that I think might solve that problem. And then I'll go back and read the paper, and I read little bits about these are how I solve these problem bits, and I'll be like “oh that's a good idea” and then I'll try to implement those. That's why for example, I didn't actually implement SSD. My custom head is not the same as their head. It's because I kind of read the gist of it and then I tried to create something as best as I could, then go back to the papers and try to see why. So by the time I got to the focal loss paper, Rachel will tell you, I was driving myself crazy with how come I can't find small objects? How come it's always predicting background? I read the focal loss paper and I was like “that's why!!” It's so much better when you deeply understand the problem they are trying to solve. I do find the vast majority of the time, by the time I read that bit of the paper which is solving a problem, I'm then like “yeah, but these three ideas I came up with, they didn't try.” Then you suddenly realize that you've got new ideas. Or else, if you just implement the paper mindlessly, you tend not to have these insights about better ways to do it .
+
+**Question** : Is your dropout rate the same through the training or do you adjust it and weights accordingly [ [1:26:27](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h26m27s) ]? Varying dropout is really interesting and there are some recent papers that suggest gradually changing dropout [ [1:28:09](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h28m9s) ]. It was either good idea to gradually make it smaller or gradually make it bigger, I'm not sure which. Maybe one of us can try and find it during the week. I haven't seen it widely used. I tried it a little bit with the most recent paper I wrote and I had some good results. I think I was gradually make it smaller, but I can't remember.
+
+**Question** : Am I correct in thinking that this language model is build on word embeddings? Would it be valuable to try this with phrase or sentence embeddings? I ask this because I saw from Google the other day, universal sentence encoder [ [1:28:45](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h28m45s) ]. This is much better than that. This is not just an embedding of a sentence, this is an entire model. An embedding by definition is like a fixed thing. A sentence or a phase embedding is always a model that creates that. We've got a model that's trying to understand language. It's not just as phrase or as sentence — it's a document in the end, and it's not just an embedding that we are training through the whole thing. This has been a huge problem with NLP for years now is this attachment they have to embeddings. Even the paper that the community has been most excited about recently from [AI2](http://allenai.org/) (Allen Institute for Artificial Intelligence) called ELMo — they found much better results across lots of models, but again it was an embedding. They took a fixed model and created a fixed set of numbers which they then fed into a model. But in computer vision, we've known for years that that approach of having fixed set of features, they're called hyper columns in computer vision, people stopped using them like 3 or 4 years ago because fine-tuning the entire model works much better. For those of you that have spent quite a lot of time with NLP and not much time with computer vision, you're going to have to start re-learning. All that stuff you have been told about this idea that there are these things called embeddings and that you learn them ahead of time and then you apply these fixed things whether it be word level or phrase level or whatever level — don't do that. You want to actually create a pre-trained model and fine-tune it end-to-end, then you'll see some specific results.
+
+**Question** : For using accuracy instead of perplexity as a metric for the model, could we work that into the loss function rather than just use it as a metric [ [1:31:21](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h31m21s) ]? No, you never want to do that whether it be computer vision or NLP or whatever. It's too bumpy. So cross entropy is fine as a loss function. And I'm not saying instead of, I use it in addition to. I think it's good to look at the accuracy and to look at the cross entropy. But for your loss function, you need something nice and smoothy. Accuracy doesn't work very well.
+
+```
+ learner.save('lm1')  learner.save_encoder('lm1_enc') 
+```
+
+#### `save_encoder` [ [1:31:55](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h31m55s) ]
+
+You'll see there are two different versions of `save` . `save` saves the whole model as per usual. `save_encoder` just saves that bit:
+
+![](../img/1_H8mfqVgmT04qnT1ludJRFQ.png)
+
+In other words, in the sequential model, it saves just `rnn_enc` and not `LinearDecoder(n_tok, emb_sz, dropout, tie_encoder=enc)` (which is the bit that actually makes it into a language model). We don't care about that bit in the classifier, we just care about `rnn_end` . That's why we save two different models here.
+
+```
+ learner.sched.plot_loss() 
+```
+
+![](../img/1_NI2INKONs4lYhviEqp3zFQ.png)
+
+#### Classifier tokens [ [1:32:31](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h32m31s) ]
+
+Let's now create the classifier. We will go through this pretty quickly because it's the same. But when you go back during the week and look at the code, convince yourself it's the same.
+
+```
+ df_trn = pd.read_csv(CLAS_PATH/'train.csv', header= None ,  chunksize=chunksize)  df_val = pd.read_csv(CLAS_PATH/'test.csv', header= None ,  chunksize=chunksize) 
+```
+
+```
+ tok_trn, trn_labels = get_all(df_trn, 1)  tok_val, val_labels = get_all(df_val, 1) 
+```
+
+```
+ _0_  _1_  _0_  _1_ 
+```
+
+```
+ (CLAS_PATH/'tmp').mkdir(exist_ok= True ) 
+```
+
+```
+ np.save(CLAS_PATH/'tmp'/'tok_trn.npy', tok_trn)  np.save(CLAS_PATH/'tmp'/'tok_val.npy', tok_val) 
+```
+
+```
+ np.save(CLAS_PATH/'tmp'/'trn_labels.npy', trn_labels)  np.save(CLAS_PATH/'tmp'/'val_labels.npy', val_labels) 
+```
+
+```
+ tok_trn = np.load(CLAS_PATH/'tmp'/'tok_trn.npy')  tok_val = np.load(CLAS_PATH/'tmp'/'tok_val.npy') 
+```
+
+We don't create a new `itos` vocabulary, we obviously want to use the same vocabulary we had in the language model because we are about to reload the same encoder [ [1:32:48](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h32m48s) ].
+
+```
+ itos = pickle.load((LM_PATH/'tmp'/'itos.pkl').open('rb'))  stoi = collections.defaultdict( lambda :0, {v:k for k,v in  enumerate(itos)})  len(itos) 
+```
+
+```
+ 60002 
+```
+
+```
+ trn_clas = np.array([[stoi[o] for o in p] for p in tok_trn])  val_clas = np.array([[stoi[o] for o in p] for p in tok_val]) 
+```
+
+```
+ np.save(CLAS_PATH/'tmp'/'trn_ids.npy', trn_clas)  np.save(CLAS_PATH/'tmp'/'val_ids.npy', val_clas) 
+```
+
+#### Classifier
+
+```
+ trn_clas = np.load(CLAS_PATH/'tmp'/'trn_ids.npy')  val_clas = np.load(CLAS_PATH/'tmp'/'val_ids.npy') 
+```
+
+```
+ trn_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'trn_labels.npy'))  val_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'val_labels.npy')) 
+```
+
+The construction of the model hyper parameters are the same [ [1:33:16](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h33m16s) ]. We can change the dropout. Pick a batch size that is as big as you can that doesn't run out of memory.
+
+```
+ bptt,em_sz,nh,nl = 70,400,1150,3  vs = len(itos)  opt_fn = partial(optim.Adam, betas=(0.8, 0.99))  bs = 48 
+```
+
+```
+ min_lbl = trn_labels.min()  trn_labels -= min_lbl  val_labels -= min_lbl  c=int(trn_labels.max())+1 
+```
+
+#### TextDataset [ [1:33:37](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h33m37s) ]
+
+This bit is interesting. There's fun stuff going on here.
+
+```
+ trn_ds = TextDataset(trn_clas, trn_labels)  val_ds = TextDataset(val_clas, val_labels) 
+```
+
+The basic idea here is that for the classifier, we do really want to look at one document. Is this document positive or negative? So we do want to shuffle the documents. But those documents have different lengths and so if we stick them all into one batch (this is a handy thing that fastai does for you) — you can stick things of different lengths into a batch and it will automatically pat them, so you don't have to worry about that. But if they are wildly different lengths, then you're going to be wasting a lot of computation times. If there is one thing that's 2,000 words long and everything else is 50 words long, that means you end up with 2000 wide tensor. That's pretty annoying. So James Bradbury who is one of Stephen Merity's colleagues and the guy who came up with torchtext came up with a neat idea which was “let's sort the dataset by length-ish”. So kind of make it so the first things in the list are, on the whole, shorter than the things at the end, but a little bit random as well.
+
+Here is how Jeremy implemented that [ [1:35:10](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h35m10s) ]. The first thing we need is a Dataset. So we have a Dataset passing in the documents and their labels. Here is `TextDataSet` which inherits from `Dataset` and `Dataset` from PyTorch is also shown below:
+
+![](../img/1_5X1u6uQ6ywmiDVOa8qzbgg.png)
+
+Actually `Dataset` doesn't do anything at all [ [1:35:34](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h35m34s) ]. It says you need `__getitem__` if you don't have one, you're going to get an error. Same is true for `__len__` . So this is an abstract class. To `TextDataset` , we are going to pass in our `x` and `y` , and `__getitem__` will grab `x` and `y` , and return them — it couldn't be much simpler. Optionally, 1\. they could reverse it, 2\. stick an end of stream at the end, 3\. stick start of stream at the beginning. But we are not doing any of those things, so literally all we are doing is putting `x` and `y` and `__getitem__` returns them as a tuple. The length is however long the `x` is. That's all `Dataset` is — something with a length that you can index.
+
+#### Turning it to a DataLoader [ [1:36:27](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h36m27s) ]
+
+```
+ trn_samp = SortishSampler(trn_clas, key= lambda x: len(trn_clas[x]),  bs=bs//2)  val_samp = SortSampler(val_clas, key= lambda x: len(val_clas[x])) 
+```
+
+```
+ trn_dl = DataLoader(trn_ds, bs//2, transpose= True , num_workers=1,  pad_idx=1, sampler=trn_samp)  val_dl = DataLoader(val_ds, bs, transpose= True , num_workers=1,  pad_idx=1, sampler=val_samp)  md = ModelData(PATH, trn_dl, val_dl) 
+```
+
+To turn it into a DataLoader, you simply pass the Dataset to the DataLoader constructor, and it's now going to give you a batch of that at a time. Normally you can say shuffle equals true or shuffle equals false, it'll decide whether to randomize it for you. In this case though, we are actually going to pass in a sampler parameter and sampler is a class we are going to define that tells the data loader how to shuffle.
+
+*   For validation set, we are going to define something that actually just sorts. It just deterministically sorts it so that all the shortest documents will be at the start, all the longest documents will be at the end, and that's going to minimize the amount of padding.
+*   For training sampler, we are going to create this thing called sort-ish sampler which also sorts (ish!)
+
+![](../img/1_Z_0F0rRH8odcUq8n7bRDVg.png)
+
+What's great about PyTorch is that they came up with this idea for an API for their data loader where we can hook in new classes to make it behave in different ways [ [1:37:27](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h37m27s) ]. SortSampler is something which has a length which is the length of the data source and has an iterator which is simply an iterator which goes through the data source sorted by length (which is passed in as `key` ). For the SortishSampler, it basically does the same thing with a little bit of randomness. It's just another of those beautiful design things in PyTorch that Jeremy discovered. He could take James Bradbury's ideas which he had written a whole new set of classes around, and he could just use inbuilt hooks inside PyTorch. You will notice data loader is not actually PyTorch's data loader — it's actually fastai's data loader. But it's basically almost entirely plagiarized from PyTorch but customized in some ways to make it faster mainly using multi-threading instead of multi-processing.
+
+**Question** : Does the pre-trained LSTM depth and `bptt` need to match with the new one we are training [ [1:39:00](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h39m) ]? No, the `bptt` doesn't need to match at all. That's just like how many things we look at at a time. It has nothing to do with the architecture.
+
+So now we can call that function we just saw before `get_rnn_classifier` [ [1:39:16](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h39m16s) ]. It's going to create exactly the same encoder more or less, and we are going to pass in the same architectural details as before. But this time, with the head we add on, you have a few more things you can do. One is you can add more than one hidden layer. In `layers=[em_sz*3, 50, c]` :
+
+*   `em_sz * 3` : this is what the input to my head (ie classifier section) is going to be.
+*   `50` : this is the output of the first layer
+*   `c` : this is the output of the second layer
+
+And you can add as many as you like. So you can basically create a little multi-layer neural net classifier at the end. Similarly, for `drops=[dps[4], 0.1]` , these are the dropouts to go after each of these layers.
+
+```
+  # part 1  dps = np.array([0.4, 0.5, 0.05, 0.3, 0.1]) 
+```
+
+```
+ dps = np.array([0.4,0.5,0.05,0.3,0.4])*0.5 
+```
+
+```
+ m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh,  n_layers=nl, pad_token=1,  layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],  dropouti=dps[0], wdrop=dps[1],  dropoute=dps[2], dropouth=dps[3]) 
+```
+
+```
+ opt_fn = partial(optim.Adam, betas=(0.7, 0.99)) 
+```
+
+We are going to use RNN_Learner just like before.
+
+```
+ learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)  learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)  learn.clip=25\.  learn.metrics = [accuracy] 
+```
+
+We are going to use discriminative learning rates for different layers [ [1:40:20](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h40m20s) ].
+
+```
+ lr=3e-3  lrm = 2.6  lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr]) 
+```
+
+```
+ lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2]) 
+```
+
+You can try using weight decay or not. Jeremy has been fiddling around a bit with that to see what happens.
+
+```
+ wd = 1e-7  wd = 0  learn.load_encoder('lm2_enc') 
+```
+
+We start out just training the last layer and we get 92.9% accuracy:
+
+```
+ learn.freeze_to(-1) 
+```
+
+```
+ learn.lr_find(lrs/1000)  learn.sched.plot() 
+```
+
+```
+ learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3)) 
+```
+
+```
+ _epoch trn_loss val_loss accuracy_ 
+ 0 0.365457 0.185553 0.928719 
+```
+
+```
+ [0.18555279, 0.9287188090884525] 
+```
+
+```
+ learn.save('clas_0')  learn.load('clas_0') 
+```
+
+Then we unfreeze one more layer, get 93.3% accuracy:
+
+```
+ learn.freeze_to(-2) 
+```
+
+```
+ learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3)) 
+```
+
+```
+ _epoch trn_loss val_loss accuracy_ 
+ 0 0.340473 0.17319 0.933125 
+```
+
+```
+ [0.17319041, 0.9331253991245995] 
+```
+
+```
+ learn.save('clas_1')  learn.load('clas_1') 
+```
+
+```
+ learn.unfreeze() 
+```
+
+```
+ learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=(32,10)) 
+```
+
+```
+ epoch trn_loss val_loss accuracy  0 0.337347 0.186812 0.930782  1 0.284065 0.318038 0.932062  2 0.246721 0.156018 0.941747  3 0.252745 0.157223 0.944106  4 0.24023 0.159444 0.945393  5 0.210046 0.202856 0.942858  6 0.212139 0.149009 0.943746  7 0.21163 0.186739 0.946553  8 0.186233 0.1508 0.945218  9 0.176225 0.150472 0.947985  10 0.198024 0.146215 0.948345  11 0.20324 0.189206 0.948145  12 0.165159 0.151402 0.947745  13 0.165997 0.146615 0.947905 
+```
+
+```
+ [0.14661488, 0.9479046703071374] 
+```
+
+```
+ learn.sched.plot_loss() 
+```
+
+```
+ learn.save('clas_2') 
+```
+
+Then we fine-tune the whole thing [ [1:40:47](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h40m47s) ]. This was the main attempt before our paper came along at using a pre-trained model:
+
+[Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107)
+
+What they did is they used a pre-trained translation model but they didn't fine tune the whole thing. They just took the activations of the translation model and when they tried IMDb, they got 91.8% — which we beat easily after only fine-tuning one layer. They weren't state-of-the-art, the state-of-the-art is 94.1% which we beat after fine-tuning the whole thing for 3 epochs and by the end, we are at 94.8% which is obviously a huge difference because in terms of error rate, that's gone done from 5.9%. A simple little trick is go back to the start of this notebook and reverse the order of all of the documents, and then re-run the whole thing. When you get to the bit that says `fwd_wt_103` , replace `fwd` for forward with `bwd` for backward. That's a backward English language model that learns to read English backward. So if you redo this whole thing, put all the documents in reverse, and change this to backward, you now have a second classifier which classifies things by positive or negative sentiment based on the reverse document. If you then take the two predictions and take the average of them, you basically have a bi-directional model (which you trained each bit separately)and that gets you to 95.4% accuracy. So we basically lowered it from 5.9% to 4.6%. So this kind of 20% change in the state-of-the-art is almost unheard of. It doesn't happen very often. So you can see this idea of using transfer learning, it's ridiculously powerful that every new field thinks their new field is too special and you can't do it. So it's a big opportunity for all of us.
+
+#### Universal Language Model Fine-tuning for Text Classification [ [1:44:02](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h44m2s) ]
+
+![](../img/1_XzWZUyxcsTu-ehYucd_vFQ.png)
+
+So we turned this into a paper, and when I say we, I did it with this guy Sebastian Ruder. Now you might remember his name because in lesson 5, I told you that I actually had shared lesson 4 with Sebastian because I think he is an awesome researcher who I thought might like it. I didn't know him personally at all. Much to my surprise, he actually watched the video. He watched the whole video and said:
+
+Sebastian: “That's actually quite fantastic! We should turn this into a paper.”
+
+Jeremy: “I don't write papers. I don't care about papers and am not interested in papers — that sounds really boring”
+
+Sebastian: “Okay, how about I write the paper for you.”
+
+Jeremy: “You can't really write a paper about this yet because you'd have to do like studies to compare it to other things (they are called ablation studies) to see which bit actually works. There's no rigor here, I just put in everything that came in my head and chucked it all together and it happened to work”
+
+Sebastian: “Okay, what if I write all the paper and do all your ablation studies, then can we write the paper?”
+
+Jeremy: “Well, it's like a whole library that I haven't documented and I'm not going to yet and you don't know how it all works”
+
+Sebastian: “Okay, if I wrote the paper, and do the ablation studies, and figure out from scratch how the code works without bothering you, then can we write the paper?”
+
+Jeremy: “Um… yeah, if you did all those things, then we can write the paper. Okay!”
+
+Then two days later, he comes back and says “okay, I've done a draft of the paper.” So, I share this story to say, if you are some student in Ireland and you want to do good work, don't let anybody stop you. I did not encourage him to say the least. But in the end, he said “I want to do this work, I think it's going to be good, and I'll figure it out” and he wrote a fantastic paper. He did the ablation study and he figured out how fastai works, and now we are planning to write another paper together. You've got to be a bit careful because sometimes I get messages from random people saying like “I've got lots of good ideas, can we have coffee?” — “I don't want… I can have coffee in my office anytime, thank you”. But it's very different to say “hey, I took your ideas and I wrote a paper, and I did a bunch of experiments, and I figured out how your code works, and I added documentation to it — should we submit this to a conference?” You see what I mean? There is nothing to stop you doing amazing work and if you do amazing work that helps somebody else, in this case, I'm happy that we have a paper. I don't particularly care about papers but I think it's cool that these ideas now have this rigorous study.
+
+#### Let me show you what he did [ [1:47:19](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h47m19s) ]
+
+He took all my code, so I'd already done all the fastai.text and as you have seen, it lets us work with large corpuses. Sebastian is fantastically well-read and he said “here's a paper that Yann LeCun and some guys just came out with where they tried lots of classification datasets so I'm going to try running your code on all these datasets.” So these are the datasets:
+
+![](../img/1_NFanphEYzNa9uMV4iSY2bw.png)
+
+Some of them had many many hundreds of thousands of documents and they were far bigger than I had tried — but I thought it should work.
+
+And he had a few good ideas as we went along and so you should totally make sure you read the paper. He said “well, this thing that you called in the lessons differential learning rates, differential kind of means something else. Maybe we should rename it” so we renamed it. It's now called discriminative learning rate. So this idea that we had from part one where we use different learning rates for different layers, after doing some literature research, it does seem like that hasn't been done before so it's now officially a thing — discriminative learning rates. This is something we learnt in lesson 1 but it now has an equation with Greek and everything [ [1:48:41](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h48m41s) ]:
+
+![](../img/1_KeaQyBreXN5QHfKCG-dJ0Q.png)
+
+When you see an equation with Greek and everything, that doesn't necessarily mean it's more complex than anything we did in lesson 1 because this one isn't.
+
+Again, that idea of like unfreezing a layer at a time, also seems to never been done before so it's now a thing and it's got the very clever name “gradual unfreezing” [ [1:48:57](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h48m57s) ].
+
+![](../img/1_W3JSe1RPeRaYhMrr-RZoWw.png)
+
+#### Slanted triangular learning rate [ [1:49:10](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h49m10s) ]
+
+So then, as promised, we will look at slanted triangular learning rates . This actually was not my idea. Leslie Smith, one of my favorite researchers who you all now know about, emailed me a while ago and said “I'm so over cyclical learning rates. I don't do that anymore. I now do a slightly different version where I have one cycle which goes up quickly at the start, and then slowly down afterwards. I often find it works better.” I've tried going back over all of my old datasets and it works better for all of them — every one I tried. So this is what the learning rate look like. You can use it in fastai just by adding `use_clr=` to your `fit` . The first number is the ratio between the highest learning rate and the lowest learning rate so the initial learning rate is 1/32 of the peak. The second number is the ratio between the first peak and the last peak. The basic idea is if you are doing a cycle length 10, that you want the first epoch to be the upward bit and the other 9 epochs to be the downward bit, then you would use 10\. I find that works pretty well and that was also Leslie's suggestion is make about 1/10 of it the upward bit and 9/10 the downward bit. Since he told me about it, maybe two days ago, he wrote this amazing paper: [A DISCIPLINED APPROACH TO NEURAL NETWORK HYPER-PARAMETERS](https://arxiv.org/abs/1803.09820) . In which, he describes something very slightly different to this again, but the same basic idea. This is a must read paper. It's got all the kinds of ideas that fastai talks about a lot in great depth and nobody else is talking about this. It's kind of a slog, unfortunately Leslie had to go away on a trop before he really had time to edit it properly, so it's a little bit slow reading, but don't let that stop you. It's amazing.
+
+![](../img/1_ydr4ZUCrsDg71s_C73ggTg.png)
+
+The equation on the right is from my paper with Sebastian. Sebastian asked “Jeremy, can you send me the math equation behind that code you wrote?” and I said “no, I just wrote the code. I could not turn it into math” so he figured out the math for it.
+
+#### Concat pooling [ [1:51:36](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h51m36s) ]
+
+So you might have noticed, the first layer of our classifier was equal to embedding size*3 . Why times 3? Times 3 because, and again, this seems to be something which people haven't done before, so a new idea “concat pooling”. It is that we take the average pooling over the sequence of the activations, the max pooling of the sequence over the activations, and the final set of activations, and just concatenate them all together. This is something which we talked about in part 1 but doesn't seem to be in the literature before so it's now called “concat pooling” and it's now got an equation and everything but this is the entirety of the implementation. So you can go through this paper and see how the fastai code implements each piece.
+
+![](../img/1_ilEQlVMIdx3m2WAKzOCjfQ.png)
+
+#### BPT3C [ [1:52:46](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h52m46s) ]
+
+One of the kind of interesting pieces is the difference between `RNN_Encoder` which you've already seen and MultiBatchRNN encoder. So what's the difference there? The key difference is that the normal RNN encoder for the language model, we could just do `bptt` chunk at a time. But for the classifier, we need to do the whole document. We need to do the whole movie review before we decide if it's positive or negative. And the whole movie review can easily be 2,000 words long and we can't fit 2.000 words worth of gradients in my GPU memory for every single one of my weights. So what do we do? So the idea was very simple which is I go through my whole sequence length one batch of `bptt` at a time. And I call `super().forward` (in other words, the `RNN_Encoder` ) to grab its outputs, and then I've got this maximum sequence length parameter where it says “okay, as long as you are doing no more than that sequence length, then start appending it to my list of outputs.” So in other words, the thing that it sends back to this pooling is only as many activations as we've asked it to keep. That way, you can figure out what `max_seq` can your particular GPU handle. So it's still using the whole document, but let's say `max_seq` is 1,000 words and your longest document length is 2, 000 words. It's still going through RNN creating states for those first thousand words, but it's not actually going to store the activations for the backprop of the first thousand. It's only going to keep the last thousand. So that means that it can't back-propagate the loss back to any state that was created in the first thousand words — basically that's now gone. So it's a really simple piece of code and honestly when I wrote it I didn't spend much time thinking about it, it seems so obviously the only way this could possibly work. But again, it seems to be a new thing, so we now have backprop through time for text classification. You can see there's lots of little pieces in this paper.
+
+![](../img/1_N-GZd5Z6Z3HjbEJnTID43g.png)
+
+#### Results [ [1:55:56](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h55m56s) ]
+
+What was the result? On every single dataset we tried, we got better result than any previous academic paper for text classification. All different types. Honestly, IMDb was the only one I spent any time trying to optimize the model, so most of them, we just did it whatever came out first. So if we actually spent time with it, I think this would be a lot better. The things that these are comparing to, most of them are different on each table because they are customized algorithms on the whole. So this is saying one simple fine-tuning algorithm can beat these really customized algorithms.
+
+![](../img/1_D9ntGwft-g9FgWsuNonGJQ.png)
+
+#### Ablation studies [ [1:56:56](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h56m56s) ]
+
+Here is the ablation studies Sebastian did. I was really keen that if you are going to publish a paper, we had to say why it works. So Sebastian went through and tried removing all of those different contributions I mentioned. So what is we don't use gradual freezing? What if we don't use discriminative learning rates? What if instead of discrimination rates, we use cosign annealing? What if we don't do any pre-training with Wikipedia? What if we don't do any fine tuning? And the really interesting one to me was, what's the validation error rate on IMDb if we only used a hundred training examples (vs. 200, vs. 500, etc). And you can see, very interestingly, the full version of this approach is nearly as accurate on just a hundred training examples — it's still very accurate vs. full 20,000 training examples. Where as if you are training from scratch on 100, it's almost random. It's what I expected. I've said to Sebastian I really think that this is most beneficial when you don't have much data. This is where fastai is most interested in contributing — small data regimes, small compute regimes, and so forth. So he did these studies to check.
+
+![](../img/1_JsahawCY9ja-kZHTd90lFQ.png)
+
+### Tricks to run ablation studies [ [1:58:32](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D1h58m32s) ]
+
+#### Trick #1: VNC
+
+The first trick is something which I know you're all going to find really handy. I know you've all been annoyed when you are running something in a Jupyter notebook, and you lose your internet connection for long enough that it decides you've gone away, and then your session disappears, and you have to start it again from scratch. So what do you do? There is a very simple cool thing called VNC where you can install on your AWS instance or PaperSpace, or whatever:
+
+*   X Windows ( `xorg` )
+*   Lightweight window manager ( `lxde-core` )
+*   VNC server ( `tightvncserver` )
+*   Firefox ( `firefox` )
+*   Terminal ( `lxterminal` )
+*   Some fonts ( `xfonts-100dpi` )
+
+Chuck the lines at the end of your `./vnc/xstartup` configuration file, and then run this command ( `tightvncserver :13 -geometry 1200x900` ):
+
+![](../img/1_A6iP79W389q7anG5nyASyg.png)
+
+It's now running a server where you can then run the TightVNC Viewer or any VNC viewer on your computer and you point it at your server. But specifically, what you do is you use SSH port forwarding to forward :5913 to localhost:5913:
+
+![](../img/1_fPDXeYX8HkT_JTuUEIHgSQ.png)
+
+Then you connect to port 5013 on localhost. It will send it off to port 5913 on your server which is the VNC port (because you said `:13` ) and it will display an X Windows desktop. Then you can click on the Linux start like button and click on Firefox and you now have Firefox. You see here in Firefox, it says localhost because this Firefox is running on my AWS server. So you now run Firefox, you start your thing running, and then you close your VNC viewer remembering that Firefox is displaying on this virtual VNC display, not in a real display, so then later on that day, you log back into VNC viewer and it pops up again. So it's like a persistent desktop, and it's shockingly fast. It works really well. There's lots of different VNC servers and clients, but this one works fine for me.
+
+#### Trick #2: Google Fire [ [2:01:27](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D2h1m27s) ]
+
+![](../img/1_03yxHYXeuHZUZbYaqKRs5g.png)
+
+Trick #2 is to create Python scripts, and this is what we ended up doing. So I ended up creating a little Python script for Sebastian to kind of say this is the basic steps you need to do, and now you need to create different versions for everything else. And I suggested to him that he tried using this thing called Google Fire. What Google Fire does is, you create a function with tons of parameters, so these are all the things that Sebastian wanted to try doing — different dropout amounts, different learning rates, do I use pre-training or not, do I use CLR or not, do I use discriminative learning rate or not, etc. So you create a function, and then you add something saying:
+
+```
+ if __name__ == '__main__': fire.Fire(train_clas) 
+```
+
+You do nothing else at all — you don't have to add any metadata, any docstrings, anything at all, and you then call that script and automatically you now have a command line interface. That's a super fantastic easy way to run lots of different variations in a terminal. This ends up being easier if you want to do lots of variations than using a notebook because you can just have a bash script that tries all of them and spits them all out.
+
+#### Trick #3: IMDb scripts [ [2:02:47](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D2h2m47s) ]
+
+You'll find inside the `courses/dl2` , there's now something called `imdb_scripts` , and I put all the scripts Sebastian and I used. Because we needed to tokenize and numericalize every dataset, then train a language model and a classifier for every dataset. And we had to do all of those things in a variety of different ways to compare them, so we had scripts for all those things. You can check out and see all of the scripts that we used.
+
+![](../img/1_4wNUZhHpjSgRLj6s6ECddQ.png)
+
+![](../img/1_SkRiJH47FdHtubjyUeYdLA.png)
+
+#### Trick #4: pip install -e [ [2:03:32](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D2h3m32s) ]
+
+When you are doing a lot of scripts, you got different code all over the place. Eventually it might get frustrating that you don't want to symlink your fastai library again and again. But you probably don't want to pip install it because that version tends to be a little bit old as we move so fast that you want to use the current version in Git. If you say `pip install -e .` from fastai repo base, it does something quite neat which is basically creates a symlink to the fastai library (ie your locally cloned Git repo) inside site-packages directory. Your site-packages directory is your main Python library. So if you do this, you can then access fastai from anywhere but every time you do `git pull` , you've got the most recent version. One downside of this is that it installs any updated versions of packages from pip which can confuse Conda a little bit, so another alternative here is just do symlink the fastai library to your site packages library. That works just as well. You can use fastai from anywhere and it's quite handy when you want to run scripts that use fastai from different directories on your system.
+
+![](../img/1_tg8X-gjGJ6rFAg-aiPIgpQ.png)
+
+#### Trick #5: SentencePiece [ [2:05:06](https://youtu.be/h5Tz7gZT9Fo%3Ft%3D2h5m6s) ]
+
+This is something you can try if you like. You don't have to tokenize. Instead of tokenizing words, you can tokenize what are called sub-word units.For example, “unsupervised” could be tokenized as “un” and “supervised”. “Tokenizer” can be tokenized as [“token”, “izer”]. Then you could do the same thing. The language model that works on sub-word units, a classifier that works on sub-word units, etc. How well does that work? I started playing with it and with not too much playing, I was getting classification results that were nearly as good as using word level tokenization — not quite as good, but nearly as good. I suspect with more careful thinking and playing around, maybe I could have gotten as good or better. But even if I couldn't, if you create a sub-word-unit wikitext model, then IMDb language model, and then classifier forwards and backwards and then ensemble it with the forwards and backwards word level ones, you should be able to beat us. So here is an approach you may be able to beat our state-of-the-art result.
+
+![](../img/1_Ihivmbwld8tPdMracJ-FuQ.png)
+
+Sebastian told me this particular project — Google has a project called sentence peace which actually uses a neural net to figure out the optimal splitting up of words and so you end up with vocabulary of sub-word units. In my playing around, I found that create vocabulary of about 30,000 sub-word units seems to be about optimal. If you are interested, there is something you can try. It is a bit of a pain to install — it's C++, doesn't have create error message, but it will work. There is a Python library for it. If anybody tries this, I'm happy to help them get it working. There's been little, if any, experiments with ensembling sub-word and word level classification, and I do think it should be the best approach.
+
+Have a great week!
diff --git a/zh/dl11.md b/zh/dl11.md
new file mode 100644
index 0000000000000000000000000000000000000000..a4b1d3fb3cef5dd72db70e030cb804c66780cb85
--- /dev/null
+++ b/zh/dl11.md
@@ -0,0 +1,1465 @@
+# 深度学习2：第2部分第11课
+
+### 链接
+
+[**论坛**](http://forums.fast.ai/t/part-2-lesson-11-in-class/14699/1) **/** [**视频**](https://youtu.be/tY0n9OT5_nA)
+
+### 开始之前：
+
+*   Sylvain Gugger的1循环[政策](https://sgugger.github.io/the-1cycle-policy.html) 。 基于Leslie Smith的新论文，该论文采用了前两篇关键论文（循环学习率和超级收敛），并在其基础上进行了大量实验，以展示如何实现超收敛。 超级收敛使您可以比以前的逐步方法快五倍地训练模型（并且比CLR快，尽管它不到五次）。 超级融合让你可以在1到3之间达到高学习率。超融合的有趣之处在于，你可以在相当大的一部分时期内以极高的学习率进行训练，在此期间，失去的不是真的非常好。 但诀窍在于它正在通过空间进行大量搜索，以找到看似真正具有普遍性的区域。 Sylvain通过刷新丢失的部分在fastai实施了它，然后确认他实际上在CIFAR10的培训上实现了超级融合。 它目前称为`use_clr_beta`但将来会重命名。 他还为fastai图书馆增添了周期性的动力。
+*   [如何使用](https://towardsdatascience.com/how-to-create-data-products-that-are-magical-using-sequence-to-sequence-models-703f86a231f8) Hamel Husain的[序列到序列模型创建神奇的数据产品](https://towardsdatascience.com/how-to-create-data-products-that-are-magical-using-sequence-to-sequence-models-703f86a231f8) 。 他在博客上写了关于培训模型以总结GitHub问题的文章。 以下是基于他的博客创建的Kubeflow团队[演示](http://gh-demo.kubeflow.org/) 。
+
+### 神经机器翻译[ [5:36](https://youtu.be/tY0n9OT5_nA%3Ft%3D5m36s) ]
+
+让我们构建一个序列到序列的模型！ 我们将致力于机器翻译。 机器翻译已经存在了很长时间，但是我们将研究一种使用神经网络进行翻译的神经翻译方法。 神经机器翻译出现在几年前，它不如使用经典特征工程和标准NLP方法的统计机器翻译方法一样好，如词干，摆弄单词频率，n-gram等。一年后，它比其他一切都好。 它基于一个名为BLEU的度量标准 - 我们不会讨论该度量标准，因为它不是一个非常好的度量标准，并且它不是很有趣，但每个人都使用它。
+
+![](../img/1_f0hoBLrTuevFPgAl-lFIfQ.png)
+
+我们看到机器翻译开始沿着我们在2012年开始计算机视觉对象分类的路径开始，该路径刚刚超过了现有技术并且现在以极快的速度拉开了它。 任何观看此操作的人都不太可能真正构建机器翻译模型，因为[https://translate.google.com/](https://translate.google.com/)可以很好地运行。 那么我们为什么要学习机器翻译呢？ 我们学习机器翻译的原因在于，将法语中的某种输入作为句子并将其转换为任意长度的其他类型输出（例如英语句子）的一般想法是非常有用的。 例如，正如我们刚刚看到的那样，Hamel将GitHub问题转化为摘要。 另一个例子是拍摄视频并将其转换为描述，或者基本上是任何你正在吐出任意大小的输出的东西，这通常是一个句子。 也许进行CT扫描并吐出放射学报告 - 这是您可以使用序列来排序学习的地方。
+
+#### 神经机器翻译的四大胜利[ [8:36](https://youtu.be/tY0n9OT5_nA%3Ft%3D8m36s) ]
+
+![](../img/1_c2kAArVl9mF_VaeqnXafBw.png)
+
+*   端到端培训：没有充分利用启发式和hacky功能工程。
+*   我们能够构建这些分布式表示，这些表示由单个网络中的许多概念共享。
+*   我们能够在RNN中使用长期状态，因此它比n-gram类型方法使用更多的上下文。
+*   最后，我们生成的文本也使用RNN，因此我们可以构建更流畅的东西。
+
+#### BiLSTMs（+ Attn）不仅适用于神经MT [ [9:20](https://youtu.be/tY0n9OT5_nA%3Ft%3D9m20s) ]
+
+![](../img/1_LrzmE8Xi5-mwsUrqzEQA6Q.png)
+
+我们将注意使用双向GRU（基本上与LSTM相同） - 如上所述，这些一般性想法也可以用于许多其他事情。
+
+#### 让我们跳进代码[ [9:47](https://youtu.be/tY0n9OT5_nA%3Ft%3D9m47s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/translate.ipynb)
+
+我们将尝试按照标准的神经网络方法将法语翻译成英语：
+
+1.  数据
+2.  建筑
+3.  损失函数
+
+#### 1.数据
+
+像往常一样，我们需要`(x, y)`对。 在这种情况下，x：法语句子，y：英语句子，您将比较您的预测。 我们需要大量的这些法语句子元组及其等效的英语句子 - 被称为“平行语料库”，并且比语言模型的语料库更难找到。 对于语言模型，我们只需要某种语言的文本。 对于任何生活语言，至少有几千兆字节的文字漂浮在互联网上供你抓取。 对于翻译，有一些非常好的平行语料库可用于欧洲语言。 欧洲议会在每种欧洲语言中都有一句话。 任何流向联合国的东西都被翻译成许多语言。 对于法语到英语，我们有特别好的东西，几乎任何半官方的加拿大网站都有法语版和英文版[ [12:13](https://youtu.be/tY0n9OT5_nA%3Ft%3D12m13s) ]。
+
+#### 翻译文件
+
+```
+ **from** **fastai.text** **import** * 
+```
+
+来自[http://www.statmt.org/wmt15/translation-task.html的](http://www.statmt.org/wmt15/translation-task.html)法语/英语并行文本。 它由Chris Callison-Burch创建，他抓取了数百万个网页，然后使用_一套简单的启发法将法语网址转换为英文网址（即将“fr”替换为“en”和其他约40个其他手写规则），并假设这些文件是彼此的翻译_ 。
+
+```
+ PATH = Path('data/translate')  TMP_PATH = PATH/'tmp'  TMP_PATH.mkdir(exist_ok= **True** )  fname='giga-fren.release2.fixed'  en_fname = PATH/f' **{fname}** .en'  fr_fname = PATH/f' **{fname}** .fr' 
+```
+
+对于边界框，所有有趣的东西都在损失函数中，但对于神经翻译，所有有趣的东西都将在他的体系结构中[ [13:01](https://youtu.be/tY0n9OT5_nA%3Ft%3D13m1s) ]。 让我们快速完成这一切，杰里米希望你特别考虑的事情之一就是我们正在做的任务以及我们如何在语言建模与神经翻译之间做到这一点的关系或相似之处。
+
+![](../img/1_458KhA7uSET5eH3fe4EhDw.png)
+
+第一步是完成我们在语言模型中所做的完全相同的事情，这是一个句子，并通过RNN [ [13:35](https://youtu.be/tY0n9OT5_nA%3Ft%3D13m35s) ]。
+
+![](../img/1_ujJzFdJfk2jP0466peyyrQ.png)
+
+现在有了分类模型，我们有了一个解码器，它接受了RNN输出并抓住了三个东西：所有时间步骤的`maxpool`和`meanpool` ，以及最后一步的RNN值，将所有这些叠加在一起并通过它线性层[ [14:24](https://youtu.be/tY0n9OT5_nA%3Ft%3D14m24s) ]。 大多数人不这样做，只使用最后一步，所以我们今天要讨论的所有事情都使用最后一步。
+
+我们首先通过RNN清除输入句子，然后从中出现一些“隐藏状态”（即一些向量，表示编码句子的RNN的输出）。
+
+#### 编码器≈骨干[ [15:18](https://youtu.be/tY0n9OT5_nA%3Ft%3D15m18s) ]
+
+斯蒂芬使用“编码器”这个词，但我们倾向于使用“骨干”这个词。 就像我们谈到为现有模型添加自定义头部时一样，例如，现有的预先训练过的ImageNet模型，我们说这是我们的支柱，然后我们会在它上面坚持一些能够完成我们想要的任务的头部。 顺序学习序列，他们使用单词编码器，但它基本上是相同的东西 - 它是一个神经网络架构的一部分，它接受输入并将其转换为一些表示，然后我们可以在顶部粘贴更多层从我们为分类器中抓取一些东西，我们在分类器上堆叠一个线性层，将int变成情绪。 但这一次，我们有一些东西比创造情绪更难[ [16:12](https://youtu.be/tY0n9OT5_nA%3Ft%3D16m12s) ]。 我们不是将隐藏状态转变为正面或负面情绪，而是将其变成一系列令牌，其中令牌序列是斯蒂芬的例子中的德语句子。
+
+这听起来更像是语言模型而不是分类器，因为语言有多个令牌（对于每个输入词，都有一个输出词）。 但语言模型也更容易，因为语言模型输出中的令牌数量与语言模型输入中的令牌数量相同。 它们不仅长度相同，而且它们完全匹配（例如，在第一个词出现第二个词之后，第二个词出现第三个词，依此类推）。 对于翻译语言，您不一定知道单词“he”将被翻译为输出中的第一个单词（不幸的是，在这种特殊情况下）。 通常情况下，主题对象顺序会有所不同，或者会插入一些额外的单词，或者我们需要添加一些性别文章等一些代词。这是我们要处理的关键问题是我们有一个任意长度的输出，其中输出中的标记不对应于输入中的相同顺序或特定标记[ [17:31](https://youtu.be/tY0n9OT5_nA%3Ft%3D17m31s) ]。 但总体思路是一样的。 这是一个对输入进行编码的RNN，将其转换为某种隐藏状态，然后我们要学习的新事物就是生成一个序列输出。
+
+#### 序列输出[ [17:47](https://youtu.be/tY0n9OT5_nA%3Ft%3D17m47s) ]
+
+我们已经知道了：
+
+*   序列到类（IMDB分类器）
+*   序列到等长序列（语言模型）
+
+但是我们还不知道如何做一个通用序列来排序，所以这就是今天的新事物。 除非你真正理解第6课RNN是如何工作的，否则这一点很有意义。
+
+#### 快速回顾[第6课](https://medium.com/%40hiromi_suenaga/deep-learning-2-part-1-lesson-6-de70d626976c) [ [18:20](https://youtu.be/tY0n9OT5_nA%3Ft%3D18m20s) ]
+
+我们了解到RNN的核心是标准的完全连接网络。 下面是一个有4层 - 一个输入并通过四层，但在第二层，它连接在第二个输入，第三层连接在第三个输入，但我们实际上在Python中写了这只是一个四层神经网络。 除线性图层和ReLU之外，我们没有使用任何其他内容。 每当输入进入时我们使用相同的权重矩阵，每当我们从一个隐藏状态进入下一个状态时我们使用相同的矩阵 - 这就是为什么这些箭头是相同的颜色。
+
+![](../img/1_ZTtw8vtjy-K2CptW_xpLqw.png)
+
+我们可以像下面[ [19:29](https://youtu.be/tY0n9OT5_nA%3Ft%3D19m29s) ]重新绘制上面的图表。
+
+![](../img/1_QE70fqvLLDhq8fq_ctxpZQ.png)
+
+我们不仅重新绘制它，而且我们在PyTorch中使用了四行线性线性线性线性代码，我们用for循环替换它。 记住，我们有一些与下面完全相同的东西，但它只有四行代码说`self.l_in(input)` ，我们用for循环替换它，因为这很好重构。 不改变任何数学，任何想法或任何输出的重构是RNN。 它将代码中的一堆独立行转换为Python for循环。
+
+![](../img/1_ewV_N6jZBjStFNpSCMndxg.png)
+
+我们可以获取输出，使其不在循环之外并将其放入循环[ [20:25](https://youtu.be/tY0n9OT5_nA%3Ft%3D20m25s) ]。 如果我们这样做，我们现在将为每个输入生成一个单独的输出。 上面的代码，隐藏状态每次都被替换，我们最终只是吐出最后的隐藏状态。 但是，如果相反，我们有一些东西说`hs.append(h)`并在最后返回`hs` ，这将是下图。
+
+![](../img/1_CX45skUFZZO6uHsR8IndzA.png)
+
+要记住的主要事情是，当我们说隐藏状态时，我们指的是一个向量 - 技术上是小批量中每个东西的向量，所以它是一个矩阵，但通常当杰里米谈到这些东西时，他忽略了小批量只需一件物品即可。
+
+![](../img/1_ch4De-RThVp-fthGpqsaWw.png)
+
+我们还了解到你可以将这些层叠在一起[ [21:41](https://youtu.be/tY0n9OT5_nA%3Ft%3D21m41s) ]。 因此，不是左边的RNN（在上图中）吐出输出，它们只能将输入吐出到第二个RNN中。 如果你正在考虑这一点“我想我理解这一点，但我不太确定”这意味着你不理解这一点。 你知道自己真正了解它的唯一方法就是在PyTorch或Numpy中从头开始编写。 如果你不能这样做，那么你知道你不理解它，你可以回去重新观看第6课，看看笔记本并复制一些想法，直到你可以。 从头开始编写代码非常重要 - 它不仅仅是代码屏幕。 因此，您需要确保可以创建2层RNN。 下面是展开它的样子。
+
+![](../img/1_2GKBK9P_zpUieQF6JFyChw.png)
+
+为了得到我们有（x，y）对句子的点，我们将从下载数据集[ [22:39](https://youtu.be/tY0n9OT5_nA%3Ft%3D22m39s) ]开始。 培训翻译模型需要很长时间。 谷歌的翻译模型有八层RNN叠加在一起。 八层和两层之间没有概念上的区别。 如果您是Google，并且您拥有的GPU或TPU比您知道的更多，那么您可以这样做。 在其他情况下，在我们的情况下，很可能我们正在构建的序列模型的序列类型不需要那么级别的计算。 所以为了简单[起见](https://youtu.be/tY0n9OT5_nA%3Ft%3D23m22s) [ [23:22](https://youtu.be/tY0n9OT5_nA%3Ft%3D23m22s) ]，让我们做一个简单的事情，而不是学习如何将法语翻译成英语用于任何句子，让我们学习将法语问题翻译成英语问题 - 特别是从什么/哪里开始的问题/哪个/时。 所以这是一个正则表达式，它寻找以“wh”开头并以问号结尾的内容。
+
+```
+ re_eq = re.compile('^(Wh[^?.!]+\?)')  re_fq = re.compile('^([^?.!]+\?)') 
+```
+
+```
+ lines = ((re_eq.search(eq), re_fq.search(fq))  **for** eq, fq **in** zip(open(en_fname, encoding='utf-8'),  open(fr_fname, encoding='utf-8'))) 
+```
+
+```
+ qs = [(e.group(), f.group()) **for** e,f **in** lines **if** e **and** f] 
+```
+
+我们通过语料库[ [23:43](https://youtu.be/tY0n9OT5_nA%3Ft%3D23m43s) ]，打开两个文件中的每一个，每行是一个平行文本，将它们压缩在一起，抓住英语问题和法语问题，并检查它们是否与正则表达式匹配。
+
+```
+ pickle.dump(qs, (PATH/'fr-en-qs.pkl').open('wb'))  qs = pickle.load((PATH/'fr-en-qs.pkl').open('rb')) 
+```
+
+把它作为一个泡菜倾倒，所以我们不必再这样做，所以现在我们有52,000个句子对，这里有一些例子：
+
+```
+ qs[:5], len(qs) 
+```
+
+```
+ _([('What is light ?', 'Qu'est-ce que la lumière?'),_  _('Who are we?', 'Où sommes-nous?'),_  _('Where did we come from?', "D'où venons-nous?"),_  _('What would we do without it?', 'Que ferions-nous sans elle ?'),_  _('What is the absolute location (latitude and longitude) of Badger, Newfoundland and Labrador?',_  _'Quelle sont les coordonnées (latitude et longitude) de Badger, à Terre-Neuve-etLabrador?')],_  _52331)_ 
+```
+
+关于这一点的一个[好处](https://youtu.be/tY0n9OT5_nA%3Ft%3D24m8s)是什么/谁/哪里类型问题往往相当短[ [24:08](https://youtu.be/tY0n9OT5_nA%3Ft%3D24m8s) ]。 但我们可以从头开始学习，而不是先前对语言的概念有所了解的想法，更不用说英语或法语了，我们可以创建一些可以将任意一个问题翻译成另一个只有50k句子的话，这听起来像一个非常难以理解的事情要求这样做。 如果我们能够取得任何进展，那将是令人印象深刻的。 这是一项非常少的数据，可以进行非常复杂的练习。
+
+`qs`包含法语和英语的元组[ [24:48](https://youtu.be/tY0n9OT5_nA%3Ft%3D24m48s) ]。 您可以使用这个方便的习惯用法将它们分成英语问题列表和法语问题列表。
+
+```
+ en_qs,fr_qs = zip(*qs) 
+```
+
+然后我们将英语问题标记出来并将法语问题标记化。 所以请记住，只是意味着将它们分成单独的单词或类似单词的东西。 默认情况下[ [25:11](https://youtu.be/tY0n9OT5_nA%3Ft%3D25m11s) ]，我们在这里使用的标记器（记住这是一个围绕spaCy标记器的包装器，这是一个很棒的标记器）假设是英文。 所以要求法语，你只需添加一个额外的参数`'fr'` 。 第一次执行此操作时，您将收到一条错误消息，指出您没有安装spaCy French模型，因此您可以运行`python -m spacy download fr`来获取法语模型。
+
+```
+ en_tok = Tokenizer.proc_all_mp(partition_by_cores(en_qs)) 
+```
+
+```
+ fr_tok = Tokenizer.proc_all_mp(partition_by_cores(fr_qs), 'fr') 
+```
+
+你不可能在这里遇到RAM问题，因为这不是特别大的语料库，但有些学生在本周试图训练一种新的语言模型并且有RAM问题。 如果你这样做，那么值得知道这些函数（ `proc_all_mp` ）实际上在做什么。 `proc_all_mp`正在处理多个进程中的每个句子[ [25:59](https://youtu.be/tY0n9OT5_nA%3Ft%3D25m59s) ]：
+
+![](../img/1_3dijGYRXl1Vf9MFLD5AUOA.png)
+
+上面的函数找出你有多少CPU，除以2（因为通常使用超线程它们实际上并不都是并行工作），然后并行运行这个`proc_all`函数。 因此，这将为您拥有的每个CPU吐出一个完整的Python进程。 如果你有很多内核，那就是很多Python进程 - 每个人都会加载所有这些数据，这可能会耗尽你所有的内存。 所以你可以用`proc_all`替换它而不是`proc_all_mp`来使用更少的RAM。 或者你可以使用更少的核心。 目前，我们正在调用`partition_by_cores` ，它调用列表中的`partition` ，并根据您拥有的CPU数量要求将其拆分为多个相等长度的内容。 因此，您可以将其替换为拆分为较小的列表，并在较少的事情上运行它。
+
+![](../img/1_9_D6dkXM4mR8fPf0E2eLcg.png)
+
+将英语和法语标记化后，您可以看到它如何分裂[ [28:04](https://youtu.be/tY0n9OT5_nA%3Ft%3D28m4s) ]：
+
+```
+ en_tok[0], fr_tok[0] 
+```
+
+```
+ (['what', 'is', 'light', '?'],  ['qu'', 'est', '-ce', 'que', 'la', 'lumière', '?']) 
+```
+
+你可以看到法语的标记化看起来完全不同，因为法国人喜欢他们的撇号和连字符。 因此，如果您尝试使用英语标记符来表示法语句子，那么您将获得非常糟糕的结果。 你不需要知道大量的NLP思想来使用NLP的深度学习，但只是为你的语言使用正确的标记化器这些基本的东西很重要[ [28:23](https://youtu.be/tY0n9OT5_nA%3Ft%3D28m23s) ]。 本周我们学习小组的一些学生一直在尝试为中文实例建立语言模型，当然这些模型并没有真正具有标记器的概念，所以我们一直在开始研究[句子](https://github.com/google/sentencepiece) 。将事物分成任意子字单元，所以当Jeremy说令牌化时，如果你使用的语言没有空格，你可能应该检查句子或其他类似的子词单位。 希望在接下来的一两周内，我们将能够用中文报​​告这些实验的早期结果。
+
+```
+ np.percentile([len(o) **for** o **in** en_tok], 90),  np.percentile([len(o) **for** o **in** fr_tok], 90) 
+```
+
+```
+ _(23.0, 28.0)_ 
+```
+
+```
+ keep = np.array([len(o)<30 **for** o **in** en_tok]) 
+```
+
+```
+ en_tok = np.array(en_tok)[keep]  fr_tok = np.array(fr_tok)[keep] 
+```
+
+```
+ pickle.dump(en_tok, (PATH/'en_tok.pkl').open('wb'))  pickle.dump(fr_tok, (PATH/'fr_tok.pkl').open('wb')) 
+```
+
+```
+ en_tok = pickle.load((PATH/'en_tok.pkl').open('rb'))  fr_tok = pickle.load((PATH/'fr_tok.pkl').open('rb')) 
+```
+
+因此将其标记化[ [29:25](https://youtu.be/tY0n9OT5_nA%3Ft%3D29m25s) ]，我们将其保存到磁盘。 然后记住，我们创建令牌后的下一步是将它们变成数字。 要做到这一点，我们有两个步骤 - 第一步是获取所有出现的单词的列表，然后我们将每个单词转换为索引。 如果出现超过40,000个单词，那么让我们将其剪掉，这样就不会太疯狂了。 我们为流（ `_bos_` ），填充（ `_pad_` ），流尾（ `_eos_` ）和未知（ `_unk` ）的开头插入了一些额外的令牌。 因此，如果我们试图查找不在40,000最常见的东西，那么我们使用`deraultdict`返回3，这是未知的。
+
+```
+ **def** toks2ids(tok,pre):  freq = Counter(p **for** o **in** tok **for** p **in** o)  itos = [o **for** o,c **in** freq.most_common(40000)]  itos.insert(0, '_bos_')  itos.insert(1, '_pad_')  itos.insert(2, '_eos_')  itos.insert(3, '_unk')  stoi = collections.defaultdict( **lambda** : 3,  {v:k **for** k,v **in** enumerate(itos)})  ids = np.array([([stoi[o] **for** o **in** p] + [2]) **for** p **in** tok])  np.save(TMP_PATH/f' **{pre}** _ids.npy', ids)  pickle.dump(itos, open(TMP_PATH/f' **{pre}** _itos.pkl', 'wb'))  **return** ids,itos,stoi 
+```
+
+现在我们可以继续将每个标记转换为ID，方法是将它通过字符串`stoi`我们刚刚创建的整数字典（ `stoi` ）中，然后在结尾处添加数字2，这是流的结尾。 你在这里看到的代码是Jeremy在迭代和试验时写的代码[ [30:25](https://youtu.be/tY0n9OT5_nA%3Ft%3D30m25s) ]。 因为他在迭代和实验时编写的代码中有99％证明是完全错误或愚蠢或令人尴尬，而你却无法看到它。 但是，当他写这篇文章时，没有点重构并使它变得美丽，所以他希望你能看到他所拥有的所有小捷径。 而不是为`_eos_`标记使用一些常量并使用它，当他进行原型设计时，他只是做了简单的事情。 并非如此，他最终会破坏代码，但他试图在美丽的代码和有效的代码之间找到一些中间立场。
+
+**问** ：刚听到他提到我们将CPU的数量除以2，因为使用超线程，我们无法使用所有超线程内核加速。 这是基于实际经验还是有一些潜在的原因导致我们不能获得额外的加速[ [31:18](https://youtu.be/tY0n9OT5_nA%3Ft%3D31m18s) ]？ 是的，这只是实际经验而且并非所有事情都像这样，但我确实注意到了令牌化 - 超线程似乎让事情变得缓慢。 此外，如果我使用所有内核，通常我想同时做其他事情（比如运行一些交互式笔记本），我没有任何空余的空间来做这件事。
+
+现在我们的英语和法语，我们可以获取ID列表`en_ids` [ [32:01](https://youtu.be/tY0n9OT5_nA%3Ft%3D32m1s) ]。 当我们这样做时，当然，我们需要确保我们也存储词汇。 如果我们不知道数字5代表什么是没有任何意义，那就没有数字5.这就是我们的词汇表`en_itos`和反向映射`en_stoi` ，我们可以用来转换未来更多的语料库。
+
+```
+ en_ids,en_itos,en_stoi = toks2ids(en_tok,'en')  fr_ids,fr_itos,fr_stoi = toks2ids(fr_tok,'fr') 
+```
+
+只是为了确认它是否正常工作，我们可以遍历每个ID，将int转换为字符串，然后将其吐出来 - 现在我们已经将句子返回到末尾的流结束标记。 我们的英语词汇是17,000，而我们的法语词汇是25,000，所以这不是太大，也不是我们正在处理的太复杂的词汇。
+
+```
+ **def** load_ids(pre):  ids = np.load(TMP_PATH/f' **{pre}** _ids.npy')  itos = pickle.load(open(TMP_PATH/f' **{pre}** _itos.pkl', 'rb'))  stoi = collections.defaultdict( **lambda** : 3,  {v:k **for** k,v **in** enumerate(itos)})  **return** ids,itos,stoi 
+```
+
+```
+ en_ids,en_itos,en_stoi = load_ids('en')  fr_ids,fr_itos,fr_stoi = load_ids('fr') 
+```
+
+```
+ [fr_itos[o] **for** o **in** fr_ids[0]], len(en_itos), len(fr_itos) 
+```
+
+```
+ _(['qu'', 'est', '-ce', 'que', 'la', 'lumière', '?', '_eos_'], 17573, 24793)_ 
+```
+
+#### 单词向量[ [32:53](https://youtu.be/tY0n9OT5_nA%3Ft%3D32m53s) ]
+
+本周我们在论坛上花了很多时间讨论无意义的单词向量是如何以及如何停止对它们如此兴奋 - 现在我们将使用它们。 为什么？ 我们一直在学习使用语言模型和预训练的适当模型而不是预先训练的线性单层（这是单词向量）的所有内容同样适用于序列到序列。 但杰里米和塞巴斯蒂安开始关注这一点。 对于任何有兴趣创造一些真正新的高度可发表结果的人来说，有一个完整的事情，用预先训练的语言模型排序的整个序列区域还没有被触及。 杰里米认为它会和分类一样好。 如果你正在努力解决这个问题，那么你就会发现一些令人兴奋的东西并希望帮助发布它，Jeremy非常乐意帮助共同撰写论文。 因此，当您有一些有趣的结果时，请随时与我们联系。
+
+在这个阶段，我们没有任何这个，所以我们将使用很少的fastai [ [34:14](https://youtu.be/tY0n9OT5_nA%3Ft%3D34m14s) ]。 我们所拥有的只是单词向量 - 所以让我们至少使用体面的单词向量。 Word2vec是非常古老的单词词向量。 现在有更好的单词向量和fast.text是一个非常好的单词向量源。 有数百种语言可供他们使用，您的语言可能会被表示出来。
+
+fasttext字向量可从[https://fasttext.cc/docs/en/english-vectors.html获得](https://fasttext.cc/docs/en/english-vectors.html)
+
+pytext中没有fasttext Python库，但这是一个方便的技巧[ [35:03](https://youtu.be/tY0n9OT5_nA%3Ft%3D35m3s) ]。 如果有一个GitHub存储库中有一个setup.py和reqirements.txt，你可以在开始时查看`git+`然后将其粘贴到你的`pip install`并且它可以工作。 几乎没有人知道这一点，如果你去快照回购，他们不会告诉你这个 - 他们会说你必须下载它并插入它并等等但你没有。 你可以运行这个：
+
+```
+ _# !_ _pip install git+https://github.com/facebookresearch/fastText.git_ 
+```
+
+```
+ **import** **fastText** **as** **ft** 
+```
+
+要使用fastText库，您需要为您的语言下载[fasttext字向量](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) （下载'bin plus text'文件）。
+
+```
+ en_vecs = ft.load_model(str((PATH/'wiki.en.bin'))) 
+```
+
+```
+ fr_vecs = ft.load_model(str((PATH/'wiki.fr.bin'))) 
+```
+
+以上是我们的英语和法语模型。 有文本版本和二进制版本。 二进制版本更快，所以我们将使用它。 文本版本也有点儿麻烦。 我们将把它转换成标准的Python字典，以便更容易使用[ [35:55](https://youtu.be/tY0n9OT5_nA%3Ft%3D35m55s) ]。 这只是通过字典理解来完成每个单词并将其保存为pickle字典：
+
+```
+ **def** get_vecs(lang, ft_vecs):  vecd = {w:ft_vecs.get_word_vector(w)  **for** w **in** ft_vecs.get_words()}  pickle.dump(vecd, open(PATH/f'wiki. **{lang}** .pkl','wb'))  **return** vecd 
+```
+
+```
+ en_vecd = get_vecs('en', en_vecs)  fr_vecd = get_vecs('fr', fr_vecs) 
+```
+
+```
+ en_vecd = pickle.load(open(PATH/'wiki.en.pkl','rb'))  fr_vecd = pickle.load(open(PATH/'wiki.fr.pkl','rb')) 
+```
+
+```
+ ft_words = ft_vecs.get_words(include_freq= **True** )  ft_word_dict = {k:v **for** k,v **in** zip(*ft_words)}  ft_words = sorted(ft_word_dict.keys(),  key= **lambda** x: ft_word_dict[x]) 
+```
+
+现在我们有了泡菜字典，我们可以继续查找一个单词，例如逗号[ [36:07](https://youtu.be/tY0n9OT5_nA%3Ft%3D36m7s) ]。 这将返回一个向量。 向量的长度是这组单词向量的维数。 在这种情况下，我们有300维英语和法语单词向量。
+
+```
+ dim_en_vec = len(en_vecd[','])  dim_fr_vec = len(fr_vecd[','])  dim_en_vec,dim_fr_vec 
+```
+
+```
+ _(300, 300)_ 
+```
+
+由于您将在稍后看到的原因，我们还想知道我们的向量的平均值和标准偏差是什么。 因此平均值约为零，标准差约为0.3。
+
+```
+ en_vecs = np.stack(list(en_vecd.values()))  en_vecs.mean(),en_vecs.std() 
+```
+
+```
+ _(0.0075652334, 0.29283327)_ 
+```
+
+#### 模型数据[ [36:48](https://youtu.be/tY0n9OT5_nA%3Ft%3D36m48s) ]
+
+通常语料库具有相当长的序列长度分布，并且它是最长的序列，往往会压倒事情需要多长时间，使用多少内存等等。因此在这种情况下，我们将获得英语的第99百分位数到第97百分位数和法语并将它们截断到那个数量。 最初Jeremy使用了90个百分位数（因此变量名称）：
+
+```
+ enlen_90 = int(np.percentile([len(o) **for** o **in** en_ids], 99))  frlen_90 = int(np.percentile([len(o) **for** o **in** fr_ids], 97))  enlen_90,frlen_90 
+```
+
+```
+ _(29, 33)_ 
+```
+
+我们[快到](https://youtu.be/tY0n9OT5_nA%3Ft%3D37m24s)了[ [37:24](https://youtu.be/tY0n9OT5_nA%3Ft%3D37m24s) ]。 我们已经获得了我们的标记化，数字化的英语和法语数据集。 我们有一些单词向量。 所以现在我们需要为PyTorch做好准备。 PyTorch需要一个`Dataset`对象，希望现在可以说数据集对象需要两个东西 - 长度（ `__len__` ）和索引器（ `__getitem__` ）。 Jeremy开始编写`Seq2SeqDataset` ，结果证明它只是一个通用的`Dataset` [ [37:52](https://youtu.be/tY0n9OT5_nA%3Ft%3D37m52s) ]。
+
+```
+ en_ids_tr = np.array([o[:enlen_90] **for** o **in** en_ids])  fr_ids_tr = np.array([o[:frlen_90] **for** o **in** fr_ids]) 
+```
+
+```
+ **class** **Seq2SeqDataset** (Dataset):  **def** __init__(self, x, y): self.x,self.y = x,y  **def** __getitem__(self, idx): **return** A(self.x[idx], self.y[idx])  **def** __len__(self): **return** len(self.x) 
+```
+
+*   `A` ：阵列。 它将通过你传递它的每一个东西，如果它不是一个numpy数组，它会转换成一个numpy数组，并返回一个你传递它的所有东西的元组，现在保证是numpy数组[ [38 ：32](https://youtu.be/tY0n9OT5_nA%3Ft%3D38m32s) ]。
+*   `V` ：变量
+*   `T` ：Tensors
+
+#### 训练集和验证集[ [39:03](https://youtu.be/tY0n9OT5_nA%3Ft%3D39m3s) ]
+
+现在我们需要获取我们的英语和法语ID并获得训练集和验证集。 关于互联网上很多代码令人非常失望的事情之一就是他们没有遵循一些简单的最佳实践。 例如，如果你去PyTorch网站，他们有一个序列到序列翻译的例子部分。 他们的示例没有单独的验证集。 Jeremy根据他们的设置尝试了训练，并使用验证装置对其进行了测试，结果发现它大量过度。 所以这不仅仅是一个理论问题 - 实际的PyTorch repo具有序列翻译实例的实际官方序列，它不会检查过度拟合和过度拟合[ [39:41](https://youtu.be/tY0n9OT5_nA%3Ft%3D39m41s) ]。 此外，它无法使用迷你批次，因此它实际上无法利用任何PyTorch的效率。 即使您在官方PyTorch回购中找到代码，也不要认为它有任何好处。 你会注意到的另一件事是，几乎所有其他序列模型Jeremy在PyTorch中在互联网上的任何地方找到的都清楚地复制了那个糟糕的PyTorch仓库，因为它们都有相同的变量名，它有相同的问题，它有同样的错误。
+
+另一个例子是Jeremy发现的几乎每个PyTorch卷积神经网络都没有使用自适应汇集层[ [40:27](https://youtu.be/tY0n9OT5_nA%3Ft%3D40m27s) ]。 换句话说，最后一层总是平均池（7,7）。 他们假设前一层是7乘7，如果你使用任何其他大小的输入，你会得到一个例外，因此几乎所有Jeremy所说的使用PyTorch的人认为CNN的基本限制是它们与输入相关联尺寸，自VGG以来一直没有。 因此，每当Jeremy抓住一个新模型并将其粘贴在fastai repo中时，他必须去搜索“pool”并在开始时添加“adaptive”，并用1替换7，现在它适用于任何大小的对象。 所以要小心。 它还处于早期阶段并且信不信由你，尽管你们大多数人在去年开始了深度学习之旅，但你们对许多重要的实践方面的了解远远超过绝大多数人在官方回购中出版和撰写文章。 因此，在阅读其他人的代码时，您需要比自己预期的更自信。 如果你发现自己认为“看起来很奇怪”，那不一定是你。
+
+如果您正在查看的回购没有关于它的部分说这里是我们所做的测试，我们得到了与应该实施的论文相同的结果，这几乎可以肯定意味着它们没有得到相同的结果他们正在实施的论文，但可能还没有检查[ [42:13](https://youtu.be/tY0n9OT5_nA%3Ft%3D42m13s) ]。 如果你运行它，肯定不会得到那些结果，因为第一次很难把事情弄好 - Jeremy 12需要它。 如果他们没有测试过一次，那几乎肯定是行不通的。
+
+这是获得培训和验证集的简单方法[ [42:45](https://youtu.be/tY0n9OT5_nA%3Ft%3D42m45s) ]。 抓取一堆随机数 - 每行数据一行，看看它们是否大于0.1。 这会给你一个布尔列表。 使用该布尔列表索引到您的数组中以获取训练集，使用与布尔列表相反的索引到该数组中以获得验证集。
+
+```
+ np.random.seed(42)  trn_keep = np.random.rand(len(en_ids_tr))>0.1  en_trn,fr_trn = en_ids_tr[trn_keep],fr_ids_tr[trn_keep]  en_val,fr_val = en_ids_tr[~trn_keep],fr_ids_tr[~trn_keep]  len(en_trn),len(en_val) 
+```
+
+```
+ _(45219, 5041)_ 
+```
+
+Now we can create our dataset with our X's and Y's (ie French and English)[ [43:12](https://youtu.be/tY0n9OT5_nA%3Ft%3D43m12s) ]. If you want to translate instead English to French, switch these two around and you're done.
+
+```
+ trn_ds = Seq2SeqDataset(fr_trn,en_trn)  val_ds = Seq2SeqDataset(fr_val,en_val) 
+```
+
+Now we need to create DataLoaders [ [43:22](https://youtu.be/tY0n9OT5_nA%3Ft%3D43m22s) ]. We can just grab our data loader and pass in our dataset and batch size. We actually have to transpose the arrays — we won't go into the details about why, but we can talk about it during the week if you're interested but have a think about why we might need to transpose their orientation. Since we've already done all the pre-processing, there is no point spawning off multiple workers to do augmentation, etc because there is no work to do. So `making num_workers=1` will save you some time. We have to tell it what our padding index is — that is pretty important because what's going to happen is that we've got different length sentences and fastai will automatically stick them together and pad the shorter ones so that they are all equal length. Remember a tensor has to be rectangular.
+
+```
+ bs=125 
+```
+
+```
+ trn_samp = SortishSampler(en_trn, key= lambda x: len(en_trn[x]),  bs=bs)  val_samp = SortSampler(en_val, key= lambda x: len(en_val[x])) 
+```
+
+```
+ trn_dl = DataLoader(trn_ds, bs, transpose= True , transpose_y= True ,  num_workers=1, pad_idx=1, pre_pad= False ,  sampler=trn_samp)  val_dl = DataLoader(val_ds, int(bs*1.6), transpose= True ,  transpose_y= True , num_workers=1, pad_idx=1,  pre_pad= False , sampler=val_samp)  md = ModelData(PATH, trn_dl, val_dl) 
+```
+
+In the decoder in particular, we want our padding to be at the end, not at the start [ [44:29](https://youtu.be/tY0n9OT5_nA%3Ft%3D44m29s) ]:
+
+*   Classifier → padding in the beginning. Because we want that final token to represent the last word of the movie review.
+*   Decoder → padding at the end. As you will see, it actually is going to work out a bit better to have the padding at the end.
+
+**Sampler [** [**44:54**](https://youtu.be/tY0n9OT5_nA%3Ft%3D44m54s) **]** Finally, since we've got sentences of different lengths coming in and they all have to be put together in a mini-batch to be the same size by padding, we would much prefer that the sentences in a mini-batch are of similar sizes already. Otherwise it is going to be as long as the longest sentence and that is going to end up wasting time and memory. Therefore, we are going to use the sampler tricks that we learnt last time which is the validation set, we are going to ask it to sort everything by length first. Then for the training set, we are going to randomize the order of things but to roughly make it so that things of similar length are about in the same spot.
+
+**Model Data [** [**45:40**](https://youtu.be/tY0n9OT5_nA%3Ft%3D45m40s) **]** At this point, we can create a model data object — remember a model data object really does one thing which is it says “I have a training set and a validation set, and an optional test set” and sticks them into a single object. We also has a path so that it has somewhere to store temporary files, models, stuff like that.
+
+We are not using fastai for very much at all in this example. We used PyTorch compatible Dataset and and DataLoader — behind the scene it is actually using the fastai version because we need it to do the automatic padding for convenience, so there is a few tweaks in fastai version that are a bit faster and a bit more convenient. We are also using fastai's Samplers, but there is not too much going on here.
+
+#### Architecture [ [46:59](https://youtu.be/tY0n9OT5_nA%3Ft%3D46m59s) ]
+
+![](../img/1_IMBl2Aiclyt6PCrg1IQg5A.png)
+
+*   The architecture is going to take our sequence of tokens.
+*   It is going to spit them into an encoder (aka backbone).
+*   That is going to spit out the final hidden state which for each sentence, it's just a single vector.
+
+None of this is going to be new [ [47:41](https://youtu.be/tY0n9OT5_nA%3Ft%3D47m41s) ]. That is all going to be using very direct simple techniques that we've already learned.
+
+*   Then we are going to take that, and we will spit it into a different RNN which is a decoder. That's going to have some new stuff because we need something that can go through one word at a time. And it keeps going until it thinks it's finished the sentence. It doesn't know how long the sentence is going to be ahead of time. It keeps going until it thinks it's finished the sentence and then it stops and returns a sentence.
+
+```
+ def create_emb(vecs, itos, em_sz):  emb = nn.Embedding(len(itos), em_sz, padding_idx=1)  wgts = emb.weight.data  miss = []  for i,w in enumerate(itos):  try : wgts[i] = torch.from_numpy(vecs[w]*3)  except : miss.append(w)  print(len(miss),miss[5:10])  return emb 
+```
+
+```
+ nh,nl = 256,2 
+```
+
+Let's start with the encoder [ [48:15](https://youtu.be/tY0n9OT5_nA%3Ft%3D48m15s) ]. In terms of the variable naming here, there is identical attributes for encoder and decoder. The encoder version has `enc` the decoder version has `dec` .
+
+*   `emb_enc` : Embeddings for the encoder
+*   `gru` : RNN. GRU and LSTM are nearly the same thing.
+
+We need to create an embedding layer because remember — what we are being passed is the index of the words into a vocabulary. And we want to grab their fast.text embedding. Then over time, we might want to also fine tune to train that embedding end-to-end.
+
+`create_emb` [ [49:37](https://youtu.be/tY0n9OT5_nA%3Ft%3D49m37s) ]: It is important that you know now how to set the rows and columns for your embedding so the number of rows has to be equal to your vocabulary size — so each vocabulary has a word vector. The size of the embedding is determined by fast.text and fast.text embeddings are size 300\. So we have to use size 300 as well otherwise we can't start out by using their embeddings.
+
+`nn.Embedding` will initially going to give us a random set of embeddings [ [50:12](https://youtu.be/tY0n9OT5_nA%3Ft%3D50m12s) ]. So we will go through each one of these and if we find it in fast.text, we will replace it with the fast.text embedding. Again, something you should already know is that ( `emb.weight.data` ):
+
+*   A PyTorch module that is learnable has `weight` attribute
+*   `weight` attribute is a `Variable` that has `data` attribute
+*   The `data` attribute is a tensor
+
+Now that we've got our weight tensor, we can just go through our vocabulary and we can look up the word in our pre-trained vectors and if we find it, we will replace the random weights with that pre-trained vector [ [52:35](https://youtu.be/tY0n9OT5_nA%3Ft%3D52m35s) ]. The random weights have a standard deviation of 1\. Our pre-trained vectors has a standard deviation of about 0.3\. So again, this is the kind of hacky thing Jeremy does when he is prototyping stuff, he just multiplied it by 3\. By the time you see the video of this, we may able to put all this sequence to sequence stuff into the fastai library, you won't find horrible hacks like that in there (sure hope). But hack away when you are prototyping. Some things won't be in fast.text in which case, we'll just keep track of it [ [53:22](https://youtu.be/tY0n9OT5_nA%3Ft%3D53m22s) ]. The print statement is there so that we can see what's going on (ie why are we missing stuff?). Remember we had about 30,000 so we are not missing too many.
+
+```
+ 3097 ['l'', "d'", 't_up', 'd'', "qu'"] 
+ 1285 ["'s", ''s', "n't", 'n't', ':'] 
+```
+
+Jeremy has started doing some stuff around incorporating large vocabulary handling into fastai — it's not finished yet but hopefully by the time we get here, this kind of stuff will be possible [ [56:50](https://youtu.be/tY0n9OT5_nA%3Ft%3D56m50s) ].
+
+```
+ class Seq2SeqRNN (nn.Module):  def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec,  itos_dec, em_sz_dec, nh, out_sl, nl=2):  super().__init__()  self.nl,self.nh,self.out_sl = nl,nh,out_sl  self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)  self.emb_enc_drop = nn.Dropout(0.15)  self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl,  dropout=0.25)  self.out_enc = nn.Linear(nh, em_sz_dec, bias= False )  self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)  self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl,  dropout=0.1)  self.out_drop = nn.Dropout(0.35)  self.out = nn.Linear(em_sz_dec, len(itos_dec))  self.out.weight.data = self.emb_dec.weight.data  def forward(self, inp):  sl,bs = inp.size()  h = self.initHidden(bs)  emb = self.emb_enc_drop(self.emb_enc(inp))  enc_out, h = self.gru_enc(emb, h)  h = self.out_enc(h)  dec_inp = V(torch.zeros(bs).long())  res = []  for i in range(self.out_sl):  emb = self.emb_dec(dec_inp).unsqueeze(0)  outp, h = self.gru_dec(emb, h)  outp = self.out(self.out_drop(outp[0]))  res.append(outp)  dec_inp = V(outp.data.max(1)[1])  if (dec_inp==1).all(): break  return torch.stack(res)  def initHidden(self, bs):  return V(torch.zeros(self.nl, bs, self.nh)) 
+```
+
+The key thing to know is that encoder takes our inputs and spits out a hidden vector that hopefully will learn to contain all of the information about what that sentence says and how it sets it [ [58:49](https://youtu.be/tY0n9OT5_nA%3Ft%3D58m49s) ]. If it can't do that, we can't feed it into a decoder and hope it to spit our our sentence in a different language. So that's what we want it to learn to do. We are not going to do anything special to make it learn to do that — we are just going to do the three things (data, architecture, loss function) and cross our fingers.
+
+**Decoder [** [**59:58**](https://youtu.be/tY0n9OT5_nA%3Ft%3D59m58s) **]** : How do we now do the new bit? The basic idea of the new bit is the same. We are going to do exactly the same thing, but we are going to write our own for loop. The for loop is going to do exactly what the for loop inside PyTorch does for encoder, but we are going to do it manually. How big is the for loop? It's an output sequence length ( `out_sl` ) which was something passed to the constructor which is equal to the length of the largest English sentence. Since we are translating into English, so it can't possibly be longer than that at least in this corpus. If we then used it on some different corpus that was longer, this is going to fail — you could always pass in a different parameter, of course. So the basic idea is the same [ [1:01:06](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h1m6s) ].
+
+*   We are going to go through and put it through the embedding.
+*   We are going to stick it through the RNN, dropout, and a linear layer.
+*   We will then append the output to a list which will be stacked into a single tensor and get returned.
+
+Normally, a recurrent neural network works on a whole sequence at a time, but we have a for loop to go through each part of the sequence separately [ [1:01:37](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h1m37s) ]. Wo we have to add a leading unit axis to the start ( `.unsqueeze(0)` ) to basicaly say this is a sequence of length one. We are not really taking advantage of the recurrent net much at all — we could easily re-write this with a linear layer.
+
+One thing to be aware of is `dec_inp` [ [1:02:34](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h2m34s) ]: What is the input to the embedding? The answer is it is the previous word that we translated. The basic idea is if you are trying to translate the 4th word of the new sentence but you don't know what the third word you just said was, that is going to be really hard. So we are going to feed that in at each time step. What was the previous word at the start? There was none. Specifically, we are going to start out with a beginning of stream token ( `_bos_` ) which is zero.
+
+`outp` [ [1:05:24](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h5m24s) ]: it is a tensor whose length is equal to the number of words in our English vocabulary and it contains the probability for every one of those words that it is that word.
+
+`outp.data.max` : It looks in its tensor to find out which word has the highest probability. `max` in PyTorch returns two things: the first thing is what is that max probability and the second is what is the index into the array of that max probability. So we want that second item which is the word index with the largest thing.
+
+`dec_inp` : It contains the word index into the vocabulary of the word. If it's one (ie padding), that means we are done — we reached the end because we finished with a bunch of padding. If it's not one, let's go back and continue.
+
+Each time, we appended our outputs (not the word but the probabilities) to the list [ [1:06:48](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h6m48s) ] which we stack up into a tensor and we can now go ahead and feed that to a loss function.
+
+#### Loss function [ [1:07:13](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h7m13s) ]
+
+The loss function is categorical cross entropy loss. We have a list of probabilities for each of our classes where the classes are all the words in our English vocab and we have a target which is the correct class (ie which is the correct word at this location). There are two tweaks which is why we need to write our own loss function but you can see basically it is going to be cross entropy loss.
+
+```
+ def seq2seq_loss(input, target):  sl,bs = target.size()  sl_in,bs_in,nc = input.size()  if sl>sl_in: input = F.pad(input, (0,0,0,0,0,sl-sl_in))  input = input[:sl]  return F.cross_entropy(input.view(-1,nc), target.view(-1)) 
+```
+
+Tweaks [ [1:07:40](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h7m40s) ]:
+
+1.  If the generated sequence length is shorter than the sequence length of the target, we need to add some padding. PyTorch padding function requires a tuple of 6 to pad a rank 3 tensor (sequence length, batch size, by number of words in the vocab). Each pair represents padding before and after that dimension.
+
+2\. `F.cross_entropy` expects a rank 2 tensor, but we have sequence length by batch size, so let's just flatten out. That is what `view(-1, ...)` does.
+
+```
+ opt_fn = partial(optim.Adam, betas=(0.8, 0.99)) 
+```
+
+The difference between `.cuda()` and `to_gpu()` : `to_gpu` will not put to in the GPU if you do not have one. You can also set `fastai.core.USE_GPU` to `false` to force it to not use GPU that can be handy for debugging.
+
+```
+ rnn = Seq2SeqRNN(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos,  dim_en_vec, nh, enlen_90)  learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)  learn.crit = seq2seq_loss 
+```
+
+```
+ 3097 ['l'', "d'", 't_up', 'd'', "qu'"] 
+ 1285 ["'s", ''s', "n't", 'n't', ':'] 
+```
+
+We then need something that tells it how to handle learning rate groups so there is a thing called `SingleModel` that you can pass it to which treats the whole thing as a single learning rate group [ [1:09:40](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h9m40s) ]. So this is the easiest way to turn a PyTorch module into a fastai model.
+
+![](../img/1_NW1_lYHLm8R0ML_BWq0QRA.png)
+
+We could just call Learner to turn that into a learner, but if we call RNN_Learner, it does add in `save_encoder` and `load_encoder` that can be handy sometimes. In this case, we really could have said `Leaner` but `RNN_Learner` also works.
+
+```
+ learn.lr_find()  learn.sched.plot() 
+```
+
+![](../img/1_Fwxxo1lXoIqdM5v24sWfIA.png)
+
+```
+ lr=3e-3  learn.fit(lr, 1, cycle_len=12, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 5.48978 5.462648 
+ 1 4.616437 4.770539 
+ 2 4.345884 4.37726 
+ 3 3.857125 4.136014 
+ 4 3.612306 3.941867 
+ 5 3.375064 3.839872 
+ 6 3.383987 3.708972 
+ 7 3.224772 3.664173 
+ 8 3.238523 3.604765 
+ 9 2.962041 3.587814 
+ 10 2.96163 3.574888 
+ 11 2.866477 3.581224 
+```
+
+```
+ [3.5812237] 
+```
+
+```
+ learn.save('initial')  learn.load('initial') 
+```
+
+#### Test [ [1:11:01](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h11m1s) ]
+
+Remember the model attribute of a learner is a standard PyTorch model so we can pass some `x` which we can grab out of our validation set or you could `learn.predict_array` or whatever you like to get some predictions. Then we convert those predictions into words by going `.max()[1]` to grab the index of the highest probability words to get some predictions. Then we can go through a few examples and print out the French, the correct English, and the predicted English for things that are not padding.
+
+```
+ x,y = next(iter(val_dl))  probs = learn.model(V(x))  preds = to_np(probs.max(2)[1])  for i in range(180,190):  print(' '.join([fr_itos[o] for o in x[:,i] if o != 1]))  print(' '.join([en_itos[o] for o in y[:,i] if o != 1]))  print(' '.join([en_itos[o] for o in preds[:,i] if o!=1]))  print() 
+```
+
+```
+ quels facteurs pourraient influer sur le choix de leur emplacement ? _eos_ 
+ what factors influencetheir location ? _eos_ 
+ what factors might might influence on the their ? _?_ _eos_ 
+
+ qu' est -ce qui ne peut pas changer ? _eos_ 
+ what can not change ? _eos_ 
+ what not change change ? _eos_ 
+
+ que faites - vous ? _eos_ 
+ what do you do ? _eos_ 
+ what do you do ? _eos_ 
+
+ qui réglemente les pylônes d' antennes ? _eos_ 
+ who regulates antenna towers ? _eos_ 
+ who regulates the doors doors ? _eos_ 
+
+ où sont - ils situés ? _eos_ 
+ where are they located ? _eos_ 
+ where are the located ? _eos_ 
+
+ quelles sont leurs compétences ? _eos_ 
+ what are their qualifications ? _eos_ 
+ what are their skills ? _eos_ 
+
+ qui est victime de harcèlement sexuel ? _eos_ 
+ who experiences sexual harassment ? _eos_ 
+ who is victim sexual sexual ? _?_ _eos_ 
+
+ quelles sont les personnes qui visitent les communautés autochtones ? _eos_ 
+ who visits indigenous communities ? _eos_ 
+ who are people people aboriginal aboriginal ? _eos_ 
+
+ pourquoi ces trois points en particulier ? _eos_ 
+ why these specific three ? _eos_ 
+ why are these two different ? _?_ _eos_ 
+
+ pourquoi ou pourquoi pas ? _eos_ 
+ why or why not ? _eos_ 
+ why or why not _eos_ 
+```
+
+Amazingly enough, this kind of simplest possible written largely from scratch PyTorch module on only fifty thousand sentences is sometimes capable, on validation set, of giving you exactly the right answer. Sometimes the right answer is in slightly different wording, and sometimes sentences that really aren't grammatically sensible or even have too many question marks. So we are well on the right track. We think you would agree even the simplest possible seq-to-seq trained for a very small number of epochs without any pre-training other than the use of word embeddings is surprisingly good. We are going to improve this later but the message here is even sequence to sequence models you think is simpler than they could possibly work even with less data than you think you could learn from can be surprisingly effective and in certain situations this may be enough for your needs.
+
+**Question** : Would it help to normalize punctuation (eg `'` vs. `'` )? [ [1:13:10](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h13m10s) ] The answer to this particular case is probably yes — the difference between curly quotes and straight quotes is really semantic. You do have to be very careful though because it may turn out that people using beautiful curly quotes like using more formal language and they are writing in a different way. So if you are going to do some kind of pre-processing like punctuation normalization, you should definitely check your results with and without because nearly always that kind of pre-processing make things worse even when you're sure it won't.
+
+**Question** : What might be some ways of regularizing these seq2seq models besides dropout and weight decay? [ [1:14:17](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h14m17s) ] Let me think about that during the week. AWD-LSTM which we have been relying a lot has dropouts of many different kinds and there is also a kind of a regularization based on activations and on changes. Jeremy has not seen anybody put anything like that amount of work into regularizing sequence to sequence model and there is a huge opportunity for somebody to do like the AWD-LSTM of seq-to-seq which might be as simple as stealing all the ideas from AWD-LSTM and using them directly in seq-to-seq that would be pretty easy to try. There's been an interesting paper that Stephen Merity added in the last couple weeks where he used an idea which take all of these different AWD-LSTM hyper parameters and train a bunch of different models and then use a random forest to find out the feature importance — which ones actually matter the most and then figure out how to set them. You could totally use this approach to figure out for sequence to sequence regularization approaches which one is the best and optimize them and that would be amazing. But at the moment, we don't know if there are additional ideas to sequence to sequence regularization beyond what is in that paper for regular language model.
+
+### Tricks [ [1:16:28](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h16m28s) ]
+
+#### **Trick #1 : Go bi-directional**
+
+For classification, the approach to bi-directional Jeremy suggested to use is take all of your token sequences, spin them around, train a new language model, and train a new classifier. He also mentioned that wikitext pre-trained model if you replace `fwd` with `bwd` in the name, you will get the pre-trained backward model he created for you. Get a set of predictions and then average the predictions just like a normal ensemble. That is how we do bi-dir for that kind of classification. There may be ways to do it end-to-end, but Jeremy hasn't quite figured them out yet and they are not in fastai yet. So if you figure it out, that's an interesting line of research. But because we are not doing massive documents where we have to chunk it into separate bits and then pool over them, we can do bi-dir very easily in this case. It is literally as simple as adding `bidirectional=True` to our encoder. People tend not to do bi-directional for the decoder partly because it is kind of considered cheating but maybe it can work in some situations although it might need to be more of an ensemble approach in the decoder because it's a bit less obvious. But encoder it's very simple — `bidirectional=True` and we now have a second RNN that is going the opposite direction. The second RNN is visiting each token in the opposing order so when we get to the final hidden state, it is the first (ie left most) token . But the hidden state is the same size, so the final result is that we end up with a tensor with an extra axis of length 2\. Depending on what library you use, often that will be then combined with the number of layers, so if you have 2 layers and bi-directional — that tensor dimension is now length 4\. With PyTorch it depends which bit of the process you are looking at as to whether you get a separate result for each layer and/or for each bidirectional bit. You have to look up the documentation and it will tell you input's output's tensor sizes appropriate for the number of layers and whether you have `bidirectional=True` .
+
+In this particular case, you will see all the changes that had to be made [ [1:19:38](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h19m38s) ]. For example ,when we added `bidirectional=True` , the `Linear` layer now needs number of hidden times 2 (ie `nh*2` ) to reflect the fact that we have that second direction in our hidden state. Also in `initHidden` it's now `self.nl*2` .
+
+```
+ class Seq2SeqRNN_Bidir (nn.Module):  def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec,  itos_dec, em_sz_dec, nh, out_sl, nl=2):  super().__init__()  self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)  self.nl,self.nh,self.out_sl = nl,nh,out_sl  self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl,  dropout=0.25, bidirectional= True )  self.out_enc = nn.Linear(nh *2 , em_sz_dec, bias= False )  self.drop_enc = nn.Dropout(0.05)  self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)  self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl,  dropout=0.1)  self.emb_enc_drop = nn.Dropout(0.15)  self.out_drop = nn.Dropout(0.35)  self.out = nn.Linear(em_sz_dec, len(itos_dec))  self.out.weight.data = self.emb_dec.weight.data  def forward(self, inp):  sl,bs = inp.size()  h = self.initHidden(bs)  emb = self.emb_enc_drop(self.emb_enc(inp))  enc_out, h = self.gru_enc(emb, h)  h = h.view(2,2,bs,-1).permute(0,2,1,3)  .contiguous().view(2,bs,-1)  h = self.out_enc(self.drop_enc(h)) 
+```
+
+```
+ dec_inp = V(torch.zeros(bs).long())  res = []  for i in range(self.out_sl):  emb = self.emb_dec(dec_inp).unsqueeze(0)  outp, h = self.gru_dec(emb, h)  outp = self.out(self.out_drop(outp[0]))  res.append(outp)  dec_inp = V(outp.data.max(1)[1])  if (dec_inp==1).all(): break  return torch.stack(res)  def initHidden(self, bs):  return V(torch.zeros(self.nl *2 , bs, self.nh)) 
+```
+
+**Question** : Why is making the decoder bi-directional considered cheating? [ [1:20:13](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h20m13s) ] It's not just cheating but we have this loop going on so it is not as simple as having two tensors. Then how do you turn those two separate loops into a final result? After talking about it during the break, Jeremy has gone from “everybody knows it doesn't work” to “maybe it could work”, but it requires more thought. It is quite possible during the week, he'll realize it's a dumb idea, but we'll think about it.
+
+**Question** : Why do you need to set a range to the loop? [ [1:20:58](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h20m58s) ] Because when we start training, everything is random so `if (dec_inp==1).all(): break` will probably never be true. Later on, it will pretty much always break out eventually but basically we are going to go forever. It's really important to remember when you are designing an architecture that when you start, the model knows nothing about anything. So you want to make sure if it's going to do something at least it's vaguely sensible.
+
+We got 3.58 cross entropy loss with single direction [ [1:21:46](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h21m46s) ]. With bi-direction, we got down to 3.51, so that improved a little. It shouldn't really slow things down too much. Bi-directional does mean there is a little bit more sequential processing have to happen, but it is generally a good win. In the Google translation model, of the 8 layers, only the first layer is bi-directional because it allows it to do more in parallel so if you create really deep models you may need to think about which ones are bi-directional otherwise we have performance issues.
+
+```
+ rnn = Seq2SeqRNN_Bidir(fr_vecd, fr_itos, dim_fr_vec, en_vecd,  en_itos, dim_en_vec, nh, enlen_90)  learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)  learn.crit = seq2seq_loss 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=12, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 4.896942 4.761351 
+ 1 4.323335 4.260878 
+ 2 3.962747 4.06161 
+ 3 3.596254 3.940087 
+ 4 3.432788 3.944787 
+ 5 3.310895 3.686629 
+ 6 3.454976 3.638168 
+ 7 3.093827 3.588456 
+ 8 3.257495 3.610536 
+ 9 3.033345 3.540344 
+ 10 2.967694 3.516766 
+ 11 2.718945 3.513977 
+```
+
+```
+ [3.5139771] 
+```
+
+#### Trick #2 Teacher Forcing [ [1:22:39](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h22m39s) ]
+
+Now let's talk about teacher forcing. When a model starts learning, it knows nothing about nothing. So when the model starts learning, it is not going to spit out “Er” at the first step, it is going to spit out some random meaningless word because it doesn't know anything about German or about English or about the idea of language. And it is going to feed it to the next process as an input and be totally unhelpful. That means, early learning is going to be very difficult because it is feeding in an input that is stupid into a model that knows nothing and somehow it's going to get better. So it is not asking too much eventually it gets there, but it's definitely not as helpful as we can be. So what if instead of feeing in the thing I predicted just now, what if we instead we feed in the actual correct word was meant to be. We can't do that at inference time because by definition we don't know the correct word - it has to translate it. We can't require the correct translation in order to do translation.
+
+![](../img/1_DU776SGr1rhYeU7ilIKX9w.png)
+
+So the way it's set up is we have this thing called `pr_force` which is probability of forcing [ [1:24:01](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h24m1s) ]. If some random number is less than that probability then we are going to replace our decoder input with the actual correct thing. If we have already gone too far and if it is already longer than the target sequence, we are just going to stop because obviously we can't give it the correct thing. So you can see how beautiful PyTorch is for this. The key reasons that we switched to PyTorch at this exact point in last year's class was because Jeremy tried to implement teacher forcing in Keras and TensorFlow and went even more insane than he started. It was weeks of getting nowhere then he saw on Twitter Andrej Karpathy said something about this thing called PyTorch that just came out and it's really cool. He tried it that day, by the next day, he had teacher forcing. All this stuff of trying to debug things was suddenly so much easier and and this kind of dynamic thing is so much easier. So this is a great example of “hey, I get to use random numbers and if statements”.
+
+```
+ class Seq2SeqStepper (Stepper):  def step(self, xs, y, epoch):  self.m.pr_force = (10-epoch)*0.1 if epoch<10 else 0  xtra = []  output = self.m(*xs, y)  if isinstance(output,tuple): output,*xtra = output  self.opt.zero_grad()  loss = raw_loss = self.crit(output, y)  if self.reg_fn: loss = self.reg_fn(output, xtra, raw_loss)  loss.backward()  if self.clip: # Gradient clipping  nn.utils.clip_grad_norm(trainable_params_(self.m),  self.clip)  self.opt.step()  return raw_loss.data[0] 
+```
+
+Here is the basic idea [ [1:25:29](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h25m29s) ]. At the start of training, let's set `pr_force` really high so that nearly always it gets the actual correct previous word and so it has a useful input. Then as we trained a bit more, let's decrease `pr_force` so that by the end `pr_force` is zero and it has to learn properly which is fine because it is now actually feeding in sensible inputs most of the time anyway.
+
+```
+ class Seq2SeqRNN_TeacherForcing (nn.Module):  def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec,  itos_dec, em_sz_dec, nh, out_sl, nl=2):  super().__init__()  self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)  self.nl,self.nh,self.out_sl = nl,nh,out_sl  self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl,  dropout=0.25)  self.out_enc = nn.Linear(nh, em_sz_dec, bias= False )  self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)  self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl,  dropout=0.1)  self.emb_enc_drop = nn.Dropout(0.15)  self.out_drop = nn.Dropout(0.35)  self.out = nn.Linear(em_sz_dec, len(itos_dec))  self.out.weight.data = self.emb_dec.weight.data  self.pr_force = 1\.  def forward(self, inp, y= None ):  sl,bs = inp.size()  h = self.initHidden(bs)  emb = self.emb_enc_drop(self.emb_enc(inp))  enc_out, h = self.gru_enc(emb, h)  h = self.out_enc(h) 
+```
+
+```
+ dec_inp = V(torch.zeros(bs).long())  res = []  for i in range(self.out_sl):  emb = self.emb_dec(dec_inp).unsqueeze(0)  outp, h = self.gru_dec(emb, h)  outp = self.out(self.out_drop(outp[0]))  res.append(outp)  dec_inp = V(outp.data.max(1)[1])  if (dec_inp==1).all(): break  if (y is not None ) and (random.random()<self.pr_force):  if i>=len(y): break  dec_inp = y[i]  return torch.stack(res)  def initHidden(self, bs):  return V(torch.zeros(self.nl, bs, self.nh)) 
+```
+
+`pr_force` : “probability of forcing”. High in the beginning zero by the end.
+
+Let's now write something such that in the training loop, it gradually decreases `pr_force` [ [1:26:01](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h26m1s) ]. How do we do that? One approach would be to write our own training loop but let's not do that because we already have a training loop that has progress bars, uses exponential weighted averages to smooth out the losses, keeps track of metrics, and does bunch of things. They also keep track of calling the reset for RNN at the start of the epoch to make sure the hidden state is set to zeros. What we've tended to find is that as we start to write some new thing and we need to replace some part of the code, we then add some little hook so that we can all use that hook to make things easier. In this particular case, there is a hook that Jeremy has ended up using all the time which is the hook called the stepper. If you look at the source code, model.py is where our fit function lives which is the lowest level thing that does not require learner or anything much at all — just requires a standard PyTorch model and a model data object. You just need to know how many epochs, a standard PyTorch optimizer, and a standard PyTorch loss function. We hardly ever used in the class, we normally call `learn.fit` , but `learn.fit` calls this.
+
+![](../img/1_hhksba0Jh8iyWmuC_tPtqg.png)
+
+We have looked at the source code sometime [ [1:27:49](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h27m49s) ]. We've seen how it loos through each epoch and that loops through each thing in our batch and calls `stepper.step` . `stepper.step` is the thing that is responsible for:
+
+*   calling the model
+*   getting the loss
+*   finding the loss function
+*   calling the optimizer
+
+![](../img/1_dlBOu68q6RyNuQ0opzvxMg.png)
+
+So by default, `stepper.step` uses a particular class called `Stepper` which basically calls the model, zeros the gradient, calls the loss function, calls `backward` , does gradient clipping if necessary, then calls the optimizer. They are basic steps that back when we looked at “PyTorch from scratch” we had to do. The nice thing is, we can replace that with something else rather than replacing the training loop. If you inherit from `Stepper` , then write your own version of `step` , you can just copy and paste the contents of step and add whatever you like. Or if it's something that you're going to do before or afterwards, you could even call `super.step` . In this case, Jeremy rather suspects he has been unnecessarily complicated [ [1:29:12](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h29m12s) ] — he probably could have done something like:
+
+```
+ class Seq2SeqStepper (Stepper):  def step(self, xs, y, epoch):  self.m.pr_force = (10-epoch)*0.1 if epoch<10 else 0  return super.step(xs, y, epoch) 
+```
+
+But as he said, when he is prototyping, he doesn't think carefully about how to minimize his code — he copied and pasted the contents of the `step` and he added a single line to the top which was to replace `pr_force` in the module with something that gradually decreased linearly for the first 10 epochs, and after 10 epochs, it is zero. So total hack but good enough to try it out. The nice thing is that everything else is the same except for the addition of these three lines:
+
+```
+  if (y is not None ) and (random.random()<self.pr_force):  if i>=len(y): break  dec_inp = y[i] 
+```
+
+And the only thing we need to do differently is when we call `fit` , we pass in our customized stepper class.
+
+```
+ rnn = Seq2SeqRNN_TeacherForcing(fr_vecd, fr_itos, dim_fr_vec,  en_vecd, en_itos, dim_en_vec, nh, enlen_90)  learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)  learn.crit = seq2seq_loss 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=12, use_clr=(20,10),  stepper=Seq2SeqStepper) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 4.460622 12.661013 
+ 1 3.468132 7.138729 
+ 2 3.235244 6.202878 
+ 3 3.101616 5.454283 
+ 4 3.135989 4.823736 
+ 5 2.980696 4.933402 
+ 6 2.91562 4.287475 
+ 7 3.032661 3.975346 
+ 8 3.103834 3.790773 
+ 9 3.121457 3.578682 
+ 10 2.917534 3.532427 
+ 11 3.326946 3.490643 
+```
+
+```
+ [3.490643] 
+```
+
+And now our loss is down to 3.49\. We needed to make sure at least do 10 epochs because before that, it was cheating by using the teacher forcing.
+
+#### Trick #3 Attentional model [ [1:31:00](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h31m) ]
+
+This next trick is a bigger and pretty cool trick. It's called “attention.” The basic idea of attention is this — expecting the entirety of the sentence to be summarized into this single hidden vector is asking a lot. It has to know what was said, how it was said, and everything necessary to create the sentence in German. The idea of attention is basically maybe we are asking too much. Particularly because we could use this form of model (below) where we output every step of the loop to not just have a hidden state at the end but to have a hidden state after every single word. Why not try and use that information? It's already there but so far we've just been throwing it away. Not only that but bi-directional, we got two vectors of state every step that we can use. How can we do this?
+
+![](../img/1_CX45skUFZZO6uHsR8IndzA.png)
+
+Let's say we are translating a word “liebte” right now [ [1:32:34](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h32m34s) ]. Which of previous 5 pieces of state do we want? We clearly want “love” because it is the word. How about “zu”? We probably need “eat” and “to” and loved” to make sure we have gotten the tense right and know that I actually need this part of the verb and so forth. So depending on which bit we are translating, we would need one or more bits of these various hidden states. In fact, we probably want some weighting of them. In other words, for these five pieces of hidden state, we want a weighted average [ [1:33:47](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h33m47s) ]. We want it weighted by something that can figure out which bits of the sentence is the most important right now. How do we figure out something like which bits of the sentence are important right now? We create a neural net and we train the neural net to figure it out. When do we train that neural net? End to end. So let's now train two neural nets [ [1:34:18](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h34m18s) ]. Well, we've already got a bunch — RNN encoder, RNN decoder, a couple of linear layers, what the heck, let's add another neural net into the mix. This neural net is going to spit out a weight for every one of these states and we will take the weighted average at every step, and it's just another set of parameters that we learn all at the same time. So that is called “attention”.
+
+![](../img/1_JTCoNaf3I5LQVz2SrYCz0A.png)
+
+The idea is that once that attention has been learned, each word is going to take a weighted average as you can see in this terrific demo from Chris Olah and Shan Carter [ [1:34:50](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h34m50s) ]. Check out this [distill.pub article](https://distill.pub/2016/augmented-rnns/) — these things are interactive diagrams that shows you how the attention works and what the actual attention looks like in a trained translation model.
+
+[![](../img/1_fkL30nxS54fKVmyC2jtMrw.png)](https://distill.pub/2016/augmented-rnns/)
+
+Let's try and implement attention [ [1:35:47](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h35m47s) ]:
+
+```
+ def rand_t(*sz): return torch.randn(sz)/math.sqrt(sz[0])  def rand_p(*sz): return nn.Parameter(rand_t(*sz)) 
+```
+
+```
+ class Seq2SeqAttnRNN (nn.Module):  def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec,  itos_dec, em_sz_dec, nh, out_sl, nl=2):  super().__init__()  self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)  self.nl,self.nh,self.out_sl = nl,nh,out_sl  self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl,  dropout=0.25)  self.out_enc = nn.Linear(nh, em_sz_dec, bias= False )  self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)  self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl,  dropout=0.1)  self.emb_enc_drop = nn.Dropout(0.15)  self.out_drop = nn.Dropout(0.35)  self.out = nn.Linear(em_sz_dec*2, len(itos_dec))  self.out.weight.data = self.emb_dec.weight.data 
+```
+
+```
+ self.W1 = rand_p(nh, em_sz_dec)  self.l2 = nn.Linear(em_sz_dec, em_sz_dec)  self.l3 = nn.Linear(em_sz_dec+nh, em_sz_dec)  self.V = rand_p(em_sz_dec) 
+```
+
+```
+ def forward(self, inp, y= None , ret_attn= False ):  sl,bs = inp.size()  h = self.initHidden(bs)  emb = self.emb_enc_drop(self.emb_enc(inp))  enc_out, h = self.gru_enc(emb, h)  h = self.out_enc(h) 
+```
+
+```
+ dec_inp = V(torch.zeros(bs).long())  res,attns = [],[]  w1e = enc_out @ self.W1  for i in range(self.out_sl):  w2h = self.l2(h[-1])  u = F.tanh(w1e + w2h)  a = F.softmax(u @ self.V, 0)  attns.append(a)  Xa = (a.unsqueeze(2) * enc_out).sum(0)  emb = self.emb_dec(dec_inp)  wgt_enc = self.l3(torch.cat([emb, Xa], 1))  outp, h = self.gru_dec(wgt_enc.unsqueeze(0), h)  outp = self.out(self.out_drop(outp[0]))  res.append(outp)  dec_inp = V(outp.data.max(1)[1])  if (dec_inp==1).all(): break  if (y is not None ) and (random.random()<self.pr_force):  if i>=len(y): break  dec_inp = y[i] 
+```
+
+```
+ res = torch.stack(res)  if ret_attn: res = res,torch.stack(attns)  return res 
+```
+
+```
+ def initHidden(self, bs):  return V(torch.zeros(self.nl, bs, self.nh)) 
+```
+
+With attention, most of the code is identical. The one major difference is this line: `Xa = (a.unsqueeze(2) * enc_out).sum(0)` . We are going to take a weighted average and the way we are going to do the weighted average is we create a little neural net which we are going to see here:
+
+```
+ w2h = self.l2(h[-1])  u = F.tanh(w1e + w2h)  a = F.softmax(u @ self.V, 0) 
+```
+
+We use softmax because the nice thing about softmax is that we want to ensure all of the weights that we are using add up to 1 and we also expect that one of those weights should probably be higher than the other ones [ [1:36:38](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h36m38s) ]. Softmax gives us the guarantee that they add up to 1 and because it has `e^` in it, it tends to encourage one of the weights to be higher than the other ones.
+
+Let's see how this works [ [1:37:09](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h37m9s) ]. We are going to take the last layer's hidden state and we are going to stick it into a linear layer. Then we are going to stick it into a nonlinear activation, then we are going to do a matrix multiply. So if you think about it — a linear layer, nonlinear activation, matrix multiple — it's a neural net. It is a neural net with one hidden layer. Stick it into a softmax and then we can use that to weight our encoder outputs. Now rather than just taking the last encoder output, we have the whole tensor of all of the encoder outputs which we just weight by this neural net we created.
+
+In Python, `A @ B` is the matrix product, `A * B` the element-wise product
+
+#### Papers [ [1:38:18](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h38m18s) ]
+
+*   [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) — One amazing paper that originally introduced this idea of attention as well as a couple of key things which have really changed how people work in this field. They say area of attention has been used not just for text but for things like reading text out of pictures or doing various things with computer vi sion.
+*   [Grammar as a Foreign Language](https://arxiv.org/abs/1412.7449) — The second paper which Geoffrey Hinton was involved in that used this idea of RNN with attention to try to replace rules based grammar with an RNN which automatically tagged each word based on the grammar. It turned out to do it better than any rules based system which today seems obvious but at that time it was considered really surprising. They are summary of how attention works which is really nice and concise.
+
+**Question** : Could you please explain attention again? [ [1:39:46](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h39m46s) ] Sure! Let's go back and look at our original encoder.
+
+![](../img/1_NsX_t2WEEjsPcVyWOrcXFA.png)
+
+The RNN spits out two things: it spits out a list of the state after every time step ( `enc_out` ), and it also tells you the state at the last time step ( `h` )and we used the state at the last time step to create the input state for our decoder which is one vector `s` below:
+
+![](../img/1_QQiYoum-_J9Rm7DEQCElxA.png)
+
+But we know that it's creating a vector at every time steps (orange arrows), so wouldn't it be nice to use them all? But wouldn't it be nice to use the one or ones that's most relevant to translating the word we are translating now? So wouldn't it be nice to be able to take a weighted average of the hidden state at each time step weighted by whatever is the appropriate weight right now. For example, “liebte” would definitely be time step #2 is what it's all about because that is the word I'm translating. So how do we get a list of weights that is suitable fore the word we are training right now? The answer is by training a neural net to figure out the list of weights. So anytime we want to figure out how to train a little neural net that does any task, the easiest way, normally always to do that is to include it in your module and train it in line with everything else. The minimal possible neural net is something that contains two layers and one nonlinear activation function, so `self.l2` is one linear layer.
+
+![](../img/1_fnTtr-UiW5JtNy8M9q1mkg.png)
+
+In fact, instead of a linear layer, we can even just grab a random matrix if we do not care about bias [ [1:42:18](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h42m18s) ]. `self.W1` is a random tensor wrapped up in a `Parameter` .
+
+`Parameter` : Remember, a `Parameter` is identical to PyTorch `Variable` but it just tells PyTorch “I want you to learn the weights for this please.” [ [1:42:35](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h42m35s) ]
+
+So when we start out our decoder, let's take the current hidden state of the decoder, put that into a linear layer ( `self.l2` ) because what is the information we use to decide what words we should focus on next — the only information we have to go on is what the decoder's hidden state is now. So let's grab that:
+
+*   put it into the linear layer ( `self.l2` )
+*   put it through a non-linearity ( `F.tanh` )
+*   put it through one more nonlinear layer ( `u @ self.V` doesn't have a bias in it so it's just matrix multiply)
+*   put that through softmax
+
+That's it — a little neural net. It doesn't do anything. It's just a neural net and no neural nets do anything they are just linear layers with nonlinear activations with random weights. But it starts to do something if we give it a job to do. In this case, the job we give it to do is to say don't just take the final state but now let's use all of the encoder states and let's take all of them and multiply them by the output of that little neural net. So given that the things in this little neural net are learnable weights, hopefully it's going to learn to weight those encoder hidden states by something useful. That is all neural net ever does is we give it some random weights to start with and a job to do, and hope that it learns to do the job. It turns out, it does.
+
+![](../img/1_jOVVKAFwMxGt9v6WEqgLXw.png)
+
+Everything else in here is identical to what it was before. We have teacher forcing, it's not bi-directional, so we can see how this goes.
+
+```
+ rnn = Seq2SeqAttnRNN(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90)  learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)  learn.crit = seq2seq_loss  lr=2e-3 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=15, use_clr=(20,10),  stepper=Seq2SeqStepper) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 3.882168 11.125291 
+ 1 3.599992 6.667136 
+ 2 3.236066 5.552943 
+ 3 3.050283 4.919096 
+ 4 2.99024 4.500383 
+ 5 3.07999 4.000295 
+ 6 2.891087 4.024115 
+ 7 2.854725 3.673913 
+ 8 2.979285 3.590668 
+ 9 3.109851 3.459867 
+ 10 2.92878 3.517598 
+ 11 2.778292 3.390253 
+ 12 2.795427 3.388423 
+ 13 2.809757 3.353334 
+ 14 2.6723 3.368584 
+```
+
+```
+ [3.3685837] 
+```
+
+Teacher forcing had 3.49 and now with nearly exactly the same thing but we've got this little minimal neural net figuring out what weightings to give our inputs and we are down to 3.37\. Remember, these loss are logs, so `e^3.37` is quite a significant change.
+
+```
+ learn.save('attn') 
+```
+
+#### Test [ [1:45:37](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h45m37s) ]
+
+```
+ x,y = next(iter(val_dl))  probs,attns = learn.model(V(x),ret_attn= True )  preds = to_np(probs.max(2)[1]) 
+```
+
+```
+ for i in range(180,190):  print(' '.join([fr_itos[o] for o in x[:,i] if o != 1]))  print(' '.join([en_itos[o] for o in y[:,i] if o != 1]))  print(' '.join([en_itos[o] for o in preds[:,i] if o!=1]))  print() 
+```
+
+```
+ quels facteurs pourraient influer sur le choix de leur emplacement ? _eos_ 
+ what factors influencetheir location ? _eos_ 
+ what factors might influence the their their their ? _eos_ 
+```
+
+```
+ qu' est -ce qui ne peut pas changer ? _eos_ 
+ what can not change ? _eos_ 
+ what can not change change ? _eos_ 
+```
+
+```
+ que faites - vous ? _eos_ 
+ what do you do ? _eos_ 
+ what do you do ? _eos_ 
+```
+
+```
+ qui réglemente les pylônes d' antennes ? _eos_ 
+ who regulates antenna towers ? _eos_ 
+ who regulates the lights ? _?_ _eos_ 
+```
+
+```
+ où sont - ils situés ? _eos_ 
+ where are they located ? _eos_ 
+ where are they located ? _eos_ 
+```
+
+```
+ quelles sont leurs compétences ? _eos_ 
+ what are their qualifications ? _eos_ 
+ what are their skills ? _eos_ 
+```
+
+```
+ qui est victime de harcèlement sexuel ? _eos_ 
+ who experiences sexual harassment ? _eos_ 
+ who is victim sexual sexual ? _eos_ 
+```
+
+```
+ quelles sont les personnes qui visitent les communautés autochtones ? _eos_ 
+ who visits indigenous communities ? _eos_ 
+ who is people people aboriginal people ? _eos_ 
+```
+
+```
+ pourquoi ces trois points en particulier ? _eos_ 
+ why these specific three ? _eos_ 
+ why are these three three ? _?_ _eos_ 
+```
+
+```
+ pourquoi ou pourquoi pas ? _eos_ 
+ why or why not ? _eos_ 
+ why or why not ? _eos_ 
+```
+
+不错。 It's still not perfect but quite a few of them are correct and again considering that we are asking it to learn about the very idea of language for two different languages and how to translate them between the two, and grammar, and vocabulary, and we only have 50,000 sentences and a lot of the words only appear once, I would say this is actually pretty amazing.
+
+**Question:** Why do we use tanh instead of ReLU for the attention mini net? [ [1:46:23](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h46m23s) ] I don't quite remember — it's been a while since I looked at it. You should totally try using value and see how it goes. Obviously tanh the key difference is that it can go in each direction and it's limited both at the top and the bottom. I know very often for the gates inside RNNs, LSTMs, and GRUs, tanh often works out better but it's been about a year since I actually looked at that specific question so I'll look at it during the week. The short answer is you should try a different activation function and see if you can get a better result.
+
+> From Lesson 7 [ [44:06](https://youtu.be/H3g26EVADgY%3Ft%3D44m6s) ]: As we have seen last week, tanh is forcing the value to be between -1 and 1\. Since we are multiplying by this weight matrix again and again, we would worry that relu (since it is unbounded) might have more gradient explosion problem. Having said that, you can specify RNNCell to use different nonlineality whose default is tanh and ask it to use relu if you wanted to.
+
+#### Visualization [ [1:47:12](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h47m12s) ]
+
+What we can do also is we can grab the attentions out of the model by adding return attention parameter to `forward` function. You can put anything you'd like in `forward` function argument. So we added a return attention parameter, false by default because obviously the training loop it doesn't know anything about it but then we just had something here says if return attention, then stick the attentions on as well ( `if ret_attn: res = res,torch.stack(attns)` ). The attentions is simply the value `a` just chuck it on a list ( `attns.append(a)` ). We can now call the model with return attention equals true and get back the probabilities and the attentions [ [1:47:53](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h47m53s) ]:
+
+```
+ probs,attns = learn.model(V(x),ret_attn= True ) 
+```
+
+We can now draw pictures, at each time step, of the attention.
+
+```
+ attn = to_np(attns[...,180]) 
+```
+
+```
+ fig, axes = plt.subplots(3, 3, figsize=(15, 10))  **for** i,ax **in** enumerate(axes.flat):  ax.plot(attn[i]) 
+```
+
+![](../img/1_CSG2P8oBICyPDBnoT9z7Wg.png)
+
+When you are Chris Olah and Shan Carter, you make things that looks like ☟when you are Jeremy Howard, the exact same information looks like ☝︎[ [1:48:24](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h48m24s) ]. You can see at each different time step, we have a different attention.
+
+![](../img/1_zOxcT0Nib1_VtVkyQN4FQQ.png)
+
+It's very important when you try to build something like this, you don't really know if it's not working right because if it's not working (as per usual Jeremy's first 12 attempts of this were broken) and they were broken in a sense that it wasn't really learning anything useful. Therefore, it was giving equal attention to everything and it wasn't worse — it just wasn't much better. Until you actually find ways to visualize the thing in a way that you know what it ought to look like ahead of time, you don't really know if it's working [ [1:49:16](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h49m16s) ]. So it's really important that you try to find ways to check your intermediate steps in your outputs.
+
+**Question** : What is the loss function of the attentional neural network? [ [1:49:31](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h49m31s) ] No, there is no loss function for the attentional neural network. It is trained end-to-end. It is just sitting inside our decoder loop. The loss function for the decoder loop is the same loss function because the result contains exactly same thing as before — the probabilities of the words. How come the mini neural net learning something? Because in order to make the outputs better and better, it would be great if it made the weights of weighted-average better and better. So part of creating our output is to please do a good job of finding a good set of weights and if it doesn't do a good job of finding good set of weights, then the loss function won't improve from that bit. So end-to-end learning means you throw in everything you can into one loss function and the gradients of all the different parameters point in a direction that says “hey, you know if you had put more weight over there, it would have been better.” And thanks to the magic of the chain rule, it knows to put more weight over there, change the parameter in the matrix multiply a little, etc. That is the magic of end-to-end learning. It is a very understandable question but you have to realize there is nothing particular about this code that says this particular bits are separate mini neural network anymore than the GRU is a separate little neural network, or a linear layer is a separate little function. It's all ends up pushed into one output which is a bunch of probabilities which ends up in one loss function that returns a single number that says this either was or wasn't a good translation. So thanks to the magic of the chain rule, we then back propagate little updates to all the parameters to make them a little bit better. This is a big, weird, counterintuitive idea and it's totally okay if it's a bit mind-bending. It is the bit where even back to lesson 1 “how did we make it find dogs vs. cats?” — we didn't. All we did was we said “this is our data, this is our architecture, this is our loss function. Please back propagate into the weights to make them better and after you've made them better a while, it will start finding cats from dogs.” In this case (ie translation), we haven't used somebody else's convolutional network architecture. We said “here is a custom architecture which we hope is going to be particularly good at this problem.” Even without this custom architecture, it was still okay. But we made it in a way that made more sense or we think it ought to do worked even better. But at no point, did we do anything different other than say “here is a data, here is an architecture, here is a loss function — go and find the parameters please” And it did it because that's what neural nets do.
+
+So that is sequence-to-sequence learning [ [1:53:19](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h53m19s) ].
+
+*   If you want to encode an image into a CNN backbone of some kind, and then pass that into a decoder which is like RNN with attention, and you make your y-values the actual correct caption of each of those image, you will end up with an image caption generator.
+*   If you do the same thing with videos and captions, you will end up with a video caption generator.
+*   If you do the same thing with 3D CT scan and radiology reports, you will end up with a radiology report generator.
+*   If you do the same thing with Github issues and people's chosen summaries of them, you'll get a Github issue summary generator.
+
+> Seq-to-seq is magical but they work [ [1:54:07](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h54m7s) ]. And I don't feel like people have begun to scratch the surface of how to use seq-to-seq models in their own domains. Not being a Github person, it would never have occurred to me that “it would be kind of cool to start with some issue and automatically create a summary”. But now, of course, next time I go into Github, I want to see a summary written there for me. I don't want to write my own commit message. Why should I write my own summary of the code review when I finished adding comments to lots of lines — it should do that for me as well. Now I'm thinking Github so behind, it could be doing this stuff. So what are the thing in your industry? You could start with a sequence and generate something from it. I can't begin to imagine. Again, it is a fairly new area and the tools for it are not easy to use — they are not even built into fastai yet. Hopefully there will be soon. I don't think anybody knows what the opportunities are.
+
+### Devise [ [1:55:23](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h55m23s) ]
+
+[Notebook](https://github.com/fastai/fastai/blob/master/courses/dl2/devise.ipynb) / [Paper](http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf)
+
+We are going to do something bringing together for the first time our two little worlds we focused on — text and images [ [1:55:49](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h55m49s) ]. This idea came up in a paper by an extraordinary deep learning practitioner and researcher named Andrea Frome. Andrea was at Google at the time and her crazy idea was words can have a distributed representation, a space, which particularly at that time was just word vectors. And images can be represented in a space. In the end, if we have a fully connected layer, they ended up as a vector representation. Could we merge the two? Could we somehow encourage the vector space that the images end up with be the same vector space that the words are in? And if we could do that, what would that mean? What could we do with that? So what could we do with that covers things like well, what if I'm wrong what if I'm predicting that this image is a beagle and I predict jumbo jet and Yannet's model predicts corgi. The normal loss function says that Yannet's and Jeremy's models are equally good (ie they are both wrong). But what if we could somehow say though you know what corgi is closer to beagle than it is to jumbo jets. So Yannet's model is better than Jeremy's. We should be able to do that because in word vector space, beagle and corgi are pretty close together but jumbo jet not so much. So it would give us a nice situation where hopefully our inferences would be wrong in saner ways if they are wrong. It would also allow us to search for things that are not in ImageNet Synset ID (ie a category in ImageNet). Why did we have to train a whole new model to find dog vs. cats when we already have something that found corgis and tabbies. Why can't we just say find me dogs? If we had trained it in word vector space, we totally could because they are word vector, we can find things with the right image vector and so forth. We will look at some cool things we can do with it in a moment but first of all let's train a model where this model is not learning a category (one hot encoded ID) where every category is equally far from every other category, let's instead train a model where we're finding a dependent variable which is a word vector. so What word vector? Obviously the word vector for the word you want. So if it's corgi, let's train it to create a word vector that's the corgi word vector, and if it's a jumbo jet, let's train it with a dependent variable that says this is the word vector for a jumbo jet.
+
+```
+ **from** **fastai.conv_learner** **import** *  torch.backends.cudnn.benchmark= True  import fastText as ft 
+```
+
+```
+ PATH = Path('data/imagenet/')  TMP_PATH = PATH/'tmp'  TRANS_PATH = Path('data/translate/')  PATH_TRN = PATH/'train' 
+```
+
+It is shockingly easy [ [1:59:17](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h59m17s) ]. Let's grab the fast text word vectors again, load them in (we only need English this time).
+
+```
+ ft_vecs = ft.load_model(str((TRANS_PATH/'wiki.en.bin'))) 
+```
+
+```
+ np.corrcoef(ft_vecs.get_word_vector('jeremy'),  ft_vecs.get_word_vector('Jeremy')) 
+```
+
+```
+ array([[1\. , 0.60866], 
+ [0.60866, 1\. ]]) 
+```
+
+So for example, “jeremy” and “Jeremy” have a correlation of .6\.
+
+```
+ np.corrcoef(ft_vecs.get_word_vector('banana'),  ft_vecs.get_word_vector('Jeremy')) 
+```
+
+```
+ array([[1\. , 0.14482], 
+ [0.14482, 1\. ]]) 
+```
+
+Jeremy doesn't like bananas at all, and “banana” and “Jeremy” .14\. So words that you would expect to be correlated are correlated and words that should be as far away from each other as possible, unfortunately, they are still slightly correlated but not so much [ [1:59:41](https://youtu.be/tY0n9OT5_nA%3Ft%3D1h59m41s) ].
+
+#### Map ImageNet classes to word vectors
+
+Let's now grab all of the ImageNet classes because we actually want to know which one is corgi and which one is jumbo jet.
+
+```
+ ft_words = ft_vecs.get_words(include_freq= True )  ft_word_dict = {k:v for k,v in zip(*ft_words)}  ft_words = sorted(ft_word_dict.keys(), key= lambda x: ft_word_dict[x]) 
+```
+
+```
+ len(ft_words) 
+```
+
+```
+ 2519370 
+```
+
+```
+ from fastai.io import get_data 
+```
+
+We have a list of all of those up on files.fast.ai that we can grab them.
+
+```
+ CLASSES_FN = 'imagenet_class_index.json'  get_data(f'http://files.fast.ai/models/{CLASSES_FN}',  TMP_PATH/CLASSES_FN) 
+```
+
+Let's also grab a list of all of the nouns in English which Jeremy made available here:
+
+```
+ WORDS_FN = 'classids.txt'  get_data(f'http://files.fast.ai/data/{WORDS_FN}', PATH/WORDS_FN) 
+```
+
+So we have the names of each of the thousand ImageNet classes and all of the nouns in English according to WordNet which is a popular thing for representing what words are and are not. We can now load that list of ImageNet classes, turn that into a dictionary, so `classids_1k` contains the class IDs for the 1000 images that are in the competition dataset.
+
+```
+ class_dict = json.load((TMP_PATH/CLASSES_FN).open())  classids_1k = dict(class_dict.values())  nclass = len(class_dict); nclass 
+```
+
+```
+ 1000 
+```
+
+这是一个例子。 A “tench” apparently is a kind of fish.
+
+```
+ class_dict['0'] 
+```
+
+```
+ ['n01440764', 'tench'] 
+```
+
+Let's do the same thing for all those WordNet nouns [ [2:01:11](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h1m11s) ]. It turns out that ImageNet is using WordNet class names so that makes it nice and easy to map between the two.
+
+```
+ classid_lines = (PATH/WORDS_FN).open().readlines()  classid_lines[:5] 
+```
+
+```
+ ['n00001740 entity\n', 
+ 'n00001930 physical_entity\n', 
+ 'n00002137 abstraction\n', 
+ 'n00002452 thing\n', 
+ 'n00002684 object\n'] 
+```
+
+```
+ classids = dict(l.strip().split() for l in classid_lines)  len(classids),len(classids_1k) 
+```
+
+```
+ (82115, 1000) 
+```
+
+So these are our two worlds — we have the ImageNet thousand and we have the 82,000 which are in WordNet.
+
+```
+ lc_vec_d = {w.lower(): ft_vecs.get_word_vector(w) for w  in ft_words[-1000000:]} 
+```
+
+So we want to map the two together which is as simple as creating a couple of dictionaries to map them based on the Synset ID or the WordNet ID.
+
+```
+ syn_wv = [(k, lc_vec_d[v.lower()]) for k,v in classids.items()  if v.lower() in lc_vec_d]  syn_wv_1k = [(k, lc_vec_d[v.lower()]) for k,v in classids_1k.items()  if v.lower() in lc_vec_d]  syn2wv = dict(syn_wv)  len(syn2wv) 
+```
+
+```
+ 49469 
+```
+
+What we need to do now is grab the 82,000 nouns in WordNet and try and look them up in fast text. We've managed to look up 49,469 of them in fast text. We now have a dictionary that goes from synset ID which is what WordNet calls them to word vectors. We also have the same thing specifically for the 1k ImageNet classes.
+
+```
+ pickle.dump(syn2wv, (TMP_PATH/'syn2wv.pkl').open('wb'))  pickle.dump(syn_wv_1k, (TMP_PATH/'syn_wv_1k.pkl').open('wb')) 
+```
+
+```
+ syn2wv = pickle.load((TMP_PATH/'syn2wv.pkl').open('rb'))  syn_wv_1k = pickle.load((TMP_PATH/'syn_wv_1k.pkl').open('rb')) 
+```
+
+Now we grab all of the ImageNet which you can download from Kaggle now [ [2:02:54](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h2m54s) ]. If you look at the Kaggle ImageNet localization competition, that contains the entirety of the ImageNet classifications as well.
+
+```
+ images = []  img_vecs = [] 
+```
+
+```
+ for d in (PATH/'train').iterdir():  if d.name not in syn2wv: continue  vec = syn2wv[d.name]  for f in d.iterdir():  images.append(str(f.relative_to(PATH)))  img_vecs.append(vec) 
+```
+
+```
+ n_val=0  for d in (PATH/'valid').iterdir():  if d.name not in syn2wv: continue  vec = syn2wv[d.name]  for f in d.iterdir():  images.append(str(f.relative_to(PATH)))  img_vecs.append(vec)  n_val += 1 
+```
+
+```
+ n_val 
+```
+
+```
+ 28650 
+```
+
+It has a validation set of 28,650 items in it. For every image in ImageNet, we can grab its fast text word vector using the synset to word vector ( `syn2wv` ) and we can stick that into the image vectors array ( `img_vecs` ), stack that all up into a single matrix and save that away.
+
+```
+ img_vecs = np.stack(img_vecs)  img_vecs.shape 
+```
+
+Now what we have is something for every ImageNet image, we also have the fast text word vector that it is associated with [ [2:03:43](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h3m43s) ] by looking up the synset ID → WordNet → Fast text → word vector.
+
+```
+ pickle.dump(images, (TMP_PATH/'images.pkl').open('wb'))  pickle.dump(img_vecs, (TMP_PATH/'img_vecs.pkl').open('wb')) 
+```
+
+```
+ images = pickle.load((TMP_PATH/'images.pkl').open('rb'))  img_vecs = pickle.load((TMP_PATH/'img_vecs.pkl').open('rb')) 
+```
+
+```
+ arch = resnet50 
+```
+
+```
+ n = len(images); n 
+```
+
+```
+ 766876 
+```
+
+```
+ val_idxs = list(range(n-28650, n)) 
+```
+
+Here is a cool trick [ [2:04:06](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h4m6s) ]. We can now create a model data object which specifically is an image classifier data object and we have this thing called `from_names_and_array` I'm not sure if we've used it before but we can pass it a list of file names (all of the file names in ImageNet) and an array of our dependent variables (all of the fast text word vectors). We can then pass in the validation indexes which in this case is just all of the last IDs — we need to make sure that they are the same as ImageNet uses otherwise we will be cheating. Then we pass in `continuous=True` which means this puts a lie again to this image classifier data is now an image regressive data so continuous equals True means don't one hot encode my outputs but treat them just as continuous values. So now we have a model data object that contains all of our file names and for every file name a continuous array representing the word vector for that. So we have data, now we need an architecture and the loss function.
+
+```
+ tfms = tfms_from_model(arch, 224, transforms_side_on, max_zoom=1.1)  md = ImageClassifierData. from_names_and_array (PATH, images,  img_vecs, val_idxs=val_idxs, classes= None , tfms=tfms,  continuous= True , bs=256) 
+```
+
+```
+ x,y = next(iter(md.val_dl)) 
+```
+
+Let's create an architecture [ [2:05:26](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h5m26s) ]. We'll revise this next week, but we can use the tricks we've learnt so far and it's actually incredibly simple. Fastai has a `ConvnetBuilder` which is what gets called when you say `ConvLerner.pretrained` and you specify:
+
+*   `f` : the architecture (we are going to use ResNet50)
+*   `c` : how many classes you want (in this case, it's not really classes — it's how many outputs you want which is the length of the fast text word vector ie 300).
+*   `is_multi` : It is not a multi classification as it is not classification at all.
+*   `is_reg` : Yes, it is a regression.
+*   `xtra_fc` : What fully connected layers you want. We are just going to add one fully connected hidden layer of a length of 1024\. Why 1024? The last layer of ResNet50 I think is 1024 long, the final output we need is 300 long. We obviously need our penultimate (second to the last) layer to be longer than 300\. Otherwise it's not enough information, so we just picked something a bit bigger. Maybe different numbers would be better but this worked for Jeremy.
+*   `ps` : how much dropout you want. Jeremy found that the default dropout, he was consistently under fitting so he just decreased the dropout from 0.5 to 0.2\.
+
+So this is now a convolutional neural network that does not have any softmax or anything like that because it's regression it's just a linear layer at the end and that's our model [ [2:06:55](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h6m55s) ]. We can create a ConvLearner from that model and give it an optimization function. So now all we need is a loss function.
+
+```
+ models = ConvnetBuilder(arch, md.c, is_multi= False , is_reg= True ,  xtra_fc=[1024], ps=[0.2,0.2]) 
+```
+
+```
+ learn = ConvLearner(md, models, precompute= True )  learn.opt_fn = partial(optim.Adam, betas=(0.9,0.99)) 
+```
+
+**Loss Function** [ [2:07:38](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h7m38s) ]: Default loss function for regression is L1 loss (the absolute differences) — that is not bad. But unfortunately in really high dimensional spaces (anybody who has studied a bit of machine learning probably knows this) everything is on the outside (in this case, it's 300 dimensional). When everything is on the outside, distance is not meaningless but a little bit awkward. Things tend to be close together or far away, it doesn't really mean much in these really high dimensional spaces where everything is on the edge. What does mean something, though, is that if one thing is on the edge over here, and one thing is on the edge over there, we can form an angle between those vectors and the angle is meaningful. That is why we use cosine similarity when we are looking for how close or far apart things are in high dimensional spaces. If you haven't seen cosine similarity before, it is basically the same as Euclidean distance but it's normalized to be a unit norm (ie divided by the length). So we don't care about the length of the vector, we only care about its angle. There is a bunch of stuff that you could easily learn in a couple of hours but if you haven't seen it before, it's a bit mysterious. For now, just know that loss functions and high dimensional spaces where you are trying to find similarity, you care about angle and you don't care about distance [ [2:09:13](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h9m13s) ]. If you didn't use the following custom loss function, it would still work but it's a little bit less good. Now we have data, architecture, and loss function, therefore, we are done. We can go ahead and fit.
+
+```
+ def cos_loss(inp,targ):  return 1 - F.cosine_similarity(inp,targ).mean()  learn.crit = cos_loss 
+```
+
+```
+ learn.lr_find(start_lr=1e-4, end_lr=1e15) 
+```
+
+```
+ learn.sched.plot() 
+```
+
+```
+ lr = 1e-2  wd = 1e-7 
+```
+
+We are training on all of ImageNet that is going to take a long time. So `precompute=True` is your friend. Remember `precompute=True` ? That is the thing we've learnt ages ago that caches the output of the final convolutional layer and just trains the fully connected bit. Even with `precompute=True` , it takes about 3 minutes to train an epoch on all of ImageNet. So this is about an hour worth of training, but it's pretty cool that with fastai, we can train a new custom head on all of ImageNet for 40 epochs in an hour or so.
+
+```
+ learn.precompute= True 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=20, wds=wd, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 0.104692 0.125685 
+ 1 0.112455 0.129307 
+ 2 0.110631 0.126568 
+ 3 0.108629 0.127338 
+ 4 0.110791 0.125033 
+ 5 0.108859 0.125186 
+ 6 0.106582 0.123875 
+ 7 0.103227 0.123945 
+ 8 0.10396 0.12304 
+ 9 0.105898 0.124894 
+ 10 0.10498 0.122582 
+ 11 0.104983 0.122906 
+ 12 0.102317 0.121171 
+ 13 0.10017 0.121816 
+ 14 0.099454 0.119647 
+ 15 0.100425 0.120914 
+ 16 0.097226 0.119724 
+ 17 0.094666 0.118746 
+ 18 0.094137 0.118744 
+ 19 0.090076 0.117908 
+```
+
+```
+ [0.11790786389489033] 
+```
+
+```
+ learn.bn_freeze( True ) 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=20, wds=wd, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 0.104692 0.125685 
+ 1 0.112455 0.129307 
+ 2 0.110631 0.126568 
+ 3 0.108629 0.127338 
+ 4 0.110791 0.125033 
+ 5 0.108859 0.125186 
+ 6 0.106582 0.123875 
+ 7 0.103227 0.123945 
+ 8 0.10396 0.12304 
+ 9 0.105898 0.124894 
+ 10 0.10498 0.122582 
+ 11 0.104983 0.122906 
+ 12 0.102317 0.121171 
+ 13 0.10017 0.121816 
+ 14 0.099454 0.119647 
+ 15 0.100425 0.120914 
+ 16 0.097226 0.119724 
+ 17 0.094666 0.118746 
+ 18 0.094137 0.118744 
+ 19 0.090076 0.117908 
+```
+
+```
+ [0.11790786389489033] 
+```
+
+```
+ lrs = np.array([lr/1000,lr/100,lr]) 
+```
+
+```
+ learn.precompute= False  learn.freeze_to(1) 
+```
+
+```
+ learn.save('pre0') 
+```
+
+```
+ learn.load('pre0') 
+```
+
+### Image search
+
+#### Search imagenet classes
+
+At the end of all that, we can now say let's grab the 1000 ImageNet classes, let's predict on our whole validation set, and take a look at a few pictures [ [2:10:26](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h10m26s) ].
+
+```
+ syns, wvs = list(zip(*syn_wv_1k))  wvs = np.array(wvs) 
+```
+
+```
+ %time pred_wv = learn.predict() 
+```
+
+```
+ CPU times: user 18.4 s, sys: 7.91 s, total: 26.3 s 
+ Wall time: 7.17 s 
+```
+
+```
+ start=300 
+```
+
+```
+ denorm = md.val_ds.denorm 
+```
+
+```
+ def show_img(im, figsize= None , ax= None ):  if not ax: fig,ax = plt.subplots(figsize=figsize)  ax.imshow(im)  ax.axis('off')  return ax 
+```
+
+```
+ def show_imgs(ims, cols, figsize= None ):  fig,axes = plt.subplots(len(ims)//cols, cols, figsize=figsize)  for i,ax in enumerate(axes.flat): show_img(ims[i], ax=ax)  plt.tight_layout() 
+```
+
+Because validation set is ordered, tall the stuff of the same type are in the same place.
+
+```
+ show_imgs(denorm(md.val_ds[start:start+25][0]), 5, (10,10)) 
+```
+
+![](../img/1_exiD0uDeL6xx5EOLdPS3BA.png)
+
+**Nearest neighbor search** [ [2:10:56](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h10m56s) ]: What we can now do is we can now use nearest neighbors search. So nearest neighbors search means here is one300 dimensional vector and here is a whole a lot of other 300 dimensional vectors, which things is it closest to? Normally that takes a very long time because you have to look through every 300 dimensional vector, calculate its distance, and find out how far away it is. But there is an amazing almost unknown library called **NMSLib** that does that incredibly fast. Some of you may have tried other nearest neighbor's libraries, I guarantee this is faster than what you are using — I can tell you that because it's been bench marked by people who do this stuff for a living. This is by far the fastest on every possible dimension. We want to create an index on angular distance, and we need to do it on all of our ImageNet word vectors. Adding a whole batch, create the index, and now we can query a bunch of vectors all at once, get the 10 nearest neighbors. The library uses multi-threading and is absolutely fantastic. You can install from pip ( `pip install nmslib` ) and it just works.
+
+```
+ import nmslib 
+```
+
+```
+ def create_index(a):  index = nmslib.init(space='angulardist')  index.addDataPointBatch(a)  index.createIndex()  return index 
+```
+
+```
+ def get_knns(index, vecs):  return zip(*index.knnQueryBatch(vecs, k=10, num_threads=4)) 
+```
+
+```
+ def get_knn(index, vec): return index.knnQuery(vec, k=10) 
+```
+
+```
+ nn_wvs = create_index(wvs) 
+```
+
+It tells you how far away they are and their indexes [ [2:12:13](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h12m13s) ].
+
+```
+ idxs,dists = get_knns(nn_wvs, pred_wv) 
+```
+
+So now we can go through and print out the top 3 so it turns out that bird actually is a limpkin. Interestingly the fourth one does not say it's a limpkin and Jeremy looked it up. He doesn't know much about birds but everything else is brown with white spots, but the 4th one isn't. So we don't know if that is actually a limpkin or if it is mislabeled but sure as heck it doesn't look like the other birds.
+
+```
+ [[classids[syns[id]] for id in ids[:3]]  for ids in idxs[start:start+10]] 
+```
+
+```
+ [['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['spoonbill', 'bustard', 'oystercatcher'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill']] 
+```
+
+This is not a particularly hard thing to do because there is only a thousand ImageNet classes and it is not doing anything new. But what if we now bring in the entirety of WordNet and we now say which of those 45 thousand things is it closest to?
+
+#### Search all WordMet noun classes
+
+```
+ all_syns, all_wvs = list(zip(*syn2wv.items()))  all_wvs = np.array(all_wvs) 
+```
+
+```
+ nn_allwvs = create_index(all_wvs) 
+```
+
+```
+ idxs,dists = get_knns(nn_allwvs, pred_wv) 
+```
+
+```
+ [[classids[all_syns[id]] for id in ids[:3]]  for ids in idxs[start:start+10]] 
+```
+
+```
+ [['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['spoonbill', 'bustard', 'oystercatcher'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill'], 
+ ['limpkin', 'oystercatcher', 'spoonbill']] 
+```
+
+Exactly the same result. It is now searching all of the WordNet.
+
+#### Text -&gt; image search [ [2:13:16](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h13m16s) ]
+
+Now let's do something a bit different — which is to take all of our predictions ( `pred_wv` ) so basically take our whole validation set of images and create a KNN index of the image representations because remember, it is predicting things that are meant to be word vectors. Now let's grab the fast text vector for “boat” and boat is not an ImageNet concept — yet we can now find all of the images in our predicted word vectors (ie our validation set) that are closest to the word boat and it works even though it is not something that was ever trained on.
+
+```
+ nn_predwv = create_index(pred_wv)  en_vecd = pickle.load(open(TRANS_PATH/'wiki.en.pkl','rb'))  vec = en_vecd['boat'] 
+```
+
+```
+ idxs,dists = get_knn(nn_predwv, vec)  show_imgs([open_image(PATH/md.val_ds.fnames[i]) for i in idxs[:3]],  3, figsize=(9,3)); 
+```
+
+![](../img/1_BLLI7IWFO84BPwB-Y1uCKg.png)
+
+What if we now take engine's vector and boat's vector and take their average and what if we now look in our nearest neighbors for that [ [2:14:04](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h14m4s) ]?
+
+```
+ vec = (en_vecd['engine'] + en_vecd['boat'])/2 
+```
+
+```
+ idxs,dists = get_knn(nn_predwv, vec)  show_imgs([open_image(PATH/md.val_ds.fnames[i]) for i in idxs[:3]],  3, figsize=(9,3)); 
+```
+
+![](../img/1_EkwjxE8m8xeX2lDRLIJTCw.png)
+
+These are boats with engines. I mean, yes, the middle one is actually a boat with an engine — it just happens to have wings on as well. By the way, sail is not an ImageNet thing , neither is boat. Here is the average of two things that are not ImageNet things and yet with one exception, it's found us two sailboats.
+
+```
+ vec = (en_vecd['sail'] + en_vecd['boat'])/2 
+```
+
+```
+ idxs,dists = get_knn(nn_predwv, vec)  show_imgs([open_image(PATH/md.val_ds.fnames[i]) for i in idxs[:3]],  3, figsize=(9,3)); 
+```
+
+![](../img/1_-7X7LGxPWi1qv8fF1hkCog.png)
+
+#### Image-&gt;image [ [2:14:35](https://youtu.be/tY0n9OT5_nA%3Ft%3D2h14m35s) ]
+
+Okay, let's do something else crazy. Let's open up an image in the validation set. Let's call `predict_array` on that image to get its word vector like thing, and let's do a nearest neighbor search on all the other images.
+
+```
+ fname = 'valid/n01440764/ILSVRC2012_val_00007197.JPEG'  img = open_image(PATH/fname)  show_img(img); 
+```
+
+![](../img/1_FVCX6O367r2oJXVhR1r-Sg.png)
+
+```
+ t_img = md.val_ds.transform(img)  pred = learn.predict_array(t_img[ None ]) 
+```
+
+```
+ idxs,dists = get_knn(nn_predwv, pred)  show_imgs([open_image(PATH/md.val_ds.fnames[i]) for i in idxs[1:4]],  3, figsize=(9,3)); 
+```
+
+![](../img/1_hQ8r1gcwNfh7KlZGjS-lcQ.png)
+
+And here are all the other images of whatever that is. So you can see, this is crazy — we've trained a thing on all of ImageNet in an hour, using a custom head that required basically like two lines fo code, and these things run in 300 milliseconds to do these searches.
+
+Jeremy taught this basic idea last year as well, but it was in Keras, and it was pages and pages of code, and everything took a long time and complicated. And back then, Jeremy said he can't begin to think all of the stuff you could do with this. He doesn't think anybody has really thought deeply about this yet, but he thinks it's fascinating. So go back and read the DeVICE paper because Andrea had a whole bunch of other thoughts and now that it is so easy to do, hopefully people will dig into this now. Jeremy thinks it's crazy and amazing.
+
+Alright, see you next week!
diff --git a/zh/dl12.md b/zh/dl12.md
new file mode 100644
index 0000000000000000000000000000000000000000..ed9ce8c21adfe43ad4b7b03126c1f3a94075ea80
--- /dev/null
+++ b/zh/dl12.md
@@ -0,0 +1,779 @@
+# 深度学习2：第2部分第12课
+
+![](../img/1_iFOmwPIB-BHiM7G4ttDb9w.png)
+
+### 生成对抗网络（GAN）
+
+[视频](https://youtu.be/ondivPiwQho) / [论坛](http://forums.fast.ai/t/part-2-lesson-12-in-class/15023)
+
+非常热门的技术，但绝对值得深入学习课程的一部分，因为它们并没有被证明对任何事情都有用，但它们几乎就在那里并且肯定会到达那里。 我们将专注于它们在实践中肯定会有用的东西，并且在许多领域它们可能变得有用但我们还不知道。 因此我认为它们在实践中肯定会有用的区域是你在幻灯片左侧看到的那种东西 - 例如将绘图转换为渲染图片。 这是来自[2天前刚刚发布的论文](https://arxiv.org/abs/1804.04732) ，所以现在正在进行一项非常活跃的研究。
+
+**从上一次讲座[** [**1:04**](https://youtu.be/ondivPiwQho%3Ft%3D1m4s) **]：**我们的多元化研究员之一Christine Payne拥有斯坦福大学的医学硕士学位，因此她有兴趣思考如果我们建立一种医学语言模型会是什么样子。 我们在第4课中简要介绍过但上次没有真正谈论过的事情之一就是这个想法，你实际上可以种下一个生成语言模型，这意味着你已经在一些语料库中训练了一个语言模型，然后你就是将从该语言模型生成一些文本。 你可以先用几句话来说“这是在语言模型中创建隐藏状态的前几个单词，然后从那里生成。 克里斯汀做了一些聪明的事情，就是用一个问题来播种它，并重复三次问题并让它从那里产生。 她提供了许多不同医学文本的语言模型，如下所示：
+
+![](../img/1_v6gjjQ9Eu_yyJnj5qJoMyA.png)
+
+杰里米对此感到有趣的是，对于那些没有医学硕士学位的人来说，这是一个可信的答案。 但它与现实无关。 他认为这是一种有趣的道德和用户体验困境。 Jeremy参与​​了一家名为doc.ai的公司，该公司正在努力做一些事情，但最终为医生和患者提供了一个应用程序，可以帮助创建一个对话用户界面，帮助他们解决他们的医疗问题。 他一直在对那个团队的软件工程师说，请不要尝试使用LSTM创建一个生成模型，因为他们会非常善于创建听起来令人印象深刻的糟糕建议 - 有点像政治专家或终身教授谁可以说具有很大权威的背叛。 所以他认为这是非常有趣的实验。 如果你做了一些有趣的实验，请在论坛，博客，Twitter上分享。 让人们了解它并让真棒的人注意到。
+
+#### CIFAR10 [ [5:26](https://youtu.be/ondivPiwQho%3Ft%3D5m26s) ]
+
+让我们来谈谈CIFAR10，原因是我们今天要研究一些比较简单的PyTorch的东西来构建这些生成的对抗模型。 现在根本没有对GAN说话的快速支持 - 很快就会出现，但目前还没有，我们将从头开始构建大量模型。 我们已经做了很多严肃的模型建设已经有一段时间了。 我们在课程的第1部分看了CIFAR10，我们制作了一些准确率达到85％的东西，花了几个小时训练。 有趣的是，现在正在进行一场竞赛，看谁能最快地训练CIFAR10（ [DAWN](https://dawn.cs.stanford.edu/benchmark/) ），目标是让它达到94％的准确率。 看看我们是否可以构建一个可以达到94％精度的架构会很有趣，因为这比我们之前的尝试要好得多。 希望在这样做的过程中，我们将学到一些关于创建良好架构的知识，这对于今天查看GAN非常有用。 此外它很有用，因为Jeremy在过去几年关于不同类型的CNN架构的论文中已经深入研究，并且意识到这些论文中的许多见解没有被广泛利用，并且显然没有被广泛理解。 因此，如果我们能够利用这种理解，他想告诉你会发生什么。
+
+#### [cifar10-darknet.ipynb](https://github.com/fastai/fastai/blob/master/courses/dl2/cifar10-darknet.ipynb) [ [7:17](https://youtu.be/ondivPiwQho%3Ft%3D7m17s) ]
+
+笔记本电脑被称为[暗网，](https://pjreddie.com/darknet/)因为我们要看的特定架构非常接近暗网架构。 但是你会在整个过程中看到暗网结构不是整个YOLO v3端到端的东西，而只是它们在ImageNet上预先训练过来进行分类的部分。 它几乎就像你能想到的最通用的简单架构，所以它是实验的一个非常好的起点。 因此我们将其称为“暗网”，但它并不是那样，你可以摆弄它来创造绝对不是暗网的东西。 它实际上只是几乎所有基于ResNet的现代架构的基础。
+
+CIFAR10是一个相当小的数据集[ [8:06](https://youtu.be/ondivPiwQho%3Ft%3D8m6s) ]。 图像大小只有32 x 32，这是一个很好的数据集，因为：
+
+*   与ImageNet不同，您可以相对快速地训练它
+*   相对少量的数据
+*   实际上很难识别图像，因为32乘32太小，不容易看到发生了什么。
+
+这是一个不受重视的数据集，因为它已经过时了。 当他们可以使用整个服务器机房处理更大的数据时，谁想要使用小的旧数据集。 但它是一个非常好的数据集，专注于。
+
+来吧，导入我们常用的东西，我们将尝试从头开始构建一个网络来训练[ [8:58](https://youtu.be/ondivPiwQho%3Ft%3D8m58s) ]。
+
+```
+ %matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+```
+ **from** **fastai.conv_learner** **import** *  PATH = Path("data/cifar10/")  os.makedirs(PATH,exist_ok= **True** ) 
+```
+
+对于那些对他们的广播和PyTorch基本技能没有100％信心的人来说，这是一个非常好的练习，可以理解Jeremy如何提出这些`stats`数据。 这些数字是CIFAR10中每个通道的平均值和标准偏差。 尝试并确保您可以重新创建这些数字，看看是否可以使用不超过几行代码（无循环！）来完成。
+
+因为它们相当小，我们可以使用比平时更大的批量大小，这些图像的大小是32 [ [9:46](https://youtu.be/ondivPiwQho%3Ft%3D9m46s) ]。
+
+```
+ classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog',  'horse', 'ship', 'truck')  stats = (np.array([ 0.4914 , 0.48216, 0.44653]),  np.array([ 0.24703, 0.24349, 0.26159]))  num_workers = num_cpus()//2  bs=256  sz=32 
+```
+
+转换[ [9:57](https://youtu.be/ondivPiwQho%3Ft%3D9m57s) ]，通常我们使用这组标准的side_on转换，用于普通对象的照片。 我们不打算在这里使用它，因为这些图像非常小，试图将32 x 32的图像稍微旋转会引入大量的块状失真。 所以人们倾向于使用的标准变换是随机水平翻转，然后我们在每一侧添加4个像素（大小除以8）的填充。 一个非常有效的方法是默认情况下fastai不添加许多其他库所做的黑色填充。 Fastai拍摄现有照片的最后4个像素并翻转并反射它，我们发现默认情况下使用反射填充可以获得更好的效果。 现在我们有了40 x 40的图像，这组训练中的变换会随机选择一个32乘32的作物，所以我们得到一点变化而不是堆。 我们可以使用普通的`from_paths`来获取我们的数据。
+
+```
+ tfms = tfms_from_stats(stats, sz, aug_tfms=[RandomFlip()],  pad=sz//8)  data = ImageClassifierData.from_paths(PATH, val_name='test',  tfms=tfms, bs=bs) 
+```
+
+现在我们需要一个架构，我们将创建一个适合一个屏幕[ [11:07](https://youtu.be/ondivPiwQho%3Ft%3D11m7s) ]。 这是从头开始的。 我们使用预定义的`Conv2d` ， `BatchNorm2d` ， `LeakyReLU`模块，但我们没有使用任何块或任何东西。 整个过程都在一个屏幕上，所以如果你想知道我能理解一个现代的优质建筑，绝对是！ 我们来研究这个。
+
+```
+ **def** conv_layer(ni, nf, ks=3, stride=1):  **return** nn.Sequential(  nn.Conv2d(ni, nf, kernel_size=ks, bias= **False** , stride=stride,  padding=ks//2),  nn.BatchNorm2d(nf, momentum=0.01),  nn.LeakyReLU(negative_slope=0.1, inplace= **True** )) 
+```
+
+```
+ **class** **ResLayer** (nn.Module):  **def** __init__(self, ni):  super().__init__()  self.conv1=conv_layer(ni, ni//2, ks=1)  self.conv2=conv_layer(ni//2, ni, ks=3)  **def** forward(self, x): **return** x.add_(self.conv2(self.conv1(x))) 
+```
+
+```
+ **class** **Darknet** (nn.Module):  **def** make_group_layer(self, ch_in, num_blocks, stride=1):  **return** [conv_layer(ch_in, ch_in*2,stride=stride)  ] + [(ResLayer(ch_in*2)) **for** i **in** range(num_blocks)]  **def** __init__(self, num_blocks, num_classes, nf=32):  super().__init__()  layers = [conv_layer(3, nf, ks=3, stride=1)]  **for** i,nb **in** enumerate(num_blocks):  layers += self.make_group_layer(nf, nb, stride=2-(i==1))  nf *= 2  layers += [nn.AdaptiveAvgPool2d(1), Flatten(),  nn.Linear(nf, num_classes)]  self.layers = nn.Sequential(*layers)  **def** forward(self, x): **return** self.layers(x) 
+```
+
+一个体系结构的基本出发点是它是一堆堆叠的层，一般来说，层会有某种层次[ [11:51](https://youtu.be/ondivPiwQho%3Ft%3D11m51s) ]。 在最底层，有卷积层和批量规范层之类的东西，但是只要你有卷积，你可能会有一些标准序列。 通常它会是：
+
+1.  CONV
+2.  批量规范
+3.  非线性激活（例如ReLU）
+
+我们将首先确定我们的基本单元是什么，并在一个函数（ `conv_layer` ）中定义它，这样我们就不必担心尝试保持一致性，它会使一切变得更简单。
+
+**Leaky Relu** [ [12:43](https://youtu.be/ondivPiwQho%3Ft%3D12m43s) ]：
+
+![](../img/1_p1xIcvOk2F-EWWDgTTfW7Q.png)
+
+Leaky ReLU（其中_x_ &lt;0）的梯度变化但是大约0.1或0.01的常数。 它背后的想法是，当你处于负区域时，你不会得到零梯度，这使得更新它变得非常困难。 在实践中，人们发现Leaky ReLU在较小的数据集上更有用，而在大数据集中则不太有用。 但有趣的是，对于[YOLO v3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)论文，他们使用了Leaky ReLU并从中获得了很好的表现。 它很少使事情变得更糟，它往往使事情变得更好。 因此，如果您需要创建自己的架构以使您的默认设置是使用Leaky ReLU，那可能并不错。
+
+你会注意到我们没有在`conv_layer`定义PyTorch模块，我们只是做`nn.Sequential` [ [14:07](https://youtu.be/ondivPiwQho%3Ft%3D14m7s) ]。 如果您阅读其他人的PyTorch代码，这是真的未充分利用。 人们倾向于将所有东西都写成带有`__init__`和`forward`的PyTorch模块，但是如果你想要的东西只是一个接一个的序列，那么它就更简洁易懂，使它成为一个`Sequential` 。
+
+**剩余块** [ [14:40](https://youtu.be/ondivPiwQho%3Ft%3D14m40s) ]：如前所述，在大多数现代网络中通常存在许多单元层次结构，现在我们知道ResNet的这个单元层次结构中的下一个层次是ResBlock或残差块（参见`ResLayer` ） 。 回到我们上次做CIFAR10时，我们过度简化了这一点（作弊一点）。 我们有`x`进来，我们通过`conv` ，然后我们将它添加回`x`出去。 在真正的ResBlock中，有两个。 当我们说“conv”时，我们将它用作我们的`conv_layer` （conv，batch norm，ReLU）的快捷方式。
+
+![](../img/1_unH5bhpWH7HfLCG8WozNfA.png)
+
+这里有一个有趣的见解是这些卷积中的通道数量[ [16:47](https://youtu.be/ondivPiwQho%3Ft%3D16m47s) ]。 我们有一些`ni`进来（一些输入通道/过滤器）。 黑暗人们设置的方式是他们让这些Res层中的每一个都吐出相同数量的通道，Jeremy喜欢这个，这就是他在`ResLayer`使用它的原因，因为它让生活更简单。 第一个转换器将通道数量减半，然后第二个转换器将它再次加倍。 所以你有这种漏斗效果，64个频道进入，第一个转换为32个频道，然后再次恢复到64个频道。
+
+**问题：**为什么`LeakyReLU`中的`LeakyReLU` `inplace=True` [ [17:54](https://youtu.be/ondivPiwQho%3Ft%3D17m54s) ]？ 谢谢你的询问！ 很多人都忘记了这个或者不知道它，但这是一个非常重要的记忆技术。 如果你考虑一下，这个`conv_layer` ，它是最低级别的东西，所以我们的ResNet中的所有内容一旦全部放在一起就会有很多`conv_layer` 。 如果你没有`inplace=True` ，那么它将为ReLU的输出创建一个完整的独立内存，因此它将分配一大堆完全没有必要的内存。 另一个例子是`ResLayer`中的原始`forward`看起来像：
+
+```
+ **def** forward(self, x): **return** x + self.conv2(self.conv1(x)) 
+```
+
+希望你们中的一些人可能还记得在PyTorch中，几乎每个函数都有一个下划线后缀版本，告诉它在原地进行。 `+`相当于`add`和就地版本`add_`所以这会减少内存使用量：
+
+```
+ **def** forward(self, x): **return** x.add_(self.conv2(self.conv1(x))) 
+```
+
+这些都是非常方便的小动作。 Jeremy最初忘记了`inplace=True` ，但是他不得不将批量减少到更低的数量，这让他发疯了 - 然后他意识到这种情况已经消失了。 如果你有辍学，你也可以通过辍学来做到这一点。 以下是需要注意的事项：
+
+*   退出
+*   所有激活功能
+*   任何算术运算
+
+**问题** ：在ResNet中，为什么在conv_layer [ [19:53](https://youtu.be/ondivPiwQho%3Ft%3D19m53s) ]中偏差通常设置为False？ 在`Conv` ，有一个`BatchNorm` 。 请记住， `BatchNorm`每次激活都有2个可学习的参数 - 您乘以的东西和您添加的东西。 如果我们在`Conv`有偏见然后在`BatchNorm`添加另一个东西，我们将添加两个完全没有意义的东西 - 这是两个权重，其中一个会做。 因此，如果您在`Conv`之后有BatchNorm，您可以告诉`BatchNorm`不要包含添加位，或者更容易告诉`Conv`不要包含偏差。 没有特别的伤害，但同样，它需要更多的记忆，因为它需要跟踪更多的渐变，所以最好避免。
+
+另外一个小技巧是，大多数人的`conv_layer`都有填充作为参数[ [21:11](https://youtu.be/ondivPiwQho%3Ft%3D21m11s) ]。 但一般来说，你应该能够轻松地计算填充。 如果你的内核大小为3，那么显然每侧会有一个单元重叠，所以我们要填充1.否则，如果它的内核大小为1，那么我们不需要任何填充。 所以一般来说，内核大小“整数除以”的填充是你需要的。 有时会有一些调整，但在这种情况下，这非常有效。 再次，尝试通过计算机为我计算内容来简化我的代码，而不是我自己必须这样做。
+
+![](../img/1_Pc3_ut-tOnPm5FLdYqRrOA.png)
+
+两个`conv_layer`的另一件事[ [22:14](https://youtu.be/ondivPiwQho%3Ft%3D22m14s) ]：我们有这个瓶颈的想法（减少通道然后再增加它们），还有使用的内核大小。 第一个有1比1的`Conv` 。 什么实际发生在1对1转？ 如果我们有4个4格的网格和32个滤波器/通道，我们将逐步进行转换，转换器的内核看起来像中间的那个。 当我们谈论内核大小时，我们从不提及最后一块 - 但是让我们说它是1乘1乘32，因为这是过滤器的一部分并过滤掉了。 内核以黄色放在第一个单元格上，我们得到这32个深位的点积，这给了我们第一个输出。 然后我们将它移动到第二个单元格并获得第二个输出。 因此，网格中的每个点都会有一堆点积。 它允许我们以任何方式在通道维度中更改维度。 我们正在创建`ni//2`过滤器，我们将有`ni//2`点积，这些产品基本上是输入通道的加权平均值。 通过非常少的计算，它可以让我们添加额外的计算和非线性步骤。 这是一个很酷的技巧，利用这些1比1的转换，创建这个瓶颈，然后再用3乘3转将其拉出 - 这将充分利用输入的2D性质。 或者，1乘1转并没有充分利用它。
+
+![](../img/1_-lUndrOFGp_FdJ27T_Px-g.png)
+
+这两行代码中没有太多内容，但它是对你的理解和直觉的一个非常好的考验[ [25:17](https://youtu.be/ondivPiwQho%3Ft%3D25m17s) ] - 为什么它有效？ 为什么张量排队？ 为什么尺寸排列很好？ 为什么这是个好主意？ 它到底在做什么？ 摆弄它是一件非常好的事情。 也许在Jupyter Notebook中创建一些小的，自己运行它们，看看输入和输出是什么输入和输出。 真的感受到了这一点。 一旦你这样做了，你就可以玩弄不同的东西。
+
+其中一篇真正[未被重视的](https://youtu.be/ondivPiwQho%3Ft%3D26m9s)论文是[ [26:09](https://youtu.be/ondivPiwQho%3Ft%3D26m9s) ] - [广泛的剩余网络](https://arxiv.org/abs/1605.07146) 。 这是非常简单的纸张，但他们所做的是他们用这两行代码摆弄：
+
+*   我们做了什么`ni*2`而不是`ni//2` ？
+*   如果我们添加了`conv3`怎么`conv3` ？
+
+他们提出了这种简单的符号来定义两行代码的样子，并展示了大量的实验。 他们展示的是，这种减少ResNet中几乎普遍的渠道数量的瓶颈方法可能不是一个好主意。 事实上，从实验中，绝对不是一个好主意。 因为它会让你创建真正的深层网络。 创建ResNet的人因创建1001层网络而闻名。 但是关于1001层的事情是你在第1层完成之前无法计算第2层。在完成第2层计算之前，你无法计算第3层。所以它是顺序的。 GPU不喜欢顺序。 所以他们展示的是，如果你有更少的层，但每层有更多的计算 - 所以一个简单的方法是删除`//2` ，没有其他变化：
+
+![](../img/1_89Seymgfa5Bdx1_EXBW_lA.png)
+
+在家尝试一下。 尝试运行CIFAR，看看会发生什么。 甚至乘以2或摆弄。 这可以让你的GPU做更多的工作而且非常有趣，因为绝大多数谈论不同架构性能的论文从未实际计算通过它运行批量需要多长时间。 他们说“这个每批需要X次浮点运算”，但他们从来没有像真正的实验主义者那样费心去运行，并且发现它是更快还是更慢。 许多真正着名的体系结构现在变得像糖蜜一样缓慢并且需要大量的内存并且完全没用，因为研究人员从来没有真正费心去看它们是否很快并且实际上看它们是否适合RAM正常批量大小。 因此，广泛的ResNet纸张不同寻常之处在于它实际上需要花费多长时间才能获得同样的洞察力的YOLO v3纸张。 他们可能错过了Wide ResNet论文，因为YOLO v3论文得出了许多相同的结论，但Jeremy不确定他们选择了Wide ResNet论文，所以他们可能不知道所有这些工作都已完成。 很高兴看到人们真正计时并注意到实际上有意义的东西。
+
+**问题** ：您对SELU（缩放指数线性单位）有何看法？ [ [29:44](https://youtu.be/ondivPiwQho%3Ft%3D29m44s) ] [SELU](https://youtu.be/ondivPiwQho%3Ft%3D29m44s)主要用于完全连接的层，它允许你摆脱批量规范，基本的想法是，如果你使用这个不同的激活功能，它是自我规范化。 自我归一化意味着它将始终保持单位标准差和零均值，因此您不需要批量规范。 它并没有真正去过任何地方，原因是因为它非常挑剔 - 你必须使用一个非常具体的初始化，否则它不会以恰当的标准偏差和平均值开始。 很难将它与嵌入之类的东西一起使用，如果你这样做，你必须使用一种特殊的嵌入初始化，这对于嵌入是没有意义的。 你完成所有这些工作，很难做到正确，如果你最终做到了，那有什么意义呢？ 好吧，你已经设法摆脱了一些批量标准层，无论如何都没有真正伤害你。 这很有趣，因为SELU论文 - 人们注意到它的主要原因是它是由LSTM的发明者创造的，并且它有一个巨大的数学附录。 因此人们认为“很多来自一个着名人物的数学 - 它一定很棒！”但在实践中，杰里米并没有看到任何人使用它来获得任何最先进的结果或赢得任何比赛。
+
+`Darknet.make_group_layer`包含一堆`ResLayer` [ [31:28](https://youtu.be/ondivPiwQho%3Ft%3D31m28s) ]。 `group_layer`将有一些频道/过滤器进入。我们将通过使用标准的`conv_layer`加倍进入的频道数量。 可选地，我们将使用2的步幅将网格大小减半。然后我们将做一大堆ResLayers - 我们可以选择多少（2,3,8等），因为记住ResLayers不会改变网格大小和它们不会更改通道数，因此您可以添加任意数量的通道而不会出现任何问题。 这将使用更多的计算和更多的RAM，但没有其他理由，你不能添加任意多的。 因此， `group_layer`最终会使通道数增加一倍，因为初始卷积使通道数增加一倍，并且根据我们在`stride`传递的内容，如果我们设置`stride=2` ，它也可以将网格大小减半。 然后我们可以根据需要进行一大堆Res块计算。
+
+为了定义我们的`Darknet` ，我们将传递一些看起来像这样的东西[ [33:13](https://youtu.be/ondivPiwQho%3Ft%3D33m13s) ]：
+
+```
+ m = Darknet([1, 2, 4, 6, 3], num_classes=10, nf=32)  m = nn.DataParallel(m, [1,2,3]) 
+```
+
+这说的是创建五个组层：第一个将包含1个额外的ResLayer，第二个将包含2个，然后是4个，6个，3个，我们希望从32个过滤器开始。 ResLayers中的第一个将包含32个过滤器，并且只有一个额外的ResLayer。 第二个，它将使过滤器的数量增加一倍，因为这是我们每次有新的组层时所做的事情。 所以第二个将有64，然后是128,256,512，那就是它。 几乎所有的网络都将成为那些层，并记住，这些组层中的每一个在开始时也都有一个卷积。 那么我们所拥有的就是在所有这一切发生之前，我们将在一开始就有一个卷积层，最后我们将进行标准的自适应平均池化，展平和线性层来创建数字最后的课程。 总结[ [34:44](https://youtu.be/ondivPiwQho%3Ft%3D34m44s) ]，在一端的一个卷积，自适应池和另一端的一个线性层，在中间，这些组层每个由卷积层和随后的`n`个ResLayers组成。
+
+**自适应平均汇集** [ [35:02](https://youtu.be/ondivPiwQho%3Ft%3D35m2s) ]：杰里米曾多次提到这一点，但他还没有看到任何代码，任何地方，任何地方，使用自适应平均池。 他所看到的每一个人都像`nn.AvgPool2d(n)`那样写，其中`n`是一个特定的数字 - 这意味着它现在与特定的图像大小相关联，这绝对不是你想要的。 因此，大多数人仍然认为特定架构与特定大小相关联。 当人们认为这是一个巨大的问题，因为它确实限制了他们使用较小尺寸来启动建模或使用较小尺寸进行实验的能力。
+
+**顺序** [ [35:53](https://youtu.be/ondivPiwQho%3Ft%3D35m53s) ]：创建体系结构的一个好方法是首先创建一个列表，在这种情况下，这是一个只有一个`conv_layer`的列表，而`make_group_layer`返回另一个列表。 然后我们可以使用`+=`将该列表附加到上一个列表，并对包含`AdaptiveAvnPool2d`另一个列表执行相同的操作。 最后，我们将调用所有这些层的`nn.Sequential` 。 现在， `forward`只是`self.layers(x)` 。
+
+![](../img/1_nr69J3I7lNPlsblmrLt15A.png)
+
+这是一个很好的图片，说明如何使您的架构尽可能简单。 有很多你可以摆弄。 您可以参数化`ni`的分隔符，使其成为您传入的数字以传递不同的数字 - 可能会执行2次。 您还可以传入更改内核大小的内容，或更改卷积层的数量。 杰里米有一个版本，他将为你运行，它实现了Wide ResNet论文中的所有不同参数，所以他可以摆弄一下，看看效果如何。
+
+![](../img/1_mR3BupmhN_XGo34Qvm58Sg.png)
+
+```
+ lr = 1.3  learn = ConvLearner.from_model_data(m, data)  learn.crit = nn.CrossEntropyLoss()  learn.metrics = [accuracy]  wd=1e-4 
+```
+
+```
+ %time learn.fit(lr, 1, wds=wd, cycle_len=30, use_clr_beta=(20, 20,  0.95, 0.85)) 
+```
+
+一旦我们得到了它，我们可以使用`ConvLearner.from_model_data`来获取我们的PyTorch模块和模型数据对象，并将它们变成学习者[ [37:08](https://youtu.be/ondivPiwQho%3Ft%3D37m8s) ]。 给它一个标准，如果我们愿意，添加一个指标，然后我们可以适应和离开我们去。
+
+**问题** ：您能解释一下自适应平均汇总吗？ 如何设置1工作[ [37:25](https://youtu.be/ondivPiwQho%3Ft%3D37m25s) ]？ 当然。 通常当我们进行平均合并时，假设我们有4x4而且我们做了`avgpool((2, 2))` [ [40:35](https://youtu.be/ondivPiwQho%3Ft%3D40m35s) ]。 这会创建2x2区域（下方为蓝色）并取这四个区域的平均值。 如果我们传入`stride=1` ，则下一个是2x2，显示为绿色并取平均值。 所以这就是正常的2x2平均合并量。 如果我们没有任何填充，那将会吐出3x3。 如果我们想要4x4，我们可以添加填充。
+
+![](../img/1_vTPZGULUC12lQplYtkGZuQ.png)
+
+如果我们想要1x1怎么办？ 然后我们可以说`avgpool((4,4), stride=1)`将以黄色做4x4并且平均整个批次导致1x1。 但这只是一种方法。 而不是说汇集过滤器的大小，为什么我们不说“我不关心输入网格的大小是什么。 我总是一个接一个地想要“。 这就是你说`adap_avgpool(1)` 。 在这种情况下，您没有说明池化过滤器的大小，而是说出我们想要的输出大小。 我们想要一个接一个的东西。 如果你输入一个整数`n` ，它假定你的意思是`n`乘以`n` 。 在这种情况下，具有4x4网格的自适应平均合并1与平均合并（4,4）相同。 如果它是7x7网格，它将与平均合并（7,7）相同。 它是相同的操作，它只是表达它的方式，无论输入如何，我们想要一些大小的输出。
+
+**DAWNBench** [ [37:43](https://youtu.be/ondivPiwQho%3Ft%3D37m43s) ]：让我们看看我们如何利用我们简单的网络来对抗这些最先进的结果。 杰里米准备好了。 我们已经把所有这些东西都放到了一个简单的Python脚本中，他修改了他提到的一些参数来创建一些他称之为`wrn_22`网络的东西，这种东西不是正式存在但是它对我们所讨论的参数有一些变化基于杰里米的实验。 它有一堆很酷的东西，如：
+
+*   莱斯利史密斯的一个周期
+*   半精度浮点实现
+
+![](../img/1_OhRBmhkrMWXgMpDHKZGJJQ.png)
+
+这将在具有8个GPU和Volta架构GPU的AWS p3上运行，这些GPU具有对半精度浮点的特殊支持。 Fastai是第一个将Volta优化的半精度浮点实际集成到库中的库，因此您可以自动`learn.half()`并获得该支持。 它也是第一个整合一个周期的人。
+
+它实际上做的是使用PyTorch的多GPU支持[ [39:35](https://youtu.be/ondivPiwQho%3Ft%3D39m35s) ]。 由于有八个GPU，它实际上将启动八个独立的Python处理器，每个处理器将进行一些训练，然后最后它将梯度更新传递回将要集成的主进程他们都在一起。 所以你会看到很多进度条一起出现。
+
+![](../img/1_7JkxOliX34lgAkzwGbl1tA.png)
+
+当你这样做时，你可以看到训练三到四秒。 在其他地方，当杰里米早些时候训练时，他每个时期得到30秒。 所以这样做，我们可以训练的东西快10倍，非常酷。
+
+**检查状态** [ [43:19](https://youtu.be/ondivPiwQho%3Ft%3D43m19s) ]：
+
+![](../img/1_fgY5v-w-44eIBkkS4fPqEA.png)
+
+完成！ 我们达到了94％，耗时3分11秒。 以前最先进的是1小时7分钟。 是否值得摆弄这些参数并了解这些架构如何实际工作而不只是使用开箱即用的东西？ 好吧，圣洁的废话。 我们刚刚使用了一个公开的实例（我们使用了一个现场实例，所以每小时花费8美元--3分钟40美分），从头开始训练这比以往任何人都快20倍。 所以这是最疯狂的最先进的结果之一。 我们已经看过很多，但是这个只是把它从水中吹走了。 这部分归功于摆弄架构的那些参数，主要是坦率地说使用Leslie Smith的一个周期。 提醒它正在做什么[ [44:35](https://youtu.be/ondivPiwQho%3Ft%3D44m35s) ]，对于学习率，它创造了与向下路径同样长的向上路径，因此它是真正的三角形循环学习率（CLR）。 按照惯例，您可以选择x和y的比率（即起始LR /峰值LR）。 在
+
+![](../img/1_5lQZ0Jln6Cn29rd_9Bvzfw.png)
+
+在这种情况下，我们选择了50比例。 所以我们开始时学习率要小得多。 然后它就有了这个很酷的想法，你可以说你的时代占了三角形底部的百分比从几乎一直到零 - 这是第二个数字。 所以15％的批次都是从三角形底部进一步消耗的。
+
+![](../img/1_E0gxTQ5sf4XSceo9pWKxWQ.png)
+
+这不是一个周期的唯一作用，我们也有动力。 动量从.95到.85。 换句话说，当学习率非常低时，我们会使用很多动力，当学习率非常高时，我们使用的动量非常小，这很有意义但是直到Leslie Smith在论文中表明这一点，Jeremy从来没有看到有人以前做过。 这是一个非常酷的技巧。 您现在可以通过在fastai中使用`use-clr-beta`参数（ [Sylvain的论坛帖子](http://forums.fast.ai/t/using-use-clr-beta-and-new-plotting-tools/14702) ）来使用它，您应该能够复制最先进的结果。 您可以在自己的计算机或纸张空间中使用它，您唯一不会得到的是多GPU部件，但这使得它更容易进行训练。
+
+**问题** ： `make_group_layer`包含步幅等于2，因此这意味着步幅为第1层，第2步为第二层。 它背后的逻辑是什么？ 通常我所看到的步伐是奇怪的[ [46:52](https://youtu.be/ondivPiwQho%3Ft%3D46m52s) ]。 跨栏是一两个。 我认为你在考虑内核大小。 所以stride = 2意味着我跳过两个意味着你将网格大小减半。 所以我认为你可能会对步幅和内核大小感到困惑。 如果你有一个步幅，网格大小不会改变。 如果你有两步，那就确实如此。 在这种情况下，因为这是CIFAR10,32乘32很小，我们不会经常将网格大小减半，因为很快我们就会耗尽单元格。 这就是为什么第一层有一个步幅，所以我们不会立即减小网格尺寸。 这是一种很好的方式，因为这就是为什么我们在第一个`Darknet([1, 2, 4, 6, 3], …)`数字很​​少。 我们可以从大网格上没有太多的计算开始，然后随着网格变得越来越小，我们可以逐渐进行越来越多的计算，因为较小的网格计算将花费更少的时间
+
+### 生成性对抗网络（GAN）[ [48:49](https://youtu.be/ondivPiwQho%3Ft%3D48m49s) ]
+
+*   [Wasserstein GAN](https://arxiv.org/abs/1701.07875)
+*   [用深层卷积生成对抗网络学习无监督表示](https://arxiv.org/abs/1511.06434)
+
+我们将谈论生成对抗网络，也称为GAN，特别是我们将关注Wasserstein GAN论文，其中包括继续创建PyTorch的Soumith Chintala。 Wasserstein GAN（WGAN）深受卷积生成对抗性网络论文的影响，该论文也与Soumith有关。 这是一篇非常有趣的论文。 很多看起来像这样：
+
+![](../img/1_9zXXZvCNC8_eF_V9LJIolw.png)
+
+好消息是你可以跳过那些位，因为还有一些看起来像这样：
+
+![](../img/1_T90I-RKpUzV7yyo_TOqowQ.png)
+
+很多论文都有一个理论部分，似乎完全是为了超越评论者对理论的需求。 WGAN论文不是这样。 理论位实际上很有趣 - 你不需要知道它就可以使用它，但是如果你想了解一些很酷的想法并看到为什么这个特殊的算法背后的想法，它绝对是迷人的。 在这篇论文出来之前，杰里米知道没有人研究它所基于的数学，所以每个人都必须学习数学。 这篇文章很好地布置了所有的部分（你必须自己做一堆阅读）。 因此，如果你有兴趣深入挖掘一些论文背后的深层数学，看看研究它是什么样的，我会选择这个，因为在理论部分结束时，你会说“我现在可以看到他们为什么使这个算法成为现实。“
+
+GAN的基本思想是它是一个生成模型[ [51:23](https://youtu.be/ondivPiwQho%3Ft%3D51m23s) ]。 它可以创建句子，创建图像或生成某些东西。 它会尝试创造一个很难分辨生成的东西和真实东西之间的东西的东西。 因此，可以使用生成模型来交换视频 - 这是一个非常有争议的深刻假货和伪造的色情内容。 它可以用来伪造某人的声音。 它可以用来假冒医学问题的答案 - 但在这种情况下，它不是真的假，它可能是一个医学问题的生成性答案，实际上是一个很好的答案，所以你生成语言。 例如，您可以为图像生成标题。 所以生成模型有很多有趣的应用。 但一般来说，它们需要足够好，例如，如果你使用它来为Carrie Fisher在下一部星球大战电影中自动创建一个新场景而且她不再玩那个部分，你想尝试生成她的形象看起来一样，然后它必须欺骗星球大战的观众思考“好吧，这看起来不像一些奇怪的嘉莉费舍尔 - 看起来像真正的嘉莉费舍尔。 或者，如果您正在尝试生成医学问题的答案，您希望生成能够很好地清晰读取的英语，并且听起来具有权威性和意义。 生成对抗网络的想法是我们不仅要创建生成图像的生成模型，而且要创建第二个模型来尝试选择哪些是真实的，哪些是生成的（我们称之为“假的”） ）。 因此，我们有一个生成器，它将创建我们的虚假内容和一个鉴别器，它将试图善于识别哪些是真实的，哪些是假的。 因此，将会有两个模型，它们将是对抗性的，意味着发电机将试图不断变得更好地愚弄鉴别者认为假货是真实的，并且鉴别器将试图在辨别力方面继续变得更好真假之间。 所以他们将要面对面。 它基本上就像杰里米刚才描述的那样容易[ [54:14](https://youtu.be/ondivPiwQho%3Ft%3D54m14s) ]：
+
+*   我们将在PyTorch中构建两个模型
+*   我们将创建一个训练循环，首先说鉴别器的损失函数是“你能告诉真实和假的区别，然后更新它的权重。
+*   我们将为生成器创建一个损失函数，“你可以生成一些愚弄鉴别器并从损失中更新权重的东西。
+*   我们将循环几次，看看会发生什么。
+
+#### 看代码[ [54:52](https://youtu.be/ondivPiwQho%3Ft%3D54m52s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/wgan.ipynb)
+
+你可以用GANS做很多不同的事情。 我们将做一些有点无聊但易于理解的事情，它甚至可能很酷，我们将从无到有生成一些图片。 我们只是想拍一些照片。 具体来说，我们将把它画成卧室的照片。 希望你有机会在一周内使用自己的数据集来解决这个问题。 如果您选择一个像ImageNet这样变化很大的数据集，然后让GAN尝试创建ImageNet图片，那么它往往不会很好，因为它不够清晰，您想要的图片。 所以最好给它，例如，有一个名为[CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)的数据集，这是名人面孔的图片，与GAN很有效。 你创造了真正清晰的名人面孔，这些面孔实际上并不存在。 卧室数据集也很好 - 相同类型的图片。
+
+有一种叫做LSUN场景分类数据集[ [55:55](https://youtu.be/ondivPiwQho%3Ft%3D55m55s) ]。
+
+```
+ **from** **fastai.conv_learner** **import** *  **from** **fastai.dataset** **import** *  **import** **gzip** 
+```
+
+下载LSUN场景分类数据集卧室类别，解压缩并将其转换为jpg文件（脚本文件夹位于`dl2`文件夹中）：
+
+```
+ curl 'http://lsun.cs.princeton.edu/htbin/download.cgi?tag=latest&category=bedroom&set=train' -o bedroom.zip 
+```
+
+```
+ unzip bedroom.zip 
+```
+
+```
+ pip install lmdb 
+```
+
+```
+ python lsun-data.py {PATH}/bedroom_train_lmdb --out_dir {PATH}/bedroom 
+```
+
+这不在Windows上测试 - 如果它不起作用，您可以使用Linux框转换文件，然后将其复制。 或者，您可以从Kaggle数据集下载[此20％样本](https://www.kaggle.com/jhoward/lsun_bedroom) 。
+
+```
+ PATH = Path('data/lsun/')  IMG_PATH = PATH/'bedroom'  CSV_PATH = PATH/'files.csv'  TMP_PATH = PATH/'tmp'  TMP_PATH.mkdir(exist_ok= **True** ) 
+```
+
+在这种情况下，在处理我们的数据时，更容易使用CSV路径。 因此，我们生成一个包含我们想要的文件列表的CSV，以及一个假标签“0”，因为我们根本没有这些标签。 One CSV file contains everything in that bedroom dataset, and another one contains random 10%. It is nice to do that because then we can most of the time use the sample when we are experimenting because there is well over a million files even just reading in the list takes a while.
+
+```
+ files = PATH.glob('bedroom/**/*.jpg')  with CSV_PATH.open('w') as fo:  for f in files: fo.write(f'{f.relative_to(IMG_PATH)},0 \n ') 
+```
+
+```
+ # Optional - sampling a subset of files  CSV_PATH = PATH/'files_sample.csv' 
+```
+
+```
+ files = PATH.glob('bedroom/**/*.jpg')  with CSV_PATH.open('w') as fo:  for f in files:  if random.random()<0.1:  fo.write(f'{f.relative_to(IMG_PATH)},0 \n ') 
+```
+
+This will look pretty familiar [ [57:10](https://youtu.be/ondivPiwQho%3Ft%3D57m10s) ]. This is before Jeremy realized that sequential models are much better. So if you compare this to the previous conv block with a sequential model, there is a lot more lines of code here — but it does the same thing of conv, ReLU, batch norm.
+
+```
+ class ConvBlock (nn.Module):  def __init__(self, ni, no, ks, stride, bn= True , pad= None ):  super().__init__()  if pad is None : pad = ks//2//stride  self.conv = nn.Conv2d(ni, no, ks, stride, padding=pad,  bias= False )  self.bn = nn.BatchNorm2d(no) if bn else None  self.relu = nn.LeakyReLU(0.2, inplace= True )  def forward(self, x):  x = self.relu(self.conv(x))  return self.bn(x) if self.bn else x 
+```
+
+The first thing we are going to do is to build a discriminator [ [57:47](https://youtu.be/ondivPiwQho%3Ft%3D57m47s) ]. A discriminator is going to receive an image as an input, and it's going to spit out a number. The number is meant to be lower if it thinks this image is real. Of course “what does it do for a lower number” thing does not appear in the architecture, that will be in the loss function. So all we have to do is to create something that takes an image and spits out a number. A lot of this code is borrowed from the original authors of this paper, so some of the naming scheme is different to what we are used to. But it looks similar to what we had before. We start out with a convolution (conv, ReLU, batch norm). Then we have a bunch of extra conv layers — this is not going to use a residual so it looks very similar to before a bunch of extra layers but these are going to be conv layers rather than res layers. At the end, we need to append enough stride 2 conv layers that we decrease the grid size down to no bigger than 4x4\. So it's going to keep using stride 2, divide the size by 2, and repeat till our grid size is no bigger than 4\. This is quite a nice way of creating as many layers as you need in a network to handle arbitrary sized images and turn them into a fixed known grid size.
+
+**Question** : Does GAN need a lot more data than say dogs vs. cats or NLP? Or is it comparable [ [59:48](https://youtu.be/ondivPiwQho%3Ft%3D59m48s) ]? Honestly, I am kind of embarrassed to say I am not an expert practitioner in GANs. The stuff I teach in part one is things I am happy to say I know the best way to do these things and so I can show you state-of-the-art results like we just did with CIFAR10 with the help of some of the students. I am not there at all with GANs so I am not quite sure how much you need. In general, it seems it needs quite a lot but remember the only reason we didn't need too much in dogs and cats is because we had a pre-trained model and could we leverage pre-trained GAN models and fine tune them, probably. I don't think anybody has done it as far as I know. That could be really interesting thing for people to think about and experiment with. Maybe people have done it and there is some literature there we haven't come across. I'm somewhat familiar with the main pieces of literature in GANs but I don't know all of it, so maybe I've missed something about transfer learning in GANs. But that would be the trick to not needing too much data.
+
+**Question** : So the huge speed-up a combination of one cycle learning rate and momentum annealing plus the eight GPU parallel training in the half precision? Is that only possible to do the half precision calculation with consumer GPU? Another question, why is the calculation 8 times faster from single to half precision, while from double the single is only 2 times faster [ [1:01:09](https://youtu.be/ondivPiwQho%3Ft%3D1h1m9s) ]? Okay, so the CIFAR10 result, it's not 8 times faster from single to half. It's about 2 or 3 times as fast from single to half. NVIDIA claims about the flops performance of the tensor cores, academically correct, but in practice meaningless because it really depends on what calls you need for what piece — so about 2 or 3x improvement for half. So the half precision helps a bit, the extra GPUs helps a bit, the one cycle helps an enormous amount, then another key piece was the playing around with the parameters that I told you about. So reading the wide ResNet paper carefully, identifying the kinds of things that they found there, and then writing a version of the architecture you just saw that made it really easy for us to fiddle around with parameters, staying up all night trying every possible combination of different kernel sizes, numbers of kernels, number of layer groups, size of layer groups. And remember, we did a bottleneck but actually we tended to focus instead on widening so we increase the size and then decrease it because it takes better advantage of the GPU. So all those things combined together, I'd say the one cycle was perhaps the most critical but every one of those resulted in a big speed-up. That's why we were able to get this 30x improvement over the state-of-the-art CIFAR10\. We have some ideas for other things — after this DAWN bench finishes, maybe we'll try and go even further to see if we can beat one minute one day. That'll be fun.
+
+```
+ class DCGAN_D (nn.Module):  def __init__(self, isize, nc, ndf, n_extra_layers=0):  super().__init__()  assert isize % 16 == 0, "isize has to be a multiple of 16"  self.initial = ConvBlock(nc, ndf, 4, 2, bn= False )  csize,cndf = isize/2,ndf  self.extra = nn.Sequential(*[ConvBlock(cndf, cndf, 3, 1)  for t in range(n_extra_layers)])  pyr_layers = []  while csize > 4:  pyr_layers.append(ConvBlock(cndf, cndf*2, 4, 2))  cndf *= 2; csize /= 2  self.pyramid = nn.Sequential(*pyr_layers)  self.final = nn.Conv2d(cndf, 1, 4, padding=0, bias= False )  def forward(self, input):  x = self.initial(input)  x = self.extra(x)  x = self.pyramid(x)  return self.final(x).mean(0).view(1) 
+```
+
+So here is our discriminator [ [1:03:37](https://youtu.be/ondivPiwQho%3Ft%3D1h3m37s) ].The important thing to remember about an architecture is it doesn't do anything rather than have some input tensor size and rank, and some output tensor size and rank. As you see the last conv has one channel. This is different from what we are used to because normally our last thing is a linear block. But our last layer here is a conv block. It only has one channel but it has a grid size of something around 4x4 (no more than 4x4). So we are going to spit out (let's say it's 4x4), 4 by 4 by 1 tensor. What we then do is we then take the mean of that. So it goes from 4x4x1 to a scalar. This is kind of like the ultimate adaptive average pooling because we have something with just one channel and we take the mean. So this is a bit different — normally we first do average pooling and then we put it through a fully connected layer to get our one thing out. But this is getting one channel out and then taking the mean of that. Jeremy suspects that it would work better if we did the normal way, but he hasn't tried it yet and he doesn't really have a good enough intuition to know whether he is missing something — but it will be an interesting experiment to try if somebody wants to stick an adaptive average pooling layer and a fully connected layer afterwards with a single output.
+
+So that's a discriminator. Let's assume we already have a generator — somebody says “okay, here is a generator which generates bedrooms. I want you to build a model that can figure out which ones are real and which ones aren't”. We are going to take the dataset and label bunch of images which are fake bedrooms from the generator, and a bunch of images of real bedrooms from LSUN dataset to stick a 1 or a 0 on each one. Then we'll try to get the discriminator to tell the difference. So that is going to be simple enough. But we haven't been given a generator. We need to build one. We haven't talked about the loss function yet — we are going to assume that there's some loss function that does this thing.
+
+#### **Generator** [ [1:06:15](https://youtu.be/ondivPiwQho%3Ft%3D1h6m15s) ]
+
+A generator is also an architecture which doesn't do anything by itself until we have a loss function and data. But what are the ranks and sizes of the tensors? The input to the generator is going to be a vector of random numbers. In the paper, they call that the “prior.” How big? We don't know. The idea is that a different bunch of random numbers will generate a different bedroom. So our generator has to take as input a vector, stick it through sequential models, and turn it into a rank 4 tensor (rank 3 without the batch dimension) — height by width by 3\. So in the final step, `nc` (number of channel) is going to have to end up being 3 because it's going to create a 3 channel image of some size.
+
+```
+ class DeconvBlock (nn.Module):  def __init__(self, ni, no, ks, stride, pad, bn= True ):  super().__init__()  self.conv = nn.ConvTranspose2d(ni, no, ks, stride,  padding=pad, bias= False )  self.bn = nn.BatchNorm2d(no)  self.relu = nn.ReLU(inplace= True )  def forward(self, x):  x = self.relu(self.conv(x))  return self.bn(x) if self.bn else x 
+```
+
+```
+ class DCGAN_G (nn.Module):  def __init__(self, isize, nz, nc, ngf, n_extra_layers=0):  super().__init__()  assert isize % 16 == 0, "isize has to be a multiple of 16"  cngf, tisize = ngf//2, 4  while tisize!=isize: cngf*=2; tisize*=2  layers = [DeconvBlock(nz, cngf, 4, 1, 0)]  csize, cndf = 4, cngf  while csize < isize//2:  layers.append(DeconvBlock(cngf, cngf//2, 4, 2, 1))  cngf //= 2; csize *= 2  layers += [DeconvBlock(cngf, cngf, 3, 1, 1)  for t in range(n_extra_layers)]  layers.append(nn.ConvTranspose2d(cngf, nc, 4, 2, 1,  bias= False ))  self.features = nn.Sequential(*layers)  def forward(self, input): return F.tanh(self.features(input)) 
+```
+
+**Question** : In ConvBlock, is there a reason why batch norm comes after ReLU (ie `self.bn(self.relu(…))` ) [ [1:07:50](https://youtu.be/ondivPiwQho%3Ft%3D1h7m50s) ]? I would normally expect to go ReLU then batch norm [ [1:08:23](https://youtu.be/ondivPiwQho%3Ft%3D1h8m23s) ] that this is actually the order that makes sense to Jeremy. The order we had in the darknet was what they used in the darknet paper, so everybody seems to have a different order of these things. In fact, most people for CIFAR10 have a different order again which is batch norm → ReLU → conv which is a quirky way of thinking about it, but it turns out that often for residual blocks that works better. That is called a “pre-activation ResNet.” There is a few blog posts out there where people have experimented with different order of those things and it seems to depend a lot on what specific dataset it is and what you are doing with — although the difference in performance is small enough that you won't care unless it's for a competition.
+
+#### Deconvolution [ [1:09:36](https://youtu.be/ondivPiwQho%3Ft%3D1h9m36s) ]
+
+So the generator needs to start with a vector and end up with a rank 3 tensor. We don't really know how to do that yet. We need to use something called a “deconvolution” and PyTorch calls it transposed convolution — same thing, different name. Deconvolution is something which rather than decreasing the grid size, it increases the grid size. So as with all things, it's easiest to see in an Excel spreadsheet.
+
+Here is a convolution. We start, let's say, with a 4 by 4 grid cell with a single channel. Let's put it through a 3 by 3 kernel with a single output filter. So we have a single channel in, a single filter kernel, so if we don't add any padding, we are going to end up with 2 by 2\. Remember, the convolution is just the sum of the product of the kernel and the appropriate grid cell [ [1:11:09](https://youtu.be/ondivPiwQho%3Ft%3D1h11m9s) ]. So there is our standard 3 by 3 conv one channel one filter.
+
+![](../img/1_FqkDO90rEDwa_CgxTAlyIQ.png)
+
+So the idea now is we want to go the opposite direction [ [1:11:25](https://youtu.be/ondivPiwQho%3Ft%3D1h11m25s) ]. We want to start with our 2 by 2 and we want to create a 4 by 4\. Specifically we want to create the same 4 by 4 that we started with. And we want to do that by using a convolution. How would we do that?
+
+If we have a 3 by 3 convolution, then if we want to create a 4 by 4 output, we are going to need to create this much padding:
+
+![](../img/1_flOxFmF21kUyLpPDJ6kr-w.png)
+
+Because with this much padding, we are going to end up with 4 by 4\. So let's say our convolutional filter was just a bunch of zeros then we can calculate our error for each cell just by taking this subtraction:
+
+![](../img/1_HKcU-wgdLPgxd5kJfEkmlg.png)
+
+Then we can get the sum of absolute values (L1 loss) by summing up the absolute values of those errors:
+
+![](../img/1_mjLTOFUXneXGeER4hKj4Kw.png)
+
+So now we could use optimization, in Excel it's called “solver” to do a gradient descent. So we will set the Total cell equal to minimum and we'll try and reduce our loss by changing our filter. You can see it's come up with a filter such that Result is almost like Data. It's not perfect, and in general, you can't assume that a deconvolution can exactly create the same exact thing you want because there is just not enough. Because there is 9 things in the filter and 16 things in the result. But it's made a pretty good attempt. So this is what a deconvolution looks like — a stride 1, 3x3 deconvolution on a 2x2 grid cell input.
+
+![](../img/1_QzJe8qhpZl6hfKAB0Zw-vQ.png)
+
+**Question** : How difficult is it to create a discriminator to identify fake news vs. real news [ [1:13:43](https://youtu.be/ondivPiwQho%3Ft%3D1h13m43s) ]? You don't need anything special — that's just a classifier. So you would just use the NLP classifier from previous class and lesson 4\. In that case, there is no generative piece, so you just need a dataset that says these are the things that we believe are fake news and these are the things we consider to be real news and it should actually work very well. To the best of our knowledge, if you try it you should get as good a result as anybody else has got — whether it's good enough to be useful in practice, Jeremy doesn't know. The best thing you could do at this stage would be to generate a kind of a triage that says these things look pretty sketchy based on how they are written and then some human could go in and fact check them. NLP classifier and RNN can't fact-check things but it could recognize that these are written in that kind of highly popularized style which often fake news is written in so maybe these ones are worth paying attention to. That would probably be the best you could hope for without drawing on some kind of external data sources. But it's important to remember the discriminator is basically just a classifier and you don't need any special techniques beyond what we've already learned to do NLP classification.
+
+#### ConvTranspose2d [ [1:16:00](https://youtu.be/ondivPiwQho%3Ft%3D1h16m) ]
+
+To do deconvolution in PyTorch, just say:
+
+`nn.ConvTranspose2d(ni, no, ks, stride, padding=pad, bias=False)`
+
+*   `ni` : number of input channels
+*   `no` : number of ourput channels
+*   `ks` : kernel size
+
+The reason it's called a ConvTranspose is because it turns out that this is the same as the calculation of the gradient of convolution. That's why they call it that.
+
+**Visualizing** [ [1:16:33](https://youtu.be/ondivPiwQho%3Ft%3D1h16m33s) ]
+
+![](../img/1_GZz25GtnzqaYy5MV5iQPmA.png)
+
+<figcaption class="imageCaption">[http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)</figcaption>
+
+
+
+One on the left is what we just saw of doing a 2x2 deconvolution. If there is a stride 2, then you don't just have padding around the outside, but you actually have to put padding in the middle as well. They are not actually quite implemented this way because this is slow to do. In practice, you'll implement them in a different way but it all happens behind the scene, so you don't have to worry about it. We've talked about this [convolution arithmetic tutorial](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html) before and if you are still not comfortable with convolutions and in order to get comfortable with deconvolutions, this is a great site to go to. If you want to see the paper, it is [A guide to convolution arithmetic for deep learning](https://arxiv.org/abs/1603.07285) .
+
+`DeconvBlock` looks identical to a `ConvBlock` except it has the word `Transpose` [ [1:17:49](https://youtu.be/ondivPiwQho%3Ft%3D1h17m49s) ]. We just go conv → relu → batch norm as before, and it has input filters and output filters. The only difference is taht stride 2 means that the grid size will double rather than half.
+
+![](../img/1_vUpDoEX5vPs6y3auKiCFsQ.png)
+
+Question: Both `nn.ConvTranspose2d` and `nn.Upsample` seem to do the same thing, ie expand grid-size (height and width) from previous layer. Can we say `nn.ConvTranspose2d` is always better than `nn.Upsample` , since `nn.Upsample` is merely resize and fill unknowns by zero's or interpolation [ [1:18:10](https://youtu.be/ondivPiwQho%3Ft%3D1h18m10s) ]? No, you can't. There is a fantastic interactive paper on distill.pub called [Deconvolution and Checkerboard Artifacts](https://distill.pub/2016/deconv-checkerboard/) which points out that what we are doing right now is extremely suboptimal but the good news is everybody else does it.
+
+![](../img/1_-EmXZ1cNtZEO-2SwEYG6bA.png)
+
+Have a look here, could you see these checkerboard artifacts? These are all from actual papers and basically they noticed every one of these papers with generative models have these checkerboard artifacts and what they realized is it's because when you have a stride 2 convolution of size three kernel, they overlap. So some grid cells gets twice as much activation,
+
+![](../img/1_rafmdyh7EfqCsptcOppq1w.png)
+
+So even if you start with random weights, you end up with a checkerboard artifacts. So deeper you get, the worse it gets. Their advice is less direct than it ought to be, Jeremy found that for most generative models, upsampling is better. If you `nn.Upsample` , it's basically doing the opposite of pooling — it says let's replace this one grid cell with four (2x2). There is a number of ways to upsample — one is just to copy it all across to those four, and other is to use bilinear or bicubic interpolation. There are various techniques to try and create a smooth upsampled version and you can choose any of them in PyTorch. If you do a 2 x 2 upsample and then regular stride one 3 x 3 convolution, that is another way of doing the same kind of thing as a ConvTranspose — it's doubling the grid size and doing some convolutional arithmetic on it. For generative models, it pretty much always works better. In that distil.pub publication, they indicate that maybe that's a good approach but they don't just come out and say just do this whereas Jeremy would just say just do this. Having said that, for GANS, he hasn't had that much success with it yet and he thinks it probably requires some tweaking to get it to work, The issue is that in the early stages, it doesn't create enough noise. He had a version where he tried to do it with an upsample and you could kind of see that the noise didn't look very noisy. Next week when we look at style transfer and super-resolution, you will see `nn.Upsample` really comes into its own.
+
+The generator, we can now start with the vector [ [1:22:04](https://youtu.be/ondivPiwQho%3Ft%3D1h22m04s) ]. We can decide and say okay let's not think of it as a vector but actually it's 1x1 grid cell, and then we can turn it into a 4x4 then 8x8 and so forth. That is why we have to make sure it's a suitable multiple so that we can create something of the right size. As you can see, it's doing the exact opposite as before. It's making the cell size bigger and bigger by 2 at a time as long as it can until it gets to half the size that we want, and then finally we add `n` more on at the end with stride 1\. Then we add one more ConvTranspose to finally get to the size that we wanted and we are done. Finally we put that through a `tanh` and that will force us to be in the zero to one range because of course we don't want to spit out arbitrary size pixel values. So we have a generator architecture which spits out an image of some given size with the correct number of channels with values between zero and one.
+
+![](../img/1_sNvYsoGpBl6vzCdcjWkH1Q.png)
+
+At this point, we can now create our model data object [ [1:23:38](https://youtu.be/ondivPiwQho%3Ft%3D1h23m38s) ]. These things take a while to train, so we made it 128 by 128 (just a convenient way to make it a little bit faster). So that is going to be the size of the input, but then we are going to use transformation to turn it into 64 by 64\.
+
+There's been more recent advances which have attempted to really increase this up to high resolution sizes but they still tend to require either a batch size of 1 or lots and lots of GPUs [ [1:24:05](https://youtu.be/ondivPiwQho%3Ft%3D1h24m5s) ]. So we are trying to do things that we can do with a single consumer GPU. Here is an example of one of the 64 by 64 bedrooms.
+
+```
+ bs,sz,nz = 64,64,100 
+```
+
+```
+ tfms = tfms_from_stats(inception_stats, sz)  md = ImageClassifierData.from_csv(PATH, 'bedroom', CSV_PATH,  tfms=tfms, bs=128, skip_header= False , continuous= True ) 
+```
+
+```
+ md = md.resize(128) 
+```
+
+```
+ x,_ = next(iter(md.val_dl)) 
+```
+
+```
+ plt.imshow(md.trn_ds.denorm(x)[0]); 
+```
+
+![](../img/1_FIBPb5I8EloAjg7mvtRXaQ.png)
+
+#### Putting them all together [ [1:24:30](https://youtu.be/ondivPiwQho%3Ft%3D1h24m30s) ]
+
+We are going to do pretty much everything manually so let's go ahead and create our two models — our generator and discriminator and as you can see they are DCGAN, so in other words, they are the same modules that appeared in [this paper](https://arxiv.org/abs/1511.06434) . It is well worth going back and looking at the DCGAN paper to see what these architectures are because it's assumed that when you read the Wasserstein GAN paper that you already know that.
+
+```
+ netG = DCGAN_G(sz, nz, 3, 64, 1).cuda()  netD = DCGAN_D(sz, 3, 64, 1).cuda() 
+```
+
+**Question** : Shouldn't we use a sigmoid if we want values between 0 and 1 [ [1:25:06](https://youtu.be/ondivPiwQho%3Ft%3D1h25m6s) ]? As usual, our images have been normalized to have a range from -1 to 1, so their pixel values don't go between 0 and 1 anymore. This is why we want values going from -1 to 1 otherwise we wouldn't give a correct input for the discriminator (via [this post](http://forums.fast.ai/t/part-2-lesson-12-wiki/15023/140) ).
+
+So we have a generator and a discriminator, and we need a function that returns a “prior” vector (ie a bunch of noise)[ [1:25:49](https://youtu.be/ondivPiwQho%3Ft%3D1h25m49s) ]. We do that by creating a bunch of zeros. `nz` is the size of `z` — very often in our code, if you see a mysterious letter, it's because that's the letter they used in the paper. Here, `z` is the size of our noise vector. We then use normal distribution to generate random numbers between 0 and 1\. And that needs to be a variable because it's going to be participating in the gradient updates.
+
+```
+ def create_noise(b):  return V(torch.zeros(b, nz, 1, 1).normal_(0, 1)) 
+```
+
+```
+ preds = netG(create_noise(4))  pred_ims = md.trn_ds.denorm(preds)  fig, axes = plt.subplots(2, 2, figsize=(6, 6))  for i,ax in enumerate(axes.flat): ax.imshow(pred_ims[i]) 
+```
+
+![](../img/1_4nHm3LLiShNb0pSS3dCuCw.png)
+
+So here is an example of creating some noise and resulting four different pieces of noise.
+
+```
+ def gallery(x, nc=3):  n,h,w,c = x.shape  nr = n//nc  assert n == nr*nc  return (x.reshape(nr, nc, h, w, c)  .swapaxes(1,2)  .reshape(h*nr, w*nc, c)) 
+```
+
+We need an optimizer in order to update our gradients [ [1:26:41](https://youtu.be/ondivPiwQho%3Ft%3D1h26m41s) ]. In the Wasserstein GAN paper, they told us to use RMSProp:
+
+![](../img/1_5o4cwLlNjQfgrNVgLrsVlg.png)
+
+We can easily do that in PyTorch:
+
+```
+ optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-4)  optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-4) 
+```
+
+In the paper, they suggested a learning rate of 0.00005 ( `5e-5` ), we found `1e-4` seem to work, so we made it a little bit bigger.
+
+Now we need a training loop [ [1:27:14](https://youtu.be/ondivPiwQho%3Ft%3D1h27m14s) ]:
+
+![](../img/1_VROXSgyt6HWaJiMMY6ogFQ.png)
+
+<figcaption class="imageCaption">For easier reading</figcaption>
+
+
+
+A training loop will go through some number of epochs that we get to pick (so that's going to be a parameter). Remember, when you do everything manually, you've got to remember all the manual steps to do:
+
+1.  You have to set your modules into training mode when you are training them and into evaluation mode when you are evaluating because in training mode batch norm updates happen and dropout happens, in evaluation mode, those two things gets turned off.
+2.  We are going to grab an iterator from our training data loader
+3.  We are going to see how many steps we have to go through and then we will use `tqdm` to give us a progress bar, and we are going to go through that many steps.
+
+The first step of the algorithm in the paper is to update the discriminator (in the paper, they call discriminator a “critic” and `w` is the weights of the critic). So the first step is to train our critic a little bit, and then we are going to train our generator a little bit, and we will go back to the top of the loop. The inner `for` loop in the paper correspond to the second `while` loop in our code.
+
+What we are going to do now is we have a generator that is random at the moment [ [1:29:06](https://youtu.be/ondivPiwQho%3Ft%3D1h29m6s) ]. So our generator will generate something that looks like the noise. First of all, we need to teach our discriminator to tell the difference between the noise and a bedroom — which shouldn't be too hard you would hope. So we just do it in the usual way but there is a few little tweaks:
+
+1.  We are going to grab a mini batch of real bedroom photos so we can just grab the next batch from our iterator, turn it into a variable.
+2.  Then we are going to calculate the loss for that — so this is going to be how much the discriminator thinks this looks fake (“does the real one look fake?”).
+3.  Then we are going to create some fake images and to do that we will create some random noise, and we will stick it through our generator which at this stage is just a bunch of random weights. That will create a mini batch of fake images.
+4.  Then we will put that through the same discriminator module as before to get the loss for that (“how fake does the fake one look?”). Remember, when you do everything manually, you have to zero the gradients ( `netD.zero_grad()` ) in your loop. If you have forgotten about that, go back to the part 1 lesson where we do everything from scratch.
+5.  Finally, the total discriminator loss is equal to the real loss minus the fake loss.
+
+So you can see that here [ [1:30:58](https://youtu.be/ondivPiwQho%3Ft%3D1h30m58s) ]:
+
+![](../img/1_atls5DInIbp5wHZz8szQ1A.png)
+
+They don't talk about the loss, they actually just talk about one of the gradient updates.
+
+![](../img/1_9nGWityXFzNdgOxN15flRA.png)
+
+In PyTorch, we don't have to worry about getting the gradients, we can just specify the loss and call `loss.backward()` then discriminator's `optimizer.step()` [ [1:34:27](https://youtu.be/ondivPiwQho%3Ft%3D1h34m27s) ]. There is one key step which is that we have to keep all of our weights which are the parameters in PyTorch module in the small range of -0.01 and 0.01\. 为什么？ Because the mathematical assumptions that make this algorithm work only apply in a small ball. It is interesting to understand the math of why that is the case, but it's very specific to this one paper and understanding it won't help you understand any other paper, so only study it if you are interested. It is nicely explained and Jeremy thinks it's fun but it won't be information that you will reuse elsewhere unless you get super into GANs. He also mentioned that after the came out and improved Wasserstein GAN came out that said there are better ways to ensure that your weight space is in this tight ball which was to penalize gradients that are too high, so nowadays there are slightly different ways to do this. But this line of code is the key contribution and it is what makes it Wasserstein GAN:
+
+```
+ for p in netD.parameters(): p.data.clamp_(-0.01, 0.01) 
+```
+
+At the end of this, we have a discriminator that can recognize real bedrooms and our totally random crappy generated images [ [1:36:20](https://youtu.be/ondivPiwQho%3Ft%3D1h36m20s) ]. Let's now try and create some better images. So now set trainable discriminator to false, set trainable generator to true, zero out the gradients of the generator. Our loss again is `fw` (discriminator) of the generator applied to some more random noise. So it's exactly the same as before where we did generator on the noise and then pass that to a discriminator, but this time, the thing that's trainable is the generator, not the discriminator. In other words, in the pseudo code, the thing they update is Ɵ which is the generator's parameters. So it takes noise, generate some images, try and figure out if they are fake or real, and use that to get gradients with respect to the generator, as opposed to earlier we got them with respect to the discriminator, and use that to update our weights with RMSProp with an alpha learning rate [ [1:38:21](https://youtu.be/ondivPiwQho%3Ft%3D1h38m21s) ].
+
+```
+ def train(niter, first= True ):  gen_iterations = 0  for epoch in trange(niter):  netD.train(); netG.train()  data_iter = iter(md.trn_dl)  i,n = 0,len(md.trn_dl)  with tqdm(total=n) as pbar:  while i < n:  set_trainable(netD, True )  set_trainable(netG, False )  d_iters = 100 if (first and (gen_iterations < 25)  or (gen_iterations % 500 == 0)) else 5  j = 0  while (j < d_iters) and (i < n):  j += 1; i += 1  for p in netD.parameters():  p.data.clamp_(-0.01, 0.01)  real = V(next(data_iter)[0])  real_loss = netD(real)  fake = netG(create_noise(real.size(0)))  fake_loss = netD(V(fake.data))  netD.zero_grad()  lossD = real_loss-fake_loss  lossD.backward()  optimizerD.step()  pbar.update()  set_trainable(netD, False )  set_trainable(netG, True )  netG.zero_grad()  lossG = netD(netG(create_noise(bs))).mean(0).view(1)  lossG.backward()  optimizerG.step()  gen_iterations += 1  print(f'Loss_D {to_np(lossD)}; Loss_G {to_np(lossG)}; '  f'D_real {to_np(real_loss)}; Loss_D_fake  {to_np(fake_loss)}') 
+```
+
+You'll see that it's unfair that the discriminator is getting trained _ncritic_ times ( `d_iters` in above code) which they set to 5 for every time we train the generator once. And the paper talks a bit about this but the basic idea is there is no point making the generator better if the discriminator doesn't know how to discriminate yet. So that's why we have the second while loop. And here is that 5:
+
+```
+ d_iters = 100 if (first and (gen_iterations < 25)  or (gen_iterations % 500 == 0)) else 5 
+```
+
+Actually something which was added in the later paper or maybe supplementary material is the idea that from time to time and a bunch of times at the start, you should do more steps at the discriminator to make sure that the discriminator is capable.
+
+```
+ torch.backends.cudnn.benchmark= True 
+```
+
+Let's train that for one epoch:
+
+```
+ train(1, False ) 
+```
+
+```
+ 0%| | 0/1 [00:00<?, ?it/s]  100%|██████████| 18957/18957 [19:48<00:00, 10.74it/s]  Loss_D [-0.67574]; Loss_G [0.08612]; D_real [-0.1782]; Loss_D_fake [0.49754]  100%|██████████| 1/1 [19:49<00:00, 1189.02s/it] 
+```
+
+Then let's create some noise so we can generate some examples.
+
+```
+ fixed_noise = create_noise(bs) 
+```
+
+But before that, reduce the learning rate by 10 and do one more pass:
+
+```
+ set_trainable(netD, True )  set_trainable(netG, True )  optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-5)  optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-5) 
+```
+
+```
+ train(1, False ) 
+```
+
+```
+ 0%| | 0/1 [00:00<?, ?it/s]  100%|██████████| 18957/18957 [23:31<00:00, 13.43it/s]  Loss_D [-1.01657]; Loss_G [0.51333]; D_real [-0.50913]; Loss_D_fake [0.50744]  100%|██████████| 1/1 [23:31<00:00, 1411.84s/it] 
+```
+
+Then let's use the noise to pass it to our generator, then put it through our denormalization to turn it back into something we can see, and then plot it:
+
+```
+ netD.eval(); netG.eval();  fake = netG(fixed_noise).data.cpu()  faked = np.clip(md.trn_ds.denorm(fake),0,1)  plt.figure(figsize=(9,9))  plt.imshow(gallery(faked, 8)); 
+```
+
+![](../img/1_b8XHbkL7E3tREt_T2mXFqQ.png)
+
+And we have some bedrooms. These are not real bedrooms, and some of them don't look particularly like bedrooms, but some of them look a lot like bedrooms, so that's the idea. That's GAN. The best way to think about GAN is it is like an underlying technology that you will probably never use like this, but you will use in lots of interesting ways. For example, we are going to use it to create a cycle GAN.
+
+**Question** : Is there any reason for using RMSProp specifically as the optimizer as opposed to Adam etc. [ [1:41:38](https://youtu.be/ondivPiwQho%3Ft%3D1h41m38s) ]? I don't remember it being explicitly discussed in the paper. I don't know if it's just experimental or the theoretical reason. Have a look in the paper and see what it says.
+
+[From the forum](http://forums.fast.ai/t/part-2-lesson-12-wiki/15023/211)
+
+> From experimenting I figured that Adam and WGANs not just work worse — it causes to completely fail to train meaningful generator.
+
+> from WGAN paper:
+
+> _Finally, as a negative result, we report that WGAN training becomes unstable at times when one uses a momentum based optimizer such as Adam [8] (with β1&gt;0) on the critic, or when one uses high learning rates. Since the loss for the critic is nonstationary, momentum based methods seemed to perform worse. We identified momentum as a potential cause because, as the loss blew up and samples got worse, the cosine between the Adam step and the gradient usually turned negative. The only places where this cosine was negative was in these situations of instability. We therefore switched to RMSProp [21] which is known to perform well even on very nonstationary problems_
+
+**Question** : Which could be a reasonable way of detecting overfitting while training? Or of evaluating the performance of one of these GAN models once we are done training? In other words, how does the notion of train/val/test sets translate to GANs [ [1:41:57](https://youtu.be/ondivPiwQho%3Ft%3D1h41m57s) ]? That is an awesome question, and there's a lot of people who make jokes about how GANs is the one field where you don't need a test set and people take advantage of that by making stuff up and saying it looks great. There are some famous problems with GANs, one of them is called Mode Collapse. Mode collapse happens where you look at your bedrooms and it turns out that there's only three kinds of bedrooms that every possible noise vector maps to. You look at your gallery and it turns out they are all just the same thing or just three different things. Mode collapse is easy to see if you collapse down to a small number of modes, like 3 or 4\. But what if you have a mode collapse down to 10,000 modes? So there are only 10,000 possible bedrooms that all of your noise vectors collapse to. You wouldn't be able to see in the gallery view we just saw because it's unlikely you would have two identical bedrooms out of 10,000\. Or what if every one of these bedrooms is basically a direct copy of one of the input — it basically memorized some input. Could that be happening? And the truth is, most papers don't do a good job or sometimes any job of checking those things. So the question of how do we evaluate GANS and even the point of maybe we should actually evaluate GANs properly is something that is not widely enough understood even now. Some people are trying to really push. Ian Goodfellow was the first author on the most famous deep learning book and is the inventor of GANs and he's been sending continuous stream of tweets reminding people about the importance of testing GANs properly. If you see a paper that claims exceptional GAN results, then this is definitely something to look at. Have they talked about mode collapse? Have they talked about memorization? 等等。
+
+**Question** : Can GANs be used for data augmentation [ [1:45:33](https://youtu.be/ondivPiwQho%3Ft%3D1h45m33s) ]? Yeah, absolutely you can use GAN for data augmentation. 你应该？ 我不知道。 There are some papers that try to do semi-supervised learning with GANs. I haven't found any that are particularly compelling showing state-of-the-art results on really interesting datasets that have been widely studied. I'm a little skeptical and the reason I'm a little skeptical is because in my experience, if you train a model with synthetic data, the neural net will become fantastically good at recognizing the specific problems of your synthetic data and that'll end up what it's learning from. There are lots of other ways of doing semi-supervised models which do work well. There are some places that can work. For example, you might remember Otavio Good created that fantastic visualization in part 1 of the zooming conv net where it showed letter going through MNIST, he, at least at that time, was the number one in autonomous remote control car competitions, and he trained his model using synthetically augmented data where he basically took real videos of a car driving around the circuit and added fake people and fake other cars. I think that worked well because A. he is kind of a genius and B. because I think he had a well defined little subset that he had to work in. But in general, it's really really hard to use synthetic data. I've tried using synthetic data and models for decades now (obviously not GANs because they're pretty new) but in general it's very hard to do. Very interesting research question.
+
+### Cycle GAN [ [1:41:08](https://youtu.be/ondivPiwQho%3Ft%3D1h41m8s) ]
+
+[Paper](https://arxiv.org/abs/1703.10593) / [Notebook](https://github.com/fastai/fastai/blob/master/courses/dl2/cyclegan.ipynb)
+
+We are going to use cycle GAN to turn horses into zebras. You can also use it to turn Monet prints into photos or to turn photos of Yosemite in summer into winter.
+
+![](../img/1_dWd0lVTbnu80UZM641gCbw.gif)
+
+This is going to be really straight forward because it's just a neural net [ [1:44:46](https://youtu.be/ondivPiwQho%3Ft%3D1h44m46s) ]. All we are going to do is we are going to create an input containing lots of zebra photos and with each one we'll pair it with an equivalent horse photo and we'll just train a neural net that goes from one to the other. Or you could do the same thing for every Monet painting — create a dataset containing the photo of the place …oh wait, that's not possible because the places that Monet painted aren't there anymore and there aren't exact zebra versions of horses …how the heck is this going to work? This seems to break everything we know about what neural nets can do and how they do them.
+
+So somehow these folks at Berkeley cerated a model that can turn a horse into a zebra despite not having any photos. Unless they went out there and painted horses and took before-and-after shots but I believe they didn't [ [1:47:51](https://youtu.be/ondivPiwQho%3Ft%3D1h47m51s) ]. So how the heck did they do this? It's kind of genius.
+
+The person I know who is doing the most interesting practice of cycle GAN right now is one of our students Helena Sarin [**@** glagolista](https://twitter.com/glagolista) . She is the only artist I know of who is a cycle GAN artist.
+
+![](../img/1_y0xHbQJvxcwUsx7EEK4nHQ.jpeg)
+
+![](../img/1_QZWqdoLXR1TjgeWDivTlnA.jpeg)
+
+![](../img/1_JIF1OaO04wxkWIP_7b14uA.jpeg)
+
+![](../img/1_xn7L_rsu2J6Py2Mjq_q1LA.jpeg)
+
+Here are some more of her amazing works and I think it's really interesting. I mentioned at the start of this class that GANs are in the category of stuff that is not there yet, but it's nearly there. And in this case, there is at least one person in the world who is creating beautiful and extraordinary artworks using GANs (specifically cycle GANs). At least a dozen people I know of who are just doing interesting creative work with neural nets more generally. And the field of creative AI is going to expand dramatically.
+
+![](../img/1_oqSRuiHT8Z9pWl0Zq9_Sjw.png)
+
+Here is the basic trick [ [1:50:11](https://youtu.be/ondivPiwQho%3Ft%3D1h50m11s) ]. This is from the cycle GAN paper. We are going to have two images (assuming we are doing this with images). The key thing is they are not paired images, so we don't have a dataset of horses and the equivalent zebras. We have bunch of horses, and bunch of zebras. Grab one horse _X_ , grab one zebra _Y_ . We are going to train a generator (what they call here a “mapping function”) that turns horse into zebra. We'll call that mapping function _G_ and we'll create one mapping function (aka generator) that turns a zebra into a horse and we will call that _F._ We will create a discriminator just like we did before which is going to get as good as possible at recognizing real from fake horses so that will be _Dx._ Another discriminator which is going to be as good as possible at recognizing real from fake zebras, we will call that _Dy_ . That is our starting point.
+
+The key thing to making this work [ [1:51:27](https://youtu.be/ondivPiwQho%3Ft%3D1h51m27s) ]— so we are generating a loss function here ( _Dx_ and _Dy_ ). We are going to create something called **cycle-consistency loss** which says after you turn your horse into a zebra with your generator, and check whether or not I can recognize that it's a real. We turn our horse into a zebra and then going to try and turn that zebra back into the same horse that we started with. Then we are going to have another function that is going to check whether this horse which are generated knowing nothing about _x_ — generated entirely from this zebra _Y_ is similar to the original horse or not. So the idea would be if your generated zebra doesn't look anything like your original horse, you've got no chance of turning it back into the original horse. So a loss which compares _x-hat_ to _x_ is going to be really bad unless you can go into _Y_ and back out again and you're probably going to be able to do that if you're able to create a zebra that looks like the original horse so that you know what the original horse looked like. And vice versa — take your zebra, turn it into a fake horse, and check that you can recognize that and then try and turn it back into the original zebra and check that it looks like the original.
+
+So notice _F_ (zebra to horse) and _G_ (horse to zebra) are doing two things [ [1:53:09](https://youtu.be/ondivPiwQho%3Ft%3D1h53m9s) ]. They are both turning the original horse into the zebra, and then turning the zebra back into the original horse. So there are only two generators. There isn't a separate generator for the reverse mapping. You have to use the same generator that was used for the original mapping. So this is the cycle-consistency loss. I think this is genius. The idea that this is a thing that could even be possible. Honestly when this came out, it just never occurred to me as a thing that I could even try and solve. It seems so obviously impossible and then the idea that you can solve it like this — I just think it's so darn smart.
+
+It's good to look at the equations in this paper because they are good examples — they are written pretty simply and it's not like some of the Wasserstein GAN paper which is lots of theoretical proofs and whatever else [ [1:54:05](https://youtu.be/ondivPiwQho%3Ft%3D1h54m5s) ]. In this case, they are just equations that lay out what's going on. You really want to get to a point where you can read them and understand them.
+
+![](../img/1_Mygxs_TWrjycbanbH5aUeQ.png)
+
+So we've got a horse _X_ and a zebra _Y_ [ [1:54:34](https://youtu.be/ondivPiwQho%3Ft%3D1h54m34s) ]. For some mapping function _G_ which is our horse to zebra mapping function then there is a GAN loss which is a bit we are already familiar with it says we have a horse, a zebra, a fake zebra recognizer, and a horse-zebra generator. The loss is what we saw before — it's our ability to draw one zebra out of our zebras and recognize whether it is real or fake. Then take a horse and turn it into a zebra and recognize whether that's real or fake. You then do one minus the other (in this case, they have a log in there but the log is not terribly important). So this is the thing we just saw. That is why we did Wasserstein GAN first. This is just a standard GAN loss in math form.
+
+**Question** : All of this sounds awfully like translating in one language to another then back to the original. Have GANs or any equivalent been tried in translation [ [1:55:54](https://youtu.be/ondivPiwQho%3Ft%3D1h55m54s) ]? [Paper from the forum](https://arxiv.org/abs/1711.00043) . Back up to what I do know — normally with translation you require this kind of paired input (ie parallel text — “this is the French translation of this English sentence”). There has been a couple of recent papers that show the ability to create good quality translation models without paired data. I haven't implemented them and I don't understand anything I haven't implemented, but they may well be doing the same basic idea. We'll look at it during the week and get back to you.
+
+**Cycle-consistency loss** [ [1:57:14](https://youtu.be/ondivPiwQho%3Ft%3D1h57m14s) ]: So we've got a GAN loss and the next piece is the cycle-consistency loss. So the basic idea here is that we start with our horse, use our zebra generator on that to create a zebra, use our horse generator on that to create a horse and compare that to the original horse. This double lines with the 1 is the L1 loss — sum of the absolute value of differences [ [1:57:35](https://youtu.be/ondivPiwQho%3Ft%3D1h57m35s) ]. Where else if this was 2, it would be the L2 loss so the 2-norm which would be the sum of the squared differences.
+
+![](../img/1_0wq511kW9eRhBMWS94G0Bw.png)
+
+We now know this squiggle idea which is from our horses grab a horse. This is what we mean by sample from a distribution. There's all kinds of distributions but most commonly in these papers we're using an empirical distribution, in other words we've got some rows of data, grab a row. So here, it is saying grab something from the data and we are going to call that thing _x_ . To recapture:
+
+1.  From our horse pictures, grab a horse
+2.  Turn it into a zebra
+3.  Turn it back into a horse
+4.  Compare it to the original and sum of the absolute values
+5.  Do it for zebra to horse as well
+6.  And add the two together
+
+That is our cycle-consistency loss.
+
+**Full objective** [ [1:58:54](https://youtu.be/ondivPiwQho%3Ft%3D1h58m54s) ]
+
+![](../img/1_84eYJ5eck_7r3zVJzrzGzA.png)
+
+Now we get our loss function and the whole loss function depends on:
+
+*   our horse generator
+*   a zebra generator
+*   our horse recognizer
+*   our zebra recognizer (aka discriminator)
+
+We are going to add up :
+
+*   the GAN loss for recognizing horses
+*   GAN loss for recognizing zebras
+*   the cycle-consistency loss for our two generators
+
+We have a lambda here which hopefully we are kind of used to this idea now that is when you have two different kinds of loss, you chuck in a parameter there you can multiply them by so they are about the same scale [ [1:59:23](https://youtu.be/ondivPiwQho%3Ft%3D1h59m23s) ]. We did a similar thing with our bounding box loss compared to our classifier loss when we did the localization.
+
+Then for this loss function, we are going to try to maximize the capability of the discriminators to discriminate, whilst minimizing that for the generators. So the generators and the discriminators are going to be facing off against each other. When you see this _min max_ thing in papers, it basically means this idea that in your training loop, one thing is trying to make something better, the other is trying to make something worse, and there're lots of ways to do it but most commonly, you'll alternate between the two. You will often see this just referred to in math papers as min-max. So when you see min-max, you should immediately think **adversarial training** .
+
+#### Implementing cycle GAN [ [2:00:41](https://youtu.be/ondivPiwQho%3Ft%3D2h41s) ]
+
+Let's look at the code. We are going to do something almost unheard of which is I started looking at somebody else's code and I was not so disgusted that I threw the whole thing away and did it myself. I actually said I quite like this, I like it enough I'm going to show it to my students. [This](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) is where the code came from, and this is one of the people that created the original code for cycle GANs and they created a PyTorch version. I had to clean it up a little bit but it's actually pretty darn good. The cool thing about this is that you are now going to get to see almost all the bits of fast.ai or all the relevant bits of fast.ai written in a different way by somebody else. So you're going to get to see how they do datasets, data loaders, models, training loops, and so forth.
+
+You'll find there is a `cgan` directory [ [2:02:12](https://youtu.be/ondivPiwQho%3Ft%3D2h2m12s) ] which is basically nearly the original with some cleanups which I hope to submit as a PR sometime . It was written in a way that unfortunately made it a bit over connected to how they were using it as a script, so I cleaned it up a little bit so I could use it as a module. But other than that, it's pretty similar.
+
+```
+ **from** **fastai.conv_learner** **import** *  **from** **fastai.dataset** **import** *  from cgan.options.train_options import * 
+```
+
+So `cgan` is their code copied from their github repo with some minor changes. The way `cgan` mini library has been set up is that the configuration options, they are assuming, are being passed into like a script. So they have `TrainOptions().parse` method and I'm basically passing in an array of script options (where's my data, how many threads, do I want to dropout, how many iterations, what am I going to call this model, which GPU do I want run it on). That gives us an `opt` object which you can see what it contains. You'll see that it contains some things we didn't mention that is because it has defaults for everything else that we didn't mention.
+
+```
+ opt = TrainOptions().parse(['--dataroot',  '/data0/datasets/cyclegan/horse2zebra', '--nThreads', '8',  '--no_dropout', '--niter', '100', '--niter_decay', '100',  '--name', 'nodrop', '--gpu_ids', '2']) 
+```
+
+So rather than using fast.ai stuff, we are going to largely use cgan stuff.
+
+```
+ from cgan.data.data_loader import CreateDataLoader  from cgan.models.models import create_model 
+```
+
+The first thing we are going to need is a data loader. So this is also a great opportunity for you again to practice your ability to navigate through code with your editor or IDE of choice. We are going to start with `CreateDataLoader` . You should be able to go find symbol or in vim tag to jump straight to `CreateDataLoader` and we can see that's creating a `CustomDatasetDataLoader` . Then we can see `CustomDatasetDataLoader` is a `BaseDataLoader` . We can see that it's going to use a standard PyTorch DataLoader, so that's good. We know if you are going to use a standard PyTorch DataLoader, you have pass it a dataset, and we know that a dataset is something that contains a length and an indexer so presumably when we look at `CreateDataset` it's going to do that.
+
+Here is `CreateDataset` and this library does more than just cycle GAN — it handles both aligned and unaligned image pairs [ [2:04:46](https://youtu.be/ondivPiwQho%3Ft%3D2h4m46s) ]. We know that our image pairs are unaligned so we are going to `UnalignedDataset` .
+
+![](../img/1_wDbxkFlSWbEnC9QDtymlZA.png)
+
+As expected, it has `__getitem__` and `__len__` . For length, A and B are our horses and zebras, we got two sets, so whichever one is longer is the length of the `DataLoader` . `__getitem__` is going to:
+
+*   Randomly grab something from each of our two horses and zebras
+*   Open them up with pillow (PIL)
+*   Run them through some transformations
+*   Then we could either be turning horses into zebras or zebras into horses, so there's some direction
+*   Return our horse, zebra, a path to the horse, and a path of zebra
+
+Hopefully you can kind of see that this is looking pretty similar to the kind of things fast.ai does. Fast.ai obviously does quite a lot more when it comes to transforms and performance, but remember, this is research code for this one thing and it's pretty cool that they did all this work.
+
+![](../img/1_zWN8sgzWry6qu7R9FS0Ydw.png)
+
+```
+ data_loader = CreateDataLoader(opt)  dataset = data_loader.load_data()  dataset_size = len(data_loader)  dataset_size 
+```
+
+```
+ 1334 
+```
+
+We've got a data loader so we can go and load our data into it [ [2:06:17](https://youtu.be/ondivPiwQho%3Ft%3D2h6m17s) ]. That will tell us how many mini-batches are in it (that's the length of the data loader in PyTorch).
+
+Next step is to create a model. Same idea, we've got different kind of models and we're going to be doing a cycle GAN.
+
+![](../img/1_TmC6TtfaP2xRyS9KK1ryjA.png)
+
+Here is our `CycleGANModel` . There is quite a lot of stuff in `CycleGANModel` , so let's go through and find out what's going to be used. At this stage, we've just called initializer so when we initialize it, it's going to go through and define two generators which is not surprising a generator for our horses and a generator for zebras. There is some way for it to generate a pool of fake data and then we're going to grab our GAN loss, and as we talked about our cycle-consistency loss is an L1 loss. They are going to use Adam, so obviously for cycle GANS they found Adam works pretty well. Then we are going to have an optimizer for our horse discriminator, an optimizer for our zebra discriminator, and an optimizer for our generator. The optimizer for the generator is going to contain the parameters both for the horse generator and the zebra generator all in one place.
+
+![](../img/1_eDn2CkHKsIDaAz1M5WnWBg.png)
+
+So the initializer is going to set up all of the different networks and loss functions we need and they are going to be stored inside this `model` [ [2:08:14](https://youtu.be/ondivPiwQho%3Ft%3D2h8m14s) ].
+
+```
+ model = create_model(opt) 
+```
+
+It then prints out and shows us exactly the PyTorch model we have. It's interesting to see that they are using ResNets and so you can see the ResNets look pretty familiar, so we have conv, batch norm, Relu. `InstanceNorm` is just the same as batch norm basically but it applies to one image at a time and the difference isn't particularly important. And you can see they are doing reflection padding just like we are. You can kind of see when you try to build everything from scratch like this, it is a lot of work and you can forget the nice little things that fast.ai does automatically for you. You have to do all of them by hand and only you end up with a subset of them. So over time, hopefully soon, we'll get all of this GAN stuff into fast.ai and it'll be nice and easy.
+
+![](../img/1_YTCDe7-xeLelfeQNiKiq4A.png)
+
+We've got our model and remember the model contains the loss functions, generators, discriminators, all in one convenient place [ [2:09:32](https://youtu.be/ondivPiwQho%3Ft%3D2h9m32s) ]. I've gone ahead and copied and pasted and slightly refactored the training loop from their code so that we can run it inside the notebook. So this one should look a lot familiar. A loop to go through each epoch and a loop to go through the data. Before we did this, we set up `dataset` . This is actually not a PyTorch dataset, I think this is what they used slightly confusingly to talk about their combined what we would call a model data object — all the data that they need. Loop through that with `tqdm` to get a progress bar, and so now we can go through and see what happens in the model.
+
+```
+ total_steps = 0  for epoch in range(opt.epoch_count, opt.niter + opt.niter_decay+1):  epoch_start_time = time.time()  iter_data_time = time.time()  epoch_iter = 0  for i, data in tqdm(enumerate(dataset)):  iter_start_time = time.time()  if total_steps % opt.print_freq == 0:  t_data = iter_start_time - iter_data_time  total_steps += opt.batchSize  epoch_iter += opt.batchSize  model.set_input(data)  model.optimize_parameters()  if total_steps % opt.display_freq == 0:  save_result = total_steps % opt.update_html_freq == 0  if total_steps % opt.print_freq == 0:  errors = model.get_current_errors()  t = (time.time() - iter_start_time) / opt.batchSize  if total_steps % opt.save_latest_freq == 0:  print('saving the latest model(epoch %d ,total_steps %d )'  % (epoch, total_steps))  model.save('latest')  iter_data_time = time.time()  if epoch % opt.save_epoch_freq == 0:  print('saving the model at the end of epoch %d , iters %d '  % (epoch, total_steps))  model.save('latest')  model.save(epoch)  print('End of epoch %d / %d \t Time Taken: %d sec' %  (epoch, opt.niter + opt.niter_decay, time.time()  - epoch_start_time))  model.update_learning_rate() 
+```
+
+`set_input` [ [2:10:32](https://youtu.be/ondivPiwQho%3Ft%3D2h10m32s) ]: It's a different approach to what we do in fast.ai. This is kind of neat, it's quite specific to cycle GANs but basically internally inside this model is this idea that we are going to go into our data and grab the appropriate one. We are either going horse to zebra or zebra to horse, depending on which way we go, `A` is either horse or zebra, and vice versa. If necessary put it on the appropriate GPU, then grab the appropriate paths. So the model now has a mini-batch of horses and a mini-batch of zebras.
+
+![](../img/1__s9OBHq4z1OBiR9SJORySw.png)
+
+Now we optimize the parameters [ [2:11:19](https://youtu.be/ondivPiwQho%3Ft%3D2h11m19s) ]. It's kind of nice to see it like this. You can see each step. First of all, try to optimize the generators, then try to optimize the horse discriminators, then try to optimize the zebra discriminator. `zero_grad()` is a part of PyTorch, as well as `step()` . So the interesting bit is the actual thing that does the back propagation on the generator.
+
+![](../img/1_CXawhHC0Mc9pgBFBIWg22Q.png)
+
+Here it is [ [2:12:04](https://youtu.be/ondivPiwQho%3Ft%3D2h12m4s) ]. Let's jump to the key pieces. There's all the formula that we just saw in the paper. Let's take a horse and generate a zebra. Let's now use the discriminator to see if we can tell whether it's fake or not ( `pred_fake` ). Then let's pop that into our loss function which we set up earlier to get a GAN loss based on that prediction. Let's do the same thing going the opposite direction using the opposite discriminator then put that through the loss function again. Then let's do the cycle consistency loss. Again, we take our fake which we created and try and turn it back again into the original. Let's use the cycle consistency loss function we created earlier to compare it to the real original. And here is that lambda — so there's some weight that we used and that would set up actually we just use the default that they suggested in their options. Then do the same for the opposite direction and then add them all together. We then do the backward step. 而已。
+
+![](../img/1_q-ir1SHyywXmO5EkTDVq1w.png)
+
+So we can do the same thing for the first discriminator [ [2:13:50](https://youtu.be/ondivPiwQho%3Ft%3D2h13m50s) ]. Since basically all the work has been done now, there's much less to do here. There that is. We won't step all through it but it's basically the same basic stuff that we've already seen.
+
+![](../img/1_PPZdNJDrTHrrQLVRjzucgg.png)
+
+So `optimize_parameters()` is calculating the losses and doing the optimizer step. From time to time, save and print out some results. Then from time to time, update the learning rate so they've got some learning rate annealing built in here as well. Kind of like fast.ai, they've got this idea of schedulers which you can then use to update your learning rates.
+
+![](../img/1_Xrc3Dxs8hKV7pWHQBfZSsQ.png)
+
+For those of you are interested in better understanding deep learning APIs, contributing more to fast.ai, or creating your own version of some of this stuff in some different back-end, it's cool to look at a second API that covers some subset of some of the similar things to get a sense for how they are solving some of these problems and what the similarities/differences are.
+
+```
+ def show_img(im, ax= None , figsize= None ):  if not ax: fig,ax = plt.subplots(figsize=figsize)  ax.imshow(im)  ax.get_xaxis().set_visible( False )  ax.get_yaxis().set_visible( False )  return ax 
+```
+
+```
+ def get_one(data):  model.set_input(data)  model.test()  return list(model.get_current_visuals().values()) 
+```
+
+```
+ model.save(201) 
+```
+
+```
+ test_ims = []  for i,o in enumerate(dataset):  if i>10: break  test_ims.append(get_one(o)) 
+```
+
+```
+ def show_grid(ims):  fig,axes = plt.subplots(2,3,figsize=(9,6))  for i,ax in enumerate(axes.flat): show_img(ims[i], ax);  fig.tight_layout() 
+```
+
+```
+ for i in range(8): show_grid(test_ims[i]) 
+```
+
+We train that for a little while and then we can just grab a few examples and here we have them [ [2:15:29](https://youtu.be/ondivPiwQho%3Ft%3D2h15m29s) ]. Here are horses, zebras, and back again as horses.
+
+![](../img/1_CcsmcW4TlZvn7eywxQPHUQ.png)
+
+![](../img/1_uMqqilXzEmTMXf8x5ry0CQ.png)
+
+![](../img/1__9FdL_2vB30MCQ1V8fU-qw.png)
+
+![](../img/1_SanvgXWJHOoucKANA6A36A.png)
+
+![](../img/1_TcjrwAtTdYLkV1x5kCqdxA.png)
+
+![](../img/1_Rlpp3gVTYSaKknsAq4qOig.png)
+
+![](../img/1_xxPYAgd8hRxgvGv2mBQ2Vg.png)
+
+![](../img/1_MxQzw0SwBT_iYfbyd4BD_w.png)
+
+It took me like 24 hours to train it even that far so it's kind of slow [ [2:16:39](https://youtu.be/ondivPiwQho%3Ft%3D2h16m39s) ]. I know Helena is constantly complaining on Twitter about how long these things take. I don't know how she's so productive with them.
+
+```
+ #! wget https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/horse2zebra.zip 
+```
+
+I will mention one more thing that just came out yesterday [ [2:16:54](https://youtu.be/ondivPiwQho%3Ft%3D2h16m54s) ]:
+
+[Multimodal Unsupervised Image-to-Image Translation](https://arxiv.org/abs/1804.04732)
+
+There is now a multi-modal image to image translation of unpaired. So you can basically now create different cats for instance from this dog.
+
+This is basically not just creating one example of the output that you want, but creating multiple ones. This came out yesterday or the day before. I think it's pretty amazing. So you can kind of see how this technology is developing and I think there's so many opportunities to maybe do this with music, speech, writing, or to create kind of tools for artists.
diff --git a/zh/dl13.md b/zh/dl13.md
new file mode 100644
index 0000000000000000000000000000000000000000..19e15da7d88b3dcf2ea64190b4f2f2e6d22d6113
--- /dev/null
+++ b/zh/dl13.md
@@ -0,0 +1,836 @@
+# 深度学习2：第2部分第13课
+
+[论坛](http://forums.fast.ai/t/lesson-13-discussion-and-wiki/15297/1) / [视频](https://youtu.be/xXXiC4YRGrQ)
+
+![](../img/1_hnLrcMUPpHa6RyMy1DqTMA.png)
+
+![](../img/1_M8g-O4rEHrAO62K9J_AG6w.png)
+
+图像增强 - 我们将涵盖您可能熟悉的这幅画。 但是，你之前可能没有注意到这幅画中的老鹰。 之前你可能没有注意到这幅画的原因是它上面没有鹰。 出于同样的原因，第一张幻灯片上的画也不习惯美国队长的盾牌。
+
+![](../img/1_7jEj71y4syKb2ylnSGrUAg.png)
+
+这是一篇很酷的新论文，刚刚在几天前[出版](https://arxiv.org/abs/1804.03189) ，名为[Deep Painterly Harmonization](https://arxiv.org/abs/1804.03189) ，它几乎完全采用了我们将在本课中学习的技巧，并进行了一些小的调整。 但你可以看到基本的想法是将一张图片粘贴在另一张图片的顶部，然后使用某种方法将两者结合起来。 这种方法被称为“风格转移”。
+
+在我们谈论这个之前，我想提一下William Horton的这个非常酷的贡献，他将这种随机权重平均技术添加到fastai库中，现在它已经合并并准备好了。 他写了一篇关于我强烈建议您查看的帖子，不仅因为随机权重平均让您从现有的神经网络中获得更高的性能，基本上没有额外的工作（就像为您的拟合函数添加两个参数一样简单： `use_swa` ， `swa_start` ）但他也描述了他构建这个的过程以及他如何测试它以及他如何为图书馆做出贡献。 所以我觉得如果你有兴趣做这样的事情会很有趣。 我认为威廉之前没有建立过这种类型的图书馆，因此他描述了他是如何做到的。
+
+![](../img/1_hJeKM7VaXaDyvVTTXWlFqg.png)
+
+<figcaption class="imageCaption">[https://medium.com/@hortonhearsafoo/adding-a-cutting-edge-deep-learning-training-technique-to-the-fast-ai-library-2cd1dba90a49](https://medium.com/%40hortonhearsafoo/adding-a-cutting-edge-deep-learning-training-technique-to-the-fast-ai-library-2cd1dba90a49)</figcaption>
+
+
+
+#### TrainPhase [ [2:01](https://youtu.be/xXXiC4YRGrQ%3Ft%3D2m1s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/training_phase.ipynb)
+
+对fastai库的另一个很酷的贡献是新的列车阶段API。 而且我将要做一些我以前从未做过的事情，那就是我要介绍别人的笔记本。 我以前没有这么做的原因是因为我不喜欢任何笔记本足以认为它们值得展示它，但是Sylvain在这里做了出色的工作，不仅创造了这个新的API而且创造了一个美丽的笔记本描述了什么它是什么以及它是如何工作的等等。 这里的背景是你们知道我们一直在努力更快地训练网络，部分是作为Dawn替补席比赛的一部分，也是因为你将在下周学到的原因。 我上周在论坛上提到，如果我们有一种更简单的方法来尝试不同的学习率计划等，我们的实验会非常方便，而且我设计了一个我想到的API，因为如果有人真的很酷可以这样写，因为我现在要睡觉了，明天我需要它。 Sylvain在论坛上回答说这听起来像是一个很好的挑战，到24小时后，它已经完成并且它非常酷。 我想带你通过它，因为它可以让你研究以前没人尝过的东西。
+
+它被称为TrainPhase API [ [3:32](https://youtu.be/xXXiC4YRGrQ%3Ft%3D3m32s) ]，显示它的最简单方法是展示它的作用的一个例子。 这是对您熟悉的学习率图表的迭代。 这是我们以0.01的学习率训练一段时间然后我们以0.001的学习率训练一段时间。 我实际上想要创建非常类似于学习率图表的东西，因为大多数训练过ImageNet的人都使用这种逐步的方法，而实际上并不是内置于fastai的东西，因为它通常不是我们推荐的东西。 但是为了复制现有的论文，我想以同样的方式去做。 因此，不是用不同的学习率编写一些合适的，适合的，适合的呼叫，而是能够以这个学习速率训练n个时期，然后以该学习速率训练m个时期。
+
+![](../img/1_aWc3pHrwlAI7kfIPDANQpw.png)
+
+所以这是你如何做到这一点：
+
+```
+ phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2),  TrainingPhase(epochs=2, opt_fn=optim.SGD, lr = 1e-3)] 
+```
+
+阶段是具有特定优化器参数的训练时段，并且`phases`由许多训练阶段对象组成。 训练阶段对象说明要训练多少个时期，使用什么优化功能，以及我们将看到的其他事物的学习率。 在这里，您将看到刚刚在该图表上看到的两个培训阶段。 所以，现在，你不是打电话给`learn.fit` ，而是说：
+
+```
+ learn.fit_opt_sched(phases) 
+```
+
+换句话说，使用具有这些阶段的优化程序调度程序来学习。 从那里开始，你传入的大部分内容都可以按照惯例发送到fit函数，因此大多数常用参数都能正常工作。 一般来说，我们可以使用这些培训阶段，您会发现它符合常规方式。 然后，当你说`plot_lr`你会看到上面的图表。 它不仅绘制学习率，还绘制动量，每个阶段，它告诉你它使用了什么优化器。 你可以关闭优化器的打印（ `show_text=False` ），你可以关闭动作的打印（ `show_moms=False` ），你可以做其他的小事情，比如训练阶段可以有一个`lr_decay`参数[ [5:47](https://youtu.be/xXXiC4YRGrQ%3Ft%3D5m47s) ] ：
+
+```
+ phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2),  TrainingPhase(epochs=1, opt_fn=optim.SGD,  lr = (1e-2,1e-3), lr_decay=DecayType.LINEAR),  TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-3)] 
+```
+
+所以这里是一个固定的学习率，然后是一个线性衰减学习率，然后是一个固定的学习率，它放弃了这张图：
+
+```
+ lr_i = start_lr + (end_lr - start_lr) * i/n 
+```
+
+![](../img/1_duLsu6JhsWXtVxSTJ3LvTw.png)
+
+这可能是一种非常好的训练方式，因为我们知道在高学习率下，你可以更好地探索，而且在低学习率的情况下，你可以更好地进行微调。 并且在两者之间逐渐滑动可能会更好。 所以我怀疑这实际上并不是一个糟糕的方法。
+
+你可以使用其他衰变类型，如余弦[ [6:25](https://youtu.be/xXXiC4YRGrQ%3Ft%3D6m25s) ]：
+
+```
+ phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2),  TrainingPhase(epochs=1, opt_fn=optim.SGD, lr =(1e-2,1e-3),  lr_decay=DecayType.COSINE),  TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-3)] 
+```
+
+这可能更有意义，因为它具有真正潜在有用的学习速率退火形状。
+
+```
+ lr_i = end_lr + (start_lr - end_lr)/2 * (1 + np.cos(i * np.pi)/n) 
+```
+
+![](../img/1_ABxYYgRpKEiadfQj7tSWHg.png)
+
+指数是超级流行的方法：
+
+```
+ lr_i = start_lr * (end_lr/start_lr)**(i/n) 
+```
+
+![](../img/1_cDjz1ZSPmDfHWQGDt4nFUQ.png)
+
+多项式不是非常受欢迎但实际上在文献中比其他任何东西都更好，但似乎在很大程度上被忽略了。 所以多项式是很好的意识到。 Sylvain所做的是他给了我们每条曲线的公式。 因此，使用多项式，您可以选择要使用的多项式。 我相信0.9的p是我见过的非常好的结果 - 仅供参考。
+
+```
+ lr_i = end_lr + (start_lr - end_lr) * (1 - i/n) ** p 
+```
+
+![](../img/1_Ku8WXqHiEI4_Q_XtfrL4Qg.png)
+
+如果你在LR衰减时没有给出学习率的元组，那么它将一直衰减到零[ [7:26](https://youtu.be/xXXiC4YRGrQ%3Ft%3D7m26s) ]。 正如您所看到的，您可以愉快地在不同的点开始下一个周期。
+
+```
+ phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2),  TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2,  lr_decay=DecayType.COSINE),  TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-3)] 
+```
+
+![](../img/1__2LEpZ_eiPa_Oh6uaUz3Ig.png)
+
+#### SGDR [ [7:43](https://youtu.be/xXXiC4YRGrQ%3Ft%3D7m43s) ]
+
+所以很酷的是，现在我们可以使用除了这些培训阶段之外的所有现有计划复制。 所以这里有一个名为`phases_sgdr`的函数，它使用新的训练阶段API执行SGDR。
+
+```
+ **def** phases_sgdr(lr, opt_fn, num_cycle,cycle_len,cycle_mult):  phases = [TrainingPhase(epochs = cycle_len/ 20, opt_fn=opt_fn,  lr=lr/100),  TrainingPhase(epochs = cycle_len * 19/20,  opt_fn=opt_fn, lr=lr, lr_decay=DecayType.COSINE)]  **for** i **in** range(1,num_cycle):  phases.append(TrainingPhase(epochs=cycle_len*  (cycle_mult**i), opt_fn=opt_fn, lr=lr,  lr_decay=DecayType.COSINE))  **return** phases 
+```
+
+所以你可以看到，如果他运行这个时间表，这就是它的样子：
+
+![](../img/1_4ZZswpXpsQ5ZAwLpnMlUiw.png)
+
+他甚至完成了我所拥有的小技巧，你只需要一点点学习率，然后弹出并做几个周期，周期越来越长[ [8:05](https://youtu.be/xXXiC4YRGrQ%3Ft%3D8m5s) ]。 这一切都是在一个功能中完成的。
+
+#### 1周期[ [8:20](https://youtu.be/xXXiC4YRGrQ%3Ft%3D8m20s) ]
+
+我们现在可以使用一个小函数实现新的1循环。
+
+```
+ **def** phases_1cycle(cycle_len,lr,div,pct,max_mom,min_mom):  tri_cyc = (1-pct/100) * cycle_len  **return** [TrainingPhase(epochs=tri_cyc/2, opt_fn=optim.SGD,  lr=(lr/div,lr), lr_decay=DecayType.LINEAR,  momentum=(max_mom,min_mom),  momentum_decay=DecayType.LINEAR),  TrainingPhase(epochs=tri_cyc/2, opt_fn=optim.SGD,  lr=(lr,lr/div), lr_decay=DecayType.LINEAR,  momentum=(min_mom,max_mom),  momentum_decay=DecayType.LINEAR),  TrainingPhase(epochs=cycle_len-tri_cyc, opt_fn=optim.SGD,  lr=(lr/div,lr/(100*div)),  lr_decay=DecayType.LINEAR,  momentum=max_mom)] 
+```
+
+因此，如果我们适合这一点，我们得到这个三角形，接着是一点点平坦的一点，动量是一个很酷的东西 - 动量有动量衰减。 在第三个TrainingPhase中，我们有一个固定的动力。 所以它同时在做动力和学习率。
+
+![](../img/1_Wws6eOCVXppEMOtORM1dSQ.png)
+
+#### 判别学习率+ 1周期[ [8:53](https://youtu.be/xXXiC4YRGrQ%3Ft%3D8m53s) ]
+
+所以我还没有尝试过，但我认为真的很有趣的是使用判别学习率和1周期的组合。 还没有人尝试过。 所以那真的很有趣。 我遇到的唯一具有歧视性学习率的论文使用了一种叫做LARS的东西。 通过查看每层的梯度和平均值之间的比率，并使用它来自动更改每层的学习率，它用于训练具有非常大批量的ImageNet。 他们发现他们可以使用更大的批量。 这是我见过这种方法的唯一其他地方，但是你可以尝试结合歧视学习率和不同的有趣时间表来尝试很多有趣的事情。
+
+#### 你自己的LR发现者[ [10:06](https://youtu.be/xXXiC4YRGrQ%3Ft%3D10m6s) ]
+
+你现在可以编写自己的不同类型的LR手指，特别是因为现在有这个`stop_div`参数，这基本上意味着它将使用你要求的任何时间表，但是当损失太糟糕时，它将停止训练。
+
+添加的一个有用的东西是`plot`函数的`linear`参数。 如果您在学习速率查找器中使用线性时间表而不是指数时间表，如果您对大致正确的区域进行微调，那么您可以使用线性来找到正确的区域。 然后你可能想用线性刻度绘制它。 这就是为什么你现在也可以将线性传递到情节。
+
+您可以在每个阶段[ [11:06](https://youtu.be/xXXiC4YRGrQ%3Ft%3D11m6s) ]更改优化程序。 这比你想象的更重要，因为实际上目前用于真正大批量训练的最先进技术对于ImageNet实际上很快就从RMSProp开始第一位，然后他们切换到第二位的SGD。 所以这可能是一个有趣的实验，因为至少有一篇论文现在表明它可以很好地运作。 同样，它还没有得到很好的认可。
+
+#### 改变数据[ [11:49](https://youtu.be/xXXiC4YRGrQ%3Ft%3D11m49s) ]
+
+然后我发现最有趣的是你可以改变你的数据。 我们为什么要更改数据？ 因为您从第1课和第2课中记得，您可以在开始时使用小图像，稍后可以使用更大的图像。 理论上你可以使用它来更快地用较小的图像训练第一位，并记住如果你将高度减半并将宽度减半，你每层都有四分之一的激活，所以它可以快得多。 它甚至可能更好地概括。 因此，您现在可以创建几个不同的大小，例如，他有28和32大小的图像。 这是CIFAR10所以你只能这么做。 然后，如果在调用`fit_opt_sched`时`fit_opt_sched`此`data_list`参数中的数据数组，它将为每个阶段使用不同的数据集。
+
+```
+ data1 = get_data(28,batch_size)  data2 = get_data(32,batch_size) 
+```
+
+```
+ learn = ConvLearner.from_model_data(ShallowConvNet(), data1) 
+```
+
+```
+ phases = [TrainingPhase(epochs=1, opt_fn=optim.Adam, lr=1e-2,  lr_decay=DecayType.COSINE),  TrainingPhase(epochs=2, opt_fn=optim.Adam, lr=1e-2,  lr_decay=DecayType.COSINE)] 
+```
+
+```
+ learn.fit_opt_sched(phases, data_list=[data1,data2]) 
+```
+
+这真的很酷，因为我们可以使用它，就像我们可以在我们的DAWN工作台条目中使用它，看看当我们用非常少的代码实际增加大小时会发生什么。 那么当我们这样做时会发生什么[ [13:02](https://youtu.be/xXXiC4YRGrQ%3Ft%3D13m2s) ]？ 答案就在ImageNet的DAWN工作台培训中：
+
+![](../img/1_5CGZBcVMwjuz7N6J9906xQ.png)
+
+你可以在这里看到谷歌已经在一个TPU集群上赢了半个小时。 最好的非集群TPU结果是fast.ai + 3个小时以下的学生在128台计算机上击败英特尔，在其他地方，我们在一台计算机上运行。 我们还击败谷歌在TPU上运行，所以使用这种方法，我们已经显示：
+
+*   最快的GPU结果
+*   最快的单机结果
+*   最快的公开基础设施结果
+
+除非您是Google，否则您无法使用这些TPU广告连播。 此外，成本很小（72.54美元），这款英特尔的计算机价值1200美元 - 他们甚至没有在这里写过它，但如果你使用并行的128台计算机，每台计算机有36个核心，每个都有140G，那就是你得到的与我们的单个AWS实例进行比较。 所以这是我们能做的突破。 我们可以在单个公共机器上训练ImageNet的想法是72美元，顺便说一下，它实际上是25美元，因为我们使用了一个现场实例。 我们的一个学生安德鲁·肖建立了整个系统，允许我们抛出一大堆现场实例实验并同时运行它们并且非常自动地运行它们，但是DAWN工作台没有引用我们使用的实际数字。 所以它实际上是25美元，而不是72美元。 所以这个`data_list`想法非常重要且有用。
+
+#### CIFAR10结果[ [15:15](https://youtu.be/xXXiC4YRGrQ%3Ft%3D15m15s) ]
+
+我们的CIFAR10结果现在也正式上传，你可能还记得以前最好的结果是一个多小时。 这里的诀窍是使用1循环，所以Sylvain的训练阶段API中的所有这些东西都是我们用来获得这些最佳结果的所有东西。 另一位名叫bkj的fast.ai学生已经接受了这个并完成了他自己的版本，他拿了一个Resnet18并添加了你可能记得我们在顶部学到的concat汇集，并使用了Leslie Smith的1循环，所以他有了在排行榜上。 所以前三名都是fast.ai学生，这很精彩。
+
+![](../img/1_9UOqTbSEEsBMtEp5qKdUQg.png)
+
+#### CIFAR10成本结果[ [16:05](https://youtu.be/xXXiC4YRGrQ%3Ft%3D16m5s) ]
+
+相同的成本 - 前3，你可以看到，Paperspace。 Brett在Paperspace上运行了这个，并且在bkj之前得到了最便宜的结果。
+
+![](../img/1_c2xPu_54XdllFPo8mrQmaA.png)
+
+所以我认为你可以看到[ [16:25](https://youtu.be/xXXiC4YRGrQ%3Ft%3D16m25s) ]，目前很多有趣的机会，培训内容更快，更便宜，都是关于学习速率退火，尺寸退火和不同时间不同参数的训练，我仍然认为每个人都在摸索表面。 我认为我们可以更快，更便宜。 这对于资源有限的环境中的人来说真的很有帮助，除了Google之外，基本上每个人都可能是Facebook。
+
+尽管[ [17:00](https://youtu.be/xXXiC4YRGrQ%3Ft%3D17m) ]，架构也很有趣，我们上周看到的其中一件事就是创建了一个更简单的暗网架构版本。 但是有一个架构我们没有谈论哪些是理解Inception网络所必需的。 Inception网络实际上非常有趣，因为它们使用一些技巧来提高效率。 我们目前没有使用这些技巧，我觉得也许我们应该尝试一下。 最有趣和最成功的Inception网络是他们的Inception-ResNet-v2网络，其中的大多数块看起来像这样：
+
+![](../img/1_HUqLof5PFa8QA0imCbzjgw.png)
+
+它看起来很像一个标准的Res​​Net块，因为它有一个身份连接，并且有一个转换路径，我们将它们加在一起[ [17:47](https://youtu.be/xXXiC4YRGrQ%3Ft%3D17m47s) ]。 但事实并非如此。 第一个是中间转换路径是1x1转换，并且值得思考1x1转换实际上是什么。
+
+#### 1x1卷积[ [18:23](https://youtu.be/xXXiC4YRGrQ%3Ft%3D18m23s) ]
+
+1x1 conv只是对输入中的每个网格单元说，你基本上有一个向量。 滤波器张数的1乘1基本上是矢量。 对于输入中的每个网格单元，您只需要使用该张量进行点积。 当然，它将成为我们正在创建的192个激活中的每个激活的向量之一。 所以基本上做192网点产品与网格单元（1,1），然后192网格单元（1,2）或（1,3）等等。 因此，您将得到与输入具有相同网格大小的内容以及输出中的192个通道。 因此，这是一种非常好的方法，可以在不改变网格大小的情况下减少维度或增加输入的维度。 这通常是我们使用1x1转换的原因。 在这里，我们有一个1x1转换和另一个1x1转换，然后他们将它们加在一起。 然后是第三条路径，不添加第三条路径。 没有明确提到，但第三条路径是连接在一起的。 有一种形式的ResNet与ResNet基本相同，但我们不做加号，我们做concat。 这叫做DenseNet。 它只是一个ResNet，我们用concat代替plus。 这是一种有趣的方法，因为那时身份路径的类型实际上是被复制的。 因此，你可以获得整个流程，因此我们将在下周看到，这对于细分和类似的东西来说更好，你真的想保留原始像素，第一层像素，第二层像素层不受影响。
+
+连接而不是添加分支是一件非常有用的事情，我们正在连接中间分支和右分支[ [20:22](https://youtu.be/xXXiC4YRGrQ%3Ft%3D20m22s) ]。 最正确的分支正在做一些有趣的事情，首先是1x1转换，然后是1x7，然后是7x1。 那里发生了什么？ 所以，那里发生的事情基本上我们真正想做的就是做7x7转换。 我们想要进行7x7转换的原因是，如果你有多条路径（每条路径都有不同的内核大小），那么它就可以查看不同数量的图像。 最初的Inception网络有1x1,3x3,5x5,7x7连接在一起或类似的东西。 因此，如果我们可以使用7x7过滤器，那么我们可以立即查看大量图像并创建一个非常丰富的表示。 所以Inception网络的主干是Inception网络的前几层实际上也使用了这种类型的7x7转换，因为你从这个224乘224乘3开始，你想把它变成112乘112的东西64.通过使用7x7转换，您可以在每个输出中获得大量信息，以获得64个过滤器。 但问题是7x7转换是很多工作。 对于每个通道的每个输入像素，您有49个内核值乘以49个输入。 所以计算很疯狂。 对于第一层，你可以为它（可能）侥幸逃脱，事实上，ResNet的第一个转换是7x7转换。
+
+但对于Inception [ [22:30](https://youtu.be/xXXiC4YRGrQ%3Ft%3D22m30s) ]则不是这样。 他们没有做7x7转换，相反，他们做1x7跟随7x1。 因此，要解释一下，初始网络的基本思想或它的所有不同版本，你有许多具有不同卷积宽度的独立路径。 在这种情况下，概念上的想法是中间路径是1x1卷积宽度，右路径将是7卷积宽度，因此他们查看不同数量的数据然后我们将它们组合在一起。 但我们不希望通过网络获得7x7转换，因为它的计算成本太高。
+
+但是如果你考虑它[ [23:18](https://youtu.be/xXXiC4YRGrQ%3Ft%3D23m18s) ]，如果我们有一些输入，我们有一些我们想要的大过滤器，而且它太大而无法处理。 我们能做什么？ 我们做5x5。 我们能做的是创建两个过滤器 - 一个是1x5，一个是5x1。 我们对前一层进行了激活，并将它通过了1x5。 我们从中取出激活，并通过5x1，而另一端则出现了一些东西。 现在另一端出现了什么？ 而不是把它想象成，首先，我们采取激活，然后我们通过1x5，然后我们通过5x1，如果相反我们将这两个操作放在一起，并说什么是5x1点积和一个1x5点的产品一起做？ 实际上，你可以采取1x5和5x1，其外部产品将给你一个5x5。 现在你不能通过获取该产品来创建任何可能的5x5矩阵，但是你可以创建很多5x5矩阵。 所以这里的基本思想是当你考虑操作的顺序时（如果你对这里的更多理论感兴趣，你应该看看Rachel的数值线性代数课程，这基本上是关于这个的整个过程）。 但从概念上讲，这个想法很常见，你想要做的计算实际上比整个5x5卷积更简单。 通常，我们在线性代数中使用的术语是有一些较低等级的近似值。 换句话说，1x5和5x1组合在一起 - 5x5矩阵几乎与5x5矩阵一样好，如果你能够，你理想情况下会计算出来。 所以在实践中经常出现这种情况 - 仅仅因为现实世界的本质是现实世界往往具有比随机更多的结构。
+
+很酷的是[26:16]，如果我们用1x7和7x1替换我们的7x7转换器，对于每个单元，它通过输出通道点产品有14个输入通道，而7x7一个有49个要做。 所以它会快得多，我们不得不希望它几乎一样好。 根据定义，它肯定会捕获尽可能多的信息宽度。
+
+![](../img/1_V1k6QCPEuoJpD2GmRrCfAQ.png)
+
+如果您有兴趣了解更多相关信息，特别是在深度学习领域，您可以谷歌参与**Factored Convolutions** 。 这个想法是在3或4年前提出来的。 它可能已经存在了很长时间，但那是我第一次看到它的时候。 事实证明它工作得非常好，并且Inception网络广泛使用它。 他们实际上是在他们的干中使用它。 我们之前已经讨论了我们如何倾向于加载 - 例如，当我们有ResNet34时，我们倾向于说这是主要的主干。 这是所有卷积的主要支柱，然后我们可以添加一个自定义头，它往往是最大池或完全连接层。 最好谈谈骨干包含两个部分：一个是干，另一个是主干。 原因是进入的东西只有3个通道，所以我们想要一些操作序列，这些操作将扩展到更丰富的东西 - 通常类似于64个通道。
+
+![](../img/1_5JWfOol_FWA5gjThyktI7g.png)
+
+在ResNet中，词干非常简单。 这是一个7x7步幅2转，然后是一个步幅2最大池（我认为如果内存正确服务就是这样）。 初始有一个更复杂的词干，多个路径被组合和连接，包括factored conv（1x7和7x1）。 我很感兴趣，例如，如果你将一个标准的Res​​Net堆叠在一个Inception干上，会发生什么。 我认为这将是一个非常有趣的事情，因为一个Inception词干是一个非常精心设计的东西，而你如何采用3通道输入并将其转化为更丰富的东西似乎非常重要。 所有这些工作似乎都被ResNet抛弃了。 我们喜欢ResNet，它的效果非常好。 但是，如果我们将一个密集的网络骨干放在一个Inception干线上呢？ 或者如果我们用标准ResNet中的1x7和7x1因子转换替换7x7转换器怎么办？ 我们可以尝试很多东西，我认为它会非常有趣。 因此，对潜在的研究方向有更多的想法。
+
+所以这就是我的一小部分随机内容[ [29:51](https://youtu.be/xXXiC4YRGrQ%3Ft%3D29m51s) ]。 更接近实际的主题是图像增强。 我将简要讨论一篇新论文，因为它真正将我刚才讨论的内容与我们接下来要讨论的内容联系起来。 这篇关于渐进式GAN的论文来自Nvidia： [GAN的渐进式增长，用于提高质量，稳定性和变异性](http://research.nvidia.com/publication/2017-10_Progressive-Growing-of) 。 渐进式GAN采用逐渐增加图像尺寸的想法。 这是我所知道的唯一其他方向，人们实际上逐渐增加了图像尺寸。 这让我感到惊讶，因为这篇论文实际上非常受欢迎，众所周知，并且很受欢迎，但是，人们还没有采取逐步增加图像尺寸的基本思想，并在其他任何地方使用它，向您展示您可以期待的一般创造力水平也许，在深度学习研究社区中找到。
+
+![](../img/1_QZSuQJD2MhWOyQ5bxTOSEg.png)
+
+他们真的回去了，他们从4x4 GAN开始[ [31:47](https://youtu.be/xXXiC4YRGrQ%3Ft%3D31m47s) ]。 从字面上看，他们试图复制4x4像素，然后是8x8（上面的左上角）。 这是CelebA数据集，所以我们正在尝试重新创建名人的照片。 然后他们去了16x16,32,64,128，然后是256.他们做的一件非常好的事情是，随着他们增加尺寸，他们还会为网络添加更多层。 哪种有意义，因为如果你正在做更多的ResNet-y类型的东西，那么你正在吐出一些希望在每个网格单元大小上有意义的东西，所以你应该能够将东西分层。 他们做了另一件漂亮的事情，当他们这样做时他们添加跳过连接，并且他们逐渐改变线性插值参数，使其越来越远离旧的4x4网络并转向新的8x8网络。 然后，一旦它完全移动它，他们扔掉了额外的连接。 细节并不重要，但它使用了我们所讨论的基本思想，逐渐增加图像大小并跳过连接。 这是一篇很好的研究论文，因为它是这些罕见的事情之一，优秀的工程师实际上建立了一些只是以一种非常明智的方式工作的东西。 现在，这实际上来自于Nvidia本身也就不足为奇了。 Nvidia不做很多论文，有趣的是，当他们这样做时，他们会构建一些非常实用和明智的东西。 所以我认为这是一篇很好的论文，如果你想把我们学到的很多不同的东西放在一起，并且没有很多的重新实现，所以这是一个有趣的事情，也许你可以建立和找到别的东西。
+
+接下来会发生什么[ [33:45](https://youtu.be/xXXiC4YRGrQ%3Ft%3D33m45s) ]。 我们最终会达到1024x1024，你会发现这些图像不仅分辨率更高，而且越来越好。 所以我要看看你是否可以猜出以下哪一个是假的：
+
+![](../img/1_xKpY8dRD8lSeL6Twd2lcgw.png)
+
+他们都是假的。 那是下一个阶段。 你上升了，他们砰的一声。 所以GAN和东西都变得疯狂，你们中的一些人可能在本周[ [34:16](https://youtu.be/xXXiC4YRGrQ%3Ft%3D34m16s) ]看到了这一点。 这段视频刚刚问世，这是巴拉克奥巴马的演讲，让我们来看看：
+
+正如你所看到的，他们已经使用这种技术来实现奥巴马面对乔丹皮尔脸部移动的方式。 你现在基本上拥有了所需的所有技术。 这是一个好主意吗？
+
+### 人工智能中的伦理[ [35:31](https://youtu.be/xXXiC4YRGrQ%3Ft%3D35m31s) ]
+
+这就是我们谈论什么是最重要的，现在我们可以做所有这些事情，我们应该做什么以及我们如何思考？ TL; DR版本是我其实不知道的。 最近很多人看到spaCy神童们的创始人在Explosion AI上做了一个演讲，Matthew和Ines，然后我和他们一起吃饭，我们基本上整个晚上都在谈论，辩论，争论什么做这意味着像我们这样的公司正在构建工具，使得可以以有害方式使用的工具的民主化。 他们是非常有思想的人，我们不会说我们不同意，我们自己也无法得出结论。 所以我只是要提出一些问题，并指出一些研究，当我说研究时，大多数实际的文献回顾和把它放在一起是由Rachel完成的，所以谢谢Rachel。
+
+首先我要说的是，我们建立的模型往往非常糟糕，并不是很明显[ [36:52](https://youtu.be/xXXiC4YRGrQ%3Ft%3D36m52s) ]。 你不会知道它们是多么糟糕，除非与你一起建造它们的人是一系列人，与你一起使用它们的人是一群人。 例如，一些出色的研究人员， [Timnit Gebru](https://twitter.com/timnitGebru)在微软和[Joy Buolamwini](https://twitter.com/jovialjoy)刚刚从麻省理工学院获得博士学位，他们做了这个非常有趣的研究，他们看了一些现成的面部识别器，一个来自FACE ++，这是一个巨大的中国人公司，IBM和微软，他们寻找一系列不同的面部类型。
+
+![](../img/1_nELJUHaM-pD_MHwLz4ThEg.png)
+
+一般来说，微软的一个特别令人难以置信的准确，除非脸部类型突然变暗皮肤突然变得更糟。 IBM差不多有一半时间弄错了。 对于像这样的大公司来说，发布一种产品，对于世界上很大一部分而言，不起作用不仅仅是技术故障。 这是一个非常深刻的失败，无法理解需要使用什么样的团队来创建这样的技术，测试这样的技术，甚至了解客户是谁。 你的一些顾客皮肤黝黑。 “我还要补充一点，分类器对女性的影响都比男性差”（Rachel）。 令人震惊的。 有趣的是，雷切尔前几天发布了类似这样的内容，而且有人说“这是怎么回事？ 你在说什么？ 难道你不知道人们长时间制造汽车 - 你是说你需要女性来制造汽车？“雷切尔指出 - 实际上是的。 对于汽车安全的大部分历史来说，汽车中的女性死于汽车的风险远远超过男性，因为这些男性创造了男性外观，感觉，大小的碰撞测试假人，因此汽车安全实际上没有在女性身体上进行测试。 糟糕的产品管理以及多样性和理解的完全失败对我们的领域来说并不陌生。
+
+“我只想说比较男女相似力量的影响”（瑞秋）。 我不知道为什么每当你在推特上说这样的话时，雷切尔就不得不说这个，因为无论何时你在Twitter上说这样的话，大约有10个人会说“哦，你必须比较所有这些其他东西”好像我们不知道那样。
+
+![](../img/1_mHJOIpE3W1Q4Z4NAl_JWlw.png)
+
+其他一些我们最着名的系统就像微软的面部识别器或谷歌的语言翻译器，你转过身来“她是一名医生。 他是一名护士。“进入土耳其语并且非常正确 - 两个代词都变成了O，因为土耳其语中没有性别代词。 走向另一个方向，它变成了什么？ “他是一个医生。 她是一名护士。“因此，我们将这些偏见纳入了我们每天都在使用的工具中。 再一次，人们说“哦，它只是向我们展示了世界上的东西”，而且，这个基本断言有很多问题，但正如你所知，机器学习算法喜欢概括。
+
+![](../img/1_kAtnaDbTmcC_uVWSzxOGEg.png)
+
+所以，因为他们喜欢概括，这是一个很酷的事情，你们现在知道技术细节，因为他们喜欢概括，当你看到60％的人烹饪的东西是他们用来建立这个模型和然后你在一组单独的图片上运行模型，然后他们选择烹饪的人中有84％是女性，而不是正确的67％。 对于算法而言，这是一个真正可以理解的事情，因为它采用了偏置输入并创建了更偏向的输出，因为对于这个特定的损失函数，它就是最终的结果。 这是一种非常常见的模型放大。
+
+这件事很重要[ [41:41](https://youtu.be/xXXiC4YRGrQ%3Ft%3D41m41s) ]。 它的重要性不仅仅是笨拙的翻译或黑人的照片没有被正确分类。 也许也有一些胜利 - 比如到处都是恐怖的监视，也许不会对黑人有用。 “或者情况会更糟，因为这是可怕的监视，而且是种族主义和错误的结果”（雷切尔）。 但是，让我们更深入。 For all we say about human failings, there is a long history of civilization and societies creating layers of human judgement which avoid, hopefully, the most horrible things happening. And sometimes companies which love technology think “let's throw away humans and replace them with technology” like Facebook did. A couple years ago, Facebook literally got rid of their human editors, and this was in the news at the time. And they were replaced with algorithms. So now as algorithms put all the stuff on your news feed and human editors were out of the loop. What happened next?
+
+![](../img/1_VkIbRF2g5fsRvgfopPRDZQ.png)
+
+Many things happened next. One of which was a massive horrifying genocide in Myanmar. Babies getting torn out of their mothers arms and thrown into fires. Mass rape, murder, and an entire people exiled from their homeland.
+
+![](../img/1_6-Uu8ezBnUol5cYw4Q11lA.png)
+
+Okay, I'm not gonna say that was because Facebook did this, but what I will say is that when the leaders of this horrifying project are interviewed, they regularly talk about how everything they learnt about the disgusting animal behaviors of Rohingyas that need to be thrown off the earth, they learnt from Facebook. Because the algorithms just want to feed you more stuff that gets you clicking. If you get told these people that don't look like you and you don't know the bad people and here's lots of stories about bad people and then you start clicking on them and then they feed you more of those things. Next thing you know, you have this extraordinary cycle. People have been studying this, so for example, we've been told a few times people click on our fast.ai videos and then the next thing recommended to them is like conspiracy theory videos from Alex Jones, and then continues from there. Because humans click on things that shock us, surprise us, and horrify us. At so many levels, this decision has had extraordinary consequences which we're only beginning to understand. Again, this is not to say this particular consequence is because of this one thing, but to say it's entirely unrelated would be clearly ignoring all of the evidence and information that we have.
+
+#### Unintended consequences [ [45:04](https://youtu.be/xXXiC4YRGrQ%3Ft%3D45m4s) ]
+
+![](../img/1_8stxAKlNajQqn4Q6mt9HpQ.png)
+
+The key takeaway is to think what are you building and how could it be used. Lots and lots of effort now being put into face detection including in our course. We've been spending a lot of time thinking about how to recognize stuff and where it is. There's lots of good reasons to want to be good at that for improving crop yields in agriculture, for improving diagnostic and treatment planning in medicine, for improving your LEGO sorting robot system, etc. But it's also being widely used in surveillance, propaganda, and disinformation. Again, the question is what do I do about that? I don't exactly know. But it's definitely at least important to be thinking about it, talking about it.
+
+#### Runaway feedback loops [ [46:10](https://youtu.be/xXXiC4YRGrQ%3Ft%3D46m10s) ]
+
+![](../img/1_MQa0eNjEl__LOn8pc0YmDw.png)
+
+Sometimes you can do really good things. For example, meetup.com did something which I would put in the category of really good thing which is they recognized early a potential problem which is that more men are tending to go to their meet ups. And that was causing their collaborative filtering systems, which you are familiar building now to recommend more technical content to men. And that was causing more men to go to more technical content which was causing the recommendation system to suggest more technical content to men. This kind of runaway feedback loop is extremely common when we interface the algorithm and the human together. So what did Meetup do? They intentionally made the decision to recommend more technical content to women, not because highfalutin idea about how the world should be, but just because that makes sense. Runaway feedback loop was a bug — there are women that want to go to tech meetups, but when you turn up for a tech meet up and it's all men and you don't go, then it recommends more to men and so on and so forth. So Meetup made a really strong product management decision here which was to not do what the algorithm said to do. Unfortunately this is rare. Most of these runaway feedback loops, for example, in predictive policing where algorithms tell policemen where to go which very often is more black neighborhoods which end up crawling with more policemen which leads to more arrests which is assisting to tell more policemen to go to more black neighborhoods and so forth.
+
+#### Bias in AI [ [48:09](https://youtu.be/xXXiC4YRGrQ%3Ft%3D48m9s) ]
+
+![](../img/1_Bd_fR4tfFYj5fBQYgum35A.png)
+
+This problem of algorithmic bias is now very wide spread and as algorithms become more and more widely used for specific policy decisions, judicial decisions, day-to-day decisions about who to give what offer to, this just keeps becoming a bigger problem. Some of them are really things that the people involved in the product management decision should have seen at the very start, didn't make sense, and unreasonable under any definition of the term. For example, this stuff Abe Gong pointed out — these were questions that were used for both pretrial so who was required to post bail, so these are people that haven't even been convicted, as well as for sentencing and for who gets parole. This was upheld by the Wisconsin Supreme Court last year despite all the flaws. So whether you have to stay in jail because you can't pay the bail and how long your sentence is for, and how long you stay in jail for depends on what your father did, whether your parents stayed married, who your friends are, and where you live. Now turns out these algorithms are actually terribly terribly bad so some recent analysis showed that they are basically worse than chance. But even if the company's building them were confident on these were statistically accurate correlations, does anybody imagine there's a world where it makes sense to decide what happens to you based on what your dad did?
+
+A lot of this stuff at the basic level is obviously unreasonable and a lot of it just fails in these ways that you can see empirically that these kind of runaway feedback loops must have happened and these over generalizations must have happened. For example, these are the cross tabs that anybody working in any field using these algorithm should be preparing. So prediction of likelihood of reoffending for black vs. white defendants, we can just calculate this very simply. Of the people that were labeled high-risk but didn't reoffend — they were 23.5% white but about twice that African American. Where else, those that were labeled lower risk but did reoffend was half the white people and only 28% of the African American. This is the kind of stuff where at least if you are taking the technologies we've been talking about and putting the production in any way, building an API for other people, providing training for people, or whatever — then at least make sure that what you are doing can be tracked in a way that people know what's going on so at least they are informed. I think it's a mistake in my opinion to assume that people are evil and trying to break society. I think I would prefer to start with an assumption of if people are doing dumb stuff, it's because they don't know better. So at least make sure they have this information. I find very few ML practitioners thinking about what is the information they should be presenting in their interface. Then often I'll talk to data scientists who will say “oh, the stuff I'm working on doesn't have a societal impact.” Really? A number of people who think that what they are doing is entirely pointless? 来吧。 People are paying you to do it for a reason. It's going to impact people in some way. So think about what that is.
+
+#### Responsibility in hiring [ [52:46](https://youtu.be/xXXiC4YRGrQ%3Ft%3D52m46s) ]
+
+![](../img/1_7V8grUptQO556VPQ4Dw8Sw.png)
+
+The other thing I know is a lot of people involved here are hiring people and if you are hiring people, I guess you are all very familiar with the fast.ai philosophy now which is the basic premise that, and I thin it comes back to this idea that I don't think people on the whole are evil, I think they need to be informed and have tools. So we are trying to give as many people the tools as possible that they need and particularly we are trying to put those tools in the hands of a more diverse range of people. So if you are involved in hiring decisions, perhaps you can keep this kind of philosophy in mind as well. If you are not just hiring a wider range of people, but also promoting a wider range of people, and providing appropriate career management for a wider range of people, apart from anything else, your company will do better. It actually turns out that more diverse teams are more creative and tend to solve problems more quickly and better than less diverse teams, but also you might avoid these kind of awful screw-ups which, at one level, are bad for the world and another level if you ever get found out, they can destroy your company.
+
+#### IBM &amamp; “Death's Calculator” [ [54:08](https://youtu.be/xXXiC4YRGrQ%3Ft%3D54m8s) ]
+
+![](../img/1_sOrJFBPiJYdUsxzythGk8w.png)
+
+Also they can destroy you or at least make you look pretty bad in history. A couple of examples, one is going right back to the second world war. IBM provided all of the infrastructure necessary to track the Holocaust. These are the forms they used and they had different code — Jews were 8, Gypsies were 12, death in the gas chambers was 6, and they all went on these punch cards. You can go and look at these punch cards in museums now and this has actually been reviewed by a Swiss judge who said that IBM's technical assistance facilitated the task of the Nazis and the commission their crimes against humanity. It is interesting to read back the history from these times to see what was going through the minds of people at IBM at that time. What was clearly going through the minds was the opportunity to show technical superiority, the opportunity to test out their new systems, and of course the extraordinary amount of money that they were making. When you do something which at some point down the line turns out to be a problem, even if you were told to do it, that can turn out to be a problem for you personally. For example, you all remember the diesel emission scandal in VW. Who is the one guy that went to jail? It was the engineer just doing his job. If all of this stuff about actually not messing up the world isn't enough to convince you, it can mess up your life too. If you do something that turns out to cause problems even though somebody told you to do it, you can absolutely be held criminally responsible. Aleksandr Kogan was the guy that handed over the Cambridge Analytica data. He is a Cambridge academic. Now a very famous Cambridge academic the world over for doing his part to destroy the foundations of democracy. This is not how we want to go down in history.
+
+![](../img/1_qXLN21dyZdaaYfxXuCTWhg.png)
+
+**Question:** In one of your tweets, you said dropout is patented [ [56:50](https://youtu.be/xXXiC4YRGrQ%3Ft%3D56m50s) ]. I think this is about WaveNet patent from Google. 这是什么意思？ Can you please share more insight on this subject? Does it mean that we will have to pay to use dropout in the future? One of the patent holders is Geoffrey Hinton. 所以呢？ Isn't that great? Invention is all about patents, blah blah. My answer is no. Patents have gone wildly crazy. The amount of things that are patentable that we talk about every week would be dozens. It's so easy to come up with a little tweak and then if you turn that into a patent to stop everybody from using that little tweak for the next 14 years and you end up with a situation we have now where everything is patented in 50 different ways. Then you get these patent trolls who have made a very good business out of buying lots of crappy little patents and then suing anybody who accidentally turned out did that thing like putting rounded corners on buttons. So what does it mean for us that a lot of stuff is patented in deep learning? 我不知道。
+
+One of the main people doing this is Google and people from Google who replied to this patent tend to assume that Google doing it because they want to have it defensively so if somebody sues them, they can say don't sue us we'll sue you back because we have all these patents. The problem is that as far as I know, they haven't signed what's called a defensive patent pledge so basically you can sign a legally binding document that says our patent portfolio will only be used in defense and not offense. Even if you believe all the management of Google would never turn into a patent troll, you've got to remember that management changes. To give you a specific example I know, the somewhat recent CFO of Google has a much more aggressive stance towards the PNL, I don't know, maybe she might decide that they should start monetizing their patents or maybe the group that made that patent might get spun off and then sold to another company that might end up in private equity hands and decide to monetize the patents or whatever. So I think it's a problem. There has been a big shift legally recently away from software patents actually having any legal standing, so it's possible that these will all end up thrown out of court but the reality is that anything but a big company is unlikely to have the financial ability to defend themselves against one of these huge patent trolls.
+
+You can't avoid using patented stuff if you write code. I wouldn't be surprised if most lines of code you write have patents on them. Actually funnily enough, the best thing to do is not to study the patents because if you do and you infringe knowingly then the penalties are worse. So the best thing to do is to put your hands in your ear, sing a song, and get back to work. So the thing about dropouts patented, forget I said that. You don't know that. You skipped that bit.
+
+### Style Transfer [ [1:01:28](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h1m28s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/style-transfer.ipynb)
+
+![](../img/1_GPdF7Xu7mAiUAYEDbT-SHA.png)
+
+<figcaption class="imageCaption">[https://arxiv.org/abs/1508.06576](https://arxiv.org/abs/1508.06576)</figcaption>
+
+
+
+This is super fun — artistic style. We are going a bit retro here because this is actually the original artistic style paper and there's been a lot of updates to it and a lot of different approaches and I actually think in many ways the original is the best. We are going to look at some of the newer approaches as well, but I actually think the original is a terrific way to do it even with everything that's gone since. Let's jump to the code.
+
+```
+ %matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+```
+ **from** **fastai.conv_learner** **import** *  from pathlib import Path  from scipy import ndimage  torch.cuda.set_device(3)  torch.backends.cudnn.benchmark= True 
+```
+
+```
+ PATH = Path('data/imagenet')  PATH_TRN = PATH/'train' 
+```
+
+```
+ m_vgg = to_gpu(vgg16( True )).eval()  set_trainable(m_vgg, False ) 
+```
+
+The idea here is that we want to take a photo of a bird, and we want to create a painting that looks like Van Gogh painted the picture of the bird. Quite a bit of the stuff that I'm doing, by the way, uses an ImageNet. You don't have to download the whole of ImageNet for any of the things I'm doing. There is an ImageNet sample in [files.fast.ai/data](http://files.fast.ai/data/) which has a couple of gig which should be plenty good enough for everything we are doing. If you want to get really great result, you can grab ImageNet. You can download it from [Kaggle](https://www.kaggle.com/c/imagenet-object-localization-challenge/data) . The localization competition actually contains all of the classification data as well. If you've got room, it's good to have a copy of ImageNet because it comes in handy all the time.
+
+```
+ img_fn = PATH_TRN/'n01558993'/'n01558993_9684.JPEG'  img = open_image(img_fn)  plt.imshow(img); 
+```
+
+So I just grabbed the bird out of my ImageNet folder and there is my bird:
+
+![](../img/1_eZb9GpF1VGMMIukO2AE91w.png)
+
+```
+ sz=288 
+```
+
+```
+ trn_tfms,val_tfms = tfms_from_model(vgg16, sz)  img_tfm = val_tfms(img)  img_tfm.shape 
+```
+
+```
+ (3, 288, 288) 
+```
+
+```
+ opt_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32)  plt.imshow(opt_img); 
+```
+
+What I'm going to do is I'm going to start with this picture:
+
+![](../img/1_jJzsPvZ9uHAtHqu9mZ_skA.png)
+
+And I'm going to try to make it more and more like a picture of the bird painted by Van Gogh. The way I do that is actually very simple. You're all familiar with it [ [1:03:44](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h3m44s) ]. We will create a loss function which we will call _f_ . The loss function is going to take as input a picture and spit out as output a value. The value will be lower if the image looks more like the bird photo painted by Van Gogh. Having written that loss function, we will then use the PyTorch gradient and optimizers. Gradient times the learning rate, and and we are not going to update any weights, we are going to update the pixels of the input image to make it a little bit more like a picture which would be a bird painted by Van Gogh. And we will stick it through the loss function again to get more gradients, and do it again and again. 而已。 So it's identical to how we solve every problem. You know I'm a one-trick pony, right? This is my only trick. Create a loss function, use it to get some gradients, multiply it by learning rates to update something, always before, we've updated weights in a model but today, we are not going to do that. They're going to update the pixels in the input. But it's no different at all. We are just taking the gradient with respect to the input rather than respect to the weights. 而已。 So we are nearly done.
+
+![](../img/1_6sYVxXfPJU86MMBBKib7Rw.png)
+
+Let's do a couple more things [ [1:05:49](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h5m49s) ]. Let's mention here that there's going to be two more inputs to our loss function One is the picture of the bird. The second is an artwork by Van Gogh. By having those as inputs as well, that means we'll be able to rerun the function later to make it look like a bird painted by Monet or a jumbo jet painted by Van Gogh, etc. Those are going to be the three inputs. Initially, as we discussed, our input here is some random noise. We start with some random noise, use the loss function, get the gradients, make it a little bit more like a bird painted by Van Gogh, and so forth.
+
+So the only outstanding question which I guess we can talk about briefly is how we calculate how much our image looks like this bird painted by Van Gogh [ [1:07:09](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h7m9s) ]. Let's split it into two parts:
+
+**Content Loss** : Returns a value that's lower if it looks more like the bird (not just any bird, the specific bird that we have coming in).
+
+**Style Loss** : Returns a lower number if the image is more like VG's style
+
+![](../img/1_MetpfEESntmYRQ5Z6e5rQw.png)
+
+There is one way to do the content loss which is very simple — we could look at the pixel of the output, compare them to the pixel of the bird, and do a mean squared error, and add them up. So if we did that, I ran this for a while. Eventually our image would turn into an image of the bird. You should try it. You should try this as an exercise. Try to use the optimizer in PyTorch to start with a random image and turn it into another image by using mean squared error pixel loss. Not terribly exciting but that would be step one.
+
+The problem is, even if we already had our style loss function working beautifully and then presumably, what we are going to do is we are going to add these two together, and then one of them, we'll multiply by some lambda to adjust how much style versus how much content. Assuming we had a style loss and we picked some sensible lambda, if we used pixel wise content loss then anything that makes it look more like Van Gogh and less like the exact photo, the exact background, the exact contrast, lighting, everything will increase the content loss — which is not what we want. We want it to look like the bird but not in the same way. It is still going to have the same two eyes in the same place and be the same kind of shape and so forth, but not the same representation. So what we are going to do is, this is going to shock you, we are going to use a neural network! We are going to use the VGG neural network because that's what I used last year and I didn't have time to see if other things worked so you can try that yourself during the week.
+
+The VGG network is something which takes in an input and sticks it through a number of layers, and I'm going to treat these as just the convolutional layers there's obviously ReLU there and if it's a VGG with batch norm, which most are today, then it's also got batch norm. There's some max pooling and so forth but that's fine. What we could do is, we could take one of these convolutional activations and then rather than comparing the pixels of this bird, we could instead compare the VGG layer 5 activations of this (bird painted by VG) to the VGG layer 5 activations of our original bird (or layer 6, or layer 7, etc). So why might that be more interesting? Well for one thing, it wouldn't be the same bird. It wouldn't be exactly the same because we are not checking the pixels. We are checking some later set of activations. So what are those later sets of activations contain? Assuming it's after some max pooling, they contain a smaller grid — so it's less specific about where things are. And rather than containing pixel color values, they are more like semantic things like is this kind of an eyeball, is this kind of furry, is this kind of bright, or is this kind of reflective, or laying flat, or whatever. So we would hope that there's some level of semantic features through those layers where if we get a picture that matches those activations, then any picture that matches those activations looks like the bird but it's not the same representation of the bird. So that's what we are going to do. That's what our content loss is going to be. People generally call this a **perceptual loss** because it's really important in deep learning that you always create a new name for every obvious thing you do. If you compare two activations together, you are doing a perceptual loss. 而已。 Our content loss is going to be a perceptual loss. Then we will do the style loss later.
+
+Let's start by trying to create a bird that initially is random noise and we are going to use perceptual loss to create something that is bird-like but it's not the particular bird [ [1:13:13](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h13m13s) ]. We are going to start with 288 by 288\. Because we are going to do one bird, there is going to be no GPU memory problems. I was actually disappointed that I realized that I picked a rather small input image. It would be fun to try this with something much bigger to create a really grand scale piece. The other thing to remember is if you are productionizing this, you could do a whole batch at a time. People sometimes complain about this approach (Gatys is the lead author) the Gatys' style transfer approaches being slow, and I don't agree it's slow. It takes a few seconds and you can do a whole batch in a few seconds.
+
+![](../img/1_eZb9GpF1VGMMIukO2AE91w.png)
+
+```
+ sz=288 
+```
+
+So we are going to stick it through some transforms for VGG16 model as per usual [ [1:14:12](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h14m12s) ]. Remember, the transform class has dunder call method ( `__call__` ) so we can treat it as if it's a function. If you pass an image into that, then we get the transformed image. Try not to treat the fast.ai and PyTorch infrastructure as a black box because it's all designed to be really easy to use in a decoupled way. So this idea of that transforms are just “callables” (ie things that you can do with parentheses) comes from PyTorch and we totally plagiarized the idea. So with torch.vision or with fast.ai, your transforms are just callables. And the whole pipelines of transforms is just a callable.
+
+```
+ trn_tfms,val_tfms = tfms_from_model(vgg16, sz)  img_tfm = val_tfms(img)  img_tfm.shape 
+```
+
+```
+ (3, 288, 288) 
+```
+
+Now we have something of 3 by 288 by 288 because PyTorch likes the channel to be first [ [1:15:05](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h15m5s) ]. As you can see, it's been turned into a square for us, it's been normalized to (0, 1), all that normal stuff.
+
+Now we are creating a random image.
+
+```
+ opt_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32)  plt.imshow(opt_img); 
+```
+
+![](../img/1_jJzsPvZ9uHAtHqu9mZ_skA.png)
+
+Here is something I discovered. Trying to turn this into a picture of anything is actually really hard. I found it very difficult to actually get an optimizer to get reasonable gradients that went anywhere. And just as I thought I was going to run out of time for this class and really embarrass myself, I realized the key issue is that pictures don't look like this. They have more smoothness, so I turned this into the following by blurring it a little bit:
+
+```
+ opt_img = scipy.ndimage.filters.median_filter(opt_img, [8,8,1])  plt.imshow(opt_img); 
+```
+
+![](../img/1_84Vk7fPct3lIUwXWFZhWRQ.png)
+
+I used a median filter — basically it is like a median pooling, effectively. As soon as I change it to this, it immediately started training really well. A number of little tweaks you have to do to get these things to work is kind of insane, but here is a little tweak.
+
+So we start with a random image which is at least somewhat smooth [ [1:16:21](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h16m21s) ]. I found that my bird image had a mean of pixels that was about half of this, so I divided it by 2 just trying to make it a little bit easier for it to match (I don't know if it matters). Turn that into a variable because this image, remember, we are going to be modifying those pixels with an optimization algorithm, so anything that's involved in the loss function needs to be a variable. And specifically, it requires a gradient because we are actually updating the image.
+
+```
+ opt_img = val_tfms(opt_img)/2  opt_img_v = V(opt_img[ None ], requires_grad= True )  opt_img_v.shape 
+```
+
+```
+ torch.Size([1, 3, 288, 288]) 
+```
+
+So we now have a mini batch of 1, 3 channels, 288 by 288 random noise.
+
+```
+ m_vgg = nn.Sequential(*children(m_vgg)[:37]) 
+```
+
+We are going to use, for no particular reason, the 37th layer of VGG. If you print out the VGG network (you can just type in `m_vgg` and prints it out), you'll see that this is mid to late stage layer. So we can just grab the first 37 layers and turn it into a sequential model. So now we have a subset of VGG that will spit out some mid layer activations, and that's what the model is going to be. So we can take our actual bird image and we want to create a mini batch of one. Remember, if you slice in Numpy with `None` , also known as `np.newaxis` , it introduces a new unit axis in that point. Here, I want to create an axis of size 1 to say this is a mini batch of size one. So slicing with None just like I did here ( `opt_img_v = V(opt_img[ **None** ], requires_grad= **True** )` ) to get one unit axis at the front. Then we turn that into a variable and this one doesn't need to be updated, so we use `VV` to say you don't need gradients for this guy. So that is going to give us our target activations.
+
+*   We've taken our bird image
+*   Turned it into a variable
+*   Stuck it through our model to grab the 37th layer activations which is our target. We want our content loss to be this set of activations.
+*   We are going to create an optimizer (we will go back to the details of this in a moment)
+*   We are going to step a bunch of times
+*   Zero the gradients
+*   Call some loss function
+*   Loss.backward()
+
+That's the high level version. I'm going to come back to the details in a moment, but the key thing is that the loss function we are passing in that randomly generated image — the variable of optimization image. So we pass that to our loss function and it's going to update this using the loss function, and the loss function is the mean squared error loss comparing our current optimization image passed through our VGG to get the intermediate activations and comparing it to our target activations. We run that bunch of times and we'll print it out. And we have our bird but not the representation of it.
+
+```
+ targ_t = m_vgg(VV(img_tfm[ None ]))  targ_v = V(targ_t)  targ_t.shape 
+```
+
+```
+ torch.Size([1, 512, 18, 18]) 
+```
+
+```
+ max_iter = 1000  show_iter = 100  optimizer = optim.LBFGS([opt_img_v], lr=0.5) 
+```
+
+#### Broyden–Fletcher–Goldfarb–Shanno (BFGS) [ [1:20:18](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h20m18s) ]
+
+A couple of new details here. One is a weird optimizer ( `optim.LBFGS` ). Anybody who's done certain parts of math and computer science courses comes into deep learning discovers we use all this stuff like Adam and the SGD and always assume that nobody in the field knows the first thing about computer science and immediately says “any of you guys tried using BFGS?” There's basically a long history of a totally different kind of algorithm for optimization that we don't use to train neural networks. And of course the answer is actually the people who have spent decades studying neural networks do know a thing or two about computer science and it turns out these techniques on the whole don't work very well. But it's actually going to work well for this, and it's a good opportunity to talk about an interesting algorithm for those of you that haven't studied this type of optimization algorithm at school. BFGS (initials of four different people) and the L stands for limited memory. It is an optimizer so as an optimizer, that means that there's some loss function and it's going to use some gradients (not all optimizers use gradients but all the ones we use do) to find a direction to go and try to make the loss function go lower and lower by adjusting some parameters. It's just an optimizer. But it's an interesting kind of optimizer because it does a bit more work than the ones we're used to on each step. Specifically, the way it works is it starts the same way that we are used to which is we just pick somewhere to get started and in this case, we've picked a random image as you saw. As per usual, we calculate the gradient. But we then don't just take a step but we actually do is as well as finding the gradient, we also try to find the second derivative. The second derivative says how fast does the gradient change.
+
+**Gradient** : how fast the function change
+
+**The second derivative** : how fast the gradient change
+
+In other words, how curvy is it? The basic idea is that if you know that it's not very curvy, then you can probably jump farther. But if it's very curvy then you probably don't want to jump as far. So in higher dimensions, the gradient is called the Jacobian and the second derivative is called the Hessian. You'll see those words all the time, but that's all they mean. Again, mathematicians have to invent your words for everything as well. They are just like deep learning researchers — maybe a bit more snooty. With BFGS, we are going to try and calculate the second derivative and then we are going to use that to figure out what direction to go and how far to go — so it's less of a wild jump into the unknown.
+
+Now the problem is that actually calculating the Hessian (the second derivative) is almost certainly not a good idea[ [1:24:15](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h24m15s) ]. Because in each possible direction that you are going to head, for each direction that you're measuring the gradient in, you also have to calculate the Hessian in every direction. It gets ridiculously big. So rather than actually calculating it, we take a few steps and we basically look at how much the gradient is changing as we do each step, and we approximate the Hessian using that little function. Again, this seems like a really obvious thing to do but nobody thought of it until someone did surprisingly a long time later. Keeping track of every single step you take takes a lot of memory, so duh, don't keep track of every step you take — just keep the last ten or twenty. And the second bit there, that's the L to the LBFGS. So a limited-memory BFGS means keep the last 10 or 20 gradients, use that to approximate the amount of curvature, and then use the curvature in gradient to estimate what direction to travel and how far. That's normally not a good idea in deep learning for a number of reasons. It's obviously more work to do than than Adam or SGD update, and it also uses more memory — memory is much more of a big issue when you've got a GPU to store it on and hundreds of millions of weights. But more importantly, the mini-batch is super bumpy so figuring out curvature to decide exactly how far to travel is kind of polishing turds as we say (yeah, Australian and English expression — you get the idea). Interestingly, actually using the second derivative information, it turns out, is like a magnet for saddle points. So there's some interesting theoretical results that basically say it actually sends you towards nasty flat areas of the function if you use second derivative information. So normally not a good idea.
+
+```
+ def actn_loss(x): return F.mse_loss(m_vgg(x), targ_v)*1000 
+```
+
+```
+ def step(loss_fn):  global n_iter  optimizer.zero_grad()  loss = loss_fn(opt_img_v)  loss.backward()  n_iter+=1  if n_iter%show_iter==0:  print(f'Iteration: n_iter, loss: {loss.data[0]} ')  return loss 
+```
+
+But in this case [ [1:26:40](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h26m40s) ], we are not optimizing weights, we are optimizing pixels so all the rules change and actually turns out BFGS does make sense. Because it does more work each time, it's a different kind of optimizer, the API is a little bit different in PyTorch. As you can see here, when you say `optimizer.step` , you actually pass in the loss function. So our loss function is to call `step` with a particular loss function which is our activation loss ( `actn_loss` ). And inside the loop, you don't say step, step, step. But rather it looks like this. So it's a little bit different and you're welcome to try and rewrite this to use SGD, it'll still work. It'll just take a bit longer — I haven't tried it with SGD yet and I'd be interested to know how much longer it takes.
+
+```
+ n_iter=0  while n_iter <= max_iter: optimizer.step(partial(step,actn_loss)) 
+```
+
+```
+ Iteration: n_iter, loss: 0.8466196656227112 
+ Iteration: n_iter, loss: 0.34066855907440186 
+ Iteration: n_iter, loss: 0.21001280844211578 
+ Iteration: n_iter, loss: 0.15562333166599274 
+ Iteration: n_iter, loss: 0.12673595547676086 
+ Iteration: n_iter, loss: 0.10863320529460907 
+ Iteration: n_iter, loss: 0.0966048613190651 
+ Iteration: n_iter, loss: 0.08812198787927628 
+ Iteration: n_iter, loss: 0.08170554041862488 
+ Iteration: n_iter, loss: 0.07657770067453384 
+```
+
+So you can see the loss function going down [ [1:27:38](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h27m38s) ]. The mean squared error between the activations at layer 37 of our VGG model for our optimized image vs. the target activations, remember the target activations were the VGG applied to our bird. Make sense? So we've now got a content loss. Now, one thing I'll say about this content loss is we don't know which layer is going to work the best. So it would be nice if we were able to experiment a little bit more. And the way it is here is annoying:
+
+![](../img/1_KTfbdTPG-pZ95vEOLrJa9Q.png)
+
+Maybe we even want to use multiple layers. So rather than lopping off all of the layers after the one we want, wouldn't it be nice if we could somehow grab the activations of a few layers as it calculates. Now, we already know one way to do that back when we did SSD, we actually wrote our own network which had a number of outputs. Remember? The different convolutional layers, we spat out a different `oconv` thing? But I don't really want to go and add that to the torch.vision ResNet model especially not if later on, I want to try torch.vision VGG model, and then I want to try NASNet-A model, I don't want to go into all of them and change their outputs. Beside which, I'd like to easily be able to turn certain activations on and off on demand. So we briefly touched before this idea that PyTorch has these fantastic things called hooks. You can have forward hooks that let you plug anything you like into the forward pass of a calculation or a backward hook that lets you plug anything you like into the backward pass. So we are going to create the world's simplest forward hook.
+
+```
+ x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]  plt.figure(figsize=(7,7))  plt.imshow(x); 
+```
+
+![](../img/1_CzZ-KObFhqarMxnV5lD-IQ.png)
+
+### Forward hook [ [1:29:42](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h29m42s) ]
+
+This is one of these things that almost nobody knows about so almost any code you find on the internet that implements style transfer will have all kind of horrible hacks rather than using forward hooks. But forward hook is really easy.
+
+To create a forward hook, you just create a class. The class has to have something called `hook_fn` . And your hook function is going to receive the `module` that you've hooked, the `input` for the forward pass, and the `output` then you do whatever you'd like. So what I'm going to do is I'm just going to store the output of this module in some attribute. 而已。 So `hook_fn` can actually be called anything you like, but “hook function” seems to be the standard because, as you can see, what happens in the constructor is I store inside some attribute the result of `m.register_forward_hook` ( `m` is going to be the layer that I'm going to hook) and pass in the function that you want to be called when the module's forward method is called. When its forward method is called, it will call `self.hook_fn` which will store the output in an attribute called `features` .
+
+```
+ class SaveFeatures ():  features= None  def __init__(self, m):  self.hook = m.register_forward_hook(self.hook_fn)  def hook_fn(self, module, input, output): self.features = output  def close(self): self.hook.remove() 
+```
+
+So now what we can do is we can create a VGG as before. And let's set it to not trainable so we don't waste time and memory calculating gradients for it. And let's go through and find all the max pool layers. So let's go through all of the children of this module and if it's a max pool layer, let's spit out index minus 1 — so that's going to give me the layer before the max pool. In general, the layer before a max pool or stride 2 conv is a very layer. It's the most complete representation we have at that grid cell size because the very next layer is changing the grid. So that seems to me like a good place to grab the content loss from. The best most semantic, most interesting content we have at that grid size. So that's why I'm going to pick those indexes.
+
+```
+ m_vgg = to_gpu(vgg16( True )).eval()  set_trainable(m_vgg, False ) 
+```
+
+These are the indexes of the last layer before each max pool in VGG [ [1:32:30](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h32m30s) ].
+
+```
+ block_ends = [i-1 for i,o in enumerate(children(m_vgg))  if isinstance(o,nn.MaxPool2d)]  block_ends 
+```
+
+```
+ [5, 12, 22, 32, 42] 
+```
+
+I'm going to grab `32` — no particular reason, just try something else. So I'm going to say `block_ends[3]` (ie 32). `children(m_vgg)[block_ends[3]]` will give me the 32nd layer of VGG as a module.
+
+```
+ sf = SaveFeatures(children(m_vgg)[block_ends[3]]) 
+```
+
+Then if I call the `SaveFeatures` constructor, it's going to go:
+
+`self.hook = {32nd layer of VGG}.register_forward_hook(self.hook_fn)`
+
+Now, every time I do a forward pass on this VGG model, it's going to store the 32nd layer's output inside `sf.features` .
+
+```
+ def get_opt():  opt_img = np.random.uniform(0, 1,  size=img.shape).astype(np.float32)  opt_img = scipy.ndimage.filters.median_filter(opt_img, [8,8,1])  opt_img_v = V(val_tfms(opt_img/2)[ None ], requires_grad= True )  return opt_img_v, optim.LBFGS([opt_img_v]) 
+```
+
+```
+ opt_img_v, optimizer = get_opt() 
+```
+
+See here [ [1:33:33](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h33m33s) ], I'm calling my VGG network, but I'm not storing it anywhere. I'm not saying `activations = m_vgg(VV(img_tfm[ **None** ]))` . I'm calling it, throwing away the answer, and then grabbing the features we stored in our `SaveFeatures` object.
+
+`m_vgg()` — this is how you do a forward path in PyTorch. You don't say `m_vgg.forward()` , you just use it as a callable. Using as a callable on an `nn.module` automatically calls `forward` . That's how PyTorch modules work.
+
+So we call it as a callable, that ends up calling our forward hook, that forward hook stores the activations in `sf.features` , and so now we have our target variable — just like before but in a much more flexible way.
+
+`get_opt` contains the same 4 lines of code we had earlier [ [1:34:34](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h34m34s) ]. It is just giving me my random image to optimize and an optimizer to optimize that image.
+
+```
+ m_vgg(VV(img_tfm[ None ]))  targ_v = V(sf.features.clone())  targ_v.shape 
+```
+
+```
+ torch.Size([1, 512, 36, 36]) 
+```
+
+```
+ def actn_loss2(x):  m_vgg(x)  out = V(sf.features)  return F.mse_loss(out, targ_v)*1000 
+```
+
+Now I can go ahead and do exactly the same thing. But now I'm going to use a different loss function `actn_loss2` (activation loss #2) which doesn't say `out=m_vgg` , again, it calls `m_vgg` to do a forward pass, throws away the results, and and grabs `sf.features` . So that's now my 32nd layer activations which I can then do my MSE loss on. You might have noticed, the last loss function and this one are both multiplied by a thousand. Why are they multiplied by a thousand? This was like all the things that were trying to get this lesson to not work correctly. I didn't used to have a thousand and it wasn't training. Lunch time today, nothing was working. After days of trying to get this thing to work, and finally just randomly noticed “gosh, the loss functions — the numbers are really low (like 10E-7)” and I thought what if they weren't so low. So I multiplied them by a thousand and it started working. So why did it not work? Because we are doing single precision floating point, and single precision floating point isn't that precise. Particularly once you're getting gradients that are kind of small and then you are multiplying by the learning rate that can be small, and you end up with a small number. If it's so small, they could get rounded to zero and that's what was happening and my model wasn't ready. I'm sure there are better ways than multiplying by a thousand, but whatever. It works fine. It doesn't matter what you multiply a loss function by because all you care about is its direction and the relative size. Interestingly, this is something similar we do for when we were training ImageNet. We were using half precision floating point because Volta tensor cores require that. And it's actually a standard practice if you want to get the half precision floating to train, you actually have to multiply the loss function by a scaling factor. We were using 1024 or 512\. I think fast.ai is now the first library that has all of the tricks necessary to train in half precision floating point built-in, so if you are lucky enough to have a Volta or you can pay for a AWS P3, if you've got a learner object, you can just say `learn.half` , it'll now just magically train correctly half precision floating point. It's built into the model data object as well, and it's all automatic. Pretty sure no other library does that.
+
+```
+ n_iter=0  while n_iter <= max_iter: optimizer.step(partial(step,actn_loss2)) 
+```
+
+```
+ Iteration: n_iter, loss: 0.2112911492586136 
+ Iteration: n_iter, loss: 0.0902421623468399 
+ Iteration: n_iter, loss: 0.05904778465628624 
+ Iteration: n_iter, loss: 0.04517251253128052 
+ Iteration: n_iter, loss: 0.03721420466899872 
+ Iteration: n_iter, loss: 0.03215853497385979 
+ Iteration: n_iter, loss: 0.028526008129119873 
+ Iteration: n_iter, loss: 0.025799645110964775 
+ Iteration: n_iter, loss: 0.02361033484339714 
+ Iteration: n_iter, loss: 0.021835438907146454 
+```
+
+This is just doing the same thing on a slightly earlier layer [ [1:37:35](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h37m35s) ]. And the bird looks more bird-like. Hopefully that makes sense to you that earlier layers are getting closer to the pixels. There are more grid cells, each cell is smaller, smaller receptive field, less complex semantic features. So the earlier we get, the more it's going to look like a bird.
+
+```
+ x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]  plt.figure(figsize=(7,7))  plt.imshow(x); 
+```
+
+![](../img/1_i2SK83mI6XYD9al6OV4fhw.png)
+
+```
+ sf.close() 
+```
+
+In fact, the paper has a nice picture of that showing various different layers and zooming into this house [ [1:38:17](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h38m17s) ]. They are trying to make this house look like The Starry Night picture. And you can see that later on, it's pretty messy, and earlier on, it looks like the house. So this is just doing what we just did. One of the things I've noticed in our study group is anytime I say to somebody to answer a question, anytime I say read the paper there is a thing in the paper that tells you the answer to that question, there's always this shocked look “read the paper? me?” but seriously the papers have done these experiments and drawn the pictures. There's all this stuff in the papers. It doesn't mean you have to read every part of the paper. But at least look at the pictures. So check out Gatys' paper, it's got nice pictures. So they've done the experiment for us but looks like they didn't go as deep — they just got some earlier ones.
+
+![](../img/1_cqZ5Az70HWX2dUhPnUDAZg.png)
+
+#### Style match [ [1:39:29](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h39m29s) ]
+
+The next thing we need to do is to create style loss. We've already got the loss which is how much like the bird is it. Now we need how like this painting style is it. And we are going to do nearly the same thing. We are going to grab the activations of some layer. Now the problem is, the activations of some layer, let's say it was a 5x5 layer (of course there are no 5x5 layers, it's 224x224, but we'll pretend). So here're some activations and we could get these activations both per the image we are optimizing and for our Van Gogh painting. Let's look at our Van Gogh painting. There it is — The Starry Night
+
+```
+ style_fn = PATH/'style'/'starry_night.jpg' 
+```
+
+```
+ style_img = open_image(style_fn)  style_img.shape, img.shape 
+```
+
+```
+ ((1198, 1513, 3), (291, 483, 3)) 
+```
+
+```
+ plt.imshow(style_img); 
+```
+
+![](../img/1_3QN8_RpikQBlk8wwjD9B3w.png)
+
+I downloaded this from Wikipedia and I was wondering what is taking son long to load [ [1:40:39](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h40m39s) ] — turns out, the Wikipedia version I downloaded was 30,000 by 30,000 pixels. It's pretty cool that they've got this serious gallery quality archive stuff there. I didn't know it existed. Don't try to run a neural net on that. Totally killed my Jupyter notebook.
+
+So we can do that for our Van Gogh image and we can do that for our optimized image. Then we can compare the two and we would end up creating an image that has content like the painting but it's not the painting — that's not what we want. We want something with the same style but it's not the painting and doesn't have the content. So we want to throw away all of the spatial information. We are not trying to create something that has a moon here, stars here, and a church here. We don't want any of that. So how do we throw away all the special information?
+
+![](../img/1_YVBXuBYYyoalPWcW2avrsg.png)
+
+In this case, there are 19 faces on this — 19 slices. So let's grab this top slice that's going to be a 5x5 matrix. Now, let's flatten it and we've got a 25 long vector. In one stroke, we've thrown away the bulk of the spacial information by flattening it. Now let's grab a second slice (ie another channel) and do the same thing. So we have channel 1 flattened and channel 2 flattened, and they both have 25 elements. Now, let's take the dot product which we can do with `@` in Numpy (Note: [here is Jeremy's answer to my dot product vs. matrix multiplication question](http://forums.fast.ai/t/part-2-lesson-13-wiki/15297/140%3Fu%3Dhiromi) ). So the dot product is going to give us one number. What's that number? What is it telling us? Assuming the activations are somewhere around the middle layer of the VGG network, we might expect some of these activations to be how textured is the brush stroke, and some of them to be like how bright is this area, and some of them to be like is this part of a house or a part of a circular thing, or other parts to be, how dark is this part of the painting. So a dot product is basically a correlation. If this element and and this element are both highly positive or both highly negative, it gives us a big result. Where else, if they are the opposite, it gives a small results. If they are both close to zero, it gives no result. So basically a dot product is a measure of how similar these two things are. So if the activations of channel 1 and channel 2 are similar, then it basically says — Let's give an example [ [1:44:28](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h44m28s) ]. Let's say the first one was how textured are the brushstrokes (C1) and that one there says how diagonally oriented are the brush strokes (C2).
+
+![](../img/1_ho9iuqmJh3hVXPNeZ9E_Xg.png)
+
+If C1 and C2 are both high for a cell (1, 1) at the same time, and same is true for a cell (4, 2), then it's saying grid cells that would have texture tend to also have diagonal. So dot product would be high when grid cells that have texture also have diagonal, and when they don't, they don't (have high dot product). So that's `C1 @ C2` . Where else, `C1 @ C1` is the 2-norm effectively (ie the sum of the squares of C1). This is basically saying how many grid cells in the textured channel is active and how active it is. So in other words, `C1 @ C1` tells us how much textured painting is going on. And `C2 @ C2` tells us how much diagonal paint stroke is going on. Maybe C3 is “is it bright colors?” so `C3 @ C3` would be how often do we have bright colored cells.
+
+So what we could do then is we could create a 19 by 19 matrix containing every dot product [ [1:47:17](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h47m17s) ]. And like we discussed, mathematicians have to give everything a name, so this particular matrix where you flatten something out and then do all the dot product is called Gram matrix.
+
+![](../img/1_hboObzQV-8h0yiVvqZNvZg.png)
+
+I'll tell you a secret [ [1:48:29](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h48m29s) ]. Most deep learning practitioners either don't know or don't remember all these things like what is a Gram matrix if they ever did study at university. They probably forgot it because they had a big night afterwards. And the way it works in practice is you realize “oh, I could create a kind of non-spacial representation of how the channels correlate with each other” and then when I write up the paper, I have to go and ask around and say “does this thing have a name?” and somebody will be like “isn't that the Gram matrix?” and you go and look it up and it is. So don't think you have to go study all of math first. Use your intuition and common sense and then you worry about what the math is called later, normally. Sometimes it works the other way, not with me because I can't do math.
+
+So this is called the Gram matrix [ [1:49:22](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h49m22s) ]. And of course, if you are a real mathematician, it's very important that you say this as if you always knew it was a Gram matrix and you kind of just go oh yes, we just calculate the Gram matrix. So the Gram matrix then is this kind of map — the diagonal is perhaps the most interesting. The diagonal is which channels are the most active and then the off diagonal is which channels tend to appear together. And overall, if two pictures have the same style, then we are expecting that some layer of activations, they will have similar Gram matrices. Because if we found the level of activations that capture a lot of stuff about like paint strokes and colors, then the diagonal alone (in Gram matrices) might even be enough. That's another interesting homework assignment, if somebody wants to take it, is try doing Gatys' style transfer not using the Gram matrix but just using the diagonal of the Gram matrix. That would be like a single line of code to change. But I haven't seen it tried and I don't know if it would work at all, but it might work fine.
+
+“Okay, yes Christine, you've tried it” [ [1:50:51](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h50m51s) ]. “I have tried that and it works most of the time except when you have funny pictures where you need two styles to appear in the same spot. So it seems like grass in one half and a crowd in one half, and you need the two styles.” (Christine). Cool, you're still gonna do your homework, but Christine says she'll do it for you.
+
+```
+ def scale_match(src, targ):  h,w,_ = img.shape  sh,sw,_ = style_img.shape  rat = max(h/sh,w/sw); rat  res = cv2.resize(style_img, (int(sw*rat), int(sh*rat)))  return res[:h,:w] 
+```
+
+```
+ style = scale_match(img, style_img) 
+```
+
+```
+ plt.imshow(style)  style.shape, img.shape 
+```
+
+```
+ ((291, 483, 3), (291, 483, 3)) 
+```
+
+![](../img/1_3QDp1KCdg6RkKL8yhkRbDw.png)
+
+So here is our painting [ [1:51:22](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h51m22s) ]. I've tried to resize the painting so it's the same size as my bird picture. So that's all this is just doing. It doesn't matter too much which bit I use as long as it's got lots of the nice style in it.
+
+I grab my optimizer and my random image just like before:
+
+```
+ opt_img_v, optimizer = get_opt() 
+```
+
+And this time, I call `SaveFeatures` for all of my `block_ends` and that's going to give me an array of SaveFeatures objects — one for each module that appears the layer before the max pool. Because this time, I want to play around with different activation layer styles, or more specifically I want to let you play around with it. So now I've got a whole array of them.
+
+```
+ sfs = [SaveFeatures(children(m_vgg)[idx]) for idx in block_ends] 
+```
+
+`style_img` is my Van Gogh painting. So I take my `style_img` , put it through my transformations to create my transform style image ( `style_tfm` ).
+
+```
+ style_tfm = val_tfms(style_img) 
+```
+
+Turn that into a variable, put it through the forward pass of my VGG module, and now I can go through all of my SaveFeatures objects and grab each set of features. Notice I call `clone` because later on, if I call my VGG object again, it's going to replace those contents. I haven't quite thought about whether this is necessary. If you take it away and it's not, that's fine. But I was just being careful. So here is now an array of the activations at every `block_end` layer. And here, you can see all of those shapes:
+
+```
+ m_vgg(VV(style_tfm[ None ]))  targ_styles = [V(o.features.clone()) for o in sfs]  [o.shape for o in targ_styles] 
+```
+
+```
+ [torch.Size([1, 64, 288, 288]), 
+ torch.Size([1, 128, 144, 144]), 
+ torch.Size([1, 256, 72, 72]), 
+ torch.Size([1, 512, 36, 36]), 
+ torch.Size([1, 512, 18, 18])] 
+```
+
+And you can see, being able to whip up a list comprehension really quickly, it's really important in your Jupyter fiddling around [ [1:53:30](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h53m30s) ]. Because you really want to be able to immediately see here's my channel (64, 128, 256, …), and grid size halving as we would expect (288, 144, 72…) because all of these appear just before a max pool.
+
+So to do a Gram MSE loss, it's going to be the MSE loss on the Gram matrix of the input vs. the gram matrix of the target. And the Gram matrix is just the matrix multiply of `x` with `x` transpose ( `xt()` ) where x is simply equal to my input where I've flattened the batch and channel axes all down together. I've only got one image, so you can ignore the batch part — it's basically channel. Then everything else ( `-1` ), which in this case is the height and width, is the other dimension because there's now going to be channel by height and width, and then as we discussed we can them just do the matrix multiply of that by its transpose. And just to normalize it, we'll divide that by the number of elements ( `b*c*h*w` ) — it would actually be more elegant if I had said `input.numel` (number of elements) that would be the same thing. Again, this gave me tiny numbers so I multiply it by a big number to make it something more sensible. So that's basically my loss.
+
+```
+ def gram(input):  b,c,h,w = input.size()  x = input.view(b*c, -1)  return torch.mm(x, xt())/input.numel()*1e6  def gram_mse_loss(input, target):  return F.mse_loss(gram(input), gram(target)) 
+```
+
+So now my style loss is to take my image to optimize, throw it through VGG forward pass, grab an array of the features in all of the SaveFeatures objects, and then call my Gram MSE loss on every one of those layers [ [1:55:13](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h55m13s) ]. And that's going to give me an array and then I just add them up. Now you could add them up with different weightings, you could add up subsets, or whatever. In this case, I'm just grabbing all of them.
+
+```
+ def style_loss(x):  m_vgg(opt_img_v)  outs = [V(o.features) for o in sfs]  losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)]  return sum(losses) 
+```
+
+Pass that into my optimizer as before:
+
+```
+ n_iter=0  while n_iter <= max_iter: optimizer.step(partial(step,style_loss)) 
+```
+
+```
+ Iteration: n_iter, loss: 230718.453125 
+ Iteration: n_iter, loss: 219493.21875 
+ Iteration: n_iter, loss: 202618.109375 
+ Iteration: n_iter, loss: 481.5616760253906 
+ Iteration: n_iter, loss: 147.41177368164062 
+ Iteration: n_iter, loss: 80.62625122070312 
+ Iteration: n_iter, loss: 49.52326965332031 
+ Iteration: n_iter, loss: 32.36254119873047 
+ Iteration: n_iter, loss: 21.831811904907227 
+ Iteration: n_iter, loss: 15.61091423034668 
+```
+
+And here we have a random image in the style of Van Gogh which I think is kind of cool.
+
+```
+ x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]  plt.figure(figsize=(7,7))  plt.imshow(x); 
+```
+
+![](../img/1_Z2UuUEecjCVOR07scQDw1g.png)
+
+Again Gatys has done it for us. Here is different layers of random image in the style of Van Gogh. So the first one, as you can see, the activations are simple geometric things — not very interesting at all. The later layers are much more interesting. So we kind of have a suspicion that we probably want to use later layers largely for our style loss if we wanted to look good.
+
+![](../img/1_p5JFBuMVDA5kw6CYh_fCfQ.png)
+
+![](../img/1_BBOkPG0_GV-KNdPhlmLUgA.png)
+
+I added this `SaveFeatures.close` [ [1:56:35](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h56m35s) ] which just calls `self.hook.remove()` . Remember, I stored the hook as `self.hook` so `hook.remove()` gets rid of it. It's a good idea to get rid of it because otherwise you can potentially just keep using memory. So at the end, I just go through each of my SaveFeatures object and close it:
+
+```
+ for sf in sfs: sf.close() 
+```
+
+#### Style transfer [ [1:57:08](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h57m8s) ]
+
+Style transfer is adding content loss and style loss together with some weight. So there is no much to show.
+
+Grab my optimizer, grab my image:
+
+```
+ opt_img_v, optimizer = get_opt() 
+```
+
+And my combined loss is the MSE loss at one particular layer, my style loss at all of my layers, sum up the style losses, add them to the content loss, the content loss I'm scaling. Actually the style loss, I scaled already by 1E6\. So they are both scaled exactly the same. Add them together. Again, you could trying weighting the different style losses or you could maybe remove some of them, so this is the simplest possible version.
+
+```
+ def comb_loss(x):  m_vgg(opt_img_v)  outs = [V(o.features) for o in sfs]  losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)]  cnt_loss = F.mse_loss(outs[3], targ_vs[3])*1000000  style_loss = sum(losses)  return cnt_loss + style_loss 
+```
+
+Train that:
+
+```
+ n_iter=0  while n_iter <= max_iter: optimizer.step(partial(step,comb_loss)) 
+```
+
+```
+ Iteration: n_iter, loss: 1802.36767578125 
+ Iteration: n_iter, loss: 1163.05908203125 
+ Iteration: n_iter, loss: 961.6024169921875 
+ Iteration: n_iter, loss: 853.079833984375 
+ Iteration: n_iter, loss: 784.970458984375 
+ Iteration: n_iter, loss: 739.18994140625 
+ Iteration: n_iter, loss: 706.310791015625 
+ Iteration: n_iter, loss: 681.6689453125 
+ Iteration: n_iter, loss: 662.4088134765625 
+ Iteration: n_iter, loss: 646.329833984375 
+```
+
+```
+ x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0]  plt.figure(figsize=(9,9))  plt.imshow(x, interpolation='lanczos')  plt.axis('off'); 
+```
+
+![](../img/1_ElVcIGvL7cWUMhoftkw9-g.png)
+
+```
+ for sf in sfs: sf.close() 
+```
+
+And holy crap, it actually looks good. So I think that's pretty awesome. The main take away here is if you want to solve something with a neural network, all you've got to do is set up a loss function and then optimize something. And the loss function is something which a lower number is something that you're happier with. Because then when you optimize it, it's going to make that number as low as you can, and it'll do what you wanted it to do. So here, Gatys came up with the loss function that does a good job of being a smaller number when it looks like the thing we want it to look like, and it looks like the style of the thing we want to be in the style of. That's all we had to do.
+
+What it actually comes to it [ [1:59:10](https://youtu.be/xXXiC4YRGrQ%3Ft%3D1h59m10s) ], apart from implementing Gram MSE loss which was like 6 lines of code if that, that's our loss function:
+
+![](../img/1_si7enkSgWg1k6EKmx-2ijQ.png)
+
+Pass it to our optimizer, and wait about 5 seconds, and we are done. And remember, we could do a batch of these at a time, so we could wait 5 seconds and 64 of these will be done. So I think that's really interesting and since this paper came out, it has really inspired a lot of interesting work. To me though, most of the interesting work hasn't happened yet because to me, the interesting work is the work where you combine human creativity with these kinds of tools. I haven't seen much in the way of tools that you can download or use where the artist is in control and can kind of do things interactively. It's interesting talking to the guys at [Google Magenta](https://magenta.tensorflow.org/) project which is their creative AI project, all of the stuff they are doing with music is specifically about this. It's building tools that musicians can use to perform in real time. And you'll see much more of that on the music space thanks to Magenta. If you go to their website, there's all kinds of things where you can press the buttons to actually change the drum beats, melodies, keys, etc. You can definitely see Adobe or Nvidia is starting to release little prototypes and starting to do this but this kind of creative AI explosion hasn't happened yet. I think we have pretty much all the technology we need but no one's put it together into a thing and said “look at the thing I built and look at the stuff that people built with my thing.” So that's just a huge area of opportunity.
+
+So the paper that I mentioned at the start of class in passing [ [2:01:16](https://youtu.be/xXXiC4YRGrQ%3Ft%3D2h1m16s) ] — the one where we can add Captain America's shield to arbitrary paintings basically used this technique. The trick was though some minor tweaks to make the pasted Captain America shield blend in nicely. But that paper is only a couple of days old, so that would be a really interesting project to try because you can use all this code. It really does leverage this approach. Then you could start by making the content image be like the painting with the shield and then the style image could be the painting without the shield. That would be a good start, and then you could see what specific problems they try to solve in this paper to make it better. But you could have a start on it right now.
+
+**Question** : Two questions — earlier there were a number of people that expressed interest in your thoughts on Pyro and probabilistic programming [ [2:02:34](https://youtu.be/xXXiC4YRGrQ%3Ft%3D2h2m34s) ]. So TensorFlow has now got this TensorFlow probability or something. There's a bunch of probabilistic programming framework out there. I think they are intriguing, but as yet unproven in the sense that I haven't seen anything done with any probabilistic programming system which hasn't been done better without them. The basic premise is that it allows you to create more of a model of how you think the world works and then plug in the parameters. So back when I used to work in management consulting 20 years ago, we used to do a lot of stuff where we would use a spreadsheet and then we would have these Monte Carlo simulation plugins — there was one called At Risk(?) and one called Crystal Ball. I don't know if they still exist decades later. Basically they would let you change a spreadsheet cell to say this is not a specific value but it actually represents a distribution of values with this mean and the standard deviation or it's got this distribution, and then you would hit a button and the spreadsheet would recalculate a thousand times pulling random numbers from these distributions and show you the distribution of your outcome that might be profit or market share or whatever. We used them all the time back then. Apparently feel that a spreadsheet is a more obvious place to do that kind of work because you can see it all much more naturally, but I don't know. 走着瞧。 At this stage, I hope it turns out to be useful because I find it very appealing and it appeals to as I say the kind of work I used to do a lot of. There's actually whole practices around this stuff they used to call system dynamics which really was built on top of this kind of stuff, but it's not quite gone anywhere.
+
+**Question** : Then there was a question about pre-training for generic style transfer [ [2:04:57](https://youtu.be/xXXiC4YRGrQ%3Ft%3D2h4m57s) ]. I don't think you can pre-train for a generic style, but you can pre-train for a generic photo for a particular style which is where we are going to get to. Although, it may end up being a homework. I haven't decided yet. But I'm going to do all the pieces.
+
+**Question** : Please ask him to talk about multi-GPU [ [2:05:31](https://youtu.be/xXXiC4YRGrQ%3Ft%3D2h5m31s) ]. Oh yeah, I haven't had a slide about that. We're about to hit it.
+
+Before we do, just another interesting picture from the Gatys' paper. They've got a few more just didn't fit in my slide but different convolutional layers for the style. Different style to content ratios, and here's the different images. Obviously this isn't Van Gogh any more, this is a different combination. So you can see, if you just do all style, you don't see any image. If you do lots of content, but you use low enough convolutional layer, it looks okay but the back ground is kind of dumb. So you kind of want somewhere in the middle. So you can play around with it and experiment, but also use the paper to help guide you.
+
+![](../img/1_x_UN319I-Ppe3xHnvzqgag.png)
+
+#### The Math [ [2:06:33](https://youtu.be/xXXiC4YRGrQ%3Ft%3D2h6m33s) ]
+
+Actually, I think I might work on the math now and we'll talk about multi GPU and super resolution next week because this is from the paper and one of the things I really do want you to do after we talk about a paper is to read the paper and then ask questions on the forum anything that's not clear. But there's a key part of this paper which I wanted to talk about and discuss how to interpret it. So the paper says, we're going to be given an input image _x_ and this little thing means normally it means it's a vector, Rachel, but this one is a matrix. I guess it could mean either. 我不知道。 Normally small letter bold means vector or a small letter with an arrow on top means vector. And normally big letter means matrix or small letter with two arrows on top means matrix. In this case, our image is a matrix. We are going to basically treat it as a vector, so maybe we're just getting ahead of ourselves.
+
+![](../img/1_kU-HMZL4kI2So7WV5xow6g.png)
+
+So we've got an input image _x_ and it can be encoded in a particular layer of the CNN by the filter responses (ie activations). Filter responses are activations. Hopefully, that's something you all understand. That's basically what a CNN does is it produces layers of activations. A layer has a bunch of filters which produce a number of channels. This year says that layer number L has capital N _l_ filters. Again, this capital does not mean matrix. So I don't know, math notation is so inconsistent. So capital Nl distinct filters at layer L which means it has also that many feature maps. So make sure you can see this letter Nl is the same as this letter. So you've got to be very careful to read the letters and recognize it's like snap, that's the same letter as that. So obviously, Nl filters create create Nl feature maps or channels, each one of size M _l_ (okay, I can see this is where the unrolling is happening). So this is like M[ _l_ ] in numpy notation. It's the _l_ th layer. So M for the _l_ th layer. The size is height times width — so we flattened it out. So the responses in a layer l can be stored in a matrix F (and now the _l_ goes at the top for some reason). So this is not f^ _l_ , it's just another indexing. We are just moving it around for fun. This thing here where we say it's an element of R — this is a special R meaning the real numbers N times M (this is saying that the dimensions of this is N by M). So this is really important, you don't move on. It's just like with PyTorch, making sure that you understand the rank and size of your dimensions first, same with math. These are the bits where you stop and think why is it N by M? N is a number of filters, M is height by width. So do you remember that thing when we did `.view(b*c, -1)` ? Here that is. So try to map the code to the math. So F is `x` :
+
+![](../img/1_uZYTy9gDHiXBhjRhbtbabg.png)
+
+If I was nicer to you, I would have used the same letters as the paper. But I was too busy getting this darn thing working to do that carefully. So you can go back and rename it as capital F.
+
+So this is why we moved the L to the top is because we're now going to have some more indexing. Where else in Numpy or PyTorch, we index things by square brackets and then lots of things with commas between. The approach in math is to surround your letter by little letters all around it — just throw them up there everywhere. So here, F _l_ is the _l_ th layer of F and then _ij_ is the activation of the _i_ th filter at position _j_ of layer _l_ . So position _j_ is up to size M which is up to size height by width. This is the kind of thing that would be easy to get confused. Often you'd see an _ij_ and assume that's indexing into a position of an image like height by width, but it's totally not, is it? It's indexing into channel by flattened image. It even tells you — it's the _i_ th filter/channel in the _j_ th position in the flattened out image in layer _l_ . So you're not gonna be able to get any further in the paper unless you understand what F is. That's why these are the bits where you stop and make sure you're comfortable.
+
+So now, the content loss, I'm not going to spend much time on but basically we are going to just check out the values of the activations vs. the predictions squared [ [2:12:03](https://youtu.be/xXXiC4YRGrQ%3Ft%3D2h12m3s) ]. So there's our content loss. The style loss will be much the same thing, but using the Gram matrix G:
+
+![](../img/1_v6S37SK4jm1o-aJXUFysAw.png)
+
+I really wanted to show you this one. I think it's super. Sometimes I really like things you can do in math notation, and they're things that you can also generally do in J and APL which is this kind of this implicit loop going on here. What this is saying is there's a whole bunch of values of _i_ and a whole bunch of values of _j_ , and I'm going to define G for all of them. And there's whole bunch of values of _l_ as well, and I'm going to define G for all of those as well. So for all of my G at every _l_ of every _i_ at every _j_ , it's going to be equal to something. And you can see that something has an _i_ and a _j_ and a _l_ , matching G, and it also has a _k_ and that's part of the sum. So what's going on here? Well, it's saying that my Gram matrix in layer _l_ for the _i_ th position in one axis and the _j_ th position in another axis is equal to my F matrix (so my flattened out matrix) for the _i_ th channel in that layer vs. the _j_ th channel in the same layer, then I'm going to sum over. We are going to take the _k_ th position and multiply them together and then add them all up. So that's exactly what we just did before when we calculated our Gram matrix. So this, there's a lot going on because of some, to me, very neat notation — which is there are three implicit loops all going on at the same time, plus one explicit loop in the sum, then they all work together to create this Gram matrix for every layer. So let's go back and see if you can match this. Sl all that's happening all at once which is pretty great.
+
+而已。 So next week, we're going to be looking at a very similar approach, basically doing style transfer all over again but in a way where we actually going to train a neural network to do it for us rather than having to do the optimization. We'll also see that you can do the same thing to do super resolution. And we are also going to go back and revisit some of the SSD stuff as well as doing some segmentation. So if you've forgotten SSD, might be worth doing a little bit of revision this week. Alright, thanks everybody. See you next week.
diff --git a/zh/dl14.md b/zh/dl14.md
new file mode 100644
index 0000000000000000000000000000000000000000..1b70e9dead8ddded44aa736e1d7dcc64c34accd6
--- /dev/null
+++ b/zh/dl14.md
@@ -0,0 +1,2310 @@
+# 深度学习2：第2部分第14课
+
+[论坛](http://forums.fast.ai/t/part-2-lesson-14-wiki/15650/1) / [视频](https://youtu.be/nG3tT31nPmQ)
+
+![](../img/1_X98pzSCWnxb5gbQxDyZ92Q.png)
+
+#### 从上周开始讲述
+
+![](../img/1_iZP-sgkKoKU2dlGj5CUBaw.jpeg)
+
+Alena Harley做了一些非常有趣的事情，她试图找出如果你只用三四百张图像循环GAN会发生什么，我真的很喜欢这些项目，人们只是使用API​​或其中一个库来进行Google Image Search那里。 我们的一些学生已经创建了一些非常好的库，用于与Google图像API交互，下载他们感兴趣的一些东西，在这种情况下是一些照片和一些彩色玻璃窗。 有300到400张照片，她训练了一些不同的模特 - 这是我特别喜欢的。 正如您所看到的，使用相当少的图像，她可以获得非常漂亮的彩色玻璃效果。 所以我认为这是一个有趣的例子，使用了很少的数据，这些数据很容易获得，她能够很快下载。 如果您有兴趣，可以在论坛上获得更多相关信息。
+有趣的是，想知道人们会用这种生成模型提出什么样的东西。 这显然是一种伟大的艺术媒介。 对于伪造品和面包店来说，这显然是一个很好的媒介。 我想知道人们会意识到他们可以用这些生成模型做些什么。 我认为音频将成为下一个重要领域。 也非常互动的类型的东西。 Nvidia刚刚发布了一篇论文，展示了一种互动式的照片修复工具，你只需刷过一个物体，就可以用非常好的深度学习替换它取而代之。 我觉得那些互动工具也很有意思。
+
+### 超级分辨率[ [2:06](https://youtu.be/nG3tT31nPmQ%3Ft%3D2m6s) ]
+
+[实时样式转换和超分辨率的感知损失](https://arxiv.org/abs/1603.08155)
+
+上次，我们通过实际直接优化像素来研究样式转移。 与第二部分中的大部分内容一样，并不是因为我希望您理解样式转换本身，而是直接优化输入并将激活作为损失函数的一部分的想法真的是关键外卖。
+
+因此，有效地看待后续论文是有趣的，不是来自同一个人，而是来自Justin Johnson和斯坦福大学的人们在这些视觉生成模型的序列中接下来的论文。 它实际上做了同样的事情 - 风格转移，但它以不同的方式。 我们不会优化像素，而是回到更熟悉的东西并优化一些权重。 具体来说，我们将训练一个模型，该模型学会拍摄照片并将其翻译成特定艺术作品风格的照片。 因此每个转换网将学习生成一种风格。
+
+现在事实证明，到了那一点，有一个中间点（我实际上认为更有用并将我们带到我们中途）是一种称为超分辨率的东西。 所以我们实际上将从超级分辨率开始[ [3:55](https://youtu.be/nG3tT31nPmQ%3Ft%3D3m55s) ]。 因为那时我们将建立在超级分辨率之上，以完成基于转换网络的样式转移。
+
+超分辨率是我们拍摄低分辨率图像（我们将采用72乘72）并将其放大到更大的图像（在我们的例子中为288乘288），试图创建一个看起来尽可能真实的更高分辨率图像。 这是一件具有挑战性的事情，因为在72乘72时，关于很多细节的信息并不多。 很酷的是，我们将采用与视觉模型相关的方式进行操作，视觉模型与输入大小无关，因此您可以完全采用此模型并将其应用于288 x 288图像并获取某些内容这是每侧大四倍，所以比原来大16倍。 通常它甚至在这个级别上工作得更好，因为你真的在更精细的细节中引入了很多细节，你可以真正打印出一些高分辨率的打印件，这些打印件之前很像是像素化的。
+
+#### [笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/enhance.ipynb) [ [5:06](https://youtu.be/nG3tT31nPmQ%3Ft%3D5m6s) ]
+
+这很像CSI样式增强，我们将采取看似信息不存在的东西，我们发明它 - 但是转换网将学会以一致的方式发明它有了那里的信息，所以希望它发明了正确的信息。 关于这类问题的一个非常好的事情是我们可以在没有任何标签要求的情况下创建我们自己的数据集，因为我们可以通过对图像进行下采样轻松地从高分辨率图像创建低分辨率图像。 所以我希望你们这些人本周尝试的其他方法是做其他类型的图像到图像的翻译，你可以发明“标签”（你的因变量）。 例如：
+
+*   **歪斜** ：要么将旋转了90度或更好的东西识别为旋转了5度并将它们拉直。
+*   **着色** ：将一堆图像变成黑白，并学会再次将颜色重新调整。
+*   **降噪** ：也许做一个非常低质量的JPEG保存，并学会把它恢复到应该如何。
+*   也许采用16色调的东西，并将其放回更高的调色板。
+
+我认为这些东西都很有趣，因为它们可以用来拍摄你可能在高分辨率之前用旧的数码相机拍摄的照片，或者你可能已经扫描了一些现在已经褪色的旧照片等等。我认为这是能够做到的确非常有用，这是一个很好的项目，因为它与我们在这里所做的非常相似，但是相当不同，你在路上遇到了一些有趣的挑战，我敢肯定。
+
+我将再次使用ImageNet [ [7:19](https://youtu.be/nG3tT31nPmQ%3Ft%3D7m19s) ]。 你根本不需要使用所有的ImageNet，我恰好让它躺在那里。 您可以从files.fast.ai下载ImageNet的1％样本。 你可以使用你老实说谎的任何一组照片。
+
+```
+ matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+### 超分辨率数据
+
+```
+ **from** **fastai.conv_learner** **import** *  **from** **pathlib** **import** Path  torch.backends.cudnn.benchmark= **True** 
+```
+
+```
+ PATH = Path('data/imagenet')  PATH_TRN = PATH/'train' 
+```
+
+在这种情况下，正如我所说，我们本身并没有真正的标签，所以我只是给所有标签零，这样我们就可以更容易地将它与我们现有的基础设施一起使用。
+
+```
+ fnames_full,label_arr_full,all_labels = folder_source(PATH, 'train')  fnames_full = ['/'.join(Path(fn).parts[-2:]) **for** fn **in** fnames_full]  list(zip(fnames_full[:5],label_arr_full[:5])) 
+```
+
+```
+ _[('n01440764/n01440764_9627.JPEG', 0),_  _('n01440764/n01440764_9609.JPEG', 0),_  _('n01440764/n01440764_5176.JPEG', 0),_  _('n01440764/n01440764_6936.JPEG', 0),_  _('n01440764/n01440764_4005.JPEG', 0)]_ 
+```
+
+```
+ all_labels[:5] 
+```
+
+```
+ _['n01440764', 'n01443537', 'n01484850', 'n01491361', 'n01494475']_ 
+```
+
+现在，因为我指的是一个包含所有ImageNet的文件夹，我当然不想等待所有的ImageNet完成运行一个纪元。 所以在这里，我只是，大多数时候，我会将“保持百分比”（ `keep_pct` ）设置为1或2％。 然后我只生成一堆随机数，然后我保留那些小于0.02的数据，这样我就可以快速对我的行进行二次采样。
+
+```
+ np.random.seed(42)  # keep_pct = 1.  _keep_pct = 0.02_  keeps = np.random.rand(len(fnames_full)) < keep_pct  fnames = np.array(fnames_full, copy= **False** )[keeps]  label_arr = np.array(label_arr_full, copy= **False** )[keeps] 
+```
+
+所以我们将使用VGG16 [ [8:21](https://youtu.be/nG3tT31nPmQ%3Ft%3D8m21s) ]而VGG16是我们在这个类中没有真正看过的东西，但它是一个非常简单的模型，我们采用正常的3通道输入，我们基本上通过一个3x3卷积的数量，然后不时，我们通过2x2 maxpool，然后我们再做3x3卷积，maxpool，等等。 这是我们的支柱。
+
+![](../img/1_kj2sH_5R5tNvT7ajbHqXKw.png)
+
+然后我们不做自适应平均池层。 在其中一些之后，我们像往常一样（或类似的东西）最终得到这个7x7x512网格。 因此，我们做了一些不同的事情，而不是平均汇集，我们将整个事情弄平 - 所以如果内存正确运行，它会向外喷出一个非常长的7x7x512大小的激活向量。 然后将其送入两个完全连接的层，每个层具有4096个激活，以及一个完全连接的层，其具有许多类。 所以如果你考虑一下，这里的权重矩阵，它是巨大的7x7x512x4096。 正是由于这个权重矩阵，VGG很快就失宠了 - 因为它占用了大量的内存并且需要大量的计算而且速度非常慢。 这里有很多冗余的东西，因为实际上那些512次激活并不特定于它们所处的7x7网格单元中的哪一个。但是当你在这里拥有每个可能组合的整个权重矩阵时，它会将所有这些单独处理。 因此，这也可能导致泛化问题，因为只有很多权重等等。
+
+![](../img/1__UB-iwca2SW15UhI8VLZ6g.png)
+
+我的观点是在每个现代网络中使用的方法，我们在这里进行自适应平均池（在Keras中它被称为全局平均池，在fast.ai中，我们做一个AdaptiveConcatPool），它直接将它吐出到512长激活[ [11:06](https://youtu.be/nG3tT31nPmQ%3Ft%3D11m6s) ]。 我认为这会丢掉太多的几何体。 所以对我来说，可能正确的答案介于两者之间，并且将涉及某种因素卷积或某种张量分解，这可能是我们中的一些人在未来几个月可能会想到的。 所以现在，无论如何，我们已经从一个极端，即自适应平均汇集到另一个极端，即这个巨大的扁平完全连接层。
+
+关于VGG的一些有趣的事情使它今天仍然有用[ [11:59](https://youtu.be/nG3tT31nPmQ%3Ft%3D11m59s) ]。 第一个是在这里有更多有趣的层与大多数现代网络，包括ResNet系列，第一层通常是7x7转换，步幅2或类似的东西。 这意味着我们可以立即丢弃一半的网格大小，因此几乎没有机会使用精细细节，因为我们从不对它进行任何计算。 因此，对于分段或超分辨率模型这样的问题，这是一个问题，因为细节很重要。 我们实际上想恢复它。 然后第二个问题是自适应池化层完全抛弃了最后几个部分中的几何体，这意味着模型的其余部分实际上没有像几何体那样有趣的学习。 因此，对于依赖于位置的事物，任何需要生成模型的任何类型的基于本地化的方法都将变得不那么有效。 所以我希望你在我描述的内容之一就是可能没有一个现有的架构实际上是理想的。 我们可以发明一个新的 。 实际上，我只是尝试在一周内发明一个新的，即将VGG头部连接到ResNet骨干网。 有趣的是，我发现我的分类器实际上比普通的ResNet好一些，但它也有一些更有用的信息。 训练需要花费5或10％的时间，但没有什么值得担心的。 也许我们可以在ResNet中替换这个（7x7转步2），正如我们之前简要讨论过的那样。 这个非常早期的卷积更像是一个Inception词干，它有更多的计算。 我认为这些架构肯定有一些很好的调整空间，因此我们可以构建一些可能更通用的模型。 目前，人们倾向于构建只做一件事的架构。 他们并没有真正想到我在机会方面扔掉了什么，因为这就是出版的运作方式。 你发表了“我已经掌握了这一件事的最新技术，而不是你创造了许多擅长许多东西的东西。
+
+由于这些原因，我们今天将使用VGG，尽管它很古老而且缺少很多很棒的东西[ [14:42](https://youtu.be/nG3tT31nPmQ%3Ft%3D14m42s) ]。 我们要做的一件事是使用稍微更现代的版本，这是VGG的一个版本，在所有卷积之后添加了批量规范。 在fast.ai，当你要求VGG网络时，你总是得到批量规范，因为这基本上总是你想要的。 所以这是具有批量规范的VGG。 有16和19,19更大更重，并没有真正做得更好，所以没有人真正使用它。
+
+```
+ arch = vgg16  sz_lr = 72 
+```
+
+我们将从72 `sz_lr` 72 LR（ `sz_lr` ：尺寸低分辨率）输入。 我们最初将它按比例缩放2倍，批量大小为64，得到2 * 72，144 144输出。 那将是我们的第一阶段。
+
+```
+ scale,bs = 2,64  _# scale,bs = 4,32_  sz_hr = sz_lr*scale 
+```
+
+我们将为此创建我们自己的数据集，并且非常值得查看fastai.dataset模块并查看其中的内容[ [15:45](https://youtu.be/nG3tT31nPmQ%3Ft%3D15m45s) ]。 因为你想要的任何东西，我们可能有一些几乎你想要的东西。 所以在这种情况下，我想要一个数据集，其中我的_x_是图像，而我的_y_也是图像。 我们可以从_x_继承的文件数据集中继承，然后我只是继承了它，我只是复制并粘贴了`get_x`并将其转换为`get_y`因此它只是打开一个图像。 现在我有一些东西，其中_x_是一个图像， _y_是一个图像，在这两种情况下，我们传入的是一个文件名数组。
+
+```
+ **class** **MatchedFilesDataset** (FilesDataset):  **def** __init__(self, fnames, y, transform, path):  self.y=y  **assert** (len(fnames)==len(y))  super().__init__(fnames, transform, path)  **def** get_y(self, i):  **return** open_image(os.path.join(self.path, self.y[i]))  **def** get_c(self): **return** 0 
+```
+
+我要做一些数据扩充[ [16:32](https://youtu.be/nG3tT31nPmQ%3Ft%3D16m32s) ]。 显然，对于所有的ImageNet，我们并不真正需要它，但这主要适用于使用较小数据集来充分利用它的任何人。 `RandomDihedral`指的是每个可能的90度旋转加上可选的左/右翻转，因此它们是八个对称的二面体组。 通常我们不会对ImageNet图片使用这种转换，因为你通常不会颠倒狗，但在这种情况下，我们不是要试图分类它是狗还是猫，我们只是试图保持一般的结构它。 所以实际上每次可能的翻转对于这个问题都是合理合理的。
+
+```
+ aug_tfms = [RandomDihedral(tfm_y=TfmType.PIXEL)] 
+```
+
+以通常的方式创建验证集[ [17:19](https://youtu.be/nG3tT31nPmQ%3Ft%3D17m19s) ]。 你可以看到我正在使用更低级别的功能 - 一般来说，我只是将它们复制并粘贴在fastai源代码中以找到我想要的位。 所以这里是一个位，它接受一组验证集索引和一个或多个变量数组，并简单地拆分。 在这种情况下，将此（ `np.array(fnames)` ）转换为训练和验证集，并将此（第二个`np.array(fnames)` ）转换为训练和验证集，以便为我们提供_x_和_y_的。 在这种情况下， _x_和_y_是相同的。 我们的输入图像和输出图像是相同的。 我们将使用转换来使其中一个降低分辨率。 这就是为什么这些都是一回事。
+
+```
+ val_idxs = get_cv_idxs(len(fnames), val_pct=min(0.01/keep_pct, 0.1))  ((val_x,trn_x),(val_y,trn_y)) = split_by_idx(val_idxs,  np.array(fnames), np.array(fnames))  len(val_x),len(trn_x) 
+```
+
+```
+ _(12811, 1268356)_ 
+```
+
+```
+ img_fn = PATH/'train'/'n01558993'/'n01558993_9684.JPEG' 
+```
+
+我们需要做的下一件事是按照惯例创建我们的转换[ [18:13](https://youtu.be/nG3tT31nPmQ%3Ft%3D18m13s) ]。 我们将使用`tfm_y`参数，就像我们对边界框所做的那样，而不是使用`TfmType.COORD`我们将使用`TfmType.PIXEL` 。 这告诉我们的转换框架你的_y_值是包含普通像素的图像，所以你对_x_做的任何事情，你也需要对_y_做同样的事情。 您需要确保您使用的任何数据扩充转换都具有相同的参数。
+
+```
+ tfms = tfms_from_model(arch, sz_lr, tfm_y=TfmType.PIXEL,  aug_tfms=aug_tfms, sz_y=sz_hr)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y),  (val_x,val_y), tfms, path=PATH_TRN)  md = ImageData(PATH, datasets, bs, num_workers=16, classes= **None** ) 
+```
+
+您可以看到可能的转换类型：
+
+*   CLASS：我们即将在今天下半年使用细分的分类
+*   COORD：坐标 - 根本没有转变
+*   PIXEL
+
+一旦我们有`Dataset`类和一些_x_和_y_训练和验证集。 有一个方便的小方法叫做get datasets（ `get_ds` ），它基本上在所有不同的东西上运行该构造函数，你需要以完全正确的格式返回所需的所有数据集，以传递给ModelData构造函数（在本例中为`ImageData`构造函数） ）。 所以我们有点回到fastai的封面下并从头开始构建它。 在接下来的几个星期里，这一切都将被包装并重构为您可以在fastai中一步完成的事情。 但本课程的重点是学习一些关于幕后工作的内容。
+
+我们之前简要介绍的一点是，当我们拍摄图像时，我们不仅使用数据增强对它们进行变换，而且还将通道尺寸移动到开始，我们减去平均值除以标准偏差等[ [20:08](https://youtu.be/nG3tT31nPmQ%3Ft%3D20m8s) ] 。 因此，如果我们希望能够显示那些来自我们的数据集或数据加载器的图片，我们需要对它们进行去规范化。 因此，模型数据对象的（ `md` ）数据集（ `val_ds` ）具有知道如何执行此操作的denorm函数。 为了方便，我只是给它一个简短的名字：
+
+```
+ denorm = md.val_ds.denorm 
+```
+
+所以现在我要创建一个可以显示来自数据集的图像的函数，如果你传入的东西说这是一个标准化的图像，那么我们就会对它进行说明。
+
+```
+ **def** show_img(ims, idx, figsize=(5,5), normed= **True** , ax= **None** ):  **if** ax **is** **None** : fig,ax = plt.subplots(figsize=figsize)  **if** normed: ims = denorm(ims)  **else** : ims = np.rollaxis(to_np(ims),1,4)  ax.imshow(np.clip(ims,0,1)[idx])  ax.axis('off') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  x.size(),y.size() 
+```
+
+```
+ _(torch.Size([32, 3, 72, 72]), torch.Size([32, 3, 288, 288]))_ 
+```
+
+你会在这里看到我们已经传递了尺寸低res（ `sz_lr` ）作为我们的变换尺寸和尺寸高res（ `sz_hr` ），因为这是新的，尺寸y参数（ `sz_y` ）[ [20:58](https://youtu.be/nG3tT31nPmQ%3Ft%3D20m58s) ]。 因此这两个位将变得不同。
+
+![](../img/1_7-vIZ1E_I3mvI59kTj_sVQ.png)
+
+在这里你可以看到_x_和_y_的两种不同分辨率，对于一大堆鱼来说。
+
+```
+ idx=1  fig,axes = plt.subplots(1, 2, figsize=(9,5))  show_img(x,idx, ax=axes[0])  show_img(y,idx, ax=axes[1]) 
+```
+
+![](../img/1_vPyOJ9-D-s2gzRhraY21YA.png)
+
+按照惯例， `plt.subplots`创建我们的两个图，然后我们可以使用回来的不同轴将东西放在彼此旁边。
+
+```
+ batches = [next(iter(md.aug_dl)) **for** i **in** range(9)] 
+```
+
+然后我们可以看看几个不同版本的数据转换[ [21:37](https://youtu.be/nG3tT31nPmQ%3Ft%3D21m37s) ]。 在那里你可以看到它们在所有不同的方向被翻转。
+
+```
+ fig, axes = plt.subplots(3, 6, figsize=(18, 9))  **for** i,(x,y) **in** enumerate(batches):  show_img(x,idx, ax=axes.flat[i*2])  show_img(y,idx, ax=axes.flat[i*2+1]) 
+```
+
+![](../img/1_9OOex0WAIoQPqzT6SwvW8g.png)
+
+#### 型号[ [21:48](https://youtu.be/nG3tT31nPmQ%3Ft%3D21m48s) ]
+
+让我们创建我们的模型。 我们将有一个小图像进入，我们希望有一个大图像出来。 所以我们需要在这两者之间做一些计算来计算大图像的样子。 基本上有两种方法可以进行计算：
+
+*   我们首先可以进行一些上采样，然后进行一些步骤来进行大量计算。
+*   我们可以先做很多步骤来完成所有计算，然后在最后做一些上采样。
+
+我们将选择第二种方法，因为我们希望对较小的东西进行大量计算，因为以这种方式执行它会快得多。 此外，我们在上采样过程中可以利用所有计算。 上采样，我们知道几种可能的方法。 我们可以用：
+
+*   转置或分步跨越的卷积
+*   最近邻居上采样后跟1x1转换
+
+在“做大量的计算”部分，我们可以拥有一大堆3x3转换。 但在特殊情况下，似乎ResNet块可能更好，因为输出和输入实际上非常相似。 所以我们真的想要一个流通路径，尽可能少的烦恼，除了做超级分辨率所需的最小量。 如果我们使用ResNet块，那么它们已经具有标识路径。 所以你可以想象那些简单的版本，它采用双线性采样方法，或者它可以通过身份块一直通过，然后在上采样块中，只需学习获取输入的平均值并得到一些不太可怕的东西。
+
+这就是我们要做的事情。 我们将创建具有五个ResNet块的东西，然后对于每个2倍的扩展，我们必须做，我们将有一个上采样块。
+
+![](../img/1_d6GkM4JtsJb3WHmTA5NLUg.png)
+
+正如往常一样，它们都将包含卷积层，可能在许多卷之后具有激活函数[ [24:37](https://youtu.be/nG3tT31nPmQ%3Ft%3D24m37s) ]。 我喜欢把我的标准卷积块放到一个函数中，这样我就可以更容易地重构它。 我不会担心传递填充，只是将其直接计算为内核大小超过两个。
+
+```
+ **def** conv(ni, nf, kernel_size=3, actn= **False** ):  layers = [nn.Conv2d(ni, nf, kernel_size,  padding=kernel_size//2)]  **if** actn: layers.append(nn.ReLU( **True** ))  **return** nn.Sequential(*layers) 
+```
+
+关于我们的小转换块的一个有趣的事情是没有批量规范，这对于ResNet类型模型来说是非常不寻常的。
+
+![](../img/1_rMC3ob6YdywFeTHcAruD_A.png)
+
+<figcaption class="imageCaption">[https://arxiv.org/abs/1707.02921](https://arxiv.org/abs/1707.02921)</figcaption>
+
+
+
+没有批量规范的原因是因为我从最近这篇精彩的论文中窃取了想法，这篇论文实际上赢得了最近超级分辨率表现的竞争。 为了看看这篇论文有多好，SRResNet是先前的技术水平，他们在这里所做的就是他们已经放大了一个上采样的网格/围栏。 人力资源是原创的。 您可以在之前的最佳方法中看到，存在大量失真和模糊现象。 或者，在他们的方法中，它几乎是完美的。 因此，本文是一个非常大的进步。 他们称他们的模型为EDSR（增强型深度超分辨率网络），他们做了两件事与以前的标准方法不同：
+
+1.  获取ResNet块并丢弃批量规范。 他们为什么要抛弃批量规范？ 原因是因为批量规范改变了东西，我们想要一个不会改变东西的直接路径。 所以这里的想法是，如果你不想比你必须更多地输入输入，那么不要强迫它必须计算批量规范参数之类的东西 - 所以扔掉批处理规范。
+2.  缩放因子（我们很快就会看到）。
+
+```
+ **class** **ResSequential** (nn.Module):  **def** __init__(self, layers, res_scale=1.0):  super().__init__()  self.res_scale = res_scale  self.m = nn.Sequential(*layers)  **def** forward(self, x): **return** x + self.m(x) * self.res_scale 
+```
+
+因此，我们将创建一个包含两个卷积的残差块。 正如你在他们的方法中看到的那样，他们在第二次转发之后甚至没有ReLU。 所以这就是我第一次激活的原因。
+
+```
+ **def** res_block(nf):  **return** ResSequential(  [conv(nf, nf, actn= **True** ), conv(nf, nf)],  0.1) 
+```
+
+这里有几件有趣的事[ [27:10](https://youtu.be/nG3tT31nPmQ%3Ft%3D27m10s) ]。 一个是这种想法，即拥有某种主要的ResNet路径（conv，ReLU，conv），然后通过将其添加回身份将其转换为ReLU块 - 这是我们经常做的事情，我将其考虑到了一个名为ResSequential的微小模块。 它只需要将一堆图层放入剩余路径中，将其转换为顺序模型，运行它，然后将其添加回输入。 有了这个小模块，我们现在可以通过包装ResSequential将任何内容（如conv激活转换）转换为ResNet块。
+
+但这并不是我所做的全部，因为通常Res块在其`forward`只有`x + self.m(x)` 。 但我也有`* self.res_scale` 。 什么是`res_scale` ？ `res_scale`是数字0.1。 为什么会这样？ 我不确定是否有人知道。 但简短的回答是，发明批量规范的人最近也做了一篇论文，其中他（我认为）首次展示了在一小时内培训ImageNet的能力。 他这样做的方法就是启动很多机器并使它们并行工作以创建非常大的批量。 现在，通常当你按照_N阶段_增加批量大小时，你也可以通过_N_阶来提高学习率。 因此，通常非常大的批量训练也意味着非常高的学习率训练。 他发现，在训练开始时，这些非常大的批量大小为8,000+甚至高达32,000， 他的激活基本上会直接进入无限 。 许多其他人也发现了这一点。 我们实际上发现，当我们在CIFAR和ImageNet竞赛中参加DAWN测试时，我们真的很难充分利用我们试图利用的八个GPU，因为这些较大的批量大小的挑战和服用他们的优势。 基督徒发现的东西是在ResNet块中，如果他将它们乘以小于1的某个数字，比如.1或.2，它确实有助于在开始时稳定训练。 这有点奇怪，因为在数学上，它是相同的。 因为很明显，无论我在这里乘以它，我都可以按相反的数量缩放权重并使用相同的数字。 但我们并不是在处理抽象数学 - 我们正在处理真正的优化问题，不同的初始化，学习率以及其他任何问题。 所以权重消失在无穷大的问题，我想通常实际上是关于计算机在实践中的离散和有限性质。 所以这些小技巧经常会有所不同。
+
+在这种情况下，我们只是根据我们的初始初始化来减少事情。 所以可能有其他方法可以做到这一点。 例如，Nvidia的一些人称之为LARS，我上周简要提到的是一种使用实时计算的判别学习率的方法。 基本上看一下渐变与激活之间的比例，以逐层扩展学习率。 因此他们发现他们不需要这个技巧来大量扩大批量。 也许完全不同的初始化就是必要的。 我提到这个的原因并不是因为我认为很多人可能想要在大量的计算机集群上进行训练，而是我认为很多人想要快速训练模型，这意味着使用高学习率并且理想情况下获得超级收敛。 我认为这些技巧是我们需要能够在更多不同的架构上实现超级融合的技巧等等。 除了莱斯利史密斯之外，除了现在的一些快餐学生之外，没有其他人真正致力于超级融合。 所以这些关于我们如何以非常高的学习率训练的事情，我们将不得不成为那些想出来的人，因为据我所知，没有其他人关心。 所以看一下有关培训ImageNet的文献，我们现在在一小时或更近的时间里训练ImageNet 15分钟，我认为这些文章实际上有一些技巧可以让我们以高学习率训练。 所以这是其中之一。
+
+有趣的是，除了列车ImageNet在一小时的纸上，我见过的唯一另一个地方就是这篇EDSR论文。 这真的很酷，因为赢得比赛的人，我发现他们非常务实，读得很好。 他们实际上必须让事情发挥作用。 因此，本文描述了一种实际上比其他人的方法更好的方法，并且他们做了这些务实的事情，例如抛弃批量规范并使用几乎没有人似乎知道的这个小的缩放因子。 那就是.1来自哪里。
+
+```
+ **def** upsample(ni, nf, scale):  layers = []  **for** i **in** range(int(math.log(scale,2))):  layers += [conv(ni, nf*4), nn.PixelShuffle(2)]  **return** nn.Sequential(*layers) 
+```
+
+所以基本上我们的超级分辨率ResNet（ `SrResnet` ）将进行卷积，从我们的三个通道转到64个通道，只是为了稍微增加空间[ [33:25](https://youtu.be/nG3tT31nPmQ%3Ft%3D33m25s) ]。 然后我们实际上有8个而不是5个Res块。 请记住，这些Res块中的每一个都是步幅1，因此网格大小不会改变，滤波器的数量不会改变。 它一直只有64。 我们将再进行一次卷积，然后我们将按照我们要求的规模进行上采样。 然后我添加了一些这是一个批量规范，因为它感觉只是缩放最后一层可能会有所帮助。 然后最终转回到我们想要的三个频道。 所以你可以看到这里有很多大量的计算，然后就像我们描述的那样进行一些上采样。
+
+```
+ **class** **SrResnet** (nn.Module):  **def** __init__(self, nf, scale):  super().__init__()  features = [conv(3, 64)]  **for** i **in** range(8): features.append(res_block(64))  features += [conv(64,64), upsample(64, 64, scale),  nn.BatchNorm2d(64),  conv(64, 3)]  self.features = nn.Sequential(*features)  **def** forward(self, x): **return** self.features(x) 
+```
+
+简而言之，正如我现在所倾向的那样，这一切都是通过创建一个包含图层的列表来完成的，然后最后变成一个顺序模型，这样我的前向功能就可以了。
+
+这是我们的上采样和上采样有点有趣，因为它没有做两件事（转换或分步跨越卷积或最近邻居上采样后跟1x1转换）。 让我们来谈谈上采样。
+
+![](../img/1_he7T9_w-2Q2wo0jgno5Qfg.png)
+
+这是纸上的图片（实时样式转移和超分辨率的感知损失）。 所以他们说“嘿，我们的做法好多了”，但看看他们的做法。 它里面有文物。 这些只是到处流行，不是吗。 其中一个原因是他们使用转置卷积，我们都知道不使用转置卷积。
+
+![](../img/1_s9IHmwTn9La0u0M8omsd2w.png)
+
+这是转换的卷积[ [35:39](https://youtu.be/nG3tT31nPmQ%3Ft%3D35m39s) ]。 这是来自这个奇妙的卷积算术论文，也在Theano文档中显示。 如果我们从（蓝色是原始图像）3x3图像到5x5图像（如果我们添加了一层填充，则为6x6），那么所有转置卷积都会使用常规的3x3转换，但它会在白色零像素之间粘贴每一对像素。 这使得输入图像更大，当我们对它进行卷积时，因此给我们更大的输出。 但这显然是愚蠢的，因为当我们到达这里时，例如，九个像素进来，其中八个是零。 所以我们只是浪费了大量的计算。 另一方面，如果我们略微偏离，那么我们的九个中有四个是非零的。 但是，我们只使用一个过滤器/内核，因此根据有多少零进来它不能改变。所以它必须适合两者并且它是不可能的，所以我们最终得到这些工件。
+
+![](../img/1_afBXEvE8aOwzNjRNb5bt6Q.png)
+
+<figcaption class="imageCaption">[http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)</figcaption>
+
+
+
+我们已经学会使其变得更好的一种方法是不在这里放置白色的东西，而是将像素的值复制到这三个位置中的每一个[ [36:53](https://youtu.be/nG3tT31nPmQ%3Ft%3D36m53s) ]。 这是最近邻居的上采样。 这当然好一点，但它仍然很糟糕，因为现在我们到达这九个（如上所示），其中4个是完全相同的数字。 当我们跨越一个，那么现在我们完全有不同的情况。 因此，取决于我们的位置，特别是，如果我们在这里，重复次数会少得多：
+
+![](../img/1_uOVWwijHzuyf8IY4g6lKhQ.png)
+
+所以，我们还有这样的问题，即数据中存在浪费的计算和太多的结构，并且它将再次导致工件。 因此，上采样优于转置卷积 - 最好复制它们而不是用零替换它们。 但它还不够好。
+
+相反，我们要做像素洗牌[ [37:56](https://youtu.be/nG3tT31nPmQ%3Ft%3D37m56s) ]。 Pixel shuffle是这个子像素卷积神经网络中的一个操作，它有点令人费解，但它有点令人着迷。
+
+![](../img/1_Yj-niImdJg30IKlk0aBJFQ.png)
+
+<figcaption class="imageCaption">[**使用高效亚像素卷积神经网络的实时单图像和视频超分辨率**](https://arxiv.org/abs/1609.05158)</figcaption>
+
+
+
+我们从输入开始，我们经过一些卷积来创建一些特征映射一段时间，直到最终我们得到具有n [i-1]个特征映射的n [i-1]层。 我们将进行另外3x3转换，我们的目标是从7x7网格单元（我们将要进行3x3升级），因此我们将升级到21x21网格单元。 那么我们能做到的另一种方式是什么呢？ 为了简单起见，让我们选择一个面/层 - 所以让我们采取最顶层的过滤器，然后对其进行卷积只是为了看看会发生什么。 我们要做的是我们将使用卷积，其中内核大小（过滤器的数量）比我们需要的大9倍（严格来说）。 因此，如果我们需要64个过滤器，我们实际上将做64次9过滤器。 为什么？ 这里，r是比例因子，因此3 2是9，所以这里有九个滤波器来覆盖这些输入层/切片之一。 但我们能做的就是从7x7开始，然后我们把它变成了7x7x9。 我们想要的输出等于7乘3乘7次3.换句话说，这里有相同数量的像素/激活，因为在前一步骤中有激活。 因此，我们可以逐步重新调整这些7x7x9激活以创建7x3乘7x3地图[ [40:16](https://youtu.be/nG3tT31nPmQ%3Ft%3D40m16s) ]。 所以我们要做的是我们要在这里拿一个小管（每个网格的左上角），我们将把紫色的一个放在左上角，然后将蓝色放在左上方。右边是浅蓝色，右边是浅蓝色，中间是浅绿色，等等。 因此，左上角的这九个单元中的每一个都将在我们网格的小3x3部分中结束。 然后我们将采用（2,1）并将所有这些9和更多它们带到网格的这些3x3部分，依此类推。 因此，我们将最终在7x3 x 7x3图像中进行这些7x7x9激活中的每一个。
+
+所以首先要认识到的是当然这是在作品的某些定义下工作的，因为我们在这里有一个可学习的卷积，并且它会得到一些渐变，它可以做到最好的工作，它可以填充正确的激活，这样这输出是我们想要的东西。 所以第一步是要意识到这里没有什么特别神奇的东西。 我们可以创建任何我们喜欢的架构。 我们可以随心所欲地移动东西，我们在卷积中的权重将尽我们所能去做。 真正的问题是 - 这是个好主意吗？ 对于它来说这是一件更容易的事情，并且它比转换卷积或逐个转换后的逐次取样更灵活吗？ 简短的回答是肯定的，简而言之就是这里的卷积发生在低分辨率7x7空间中，这非常有效。 或者，如果我们首先进行上采样然后再进行转换，那么我们的转换将发生在21乘21空间，这是很多计算。 此外，正如我们所讨论的，最近邻上采样版本中存在大量复制和冗余。 事实上，他们实际上在本文中表明，他们有一个后续技术说明，他们提供了更多的数学细节，以确切地说正在做什么工作，并表明这项工作确实更有效率。 这就是我们要做的事情。 对于我们的上采样，我们有两个步骤：
+
+1.  3x3转换，r²倍于我们原先想要的频道
+2.  然后是像素混洗操作，它将每个网格单元格中的所有内容移动到通过这里定位的_r_网格中的小_r_ 。
+
+所以这里是：
+
+![](../img/1_he7T9_w-2Q2wo0jgno5Qfg.png)
+
+这是一行代码。 这是一个转换器数量为4的滤波器数量，因为我们正在进行两个上采样（2²= 4）。 这是我们的卷积，然后这是我们在PyTorch中内置的像素shuffle。 Pixel shuffle是将每个东西移动到正确位置的东西。 因此，这将通过比例因子2进行上采样。因此，我们需要执行该对数基数2比例时间。 如果比例是4，那么我们将做两次两次两次。 这就是这里的例子。
+
+#### 棋盘图案[ [44:19](https://youtu.be/nG3tT31nPmQ%3Ft%3D44m19s) ]
+
+大。 你猜怎么着。 这并没有摆脱棋盘图案。 我们仍然有棋盘图案。 所以我确信在极大的愤怒和挫折中，来自Twitter的同一个团队我认为这是回来的时候，他们曾经是一家名为魔术小马的创业公司，Twitter买回来的另一篇论文说好了，这次我们已经摆脱了的棋盘格。
+
+![](../img/1_GHf4mB-n_o6owwX6MoY_NQ.png)
+
+<figcaption class="imageCaption">[https://arxiv.org/abs/1707.02937](https://arxiv.org/abs/1707.02937)</figcaption>
+
+
+
+为什么我们还有棋盘？ 我们之后仍然有一个棋盘格的原因是，当我们在开始时随机初始化这个卷积内核时，这意味着这里的这个小3x3网格中的这9个像素中的每一个都将完全随机地变化。 但随后下一组3个像素将彼此随机不同，但将与之前3x3部分中的相应像素非常相似。 所以我们将一直重复3x3的事情。 然后，当我们尝试更好地学习时，它从重复的3x3起点开始，这不是我们想要的。 我们实际想要的是这些3x3像素开始时是相同的。 为了使这些3x3像素相同，我们需要为每个滤波器使这9个通道相同。 所以本文的解决方案非常简单。 当我们在开始时随机初始化它时初始化这个卷积时，我们不会完全随机地初始化它。 我们随机初始化一组r²通道，然后将它们复制到另一个r²，因此它们都是相同的。 这样，最初，这些3x3中的每一个都是相同的。 所以这被称为ICNR，这就是我们马上要用的东西。
+
+#### 像素丢失[ [46:41](https://youtu.be/nG3tT31nPmQ%3Ft%3D46m41s) ]
+
+在我们开始之前，让我们快速看一下。 So we've got this super resolution ResNet which just does lots of computation with lots of ResNet blocks and then it does some upsampling and gets our final three channels out.
+
+Then to make life faster, we are going to run tins in parallel. One reason we want to run it in parallel is because Gerardo told us that he has 6 GPUs and this is what his computer looks like right now.
+
+![](../img/1_rPTiAdy8iIVV3twfOG7mHg.png)
+
+So I'm sure anybody who has more than one GPU has had this experience before. So how do we get these men working together? All you need to do is to take your PyTorch module and wrap it with `nn.DataParallel` . Once you've done that, it copies it to each of your GPUs and will automatically run it in parallel. It scales pretty well to two GPUs, okay to three GPUs, better than nothing to four GPUs and beyond that, performance does go backwards. By default, it will copy it to all of your GPUs — you can add an array of GPUs otherwise if you want to avoid getting in trouble, for example, I have to share our box with Yannet and if I didn't put this here, then she would be yelling at me right now or boycotting my class. So this is how you avoid getting into trouble with Yannet.
+
+```
+ m = to_gpu(SrResnet(64, scale))  m = nn.DataParallel(m, [0,2])  learn = Learner(md, SingleModel(m), opt_fn=optim.Adam)  learn.crit = F.mse_loss 
+```
+
+One thing to be aware of here is that once you do this, it actually modifies your module [ [48:21](https://youtu.be/nG3tT31nPmQ%3Ft%3D48m21s) ]. So if you now print out your module, let's say previously it was just an endless sequential, now you'll find it's an `nn.Sequential` embedded inside a module called `Module` . In other words, if you save something which you had `nn.DataParallel` and then tried and load it back into something you haven't `nn.DataParallel` , it'll say it doesn't match up because one of them is embedded inside this Module attribute and the other one isn't. It may also depend even on which GPU IDs you have had it copy to. Two possible solutions:
+
+1.  Don't save the module `m` but instead save the module attribute `m.module` because that's actually the non data parallel bit.
+2.  Always put it on the same GPU IDs and then use data parallel and load and save that every time. That's what I was using.
+
+This is an easy thing for me to fix automatically in fast.ai and I'll do it pretty soon so it will look for that module attribute and deal with it automatically. But for now, we have to do it manually. It's probably useful to know what's going on behind the scenes anyway.
+
+So we've got our module [ [49:46](https://youtu.be/nG3tT31nPmQ%3Ft%3D49m46s) ]. I find it'll run 50 or 60% faster on a 1080Ti if you are running on volta, it actually parallelize a bit better. There are much faster ways to parallelize but this is a super easy way.
+
+We create our learner in the usual way. We can use MSE loss here so that's just going to compare the pixels of the output to the pixels that we expected. We can run our learning rate finder and we can train it for a while.
+
+```
+ learn.lr_find(start_lr=1e-5, end_lr=10000)  learn.sched.plot() 
+```
+
+```
+ 31%|███▏ | 225/720 [00:24<00:53, 9.19it/s, loss=0.0482] 
+```
+
+![](../img/1_3NGCeWjks8iCWo97ag8KgQ.png)
+
+```
+ lr=2e-3 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=1, use_clr_beta=(40,10)) 
+```
+
+```
+ 2%|▏ | 15/720 [00:02<01:52, 6.25it/s, loss=0.042]  epoch trn_loss val_loss  0 0.007431 0.008192 
+```
+
+```
+ [array([0.00819])] 
+```
+
+```
+ x,y = next(iter(md.val_dl))  preds = learn.model(VV(x)) 
+```
+
+Here is our input:
+
+```
+ idx=4  show_img(y,idx,normed= False ) 
+```
+
+![](../img/1_eLOmwY4FKA-XU1MjNlw2PA.png)
+
+And here is our output.
+
+```
+ show_img(preds,idx,normed= False ); 
+```
+
+![](../img/1_qHj8r-MMhw_koKNxyNFKxQ.png)
+
+And you can see that what we've managed to do is to train a very advanced residual convolutional network that's learnt to blue things. 这是为什么？ Well, because it's what we asked for. We said to minimize MSE loss. MSE loss between pixels really the best way to do that is just average the pixel ie to blur it. So that's why pixel loss is no good. So we want to use our perceptual loss.
+
+```
+ show_img(x,idx,normed= True ); 
+```
+
+![](../img/1_c2wlq6d0wuNB2xIjaRMcZw.png)
+
+```
+ x,y = next(iter(md.val_dl))  preds = learn.model(VV(x)) 
+```
+
+```
+ show_img(y,idx,normed= False ) 
+```
+
+![](../img/1_MEYmY0bflI07lSFsEDMS4Q.png)
+
+```
+ show_img(preds,idx,normed= False ); 
+```
+
+![](../img/1_34ihffYGeiFCpgaWIcDOjw.png)
+
+```
+ show_img(x,idx); 
+```
+
+![](../img/1_SLZNRUiVAoakl-7YKNuqyw.png)
+
+#### Perceptual loss [50:57]
+
+With perceptual loss, we are basically going to take our VGG network and just like we did last week, we are going to find the block index just before we get a maxpool.
+
+```
+ def icnr(x, scale=2, init=nn.init.kaiming_normal):  new_shape = [int(x.shape[0] / (scale ** 2))] + list(x.shape[1:])  subkernel = torch.zeros(new_shape)  subkernel = init(subkernel)  subkernel = subkernel.transpose(0, 1)  subkernel = subkernel.contiguous().view(subkernel.shape[0],  subkernel.shape[1], -1)  kernel = subkernel.repeat(1, 1, scale ** 2)  transposed_shape = [x.shape[1]] + [x.shape[0]] +  list(x.shape[2:])  kernel = kernel.contiguous().view(transposed_shape)  kernel = kernel.transpose(0, 1)  return kernel 
+```
+
+```
+ m_vgg = vgg16( True )  blocks = [i-1 for i,o in enumerate(children(m_vgg))  if isinstance(o,nn.MaxPool2d)]  blocks, [m_vgg[i] for i in blocks] 
+```
+
+```
+ ([5, 12, 22, 32, 42], 
+ [ReLU(inplace), ReLU(inplace), ReLU(inplace), ReLU(inplace), ReLU(inplace)]) 
+```
+
+So here are the ends of each block of the same grid size. If we just print them out, as we'd expect, every one of those is a ReLU module and so in this case these last two blocks are less interesting to us. The grid size there is small enough, and course enough that it's not as useful for super resolution. So we are just going to use the first three. Just to save unnecessary computation, we are just going to use those first 23 layers of VGG and we'll throw away the rest. We'll stick it on the GPU. We are not going to be training this VGG model at all — we are just using it to compare activations. So we'll stick it in eval mode and we will set it to not trainable.
+
+```
+ vgg_layers = children(m_vgg)[:23]  m_vgg = nn.Sequential(*vgg_layers).cuda().eval()  set_trainable(m_vgg, False ) 
+```
+
+```
+ def flatten(x): return x.view(x.size(0), -1) 
+```
+
+Just like last week, we will use `SaveFeatures` class to do a forward hook which saves the output activations at each of those layers [ [52:07](https://youtu.be/nG3tT31nPmQ%3Ft%3D52m7s) ].
+
+```
+ class SaveFeatures ():  features= None  def __init__(self, m):  self.hook = m.register_forward_hook(self.hook_fn)  def hook_fn(self, module, input, output): self.features = output  def remove(self): self.hook.remove() 
+```
+
+So now we have everything we need to create our perceptual loss or as I call it here `FeatureLoss` class. We are going to pass in a list of layer IDs, the layers where we want the content loss to be calculated, and a list of weights for each of those layers. We can go through each of those layer IDs and create an object which has the forward hook function to store the activations. So in our forward, then we can just go ahead and call the forward pass of our model with the target (high res image we are trying to create). The reason we do that is because that is going to then call that hook function and store in `self.sfs` (self dot save features) the activations we want. Now we are going to need to do that for our conv net output as well. So we need to clone these because otherwise the conv net output is going to go ahead and just clobber what I already had. So now we can do the same thing for the conv net output which is the input to the loss function. And so now we've got those two things we can zip them all together along with the weights so we've got inputs, targets, and weights. Then we can do the L1 loss between the inputs and the targets and multiply by the layer weights. The only other thing I do is I also grab the pixel loss, but I weight it down quite a bit. Most people don't do this. I haven't seen papers that do this, but in my opinion, it's maybe a little bit better because you've got the perceptual content loss activation stuff but the really finest level it also cares about the individual pixels. So that's our loss function.
+
+```
+ class FeatureLoss (nn.Module):  def __init__(self, m, layer_ids, layer_wgts):  super().__init__()  self.m,self.wgts = m,layer_wgts  self.sfs = [SaveFeatures(m[i]) for i in layer_ids]  def forward(self, input, target, sum_layers= True ):  self.m(VV(target.data))  res = [F.l1_loss(input,target)/100]  targ_feat = [V(o.features.data.clone()) for o in self.sfs]  self.m(input)  res += [F.l1_loss(flatten(inp.features),flatten(targ))*wgt  for inp,targ,wgt in zip(self.sfs, targ_feat,  self.wgts)]  if sum_layers: res = sum(res)  return res  def close(self):  for o in self.sfs: o.remove() 
+```
+
+We create our super resolution ResNet telling it how much to scale up by.
+
+```
+ m = SrResnet(64, scale) 
+```
+
+And then we are going to do our `icnr` initialization of that pixel shuffle convolution [ [54:27](https://youtu.be/nG3tT31nPmQ%3Ft%3D54m27s) ]. This is very boring code, I actually stole it from somebody else. Literally all it does is just say okay, you've got some weight tensor `x` that you want to initialize so we are going to treat it as if it has shape (ie number of features) divided by scale squared features in practice. So this might be 2² = 4 because we actually want to just keep one set of then and then copy them four times, so we divide it by four and we create something of that size and we initialize that with, by default, `kaiming_normal` initialization. Then we just make scale² copies of it. And the rest of it is just kind of moving axes around a little bit. So that's going to return a new weight matrix where each initialized sub kernel is repeated r² or `scale` ² times. So that details don't matter very much. All that matters here is that I just looked through to find what was the actual conv layer just before the pixel shuffle and store it away and then I called `icnr` on its weight matrix to get my new weight matrix. And then I copied that new weight matrix back into that layer.
+
+```
+ conv_shuffle = m.features[10][0][0]  kernel = icnr(conv_shuffle.weight, scale=scale)  conv_shuffle.weight.data.copy_(kernel); 
+```
+
+As you can see, I went to quite a lot of trouble in this exercise to really try to implement all the best practices [ [56:13](https://youtu.be/nG3tT31nPmQ%3Ft%3D56m13s) ]. I tend to do things a bit one extreme or the other. I show you a really hacky version that only slightly works or I go to the _n_ th degree to make it work really well. So this is a version where I'm claiming that this is pretty much a state of the art implementation. It's a competition winning or at least my re-implementation of a competition winning approach. The reason I'm doing that is because I think this is one of those rare papers where they actually get a lot of the details right and I want you to get a feel of what it feels like to get all the details right. Remember, getting the details right is the difference between the hideous blurry mess and the pretty exquisite result.
+
+```
+ m = to_gpu(m) 
+```
+
+```
+ learn = Learner(md, SingleModel(m), opt_fn=optim.Adam) 
+```
+
+```
+ t = torch.load(learn.get_model_path('sr-samp0'),  map_location= lambda storage, loc: storage)  learn.model.load_state_dict(t, strict= False ) 
+```
+
+```
+ learn.freeze_to(999) 
+```
+
+```
+ for i in range(10,13): set_trainable(m.features[i], True ) 
+```
+
+```
+ conv_shuffle = m.features[10][2][0]  kernel = icnr(conv_shuffle.weight, scale=scale)  conv_shuffle.weight.data.copy_(kernel); 
+```
+
+So we are going do DataParallel on that again [ [57:14](https://youtu.be/nG3tT31nPmQ%3Ft%3D57m14s) ].
+
+```
+ m = nn.DataParallel(m, [0,2])  learn = Learner(md, SingleModel(m), opt_fn=optim.Adam) 
+```
+
+```
+ learn.set_data(md) 
+```
+
+We are going to set our criterion to be FeatureLoss using our VGG model, grab the first few blocks and these are sets of layer weights that I found worked pretty well.
+
+```
+ learn.crit = FeatureLoss(m_vgg, blocks[:3], [0.2,0.7,0.1]) 
+```
+
+```
+ lr=6e-3  wd=1e-7 
+```
+
+Do a learning rate finder.
+
+```
+ learn.lr_find(1e-4, 0.1, wds=wd, linear= True ) 
+```
+
+```
+ 1%| | 15/1801 [00:06<12:55, 2.30it/s, loss=0.0965]  12%|█▏ | 220/1801 [01:16<09:08, 2.88it/s, loss=0.42] 
+```
+
+```
+ learn.sched.plot(n_skip_end=1) 
+```
+
+![](../img/1_pFxZy5AKqmhC031Q6a71PQ.png)
+
+Fit it for a while
+
+```
+ learn.fit(lr, 1, cycle_len=2, wds=wd, use_clr=(20,10)) 
+```
+
+```
+ epoch trn_loss val_loss  0 0.04523 0.042932  1 0.043574 0.041242 
+```
+
+```
+ [array([0.04124])] 
+```
+
+```
+ learn.save('sr-samp0') 
+```
+
+```
+ learn.save('sr-samp1') 
+```
+
+And I fiddled around for a while trying to get some of these details right. But here is my favorite part of the paper is what happens next. Now that we've done it for scale equals 2 — progressive resizing. So progressive resizing is the trick that let us get the best best single computer result for ImageNet training on DAWN bench. It's this idea of starting small gradually making bigger. I only know of two papers that have used this idea. One is the progressive resizing of GANs paper which allows training a very high resolution GANs and the other one is the EDSR paper. And the cool thing about progressive resizing is not only are your earlier epochs, assuming you've got 2x2 smaller, four times faster. You can also make the batch size maybe 3 or 4 times bigger. But more importantly, they are going to generalize better because you are feeding in your model different sized images during training. So we were able to train half as many epochs for ImageNet as most people. Our epochs were faster and there were fewer of them. So progressive resizing is something that, particularly if you are training from scratch (I'm not so sure if it's useful for fine-tuning transfer learning, but if you are training from scratch), you probably want to do nearly all the time.
+
+#### Progressive resizing [ [59:07](https://youtu.be/nG3tT31nPmQ%3Ft%3D59m7s) ]
+
+So the next step is to go all the way back to the top and change to 4 scale, 32 batch size, restart. I saved the model before I do that.
+
+![](../img/1_Wm9cQuH2YuzT4zLdjFx5gQ.png)
+
+Go back and that's why there's a little bit of fussing around in here with reloading because what I needed to do now is I needed to load my saved model back in.
+
+![](../img/1_sbHDTKhKUuuOqvDz8RayHw.png)
+
+But there's a slight issue which is I now have one more upsampling layer than I used to have to go from 2x2 to 4x4\. My loop here is now looping through twice, not once. Therefore, it's added an extra conv net and an extra pixel shuffle. So how am I going to load in weights for a different network?
+
+![](../img/1_bu5wo96iiUbwq3lfABmLSg.png)
+
+The answer is that I use a very handy thing in PyTorch `load_state_dict` . This is what `lean.load` calls behind the scenes. If I pass this parameter `strict=False` then it says “okay, if you can't fill in all of the layers, just fill in the layers you can.” So after loading the model back in this way, we are going to end up with something where it's loaded in all the layers that it can and that one conv layer that's new is going to be randomly initialized.
+
+![](../img/1_FJ6Ntp4aiKC1Lav6V4Q-9A.png)
+
+Then I freeze all my layers and then unfreeze that upsampling part [ [1:00:45](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h45s) ] Then use `icnr` on my newly added extra layer. Then I can go ahead and learn again. So then the rest is the same.
+
+If you are trying to replicate this, don't just run this top to bottom. Realize it involves a bit of jumping around.
+
+```
+ learn.load('sr-samp1') 
+```
+
+```
+ lr=3e-3 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=1, wds=wd, use_clr=(20,10)) 
+```
+
+```
+ epoch trn_loss val_loss  0 0.069054 0.06638 
+```
+
+```
+ [array([0.06638])] 
+```
+
+```
+ learn.save('sr-samp2') 
+```
+
+```
+ learn.unfreeze() 
+```
+
+```
+ learn.load('sr-samp2') 
+```
+
+```
+ learn.fit(lr/3, 1, cycle_len=1, wds=wd, use_clr=(20,10)) 
+```
+
+```
+ epoch trn_loss val_loss  0 0.06042 0.057613 
+```
+
+```
+ [array([0.05761])] 
+```
+
+```
+ learn.save('sr1') 
+```
+
+```
+ learn.sched.plot_loss() 
+```
+
+![](../img/1_M7lOaEQaa21WBgAXVd8roQ.png)
+
+```
+ def plot_ds_img(idx, ax= None , figsize=(7,7), normed= True ):  if ax is None : fig,ax = plt.subplots(figsize=figsize)  im = md.val_ds[idx][0]  if normed: im = denorm(im)[0]  else : im = np.rollaxis(to_np(im),0,3)  ax.imshow(im)  ax.axis('off') 
+```
+
+```
+ fig,axes=plt.subplots(6,6,figsize=(20,20))  **for** i,ax **in** enumerate(axes.flat):  plot_ds_img(i+200,ax=ax, normed= True ) 
+```
+
+![](../img/1_iQrT4lAb6bYt0s65aYEVsA.png)
+
+```
+ x,y=md.val_ds[215] 
+```
+
+```
+ y=y[ None ] 
+```
+
+```
+ learn.model.eval()  preds = learn.model(VV(x[ None ]))  x.shape,y.shape,preds.shape 
+```
+
+```
+ ((3, 72, 72), (1, 3, 288, 288), torch.Size([1, 3, 288, 288])) 
+```
+
+```
+ learn.crit(preds, V(y), sum_layers= False ) 
+```
+
+```
+ [Variable containing:  1.00000e-03 *  1.1935  [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing:  1.00000e-03 *  8.5054  [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing:  1.00000e-02 *  3.4656  [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing:  1.00000e-03 *  3.8243  [torch.cuda.FloatTensor of size 1 (GPU 0)]] 
+```
+
+```
+ learn.crit.close() 
+```
+
+The longer you train, the better it gets [ [1:01:18](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h1m18s) ]. I ended up training it for about 10 hours, but you'll still get very good results much more quickly if you're less patient. So we can try it out and and here is the result. On the left is my pixelated bird and on the right is the upsampled version. It literally invented coloration. But it figured out what kind of bird it is, and it knows what these feathers are meant to look like. So it has imagined a set of feathers which are compatible with these exact pixels which is genius. Same for the back of its head. There is no way you can tell what these blue dots are meant to represent. But if you know that this kind of bird has an array of feathers here, you know that's what they must be. Then you can figure out whether the feathers would have to be such that when they were pixelated they would end up in these spots. So it literally reverse engineered given its knowledge of this exact species of bird, how it would have to have looked to create this output. This is so amazing. It also knows from all the signs around it that this area here (background) was almost certainly blurred out. So it actually reconstructed blurred vegetation. If it hadn't have done all of those things, it wouldn't have gotten such a good loss function. Because in the end, it had to match the activations saying “oh, there's a feather over here and it's kind of fluffy looking and it's in this direction” and all that.
+
+```
+ _,axes=plt.subplots(1,2,figsize=(14,7))  show_img(x[ None ], 0, ax=axes[0])  show_img(preds,0, normed= True , ax=axes[1]) 
+```
+
+![](../img/1_wJTo6Q3kodiPLTyxaLDYZg.png)
+
+Well, that brings us to the end of super resolution [ [1:03:18](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h3m18s) ]. Don't forget to check out the [ask Jeremy anything](http://forums.fast.ai/t/ask-jeremy-anything/15646/1) thread.
+
+### Ask Jeremy Anything
+
+**Question** : What are the future plans for fast.ai and this course? Will there be a part 3? If there is a part 3, I would really love to take it [ [1:04:11](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h4m11s) ].
+
+**Jeremy** : I'm not quite sure. It's always hard to guess. I hope there will be some kind of follow-up. Last year, after part 2, one of the students started up a weekly book club going through the Ian Goodfellow deep learning book, and Ian actually came in and presented quite a few of the chapters and there was somebody, an expert, who presented every chapter. That was a really cool part 3\. To a large extent, it will depend on you, the community, to come up with ideas and help make them happen, and I'm definitely keen to help. I've got a bunch of ideas but I'm nervous about saying them because I'm not sure which ones will happen and which ones won't. But the more support I have in making things happen that you want to happen from you, the more likely they are to happen.
+
+**Question** : What was your experience like starting down the path of entrepreneurship? Have you always been an entrepreneur or did you start at a big company and transition to a startup? Did you go from academia to startups or startups to academia? [ [1:05:13](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h5m13s) ]
+
+**Jeremy** : No, I was definitely not an academia. I am totally a fake academic. I started at McKinsey and company which is a strategy firm when I was 18 which meant I couldn't really go to university so it didn't really turn up. Then spent 8 years in business helping really big companies on strategic questions. I always wanted to be an entrepreneur, planned to only spend two years in McKinsey, only thing I really regret in my life was not sticking to that plan and wasting eight years instead. So two years would have been perfect. But then I went into entrepreneurship, started two companies in Australia. The best part about that was that I didn't get any funding so all the money that I made was mine or the decisions were mine and my partner's. I focused entirely on profit and product and customer and service. Whereas I find in San Francisco, I'm glad I came here and so the two of us came here for Kaggle, Anthony and I, and raised ridiculous amount of money 11 million dollar for this really new company. That was really interesting but it's also really distracting trying to worry about scaling and VC's wanting to see what your business development plans are and also just not having any real need to actually make a profit. So I had a bit of the same problem at Enlitic where I again raised a lot of money 15 million dollars pretty quickly and a lot of distractions. I think trying to bootstrap your own company and focus on making money by selling something at a profit and then plowing that back into the company, it worked really well. Because within five years, we were making a profit from 3 months in and within 5 years, we were making enough for profit not just to pay all of us and our own wages but also to see my bank account growing and after 10 years sold it for a big chunk of money, not enough that a VC would be excited but enough that I didn't have to worry about money again. So I think bootstrapping a company is something which people in the Bay Area at least don't seem to appreciate how good of an idea that is.
+
+**Question** : If you were 25 years old today and still know what you know where would you be looking to use AI? What are you working on right now or looking to work on in the next 2 years [ [1:08:10](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h8m10s) ]?
+
+**Jeremy** : You should ignore the last part of that. I won't even answer it. Doesn't matter where I'm looking. What you should do is leverage your knowledge about your domain. So one of the main reasons we do this is to get people who have backgrounds in recruiting, oil field surveys, journalism, activism, whatever and solve your problems. It'll be really obvious to you what real problems are and it will be really obvious to you what data you have and where to find it. Those are all the bits that for everybody else that's really hard. So people who start out with “oh, I know deep learning now I'll go and find something to apply it to” basically never succeed where else people who are like “oh, I've been spending 25 years doing specialized recruiting for legal firms and I know that the key issue is this thing and I know that this piece of data totally solves it and so I'm just going to do that now and I already know who to call or actually start selling it to”. They are the ones who tend to win. If you've done nothing but academic stuff, then it's more maybe about your hobbies and interests. So everybody has hobbies. The main thing I would say is please don't focus on building tools for data scientists to use or for software engineers to use because every data scientist knows about the market of data scientists whereas only you know about the market for analyzing oil survey world or understanding audiology studies or whatever it is that you do.
+
+**Question** : Given what you've shown us about applying transfer learning from image recognition to NLP, there looks to be a lot of value in paying attention to all of the developments that happen across the whole ML field and that if you were to focus in one area you might miss out on some great advances in other concentrations. How do you stay aware of all of the advancements across the field while still having time to dig in deep to your specific domains [ [1:10:19](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h10m19s) ]?
+
+**Jeremy** : Yeah, that's awesome. I mean that's one of the key messages of this course. Lots of good work's being done in different places and people are so specialized and most people don't know about it. If I can get state of the art results in NLP within six months of starting to look at NLP and I think that says more about NLP than it does about me, frankly. It's kind of like the entrepreneurship thing. You pick the areas you see that you know about and kind of transfer stuff like “oh, we could use deep learning to solve this problem” or in this case, we could use this idea of computer vision to solve that problem. So things like transfer learning, I'm sure there's like a thousand opportunities for you to do in other field to do what Sebastian and I did in NLP with NLP classification. So the short answer to your question is the way to stay ahead of what's going on would be to follow my feed of Twitter favorites and my approach is to then follow lots and lots of people on Twitter and put them into the Twitter favorites for you. Literally, every time I come across something interesting, I click favorite. There are two reasons I do it. The first is that when the next course comes along, I go through my favorites to find which things I want to study. The second is so that you can do the same thing. And then which you go deep into, it almost doesn't matter. I find every time I look at something it turns out to be super interesting and important. So pick something which you feel like solving that problem would be actually useful for some reason and it doesn't seem to be very popular which is kind of the opposite of what everybody else does. Everybody else works on the problems which everybody else is already working on because they are the ones that seem popular. I can't quite understand this train of thinking but it seems to be very common.
+
+**Question** : Is Deep Learning an overkill to use on Tabular data? When is it better to use DL instead of ML on tabular data [ [1:12:46](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h12m46s) ]?
+
+**Jeremy** : Is that a real question or did you just put that there so that I would point out that Rachel Thomas just wrote an article? [http://www.fast.ai/2018/04/29/categorical-embeddings/](http://www.fast.ai/2018/04/29/categorical-embeddings/)
+
+So Rachel has just written about this and Rachel and I spent a long time talking about it and the short answer is we think it's great to use deep learning on tabular data. Actually, of all the rich complex important and interesting things that appear in Rachel's Twitter stream covering everything from the genocide of Rohingya through to latest ethics violations in AI companies, the one by far that got the most attention and engagement from the community was the question about is it called tabular data or structured data. So yeah, ask computer people how to name things and you'll get plenty of interest. There are some really good links here to stuff from Instacart and Pinterest and other folks who have done some good work in this area. Any of you that went to the Data Institute conference would have seen Jeremy Stanley's presentation about the really cool work they did at Instacart.
+
+**Rachel** : I relied heavily on lessons 3 and 4 from part 1 in writing this post so much of that may be familiar to you.
+
+**Jeremy** : Rachel asked me during the post like how to tell whether you should use the decision tree ensemble like GBM or random forest or neural net and my answer is I still don't know. Nobody I'm aware of has done that research in any particularly meaningful way. So there's a question to be answered there, I guess. My approach has been to try to make both of those things as accessible as possible through fast.ai library so you can try them both and see what works. That's what I do.
+
+**Question** : Reinforcement Learning popularity has been on a gradual rise in the recent past. What's your take on Reinforcement Learning? Would fast.ai consider covering some ground in popular RL techniques in the future [ [1:15:21](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h15m21s) ]?
+
+**Jeremy** : I'm still not a believer in reinforcement learning. I think it's an interesting problem to solve but it's not at all clear that we have a good way of solving this problem. So the problem, it really is the delayed credit problem. So I want to learn to play pong, I've moved up or down and three minutes later I find out whether I won the game of pong — which actions I took were actually useful? So to me, the idea of calculating the gradients of the output with respect to those inputs, the credit is so delayed that those derivatives don't seem very interesting. I get this question quite regularly in every one of these four courses so far. I've always said the same thing. I'm rather pleased that finally recently there's been some results showing that actually basically random search often does better than reinforcement learning so basically what's happened is very well-funded companies with vast amounts of computational power throw all of it at reinforcement learning problems and get good results and people then say “oh it's because of the reinforcement learning” rather than the vast amounts of compute power. Or they use extremely thoughtful and clever algorithms like a combination of convolutional neural nets and Monte Carlo tree search like they did with the Alpha Go stuff to get great results and people incorrectly say “oh that's because of reinforcement learning” when it wasn't really reinforcement learning at all. So I'm very interested in solving these kind of more generic optimization type problems rather than just prediction problems and that's what these delayed credit problems tend to look like. But I don't think we've yet got good enough best practices that I have anything on, ready to teach and say I've got to teach you this thing because I think it's still going to be useful next year. So we'll keep watching and see what happens.
+
+#### Super resolution network to a style transfer network [ [1:17:57](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h17m57s) ]
+
+![](../img/1_nBoerE_seZP-z5lqL7bsUg.png)
+
+We are going to now turn the super resolution network into a style transfer network. And we'll do this pretty quickly. We basically already have something. _x_ is my input image and I'm going to have some loss function and I've got some neural net again. Instead of a neural net that does a whole a lot of compute and then does upsampling at the end, our input this time is just as big as our output. So we are going to do some downsampling first. Then our computer, and then our upsampling. So that's the first change we are going to make — we are going to add some downsampling so some stride 2 convolution layers to the front of our network. The second is rather than just comparing _yc_ and _x_ are the same thing here. So we are going to basically say our input image should look like itself by the end. Specifically we are going to compare it by chucking it through VGG and comparing it at one of the activation layers. And then its style should look like some painting which we'll do just like we did with the Gatys' approach by looking at the Gram matrix correspondence at a number of layers. So that's basically it. So that ought to be super straight forward. It's really combining two things we've already done.
+
+#### Style transfer net [ [1:19:19](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h19m19s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/style-transfer-net.ipynb)
+
+So all this code starts identical, except we don't have high res and low res, we just have one size 256\.
+
+```
+ %matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+```
+ **from** **fastai.conv_learner** **import** *  from pathlib import Path  torch.cuda.set_device(0) 
+```
+
+```
+ torch.backends.cudnn.benchmark= True 
+```
+
+```
+ PATH = Path('data/imagenet')  PATH_TRN = PATH/'train' 
+```
+
+```
+ fnames_full,label_arr_full,all_labels = folder_source(PATH, 'train')  fnames_full = ['/'.join(Path(fn).parts[-2:]) for fn in fnames_full]  list(zip(fnames_full[:5],label_arr_full[:5])) 
+```
+
+```
+ [('n01440764/n01440764_9627.JPEG', 0), 
+ ('n01440764/n01440764_9609.JPEG', 0), 
+ ('n01440764/n01440764_5176.JPEG', 0), 
+ ('n01440764/n01440764_6936.JPEG', 0), 
+ ('n01440764/n01440764_4005.JPEG', 0)] 
+```
+
+```
+ all_labels[:5] 
+```
+
+```
+ ['n01440764', 'n01443537', 'n01484850', 'n01491361', 'n01494475'] 
+```
+
+```
+ np.random.seed(42)  # keep_pct = 1.  # keep_pct = 0.01  keep_pct = 0.1  keeps = np.random.rand(len(fnames_full)) < keep_pct  fnames = np.array(fnames_full, copy= False )[keeps]  label_arr = np.array(label_arr_full, copy= False )[keeps] 
+```
+
+```
+ arch = vgg16  # sz,bs = 96,32  sz,bs = 256,24  # sz,bs = 128,32 
+```
+
+```
+ class MatchedFilesDataset (FilesDataset):  def __init__(self, fnames, y, transform, path):  self.y=y  assert (len(fnames)==len(y))  super().__init__(fnames, transform, path)  def get_y(self, i):  return open_image(os.path.join(self.path, self.y[i]))  def get_c(self): return 0 
+```
+
+```
+ val_idxs = get_cv_idxs(len(fnames), val_pct=min(0.01/keep_pct, 0.1))  ((val_x,trn_x),(val_y,trn_y)) = split_by_idx(val_idxs,  np.array(fnames), np.array(fnames))  len(val_x),len(trn_x) 
+```
+
+```
+ (12800, 115206) 
+```
+
+```
+ img_fn = PATH/'train'/'n01558993'/'n01558993_9684.JPEG' 
+```
+
+```
+ tfms = tfms_from_model(arch, sz, tfm_y=TfmType.PIXEL)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y),  (val_x,val_y), tfms, path=PATH_TRN)  md = ImageData(PATH, datasets, bs, num_workers=16, classes= None ) 
+```
+
+```
+ denorm = md.val_ds.denorm 
+```
+
+```
+ def show_img(ims, idx, figsize=(5,5), normed= True , ax= None ):  if ax is None : fig,ax = plt.subplots(figsize=figsize)  if normed: ims = denorm(ims)  else : ims = np.rollaxis(to_np(ims),1,4)  ax.imshow(np.clip(ims,0,1)[idx])  ax.axis('off') 
+```
+
+#### Model [ [1:19:30](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h19m30s) ]
+
+My model is the same. One thing I did here is I did not do any kind of fancy best practices for this one at all. Partly because there doesn't seem to be any. There's been very little follow up in this approach compared to the super resolution stuff. We'll talk about why in a moment. So you'll see, this is much more normal looking.
+
+```
+ def conv(ni, nf, kernel_size=3, stride=1, actn= True , pad= None ,  bn= True ):  if pad is None : pad = kernel_size//2  layers = [nn.Conv2d(ni, nf, kernel_size, stride=stride,  padding=pad, bias= not bn)]  if actn: layers.append(nn.ReLU(inplace= True ))  if bn: layers.append(nn.BatchNorm2d(nf))  return nn.Sequential(*layers) 
+```
+
+I've got batch norm layers. I don't have scaling factor here.
+
+```
+ class ResSequentialCenter (nn.Module):  def __init__(self, layers):  super().__init__()  self.m = nn.Sequential(*layers) 
+```
+
+```
+ def forward(self, x): return x[:, :, 2:-2, 2:-2] + self.m(x) 
+```
+
+```
+ def res_block(nf):  return ResSequentialCenter([conv(nf, nf, actn= True , pad=0),  conv(nf, nf, pad=0)]) 
+```
+
+I don't have a pixel shuffle — it's just using a normal upsampling followed by 1x1 conf. So it's just more normal.
+
+```
+ def upsample(ni, nf):  return nn.Sequential(nn.Upsample(scale_factor=2), conv(ni, nf)) 
+```
+
+One thing they mentioned in the paper is they had a lot of problems with zero padding creating artifacts and the way they solved that was by adding 40 pixel of reflection padding at the start. So I did the same thing and then they used zero padding in their convolutions in their Res blocks. Now if you've got zero padding in your convolutions in your Res blocks, then that means that the two parts of your ResNet won't add up anymore because you've lost a pixel from each side on each of your two convolutions. So my `ResSequential` has become `ResSequentialCenter` and I've removed the last 2 pixels on each side of those good cells. Other than that, this is basically the same as what we had before.
+
+```
+ class StyleResnet (nn.Module):  def __init__(self):  super().__init__()  features = [nn.ReflectionPad2d(40),  conv(3, 32, 9),  conv(32, 64, stride=2), conv(64, 128, stride=2)]  for i in range(5): features.append(res_block(128))  features += [upsample(128, 64), upsample(64, 32),  conv(32, 3, 9, actn= False )]  self.features = nn.Sequential(*features)  def forward(self, x): return self.features(x) 
+```
+
+#### Style Image [ [1:21:02](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h21m2s) ]
+
+So then we can bring in our starry night picture.
+
+```
+ style_fn = PATH/'style'/'starry_night.jpg'  style_img = open_image(style_fn)  style_img.shape 
+```
+
+```
+ (1198, 1513, 3) 
+```
+
+```
+ plt.imshow(style_img); 
+```
+
+![](../img/1_3QN8_RpikQBlk8wwjD9B3w.png)
+
+```
+ h,w,_ = style_img.shape  rat = max(sz/h,sz/h)  res = cv2.resize(style_img, (int(w*rat), int(h*rat)), interpolation=cv2.INTER_AREA)  resz_style = res[:sz,-sz:] 
+```
+
+We can resize it.
+
+```
+ plt.imshow(resz_style); 
+```
+
+![](../img/1_CExkSLE8DWFZM0M3GPkHfg.png)
+
+We can throw it through our transformations
+
+```
+ style_tfm,_ = tfms[1](resz_style,resz_style) 
+```
+
+```
+ style_tfm = np.broadcast_to(style_tfm[ None ], (bs,)+style_tfm.shape) 
+```
+
+Just to make the method a little bit easier for my brain to handle, I took our transform style image which after transformations of 3 x 256 x 256, and I made a mini batch. My batch size is 24 — 24 copies of it. It just maeks it a little bit easier to do the kind of batch arithmetic without worrying about some of the broadcasting. They are not really 24 copies. I used `np.broadcast` to basically fake 24 pieces.
+
+```
+ style_tfm.shape 
+```
+
+```
+ (24, 3, 256, 256) 
+```
+
+#### Perceptual loss [ [1:21:51](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h21m51s) ]
+
+So just like before, we create a VGG, grab the last block. This time we are going to use all of these layers so we keep everything up to the 43rd layer.
+
+```
+ m_vgg = vgg16( True ) 
+```
+
+```
+ blocks = [i-1 for i,o in enumerate(children(m_vgg))  if isinstance(o,nn.MaxPool2d)]  blocks, [m_vgg[i] for i in blocks[1:]] 
+```
+
+```
+ ([5, 12, 22, 32, 42], 
+ [ReLU(inplace), ReLU(inplace), ReLU(inplace), ReLU(inplace)]) 
+```
+
+```
+ vgg_layers = children(m_vgg)[:43]  m_vgg = nn.Sequential(*vgg_layers).cuda().eval()  set_trainable(m_vgg, False ) 
+```
+
+```
+ def flatten(x): return x.view(x.size(0), -1) 
+```
+
+```
+ class SaveFeatures ():  features= None  def __init__(self, m):  self.hook = m.register_forward_hook(self.hook_fn)  def hook_fn(self, module, input, output): self.features = output  def remove(self): self.hook.remove() 
+```
+
+```
+ def ct_loss(input, target): return F.mse_loss(input,target) 
+```
+
+```
+ def gram(input):  b,c,h,w = input.size()  x = input.view(b, c, -1)  return torch.bmm(x, x.transpose(1,2))/(c*h*w)*1e6 
+```
+
+```
+ def gram_loss(input, target):  return F.mse_loss(gram(input), gram(target[:input.size(0)])) 
+```
+
+So now our combined loss is going to add together a content loss for the third block plus the Gram loss for all of our blocks with different weights. Again, going back to everything being as normal as possible, I've gone back to using MSE above. Basically what happened was I had a lot of trouble getting this to train properly. So I gradually removed trick after trick and eventually just went “ok, I'm just gonna make it as bland as possible”.
+
+Last week's Gram matrix was wrong, by the way [ [1:22:37](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h22m37s) ]. It only worked for a batch size of one and we only had a batch size of one so that was fine. I was using matrix multiply which meant that every batch was being compared to every other batch. You actually need to use batch matrix multiple ( `torch.bmm` ) which does a matrix multiply per batch. So that's something to be aware of there.
+
+```
+ class CombinedLoss (nn.Module):  def __init__(self, m, layer_ids, style_im, ct_wgt, style_wgts):  super().__init__()  self.m,self.ct_wgt,self.style_wgts = m,ct_wgt,style_wgts  self.sfs = [SaveFeatures(m[i]) for i in layer_ids]  m(VV(style_im))  self.style_feat = [V(o.features.data.clone())  for o in self.sfs] 
+```
+
+```
+ def forward(self, input, target, sum_layers= True ):  self.m(VV(target.data))  targ_feat = self.sfs[2].features.data.clone()  self.m(input)  inp_feat = [o.features for o in self.sfs]  res = [ct_loss(inp_feat[2],V(targ_feat)) * self.ct_wgt]  res += [gram_loss(inp,targ)*wgt for inp,targ,wgt  in zip(inp_feat, self.style_feat, self.style_wgts)]  if sum_layers: res = sum(res)  return res  def close(self):  for o in self.sfs: o.remove() 
+```
+
+So I've got Gram matrices, I do my MSE loss between the Gram matrices, I weight them by style weights, so I create that ResNet.
+
+```
+ m = StyleResnet()  m = to_gpu(m) 
+```
+
+```
+ learn = Learner(md, SingleModel(m), opt_fn=optim.Adam) 
+```
+
+I create my combined loss passing in the VGG network, passing in the block IDs, passing in the transformed starry night image, and you'll see the the very start here, I do a forward pass through my VGG model with that starry night image in order that I can save the features for it. Notice, it's really important now that I don't do any data augmentation because I've saved the style features for a particular non-augmented version. So if I augmented it, it might make some minor problems. But that's fine because I've got all of ImageNet to deal with. I don't really need to do data augmentation anyway.
+
+```
+ learn.crit = CombinedLoss(m_vgg, blocks[1:], style_tfm, 1e4,  [0.025,0.275,5.,0.2]) 
+```
+
+```
+ wd=1e-7 
+```
+
+```
+ learn.lr_find(wds=wd)  learn.sched.plot(n_skip_end=1) 
+```
+
+```
+ 1%|▏ | 7/482 [00:04<05:32, 1.43it/s, loss=2.48e+04]  53%|█████▎ | 254/482 [02:27<02:12, 1.73it/s, loss=1.13e+12] 
+```
+
+![](../img/1_sDFoltJE1s5ugWZndD4k0A.png)
+
+```
+ lr=5e-3 
+```
+
+So I've got my loss function and I can go ahead and fit [ [1:24:06](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h24m6s) ]. And there is nothing clever here at all.
+
+```
+ learn.fit(lr, 1, cycle_len=1, wds=wd, use_clr=(20,10)) 
+```
+
+```
+ epoch trn_loss val_loss  0 105.351372 105.833994 
+```
+
+```
+ [array([105.83399])] 
+```
+
+```
+ learn.save('style-2') 
+```
+
+```
+ x,y=md.val_ds[201] 
+```
+
+```
+ learn.model.eval()  preds = learn.model(VV(x[ None ]))  x.shape,y.shape,preds.shape 
+```
+
+```
+ ((3, 256, 256), (3, 256, 256), torch.Size([1, 3, 256, 256])) 
+```
+
+At the end, I have my `sum_layers=False` so I can see what each part looks like and see they are balanced. And I can finally pop it out
+
+```
+ learn.crit(preds, VV(y[ None ]), sum_layers= False ) 
+```
+
+```
+ [Variable containing: 
+ 53.2221 
+ [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing: 
+ 3.8336 
+ [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing: 
+ 4.0612 
+ [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing: 
+ 5.0639 
+ [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing: 
+ 53.0019 
+ [torch.cuda.FloatTensor of size 1 (GPU 0)]] 
+```
+
+```
+ learn.crit.close() 
+```
+
+```
+ _,axes=plt.subplots(1,2,figsize=(14,7))  show_img(x[ None ], 0, ax=axes[0])  show_img(preds, 0, ax=axes[1]) 
+```
+
+![](../img/1_Hlb0cHXu_IdLZmfBxJIgJA.png)
+
+So I mentioned that should be pretty easy and yet it took me about 4 days because I just found this incredibly fiddly to actually get it to work [ [1:24:26](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h24m26s) ]. So when I finally got up in the morning I said to Rachel “guess what, it trained correctly.” Rachel said “I never thought that was going to happen.” It just looked awful all the time and it's really about getting the exact right mix of content loss and a style loss and the mix of the layers of the style loss. The worst part was it takes a really long time to train the darn CNN and I didn't really know how long to train it before I decided it wasn't doing well. Should I just train it for longer? And I don't know all the little details didn't seem to slightly change it but just it would totally fall apart all the time. So I kind of mentioned this partly to say just remember the final answer you see here is after me driving myself crazy all week of nearly always not working until finally the last minute it finally does. Even for things which just seemed like they couldn't possibly be difficult because that is combining two things we already have working. The other is to be careful about how we interpret what authors claim.
+
+![](../img/1_6AvLQSM40JcTWE26C4GafA.png)
+
+It was so fiddly getting this style transfer to work [ [1:26:10](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h26m10s) ]. After doing it, it left me thinking why did I bother because now I've got something that takes hours to create a network that can turn any kind of photo into one specific style. It just seems very unlikely I would want that for anything. The only reason I could think that being useful would be to do some art-y stuff on a video where I wanted to turn every frame into some style. It's incredibly niche thing to want to do. But when I looked at the paper, the table is saying “oh, we are a thousand times faster than the Gatys' approach which is just such an obviously meaningless thing to say. Such an incredibly misleading thing to say because it ignores all the hours of training for each individual style and I find this frustrating because groups like this Stanford group clearly know better or ought to know better, but still I guess the academic community encourages people to make these ridiculously grand claims. It also completely ignores this incredibly sensitive fiddly training process so this paper was just so well accepted when it came out. I remember everybody getting on Twitter and saying “wow, you know these Stanford people have found this way of doing style transfer a thousand times faster.” And clearly people saying this were top researchers in the field, clearly none of them actually understood it because nobody said “I don't see why this is remotely useful, and also I tried it and it was incredibly fiddly to get it all to work.” It's not until 18 months later I finally coming back to it and kind of thinking like “wait a minute, this is kind of stupid.” So this is the answer, I think, to the question of why haven't people done follow ups on this to create really amazing best practices and better approaches like with a super resolution part of the paper. And I think the answer is because it's dumb. So I think super resolution part of the paper is clearly not dumb. And it's been improved and improved and now we have great super resolution. And I think we can derive from that great noise reduction, great colorization, great slant removal, great interactive artifact removal, etc. So I think there's a lot of really cool techniques here. It's also leveraging a lot of stuff that we've been learning and getting better and better at.
+
+### Segmentation [ [1:29:13](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h29m13s) ]
+
+![](../img/1_6iq4f7DtfqsWxsE9ib0SPg.png)
+
+Finally, let's talk about segmentation. This is from the famous CamVid dataset which is a classic example of an academic segmentation dataset. Basically you can see what we do is we start with a picture (they are actually video frames in this dataset) and we have some labels where they are not actually colors — each one has an ID and the IDs are mapped to colors. So red might be 1, purple might be 2, light pink might be 3 and so all the buildings are one class, all the cars are another class, all the people are another class, all the road is another class, and so on. So what we are actually doing here is multi-class classification for every pixel. You can see, sometimes that multi-class classification really is quite tricky — like these branches. Although, sometimes the labels are really not that great. This is very coarse as you can see. So that's what we are going to do.
+
+We are going to do segmentation and so it's a lot like bounding boxes. But rather than just finding a box around each thing, we are actually going to label every single pixel with its class. Really, it's actually a lot easier because it fits our CNN style so nicely that we can create any CNN where the output is an N by M grid containing the integers from 0 to C where there are C categories. And then we can use cross-entropy loss with a softmax activation and we are done. I could actually stop the class there and you can go and use exactly the same approaches you've learnt in lesson 1 and 2 and you'll get a perfectly okay result. So the first thing to say is this is not actually a terribly hard thing to do. But we are going to try and do it really well.
+
+#### Doing it the simple way [ [1:31:26](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h31m26s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/carvana.ipynb)
+
+Let's start by doing it the really simple way. And we are going to use Kaggle [Carvana](https://www.kaggle.com/c/carvana-image-masking-challenge) competition and you can download it with Kaggle API as usual.
+
+```
+ %matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+```
+ **from** **fastai.conv_learner** **import** *  **from** **fastai.dataset** **import** *  from pathlib import Path  import json 
+```
+
+#### Setup
+
+There is a train folder containing bunch of images which is the independent variable and a train_masks folder there's the dependent variable and they look like below.
+
+![](../img/1_3lO9olOrpojAt5C22B5CoQ.png)
+
+In this case, just like cats and dogs, we are going simple rather than doing multi-class classification, we are going to do binary classification. But of course multi-class is just the more general version — categorical cross entropy or binary class entropy. There is no differences conceptually, so the dependent variable is just zeros and ones, where else the independent variable is a regular image.
+
+In order to do this well, it would really help to know what cars look like. Because really what we want to do is to figure out this is a car and its orientation and put white pixels where we expect the car to be based on the picture and their understanding of what cars look like.
+
+```
+ PATH = Path('data/carvana')  list(PATH.iterdir()) 
+```
+
+```
+ [PosixPath('data/carvana/train_masks.csv'), 
+ PosixPath('data/carvana/train_masks-128'), 
+ PosixPath('data/carvana/sample_submission.csv'), 
+ PosixPath('data/carvana/train_masks_png'), 
+ PosixPath('data/carvana/train.csv'), 
+ PosixPath('data/carvana/train-128'), 
+ PosixPath('data/carvana/train'), 
+ PosixPath('data/carvana/metadata.csv'), 
+ PosixPath('data/carvana/tmp'), 
+ PosixPath('data/carvana/models'), 
+ PosixPath('data/carvana/train_masks')] 
+```
+
+```
+ MASKS_FN = 'train_masks.csv'  META_FN = 'metadata.csv'  TRAIN_DN = 'train'  MASKS_DN = 'train_masks' 
+```
+
+```
+ masks_csv = pd.read_csv(PATH/MASKS_FN)  masks_csv.head() 
+```
+
+![](../img/1_gSfi6-hcJG8YP3knRDQUAg.png)
+
+The original dataset came with these CSV files as well [ [1:32:44](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h32m44s) ]. I don't really use them for very much other than getting the list of images from them.
+
+```
+ meta_csv = pd.read_csv(PATH/META_FN)  meta_csv.head() 
+```
+
+![](../img/1_yBm5uRoK_sHQrS-ddATg6w.png)
+
+```
+ def show_img(im, figsize= None , ax= None , alpha= None ):  if not ax: fig,ax = plt.subplots(figsize=figsize)  ax.imshow(im, alpha=alpha)  ax.set_axis_off()  return ax 
+```
+
+```
+ CAR_ID = '00087a6bd4dc' 
+```
+
+```
+ list((PATH/TRAIN_DN).iterdir())[:5] 
+```
+
+```
+ [PosixPath('data/carvana/train/5ab34f0e3ea5_15.jpg'),  PosixPath('data/carvana/train/de3ca5ec1e59_07.jpg'),  PosixPath('data/carvana/train/28d9a149cb02_13.jpg'),  PosixPath('data/carvana/train/36a3f7f77e85_12.jpg'),  PosixPath('data/carvana/train/843763f47895_08.jpg')] 
+```
+
+```
+ Image.open(PATH/TRAIN_DN/f' {CAR_ID} _01.jpg').resize((300,200)) 
+```
+
+![](../img/1_KwDMh55dYxo5-ychRK2srQ.png)
+
+```
+ list((PATH/MASKS_DN).iterdir())[:5] 
+```
+
+```
+ [PosixPath('data/carvana/train_masks/6c0cd487abcd_03_mask.gif'),  PosixPath('data/carvana/train_masks/351c583eabd6_01_mask.gif'),  PosixPath('data/carvana/train_masks/90fdd8932877_02_mask.gif'),  PosixPath('data/carvana/train_masks/28d9a149cb02_10_mask.gif'),  PosixPath('data/carvana/train_masks/88bc32b9e1d9_14_mask.gif')] 
+```
+
+```
+ Image.open(PATH/MASKS_DN/f' {CAR_ID} _01_mask.gif').resize((300,200)) 
+```
+
+![](../img/1_zz4dhkhhF00W9VIQQiiGcg.png)
+
+Each image after the car ID has a 01, 02, etc of which I've printed out all 16 of them for one car and as you can see basically those numbers are the 16 orientations of one car [ [1:32:58](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h32m58s) ]. I don't think anybody in this competition actually used these orientation information. I believe they all kept the car's images just treated them separately.
+
+```
+ ims = [open_image(PATH/TRAIN_DN/f' {CAR_ID} _{i+1:02d}.jpg')  for i in range(16)] 
+```
+
+```
+ fig, axes = plt.subplots(4, 4, figsize=(9, 6))  for i,ax in enumerate(axes.flat): show_img(ims[i], ax=ax)  plt.tight_layout(pad=0.1) 
+```
+
+![](../img/1_ran3w5qDWsvfVaoPileaxw.png)
+
+#### Resize and convert [ [1:33:27](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h33m27s) ]
+
+These images are pretty big — over 1000 by 1000 in size and just opening the JPEGs and resizing them is slow. So I processed them all. Also OpenCV can't handle GIF files so I converted them.
+
+**Question** : How would somebody get these masks for training initially? [Mechanical turk](https://www.mturk.com/) or something [ [1:33:48](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h33m48s) ]? Yeah, just a lot of boring work. Probably there are some tools that help you with a bit of edge snapping so that the human can do it roughly and then just fine tune the bits it gets wrong. These kinds of labels are expensive. So one of the things I really want to work on is deep learning enhanced interactive labeling tools because that's clearly something that would help a lot of people.
+
+I've got a little section here that you can run if you want to. You probably want to. It converts the GIFs into PNGs so just open int up with PIL and then save it as PNG because OpenCV doesn't have GIF support. As per usual for this kind of stuff, I do it with a ThreadPool so I can take advantage of parallel processing. And then also create a separate directory `train-128` and `train_masks-128` which contains the 128 by 128 resized versions of them.
+
+This is the kind of stuff that keeps you sane if you do it early in the process. So anytime you get a new dataset, seriously think about creating a smaller version to make life fast. Anytime you find yourself waiting on your computer, try and think of a way to create a smaller version.
+
+```
+ (PATH/'train_masks_png').mkdir(exist_ok= True ) 
+```
+
+```
+ def convert_img(fn):  fn = fn.name  Image.open(PATH/'train_masks'/fn).save(PATH/'train_masks_png'/  f' {fn[:-4]} .png') 
+```
+
+```
+ files = list((PATH/'train_masks').iterdir())  with ThreadPoolExecutor(8) as e: e.map(convert_img, files) 
+```
+
+```
+ (PATH/'train_masks-128').mkdir(exist_ok= True ) 
+```
+
+```
+ def resize_mask(fn):  Image.open(fn).resize((128,128)).save((fn.parent.parent)  /'train_masks-128'/fn.name)  files = list((PATH/'train_masks_png').iterdir())  with ThreadPoolExecutor(8) as e: e.map(resize_img, files) 
+```
+
+```
+ (PATH/'train-128').mkdir(exist_ok= True ) 
+```
+
+```
+ def resize_img(fn):  Image.open(fn).resize((128,128)).save((fn.parent.parent)  /'train-128'/fn.name)  files = list((PATH/'train').iterdir())  with ThreadPoolExecutor(8) as e: e.map(resize_img, files) 
+```
+
+So after you grab it from Kaggle, you probably want to run this stuff, go away, have lunch, come back and when you are done, you'll have these smaller directories which we are going to use below 128 by 128 to start with.
+
+#### Dataset [ [1:35:33](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h35m33s) ]
+
+```
+ TRAIN_DN = 'train-128'  MASKS_DN = 'train_masks-128'  sz = 128  bs = 64 
+```
+
+```
+ ims = [open_image(PATH/TRAIN_DN  /f' {CAR_ID} _{i+1:02d}.jpg') for i in range(16)]  im_masks = [open_image(PATH/MASKS_DN  /f' {CAR_ID} _{i+1:02d}_mask.png') for i in range(16)] 
+```
+
+So here is a cool trick. If you use the same axis object ( `ax` ) to plot an image twice and the second time you use alpha which you might know means transparency in the computer vision world, then you can actually plot the mask over the top of the photo. So here is a nice way to see all the masks on top of the photos for all of the cars in one group.
+
+```
+ fig, axes = plt.subplots(4, 4, figsize=(9, 6))  **for** i,ax **in** enumerate(axes.flat):  ax = show_img(ims[i], ax=ax)  show_img(im_masks[i][...,0], ax=ax, alpha=0.5)  plt.tight_layout(pad=0.1) 
+```
+
+![](../img/1_KuD6ZEQEbZH8Ka3u8tP5Ig.png)
+
+This is the same MatchedFilesDataset we've seen twice already. This is all the same code. Here is something important though. If we had something that was in the training set the one on the left, and then the validation had the image on the right, that would be kind of cheating because it's the same car.
+
+![](../img/1_vUDf60cZZxP3gezhcP36-w.png)
+
+```
+ class MatchedFilesDataset (FilesDataset):  def __init__(self, fnames, y, transform, path):  self.y=y  assert (len(fnames)==len(y))  super().__init__(fnames, transform, path)  def get_y(self, i):  return open_image(os.path.join(self.path, self.y[i]))  def get_c(self): return 0 
+```
+
+```
+ x_names = np.array([Path(TRAIN_DN)/o for o in masks_csv['img']])  y_names = np.array([Path(MASKS_DN)/f' {o[:-4]} _mask.png'  for o in masks_csv['img']]) 
+```
+
+```
+ len(x_names)//16//5*16 
+```
+
+```
+ _1008_ 
+```
+
+So we use a continuous set of car IDs and since each set is a set of 16, we make sure that's evenly divisible by 16\. So we make sure that our validation set contains different car IDs to our training set. This is the kind of stuff which you've got to be careful of. On Kaggle, it's not so bad — you'll know about it because you'll submit your result and you'll get a very different result on your leaderboard compared to your validation set. But in the real world. you won't know until you put it in production and send your company bankrupt and lose your job. So you might want to think carefully about your validation set in that case.
+
+```
+ val_idxs = list(range(1008))  ((val_x,trn_x),(val_y,trn_y)) = split_by_idx(val_idxs, x_names,  y_names)  len(val_x),len(trn_x) 
+```
+
+```
+ (1008, 4080) 
+```
+
+Here we are going to use transform type classification ( `TfmType.CLASS` ) [ [1:37:03](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h37m3s) ]. It's basically the same as transform type pixel ( `TfmType.PIXEL` ) but if you think about it, with a pixel version if we rotate a little bit then we probably want to average the pixels in between the two, but the classification, obviously we don't. We use nearest neighbor. So there's slight difference there. Also for classification, lighting doesn't kick in, normalization doesn't kick in to the dependent variable.
+
+```
+ aug_tfms = [RandomRotate(4, tfm_y=TfmType.CLASS),  RandomFlip(tfm_y=TfmType.CLASS),  RandomLighting(0.05, 0.05)]  # aug_tfms = [] 
+```
+
+They are already square images, so we don't have to do any cropping.
+
+```
+ tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO, tfm_y=TfmType.CLASS, aug_tfms=aug_tfms)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y), (val_x,val_y), tfms, path=PATH)  md = ImageData(PATH, datasets, bs, num_workers=8, classes= None ) 
+```
+
+```
+ denorm = md.trn_ds.denorm  x,y = next(iter(md.aug_dl))  x = denorm(x) 
+```
+
+So here you can see different versions of the augmented images — they are moving around a bit, and they are rotating a bit, and so forth.
+
+```
+ fig, axes = plt.subplots(5, 6, figsize=(12, 10))  **for** i,ax **in** enumerate(axes.flat):  ax=show_img(x[i], ax=ax)  show_img(y[i], ax=ax, alpha=0.5)  plt.tight_layout(pad=0.1) 
+```
+
+![](../img/1_ak5NPATO_ayUjsc7UGfEdQ.png)
+
+I get a lot of questions during our study group about how do I debug things and fix things that aren't working. I never have a great answer other than every time I fix a problem is because of stuff like this that I do all the time. I just always print out everything as I go and then the one thing that I screw up always turns out to be the one thing that I forgot to check along the way. The more of this kind of thing you can do the better. If you are not looking at all of your intermediate results, you are going to have troubles.
+
+#### Model [ [1:38:30](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h38m30s) ]
+
+```
+ class Empty (nn.Module):  def forward(self,x): return x  models = ConvnetBuilder(resnet34, 0, 0, 0, custom_head=Empty())  learn = ConvLearner(md, models)  learn.summary() 
+```
+
+```
+ class StdUpsample (nn.Module):  def __init__(self, nin, nout):  super().__init__()  self.conv = nn.ConvTranspose2d(nin, nout, 2, stride=2)  self.bn = nn.BatchNorm2d(nout)  def forward(self, x): return self.bn(F.relu(self.conv(x))) 
+```
+
+```
+ flatten_channel = Lambda( lambda x: x[:,0]) 
+```
+
+```
+ simple_up = nn.Sequential(  nn.ReLU(),  StdUpsample(512,256),  StdUpsample(256,256),  StdUpsample(256,256),  StdUpsample(256,256),  nn.ConvTranspose2d(256, 1, 2, stride=2),  flatten_channel  ) 
+```
+
+Given that we want something that knows what cars look like, we probably want to start with a pre-trained ImageNet network. So we are going to start with ResNet34\. With `ConvnetBuilder` , we can grab our ResNet34 and we can add a custom head. The custom head is going to be something that upsamples a bunch of times and we are going to do things really dumb for now which is we're just going to do a ConvTranspose2d, batch norm, ReLU.
+
+This is what I am saying — any of you could have built this without looking at any of this notebook or at least you have the information from previous classes. There is nothing new at all. So at the very end, we have a single filter. Now that's going to give us something which is batch size by 1 by 128 by 128\. But we want something which is batch size by 128 by 128\. So we have to remove that unit axis so I've got a lambda layer here. Lambda layers are incredibly helpful because without the lambda layer here, which is simply removing that unit axis by just indexing it with a 0, without a lambda layer, I would have to have created a custom class with a custom forward method and so forth. But by creating a lambda layer that does the one custom bit, I can now just chuck it in the Sequential and so that makes life easier.
+
+PyTorch people are kind of snooty about this approach. Lambda layer is actually something that's a part of the fastai library not part of the PyTorch library. And literally people on PyTorch discussion board say “yes, we could give people this”, “yes it is only a single line of code” but they never encourage them to use sequential too often. So there you go.
+
+So this is our custom head [ [1:40:36](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h40m36s) ]. So we are going to have a ResNet 34 that goes downsample and then a really simple custom head that very quickly upsamples, and that hopefully will do something. And we are going to use accuracy with a threshold of 0.5 and print out metrics.
+
+```
+ models = ConvnetBuilder(resnet34, 0, 0, 0, custom_head=simple_up)  learn = ConvLearner(md, models)  learn.opt_fn=optim.Adam  learn.crit=nn.BCEWithLogitsLoss()  learn.metrics=[accuracy_thresh(0.5)] 
+```
+
+```
+ learn.lr_find()  learn.sched.plot() 
+```
+
+```
+ 94%|█████████▍| 30/32 [00:05<00:00, 5.48it/s, loss=10.6] 
+```
+
+![](../img/1_0RoKSchCdIyFHGVjb7PXHA.png)
+
+```
+ lr=4e-2 
+```
+
+```
+ learn.fit(lr,1,cycle_len=5,use_clr=(20,5)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> 
+ 0 0.124078 0.133566 0.945951 
+ 1 0.111241 0.112318 0.954912 
+ 2 0.099743 0.09817 0.957507 
+ 3 0.090651 0.092375 0.958117 
+ 4 0.084031 0.086026 0.963243 
+```
+
+```
+ [0.086025625, 0.96324310824275017] 
+```
+
+After a few epochs, we've got 96 percent accurate. Is that good [ [1:40:56](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h40m56s) ]? Is 96% accurate good? And hopefully the answer to that question is it depends. What's it for? The answer is Carvana wanted this because they wanted to be able to take their car image and cut them out and paste them on exotic Monte Carlo backgrounds or whatever (that's Monte Carlo the place and not the simulation). To do that, you you need a really good mask. You don't want to leave the rearview mirrors behind, have one wheel missing, or include a little bit of background or something. That would look stupid. So you would need something very good. So only having 96% of the pixels correct doesn't sound great. But we won't really know until we look at it. So let's look at it.
+
+```
+ learn.save('tmp') 
+```
+
+```
+ learn.load('tmp') 
+```
+
+```
+ py,ay = learn.predict_with_targs() 
+```
+
+```
+ ay.shape 
+```
+
+```
+ (1008, 128, 128) 
+```
+
+So there is the correct version that we want to cut out [ [1:41:54](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h41m54s) ]
+
+```
+ show_img(ay[0]); 
+```
+
+![](../img/1_ZuW4s04Ubneh3fFjQuQ3rQ.png)
+
+That's the 96% accurate version. So when you look at it you realize “oh yeah, getting 96% of the pixel accurate is actually easy because all the outside bit is not car, and all the inside bit is a car, and really interesting bit is the edge. So we need to do better.
+
+```
+ show_img(py[0]>0); 
+```
+
+![](../img/1_cp-SvDXQPGdN6k-8JL2CMw.png)
+
+Let's unfreeze because all we've done so far is train the custom head. Let's do more.
+
+```
+ learn.unfreeze() 
+```
+
+```
+ learn.bn_freeze( True ) 
+```
+
+```
+ lrs = np.array([lr/100,lr/10,lr])/4 
+```
+
+```
+ learn.fit(lrs,1,cycle_len=20,use_clr=(20,10)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> 
+ 0 0.06577 0.053292 0.972977 
+ 1 0.049475 0.043025 0.982559 
+ 2 0.039146 0.035927 0.98337 
+ 3 0.03405 0.031903 0.986982 
+ 4 0.029788 0.029065 0.987944 
+ 5 0.027374 0.027752 0.988029 
+ 6 0.026041 0.026718 0.988226 
+ 7 0.024302 0.025927 0.989512 
+ 8 0.022921 0.026102 0.988276 
+ 9 0.021944 0.024714 0.989537 
+ 10 0.021135 0.0241 0.990628 
+ 11 0.020494 0.023367 0.990652 
+ 12 0.01988 0.022961 0.990989 
+ 13 0.019241 0.022498 0.991014 
+ 14 0.018697 0.022492 0.990571 
+ 15 0.01812 0.021771 0.99105 
+ 16 0.017597 0.02183 0.991365 
+ 17 0.017192 0.021434 0.991364 
+ 18 0.016768 0.021383 0.991643 
+ 19 0.016418 0.021114 0.99173 
+```
+
+```
+ [0.021113895, 0.99172959849238396] 
+```
+
+After a bit more, we've got 99.1%. Is that good? 我不知道。 让我们来看看。
+
+```
+ learn.save('0') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  py = to_np(learn.model(V(x))) 
+```
+
+其实没有。 It's totally missed the rearview vision mirror on the left and missed a lot of it on the right. And it's clearly got an edge wrong on the bottom. And these things are totally going to matter when we try to cut it out, so it's still not good enough.
+
+```
+ ax = show_img(denorm(x)[0])  show_img(py[0]>0, ax=ax, alpha=0.5); 
+```
+
+![](../img/1_b4NbyzWojBS_6peHalw3tQ.png)
+
+```
+ ax = show_img(denorm(x)[0])  show_img(y[0], ax=ax, alpha=0.5); 
+```
+
+![](../img/1_nh7F97XxSE1ZOcleTfoPeA.png)
+
+#### 512x512 [ [1:42:50](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h42m50s) ]
+
+Let's try upscaling. And the nice thing is that when we upscale to 512 by 512, (make sure you decrease the batch size because you'll run out of memory), it's quite a lot more information there for it to go on so our accuracy increases to 99.4% and things keep getting better.
+
+```
+ TRAIN_DN = 'train'  MASKS_DN = 'train_masks_png'  sz = 512  bs = 16 
+```
+
+```
+ x_names = np.array([Path(TRAIN_DN)/o for o in masks_csv['img']])  y_names = np.array([Path(MASKS_DN)/f' {o[:-4]} _mask.png'  for o in masks_csv['img']]) 
+```
+
+```
+ ((val_x,trn_x),(val_y,trn_y)) = split_by_idx(val_idxs, x_names,  y_names)  len(val_x),len(trn_x) 
+```
+
+```
+ (1008, 4080) 
+```
+
+```
+ tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO,  tfm_y=TfmType.CLASS, aug_tfms=aug_tfms)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y),  (val_x,val_y), tfms, path=PATH)  md = ImageData(PATH, datasets, bs, num_workers=8, classes= None ) 
+```
+
+```
+ denorm = md.trn_ds.denorm  x,y = next(iter(md.aug_dl))  x = denorm(x) 
+```
+
+Here is the true ones.
+
+```
+ fig, axes = plt.subplots(4, 4, figsize=(10, 10))  **for** i,ax **in** enumerate(axes.flat):  ax=show_img(x[i], ax=ax)  show_img(y[i], ax=ax, alpha=0.5)  plt.tight_layout(pad=0.1) 
+```
+
+![](../img/1_viBgn7WA9biBQ6BkzSEEnw.png)
+
+```
+ simple_up = nn.Sequential(  nn.ReLU(),  StdUpsample(512,256),  StdUpsample(256,256),  StdUpsample(256,256),  StdUpsample(256,256),  nn.ConvTranspose2d(256, 1, 2, stride=2),  flatten_channel  ) 
+```
+
+```
+ models = ConvnetBuilder(resnet34, 0, 0, 0, custom_head=simple_up)  learn = ConvLearner(md, models)  learn.opt_fn=optim.Adam  learn.crit=nn.BCEWithLogitsLoss()  learn.metrics=[accuracy_thresh(0.5)] 
+```
+
+```
+ learn.load('0') 
+```
+
+```
+ learn.lr_find()  learn.sched.plot() 
+```
+
+```
+ 85%|████████▌ | 218/255 [02:12<00:22, 1.64it/s, loss=8.91] 
+```
+
+![](../img/1_hjhVP2TyYd8FZMvyGevPgA.png)
+
+```
+ lr=4e-2 
+```
+
+```
+ learn.fit(lr,1,cycle_len=5,use_clr=(20,5)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda>  0 0.02178 0.020653 0.991708  1 0.017927 0.020653 0.990241  2 0.015958 0.016115 0.993394  3 0.015172 0.015143 0.993696  4 0.014315 0.014679 0.99388 
+```
+
+```
+ [0.014679321, 0.99388032489352751] 
+```
+
+```
+ learn.save('tmp') 
+```
+
+```
+ learn.load('tmp') 
+```
+
+```
+ learn.unfreeze()  learn.bn_freeze( True ) 
+```
+
+```
+ lrs = np.array([lr/100,lr/10,lr])/4 
+```
+
+```
+ learn.fit(lrs,1,cycle_len=8,use_clr=(20,8)) 
+```
+
+```
+ epoch trn_loss val_loss mask_acc  0 0.038687 0.018685 0.992782  1 0.024906 0.014355 0.994933  2 0.025055 0.014737 0.995526  3 0.024155 0.014083 0.995708  4 0.013446 0.010564 0.996166  5 0.01607 0.010555 0.996096  6 0.019197 0.010883 0.99621  7 0.016157 0.00998 0.996393 
+```
+
+```
+ [0.0099797687, 0.99639255659920833] 
+```
+
+```
+ learn.save('512') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  py = to_np(learn.model(V(x))) 
+```
+
+```
+ ax = show_img(denorm(x)[0])  show_img(py[0]>0, ax=ax, alpha=0.5); 
+```
+
+![](../img/1__nnK8pvyBueihmtg6JhqPA.png)
+
+```
+ ax = show_img(denorm(x)[0])  show_img(y[0], ax=ax, alpha=0.5); 
+```
+
+![](../img/1_G5zNxkOplWvUIbGlSPs86Q.png)
+
+Things keep getting better but we've still got quite a few little black blocky bits. so let's go to 1024 by 1024\.
+
+#### 1024x1024 [ [1:43:17](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h43m17s) ]
+
+So let's go to 1024 by 1024, batch size down to 4\. This is pretty high res now, and train a bit more, 99.6, 99.8%!
+
+```
+ sz = 1024  bs = 4 
+```
+
+```
+ tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO,  tfm_y=TfmType.CLASS, aug_tfms=aug_tfms)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y),  (val_x,val_y), tfms, path=PATH)  md = ImageData(PATH, datasets, bs, num_workers=8, classes= None ) 
+```
+
+```
+ denorm = md.trn_ds.denorm  x,y = next(iter(md.aug_dl))  x = denorm(x)  y = to_np(y) 
+```
+
+```
+ fig, axes = plt.subplots(2, 2, figsize=(8, 8))  **for** i,ax **in** enumerate(axes.flat):  show_img(x[i], ax=ax)  show_img(y[i], ax=ax, alpha=0.5)  plt.tight_layout(pad=0.1) 
+```
+
+![](../img/1_4PrOwKZEYtXv7xdf9rPkhg.png)
+
+```
+ simple_up = nn.Sequential(  nn.ReLU(),  StdUpsample(512,256),  StdUpsample(256,256),  StdUpsample(256,256),  StdUpsample(256,256),  nn.ConvTranspose2d(256, 1, 2, stride=2),  flatten_channel,  ) 
+```
+
+```
+ models = ConvnetBuilder(resnet34, 0, 0, 0, custom_head=simple_up)  learn = ConvLearner(md, models)  learn.opt_fn=optim.Adam  learn.crit=nn.BCEWithLogitsLoss()  learn.metrics=[accuracy_thresh(0.5)] 
+```
+
+```
+ learn.load('512') 
+```
+
+```
+ learn.lr_find()  learn.sched.plot() 
+```
+
+```
+ 85%|████████▌ | 218/255 [02:12<00:22, 1.64it/s, loss=8.91] 
+```
+
+![](../img/1_hjhVP2TyYd8FZMvyGevPgA.png)
+
+```
+ lr=4e-2 
+```
+
+```
+ learn.fit(lr,1,cycle_len=2,use_clr=(20,4)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> 
+ 0 0.01066 0.011119 0.996227 
+ 1 0.009357 0.009696 0.996553 
+```
+
+```
+ [0.0096957013, 0.99655332546385511] 
+```
+
+```
+ learn.save('tmp') 
+```
+
+```
+ learn.load('tmp') 
+```
+
+```
+ learn.unfreeze()  learn.bn_freeze( True ) 
+```
+
+```
+ lrs = np.array([lr/100,lr/10,lr])/8 
+```
+
+```
+ learn.fit(lrs,1,cycle_len=40,use_clr=(20,10)) 
+```
+
+```
+ epoch trn_loss val_loss mask_acc 
+ 0 0.015565 0.007449 0.997661 
+ 1 0.01979 0.008376 0.997542 
+ 2 0.014874 0.007826 0.997736 
+ 3 0.016104 0.007854 0.997347 
+ 4 0.023386 0.009745 0.997218 
+ 5 0.018972 0.008453 0.997588 
+ 6 0.013184 0.007612 0.997588 
+ 7 0.010686 0.006775 0.997688 
+ 8 0.0293 0.015299 0.995782 
+ 9 0.018713 0.00763 0.997638 
+ 10 0.015432 0.006575 0.9978 
+ 11 0.110205 0.060062 0.979043 
+ 12 0.014374 0.007753 0.997451 
+ 13 0.022286 0.010282 0.997587 
+ 14 0.015645 0.00739 0.997776 
+ 15 0.013821 0.00692 0.997869 
+ 16 0.022389 0.008632 0.997696 
+ 17 0.014607 0.00677 0.997837 
+ 18 0.018748 0.008194 0.997657 
+ 19 0.016447 0.007237 0.997899 
+ 20 0.023596 0.008211 0.997918 
+ 21 0.015721 0.00674 0.997848 
+ 22 0.01572 0.006415 0.998006 
+ 23 0.019519 0.007591 0.997876 
+ 24 0.011159 0.005998 0.998053 
+ 25 0.010291 0.005806 0.998012 
+ 26 0.010893 0.005755 0.998046 
+ 27 0.014534 0.006313 0.997901 
+ 28 0.020971 0.006855 0.998018 
+ 29 0.014074 0.006107 0.998053 
+ 30 0.01782 0.006561 0.998114 
+ 31 0.01742 0.006414 0.997942 
+ 32 0.016829 0.006514 0.9981 
+ 33 0.013148 0.005819 0.998033 
+ 34 0.023495 0.006261 0.997856 
+ 35 0.010931 0.005516 0.99812 
+ 36 0.015798 0.006176 0.998126 
+ 37 0.021636 0.005931 0.998067 
+ 38 0.012133 0.005496 0.998158 
+ 39 0.012562 0.005678 0.998172 
+```
+
+```
+ [0.0056782686, 0.99817223208291195] 
+```
+
+```
+ learn.save('1024') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  py = to_np(learn.model(V(x))) 
+```
+
+```
+ ax = show_img(denorm(x)[0])  show_img(py[0][0]>0, ax=ax, alpha=0.5); 
+```
+
+![](../img/1_8kgJWpP6-nxlDfWT8N25_g.png)
+
+```
+ ax = show_img(denorm(x)[0])  show_img(y[0,...,-1], ax=ax, alpha=0.5); 
+```
+
+![](../img/1__-Kx9dC5aTgBSf_aSrBrNQ.png)
+
+```
+ show_img(py[0][0]>0); 
+```
+
+![](../img/1_G3C8DPyOB4BC3VLPGjbwcA.png)
+
+```
+ show_img(y[0,...,-1]); 
+```
+
+![](../img/1_fJrWvuzyX0cG5ATWvqaDPg.png)
+
+Now if we look at the masks, , they are actually looking not bad. That's looking pretty good. So can we do better? And the answer is yes, we can.
+
+### U-Net [ [1:43:45](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h43m45s) ]
+
+[Notebook](https://github.com/fastai/fastai/blob/master/courses/dl2/carvana-unet.ipynb) / [Paper](https://arxiv.org/abs/1505.04597)
+
+U-Net network is quite magnificent. With that previous approach, our pre-trained ImageNet network was being squished down all the way down to 7x7 and then expand it out all the way back up to 224x224 (1024 gets squished down to quite a bit bigger than 7x7). And then expanded out again all this way which means it has to somehow store all the information about the much bigger version in the small version. And actually most of the information about the bigger version was really in the original picture anyway. So it doesn't seem like a great approach — this squishing and un-squishing.
+
+![](../img/1_PvXW__XxRQIMoFoVFJq-Zw.png)
+
+So the U-Net idea comes from this fantastic paper where it was literally invented in this very domain-specific area of biomedical image segmentation. But in fact, basically every Kaggle winner in anything even vaguely related to segmentation has end up using U-Net. It's one of these things that everybody in Kaggle knows it is the best practice, but in more of academic circles, this has been around for a couple of years at least, a lot of people still don't realize this is by far the best approach.
+
+![](../img/1_9nxe-lIVxXawNsLzItvqcg.png)
+
+Here is the basic idea [ [1:45:10](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h45m10s) ]. On the left is the downward path where we start at 572x572 in this case then halve the grid size 4 times, then on the right is the upward path where we double the grid size 4 times. But the thing that we also do is, at every point where we halve the grid size, we actually copy those activations over to the upward path and concatenate them together.
+
+You can see on the bottom right, these red arrows are max pooling operation, these green arrows are upward sampling, and then these gray arrows are copying. So we copy and concat. In other words, the input image after a couple of convs is copied over to the output, concatenated together, and so now we get to use all of the informations gone through all of the informations gone through all the down and all the up, plus also a slightly modified version of the input pixels. And slightly modified version of one thing down from the input pixels because they came up through here. So we have all of the richness of going all the way down and up, but also a slightly less coarse version and a slightly less coarse version and then the really simple version, and they can all be combined together. So that's U-Net. It's such a cool idea.
+
+Here we are in the carvana-unet notebook. All this is the same code as before.
+
+```
+ %matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+```
+ **from** **fastai.conv_learner** **import** *  **from** **fastai.dataset** **import** *  from fastai.models.resnet import vgg_resnet50  import json 
+```
+
+```
+ torch.backends.cudnn.benchmark= True 
+```
+
+### 数据
+
+```
+ PATH = Path('data/carvana')  MASKS_FN = 'train_masks.csv'  META_FN = 'metadata.csv'  masks_csv = pd.read_csv(PATH/MASKS_FN)  meta_csv = pd.read_csv(PATH/META_FN) 
+```
+
+```
+ def show_img(im, figsize= None , ax= None , alpha= None ):  if not ax: fig,ax = plt.subplots(figsize=figsize)  ax.imshow(im, alpha=alpha)  ax.set_axis_off()  return ax 
+```
+
+```
+ TRAIN_DN = 'train-128'  MASKS_DN = 'train_masks-128'  sz = 128  bs = 64  nw = 16 
+```
+
+```
+ TRAIN_DN = 'train'  MASKS_DN = 'train_masks_png'  sz = 128  bs = 64  nw = 16 
+```
+
+```
+ class MatchedFilesDataset (FilesDataset):  def __init__(self, fnames, y, transform, path):  self.y=y  assert (len(fnames)==len(y))  super().__init__(fnames, transform, path)  def get_y(self, i):  return open_image(os.path.join(self.path, self.y[i]))  def get_c(self): return 0 
+```
+
+```
+ x_names = np.array([Path(TRAIN_DN)/o for o in masks_csv['img']])  y_names = np.array([Path(MASKS_DN)/f' {o[:-4]} _mask.png'  for o in masks_csv['img']]) 
+```
+
+```
+ val_idxs = list(range(1008))  ((val_x,trn_x),(val_y,trn_y)) = split_by_idx(val_idxs, x_names,  y_names) 
+```
+
+```
+ aug_tfms = [RandomRotate(4, tfm_y=TfmType.CLASS),  RandomFlip(tfm_y=TfmType.CLASS),  RandomLighting(0.05, 0.05, tfm_y=TfmType.CLASS)] 
+```
+
+```
+ tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO,  tfm_y=TfmType.CLASS, aug_tfms=aug_tfms)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y),  (val_x,val_y), tfms, path=PATH)  md = ImageData(PATH, datasets, bs, num_workers=16, classes= None )  denorm = md.trn_ds.denorm 
+```
+
+```
+ x,y = next(iter(md.trn_dl)) 
+```
+
+```
+ x.shape,y.shape 
+```
+
+```
+ (torch.Size([64, 3, 128, 128]), torch.Size([64, 128, 128])) 
+```
+
+### Simple upsample
+
+And at the start, I've got a simple upsample version just to show you again the non U-net version. This time, I'm going to add in something called the dice metric. Dice is very similar, as you see, to Jaccard or I over U. It's just a minor difference. It's basically intersection over union with a minor tweak. The reason we are going to use dice is that's the metric that Kaggle competition used and it's a little bit harder to get a high dice score than a high accuracy because it's really looking at what the overlap of the correct pixels are with your pixels. But it's pretty similar.
+
+So in the Kaggle competition, people that were doing okay were getting about 99.6 dice and the winners were about 99.7 dice.
+
+```
+ f = resnet34  cut,lr_cut = model_meta[f] 
+```
+
+```
+ def get_base():  layers = cut_model(f( True ), cut)  return nn.Sequential(*layers) 
+```
+
+```
+ def dice(pred, targs):  pred = (pred>0).float()  return 2\. * (pred*targs).sum() / (pred+targs).sum() 
+```
+
+Here is our standard upsample.
+
+```
+ class StdUpsample (nn.Module):  def __init__(self, nin, nout):  super().__init__()  self.conv = nn.ConvTranspose2d(nin, nout, 2, stride=2)  self.bn = nn.BatchNorm2d(nout)  def forward(self, x): return self.bn(F.relu(self.conv(x))) 
+```
+
+This all as before.
+
+```
+ class Upsample34 (nn.Module):  def __init__(self, rn):  super().__init__()  self.rn = rn  self.features = nn.Sequential(  rn, nn.ReLU(),  StdUpsample(512,256),  StdUpsample(256,256),  StdUpsample(256,256),  StdUpsample(256,256),  nn.ConvTranspose2d(256, 1, 2, stride=2))  def forward(self,x): return self.features(x)[:,0] 
+```
+
+```
+ class UpsampleModel ():  def __init__(self,model,name='upsample'):  self.model,self.name = model,name  def get_layer_groups(self, precompute):  lgs = list(split_by_idxs(children(self.model.rn), [lr_cut]))  return lgs + [children(self.model.features)[1:]] 
+```
+
+```
+ m_base = get_base() 
+```
+
+```
+ m = to_gpu(Upsample34(m_base))  models = UpsampleModel(m) 
+```
+
+```
+ learn = ConvLearner(md, models)  learn.opt_fn=optim.Adam  learn.crit=nn.BCEWithLogitsLoss()  learn.metrics=[accuracy_thresh(0.5),dice] 
+```
+
+```
+ learn.freeze_to(1) 
+```
+
+```
+ learn.lr_find()  learn.sched.plot() 
+```
+
+```
+ 86%|█████████████████████████████████████████████████████████████ | 55/64 [00:22<00:03, 2.46it/s, loss=3.21] 
+```
+
+![](../img/1_X_dHSL-SZqkKw31hZgughg.png)
+
+```
+ lr=4e-2  wd=1e-7  lrs = np.array([lr/100,lr/10,lr])/2 
+```
+
+```
+ learn.fit(lr,1, wds=wd, cycle_len=4,use_clr=(20,8)) 
+```
+
+```
+ 0%| | 0/64 [00:00<?, ?it/s]  epoch trn_loss val_loss <lambda> dice  0 0.216882 0.133512 0.938017 0.855221  1 0.169544 0.115158 0.946518 0.878381  2 0.153114 0.099104 0.957748 0.903353  3 0.144105 0.093337 0.964404 0.915084 
+```
+
+```
+ [0.09333742126112893, 0.9644036065964472, 0.9150839788573129] 
+```
+
+```
+ learn.save('tmp') 
+```
+
+```
+ learn.load('tmp') 
+```
+
+```
+ learn.unfreeze()  learn.bn_freeze( True ) 
+```
+
+```
+ learn.fit(lrs,1,cycle_len=4,use_clr=(20,8)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> dice  0 0.174897 0.061603 0.976321 0.94382  1 0.122911 0.053625 0.982206 0.957624  2 0.106837 0.046653 0.985577 0.965792  3 0.099075 0.042291 0.986519 0.968925 
+```
+
+```
+ [0.042291240323157536, 0.986519161670927, 0.9689251193924556] 
+```
+
+Now we can check our dice metric [ [1:48:00](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h48m) ]. So you can see on dice metric, we are getting around 96.8 at 128x128\. So that's not great.
+
+```
+ learn.save('128') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  py = to_np(learn.model(V(x))) 
+```
+
+```
+ show_img(py[0]>0); 
+```
+
+![](../img/1_w6f-XvZMeLKt4Fc7O_S4EQ.png)
+
+```
+ show_img(y[0]); 
+```
+
+![](../img/1_SHntdwiyRvupP9SQO5BD5g.png)
+
+#### U-net (ish) [ [1:48:16](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h48m16s) ]
+
+So let's try U-Net. I'm calling it U-net(ish) because as per usual I'm creating my own somewhat hacky version — trying to keep things as similar to what you're used to as possible and doing things that I think makes sense. So there should be plenty of opportunity for you to at least make this more authentically U-net by looking at the exact grid sizes and see how here (the top left convs) the size is going down a little bit. So they are obviously not adding any padding and then there are some cropping going on — there's a few differences. But one of the things is because I want to take advantage of transfer learning — that means I can't quite use U-Net.
+
+So here is another big opportunity is what if you create the U-Net down path and then add a classifier on the end and then train that on ImageNet. You've now got an ImageNet trained classifier which is specifically designed to be a good backbone for U-Net. Then you should be able to now come back and get pretty closed to winning this old competition (it's actually not that old — it's fairly recent competition). Because that pre-trained network didn't exist before. But if you think about what YOLO v3 did, it's basically that. They created a DarkNet, they pre-trained it on ImageNet, and then they used it as the basis for their bounding boxes. So again, this idea of pre-training things which are designed not just for classification but designed for other things — it's just something that nobody has done yet. But as we've shown, you can train ImageNet for $25 in three hours now. And if people in the community are interested in doing this, hopefully I'll have credits I can help you with as well so if you do, the work to get it set up and give me a script, I can probably run it for you. For now though, we don't have that yet. So we are going to use ResNet.
+
+```
+ class SaveFeatures ():  features= None  def __init__(self, m):  self.hook = m.register_forward_hook(self.hook_fn)  def hook_fn(self, module, input, output): self.features = output  def remove(self): self.hook.remove() 
+```
+
+So we are basically going to start with `get_base` [ [1:50:37](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h50m37s) ]. Base is our base network and that was defined back up in the first section.
+
+![](../img/1_BDJmGsOK8kX9gHUyiQ3Xgw.png)
+
+So get_base is going to be something that calls whatever f is and `f` is `resnet34` . So we are going to grab our ResNet34 and cut_model is the first thing that our convnet builder does. It basically removes everything from the adaptive pooling onwards, so that gives us back the backbone of ResNet34\. So `get_base` is going to give us back the ResNet34 backbone.
+
+```
+ class UnetBlock (nn.Module):  def __init__(self, up_in, x_in, n_out):  super().__init__()  up_out = x_out = n_out//2  self.x_conv = nn.Conv2d(x_in, x_out, 1)  self.tr_conv = nn.ConvTranspose2d(up_in, up_out, 2,  stride=2)  self.bn = nn.BatchNorm2d(n_out)  def forward(self, up_p, x_p):  up_p = self.tr_conv(up_p)  x_p = self.x_conv(x_p)  cat_p = torch.cat([up_p,x_p], dim=1)  return self.bn(F.relu(cat_p)) 
+```
+
+```
+ class Unet34 (nn.Module):  def __init__(self, rn):  super().__init__()  self.rn = rn  self.sfs = [SaveFeatures(rn[i]) for i in [2,4,5,6]]  self.up1 = UnetBlock(512,256,256)  self.up2 = UnetBlock(256,128,256)  self.up3 = UnetBlock(256,64,256)  self.up4 = UnetBlock(256,64,256)  self.up5 = nn.ConvTranspose2d(256, 1, 2, stride=2)  def forward(self,x):  x = F.relu(self.rn(x))  x = self.up1(x, self.sfs[3].features)  x = self.up2(x, self.sfs[2].features)  x = self.up3(x, self.sfs[1].features)  x = self.up4(x, self.sfs[0].features)  x = self.up5(x)  return x[:,0]  def close(self):  for sf in self.sfs: sf.remove() 
+```
+
+```
+ class UnetModel ():  def __init__(self,model,name='unet'):  self.model,self.name = model,name  def get_layer_groups(self, precompute):  lgs = list(split_by_idxs(children(self.model.rn), [lr_cut]))  return lgs + [children(self.model)[1:]] 
+```
+
+Then we are going to take that ResNet34 backbone and turn it into a, I call it a, Unet34 [ [1:51:17](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h51m17s) ]. So what that's going to do is it's going to save that ResNet that we passed in and then we are going to use a forward hook just like before to save the results at the 2nd, 4th, 5th, and 6th blocks which as before is the layers before each stride 2 convolution. Then we are going to create a bunch of these things we are calling `UnetBlock` . We need to tell `UnetBlock` how many things are coming from the previous layer we are upsampling, how many are coming across, and then how many do we want to come out. The amount coming across is entirely defined by whatever the base network was — whatever the downward path was, we need that many layers. So this is a little bit awkward. Actually one of our master's students here, Kerem, has actually created something called DynamicUnet that you'll find in [fastai.model.DynamicUnet](https://github.com/fastai/fastai/blob/d3ef60a96cddf5b503361ed4c95d68dda4a873fc/fastai/models/unet.py) and it actually calculates this all for you and automatically creates the whole Unet from your base model. It's got some minor quirks still that I want to fix. By the time the video is out, it'll definitely be working and I will at least have a notebook showing how to use it and possibly an additional video. But for now you'll just have to go through and do it yourself. You can easily see it just by, once you've got a ResNet, you can just type in its name and it'll print out the layers. And you can see how many many activations there are in each block. Or you can have it printed out for you for each block automatically. Anyway, I just did this manually.
+
+![](../img/1_uJ4edTfPiSXfCXkFQ82Svg.png)
+
+So the UnetBlock works like this [ [1:53:29](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h53m29s) ]:
+
+*   `up_in` : This many are coming up from the previous layer
+*   `x_in` : This many are coming across (hence `x` ) from the downward path
+*   `n_out` : The amount we want coming out
+
+Now what I do is , I then say, okay we're going to create a certain amount of convolutions from the upward path and a certain amount from the cross path, and so I'm going to be concatenating them together so let's divide the number we want out by 2\. And so we are going to have our cross convolution take our cross path and create number out divided by 2 ( `n_out//2` ). And then the upward path is going to be a `ConvTranspose2d` because we want to increase/upsample. Again here, we've got the number out divided by 2 ( `up_out` ), then at the end, I just concatenate those together.
+
+So I've got an upward sample, I've got a cross convolution, I can concatenate the two together. That's all a UnetBlock is. So that's actually a pretty easy module to create.
+
+![](../img/1_cXPJlacjby171FsaalyHcQ.png)
+
+Then in my forward path, I need to pass to the forward of the UnetBlock the upward path and the cross path [ [1:54:40](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h54m40s) ]. The upward path is just whatever I am up to so far. But then the cross path is whatever the activations are that I stored on the way down. So as I come up, it's the last set of saved features that I need first. And as I gradually keep going up farther and farther, eventually it's the first set of features.
+
+There are some more tricks we can do to make this a little bit better, but this is a good stuff. So the simple upsampling approach looked horrible and had a dice of .968\. A Unet with everything else identical except we've now got these UnetBlocks has a dice of …
+
+```
+ m_base = get_base()  m = to_gpu(Unet34(m_base))  models = UnetModel(m) 
+```
+
+```
+ learn = ConvLearner(md, models)  learn.opt_fn=optim.Adam  learn.crit=nn.BCEWithLogitsLoss()  learn.metrics=[accuracy_thresh(0.5),dice] 
+```
+
+```
+ learn.summary() 
+```
+
+```
+ OrderedDict([('Conv2d-1',  OrderedDict([('input_shape', [-1, 3, 128, 128]),  ('output_shape', [-1, 64, 64, 64]),  ('trainable', False),  ('nb_params', 9408)])),  ('BatchNorm2d-2',  OrderedDict([('input_shape', [-1, 64, 64, 64]),  ('output_shape', [-1, 64, 64, 64]),  ('trainable', False),  ('nb_params', 128)])),  ('ReLU-3',  OrderedDict([('input_shape', [-1, 64, 64, 64]),  ('output_shape', [-1, 64, 64, 64]),  ('nb_params', 0)])),  ('MaxPool2d-4',  OrderedDict([('input_shape', [-1, 64, 64, 64]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('Conv2d-5',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 36864)])),  ('BatchNorm2d-6',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 128)])),  ('ReLU-7',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('Conv2d-8',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 36864)])),  ('BatchNorm2d-9',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 128)])),  ('ReLU-10',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('BasicBlock-11',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('Conv2d-12',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 36864)])),  ('BatchNorm2d-13',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 128)])),  ('ReLU-14',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('Conv2d-15',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 36864)])),  ('BatchNorm2d-16',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 128)])),  ('ReLU-17',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('BasicBlock-18',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('Conv2d-19',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 36864)])),  ('BatchNorm2d-20',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 128)])),  ('ReLU-21',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('Conv2d-22',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 36864)])),  ('BatchNorm2d-23',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('trainable', False),  ('nb_params', 128)])),  ('ReLU-24',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('BasicBlock-25',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 64, 32, 32]),  ('nb_params', 0)])),  ('Conv2d-26',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 73728)])),  ('BatchNorm2d-27',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-28',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-29',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 147456)])),  ('BatchNorm2d-30',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('Conv2d-31',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 8192)])),  ('BatchNorm2d-32',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-33',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('BasicBlock-34',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-35',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 147456)])),  ('BatchNorm2d-36',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-37',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-38',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 147456)])),  ('BatchNorm2d-39',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-40',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('BasicBlock-41',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-42',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 147456)])),  ('BatchNorm2d-43',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-44',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-45',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 147456)])),  ('BatchNorm2d-46',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-47',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('BasicBlock-48',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-49',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 147456)])),  ('BatchNorm2d-50',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-51',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-52',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 147456)])),  ('BatchNorm2d-53',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', False),  ('nb_params', 256)])),  ('ReLU-54',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('BasicBlock-55',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('nb_params', 0)])),  ('Conv2d-56',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 294912)])),  ('BatchNorm2d-57',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-58',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-59',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-60',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('Conv2d-61',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 32768)])),  ('BatchNorm2d-62',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-63',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('BasicBlock-64',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-65',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-66',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-67',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-68',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-69',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-70',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('BasicBlock-71',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-72',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-73',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-74',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-75',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-76',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-77',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('BasicBlock-78',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-79',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-80',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-81',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-82',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-83',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-84',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('BasicBlock-85',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-86',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-87',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-88',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-89',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-90',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-91',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('BasicBlock-92',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-93',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-94',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-95',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-96',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 589824)])),  ('BatchNorm2d-97',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', False),  ('nb_params', 512)])),  ('ReLU-98',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('BasicBlock-99',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('Conv2d-100',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1179648)])),  ('BatchNorm2d-101',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1024)])),  ('ReLU-102',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('Conv2d-103',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 2359296)])),  ('BatchNorm2d-104',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1024)])),  ('Conv2d-105',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 131072)])),  ('BatchNorm2d-106',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1024)])),  ('ReLU-107',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('BasicBlock-108',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('Conv2d-109',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 2359296)])),  ('BatchNorm2d-110',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1024)])),  ('ReLU-111',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('Conv2d-112',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 2359296)])),  ('BatchNorm2d-113',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1024)])),  ('ReLU-114',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('BasicBlock-115',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('Conv2d-116',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 2359296)])),  ('BatchNorm2d-117',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1024)])),  ('ReLU-118',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('Conv2d-119',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 2359296)])),  ('BatchNorm2d-120',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('trainable', False),  ('nb_params', 1024)])),  ('ReLU-121',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('BasicBlock-122',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 512, 4, 4]),  ('nb_params', 0)])),  ('ConvTranspose2d-123',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 128, 8, 8]),  ('trainable', True),  ('nb_params', 262272)])),  ('Conv2d-124',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 128, 8, 8]),  ('trainable', True),  ('nb_params', 32896)])),  ('BatchNorm2d-125',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 8, 8]),  ('trainable', True),  ('nb_params', 512)])),  ('UnetBlock-126',  OrderedDict([('input_shape', [-1, 512, 4, 4]),  ('output_shape', [-1, 256, 8, 8]),  ('nb_params', 0)])),  ('ConvTranspose2d-127',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', True),  ('nb_params', 131200)])),  ('Conv2d-128',  OrderedDict([('input_shape', [-1, 128, 16, 16]),  ('output_shape', [-1, 128, 16, 16]),  ('trainable', True),  ('nb_params', 16512)])),  ('BatchNorm2d-129',  OrderedDict([('input_shape', [-1, 256, 16, 16]),  ('output_shape', [-1, 256, 16, 16]),  ('trainable', True),  ('nb_params', 512)])),  ('UnetBlock-130',  OrderedDict([('input_shape', [-1, 256, 8, 8]),  ('output_shape', [-1, 256, 16, 16]),  ('nb_params', 0)])),  ('ConvTranspose2d-131',  OrderedDict([('input_shape', [-1, 256, 16, 16]),  ('output_shape', [-1, 128, 32, 32]),  ('trainable', True),  ('nb_params', 131200)])),  ('Conv2d-132',  OrderedDict([('input_shape', [-1, 64, 32, 32]),  ('output_shape', [-1, 128, 32, 32]),  ('trainable', True),  ('nb_params', 8320)])),  ('BatchNorm2d-133',  OrderedDict([('input_shape', [-1, 256, 32, 32]),  ('output_shape', [-1, 256, 32, 32]),  ('trainable', True),  ('nb_params', 512)])),  ('UnetBlock-134',  OrderedDict([('input_shape', [-1, 256, 16, 16]),  ('output_shape', [-1, 256, 32, 32]),  ('nb_params', 0)])),  ('ConvTranspose2d-135',  OrderedDict([('input_shape', [-1, 256, 32, 32]),  ('output_shape', [-1, 128, 64, 64]),  ('trainable', True),  ('nb_params', 131200)])),  ('Conv2d-136',  OrderedDict([('input_shape', [-1, 64, 64, 64]),  ('output_shape', [-1, 128, 64, 64]),  ('trainable', True),  ('nb_params', 8320)])),  ('BatchNorm2d-137',  OrderedDict([('input_shape', [-1, 256, 64, 64]),  ('output_shape', [-1, 256, 64, 64]),  ('trainable', True),  ('nb_params', 512)])),  ('UnetBlock-138',  OrderedDict([('input_shape', [-1, 256, 32, 32]),  ('output_shape', [-1, 256, 64, 64]),  ('nb_params', 0)])),  ('ConvTranspose2d-139',  OrderedDict([('input_shape', [-1, 256, 64, 64]),  ('output_shape', [-1, 1, 128, 128]),  ('trainable', True),  ('nb_params', 1025)]))]) 
+```
+
+```
+ [o.features.size() for o in m.sfs] 
+```
+
+```
+ [torch.Size([3, 64, 64, 64]), 
+ torch.Size([3, 64, 32, 32]), 
+ torch.Size([3, 128, 16, 16]), 
+ torch.Size([3, 256, 8, 8])] 
+```
+
+```
+ learn.freeze_to(1) 
+```
+
+```
+ learn.lr_find()  learn.sched.plot() 
+```
+
+```
+ 0%| | 0/64 [00:00<?, ?it/s] 
+```
+
+```
+ 92%|█████████████████████████████████████████████████████████████████▍ | 59/64 [00:22<00:01, 2.68it/s, loss=2.45] 
+```
+
+![](../img/1_kSXmrtfSNjvDnBoLEAiUaw.png)
+
+```
+ lr=4e-2  wd=1e-7  lrs = np.array([lr/100,lr/10,lr]) 
+```
+
+```
+ learn.fit(lr,1,wds=wd,cycle_len=8,use_clr=(5,8)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> dice 
+ 0 0.12936 0.03934 0.988571 0.971385 
+ 1 0.098401 0.039252 0.990438 0.974921 
+ 2 0.087789 0.02539 0.990961 0.978927 
+ 3 0.082625 0.027984 0.988483 0.975948 
+ 4 0.079509 0.025003 0.99171 0.981221 
+ 5 0.076984 0.022514 0.992462 0.981881 
+ 6 0.076822 0.023203 0.992484 0.982321 
+ 7 0.075488 0.021956 0.992327 0.982704 
+```
+
+```
+ [0.021955982234979434, 0.9923273126284281, 0.9827044502137199] 
+```
+
+```
+ learn.save('128urn-tmp') 
+```
+
+```
+ learn.load('128urn-tmp') 
+```
+
+```
+ learn.unfreeze()  learn.bn_freeze( True ) 
+```
+
+```
+ learn.fit(lrs/4, 1, wds=wd, cycle_len=20,use_clr=(20,10)) 
+```
+
+```
+ 0%| | 0/64 [00:00<?, ?it/s]  epoch trn_loss val_loss <lambda> dice  0 0.073786 0.023418 0.99297 0.98283  1 0.073561 0.020853 0.992142 0.982725  2 0.075227 0.023357 0.991076 0.980879  3 0.074245 0.02352 0.993108 0.983659  4 0.073434 0.021508 0.993024 0.983609  5 0.073092 0.020956 0.993188 0.983333  6 0.073617 0.019666 0.993035 0.984102  7 0.072786 0.019844 0.993196 0.98435  8 0.072256 0.018479 0.993282 0.984277  9 0.072052 0.019479 0.993164 0.984147  10 0.071361 0.019402 0.993344 0.984541  11 0.070969 0.018904 0.993139 0.984499  12 0.071588 0.018027 0.9935 0.984543  13 0.070709 0.018345 0.993491 0.98489  14 0.072238 0.019096 0.993594 0.984825  15 0.071407 0.018967 0.993446 0.984919  16 0.071047 0.01966 0.993366 0.984952  17 0.072024 0.018133 0.993505 0.98497  18 0.071517 0.018464 0.993602 0.985192  19 0.070109 0.018337 0.993614 0.9852 
+```
+
+```
+ [0.018336569653853538, 0.9936137114252362, 0.9852004420189631] 
+```
+
+.985! That's like we halved the error with everything else exactly the same [ [1:55:42](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h55m42s) ]. And more the point, you can look at it.
+
+```
+ learn.save('128urn-0') 
+```
+
+```
+ learn.load('128urn-0') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  py = to_np(learn.model(V(x))) 
+```
+
+This is actually looking somewhat car-like compared to our non-Unet equivalent which is just a blob. Because trying to do this through down and up paths — it's just asking too much. Where else, when we actually provide the downward path pixels at every point, it can actually start to create something car-ish.
+
+```
+ show_img(py[0]>0); 
+```
+
+![](../img/1_AuMRaTQP4gCUW0iHHvf2uQ.png)
+
+```
+ show_img(y[0]); 
+```
+
+![](../img/1_SHntdwiyRvupP9SQO5BD5g.png)
+
+At the end of that, we'll do m.close to remove those `sfs.features` taking up GPU memory.
+
+```
+ m.close() 
+```
+
+#### 512x512 [ [1:56:26](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h56m26s) ]
+
+Go to a smaller batch size, higher size
+
+```
+ sz=512  bs=16 
+```
+
+```
+ tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO,  tfm_y=TfmType.CLASS, aug_tfms=aug_tfms)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y),  (val_x,val_y), tfms, path=PATH)  md = ImageData(PATH, datasets, bs, num_workers=4, classes= None )  denorm = md.trn_ds.denorm 
+```
+
+```
+ m_base = get_base()  m = to_gpu(Unet34(m_base))  models = UnetModel(m) 
+```
+
+```
+ learn = ConvLearner(md, models)  learn.opt_fn=optim.Adam  learn.crit=nn.BCEWithLogitsLoss()  learn.metrics=[accuracy_thresh(0.5),dice] 
+```
+
+```
+ learn.freeze_to(1) 
+```
+
+```
+ learn.load('128urn-0') 
+```
+
+```
+ learn.fit(lr,1,wds=wd, cycle_len=5,use_clr=(5,5)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> dice  0 0.071421 0.02362 0.996459 0.991772  1 0.070373 0.014013 0.996558 0.992602  2 0.067895 0.011482 0.996705 0.992883  3 0.070653 0.014256 0.996695 0.992771  4 0.068621 0.013195 0.996993 0.993359 
+```
+
+```
+ [0.013194938530288046, 0.996993034604996, 0.993358936574724] 
+```
+
+You can see the dice coefficients really going up [ [1:56:30](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h56m30s) ]. So notice above, I'm loading in the 128x128 version of the network. We are doing this progressive resizing trick again, so that gets us .993\.
+
+```
+ learn.save('512urn-tmp') 
+```
+
+```
+ learn.unfreeze()  learn.bn_freeze( True ) 
+```
+
+```
+ learn.load('512urn-tmp') 
+```
+
+```
+ learn.fit(lrs/4,1,wds=wd, cycle_len=8,use_clr=(20,8)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> dice  0 0.06605 0.013602 0.997 0.993014  1 0.066885 0.011252 0.997248 0.993563  2 0.065796 0.009802 0.997223 0.993817  3 0.065089 0.009668 0.997296 0.993744  4 0.064552 0.011683 0.997269 0.993835  5 0.065089 0.010553 0.997415 0.993827  6 0.064303 0.009472 0.997431 0.994046  7 0.062506 0.009623 0.997441 0.994118 
+```
+
+```
+ [0.009623114736602894, 0.9974409020136273, 0.9941179137381296] 
+```
+
+Then unfreeze to get to .994\.
+
+```
+ learn.save('512urn') 
+```
+
+```
+ learn.load('512urn') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  py = to_np(learn.model(V(x))) 
+```
+
+And you can see, it's now looking pretty good.
+
+```
+ show_img(py[0]>0); 
+```
+
+![](../img/1_lW-LsQorUM1UUwRDJiiMKg.png)
+
+```
+ show_img(y[0]); 
+```
+
+![](../img/1_EdCvr3nZIJf6mhwgQActnQ.png)
+
+```
+ m.close() 
+```
+
+#### 1024x1024 [ [1:56:53](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h56m53s) ]
+
+Go down to a batch size of 4, size of 1024\.
+
+```
+ sz=1024  bs=4 
+```
+
+```
+ tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO,  tfm_y=TfmType.CLASS)  datasets = ImageData.get_ds(MatchedFilesDataset, (trn_x,trn_y),  (val_x,val_y), tfms, path=PATH)  md = ImageData(PATH, datasets, bs, num_workers=16, classes= None )  denorm = md.trn_ds.denorm 
+```
+
+```
+ m_base = get_base()  m = to_gpu(Unet34(m_base))  models = UnetModel(m) 
+```
+
+```
+ learn = ConvLearner(md, models)  learn.opt_fn=optim.Adam  learn.crit=nn.BCEWithLogitsLoss()  learn.metrics=[accuracy_thresh(0.5),dice] 
+```
+
+Load in what we just saved with the 512\.
+
+```
+ learn.load('512urn') 
+```
+
+```
+ learn.freeze_to(1) 
+```
+
+```
+ learn.fit(lr,1, wds=wd, cycle_len=2,use_clr=(5,4)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> dice  0 0.007656 0.008155 0.997247 0.99353  1 0.004706 0.00509 0.998039 0.995437 
+```
+
+```
+ [0.005090427414942828, 0.9980387706605215, 0.995437301104031] 
+```
+
+That gets us to .995\.
+
+```
+ learn.save('1024urn-tmp') 
+```
+
+```
+ learn.load('1024urn-tmp') 
+```
+
+```
+ learn.unfreeze()  learn.bn_freeze( True ) 
+```
+
+```
+ lrs = np.array([lr/200,lr/30,lr]) 
+```
+
+```
+ learn.fit(lrs/10,1, wds=wd,cycle_len=4,use_clr=(20,8)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> dice  0 0.005688 0.006135 0.997616 0.994616  1 0.004412 0.005223 0.997983 0.995349  2 0.004186 0.004975 0.99806 0.99554  3 0.004016 0.004899 0.99812 0.995627 
+```
+
+```
+ [0.004898778487196458, 0.9981196409180051, 0.9956271404784823] 
+```
+
+```
+ learn.fit(lrs/10,1, wds=wd,cycle_len=4,use_clr=(20,8)) 
+```
+
+```
+ epoch trn_loss val_loss <lambda> dice  0 0.004169 0.004962 0.998049 0.995517  1 0.004022 0.004595 0.99823 0.995818  2 0.003772 0.004497 0.998215 0.995916  3 0.003618 0.004435 0.998291 0.995991 
+```
+
+```
+ [0.004434524739663753, 0.9982911745707194, 0.9959913929776539] 
+```
+
+Unfreeze takes us to… we'll call that .996\.
+
+```
+ learn.sched.plot_loss() 
+```
+
+![](../img/1_b7J9qrMQ0OxTQgj0QebeFw.png)
+
+```
+ learn.save('1024urn') 
+```
+
+```
+ learn.load('1024urn') 
+```
+
+```
+ x,y = next(iter(md.val_dl))  py = to_np(learn.model(V(x))) 
+```
+
+As you can see, that actually looks good [ [1:57:17](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h57m17s) ]. In accuracy terms, 99.82%. You can see this is looking like something you could just about use to cut out. I think, at this point, there's a couple of minor tweaks we can do to get up to .997 but really the key thing then, I think, is just maybe to do a few bit of smoothing maybe or a little bit of post-processing. You can go and have a look at the Carvana winners' blogs and see some of these tricks, but as I say, the difference between where we are at .996 and what the winners got of .997, it's not heaps. So really that just the Unet on its own pretty much solves that problem.
+
+```
+ show_img(py[0]>0); 
+```
+
+![](../img/1_A6ghUxP4m0OMKyUZWnv3xQ.png)
+
+```
+ show_img(y[0]); 
+```
+
+![](../img/1_1eNTc9dNtmuxTryHf1XGpA.png)
+
+### Back to Bounding Box [ [1:58:15](https://youtu.be/nG3tT31nPmQ%3Ft%3D1h58m15s) ]
+
+Okay, so that's it. The last thing I wanted to mention is now to come all the way back to bounding boxes because you might remember, I said our bounding box model was still not doing very well on small objects. So hopefully you might be able to guess where I'm going to go with this which is that for the bounding box model, remember how we had at different grid cells we spat out outputs of the model. And it was those earlier ones with the small grid sizes that weren't very good. How do we fix it? U-Net it! Let's have an upward path with cross connections. So then we are just going to do a U-Net and then spit them out of that. Because now those finer grid cells have all of the information of that path, and that path, and that path, and that path for leverage. Now of course, this is deep learning so that means you can't write a paper saying we just used U-Net for bounding boxes. You have to invent a new word so this is called feature pyramid networks or FPNs. And this was used in RetinaNet paper, it was created in an earlier paper specifically about FPNs. And if memory serves correctly, they did briefly cite the U-Net paper but they kind of made it sound like it was this vaguely slightly connected thing that maybe some people could consider slightly useful. But really, FPNs are U-Nets.
+
+I don't have an implementation of it to show you but it will be a fun thing, maybe for some of us to try and I know some of the students have been trying to get it working well on the forums. So yeah, interesting thing to try. So I think a couple of things to look at after this class as well as the other things I mentioned would be playing around with FPNs and also maybe trying Kerem's DynamicUnet. They would both be interesting things to look at.
+
+So you guys have all been through 14 lessons of me talking at you now. So I'm sorry about that. Thanks for putting up with me. I think you're going to find it hard to find people who actually know them as much about training neural networks and practice as you do. It'll be really easy for you to overestimate how capable all these other people are and underestimate how capable you are. So the main thing I'd say is, please practice, please. Just because you don't have this constant thing getting you to come back here every Monday night now. It's very easy to kind of lose that momentum. So find ways to keep it. Organize a study group, a book reading group, or get together with some friends and work on a project, or do something more than just deciding I want to keep working on X. Unless you are kind of person who's super motivated and whenever you decide to do something, it happens. That's not me. It's like I know, for something to happen, I have to say “yes, David. In October, I will absolutely teach that course” and then it's like okay I better actually write some material. That's the only way I can get stuff to happen. So we've got a great community there on the forums. If people have ideas for ways to make it better, please tell me. If you think you can help with, if you want to create some new forum or moderated in some different way or whatever, just let me know. You can always PM me and there's a lot of projects going on through GitHub as well — lots of stuff. So I hope to see you all back here at something else and thanks so much for joining me on this journey.
diff --git a/zh/dl2.md b/zh/dl2.md
new file mode 100644
index 0000000000000000000000000000000000000000..ef30beedd3306347c7a91d4830e57f150d61656b
--- /dev/null
+++ b/zh/dl2.md
@@ -0,0 +1,534 @@
+# 深度学习2：第1部分第2课
+
+### [第2课](http://forums.fast.ai/t/wiki-lesson-2/9399/1)
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1.ipynb)
+
+#### 回顾上一课[ [01:02](https://youtu.be/JNxcznsrRb8%3Ft%3D1m2s) ]
+
+*   我们使用3行代码来构建图像分类器。
+*   为了训练模型，需要在`PATH`下以某种方式组织`data/dogscats/` （在本例中为`data/dogscats/` ）：
+
+![](../img/1_DdsnEeT2DrnAAp_NrHc-jA.png)
+
+*   应该有`train`文件夹和`valid`文件夹，并且在每个文件夹下都有带有分类标签的文件夹（例如本例中的`cats` ），其中包含相应的图像。
+*   培训输出： _[_ `_epoch #_` _，_ `_training loss_` _，_ `_validation loss_` _，_ `_accuracy_` _]_
+
+```
+ _[ 0\. 0.04955 0.02605 0.98975]_ 
+```
+
+#### **学习率[** [**4:54**](https://youtu.be/JNxcznsrRb8%3Ft%3D4m54s) **]**
+
+*   学习率的基本思想是，它将决定我们对解决方案的缩放/磨练速度。
+
+![](../img/1_bl1EuPH_XEGvMcMW6ZloNg.jpeg)
+
+*   如果学习率太小，则需要很长时间才能达到最低点
+*   如果学习率太大，它可能会从底部摆动。
+*   学习速率查找器（ `learn.lr_find` ）将在每个小批量之后提高学习率。 最终，学习率太高，损失会变得更糟。 然后，我们查看学习率与损失的关系曲线，并确定最低点并返回一个幅度并选择它作为学习率（下例中为`1e-2` ）。
+*   迷你批处理是我们每次查看的一组图像，因此我们有效地使用GPU的并行处理能力（通常一次64或128个图像）
+*   在Python中：
+
+![](../img/1_3ZW61inLJykJLs0FntGrqA.png)
+
+![](../img/1_GgOPv2YCx3QOUpCwolFCyA.png)
+
+![](../img/1_5EdBB9JTXXf-5ccqzDr5Kg.png)
+
+*   通过调整这一个数字，您应该能够获得相当不错的结果。 fast.ai库为您选择其余的超参数。 但随着课程的进展，我们将了解到还有一些我们可以调整的东西可以获得更好的结果。 但学习率是我们设定的关键数字。
+*   学习速率查找器位于其他优化器（例如动量，亚当等）之上，并根据您正在使用的其他调整（例如高级优化器但不限于优化器）帮助您选择最佳学习速率。
+*   问题：在时代期间改变学习率的优化者会发生什么？ 这个发现者是否选择了初始学习率？[ [14:05](https://youtu.be/JNxcznsrRb8%3Ft%3D14m5s) ]我们稍后会详细了解优化器，但基本答案是否定的。 甚至亚当的学习率除以平均先前的梯度以及最近的平方梯度之和。 即使那些所谓的“动态学习率”方法也具有学习率。
+*   使模型更好的最重要的事情是为它提供更多数据。 由于这些模型有数百万个参数，如果你训练它们一段时间，它们就会开始做所谓的“过度拟合”。
+*   过度拟合 - 模型开始在训练集中看到图像的具体细节，而不是学习可以传递到验证集的一般内容。
+*   我们可以收集更多数据，但另一种简单方法是数据增加。
+
+#### 数据扩充[ [15:50](https://youtu.be/JNxcznsrRb8%3Ft%3D15m50s) ]
+
+![](../img/1_7LgHDHSM9jgRLUX6_vYYnA.png)
+
+*   每个时代，我们都会随机改变图像。 换句话说，该模型将在每个时期看到略有不同的图像版本。
+*   您希望对不同类型的图像使用不同类型的数据增强（水平翻转，垂直翻转，放大，缩小，改变对比度和亮度等等）。
+
+#### 学习率查询器问题[ [19:11](https://youtu.be/JNxcznsrRb8%3Ft%3D19m11s) ]：
+
+*   为什么不选择底部？ 损失最低的点是红色圆圈所在的位置。 但是那个学习率实际上太大了，不太可能收敛。 那么之前的那个将是一个更好的选择（选择一个小于太大的学习率总是更好）
+
+![](../img/1_vsXrd010HEYLVfoe2F-ZiQ.png)
+
+*   我们什么时候应该学习`lr_find` ？ [ [23:02](https://youtu.be/JNxcznsrRb8%3Ft%3D23m2s) ]在开始时运行一次，也许在解冻图层后（我们稍后会学习）。 当我改变我正在训练的东西或改变我训练它的方式时。 运行它永远不会有任何伤害。
+
+#### 返回数据扩充[ [24:10](https://youtu.be/JNxcznsrRb8%3Ft%3D24m10s) ]
+
+```
+ tfms = tfms_from_model(resnet34, sz, **aug_tfms=transforms_side_on** , max_zoom=1.1) 
+```
+
+*   `transform_side_on` - 侧面照片的预定义变换集（还有`transform_top_down` ）。 稍后我们将学习如何创建自定义转换列表。
+*   它不是完全创建新数据，而是允许卷积神经网络学习如何从不同的角度识别猫或狗。
+
+```
+ data = ImageClassifierData.from_paths(PATH, tfms= **tfms** )  learn = ConvLearner.pretrained(arch, data, precompute=True) 
+```
+
+```
+ learn.fit(1e-2, 1) 
+```
+
+*   现在我们创建了一个包含扩充的新`data`对象。 最初，由于`precompute=True` ，增强实际上什么都不做。
+*   卷积神经网络将这些东西称为“激活”。激活是一个数字，表示“这个特征在这个地方具有这种置信水平（概率）”。 我们正在使用已经学会识别特征的预训练网络（即我们不想改变它所学习的超参数），所以我们可以做的是预先计算隐藏层的激活并只训练最后的线性部分。
+
+![](../img/1_JxE9HYahNpcbW9mImEJRPA.png)
+
+*   这就是为什么当你第一次训练你的模型时，它需要更长的时间 - 它预先计算这些激活。
+*   即使我们每次尝试显示不同版本的猫，我们已经预先计算了特定版本猫的激活（即我们没有重新计算更改版本的激活）。
+*   要使用数据扩充，我们必须做`learn.precompute=False` ：
+
+```
+ learn.precompute=False 
+```
+
+```
+ learn.fit(1e-2, 3, **cycle_len=1** ) 
+```
+
+```
+ _[ 0\. 0.03597 0.01879 0.99365]_  _[ 1\. 0.02605 0.01836 0.99365]_  _[ 2\. 0.02189 0.0196 0.99316]_ 
+```
+
+*   坏消息是准确性没有提高。 培训损失正在减少，但验证损失不是，但我们并没有过度拟合。 当训练损失远低于验证损失时，过度拟合。 换句话说，当您的模型在训练集上做得比在验证集上做得好得多时，这意味着您的模型不是一般化的。
+*   `cycle_len=1` [ [30:17](https://youtu.be/JNxcznsrRb8%3Ft%3D30m17s) ]：这样可以**通过重启（SGDR）**实现**随机梯度下降** 。 基本的想法是，当你以最小的损失越来越接近现场时，你可能想要开始降低学习率（采取较小的步骤），以便到达正确的位置。
+*   在训练时降低学习率的想法称为**学习率退火** ，这是非常常见的。 最常见和“hacky”的方法是在一段时间内训练具有一定学习速度的模型，当它停止改进时，手动降低学习速率（逐步退火）。
+*   一个更好的方法就是选择某种功能形式 - 结果是真正好的功能形式是cosign曲线的一半，它在开始时保持一段时间的高学习率，然后当你靠近时迅速下降。
+
+![](../img/1_xmIlOee7PWLc6fa7xdfRkA.png)
+
+*   然而，我们可能会发现自己处于重量空间的一部分，这个重量空间不是很有弹性 - 也就是说，重量的微小变化可能会导致损失发生重大变化。 我们希望鼓励我们的模型找到既精确又稳定的重量空间部分。 因此，我们不时会增加学习率（这是'SGDR'中的'重新启动'），如果当前区域是“尖锐的”，这将迫使模型跳转到权重空间的不同部分。 这是一张如果我们将学习率重置3次（在[本文中](https://arxiv.org/abs/1704.00109)他们称之为“循环LR计划”）的情况下的图片：
+
+![](../img/1_TgAz1qaKu_SzuRmsO-6WGQ.png)
+
+*   重置学习速率之间的时期数由`cycle_len`设置，并且这种情况发生的次数称为_循环次数_ ，并且是我们实际传递的第二个参数`fit()` 。 所以这就是我们的实际学习率：
+
+![](../img/1_OKmsY6RR0DirLaLU2cIXtQ.png)
+
+*   问题：我们可以通过使用随机起点获得相同的效果吗？ [ [35:40](https://youtu.be/JNxcznsrRb8%3Ft%3D35m40s) ]在SGDR创建之前，人们习惯于创建“合奏”，他们会重新整理一个新模型十次，希望其中一个最终会变得更好。 在SGDR中，一旦我们接近最佳且稳定的区域，重置实际上不会“重置”，但权重会更好。 因此，SGDR将为您提供更好的结果，而不是随机尝试几个不同的起点。
+*   选择学习率（这是SGDR使用的最高学习率）非常重要，该学习率足以允许重置跳转到功能的不同部分。 [ [37:25](https://youtu.be/JNxcznsrRb8%3Ft%3D37m25s) ]
+*   SGDR降低了每个小批量的学习率，并且每个`cycle_len`时期都会`cycle_len` （在这种情况下，它设置为1）。
+*   问题：我们的主要目标是概括而不是最终陷入狭隘的最佳状态。 在这种方法中，我们是否跟踪最小值并对它们求平均值并对它们进行整合？ [ [39:27](https://youtu.be/JNxcznsrRb8%3Ft%3D39m27s) ]这是另一个复杂程度，你会在图中看到“快照合奏”。 我们目前没有这样做，但如果你想要更好地概括，你可以在重置之前保存权重并取平均值。 但就目前而言，我们只是选择最后一个。
+*   如果你想跳过，有一个叫做`cycle_save_name`的参数你可以添加，还有`cycle_len` ，它会在每个学习速率周期结束时保存一组权重，然后你可以将它们合奏[ [40:14](https://youtu.be/JNxcznsrRb8%3Ft%3D40m14s) ]。
+
+#### 拯救模特[ [40:31](https://youtu.be/JNxcznsrRb8%3Ft%3D40m31s) ]
+
+```
+ learn.save('224_lastlayer') 
+```
+
+```
+ learn.load('224_lastlayer') 
+```
+
+*   当您预先计算激活或创建调整大小的图像（我们将很快了解它）时，会创建各种临时文件，您可以在`data/dogcats/tmp`文件夹中看到这些文件。 如果您遇到奇怪的错误，可能是因为预先计算的激活只完成了一半，或者在某种程度上与您正在做的事情不兼容。 所以你总是可以继续删除这个`/tmp`文件夹以查看它是否会使错误消失（fast.ai相当于将其关闭再打开）。
+*   您还会看到有一个名为`/models`的目录，当您说`learn.save`时，模型会被保存
+
+![](../img/1_tzmWttjMDhvuAj1xZ7PeOw.png)
+
+#### 微调和差分学习率[ [43:49](https://youtu.be/JNxcznsrRb8%3Ft%3D43m49s) ]
+
+*   到目前为止，我们没有重新训练任何预先训练的特征 - 特别是卷积内核中的任何权重。 我们所做的就是在顶部添加了一些新图层，并学习如何混合和匹配预先训练的功能。
+*   卫星图像，CT扫描等图像具有完全不同的特征（与ImageNet图像相比），因此您需要重新训练多个图层。
+*   对于狗和猫，图像类似于模型预训练的图像，但我们仍然可能发现稍微调整一些后面的图层是有帮助的。
+*   以下是告诉学习者我们想要开始实际更改卷积过滤器的方法：
+
+```
+ learn.unfreeze() 
+```
+
+*   “冻结”层是未被训练/更新的层。 `unfreeze`解冻所有图层。
+*   像第一层（检测对角线边缘或渐变）或第二层（识别角或曲线）的早期层可能不需要改变太多，如果有的话。
+*   后面的图层更有可能需要更多的学习。 因此，我们创建了一系列学习率（差异学习率）：
+
+```
+ lr=np.array([1e-4,1e-3,1e-2]) 
+```
+
+*   `1e-4` ：前几层（基本几何特征）
+*   `1e-3` ：用于中间层（复杂的卷积特征）
+*   `1e-2` ：对于我们在顶部添加的图层
+*   为什么3？ 实际上它们是3个ResNet块，但就目前而言，它被认为是一组层。
+
+**问题** ：如果我的图像比训练模型的图像大，怎么办？ [ [50:30](https://youtu.be/JNxcznsrRb8%3Ft%3D50m30s) ]简短的回答是，通过我们使用的图书馆和现代建筑，我们可以使用任何我们喜欢的尺寸。
+
+**问题** ：我们可以解冻特定的图层吗？ [ [51:03](https://youtu.be/JNxcznsrRb8%3Ft%3D51m3s) ]我们还没有这样做，但如果你想，你可以做`lean.unfreeze_to(n)` （它将从`n`层`lean.unfreeze_to(n)`解冻层）。 Jeremy几乎从未发现它有用，他认为这是因为我们使用的是差异学习率，优化器可以根据需要学习。 他发现它有用的一个地方是，如果他使用的是一个非常大的内存密集型模型，并且他的GPU耗尽，你解冻的层次越少，内存和时间就越少。
+
+使用差异学习率，我们高达99.5％！ [ [52:28](https://youtu.be/JNxcznsrRb8%3Ft%3D52m28s) ]
+
+```
+ learn.fit(lr, 3, cycle_len=1, **cycle_mult** =2) 
+```
+
+```
+ _[ 0\. 0.04538 0.01965 0.99268]_  _[ 1\. 0.03385 0.01807 0.99268]_  _[ 2\. 0.03194 0.01714 0.99316]_  _[ 3\. 0.0358 0.0166 0.99463]_  _[ 4\. 0.02157 0.01504 0.99463]_  _[ 5\. 0.0196 0.0151 0.99512]_  _[ 6\. 0.01356 0.01518 0.9956 ]_ 
+```
+
+*   早些时候我们说`3`是时代的数量，但它实际上是**周期** 。 因此，如果`cycle_len=2` ，它将进行3个循环，其中每个循环是2个时期（即6个时期）。 那为什么7呢？ 这是因为`cycle_mult`
+*   `cycle_mult=2` ：这乘以每个周期后的周期长度（1个时期+ 2个时期+ 4个时期= 7个时期）。
+
+![](../img/1_SA5MA3z-jOBwvzF2e6-E6Q.png)
+
+直觉地说[ [53:57](https://youtu.be/JNxcznsrRb8%3Ft%3D53m57s) ]，如果周期太短，它会开始下降找到一个好位置，然后弹出，然后试图找到一个好位置并且弹出，并且从未真正找到一个好点。 早些时候，你希望它能做到这一点，因为它试图找到一个更平滑的点，但是稍后，你希望它做更多的探索。 这就是为什么`cycle_mult=2`似乎是一个好方法。
+
+我们正在介绍越来越多的超级参数，告诉你没有多少。 你可以选择一个好的学习率，但随后添加这些额外的调整有助于在没有任何努力的情况下获得额外的升级。 一般来说，好的起点是：
+
+*   `n_cycle=3, cycle_len=1, cycle_mult=2`
+*   `n_cycle=3, cycle_len=2` （没有`cycle_mult` ）
+
+问题：为什么更平滑的表面与更广义的网络相关？ [ [55:28](https://youtu.be/JNxcznsrRb8%3Ft%3D55m28s) ]
+
+![](../img/1_fNvevN5qLDf9dgq4632J7A.png)
+
+说你有尖刻的东西（蓝线）。 当您更改此特定参数时，X轴显示识别狗与猫的有多好。 可以推广的东西意味着当我们给它一个稍微不同的数据集时我们希望它能够工作。 稍微不同的数据集可能在此参数与猫类与狗类之间的关系略有不同。 它可能看起来像红线。 换句话说，如果我们最终得到蓝色尖头部分，那么它就不会在这个稍微不同的数据集上做得很好。 或者，如果我们最终得到更广泛的蓝色部分，它仍将在红色数据集上做得很好。
+
+*   [这](http://forums.fast.ai/t/why-do-we-care-about-resilency-of-where-we-are-in-the-weight-space/7323)是关于尖尖极小的一些有趣的讨论。
+
+#### 测试时间增加（TTA）[ [56:49](https://youtu.be/JNxcznsrRb8%3Ft%3D56m49s) ]
+
+我们的模型达到了99.5％。 但我们还能让它变得更好吗？ 让我们来看看我们错误预测的图片：
+
+![](../img/1_5jSFmwaQRmn4HaMZm1qyZw.png)
+
+在这里，杰里米打印出了所有这些照片。 当我们进行验证集时，我们对模型的所有输入必须是正方形。 原因是一些细微的技术细节，但如果您对不同的图像有不同的尺寸，GPU不会很快。 它需要保持一致，以便GPU的每个部分都能做同样的事情。 这可能是可以修复的，但是现在这就是我们所拥有的技术状态。
+
+为了使它成为正方形，我们只选择中间的正方形 - 如下所示，可以理解为什么这张图片被错误分类：
+
+![](../img/1_u8pjW6L-FhCn0DO-utX7cA.png)
+
+我们将进行所谓的“ **测试时间增强** ”。 这意味着我们将随机采取4个数据增强以及未增强的原始（中心裁剪）。 然后，我们将计算所有这些图像的预测，取平均值，并将其作为我们的最终预测。 请注意，这仅适用于验证集和/或测试集。
+
+要做到这一点，你所要做的就是`learn.TTA()` - 它将精度提高到99.65％！
+
+```
+ log_preds,y = **learn.TTA()**  probs = np.mean(np.exp(log_preds),0) 
+```
+
+```
+ accuracy(probs, y) 
+```
+
+```
+ _0.99650000000000005_ 
+```
+
+**关于增强方法的问题[** [**01:01:36**](https://youtu.be/JNxcznsrRb8%3Ft%3D1h1m36s) **]：**为什么没有边框或填充使它成为正方形？ 通常Jeremy没有做太多填充，但他做了一点点**缩放** 。 有一种称为**反射填充**的东西适用于卫星图像。 一般来说，使用TTA加上数据增强，最好的办法是尝试尽可能使用大图像。 此外，对于TTA，固定裁剪位置加上随机对比度，亮度，旋转变化可能更好。
+
+**问题：**非图像数据集的数据增强？ [ [01:03:35](https://youtu.be/JNxcznsrRb8%3Ft%3D1h3m35s) ]似乎没有人知道。 看起来它会有所帮助，但是很少有例子。 在自然语言处理中，人们试图替换同义词，但总的来说，该领​​域正处于研究和开发之中。
+
+**问** ：fast.ai库是开源的吗？[ [01:05:34](https://youtu.be/JNxcznsrRb8%3Ft%3D1h5m34s) ]是的。 然后他介绍了[Fast.ai从Keras + TensorFlow切换到PyTorch的原因](http://www.fast.ai/2017/09/08/introducing-pytorch-for-fastai/)
+
+随机说明：PyTorch不仅仅是一个深度学习库。 它实际上让我们从头开始编写任意GPU加速算法 - Pyro是人们在深度学习之外使用PyTorch做的一个很好的例子。
+
+#### 分析结果[ [01:11:50](https://youtu.be/JNxcznsrRb8%3Ft%3D1h11m50s) ]
+
+#### **混淆矩阵**
+
+查看分类结果的简单方法称为混淆矩阵 - 它不仅用于深度学习，而且用于任何类型的机器学习分类器。 特别是如果您有四到五个班级试图预测哪个班级最容易出问题。
+
+![](../img/1_IeGRqM88ZW0-7V0Za_FaQw.png)
+
+```
+ preds = np.argmax(probs, axis=1)  probs = probs[:,1] 
+```
+
+```
+ **from** **sklearn.metrics** **import** confusion_matrix  cm = confusion_matrix(y, preds) 
+```
+
+```
+ plot_confusion_matrix(cm, data.classes) 
+```
+
+#### 让我们再看看这些照片[ [01:13:00](https://youtu.be/JNxcznsrRb8%3Ft%3D1h13m) ]
+
+大多数不正确的猫（只有左边两个不正确 - 默认显示4个）：
+
+![](../img/1_IeVm7iR9u3Iy-NPIC73pLQ.png)
+
+最不正确的点：
+
+![](../img/1_UtNl3fx4vWnEdCL6Zed4Jw.png)
+
+### 回顾：培养世界级图像分类器的简单步骤[ [01:14:09](https://youtu.be/JNxcznsrRb8%3Ft%3D1h14m9s) ]
+
+1.  启用数据扩充，并`precompute=True`
+2.  使用`lr_find()`找到最高学习率，其中损失仍在明显改善
+3.  从预先计算的激活训练最后一层1-2个时期
+4.  使用数据增强（即`precompute=False` ）训练最后一层，持续2-3个时期，其中`cycle_len=1`
+5.  解冻所有图层
+6.  将早期图层设置为比下一个更高层低3x-10x的学习率。 经验法则：ImageNet像图像10倍，卫星或医学成像3倍
+7.  再次使用`lr_find()` （注意：如果你调用`lr_find`设置差异学习率，它打印出来的是最后一层的学习率。）
+8.  使用`cycle_mult=2`训练完整网络，直到过度拟合
+
+#### 让我们再做一次： [狗品种挑战](https://www.kaggle.com/c/dog-breed-identification) [ [01:16:37](https://youtu.be/JNxcznsrRb8%3Ft%3D1h16m37s) ]
+
+*   您可以使用[Kaggle CLI](https://github.com/floydwch/kaggle-cli)下载Kaggle比赛的数据
+*   笔记本电脑不是公开的，因为它是一个积极的竞争
+
+```
+ %reload_ext autoreload  %autoreload 2  %matplotlib inline 
+```
+
+```
+ from fastai.imports import *  from fastai.transforms import *  from fastai.conv_learner import *  from fastai.model import *  from fastai.dataset import *  from fastai.sgdr import *  from fastai.plots import * 
+```
+
+```
+ PATH = 'data/dogbreed/'  sz = 224  arch = resnext101_64  bs=16 
+```
+
+```
+ label_csv = f'{PATH}labels.csv'  n = len(list(open(label_csv)))-1  val_idxs = get_cv_idxs(n) 
+```
+
+```
+ !ls {PATH} 
+```
+
+![](../img/1_RBwxSLtYOv61Ry13ayUv4w.png)
+
+这与我们之前的数据集略有不同。 而不是`train`文件夹，每个品种的狗都有一个单独的文件夹，它有一个带有正确标签的CSV文件。 我们将使用Pandas阅读CSV文件。 我们在Python中使用Pandas来进行像CSV这样的结构化数据分析，通常以`pd`格式导入：
+
+```
+ label_df = pd.read_csv(label_csv)  label_df.head() 
+```
+
+![](../img/1_fTMGpsB_jK7Pp1cjErZ8vw.png)
+
+```
+ label_df. **pivot_table** (index='breed', aggfunc=len).sort_values('id', ascending=False) 
+```
+
+![](../img/1_0-PpgltveKXnyjaMHPhCPA.png)
+
+<figcaption class="imageCaption">每个品种有多少只狗图像</figcaption>
+
+
+
+```
+ tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on,  max_zoom=1.1) 
+```
+
+```
+ data = ImageClassifierData.from_csv(PATH, 'train',  f'{PATH}labels.csv', test_name='test',  val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs) 
+```
+
+*   `max_zoom` - 我们将放大1.1倍
+*   `ImageClassifierData.from_csv` - 上次我们使用了`from_paths`但由于标签是CSV文件，我们将调用`from_csv` 。
+*   `test_name` - 如果要提交给Kaggle比赛，我们需要指定测试集的位置
+*   `val_idx` - 没有`validation`文件夹，但我们仍想跟踪我们的本地性能有多好。 所以上面你会看到：
+
+`n = len(list(open(label_csv)))-1` ：打开CSV文件，创建行列表，然后取长度。 `-1`因为第一行是标题。 因此`n`是我们拥有的图像数量。
+
+`val_idxs = **get_cv_idxs** (n)` ：“获取交叉验证索引” - 默认情况下，这将返回随机的20％的行（准确的索引）以用作验证集。 您也可以发送`val_pct`以获得不同的金额。
+
+![](../img/1_ug-ihQFW21b4P68dJlADpg.png)
+
+*   `suffix='.jpg'` - 文件名末尾有`.jpg` ，但CSV文件没有。 所以我们将设置`suffix`以便它知道完整的文件名。
+
+```
+ fn = PATH + data.trn_ds.fnames[0]; fn 
+```
+
+```
+ _'data/dogbreed/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg'_ 
+```
+
+*   您可以通过说`data.trn_ds`和`trn_ds`包含很多内容来访问训练数据集，包括文件名（ `fnames` ）
+
+```
+ img = PIL.Image.open(fn); img 
+```
+
+![](../img/1_1eb6vEpa8SOrxaoNNs7f0g.png)
+
+```
+ img.size 
+```
+
+```
+ _(500, 375)_ 
+```
+
+*   现在我们检查图像大小。 如果它们很大，那么你必须仔细考虑如何处理它们。 如果它们很小，那也很有挑战性。 大多数ImageNet模型都是通过224×224或299×299图像进行训练的
+
+```
+ size_d = {k: PIL.Image.open(PATH+k).size for k in data.trn_ds.fnames} 
+```
+
+*   字典理解 - `key: name of the file` ， `value: size of the file`
+
+```
+ row_sz, col_sz = list(zip(*size_d.values())) 
+```
+
+*   `*size_d.values()`将解压缩列表。 `zip`将配对元组元素以创建元组列表。
+
+```
+ plt.hist(row_sz); 
+```
+
+![](../img/1_KPYOb0uGgAmaqLr6JWZmSg.png)
+
+<figcaption class="imageCaption">行的直方图</figcaption>
+
+
+
+*   如果你在Python中进行任何类型的数据科学或机器学习，那么Matplotlib就是你非常熟悉的东西。 Matplotlib总是被称为`plt` 。
+
+**问题** ：我们应该使用多少图像作为验证集？ [ [01:26:28](https://youtu.be/JNxcznsrRb8%3Ft%3D1h26m28s) ]使用20％是好的，除非数据集很小 - 那么20％是不够的。 如果您多次训练相同的模型并且验证集结果非常不同，则验证集太小。 如果验证集小于一千，则很难解释您的表现如何。 如果您关心精度的第三个小数位，并且验证集中只有一千个内容，则单个图像会改变精度。 如果您关心0.01和0.02之间的差异，您希望它代表10或20行。 通常20％似乎工作正常。
+
+```
+ def get_data(sz, bs):  tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on,  max_zoom=1.1)  data = ImageClassifierData.from_csv(PATH, 'train',  f'{PATH}labels.csv', test_name='test', num_workers=4,  val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs) 
+```
+
+```
+ return data if sz>300 else data.resize(340, 'tmp') 
+```
+
+*   这是常规的两行代码。 当我们开始使用新数据集时，我们希望一切都超级快。 因此，我们可以指定大小，并从64开始快速运行。 之后，我们将使用更大的图像和更大的架构，此时，您可能会耗尽GPU内存。 如果您看到CUDA内存不足错误，您需要做的第一件事就是重新启动内核（无法从中恢复），然后缩小批量。
+
+```
+ data = get_data(224, bs) 
+```
+
+```
+ learn = ConvLearner.pretrained(arch, data, precompute=True) 
+```
+
+```
+ learn.fit(1e-2, 5) 
+```
+
+```
+ _[0._ _1.99245 1.0733 0.76178]_  _[1._ _1.09107 0.7014 0.8181 ]_  _[2._ _0.80813 0.60066 0.82148]_  _[3._ _0.66967 0.55302 0.83125]_  _[4._ _0.57405 0.52974 0.83564]_ 
+```
+
+*   120个班级的83％相当不错。
+
+```
+ **learn.precompute = False** 
+```
+
+```
+ learn.fit(1e-2, 5, **cycle_len** =1) 
+```
+
+*   提醒：一个`epoch`是一次通过数据，一个`cycle`是你所说的`cycle`有多少个时代
+
+```
+ learn.save('224_pre')  learn.load('224_pre') 
+```
+
+#### 增加图像尺寸[1:32:55]
+
+```
+ learn.set_data(get_data(299, bs)) 
+```
+
+*   如果您在较小尺寸的图像上训练模型，则可以调用`learn.set_data`并传入更大尺寸的数据集。 这将是你的模型，但它已经训练到目前为止，它将让你继续训练更大的图像。
+
+> 开始对几个时代的小图像进行训练，然后切换到更大的图像，继续训练是避免过度拟合的一种非常有效的方法。
+
+```
+ learn.fit(1e-2, 3, cycle_len=1) 
+```
+
+```
+ _[0._ _0.35614 0.22239 0.93018]_  _[1._ _0.28341 0.2274 0.92627]_  _[2._ **_0.28341_**  **_0.2274_** _ 0.92627]_ 
+```
+
+*   如您所见，验证集损失（0.2274）远低于训练集损失（0.28341） - 这意味着它**不合适** 。 当你处于拟合状态时，这意味着`cycle_len=1`太短（学习率在它有机会正确放大之前被重置）。 所以我们将添加`cycle_mult=2` （即第1个周期是1个时期，第2个周期是2个时期，第3个周期是4个时期）
+
+```
+ learn.fit(1e-2, 3, cycle_len=1, **cycle_mult=2** ) 
+```
+
+```
+ [0. 0.27171 0.2118 0.93192]  [1. 0.28743 0.21008 0.9324 ]  [2. 0.25328 0.20953 0.93288]  [3. 0.23716 0.20868 0.93001]  [4. 0.23306 0.20557 0.93384]  [5. 0.22175 0.205 0.9324 ]  [6. 0.2067 0.20275 0.9348 ] 
+```
+
+*   现在验证损失和培训损失大致相同 - 这是关于正确的轨道。 然后我们尝试`TTA` ：
+
+```
+ log_preds, y = learn.TTA()  probs = np.exp(log_preds)  accuracy(log_preds,y), metrics.log_loss(y, probs) 
+```
+
+```
+ _(0.9393346379647749, 0.20101565705592733)_ 
+```
+
+其他尝试：
+
+*   尝试再运行2个时期的循环
+*   解冻（在这种情况下，训练卷积层没有丝毫帮助，因为图像实际上来自ImageNet）
+*   删除验证集并重新运行相同的步骤，并提交 - 这使我们可以使用100％的数据。
+
+**问题** ：我们如何处理不平衡的数据集？ [ [01:38:46](https://youtu.be/JNxcznsrRb8%3Ft%3D1h38m46s) ]这个数据集并不是完全平衡的（在60到100之间），但它并不是不平衡的，杰里米会给它第二个想法。 最近的一篇论文说，处理非常不平衡的数据集的最佳方法是制作罕见案例的副本。
+
+**问题** ： `precompute=True`和`unfreeze`之间的区别？
+
+*   我们从一个预先训练好的网络开始
+*   我们在它的末尾添加了几层，随机开始。 一切都冻结了， `precompute=True` ，我们所学的就是我们添加的层。
+*   使用`precompute=True` ，数据扩充不会执行任何操作，因为我们每次都显示完全相同的激活。
+*   然后我们设置`precompute=False` ，这意味着我们仍然只训练我们添加的层，因为它被冻结但数据增加现在正在工作，因为它实际上正在通过并从头开始重新计算所有激活。
+*   最后，我们解冻，说“好吧，现在你可以继续改变所有这些早期的卷积过滤器”。
+
+**问题** ：为什么不从头开始设置`precompute=False` ？ `precompute=True`的唯一原因是它更快（10次或更多次）。 如果您正在使用相当大的数据集，它可以节省相当多的时间。 使用`precompute=True`没有准确的原因。
+
+**获得良好结果的最小步骤：**
+
+1.  使用`lr_find()`找到最高学习率，其中损失仍在明显改善
+2.  使用数据增强（即`precompute=False` ）训练最后一层，持续2-3个时期，其中`cycle_len=1`
+3.  解冻所有图层
+4.  将早期图层设置为比下一个更高层低3x-10x的学习率
+5.  使用`cycle_mult=2`训练完整网络，直到过度拟合
+
+**问题** ：减小批量大小只会影响培训速度吗？ [ [1:43:34](https://youtu.be/JNxcznsrRb8%3Ft%3D1h43m34s) ]是的， [差不多](https://youtu.be/JNxcznsrRb8%3Ft%3D1h43m34s) 。 如果每次显示较少的图像，那么它用较少的图像计算梯度 - 因此不太准确。 换句话说，知道去哪个方向以及朝这个方向走多远都不太准确。 因此，当您使批量较小时，您会使其更具波动性。 它会影响您需要使用的最佳学习率，但在实践中，将批量大小除以2而不是4似乎并没有太大改变。 如果您更改批量大小，可以重新运行学习速率查找器进行检查。
+
+**问题：**灰色图像与右侧图像有什么关系？
+
+![](../img/1_IHxFF49erSrWw02s8H6BiQ.png)
+
+<figcaption class="imageCaption">[可视化和理解卷积网络](https://arxiv.org/abs/1311.2901)</figcaption>
+
+
+
+第1层，它们正是过滤器的样子。 它很容易可视化，因为它的输入是像素。 后来，它变得更难，因为输入本身是激活，这是激活的组合。 Zeiler和Fergus想出了一个聪明的技术来展示过滤器平均看起来像什么 - 称为**反卷积** （我们将在第2部分中学习）。 右侧的图像是高度激活该过滤器的图像块的示例。
+
+**问题** ：如果狗离开角落或很小的话，你会做什么（re：狗品种鉴定）？ [ [01:47:16](https://youtu.be/JNxcznsrRb8%3Ft%3D1h47m16s) ]我们将在第2部分中了解它，但是有一种技术可以让你粗略地弄清楚图像的哪些部分最有可能包含有趣的东西。 然后你可以裁剪出那个区域。
+
+#### 进一步改进[ [01:48:16](https://youtu.be/JNxcznsrRb8%3Ft%3D1h48m16s) ]
+
+我们可以立即采取两项措施来改善它：
+
+1.  假设您使用的图像大小小于您给出的图像的平均大小，则可以增加大小。 正如我们之前看到的，您可以在训练期间增加它。
+2.  使用更好的架构。 将卷积滤波器的大小和它们如何相互连接的方法有不同的方法，不同的架构具有不同的层数，内核大小，滤波器等。
+
+我们一直在使用ResNet34 - 一个很好的起点，通常是一个很好的终点，因为它没有太多的参数，适用于小数据集。 另一个名为ResNext的架构在去年的ImageNet竞赛中获得第二名.ResNext50的内存比ResNet34长两倍，内存的2-4倍。
+
+[这](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1-rxt50.ipynb)是与原始狗几乎完全相同的笔记本。 与猫。 它使用ResNext50，精度达到99.75％。
+
+#### 卫星图像[01:53:01]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson2-image_models.ipynb)
+
+![](../img/1_oZ3NUOr9KNljuB4jkwWguQ.png)
+
+代码与我们之前看到的几乎相同。 以下是一些差异：
+
+*   `transforms_top_down` - 由于它们是卫星图像，因此它们在垂直翻转时仍然有意义。
+*   更高的学习率 - 与此特定数据集有关
+*   `lrs = np.array([lr/9,lr/3,lr])` - 差异学习率现在变化3倍，因为图像与ImageNet图像完全不同
+*   `sz=64` - 这有助于避免过度拟合卫星图像，但他不会那样做狗。 与猫或狗品种（与ImageNet相似的图像），64乘64非常小，可能会破坏预先训练过的重量。
+
+#### 如何获得您的AWS设置[ [01:58:54](https://youtu.be/JNxcznsrRb8%3Ft%3D1h58m54s) ]
+
+您可以关注视频，或者[这](https://github.com/reshamas/fastai_deeplearn_part1/blob/master/tools/aws_ami_gpu_setup.md)是一位学生写的很棒的文章。
diff --git a/zh/dl3.md b/zh/dl3.md
new file mode 100644
index 0000000000000000000000000000000000000000..228c3763fd78a2e7190f7d481889dc12f2e0a9cd
--- /dev/null
+++ b/zh/dl3.md
@@ -0,0 +1,604 @@
+# 深度学习2：第1部分第3课
+
+### [第3课](http://forums.fast.ai/t/wiki-lesson-3/9401/1)
+
+#### 学生创造的有用材料：
+
+*   [AWS操作方法](https://github.com/reshamas/fastai_deeplearn_part1/blob/master/tools/aws_ami_gpu_setup.md)
+*   [TMUX](https://github.com/reshamas/fastai_deeplearn_part1/blob/master/tools/tmux.md)
+*   [第2课总结](https://medium.com/%40apiltamang/case-study-a-world-class-image-classifier-for-dogs-and-cats-err-anything-9cf39ee4690e)
+*   [学习率查询器](https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0)
+*   [PyTorch](https://towardsdatascience.com/a-practitioners-guide-to-pytorch-1d0f6a238040)
+*   [学习率与批量大小](https://miguel-data-sc.github.io/2017-11-05-first/)
+*   [误差表面与泛化的平滑区域](https://medium.com/%40radekosmulski/do-smoother-areas-of-the-error-surface-lead-to-better-generalization-b5f93b9edf5b)
+*   [卷云神经网络在5分钟内完成](https://medium.com/%40init_27/convolutional-neural-network-in-5-minutes-8f867eb9ca39)
+*   [解码ResNet架构](http://teleported.in/posts/decoding-resnet-architecture/)
+*   [另一个ResNet教程](https://medium.com/%40apiltamang)
+
+#### 我们离开这里的地方：
+
+![](../img/1_w03TpHU-IgKy5GsLuMYxzw.png)
+
+### 回顾[ [08:24](https://youtu.be/9C06ZPF8Uuc%3Ft%3D8m24s) ]：
+
+#### Kaggle CLI：如何下载数据1：
+
+从Kaggle下载时， [Kaggle CLI](https://github.com/floydwch/kaggle-cli)是一个很好的工具。 因为它是从Kaggle网站下载数据（通过屏幕抓取），它会在网站更改时中断。 当发生这种情况时，运行`pip install kaggle-cli --upgrade` 。
+
+然后你可以运行：
+
+```
+ $ kg download -u <username> -p <password> -c <competition> 
+```
+
+将`&lt;username&gt;` ， `&lt;password&gt;`替换为您的凭证， `&lt;competition&gt;`是URL中的`/c/` 。 例如，如果您尝试从`https://www.kaggle.com **/c/** dog-breed-identification`下载狗品种数据，该命令将如下所示：
+
+```
+ `$ kg download -u john.doe -p mypassword -c` dog-breed-identification 
+```
+
+确保您曾从计算机上单击过“ `Download`按钮并接受以下规则：
+
+![](../img/1_NE_vFqUgrq_ZY-Ez8lYD1Q.png)
+
+#### CurWget（Chrome扩展程序）：如何下载数据2：
+
+[![](../img/1_dpgcElfgbBLg-LKqyQHOBQ.png)](https://chrome.google.com/webstore/detail/curlwget/jmocjfidanebdlinpbcdkcmgdifblncg)
+
+#### 快狗与猫[ [13:39](https://youtu.be/9C06ZPF8Uuc%3Ft%3D13m39s) ]
+
+```
+ **from** fastai.conv_learner **import** *  PATH = 'data/dogscats/'  sz=224; bs=64 
+```
+
+笔记本通常假设您的数据位于`data`文件夹中。 但也许你想把它们放在其他地方。 在这种情况下，您可以使用符号链接（简称符号链接）：
+
+![](../img/1_f835x3bUfRPT9pFaqjvutw.png)
+
+这是一个端到端的过程，以获得狗与猫的最新结果：
+
+![](../img/1_ItxElIWV6hU9f_fwEZ9jMQ.png)
+
+<figcaption class="imageCaption">Quick Dogs v Cats</figcaption>
+
+
+
+#### 进一步分析：
+
+```
+ data = ImageClassifierData.from_paths(PATH, tfms= tfms, bs=bs, test_name='test') 
+```
+
+*   `from_paths` ：表示子文件夹名称是标签。 如果您的`train`文件夹或`valid`文件夹具有不同的名称，则可以发送`trn_name`和`val_name`参数。
+*   `test_name` ：如果要提交给Kaggle竞赛，则需要填写测试集所在文件夹的名称。
+
+```
+ learn = ConvLearner.pretrained(resnet50, data) 
+```
+
+*   请注意，我们没有设置`pre_compue=True` 。 它只是一个快捷方式，可以缓存一些不必每次都重新计算的中间步骤。 如果你对此感到困惑，你可以把它关掉。
+*   请记住，当`pre_compute=True` ，数据扩充不起作用。
+
+```
+ learn.unfreeze()  learn. **bn_freeze** ( **True** )  %time learn.fit([1e-5, 1e-4,1e-2], 1, cycle_len=1) 
+```
+
+*   `bn_freeze` ：如果你在一个与ImageNet非常相似的数据集上使用更大的更深层次的模型，如ResNet50或ResNext101（数字大于34的任何东西）（即标准对象的侧面照片，其大小类似于ImageNet在200-500之间像素），你应该添加这一行。 我们将在课程的后半部分了解更多信息，但这会导致批量标准化移动平均值无法更新。
+
+#### [如何使用其他图书馆 - Keras](https://github.com/fastai/fastai/blob/master/courses/dl1/keras_lesson1.ipynb) [ [20:02](https://youtu.be/9C06ZPF8Uuc%3Ft%3D20m2s) ]
+
+了解如何使用Fast.ai以外的库非常重要。 Keras是一个很好的例子，因为就像Fast.ai一样位于PyTorch之上，它位于各种库之上，如TensorFlow，MXNet，CNTK等。
+
+如果要运行[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/keras_lesson1.ipynb) ，请运行`pip install tensorflow-gpu keras`
+
+1.  **定义数据生成器**
+
+```
+ train_data_dir = f'{PATH}train'  validation_data_dir = f'{PATH}valid' 
+```
+
+```
+ train_datagen = ImageDataGenerator(rescale=1\. / 255,  shear_range=0.2, zoom_range=0.2, horizontal_flip=True) 
+```
+
+```
+ test_datagen = ImageDataGenerator(rescale=1\. / 255) 
+```
+
+```
+ train_generator = train_datagen.flow_from_directory(train_data_dir,  target_size=(sz, sz),  batch_size=batch_size, class_mode='binary') 
+```
+
+```
+ validation_generator = test_datagen.flow_from_directory(  validation_data_dir,  shuffle=False,  target_size=(sz, sz),  batch_size=batch_size, class_mode='binary') 
+```
+
+*   火车文件夹和验证文件夹以及带有标签名称的子文件夹的想法通常都已完成，Keras也是这样做的。
+*   Keras需要更多的代码和更多的参数来设置。
+*   您可以在Keras中定义`DataGenerator` ，而不是创建单个数据对象，并指定我们希望它执行何种类型的数据扩充以及要执行的规范化。 换句话说，在Fast.ai中，我们可以说“无论ResNet50需要什么，请为我做这件事”，但在Keras中，你需要知道预期的结果。 没有标准的增强功能。
+*   然后，您必须创建一个验证数据生成器，您负责创建一个没有数据扩充的生成器。 而且你还必须告诉它不要对数据集进行洗牌以进行验证，否则你无法跟踪你的表现。
+
+**2.创建模型**
+
+```
+ base_model = ResNet50(weights='imagenet', include_top=False)  x = base_model.output  x = GlobalAveragePooling2D()(x)  x = Dense(1024, activation='relu')(x)  predictions = Dense(1, activation='sigmoid')(x) 
+```
+
+*   Jeremy使用ResNet50作为Quick Dogs和Cats的原因是因为Keras没有ResNet34。 我们想比较苹果和苹果。
+*   您不能要求它构建适合特定数据集的模型，因此您必须手动完成。
+*   首先创建基础模型，然后构建要在其上添加的图层。
+
+**3.冻结图层并编译**
+
+```
+ model = Model(inputs=base_model.input, outputs=predictions) 
+```
+
+```
+ for layer in base_model.layers: layer.trainable = False 
+```
+
+```
+ model.compile(optimizer='rmsprop', loss='binary_crossentropy',  metrics=['accuracy']) 
+```
+
+*   循环遍历图层并通过调用`layer.trainable=False`手动冻结它们
+*   您需要编译模型
+*   传递优化程序，损失和指标的类型
+
+**适合**
+
+```
+ model.fit_generator(train_generator, **train_generator.n//batch_size** ,  epochs=3, **workers=4** , validation_data=validation_generator,  validation_steps=validation_generator.n // batch_size) 
+```
+
+*   Keras希望知道每个时期有多少批次。
+*   `workers` ：要使用多少处理器
+
+**5.微调：解冻一些图层，编译，然后再适合**
+
+```
+ split_at = 140 
+```
+
+```
+ **for** layer **in** model.layers[:split_at]: layer.trainable = **False**  **for** layer **in** model.layers[split_at:]: layer.trainable = **True** 
+```
+
+```
+ model.compile(optimizer='rmsprop', loss='binary_crossentropy',  metrics=['accuracy']) 
+```
+
+```
+ %%time model.fit_generator(train_generator,  train_generator.n // batch_size, epochs=1, workers=3,  validation_data=validation_generator,  validation_steps=validation_generator.n // batch_size) 
+```
+
+**Pytorch** - 如果你想部署到移动设备，PyTorch仍然很早。
+
+**Tensorflow** - 如果你想要转换你在本课程中学到的东西，可以使用**Keras**做更多的工作，但这需要更多的工作并且很难获得相同级别的结果。 也许将来会有TensorFlow兼容的Fast.ai版本。 我们会看到。
+
+#### 为Kaggle创建提交文件[ [32:45](https://youtu.be/9C06ZPF8Uuc%3Ft%3D32m45s) ]
+
+要创建提交文件，我们需要两条信息：
+
+*   `data.classes` ：包含所有不同的类
+*   `data.test_ds.fnames` ：测试文件名
+
+```
+ log_preds, y = learn.TTA(is_test=True)  probs = np.exp(log_preds) 
+```
+
+使用`TTA:`总是好主意`TTA:`
+
+*   `is_test=True` ：它将为您提供测试集的预测，而不是验证集
+*   默认情况下，PyTorch模型会返回预测日志，因此您需要执行`np.exp(log_preds)`来获取概率。
+
+```
+ ds = pd.DataFrame(probs)  ds.columns = data.classes 
+```
+
+*   创建Pandas `DataFrame`
+*   将列名称设置为`data.classes`
+
+```
+ ds.insert(0, 'id', [o[5:-4] **for** o **in** data.test_ds.fnames]) 
+```
+
+*   在名为`id`零位置插入一个新列。 删除前5个和后4个字母，因为我们只需要ID（文件名看起来像`test/0042d6bf3e5f3700865886db32689436.jpg` ）
+
+```
+ ds.head() 
+```
+
+![](../img/1_S6mkbwDsXs2ERYI3Eygvig.png)
+
+```
+ SUBM = f'{PATH}sub/'  os.makedirs(SUBM, exist_ok= **True** )  ds.to_csv(f'{SUBM}subm.gz', compression='gzip', index= **False** ) 
+```
+
+*   现在你可以调用`ds.to_csv`来创建一个CSV文件， `compression='gzip'`会在服务器上压缩它。
+
+```
+ FileLink(f'{SUBM}subm.gz') 
+```
+
+*   您可以使用Kaggle CLI直接从服务器提交，也可以使用`FileLink` ，它将为您提供从服务器下载文件到计算机的链接。
+
+#### 个人预测[ [39:32](https://youtu.be/9C06ZPF8Uuc%3Ft%3D39m32s) ]
+
+如果我们想通过模型运行单个图像来获得预测怎么办？
+
+```
+ fn = data.val_ds.fnames[0]; fn 
+```
+
+```
+ _'train/_ `_001513dfcb2ffafc82cccf4d8bbaba97.jpg_` _'_ 
+```
+
+```
+ Image.open(PATH + fn) 
+```
+
+![](../img/1_1eb6vEpa8SOrxaoNNs7f0g.png)
+
+*   我们将从验证集中选择第一个文件。
+
+这是获得预测的最短途径：
+
+```
+ trn_tfms, val_tfms = tfms_from_model(arch, sz) 
+```
+
+```
+ im = val_tfms(Image.open(PATH+fn)  preds = learn.predict_array(im[None]) 
+```
+
+```
+ np.argmax(preds) 
+```
+
+*   必须改变图像。 `tfms_from_model`返回训练变换和验证变换。 在这种情况下，我们将使用验证转换。
+*   传递给模型或从模型返回的所有内容通常都假定为小批量。 这里我们只有一个图像，但我们必须将其转换为单个图像的小批量。 换句话说，我们需要创建一个不仅仅是`[rows, columns, channels]` ，而是`[number of images, rows, columns, channels]` 。
+*   `im[None]` ：在开始时添加额外单位轴的Numpy技巧。
+
+#### 理论：卷积神经网络在幕后实际发生了什么[ [42:17](https://youtu.be/9C06ZPF8Uuc%3Ft%3D42m17s) ]
+
+*   我们在第1课中看到了一点理论 - [http://setosa.io/ev/image-kernels/](http://setosa.io/ev/image-kernels/)
+*   卷积是我们有一个小矩阵（在深度学习中几乎总是3x3）并将该矩阵的每个元素乘以图像的3x3部分的每个元素并将它们全部加在一起以在一个点获得该卷积的结果。
+
+**Otavio的奇妙可视化（他创造了Word Lens）：**
+
+**Jeremy的可视化：** [**电子表格**](https://github.com/fastai/fastai/blob/master/courses/dl1/excel/conv-example.xlsx) **[** [**49:51**](https://youtu.be/9C06ZPF8Uuc%3Ft%3D49m51s) **]**
+
+![](../img/1_AUQDWjcwS2Yt7Id0WyXCaQ.png)
+
+<figcaption class="imageCaption">我使用[https://office.live.com/start/Excel.aspx](https://office.live.com/start/Excel.aspx%3Fui%3Den-US%26rs%3DUS)</figcaption>
+
+
+
+*   该数据来自MNIST
+*   **激活** ：通过对输入中的某些数字应用某种线性运算来计算的数字。
+*   **整流线性单位（ReLU）** ：抛弃负数 - 即MAX（0，x）
+*   **滤镜/内核：**用于卷积的3张3张3D张量
+*   **张量：**多维数组或矩阵隐藏层既不是输入也不是输出的层
+*   **最大合并：** A（2,2）最大合并将使高度和宽度的分辨率减半 - 将其视为摘要
+*   **完全连接的层：**为每个单独的激活赋予权重并计算总和乘积。 权重矩阵与整个输入一样大。
+*   注意：在最大池层之后，您可以执行许多操作。 其中一个是在整个大小上做另一个最大池。 在旧架构或结构化数据中，我们完全连接层。 大量使用完全连接层的架构容易过度拟合而且速度较慢。 ResNet和ResNext不使用非常大的完全连接层。
+
+**问题** ：如果输入有3个通道会发生什么？ [ [1:05:30](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h5m30s) ]它看起来类似于具有2个通道的Conv1层 - 因此，滤波器每个滤波器有2个通道。 预先训练的ImageNet模型使用3个通道。 当你有少于3个频道时，你可以使用的一些技巧是复制其中一个频道使其成为3，或者如果你有2，则获得平均值并将其视为第三个频道。 如果你有4个通道，你可以用卷全部零点为卷积内核添加额外的级别。
+
+#### 接下来发生什么？ [ [1:08:47](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h8m47s) ]
+
+我们已经达到完全连接层（它做经典矩阵产品）。 在excel表中，有一个激活。 如果我们想查看输入的十位数中的哪一位，我们实际上想要计算10个数字。
+
+让我们看一个例子，我们试图预测图片是猫，狗，飞机，鱼，还是建筑物。 我们的目标是：
+
+1.  从完全连接的层获取输出（没有ReLU，因此可能有负片）
+2.  计算5个数字，每个数字在0和1之间，它们加起来为1。
+
+为此，我们需要一种不同类型的激活功能（一种应用于激活的功能）。
+
+为什么我们需要非线性？ 如果堆叠多个线性图层，它仍然只是一个线性图层。 通过添加非线性层，我们可以适应任意复杂的形状。 我们使用的非线性激活函数是ReLU。
+
+#### Softmax [ [01:14:08](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h14m8s) ]
+
+Softmax仅出现在最后一层。 它输出0到1之间的数字，它们加起来为1.理论上，这并不是绝对必要的 - 我们可以要求神经网络学习一组内核，这些内核可以提供尽可能接近我们想要的概率的内核。 通常，通过深度学习，如果您可以构建您的体系结构，以便尽可能容易地表达所需的特征，您将获得更好的模型（更快速地学习并使用更少的参数）。
+
+![](../img/1_YMxLuqkvhuR_ef_3K8iHyw.png)
+
+1.  通过`e^x`消除负数，因为我们不能有负概率。 它还强调了价值差异（2.85：4.08→17.25：59.03）
+
+您需要熟悉的所有数学知识来深入学习：
+
+![](../img/1_83US55BbMSX1dNuKCd8f1A.png)
+
+2.然后我们将`exp`列（182.75）相加，并将`e^x`除以总和。 结果总是积极的，因为我们将积极的积极分开。 每个数字介于0和1之间，总数将为1。
+
+**问题** ：如果想把图片分类为猫狗，我们使用什么样的激活功能？ [ [1:20:27](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h20m27s) ]碰巧我们现在要这样做。 我们可能想要这样做的一个原因是进行多标签分类。
+
+### 星球大赛[ [01:20:54](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h20m54s) ]
+
+[Notebook](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson2-image_models.ipynb) / [Kaggle页面](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space)
+
+> 我肯定会建议拟人化您的激活功能。 他们有个性。 [ [1:22:21](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h22m21s) ]
+
+Softmax不喜欢预测多个事物。 它想要选择一件事。
+
+如果有多个标签，Fast.ai库将自动切换到多标签模式。 所以你不必做任何事情。 但这是幕后发生的事情：
+
+```
+ **from** **planet** **import** f2  metrics=[f2]  f_model = resnet34 
+```
+
+```
+ label_csv = f' **{PATH}** train_v2.csv'  n = len(list(open(label_csv)))-1  val_idxs = get_cv_idxs(n) 
+```
+
+```
+ **def** get_data(sz):  tfms = tfms_from_model(f_model, sz,  aug_tfms= **transforms_top_down** , max_zoom=1.05) 
+```
+
+```
+ **return** ImageClassifierData. **from_csv** (PATH, 'train-jpg',  label_csv, tfms=tfms, suffix='.jpg',  val_idxs=val_idxs, test_name='test-jpg') 
+```
+
+```
+ data = get_data(256) 
+```
+
+*   使用Keras样式方法无法进行多标签分类，其中子文件夹是标签的名称。 所以我们使用`from_csv`
+*   `transform_top_down` ：它不仅仅是垂直翻转。 一个正方形有8种可能的对称性 - 它可以旋转0度，90度，180度，270度，对于每个正方形，它可以翻转（八面**体的二面体** ）
+
+```
+ x,y = next(iter(data.val_dl)) 
+```
+
+*   我们已经看过`data.val_ds` ， `test_ds` ， `train_ds` （ `ds` ：dataset），例如，你可以通过`data.train_ds[0]`获得单个图像。
+*   `dl`是一个数据加载器，它将为您提供一个小批量，特别_转换的_小批量。 使用数据加载器，您不能要求特定的小批量; 你只能回到`next`小批量。 在Python中，它被称为“生成器”或“迭代器”。 PyTorch真正利用现代Python方法。
+
+> [如果你熟悉Python，那么PyTorch非常自然。](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h27m45s) [如果你不熟悉Python，那么PyTorch就是学习Python的好理由。](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h27m45s)
+
+*   `x` ：一小批图像， `y` ：一小批标签。
+
+如果您不确定函数采用什么参数，请按`shift+tab` 。
+
+```
+ list(zip(data.classes, y[0]))  _[('agriculture', 1.0),_  _('artisinal_mine', 0.0),_  _('bare_ground', 0.0),_  _('blooming', 0.0),_  _('blow_down', 0.0),_  _('clear', 1.0),_  _('cloudy', 0.0),_  _('conventional_mine', 0.0),_  _('cultivation', 0.0),_  _('habitation', 0.0),_  _('haze', 0.0),_  _('partly_cloudy', 0.0),_  _('primary', 1.0),_  _('road', 0.0),_  _('selective_logging', 0.0),_  _('slash_burn', 1.0),_  _('water', 1.0)]_ 
+```
+
+在幕后，PyTorch和fast.ai正在将我们的标签变成一个热门编码的标签。 如果实际标签是狗，它将看起来像：
+
+![](../img/1_u6f0xuCoSDDIz5zDrdTO5A.png)
+
+我们取`actuals`和`softmax`之间的差值，将它们加起来说出有多少错误（即损失函数）[ [1:31:02](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h31m2s) ]。
+
+单热编码对于存储是非常低效的，因此我们将存储索引值（单个整数）而不是0和1的目标值（ `y` ）[ [1:31:21](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h31m21s) ]。 如果你看一下狗品种比赛的`y`值，你实际上不会看到1和0的大名单，但你会得到一个整数。 在内部，PyTorch正在将索引转换为单热编码向量（即使你真的不会看到它）。 PyTorch具有不同的损失函数，对于一个热编码而另一些不是 - 但这些细节被fast.ai库隐藏，因此您不必担心它。 但要实现的很酷的事情是，我们对单标签分类和多标签分类都做了完全相同的事情。
+
+**问题** ：改变softmax的日志基数是否有意义？[ [01:32:55](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h32m55s) ]不，改变基数只是神经网络可以轻松学习的线性缩放：
+
+![](../img/1_WUqrfSxhd4dBKSgknPWV6w.png)
+
+```
+ plt.imshow(data.val_ds.denorm(to_np(x))[0]*1.4); 
+```
+
+![](../img/1_--s7Fu1xm8tYWTGgcE93lg.png)
+
+*   `*1.4` ：图像被洗掉，因此使其更加清晰可见（“稍微提亮一点”）。 图像只是数字的矩阵，所以我们可以做这样的事情。
+*   对这样的图像进行实验是很好的，因为这些图像完全不像ImageNet。 你做的涉及卷积神经网络的绝大多数事情实际上都不会像ImageNet（医学成像，分类不同种类的钢管，卫星图像等）
+
+```
+ sz=64 
+```
+
+```
+ data = get_data(sz)  data = data.resize(int(sz*1.3), 'tmp') 
+```
+
+*   我们不会将`sz=64`用于猫狗比赛，因为我们开始使用经过预先训练的ImageNet网络，这种网络几乎完美无缺。 如果我们用64乘64的图像重新训练整个集合，我们就会破坏已经非常好的权重。 请记住，大多数ImageNet模型都使用224 x 224或299 x 299图像进行训练。
+*   ImageNet中没有与上面相似的图像。 只有前几层对我们有用。 因此，在这种情况下，从较小的图像开始效果很好。
+
+```
+ learn = ConvLearner.pretrained(f_model, data, metrics=metrics) 
+```
+
+```
+ lrf=learn.lr_find()  learn.sched.plot() 
+```
+
+![](../img/1_PyRr1RqxJkvTp0zX9xZkmA.png)
+
+```
+ lr = 0.2  learn.fit(lr, 3, cycle_len=1, cycle_mult=2) 
+```
+
+```
+ _[ 0\. 0.14882 0.13552 0.87878]_  _[ 1\. 0.14237 0.13048 0.88251]_  _[ 2\. 0.13675 0.12779 0.88796]_  _[ 3\. 0.13528 0.12834 0.88419]_  _[ 4\. 0.13428 0.12581 0.88879]_  _[ 5\. 0.13237 0.12361 0.89141]_  _[ 6\. 0.13179 0.12472 0.8896 ]_ 
+```
+
+```
+ lrs = np.array( **[lr/9, lr/3, lr]** ) 
+```
+
+```
+ learn.unfreeze()  learn.fit(lrs, 3, cycle_len=1, cycle_mult=2) 
+```
+
+```
+ _[ 0\. 0.12534 0.10926 0.90892]_  _[ 1\. 0.12035 0.10086 0.91635]_  _[ 2\. 0.11001 0.09792 0.91894]_  _[ 3\. 0.1144 0.09972 0.91748]_  _[ 4\. 0.11055 0.09617 0.92016]_  _[ 5\. 0.10348 0.0935 0.92267]_  _[ 6\. 0.10502 0.09345 0.92281]_ 
+```
+
+*   `[lr/9, lr/3, lr]` - 这是因为图像与ImageNet图像不同，而早期的图层可能并不像它们需要的那样接近。
+
+```
+ learn.sched.plot_loss() 
+```
+
+![](../img/1_3fdnGg4PsQq3mpbM943fmw.png)
+
+```
+ **sz = 128**  learn.set_data(get_data(sz))  learn.freeze()  learn.fit(lr, 3, cycle_len=1, cycle_mult=2) 
+```
+
+```
+ _[ 0\. 0.09729 0.09375 0.91885]_  _[ 1\. 0.10118 0.09243 0.92075]_  _[ 2\. 0.09805 0.09143 0.92235]_  _[ 3\. 0.09834 0.09134 0.92263]_  _[ 4\. 0.096 0.09046 0.9231 ]_  _[ 5\. 0.09584 0.09035 0.92403]_  _[ 6\. 0.09262 0.09059 0.92358]_ 
+```
+
+```
+ learn.unfreeze()  learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)  learn.save(f'{sz}') 
+```
+
+```
+ _[ 0\. 0.09623 0.08693 0.92696]_  _[ 1\. 0.09371 0.08621 0.92887]_  _[ 2\. 0.08919 0.08296 0.93113]_  _[ 3\. 0.09221 0.08579 0.92709]_  _[ 4\. 0.08994 0.08575 0.92862]_  _[ 5\. 0.08729 0.08248 0.93108]_  _[ 6\. 0.08218 0.08315 0.92971]_ 
+```
+
+```
+ **sz = 256**  learn.set_data(get_data(sz))  learn.freeze()  learn.fit(lr, 3, cycle_len=1, cycle_mult=2) 
+```
+
+```
+ _[ 0\. 0.09161 0.08651 0.92712]_  _[ 1\. 0.08933 0.08665 0.92677]_  _[ 2\. 0.09125 0.08584 0.92719]_  _[ 3\. 0.08732 0.08532 0.92812]_  _[ 4\. 0.08736 0.08479 0.92854]_  _[ 5\. 0.08807 0.08471 0.92835]_  _[ 6\. 0.08942 0.08448 0.9289 ]_ 
+```
+
+```
+ learn.unfreeze()  learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)  learn.save(f'{sz}') 
+```
+
+```
+ _[ 0\. 0.08932 0.08218 0.9324 ]_  _[ 1\. 0.08654 0.08195 0.93313]_  _[ 2\. 0.08468 0.08024 0.93391]_  _[ 3\. 0.08596 0.08141 0.93287]_  _[ 4\. 0.08211 0.08152 0.93401]_  _[ 5\. 0.07971 0.08001 0.93377]_  _[ 6\. 0.07928 0.0792 0.93554]_ 
+```
+
+```
+ log_preds,y = learn.TTA()  preds = np.mean(np.exp(log_preds),0)  f2(preds,y) 
+```
+
+```
+ _0.93626519738612801_ 
+```
+
+人们问过这个问题有几个问题[ [01:38:46](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h38m46s) ]：
+
+```
+ data = data.resize(int(sz*1.3), 'tmp') 
+```
+
+当我们指定要应用的变换时，我们发送一个大小：
+
+```
+ tfms = tfms_from_model(f_model, sz,  aug_tfms=transforms_top_down, max_zoom=1.05) 
+```
+
+数据加载器所做的一件事就是按需调整图像大小。 这与`data.resize` 。 如果初始图像是1000乘1000，那么读取该JPEG并将其调整为64乘64会比训练卷积网花费更多时间。 `data.resize`告诉它我们不会使用大于`sz*1.3`图像，所以要经过一次并创建这个大小的新JPEG。 由于图像是矩形的，因此新的JPEG最小边缘为`sz*1.3` （中心裁剪）。 它会为你节省很多时间。
+
+```
+ metrics=[f2] 
+```
+
+我们在这款笔记本上使用了[F-beta](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html)而不是`accuacy` - 它是一种权衡假阴性和误报的方法。 我们使用它的原因是因为这个特殊的Kaggle比赛想要使用它。 看看[planet.py](https://github.com/fastai/fastai/blob/master/courses/dl1/planet.py) ，了解如何创建自己的指标函数。 这是最后打印出来的`[ 0\. 0.08932 0.08218 **0.9324** ]`
+
+#### 多标签分类的激活功能[ [01:44:25](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h44m25s) ]
+
+用于多标签分类的激活函数称为**sigmoid。**
+
+![](../img/1_j7dLkIwvXr6bs6MUzFaN2w.png)
+
+![](../img/1_p8VFZrnPgWgVpUf62ZPuuA.png)
+
+**问题** ：为什么我们不开始训练差异学习率而不是单独训练最后一层？ [ [01:50:30](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h50m30s) ]
+
+![](../img/1_2Ocl12SOFKZ75iV4bqg-OQ.png)
+
+你可以跳过最后一层的训练，直接进入不同的学习率，但你可能不想这样做。 卷积层都包含预先训练的权重，因此它们不是随机的 - 对于接近ImageNet的东西，它们确实很好; 对于那些与ImageNet不相近的东西，它们总比没有好。 然而，我们所有完全连接的层都是完全随机的。 因此，您总是希望通过先训练它们来使完全连接的权重优于随机。 否则，如果你直接解冻，那么当你后来的那些仍然是随机的时候，你实际上是要摆弄那些早期的图层权重 - 这可能不是你想要的。
+
+问题：当您使用差异学习率时，这三种学习率是否在各层之间均匀分布？ [ [01:55:35](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h55m35s) ]我们将在后面的课程中详细讨论这个问题，但是fast.ai库中有一个“图层组”的概念。 在类似ResNet50的东西中，有数百个层，您可能不想写出数百个学习速率，因此库决定如何拆分它们，最后一个总是指我们随机初始化的完全连接的层并补充说。
+
+#### 可视化图层[ [01:56:42](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h56m42s) ]
+
+```
+ learn.summary() 
+```
+
+```
+ _[('Conv2d-1',_  _OrderedDict([('input_shape', [-1, 3, 64, 64]),_  _('output_shape', [-1, 64, 32, 32]),_  _('trainable', False),_  _('nb_params', 9408)])),_  _('BatchNorm2d-2',_  _OrderedDict([('input_shape', [-1, 64, 32, 32]),_  _('output_shape', [-1, 64, 32, 32]),_  _('trainable', False),_  _('nb_params', 128)])),_  _('ReLU-3',_  _OrderedDict([('input_shape', [-1, 64, 32, 32]),_  _('output_shape', [-1, 64, 32, 32]),_  _('nb_params', 0)])),_  _('MaxPool2d-4',_  _OrderedDict([('input_shape', [-1, 64, 32, 32]),_  _('output_shape', [-1, 64, 16, 16]),_  _('nb_params', 0)])),_  _('Conv2d-5',_  _OrderedDict([('input_shape', [-1, 64, 16, 16]),_  _('output_shape', [-1, 64, 16, 16]),_  _('trainable', False),_  _('nb_params', 36864)]))_  ... 
+```
+
+*   `'input_shape', [-1, **3, 64, 64** ]` - 1,3,64,64 `'input_shape', [-1, **3, 64, 64** ]` - PyTorch在图像大小之前列出通道。 一些GPU计算在按此顺序运行时运行得更快。 这是通过转换步骤在幕后完成的。
+*   `-1` ：表示批量大小。 Keras使用`None` 。
+*   `'output_shape', [-1, 64, 32, 32]` - 64是内核的数量
+
+**问题** ：一个非常小的数据集的学习速率查找器返回了奇怪的数字并且情节是空的[ [01:58:57](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h58m57s) ] - 学习速率查找器将一次通过一个小批量。 如果你有一个很小的数据集，那么就没有足够的小批量。 所以诀窍是让您的批量大小非常小，如4或8。
+
+### 结构化数据[ [01:59:48](https://youtu.be/9C06ZPF8Uuc%3Ft%3D1h59m48s) ]
+
+我们在机器学习中使用了两种类型的数据集：
+
+*   **非结构化** - 音频，图像，自然语言文本，其中对象内的所有事物都是相同的东西 - 像素，波形的幅度或单词。
+*   **结构化** - 损益表，关于Facebook用户的信息，每个列在结构上完全不同。 “结构化”是指您可能在数据库或电子表格中找到的柱状数据，其中不同的列表示不同类型的事物，每行代表一个观察。
+
+结构化数据在学术界经常被忽略，因为如果你有更好的物流模型，很难在花哨的会议论文集中发表。 但这是让世界变得圆满，让每个人都有钱和效率的事情。 我们不会忽视它，因为我们正在进行实际的深度学习，而Kaggle也不会因为人们将奖金放在Kaggle来解决现实世界的问题：
+
+*   [CorporaciónFavoritaGrocery销售预测](https://www.kaggle.com/c/favorita-grocery-sales-forecasting) - 目前正在运行
+*   [罗斯曼商店销售](https://www.kaggle.com/c/rossmann-store-sales) - 几乎与上述相同但已完成竞争。
+
+#### 罗斯曼商店促销[ [02:02:42](https://youtu.be/9C06ZPF8Uuc%3Ft%3D2h2m42s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb)
+
+```
+ **from** **fastai.structured** **import** *  **from** **fastai.column_data** **import** *  np.set_printoptions(threshold=50, edgeitems=20)  PATH='data/rossmann/' 
+```
+
+*   `fastai.structured` - 不是PyTorch特定的，也用于机器学习过程中做任何没有PyTorch的随机森林。 它可以单独使用而无需Fast.ai库的任何其他部分。
+*   `fastai.column_data` - 允许我们使用柱状结构化数据执行Fast.ai和PyTorch。
+*   对于结构化数据需要大量使用**Pandas** 。 Pandas试图用Python复制R的数据框（如果你不熟悉Pandas，这里有一本好书 - [用于数据分析的Python，第2版](http://shop.oreilly.com/product/0636920050896.do) ）
+
+有很多数据预处理这个笔记本包含来自第三名获胜者（ [实体嵌入分类变量](https://arxiv.org/abs/1604.06737) ）的整个管道。 本课程不涉及数据处理，但在机器学习课程中有详细介绍，因为特征工程非常重要。
+
+#### 查看CSV文件
+
+```
+ table_names = ['train', 'store', 'store_states', 'state_names',  'googletrend', 'weather', 'test'] 
+```
+
+```
+ tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names] 
+```
+
+```
+ for t in tables: display(t.head()) 
+```
+
+![](../img/1_v0D2IcWqBOVyhGRo8Mv1gA.png)
+
+*   `StoreType` - 您经常会获得某些列包含“代码”的数据集。 代码意味着什么并不重要。 远离对它的过多学习，看看数据首先说的是什么。
+
+#### 加入表格
+
+这是一个关系数据集，你已经连接了很多表 - 这很容易与Pandas `merge` ：
+
+```
+ def join_df(left, right, left_on, right_on=None, suffix='_y'):  if right_on is None: right_on = left_on  return left. **merge** (right, how='left', left_on=left_on,  right_on=right_on, suffixes=("", suffix)) 
+```
+
+来自Fast.ai图书馆：
+
+```
+ add_datepart(train, "Date", drop=False) 
+```
+
+*   记下一个日期并拉出一堆列，例如“星期几”，“季度开始”，“一年中的一个月”等等，并将它们全部添加到数据集中。
+*   持续时间部分将计算诸如下一个假期之前的时间，自上次假期以来的持续时间等等。
+
+```
+ joined.to_feather(f'{PATH}joined') 
+```
+
+*   `to_feather` ：将Pandas的数据帧保存为“羽状”格式，将其放在RAM中并将其转储到磁盘上。 所以真的很快。 厄瓜多尔杂货竞赛有3.5亿条记录，因此您需要关心保存需要多长时间。
+
+#### 下周
+
+*   将列拆分为两种类型：分类和连续。 分类列将表示为一个热编码，并且连续列按原样馈送到完全连接的层。
+*   分类：商店＃1和商店＃2在数量上并不相互关联。 类似地，星期一（第0天）和星期二（第1天）的星期几。
+*   连续：距离最接近的竞争对手的公里距离是我们用数字处理的数字。
+*   `ColumnarModelData`
diff --git a/zh/dl4.md b/zh/dl4.md
new file mode 100644
index 0000000000000000000000000000000000000000..d6197fe0c4458dee399b7251df8849d62920bb54
--- /dev/null
+++ b/zh/dl4.md
@@ -0,0 +1,1023 @@
+# 深度学习2：第1部分第4课
+
+### [第4课](http://forums.fast.ai/t/wiki-lesson-4/9402/1)
+
+学生用品：
+
+*   [改善我们学习率的方式](https://techburst.io/improving-the-way-we-work-with-learning-rate-5e99554f163b)
+*   [循环学习率技术](http://teleported.in/posts/cyclic-learning-rate/)
+*   [使用重新启动（SGDR）探索随机梯度下降](https://medium.com/38th-street-studios/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e)
+*   [使用差异学习率转移学习](https://towardsdatascience.com/transfer-learning-using-differential-learning-rates-638455797f00)
+*   [让计算机看得比人类更好](https://medium.com/%40ArjunRajkumar/getting-computers-to-see-better-than-humans-346d96634f73)
+
+![](../img/1_D0WqPCX7RfOL47TOEfkzYg.png)
+
+#### 辍学[04:59]
+
+```
+ learn = ConvLearner.pretrained(arch, data, ps=0.5, precompute=True) 
+```
+
+*   `precompute=True` ：预先计算从最后一个卷积层出来的激活。 请记住，激活是一个数字，它是根据构成内核/过滤器的一些权重/参数计算出来的，它们会应用于上一层的激活或输入。
+
+```
+ learn 
+```
+
+```
+ _Sequential(_  _(0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)_  _(1): Dropout(p=0.5)_  _(2): Linear(in_features=1024, out_features=512)_  _(3): ReLU()_  _(4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)_  _(5): Dropout(p=0.5)_  _(6): Linear(in_features=512, out_features=120)_  _(7): LogSoftmax()_  _)_ 
+```
+
+`learn` - 这将显示我们最后添加的图层。 这些是我们在`precompute=True`时训练的层
+
+（0），（4）： `BatchNorm`将在上一课中介绍
+
+（1），（5）： `Dropout`
+
+（2）： `Linear`层简单地表示矩阵乘法。 这是一个包含1024行和512列的矩阵，因此它将进行1024次激活并吐出512次激活。
+
+（3）： `ReLU` - 只需用零替换负数
+
+（6）： `Linear` - 第二个线性层，从前一个线性层获取512次激活并将它们通过一个新矩阵乘以512乘120并输出120次激活
+
+（7）： `Softmax` - 激活函数，返回最多为1的数字，每个数字在0和1之间：
+
+![](../img/1_PNRoFZeNc0DfGyqsq-S7sA.png)
+
+出于较小的数值精度原因，事实证明最好直接使用softmax的log而不是softmax [ [15:03](https://youtu.be/gbceqO8PpBg%3Ft%3D15m3s) ]。 这就是为什么当我们从模型中得到预测时，我们必须做`np.exp(log_preds)` 。
+
+#### 什么是`Dropout` ，什么是`p` ？ [ [08:17](https://youtu.be/gbceqO8PpBg%3Ft%3D8m17s) ]
+
+```
+ _Dropout(p=0.5)_ 
+```
+
+![](../img/1_iF4XC8gg608IUouSRI5VrA.png)
+
+如果我们将`p=0.5`压降应用于`Conv2`层，它将如上所示。 我们通过，选择激活，并以50％的几率删除它。 所以`p=0.5`是删除该单元格的概率。 输出实际上并没有太大变化，只是一点点。
+
+随机丢弃一层激活的一半有一个有趣的效果。 需要注意的一件重要事情是，对于每个小批量，我们会丢弃该层中不同的随机半部分激活。 它迫使它不适合。 换句话说，当一个特定的激活只学习那只精确的狗或精确的猫被淘汰时，模型必须尝试找到一个表示，即使随机的一半激活每次被抛弃，它仍然继续工作。
+
+这对于进行现代深度学习工作以及解决泛化问题至关重要。 Geoffrey Hinton和他的同事们提出了这个想法，这个想法受到大脑工作方式的启发。
+
+*   `p=0.01`将丢弃1％的激活。 它根本不会改变任何东西，也不会阻止过度拟合（不是一般化）。
+*   `p=0.99`将抛弃99％的激活。 不会过度适应并且非常适合概括，但会破坏你的准确性。
+*   默认情况下，第一层为`0.25` ，第二层为`0.5` [17:54]。 如果你发现它过度拟合，就开始碰撞它 - 尝试将全部设置为`0.5` ，仍然过度拟合，尝试`0.7`等。如果你不合适，你可以尝试降低它，但你不太可能需要降低它。
+*   ResNet34具有较少的参数，因此它不会过度匹配，但对于像ResNet50这样的更大的架构，您通常需要增加丢失。
+
+你有没有想过为什么验证损失比培训早期的培训损失更好？ [ [12:32](https://youtu.be/gbceqO8PpBg%3Ft%3D12m32s) ]这是因为我们在验证集上运行推理（即进行预测）时关闭了丢失。 我们希望尽可能使用最好的模型。
+
+**问题** ：你是否必须采取任何措施来适应你正在放弃激活的事实？ [ [13:26](https://youtu.be/gbceqO8PpBg%3Ft%3D13m26s) ]我们没有，但是当你说`p=0.5`时，PyTorch会做两件事。 它抛弃了一半的激活，并且它已经存在的所有激活加倍，因此平均激活不会改变。
+
+在Fast.ai中，您可以传入`ps` ，这是所有添加的图层的`p`值。 它不会改变预训练网络中的辍学率，因为它应该已经训练过一些适当的辍学水平：
+
+```
+ learn = ConvLearner.pretrained(arch, data, **ps=0.5** , precompute=True) 
+```
+
+您可以通过设置`ps=0.`来删除dropout `ps=0.` 但即使在几个时代之后，我们开始大规模过度拟合（训练损失«验证损失）：
+
+```
+ [2. **0.3521** **0.55247** 0.84189] 
+```
+
+当`ps=0.` ，dropout图层甚至没有添加到模型中：
+
+```
+ Sequential(  (0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True)  (1): Linear(in_features=4096, out_features=512)  (2): ReLU()  (3): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)  (4): Linear(in_features=512, out_features=120)  (5): LogSoftmax()  ) 
+```
+
+你可能已经注意到，它已经添加了两个`Linear`层[ [16:19](https://youtu.be/gbceqO8PpBg%3Ft%3D16m19s) ]。 我们不必这样做。 您可以设置`xtra_fc`参数。 注意：您至少需要一个获取卷积层输出（本例中为4096）并将其转换为类数（120个品种）的一个：
+
+```
+ learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True,  **xtra_fc=[]** ); learn 
+```
+
+```
+ _Sequential(_  _(0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)_  _(1): Linear(in_features=1024, out_features=120)_  _(2): LogSoftmax()_  _)_ 
+```
+
+```
+ learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True,  **xtra_fc=[700, 300]** ); learn 
+```
+
+```
+ _Sequential(_  _(0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)_  _(1): Linear(in_features=1024, out_features=_ **_700_** _)_  _(2): ReLU()_  _(3): BatchNorm1d(700, eps=1e-05, momentum=0.1, affine=True)_  _(4): Linear(in_features=700, out_features=_ **_300_** _)_  _(5): ReLU()_  _(6): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True)_  _(7): Linear(in_features=300, out_features=120)_  _(8): LogSoftmax()_  _)_ 
+```
+
+**问题** ：有没有特定的方法可以确定它是否过度装配？ [ [19:53](https://youtu.be/gbceqO8PpBg%3Ft%3D19m53s) ]。 是的，您可以看到培训损失远低于验证损失。 你无法判断它是否_过度_装修。 零过度拟合通常不是最佳的。 您要做的唯一事情就是降低验证损失，因此您需要尝试一些事情，看看是什么导致验证损失很低。 对于你的特殊问题，你会有一种过度加工的感觉。
+
+**问题** ：为什么平均激活很重要？ [ [21:15](https://youtu.be/gbceqO8PpBg%3Ft%3D21m15s) ]如果我们刚刚删除了一半的激活，那么将它们作为输入的下一次激活也将减半，之后的所有内容。 例如，如果蓬松的耳朵大于0.6，则蓬松的耳朵会蓬松，现在如果它大于0.3则只是蓬松 - 这改变了意义。 这里的目标是删除激活而不改变含义。
+
+**问题** ：我们可以逐层提供不同级别的辍学吗？ [ [22:41](https://youtu.be/gbceqO8PpBg%3Ft%3D22m41s) ]是的，这就是它被称为`ps` ：
+
+```
+ learn = ConvLearner.pretrained(arch, data, ps=[0., 0.2],  precompute=True, xtra_fc=[512]); learn 
+```
+
+```
+ _Sequential(_  _(0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True)_  _(1): Linear(in_features=4096, out_features=512)_  _(2): ReLU()_  _(3): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)_  _(4): Dropout(p=0.2)_  _(5): Linear(in_features=512, out_features=120)_  _(6): LogSoftmax()_  _)_ 
+```
+
+*   当早期或晚期的图层应该具有不同的辍学量时，没有经验法则。
+*   如果有疑问，请为每个完全连接的层使用相同的压差。
+*   通常人们只会在最后一个线性层上投入辍学。
+
+**问题** ：为什么要监控损失而不是准确性？ [ [23:53](https://youtu.be/gbceqO8PpBg%3Ft%3D23m53s) ]损失是我们唯一可以看到的验证集和训练集。 正如我们后来所了解的那样，损失是我们实际上正在优化的事情，因此更容易监控和理解这意味着什么。
+
+**问题** ：我们是否需要在添加辍学后调整学习率？[ [24:33](https://youtu.be/gbceqO8PpBg%3Ft%3D24m33s) ]它似乎不足以影响学习率。 理论上，它可能但不足以影响我们。
+
+#### 结构化和时间序列数据[ [25:03](https://youtu.be/gbceqO8PpBg%3Ft%3D25m3s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb) / [Kaggle](https://www.kaggle.com/c/rossmann-store-sales)
+
+![](../img/1_-yc7uZaE44dDVOB850I9-A.png)
+
+列有两种类型：
+
+*   分类 - 它有许多“级别”，例如StoreType，Assortment
+*   连续 - 它有一个数字，其中数字的差异或比率具有某种含义，例如竞争距离
+
+```
+ cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day',  'StateHoliday', 'CompetitionMonthsOpen', 'Promo2Weeks',  'StoreType', 'Assortment', 'PromoInterval',  'CompetitionOpenSinceYear', 'Promo2SinceYear', 'State',  'Week', 'Events', 'Promo_fw', 'Promo_bw',  'StateHoliday_fw', 'StateHoliday_bw',  'SchoolHoliday_fw', 'SchoolHoliday_bw'] 
+```
+
+```
+ contin_vars = ['CompetitionDistance', 'Max_TemperatureC',  'Mean_TemperatureC', 'Min_TemperatureC',  'Max_Humidity', 'Mean_Humidity', 'Min_Humidity',  'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h',  'CloudCover', 'trend', 'trend_DE',  'AfterStateHoliday', 'BeforeStateHoliday', 'Promo',  'SchoolHoliday'] 
+```
+
+```
+ n = len(joined); n 
+```
+
+*   数字，如`Year` ， `Month` ，虽然我们可以将它们视为连续的，但我们没有必要。 如果我们决定将`Year`作为一个分类变量，我们告诉我们的神经网络，对于`Year` （2000,2001,2002）的每个不同“级别”，你可以完全不同地对待它; 在哪里 - 如果我们说它是连续的，它必须提出某种平滑的功能来适应它们。 通常情况下，实际上是连续的但没有很多不同的级别（例如`Year` ， `DayOfWeek` ），通常将它们视为分类更好。
+*   选择分类变量和连续变量是您要做出的建模决策。 总之，如果它在数据中是分类的，则必须是分类的。 如果它在数据中是连续的，您可以选择是在模型中使其连续还是分类。
+*   一般来说，浮点数难以分类，因为有很多级别（我们将级别数称为“ **基数** ” - 例如，星期几变量的基数为7）。
+
+**问题** ：你有没有对连续变量进行分类？[ [31:02](https://youtu.be/gbceqO8PpBg%3Ft%3D31m2s) ] Jeremy不会对变量进行分类，但我们可以做的一件事，比如最高温度，分为0-10,10-20,20-30，然后调用分类。 有趣的是，上周刚发表一篇论文，其中一组研究人员发现有时候分组可能会有所帮助。
+
+**问题** ：如果您将年份作为一个类别，当模型遇到一个前所未有的年份时会发生什么？ [ [31:47](https://youtu.be/gbceqO8PpBg%3Ft%3D31m47s) ]我们会到达那里，但简短的回答是，它将被视为一个未知类别。 熊猫有一个特殊的类别叫做未知，如果它看到一个以前没见过的类别，它会被视为未知。
+
+```
+ for v in cat_vars:  joined[v] = joined[v].astype('category').cat.as_ordered() 
+```
+
+```
+ for v in contin_vars:  joined[v] = joined[v].astype('float32') 
+```
+
+```
+ dep = 'Sales'  joined = joined[cat_vars+contin_vars+[dep, 'Date']].copy() 
+```
+
+*   循环遍历`cat_vars`并将适用的数据框列转换为分类列。
+*   循环通过`contin_vars`并将它们设置为`float32` （32位浮点），因为这是PyTorch所期望的。
+
+#### 从一个小样本开始[ [34:29](https://youtu.be/gbceqO8PpBg%3Ft%3D34m29s) ]
+
+```
+ idxs = get_cv_idxs(n, val_pct=150000/n)  joined_samp = joined.iloc[idxs].set_index("Date")  samp_size = len(joined_samp); samp_size 
+```
+
+![](../img/1_dHlXaLjRQSGyrG9pGkWMMQ.png)
+
+这是我们的数据。 即使我们将一些列设置为“类别”（例如'StoreType'，'Year'），Pandas仍然在笔记本中显示为字符串。
+
+```
+ df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)  yl = np.log(y) 
+```
+
+`proc_df` （进程数据框） - Fast.ai中的一个函数，它执行以下操作：
+
+1.  拉出因变量，将其放入单独的变量中，并从原始数据框中删除它。 换句话说， `df`没有`Sales`列， `y`只包含`Sales`列。
+2.  `do_scale` ：神经网络真的希望所有输入数据都在零左右，标准偏差大约为1.因此，我们取数据，减去均值，然后除以标准偏差即可。 它返回一个特殊对象，用于跟踪它用于该规范化的均值和标准偏差，因此您可以稍后对测试集执行相同操作（ `mapper` ）。
+3.  它还处理缺失值 - 对于分类变量，它变为ID：0，其他类别变为1,2,3等。 对于连续变量，它用中位数替换缺失值，并创建一个新的布尔列，说明它是否丢失。
+
+![](../img/1_Zs6ASJF8iaAe3cduCmLYKw.png)
+
+在处理之后，2014年例如变为2，因为分类变量已经被从零开始的连续整数替换。 原因是，我们将在稍后将它们放入矩阵中，并且当它可能只是两行时，我们不希望矩阵长度为2014行。
+
+现在我们有一个数据框，它不包含因变量，一切都是数字。 这就是我们需要深入学习的地方。 查看机器学习课程了解更多详情。 机器学习课程中涉及的另一件事是验证集。 在这种情况下，我们需要预测未来两周的销售情况，因此我们应该创建一个验证集，这是我们培训集的最后两周：
+
+```
+ val_idx = np.flatnonzero((df.index<=datetime.datetime(2014,9,17)) &  (df.index>=datetime.datetime(2014,8,1))) 
+```
+
+*   [如何（以及为什么）创建一个好的验证集](http://www.fast.ai/2017/11/13/validation-sets/)
+
+#### 让我们直接进入深度学习行动[ [39:48](https://youtu.be/gbceqO8PpBg%3Ft%3D39m48s) ]
+
+对于任何Kaggle比赛，重要的是您要充分了解您的指标 - 您将如何评判。 在[本次比赛中](https://www.kaggle.com/c/rossmann-store-sales) ，我们将根据均方根百分比误差（RMSPE）进行判断。
+
+![](../img/1_a7mJ5VCeuAxagGrHOq6ekQ.png)
+
+```
+ def inv_y(a): return np.exp(a) 
+```
+
+```
+ def exp_rmspe(y_pred, targ):  targ = inv_y(targ)  pct_var = (targ - inv_y(y_pred))/targ  return math.sqrt((pct_var**2).mean()) 
+```
+
+```
+ max_log_y = np.max(yl)  y_range = (0, max_log_y*1.2) 
+```
+
+*   当您获取数据的日志时，获得均方根误差实际上会得到均方根百分比误差。
+
+```
+ md = **ColumnarModelData.from_data_frame** (PATH, val_idx, df,  yl.astype(np.float32), cat_flds=cat_vars, bs=128,  test_df=df_test) 
+```
+
+*   按照惯例，我们将从创建模型数据对象开始，该对象具有内置于其中的验证集，训练集和可选测试集。 从那以后，我们将获得一个学习者，然后我们可以选择调用`lr_find` ，然后调用`learn.fit`等等。
+*   这里的区别是我们没有使用`ImageClassifierData.from_csv`或`.from_paths` ，我们需要一种名为`.from_paths`的不同类型的模型数据，我们调用`from_data_frame` 。
+*   `PATH` ：指定存储模型文件的位置等
+*   `val_idx` ：我们要放入验证集的行的索引列表
+*   `df` ：包含自变量的数据框
+*   `yl` ：我们取了`proc_df`返回的因变量`y`并记录了它的日志（即`np.log(y)` ）
+*   `cat_flds` ： `cat_flds`哪些列视为分类。 请记住，到目前为止，一切都是一个数字，所以除非我们指定，否则它们将全部视为连续的。
+
+现在我们有一个熟悉的标准模型数据对象，包含`train_dl` ， `val_dl` ， `train_ds` ， `val_ds`等。
+
+```
+ m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),  0.04, 1, [1000,500], [0.001,0.01],  y_range=y_range) 
+```
+
+*   在这里，我们要求它创建一个适合我们的模型数据的学习者。
+*   `0.04` ：使用多少辍学
+*   `[1000,500]` ： `[1000,500]`激活多少次
+*   `[0.001,0.01]` ：在后续层使用多少辍学者
+
+#### 关键新概念：嵌入[ [45:39](https://youtu.be/gbceqO8PpBg%3Ft%3D45m39s) ]
+
+我们暂时忘记分类变量：
+
+![](../img/1_T604NRtHHBkBWFvWoovlUw.png)
+
+请记住，您永远不想将ReLU放在最后一层，因为softmax需要负数来创建低概率。
+
+#### **完全连接神经网络的简单视图[** [**49:13**](https://youtu.be/gbceqO8PpBg%3Ft%3D49m13s) **]：**
+
+![](../img/1_5D0_nDy0K0QLKFHTD07gcQ.png)
+
+对于回归问题（不是分类），您甚至可以跳过softmax图层。
+
+#### 分类变量[ [50:49](https://youtu.be/gbceqO8PpBg%3Ft%3D50m49s) ]
+
+我们创建一个7行的新矩阵和我们选择的列数（例如4）并用浮点数填充它。 要使用连续变量将“星期日”添加到我们的等级1张量中，我们会查看此矩阵，它将返回4个浮点数，并将它们用作“星期日”。
+
+![](../img/1_cAgCy5HfD0rvPDg2dQITeg.png)
+
+最初，这些数字是随机的。 但我们可以通过神经网络将它们更新，并以减少损失的方式更新它们。 换句话说，这个矩阵只是我们神经网络中的另一组权重。 这种类型的**矩阵**称为“ **嵌入矩阵** ”。 嵌入矩阵是我们从该类别的零和最大级别之间的整数开始的。 我们索引矩阵以找到一个特定的行，然后将它追加到我们所有的连续变量中，之后的所有内容与之前的相同（线性→ReLU→等）。
+
+**问题** ：这4个数字代表什么？[ [55:12](https://youtu.be/gbceqO8PpBg%3Ft%3D55m12s) ]当我们看协同过滤时，我们会更多地了解这一点，但就目前而言，它们只是我们正在学习的参数，最终会给我们带来很大的损失。 我们稍后会发现这些特定的参数通常是人类可解释的并且非常有趣，但这是副作用。
+
+**问题** ：您对嵌入矩阵的维数有很好的启发式吗？ [ [55:57](https://youtu.be/gbceqO8PpBg%3Ft%3D55m57s) ]我确实做到了！ 让我们来看看。
+
+```
+ cat_sz = [(c, len(joined_samp[c].cat.categories)+1)  **for** c **in** cat_vars]  cat_sz 
+```
+
+```
+ _[('Store', 1116),_  _('DayOfWeek', 8),_  _('Year', 4),_  _('Month', 13),_  _('Day', 32),_  _('StateHoliday', 3),_  _('CompetitionMonthsOpen', 26),_  _('Promo2Weeks', 27),_  _('StoreType', 5),_  _('Assortment', 4),_  _('PromoInterval', 4),_  _('CompetitionOpenSinceYear', 24),_  _('Promo2SinceYear', 9),_  _('State', 13),_  _('Week', 53),_  _('Events', 22),_  _('Promo_fw', 7),_  _('Promo_bw', 7),_  _('StateHoliday_fw', 4),_  _('StateHoliday_bw', 4),_  _('SchoolHoliday_fw', 9),_  _('SchoolHoliday_bw', 9)]_ 
+```
+
+*   以下是每个分类变量及其基数的列表。
+*   即使原始数据中没有缺失值，您仍然应该留出一个未知的，以防万一。
+*   确定嵌入大小的经验法则是基数大小除以2，但不大于50。
+
+```
+ emb_szs = [(c, min(50, (c+1)//2)) **for** _,c **in** cat_sz]  emb_szs 
+```
+
+```
+ _[(1116, 50),_  _(8, 4),_  _(4, 2),_  _(13, 7),_  _(32, 16),_  _(3, 2),_  _(26, 13),_  _(27, 14),_  _(5, 3),_  _(4, 2),_  _(4, 2),_  _(24, 12),_  _(9, 5),_  _(13, 7),_  _(53, 27),_  _(22, 11),_  _(7, 4),_  _(7, 4),_  _(4, 2),_  _(4, 2),_  _(9, 5),_  _(9, 5)]_ 
+```
+
+然后将嵌入大小传递给学习者：
+
+```
+ m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,  [1000,500], [0.001,0.01], y_range=y_range) 
+```
+
+**问题** ：有没有办法初始化嵌入矩阵除了随机？ [ [58:14](https://youtu.be/gbceqO8PpBg%3Ft%3D58m14s) ]我们可能会在课程的后期讨论预训练，但基本的想法是，如果罗斯曼的其他人已经训练了一个神经网络来预测奶酪销售，你也可以从他们的嵌入矩阵开始商店预测酒类销售。 例如，在Pinterest和Instacart就会发生这种情况。 Instacart使用这种技术来路由他们的购物者，Pinterest使用它来决定在网页上显示什么。 他们嵌入了在组织中共享的产品/商店矩阵，因此人们无需培训新的产品/商店。
+
+**问题** ：使用嵌入矩阵优于单热编码有什么好处？ [ [59:23](https://youtu.be/gbceqO8PpBg%3Ft%3D59m23s) ]对于上面一周的例子，我们可以很容易地传递7个数字（例如星期日的[ [0,1,0,0,0,0,0](https://youtu.be/gbceqO8PpBg%3Ft%3D59m23s) ]），而不是4个数字。 这也是一个浮动列表，这将完全起作用 - 这就是一般来说，分类变量多年来一直用于统计（称为“虚拟变量”）。 问题是，星期日的概念只能与单个浮点数相关联。 所以它得到了这种线性行为 - 它说周日或多或少只是一件事。 通过嵌入，星期日是四维空间的概念。 我们倾向于发现的是这些嵌入向量倾向于获得这些丰富的语义概念。 例如，如果事实证明周末有不同的行为，您往往会看到周六和周日会有更高的特定数字。
+
+> 通过具有更高的维度向量而不仅仅是单个数字，它为深度学习网络提供了学习这些丰富表示的机会。
+
+嵌入的想法是所谓的“分布式表示” - 神经网络的最基本概念。 这就是神经网络中的概念具有很难解释的高维表示的想法。 这个向量中的这些数字甚至不必只有一个含义。 它可能意味着一件事，如果这个是低的，一个是高的，如果那个是高的那个，而另一个是低的，因为它正在经历这个丰富的非线性函数。 正是这种丰富的表现形式使它能够学习这种有趣的关系。
+
+**问题** ：嵌入是否适合某些类型的变量？ [ [01:02:45](https://youtu.be/gbceqO8PpBg%3Ft%3D1h2m45s) ]嵌入适用于任何分类变量。 它唯一不能很好地工作的是基数太高的东西。 如果您有600,000行且变量有600,000个级别，那么这不是一个有用的分类变量。 但总的来说，本次比赛的第三名获胜者确实认为所有基因都不是太高，他们都把它们都视为绝对的。 好的经验法则是，如果你可以创建一个分类变量，你也可以这样，因为它可以学习这种丰富的分布式表示; 如果你把它留在连续的地方，它最能做的就是试着找到一个适合它的单一功能形式。
+
+#### 场景背后的矩阵代数[ [01:04:47](https://youtu.be/gbceqO8PpBg%3Ft%3D1h4m47s) ]
+
+查找具有索引的嵌入与在单热编码向量和嵌入矩阵之间进行矩阵乘积相同。 但这样做非常低效，因此现代库实现这一点，即采用整数并查看数组。
+
+![](../img/1_psxpwtr5bw55lKxVV_y81w.png)
+
+**问题** ：您能否触及使用日期和时间作为分类以及它如何影响季节性？ [ [01:06:59](https://youtu.be/gbceqO8PpBg%3Ft%3D1h6m59s) ]有一个名为`add_datepart`的Fast.ai函数，它接受数据框和列名。 它可以选择从数据框中删除该列，并将其替换为代表该日期的所有有用信息的大量列，例如星期几，日期，月份等等（基本上是Pandas给我们的所有内容）。
+
+```
+ add_datepart(weather, "Date", drop=False)  add_datepart(googletrend, "Date", drop=False)  add_datepart(train, "Date", drop=False)  add_datepart(test, "Date", drop=False) 
+```
+
+![](../img/1_OJQ53sO6WXh0C-rzw1QyJg.png)
+
+因此，例如，星期几现在变为八行四列嵌入矩阵。 从概念上讲，这允许我们的模型创建一些有趣的时间序列模型。 如果有一个七天周期的周期在周一和周三上升，但仅限于每天和仅在柏林，它可以完全这样做 - 它拥有它需要的所有信息。 这是处理时间序列的绝佳方式。 您只需确保时间序列中的循环指示符作为列存在。 如果你没有一个名为day of week的列，那么神经网络很难学会做mod 7并在嵌入矩阵中查找。 这不是不可能，但真的很难。 如果你预测旧金山的饮料销售，你可能想要一个AT＆T公园球赛开始时的清单，因为这会影响到SoMa有多少人在喝啤酒。 因此，您需要确保基本指标或周期性在您的数据中，并且只要它们在那里，神经网络将学会使用它们。
+
+#### 学习者[ [01:10:13](https://youtu.be/gbceqO8PpBg%3Ft%3D1h10m13s) ]
+
+```
+ m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,  [1000,500], [0.001,0.01], y_range=y_range)  lr = 1e-3 
+```
+
+*   `emb_szs` ：嵌入大小
+*   `len(df.columns)-len(cat_vars)` ：数据框中连续变量的数量
+*   `0.04` ：嵌入矩阵有自己的丢失，这是辍学率
+*   `1` ：我们想要创建多少输出（最后一个线性层的输出）
+*   `[1000, 500]` ：第一线性层和第二线性层中的激活次数
+*   `[0.001, 0.01]` ：第一线性层和第二线性层中的脱落
+*   `y_range` ：我们暂时不担心
+
+```
+ m.fit(lr, 3, metrics=[exp_rmspe]) 
+```
+
+```
+ _A Jupyter Widget_ 
+```
+
+```
+ _[ 0\. 0.02479 0.02205_ **_0.19309_** _]_  _[ 1\. 0.02044 0.01751_ **_0.18301_** _]_  _[ 2\. 0.01598 0.01571_ **_0.17248_** _]_ 
+```
+
+*   `metrics` ：这是一个自定义指标，它指定在每个纪元结束时调用的函数并打印出结果
+
+```
+ m.fit(lr, 1, metrics=[exp_rmspe], cycle_len=1) 
+```
+
+```
+ _[ 0\. 0.00676 0.01041 0.09711]_ 
+```
+
+通过使用所有训练数据，我们实现了大约0.09711的RMSPE。 公共领导委员会和私人领导委员会之间存在很大差异，但我们肯定是本次竞赛的最高端。
+
+所以这是一种处理时间序列和结构化数据的技术。 有趣的是，与使用这种技术的组（ [分类变量的实体嵌入](https://arxiv.org/abs/1604.06737) ）相比，第二名获胜者做了更多的特征工程。 本次比赛的获胜者实际上是物流销售预测的主题专家，因此他们有自己的代码来创建大量的功能。 Pinterest的人们为建议建立了一个非常相似的模型也表示，当他们从梯度增强机器转向深度学习时，他们的功能工程设计更少，而且模型更简单，需要的维护更少。 因此，这是使用这种深度学习方法的一大好处 - 您可以获得最先进的结果，但工作量却少得多。
+
+**问题** ：我们是否正在使用任何时间序列？ [ [01:15:01](https://youtu.be/gbceqO8PpBg%3Ft%3D1h15m1s) ]间接地，是的。 正如我们刚刚看到的那样，我们的列中有一周中的一周，一年中的一些等，其中大多数都被视为类别，因此我们正在构建一月，周日等的分布式表示。 我们没有使用任何经典的时间序列技术，我们所做的只是在神经网络中真正完全连接的层。 嵌入矩阵能够以比任何标准时间序列技术更丰富的方式处理诸如星期几周期性之类的事情。
+
+关于图像模型和这个模型之间差异的**问题** [ [01:15:59](https://youtu.be/gbceqO8PpBg%3Ft%3D1h15m59s) ]：我们调用`get_learner`的方式有所不同。 在成像中我们只是做了`Learner.trained`并传递数据：
+
+```
+ learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True) 
+```
+
+对于这些类型的模型，事实上对于许多模型，我们构建的模型取决于数据。 在这种情况下，我们需要知道我们有什么嵌入矩阵。 所以在这种情况下，数据对象创建了学习者（颠倒到我们之前看到的）：
+
+```
+ m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,  [1000,500], [0.001,0.01], y_range=y_range) 
+```
+
+**步骤摘要** （如果你想将它用于你自己的数据集）[ [01:17:56](https://youtu.be/gbceqO8PpBg%3Ft%3D1h17m56s) ]：
+
+**第1步** 。 列出分类变量名称，并列出连续变量名称，并将它们放在Pandas数据框中
+
+**第2步** 。 在验证集中创建所需的行索引列表
+
+**第3步** 。 调用这段确切的代码：
+
+```
+ md = ColumnarModelData.from_data_frame(PATH, val_idx, df,  yl.astype(np.float32), cat_flds=cat_vars, bs=128,  test_df=df_test) 
+```
+
+**第4步** 。 创建一个列表，列出每个嵌入矩阵的大小
+
+**第5步** 。 调用`get_learner` - 您可以使用这些确切的参数开头：
+
+```
+ m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,  [1000,500], [0.001,0.01], y_range=y_range) 
+```
+
+**第6步** 。 打电话给`m.fit`
+
+**问题** ：如何对此类数据使用数据扩充，以及丢失如何工作？ [ [01:18:59](https://youtu.be/gbceqO8PpBg%3Ft%3D1h18m59s) ]不知道。 Jeremy认为它必须是针对特定领域的，但他从未见过任何论文或业内任何人使用结构化数据和深度学习进行数据增强。 他认为可以做到但没有看到它完成。 辍学者正在做什么与以前完全一样。
+
+**问题** ：缺点是什么？ 几乎没有人使用这个。 为什么不？ [ [01:20:41](https://youtu.be/gbceqO8PpBg%3Ft%3D1h20m41s) ]基本上答案就像我们之前讨论过的那样，学术界没有人差不多正在研究这个问题，因为这不是人们发表的内容。 结果，人们可以看到的并没有很好的例子，并且说“哦，这是一种运作良好的技术，让我们的公司实施它”。 但也许同样重要的是，到目前为止，使用这个Fast.ai库，还没有任何方法可以方便地进行。 如果您想要实现其中一个模型，则必须自己编写所有自定义代码。 有很多大的商业和科学机会来使用它并解决以前未能很好解决的问题。
+
+### 自然语言处理[ [01:23:37](https://youtu.be/gbceqO8PpBg%3Ft%3D1h23m37s) ]
+
+最具前瞻性的深度学习领域，它落后于计算机视觉两三年。 软件的状态和一些概念远没有计算机视觉那么成熟。 您在NLP中找到的一件事是您可以解决的特殊问题，并且它们具有特定的名称。 NLP中存在一种称为“语言建模”的特殊问题，它有一个非常具体的定义 - 它意味着建立一个模型，只要给出一个句子的几个单词，你能预测下一个单词将会是什么。
+
+#### 语言建模[ [01:25:48](https://youtu.be/gbceqO8PpBg%3Ft%3D1h25m48s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lang_model-arxiv.ipynb)
+
+这里我们有来自arXiv（arXiv.org）的18个月的论文，这是一个例子：
+
+```
+ ' '.join(md.trn_ds[0].text[:150]) 
+```
+
+```
+ _'<cat> csni <summ> the exploitation of mm - wave bands is one of the key - enabler for 5 g mobile \n radio networks ._ _however , the introduction of mm - wave technologies in cellular \n networks is not straightforward due to harsh propagation conditions that limit \n the mm - wave access availability ._ _mm - wave technologies require high - gain antenna \n systems to compensate for high path loss and limited power ._ _as a consequence , \n directional transmissions must be used for cell discovery and synchronization \n processes : this can lead to a non - negligible access delay caused by the \n exploration of the cell area with multiple transmissions along different \n directions ._ _\n the integration of mm - wave technologies and conventional wireless access \n networks with the objective of speeding up the cell search process requires new \n'_ 
+```
+
+*   `&lt;cat&gt;` - 论文的类别。 CSNI是计算机科学和网络
+*   `&lt;summ&gt;` - 论文摘要
+
+以下是训练有素的语言模型的输出结果。 我们做了简单的小测试，在这些测试中你传递了一些启动文本，看看模型认为下一步应该是什么：
+
+```
+ sample_model(m, "<CAT> csni <SUMM> algorithms that") 
+```
+
+```
+ _...use the same network as a single node are not able to achieve the same performance as the traditional network - based routing algorithms ._ _in this paper , we propose a novel routing scheme for routing protocols in wireless networks ._ _the proposed scheme is based ..._ 
+```
+
+它通过阅读arXiv论文得知，正在写关于计算机网络的人会这样说。 记住，它开始根本不懂英语。 它开始时是一个嵌入矩阵，用于英语中每个随机的单词。 通过阅读大量的arXiv论文，它学到了什么样的单词跟随他人。
+
+在这里，我们尝试将类别指定为计算机视觉：
+
+```
+ sample_model(m, "<CAT> cscv <SUMM> algorithms that") 
+```
+
+```
+ _...use the same data to perform image classification are increasingly being used to improve the performance of image classification algorithms ._ _in this paper , we propose a novel method for image classification using a deep convolutional neural network ( cnn ) ._ _the proposed method is ..._ 
+```
+
+它不仅学会了如何写好英语，而且在你说出“卷积神经网络”之后，你应该使用括号来指定首字母缩略词“（CNN）”。
+
+```
+ sample_model(m,"<CAT> cscv <SUMM> algorithms. <TITLE> on ") 
+```
+
+```
+ ...the performance of deep learning for image classification <eos> 
+```
+
+```
+ sample_model(m,"<CAT> csni <SUMM> algorithms. <TITLE> on ") 
+```
+
+```
+ ...the performance of wireless networks <eos> 
+```
+
+```
+ sample_model(m,"<CAT> cscv <SUMM> algorithms. <TITLE> towards ") 
+```
+
+```
+ ...a new approach to image classification <eos> 
+```
+
+```
+ sample_model(m,"<CAT> csni <SUMM> algorithms. <TITLE> towards ") 
+```
+
+```
+ ...a new approach to the analysis of wireless networks <eos> 
+```
+
+A language model can be incredibly deep and subtle, so we are going to try and build that — not because we care about this at all, but because we are trying to create a pre-trained model which is used to do some other tasks. For example, given an IMDB movie review, we will figure out whether they are positive or negative. It is a lot like cats vs. dogs — a classification problem. So we would really like to use a pre-trained network which at least knows how to read English. So we will train a model that predicts a next word of a sentence (ie language model), and just like in computer vision, stick some new layers on the end and ask it to predict whether something is positive or negative.
+
+#### IMDB [ [1:31:11](https://youtu.be/gbceqO8PpBg%3Ft%3D1h31m11s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson4-imdb.ipynb)
+
+What we are going to do is to train a language model, making that the pre-trained model for a classification model. In other words, we are trying to leverage exactly what we learned in our computer vision which is how to do fine-tuning to create powerful classification models.
+
+**Question** : why would doing directly what you want to do not work? [ [01:31:34](https://youtu.be/gbceqO8PpBg%3Ft%3D1h31m34s) ] It just turns out it doesn't empirically. There are several reasons. First of all, we know fine-tuning a pre-trained network is really powerful. So if we can get it to learn some related tasks first, then we can use all that information to try and help it on the second task. The other is IMDB movie reviews are up to a thousands words long. So after reading a thousands words knowing nothing about how English is structured or concept of a word or punctuation, all you get is a 1 or a 0 (positive or negative). Trying to learn the entire structure of English and then how it expresses positive and negative sentiments from a single number is just too much to expect.
+
+**Question** : Is this similar to Char-RNN by Karpathy? [ [01:33:09](https://youtu.be/gbceqO8PpBg%3Ft%3D1h33m9s) ] This is somewhat similar to Char-RNN which predicts the next letter given a number of previous letters. Language model generally work at a word level (but they do not have to), and we will focus on word level modeling in this course.
+
+**Question** : To what extent are these generated words/sentences actual copies of what it found in the training set? [ [01:33:44](https://youtu.be/gbceqO8PpBg%3Ft%3D1h33m44s) ] Words are definitely words it has seen before because it is not a character level so it can only give us the word it has seen before. Sentences, there are rigorous ways of doing it but the easiest would be by looking at examples like above, you get a sense of it. Most importantly, when we train the language model, we will have a validation set so that we are trying to predict the next word of something that has never seen before. There are tricks to using language models to generate text like [beam search](http://forums.fast.ai/t/tricks-for-using-language-models-to-generate-text/8127/2) .
+
+Use cases of text classification:
+
+*   For hedge fund, identify things in articles or Twitter that caused massive market drops in the past.
+*   Identify customer service queries which tend to be associated with people who cancel their contracts in the next month
+*   Organize documents into whether they are part of legal discovery or not.
+
+```
+ from fastai.learner import * 
+```
+
+```
+ import torchtext  from torchtext import vocab, data  from torchtext.datasets import language_modeling 
+```
+
+```
+ from fastai.rnn_reg import *  from fastai.rnn_train import *  from fastai.nlp import *  from fastai.lm_rnn import * 
+```
+
+```
+ import dill as pickle 
+```
+
+*   `torchtext` — PyTorch's NLP library
+
+#### Data [ [01:37:05](https://youtu.be/gbceqO8PpBg%3Ft%3D1h37m5s) ]
+
+IMDB [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
+
+```
+ PATH = 'data/aclImdb/' 
+```
+
+```
+ TRN_PATH = 'train/all/'  VAL_PATH = 'test/all/'  TRN = f'{PATH}{TRN_PATH}'  VAL = f'{PATH}{VAL_PATH}' 
+```
+
+```
+ %ls {PATH} 
+```
+
+```
+ imdbEr.txt imdb.vocab models/ README test/ tmp/ train/ 
+```
+
+We do not have separate test and validation in this case. Just like in vision, the training directory has bunch of files in it:
+
+```
+ trn_files = !ls {TRN}  trn_files[:10]  ['0_0.txt', 
+ '0_3.txt', 
+ '0_9.txt', 
+ '10000_0.txt', 
+ '10000_4.txt', 
+ '10000_8.txt', 
+ '1000_0.txt', 
+ '10001_0.txt', 
+ '10001_10.txt', 
+ '10001_4.txt'] 
+```
+
+```
+ review = !cat {TRN}{trn_files[6]}  review[0] 
+```
+
+```
+ "I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out lines. I mean, some of it didn't make sense with the rest of the flick, but who cares when you're laughing so hard! All in all the film wasn't the greatest thing since sliced bread, but I wasn't expecting that. It was a Troma flick so I figured it would totally suck. It's nice when something surprises you but not totally sucking.<br /><br />Rent it if you want to get stoned on a Friday night and laugh with your buddies. Don't rent it if you are an uptight weenie or want a zombie movie with lots of flesh eating.<br /><br />PS Uwe Boil was a nice touch." 
+```
+
+Now we will check how many words are in the dataset:
+
+```
+ !find {TRN} -name '*.txt' | xargs cat | wc -w 
+```
+
+```
+ 17486581 
+```
+
+```
+ !find {VAL} -name '*.txt' | xargs cat | wc -w 
+```
+
+```
+ 5686719 
+```
+
+Before we can do anything with text, we have to turn it into a list of tokens. Token is basically like a word. Eventually we will turn them into a list of numbers, but the first step is to turn it into a list of words — this is called “tokenization” in NLP. A good tokenizer will do a good job of recognizing pieces in your sentence. Each separated piece of punctuation will be separated, and each part of multi-part word will be separated as appropriate. Spacy does a lot of NLP stuff, and it has the best tokenizer Jeremy knows. So Fast.ai library is designed to work well with the Spacey tokenizer as with torchtext.
+
+#### Creating a field [ [01:41:01](https://youtu.be/gbceqO8PpBg%3Ft%3D1h41m1s) ]
+
+A field is a definition of how to pre-process some text.
+
+```
+ TEXT = data.Field(lower= True , tokenize=spacy_tok) 
+```
+
+*   `lower=True` — lowercase the text
+*   `tokenize=spacy_tok` — tokenize with `spacy_tok`
+
+Now we create the usual Fast.ai model data object:
+
+```
+ bs=64; bptt=70 
+```
+
+```
+ FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)  md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs,  bptt=bptt, min_freq=10) 
+```
+
+*   `PATH` : as per usual where the data is, where to save models, etc
+*   `TEXT` : torchtext's Field definition
+*   `**FILES` : list of all of the files we have: training, validation, and test (to keep things simple, we do not have a separate validation and test set, so both points to validation folder)
+*   `bs` : batch size
+*   `bptt` : Back Prop Through Time. It means how long a sentence we will stick on the GPU at once
+*   `min_freq=10` : In a moment, we are going to be replacing words with integers (a unique index for every word). If there are any words that occur less than 10 times, just call it unknown.
+
+After building our `ModelData` object, it automatically fills the `TEXT` object with a very important attribute: `TEXT.vocab` . This is a _vocabulary_ , which stores which unique words (or _tokens_ ) have been seen in the text, and how each word will be mapped to a unique integer id.
+
+```
+ # 'itos': 'int-to-string'  TEXT.vocab.itos[:12] 
+```
+
+```
+ ['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in'] 
+```
+
+```
+ # 'stoi': 'string to int'  TEXT.vocab.stoi['the'] 
+```
+
+```
+ _2_ 
+```
+
+`itos` is sorted by frequency except for the first two special ones. Using `vocab` , torchtext will turn words into integer IDs for us :
+
+```
+ md.trn_ds[0].text[:12] 
+```
+
+```
+ ['i', 
+ 'have', 
+ 'always', 
+ 'loved', 
+ 'this', 
+ 'story', 
+ '-', 
+ 'the', 
+ 'hopeful', 
+ 'theme', 
+ ',', 
+ 'the'] 
+```
+
+```
+ TEXT.numericalize([md.trn_ds[0].text[:12]]) 
+```
+
+```
+ Variable containing: 
+ _12_  _35_  _227_  _480_  _13_  _76_  _17_  _2_ 
+ 7319 
+ _769_  _3_  _2_ 
+ [torch.cuda.LongTensor of size 12x1 (GPU 0)] 
+```
+
+**Question** : Is it common to do any stemming or lemma-tizing? [ [01:45:47](https://youtu.be/gbceqO8PpBg%3Ft%3D1h45m47s) ] Not really, no. Generally tokenization is what we want. To keep it as general as possible, we want to know what is coming next so whether it is future tense or past tense or plural or singular, we don't really know which things are going to be interesting and which are not, so it seems that it is generally best to leave it alone as much as possible.
+
+**Question** : When dealing with natural language, isn't context important? Why are we tokenizing and looking at individual word? [ [01:46:38](https://youtu.be/gbceqO8PpBg%3Ft%3D1h46m38s) ] No, we are not looking at individual word — they are still in order. Just because we replaced I with a number 12, they are still in that order. There is a different way of dealing with natural language called “bag of words” and they do throw away the order and context. In the Machine Learning course, we will be learning about working with bag of words representations but my belief is that they are no longer useful or in the verge of becoming no longer useful. We are starting to learn how to use deep learning to use context properly.
+
+#### Batch size and BPTT [ [01:47:40](https://youtu.be/gbceqO8PpBg%3Ft%3D1h47m40s) ]
+
+What happens in a language model is even though we have lots of movie reviews, they all get concatenated together into one big block of text. So we predict the next word in this huge long thing which is all of the IMDB movie reviews concatenated together.
+
+![](../img/1_O-Kq1qtgZmrShbKhaN3fTg.png)
+
+*   We split up the concatenated reviews into batches. In this case, we will split it to 64 sections
+*   We then move each section underneath the previous one, and transpose it.
+*   We end up with a matrix which is 1 million by 64\.
+*   We then grab a little chunk at time and those chunk lengths are **approximately** equal to BPTT. Here, we grab a little 70 long section and that is the first thing we chuck into our GPU (ie the batch).
+
+```
+ next(iter(md.trn_dl)) 
+```
+
+```
+ (Variable containing: 
+ 12 567 3 ... 2118 4 2399 
+  _35 7 33_  ... 6 148 55 
+ 227 103 533 ... 4892 31 10 
+ ... ⋱ ... 
+ 19 8879 33 ... 41 24 733 
+ 552 8250 57 ... 219 57 1777 
+ 5 19 2 ... 3099 8 48 
+ [torch.cuda.LongTensor of size 75x64 (GPU 0)], Variable containing: 
+ **_35_**  **_7_**  **_33_** 
+ ⋮ 
+ _22_ 
+ 3885 
+ 21587 
+ [torch.cuda.LongTensor of size 4800 (GPU 0)]) 
+```
+
+*   We grab our first training batch by wrapping data loader with `iter` then calling `next` .
+*   We got back a 75 by 64 tensor (approximately 70 rows but not exactly)
+*   A neat trick torchtext does is to randomly change the `bptt` number every time so each epoch it is getting slightly different bits of text — similar to shuffling images in computer vision. We cannot randomly shuffle the words because they need to be in the right order, so instead, we randomly move their breakpoints a little bit.
+*   The target value is also 75 by 64 but for minor technical reasons it is flattened out into a single vector.
+
+**Question** : Why not split by a sentence? [ [01:53:40](https://youtu.be/gbceqO8PpBg%3Ft%3D1h53m40s) ] Not really. Remember, we are using columns. So each of our column is of length about 1 million, so although it is true that those columns are not always exactly finishing on a full stop, they are so darn long we do not care. Each column contains multiple sentences.
+
+Pertaining to this question, Jeremy found what is in this language model matrix a little mind-bending for quite a while, so do not worry if it takes a while and you have to ask a thousands questions.
+
+#### Create a model [ [01:55:46](https://youtu.be/gbceqO8PpBg%3Ft%3D1h55m46s) ]
+
+Now that we have a model data object that can fee d us batches, we can create a model. First, we are going to create an embedding matrix.
+
+Here are the: # batches; # unique tokens in the vocab; length of the dataset; # of words
+
+```
+ len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text) 
+```
+
+```
+ (4602, 34945, 1, 20621966) 
+```
+
+This is our embedding matrix looks like:
+
+![](../img/1_6EHxqeSYMioiLEQ5ufrf_g.png)
+
+*   It is a high cardinality categorical variable and furthermore, it is the only variable — this is typical in NLP
+*   The embedding size is 200 which is much bigger than our previous embedding vectors. Not surprising because a word has a lot more nuance to it than the concept of Sunday. **Generally, an embedding size for a word will be somewhere between 50 and 600.**
+
+```
+ em_sz = 200 # size of each embedding vector  nh = 500 # number of hidden activations per layer  nl = 3 # number of layers 
+```
+
+Researchers have found that large amounts of _momentum_ (which we'll learn about later) don't work well with these kinds of _RNN_ models, so we create a version of the _Adam_ optimizer with less momentum than its default of `0.9` . Any time you are doing NLP, you should probably include this line:
+
+```
+ opt_fn = partial(optim.Adam, betas=(0.7, 0.99)) 
+```
+
+Fast.ai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through [Dropout](https://en.wikipedia.org/wiki/Convolutional_neural_network) . There is no simple way known (yet!) to find the best values of the dropout parameters below — you just have to experiment…
+
+However, the other parameters ( `alpha` , `beta` , and `clip` ) shouldn't generally need tuning.
+
+```
+ learner = md.get_model(opt_fn, em_sz, nh, nl, dropouti=0.05,  dropout=0.05, wdrop=0.1, dropoute=0.02,  dropouth=0.05)  learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)  learner.clip=0.3 
+```
+
+*   In the last lecture, we will learn what the architecture is and what all these dropouts are. For now, just know it is the same as per usual, if you try to build an NLP model and you are under-fitting, then decrease all these dropouts, if overfitting, then increase all these dropouts in roughly this ratio. Since this is such a recent paper so there is not a lot of guidance but these ratios worked well — it is what Stephen has been using as well.
+*   There is another kind of way we can avoid overfitting that we will talk about in the last class. For now, `learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)` works reliably so all of your NLP models probably want this particular line.
+*   `learner.clip=0.3` : when you look at your gradients and you multiply them by the learning rate to decide how much to update your weights by, this will not allow them be more than 0.3\. This is a cool little trick to prevent us from taking too big of a step.
+*   Details do not matter too much right now, so you can use them as they are.
+
+**Question** : There are word embedding out there such as Word2vec or GloVe. How are they different from this? And why not initialize the weights with those initially? [ [02:02:29](https://youtu.be/gbceqO8PpBg%3Ft%3D2h2m29s) ] People have pre-trained these embedding matrices before to do various other tasks. They are not called pre-trained models; they are just a pre-trained embedding matrix and you can download them. There is no reason we could not download them. I found that building a whole pre-trained model in this way did not seem to benefit much if at all from using pre-trained word vectors; where else using a whole pre-trained language model made a much bigger difference. Maybe we can combine both to make them a little better still.
+
+**Question:** What is the architecture of the model? [ [02:03:55](https://youtu.be/gbceqO8PpBg%3Ft%3D2h3m55s) ] We will be learning about the model architecture in the last lesson but for now, it is a recurrent neural network using something called LSTM (Long Short Term Memory).
+
+#### Fitting [ [02:04:24](https://youtu.be/gbceqO8PpBg%3Ft%3D2h4m24s) ]
+
+```
+ learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2) 
+```
+
+```
+ learner.save_encoder('adam1_enc') 
+```
+
+```
+ learner.fit(3e-3, 4, wds=1e-6, cycle_len=10,  cycle_save_name='adam3_10') 
+```
+
+```
+ learner.save_encoder('adam3_10_enc') 
+```
+
+```
+ learner.fit(3e-3, 1, wds=1e-6, cycle_len=20,  cycle_save_name='adam3_20') 
+```
+
+```
+ learner.load_cycle('adam3_20',0) 
+```
+
+In the sentiment analysis section, we'll just need half of the language model - the _encoder_ , so we save that part.
+
+```
+ learner.save_encoder('adam3_20_enc') 
+```
+
+```
+ learner.load_encoder('adam3_20_enc') 
+```
+
+Language modeling accuracy is generally measured using the metric _perplexity_ , which is simply `exp()` of the loss function we used.
+
+```
+ math.exp(4.165) 
+```
+
+```
+ 64.3926824434624 
+```
+
+```
+ pickle.dump(TEXT, open(f' {PATH} models/TEXT.pkl','wb')) 
+```
+
+#### Testing [ [02:04:53](https://youtu.be/gbceqO8PpBg%3Ft%3D2h4m53s) ]
+
+We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.
+
+```
+ m=learner.model  ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best""" 
+```
+
+```
+ s = [spacy_tok(ss)]  t=TEXT.numericalize(s)  ' '.join(s[0]) 
+```
+
+```
+ ". So , it was n't quite was I was expecting , but I really liked it anyway ! The best" 
+```
+
+We haven't yet added methods to make it easy to test a language model, so we'll need to manually go through the steps.
+
+```
+ # Set batch size to 1  m[0].bs=1  # Turn off dropout  m.eval()  # Reset hidden state  m.reset()  # Get predictions from model  res,*_ = m(t)  # Put the batch size back to what it was  m[0].bs=bs 
+```
+
+Let's see what the top 10 predictions were for the next word after our short text:
+
+```
+ nexts = torch.topk(res[-1], 10)[1]  [TEXT.vocab.itos[o] for o in to_np(nexts)] 
+```
+
+```
+ ['film', 
+ 'movie', 
+ 'of', 
+ 'thing', 
+ 'part', 
+ '<unk>', 
+ 'performance', 
+ 'scene', 
+ ',', 
+ 'actor'] 
+```
+
+…and let's see if our model can generate a bit more text all by itself!
+
+```
+ print(ss," \n ")  for i in range(50):  n=res[-1].topk(2)[1]  n = n[1] if n.data[0]==0 else n[0]  print(TEXT.vocab.itos[n.data[0]], end=' ')  res,*_ = m(n[0].unsqueeze(0))  print('...') 
+```
+
+```
+ _._ So, it wasn't quite was I was expecting, but I really liked it anyway! The best 
+```
+
+```
+ film ever ! <eos> i saw this movie at the toronto international film festival . i was very impressed . i was very impressed with the acting . i was very impressed with the acting . i was surprised to see that the actors were not in the movie . _..._ 
+```
+
+#### Sentiment [ [02:05:09](https://youtu.be/gbceqO8PpBg%3Ft%3D2h5m9s) ]
+
+So we had pre-trained a language model and now we want to fine-tune it to do sentiment classification.
+
+To use a pre-trained model, we will need to the saved vocab from the language model, since we need to ensure the same words map to the same IDs.
+
+```
+ TEXT = pickle.load(open(f' {PATH} models/TEXT.pkl','rb')) 
+```
+
+`sequential=False` tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).
+
+```
+ IMDB_LABEL = data.Field(sequential= False ) 
+```
+
+This time, we need to not treat the whole thing as one big piece of text but every review is separate because each one has a different sentiment attached to it.
+
+`splits` is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at `lang_model-arxiv.ipynb` to see how to define your own fastai/torchtext datasets.
+
+```
+ splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/') 
+```
+
+```
+ t = splits[0].examples[0] 
+```
+
+```
+ t.label, ' '.join(t.text[:16]) 
+```
+
+```
+ ('pos', 'ashanti is a very 70s sort of film ( 1979 , to be precise ) .') 
+```
+
+fastai can create a `ModelData` object directly from torchtext `splits` .
+
+```
+ md2 = TextData.from_splits(PATH, splits, bs) 
+```
+
+Now you can go ahead and call `get_model` that gets us our learner. Then we can load into it the pre-trained language model ( `load_encoder` ).
+
+```
+ m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh,  n_layers=nl, dropout=0.1, dropouti=0.4,  wdrop=0.5, dropoute=0.05, dropouth=0.3) 
+```
+
+```
+ m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1) 
+```
+
+```
+ m3\. load_encoder (f'adam3_20_enc') 
+```
+
+Because we're fine-tuning a pretrained model, we'll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.
+
+```
+ m3.clip=25\.  lrs=np.array([1e-4,1e-3,1e-2]) 
+```
+
+```
+ m3.freeze_to(-1)  m3.fit(lrs/2, 1, metrics=[accuracy])  m3.unfreeze()  m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1) 
+```
+
+```
+ [ 0\. 0.45074 0.28424 0.88458] 
+```
+
+```
+ [ 0\. 0.29202 0.19023 0.92768] 
+```
+
+We make sure all except the last layer is frozen. Then we train a bit, unfreeze it, train it a bit. The nice thing is once you have got a pre-trained language model, it actually trains really fast.
+
+```
+ m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2,  cycle_save_name='imdb2') 
+```
+
+```
+ [ 0\. 0.29053 0.18292 0.93241]  [ 1\. 0.24058 0.18233 0.93313]  [ 2\. 0.24244 0.17261 0.93714]  [ 3\. 0.21166 0.17143 0.93866]  [ 4\. 0.2062 0.17143 0.94042]  [ 5\. 0.18951 0.16591 0.94083]  [ 6\. 0.20527 0.16631 0.9393 ]  [ 7\. 0.17372 0.16162 0.94159]  [ 8\. 0.17434 0.17213 0.94063]  [ 9\. 0.16285 0.16073 0.94311]  [ 10\. 0.16327 0.17851 0.93998]  [ 11\. 0.15795 0.16042 0.94267]  [ 12\. 0.1602 0.16015 0.94199]  [ 13\. 0.15503 0.1624 0.94171] 
+```
+
+```
+ m3.load_cycle('imdb2', 4) 
+```
+
+```
+ accuracy(*m3.predict_with_targs()) 
+```
+
+```
+ 0.94310897435897434 
+```
+
+A recent paper from Bradbury et al, [Learned in translation: contextualized word vectors](https://einstein.ai/research/learned-in-translation-contextualized-word-vectors) , has a handy summary of the latest academic research in solving this IMDB sentiment analysis problem. Many of the latest algorithms shown are tuned for this specific problem.
+
+![](../img/1_PotEPJjvS-R4C5OCMbw7Vw.png)
+
+As you see, we just got a new state of the art result in sentiment analysis, decreasing the error from 5.9% to 5.5%! You should be able to get similarly world-class results on other NLP classification problems using the same basic steps.
+
+There are many opportunities to further improve this, although we won't be able to get to them until part 2 of this course.
+
+*   For example we could start training language models that look at lots of medical journals and then we could make a downloadable medical language model that then anybody could use to fine-tune on a prostate cancer subset of medical literature.
+*   We could also combine this with pre-trained word vectors
+*   We could have pre-trained a Wikipedia corpus language model and then fine-tuned it into an IMDB language model, and then fine-tune that into an IMDB sentiment analysis model and we would have gotten something better than this.
+
+There is a really fantastic researcher called Sebastian Ruder who is the only NLP researcher who has been really writing a lot about pre-training, fine-tuning, and transfer learning in NLP. Jeremy was asking him why this is not happening more, and his view was it is because there is not a software to make it easy. Hopefully Fast.ai will change that.
+
+#### Collaborative Filtering Introduction [ [02:11:38](https://youtu.be/gbceqO8PpBg%3Ft%3D2h11m38s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)
+
+Data available from [http://files.grouplens.org/datasets/movielens/ml-latest-small.zip](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)
+
+```
+ path='data/ml-latest-small/' 
+```
+
+```
+ ratings = pd.read_csv(path+'ratings.csv')  ratings.head() 
+```
+
+The dataset looks like this:
+
+![](../img/1_Ev47i52AF-qIRHtYTOYm2Q.png)
+
+It contains ratings by users. Our goal will be for some user-movie combination we have not seen before, we have to predict a rating.
+
+```
+ movies = pd.read_csv(path+'movies.csv')  movies.head() 
+```
+
+![](../img/1_cl9JWMSKPsrYf4hHsxNq-Q.png)
+
+To make it more interesting, we will also actually download a list of movies so that we can interpret what is actually in these embedding matrices.
+
+```
+ g=ratings.groupby('userId')['rating'].count()  topUsers=g.sort_values(ascending=False)[:15] 
+```
+
+```
+ g=ratings.groupby('movieId')['rating'].count()  topMovies=g.sort_values(ascending=False)[:15] 
+```
+
+```
+ top_r = ratings.join(topUsers, rsuffix='_r', how='inner',  on='userId')  top_r = top_r.join(topMovies, rsuffix='_r', how='inner',  on='movieId') 
+```
+
+```
+ pd.crosstab(top_r.userId, top_r.movieId, top_r.rating,  aggfunc=np.sum) 
+```
+
+![](../img/1_f50pUlwGbsu85fVI-n9-MA.png)
+
+This is what we are creating — this kind of cross tab of users by movies.
+
+Feel free to look ahead and you will find that most of the steps are familiar to you already.
diff --git a/zh/dl5.md b/zh/dl5.md
new file mode 100644
index 0000000000000000000000000000000000000000..19fc5dc8eab693a97e15d830e9a7d43cd614711c
--- /dev/null
+++ b/zh/dl5.md
@@ -0,0 +1,382 @@
+# 深度学习2：第1部分第5课
+
+### [第5课](http://forums.fast.ai/t/wiki-lesson-5/9403)
+
+### 一，导言
+
+没有足够的关于结构化深度学习的出版物，但它肯定发生在行业中：
+
+[**结构化深度学习**](https://towardsdatascience.com/structured-deep-learning-b8ca4138b848 "https://towardsdatascience.com/structured-deep-learning-b8ca4138b848")[
+](https://towardsdatascience.com/structured-deep-learning-b8ca4138b848 "https://towardsdatascience.com/structured-deep-learning-b8ca4138b848")[_作者：Kerem Turgutlu_朝向datascience.com](https://towardsdatascience.com/structured-deep-learning-b8ca4138b848 "https://towardsdatascience.com/structured-deep-learning-b8ca4138b848")[](https://towardsdatascience.com/structured-deep-learning-b8ca4138b848)
+
+您可以使用[此工具](https://github.com/hardikvasa/google-images-download)从Google下载图片并解决自己的问题：
+
+[**小图像数据集的乐趣（第2部分）**](https://towardsdatascience.com/fun-with-small-image-data-sets-part-2-54d683ca8c96 "https://towardsdatascience.com/fun-with-small-image-data-sets-part-2-54d683ca8c96")[
+](https://towardsdatascience.com/fun-with-small-image-data-sets-part-2-54d683ca8c96 "https://towardsdatascience.com/fun-with-small-image-data-sets-part-2-54d683ca8c96")[_作者：Nikhil B_ towardsdatascience.com](https://towardsdatascience.com/fun-with-small-image-data-sets-part-2-54d683ca8c96 "https://towardsdatascience.com/fun-with-small-image-data-sets-part-2-54d683ca8c96")[](https://towardsdatascience.com/fun-with-small-image-data-sets-part-2-54d683ca8c96)
+
+关于如何训练神经网络的介绍（一篇伟大的技术写作）：
+
+[**我们如何“训练”神经网络？**](https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73 "https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73")[
+](https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73 "https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73")[_由Vitaly Bushaev_朝向dasatcience.com](https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73 "https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73")[](https://towardsdatascience.com/how-do-we-train-neural-networks-edd985562b73)
+
+学生们在[Kaggle幼苗分类比赛中](https://www.kaggle.com/c/plant-seedlings-classification/leaderboard)与Jeremy [竞争](https://www.kaggle.com/c/plant-seedlings-classification/leaderboard) 。
+
+### II。 协作过滤 - 使用MovieLens数据集
+
+讨论的笔记本可以在[这里](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)找到[（lesson5-movielens.ipynb）](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb) 。
+
+我们来看看数据。 我们将使用`userId` （分类）， `movieId` （分类）和`rating` （依赖）进行建模。
+
+```
+ ratings = pd.read_csv(path+'ratings.csv')  ratings.head() 
+```
+
+![](../img/1_p-154IwDcs32F5_betEmEw.png)
+
+#### **为Excel创建子集**
+
+我们创建了最受欢迎的电影和大多数电影上瘾用户的交叉表，我们将其复制到Excel中进行可视化。
+
+```
+ g=ratings.groupby('userId')['rating'].count()  topUsers=g.sort_values(ascending=False)[:15] 
+```
+
+```
+ g=ratings.groupby('movieId')['rating'].count()  topMovies=g.sort_values(ascending=False)[:15] 
+```
+
+```
+ top_r = ratings.join(topUsers, rsuffix='_r', how='inner', on='userId')  top_r = top_r.join(topMovies, rsuffix='_r', how='inner', on='movieId') 
+```
+
+```
+ pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum) 
+```
+
+![](../img/1_QO-Doqw_0YGOU-vmI-R5CA.png)
+
+[这](https://github.com/fastai/fastai/blob/master/courses/dl1/excel/collab_filter.xlsx)是包含上述信息的excel文件。 首先，我们将使用**矩阵分解/分解**而不是构建神经网络。
+
+![](../img/1_ps-Mq2y88JBT3JsKBh-sKQ.png)
+
+*   蓝色细胞 - 实际评级
+*   紫色细胞 - 我们的预测
+*   红细胞 - 我们的损失函数即均方根误差（RMSE）
+*   绿色单元格 - 电影嵌入（随机初始化）
+*   橙色单元格 - 用户嵌入（随机初始化）
+
+每个预测是电影嵌入矢量和用户嵌入矢量的点积。 在线性代数项中，它等于矩阵乘积，因为一个是行，一个是列。 如果没有实际评级，我们将预测设置为零（将其视为测试数据 - 而不是训练数据）。
+
+![](../img/1_2SeWMcKe9VCLkVQVuCvU8g.png)
+
+然后我们使用Gradient Descent来减少损失。 Microsoft Excel在加载项中有一个“求解器”，可以通过更改所选单元格来最小化变量（ `GRG Nonlinear`是您要使用的方法）。
+
+这可称为“浅学习”（与深度学习相反），因为没有非线性层或第二线性层。 那么我们直觉上做了什么呢？ 每部电影的五个数字称为“嵌入”（潜在因素） - 第一个数字可能代表科幻和幻想的数量，第二个数字可能是电影使用了多少特效，第三个可能是如何类似地，每个用户还有5个数字，例如，表示用户喜欢幻想幻想，特效和电影中的对话驱动多少。 我们的预测是这些载体的交叉产物。 由于我们没有针对每个用户进行所有电影评论，因此我们试图找出哪些电影与这部电影相似，以及其他用户如何评价与此用户类似的其他电影为此电影评分（因此称为“协作”）。
+
+我们如何处理新用户或新电影 - 我们是否需要重新培训模型？ 我们现在没有时间来讨论这个问题，但基本上你需要有一个新的用户模型或最初会使用的新电影模型，随着时间的推移你需要重新训练模型。
+
+#### **简单的Python版本[** [**26:03**](https://youtu.be/J99NV9Cr75I%3Ft%3D26m3s) **]**
+
+这应该看起来很熟悉了。 我们通过选择随机ID集来创建验证集。 `wd`是L2正则化的权重衰减， `n_factors`是我们想要的嵌入矩阵有多大。
+
+```
+ val_idxs = get_cv_idxs(len(ratings))  wd = 2e-4  n_factors = 50 
+```
+
+我们从CSV文件创建模型数据对象：
+
+```
+ cf = CollabFilterDataset.from_csv(path, 'ratings.csv', 'userId', 'movieId', 'rating') 
+```
+
+然后我们得到一个适合模型数据的学习者，并适合模型：
+
+```
+ learn = cf.get_learner(n_factors, val_idxs, 64, opt_fn=optim.Adam) 
+```
+
+```
+ learn.fit(1e-2, 2, wds=wd, cycle_len=1, cycle_mult=2) 
+```
+
+![](../img/1_Xl9If92kjaI5OEIxKyNLiw.png)
+
+<figcaption class="imageCaption">输出MSE</figcaption>
+
+
+
+由于输出是Mean Squared Error，您可以通过以下方式获取RMSE：
+
+```
+ math.sqrt(0.765) 
+```
+
+产量约为0.88，优于0.91的基准。
+
+您可以通过常规方式获得预测：
+
+```
+ preds = learn.predict() 
+```
+
+你也可以使用seaborn `sns` （建立在matplotlib之上）：
+
+```
+ y = learn.data.val_y  sns.jointplot(preds, y, kind='hex', stat_func=None) 
+```
+
+![](../img/1_cXAU8huHFkxKbJjZUwwxIA.png)
+
+#### **使用Python的Dot产品**
+
+![](../img/1_kSUYsjtdLbyn2SqW9cKiHA.jpeg)
+
+![](../img/1_H_VqypjqEku0QjLZ51rvKA.jpeg)
+
+`T`是火炬中的张量
+
+```
+ a = T([[1., 2], [3, 4]])  b = T([[2., 2], [10, 10]]) 
+```
+
+当我们在numpy或PyTorch中的张量之间有一个数学运算符时，它将在元素方面假设它们都具有相同的维数。 下面是你如何计算两个向量的点积（例如（1,2）·（2,2）= 6 - 矩阵a和b的第一行）：
+
+```
+ (a*b).sum(1) 
+```
+
+```
+ 6  70  [torch.FloatTensor of size 2] 
+```
+
+#### **构建我们的第一个自定义层（即PyTorch模块）[** [**33:55**](https://youtu.be/J99NV9Cr75I%3Ft%3D33m55s) **]**
+
+我们通过创建一个扩展`nn.Module`并覆盖`forward`函数的Python类来实现这一点。
+
+```
+ class DotProduct (nn.Module):  def forward(self, u, m): return (u*m).sum(1) 
+```
+
+现在我们可以调用它并得到预期的结果（注意我们不需要说`model.forward(a, b)`来调用`forward`函数 - 它是PyTorch魔法。）[ [40:14](https://youtu.be/J99NV9Cr75I%3Ft%3D40m14s) ]：
+
+```
+ model = DotProduct()  **model(a,b)** 
+```
+
+```
+ 6  70  [torch.FloatTensor of size 2] 
+```
+
+#### **建造更复杂的模块[** [**41:31**](https://youtu.be/J99NV9Cr75I%3Ft%3D41m31s) **]**
+
+这个实现有两个`DotProduct`类的补充：
+
+*   两个`nn.Embedding`矩阵
+*   在上面的嵌入矩阵中查找我们的用户和电影
+
+用户ID很可能不是连续的，这使得很难用作嵌入矩阵的索引。 因此，我们将首先创建从零开始并且连续的索引，并使用带有匿名函数`lambda` Panda的`apply`函数将`ratings.userId`列替换为索引，并对`ratings.movieId`执行相同的操作。
+
+```
+ u_uniq = ratings.userId.unique()  user2idx = {o:i **for** i,o **in** enumerate(u_uniq)}  ratings.userId = ratings.userId.apply( **lambda** x: user2idx[x]) 
+```
+
+```
+ m_uniq = ratings.movieId.unique()  movie2idx = {o:i **for** i,o **in** enumerate(m_uniq)}  ratings.movieId = ratings.movieId.apply( **lambda** x: movie2idx[x]) 
+```
+
+```
+ n_users=int(ratings.userId.nunique()) n_movies=int(ratings.movieId.nunique()) 
+```
+
+_提示：_ `{o:i for i,o in enumerate(u_uniq)}`是一个方便的代码行保存在你的工具带中！
+
+```
+ class EmbeddingDot(nn.Module):  def __init__(self, n_users, n_movies):  super().__init__()  self.u = nn.Embedding(n_users, n_factors)  self.m = nn.Embedding(n_movies, n_factors)  self.u.weight.data.uniform_(0,0.05)  self.m.weight.data.uniform_(0,0.05)  def forward(self, cats, conts):  users,movies = cats[:,0],cats[:,1]  u,m = self.u(users),self.m(movies)  return (u*m).sum(1) 
+```
+
+请注意， `__init__`是一个现在需要的构造函数，因为我们的类需要跟踪“状态”（多少部电影，多少用户，多少因素等）。 我们将权重初始化为0到0.05之间的随机数，你可以在这里找到关于权重初始化的标准算法的更多信息，“Kaiming Initialization”（PyTorch有He初始化实用函数，但是我们试图从头开始做事）[ [46 ：58](https://youtu.be/J99NV9Cr75I%3Ft%3D46m58s) ]。
+
+`Embedding`不是张量而是**变量** 。 变量执行与张量完全相同的操作，但它也可以自动区分。 要从变量中拉出张量，请调用`data`属性。 所有张量函数都有一个变量，尾随下划线（例如`uniform_` ）将就地执行。
+
+```
+ x = ratings.drop(['rating', 'timestamp'],axis=1)  y = ratings['rating'].astype(np.float32)  data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], 64) 
+```
+
+我们正在重用Rossmann笔记本中的ColumnarModelData（来自fast.ai库），这也是为什么在`EmbeddingDot`类[ [50:20](https://youtu.be/J99NV9Cr75I%3Ft%3D50m20s) ]中为`def forward(self, cats, conts)`函数存在分类和连续变量的原因。 由于在这种情况下我们没有连续变量，我们将忽略`conts`并使用`cats`的第一和第二列作为`users`和`movies` 。 请注意，它们是小批量的用户和电影。 重要的是不要手动循环小批量，因为你不会获得GPU加速，而是一次处理整个小批量，正如你在上面的`forward`功能的第3和第4行看到的那样[ [51](https://youtu.be/J99NV9Cr75I%3Ft%3D51m) ： [00-52](https://youtu.be/J99NV9Cr75I%3Ft%3D51m) ：05 ]。
+
+```
+ wd=1e-5  model = EmbeddingDot(n_users, n_movies).cuda()  opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9) 
+```
+
+`optim`是为PyTorch提供优化器的原因。 `model.parameters()`是从`nn.Modules`继承的函数之一，它为我们提供了更新/学习的`nn.Modules`重。
+
+```
+ fit(model, data, 3, opt, F.mse_loss) 
+```
+
+这个函数来自fast.ai库[ [54:40](https://youtu.be/J99NV9Cr75I%3Ft%3D54m40s) ]并且比我们一直在使用的`learner.fit()`更接近常规的PyTorch方法。 它不会为您提供诸如“重启的随机梯度下降”或开箱即用的“差分学习率”等功能。
+
+#### **让我们改进我们的模型**
+
+**偏见** - 适应普遍流行的电影或普遍热情的用户。
+
+```
+ min_rating,max_rating = ratings.rating.min(),ratings.rating.max()  min_rating,max_rating 
+```
+
+```
+ def get_emb(ni,nf):  e = nn.Embedding(ni, nf)  e.weight.data.uniform_(-0.01,0.01)  return e 
+```
+
+```
+ class EmbeddingDotBias(nn.Module):  def __init__(self, n_users, n_movies):  super().__init__()  (self.u, self.m, **self.ub** , **self.mb** ) = [get_emb(*o) for o in [  (n_users, n_factors), (n_movies, n_factors), (n_users,1), (n_movies,1)  ]]  def forward(self, cats, conts):  users,movies = cats[:,0],cats[:,1]  um = (self.u(users)* self.m(movies)).sum(1)  res = um + ** self.ub(users)** .squeeze() + **self.mb(movies)** .squeeze()  res = F.sigmoid(res) * (max_rating-min_rating) + min_rating  return res 
+```
+
+`squeeze`是PyTorch版本的_广播_ [ [1:04:11](https://youtu.be/J99NV9Cr75I%3Ft%3D1h4m11s) ]以获取更多信息，请参阅机器学习课程或[numpy文档](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html) 。
+
+我们可以压缩评级，使其在1到5之间吗？ 是! 通过sigmoid函数进行预测将导致数字介于1和0之间。因此，在我们的情况下，我们可以将其乘以4并加1 - 这将导致1到5之间的数字。
+
+![](../img/1_UYeXmpTtxA0pIkHJ8ETMUA.png)
+
+`F`是PyTorch函数（ `torch.nn.functional` ），包含张量的所有函数，在大多数情况下作为`F`导入。
+
+```
+ wd=2e-4  model = EmbeddingDotBias(cf.n_users, cf.n_items).cuda()  opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9) 
+```
+
+```
+ fit(model, data, 3, opt, F.mse_loss)  [ 0\. 0.85056 0.83742]  [ 1\. 0.79628 0.81775]  [ 2\. 0.8012 0.80994] 
+```
+
+让我们来看看我们在**Simple Python版本中**使用的fast.ai代码[ [1:13:44](https://youtu.be/J99NV9Cr75I%3Ft%3D1h13m44s) ] **。** 在`column_data.py`文件中， `CollabFilterDataSet.get_leaner`调用`get_model`函数，该函数创建与我们创建的相同的`EmbeddingDotBias`类。
+
+#### 神经网络版[ [1:17:21](https://youtu.be/J99NV9Cr75I%3Ft%3D1h17m21s) ]
+
+我们回到excel表来理解直觉。 请注意，我们创建user_idx以查找嵌入，就像我们之前在python代码中所做的那样。 如果我们对user_idx进行单热编码并将其乘以用户嵌入，我们将为用户获取适用的行。 如果它只是矩阵乘法，为什么我们需要嵌入？ 它用于计算性能优化目的。
+
+![](../img/1_0CRZIBnNzw1lT_9EHOyd5g.png)
+
+我们不是计算用户嵌入向量和电影嵌入向量的点积来得到预测，而是将两者连接起来并通过神经网络来提供它。
+
+```
+ class EmbeddingNet(nn.Module):  def __init__(self, n_users, n_movies, **nh** =10, p1=0.5, p2=0.5):  super().__init__()  (self.u, self.m) = [get_emb(*o) for o in [  (n_users, n_factors), (n_movies, n_factors)]]  self.lin1 = **nn.Linear** (n_factors*2, nh)  self.lin2 = nn.Linear(nh, 1)  self.drop1 = nn.Dropout(p1)  self.drop2 = nn.Dropout(p2)  def forward(self, cats, conts):  users,movies = cats[:,0],cats[:,1]  x = self.drop1(torch.cat([self.u(users),self.m(movies)], dim=1))  x = self.drop2(F.relu(self.lin1(x)))  return F.sigmoid(self.lin2(x)) * (max_rating-min_rating+1) + min_rating-0.5 
+```
+
+请注意，我们不再有偏差项，因为PyTorch中的`Linear`层已经存在偏差。 `nh`是线性层创建的一些激活（Jeremy称之为“数字隐藏”）。
+
+![](../img/1_EUxuR7ejeb1wJUib0GRr2g.jpeg)
+
+它只有一个隐藏层，所以可能不是“深层”，但这绝对是一个神经网络。
+
+```
+ wd=1e-5  model = EmbeddingNet(n_users, n_movies).cuda()  opt = optim.Adam(model.parameters(), 1e-3, weight_decay=wd)  fit(model, data, 3, opt, **F.mse_loss** ) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 0.88043 0.82363]  [ 1\. 0.8941 0.81264]  [ 2\. 0.86179 0.80706] 
+```
+
+请注意，损失函数也在`F` （这里，它是均方损失）。
+
+既然我们有神经网络，我们可以尝试很多东西：
+
+*   添加辍学者
+*   使用不同的嵌入大小进行用户嵌入和电影嵌入
+*   不仅是用户和电影嵌入，而且还附加来自原始数据的电影类型嵌入和/或时间戳。
+*   增加/减少隐藏层数和激活次数
+*   增加/减少正规化
+
+#### **训练循环中发生了什么？** [ [1:33:21](https://youtu.be/J99NV9Cr75I%3Ft%3D1h33m21s) ]
+
+目前，我们正在将权重更新传递给PyTorch的优化器。 优化器有什么作用？ 什么是`momentum` ？
+
+```
+ opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9) 
+```
+
+我们将在excel表（ [graddesc.xlsm](https://github.com/fastai/fastai/blob/master/courses/dl1/excel/graddesc.xlsm) ）中实现梯度下降 - 从右到左看工作表。 首先我们创建一个随机_x_ '， _y_与_x_的线性相关（例如_y_ = _a * x_ + _b_ ）。 通过使用_x_和_y_的集合，我们将尝试学习_a_和_b。_
+
+![](../img/1_EyHgeFUNArZ3xZRbY507QQ.jpeg)
+
+![](../img/1_D_qMGnGAmQYMwsuBpBhhlQ.jpeg)
+
+要计算误差，我们首先需要预测，并将差异平方：
+
+![](../img/1_q7Fb4G2j2csZ7sS0tbi8zQ.png)
+
+为了减少误差，我们增加/减少_a_和_b_一点点，并找出会导致误差减小的原因。 这被称为通过有限差分找到导数。
+
+![](../img/1_Z2NHeXo8RFIOwhCyuQb7oQ.jpeg)
+
+有限差分在高维空间中变得复杂[ [1:41:46](https://youtu.be/J99NV9Cr75I%3Ft%3D1h41m46s) ]，并且它变得非常耗费内存并且需要很长时间。 所以我们想找到一些方法来更快地完成这项工作。 查找Jacobian和Hessian之类的东西是值得的（深度学习书： [第84页第4.3.1节](http://www.deeplearningbook.org/contents/numerical.html) ）。
+
+#### 链规则和反向传播
+
+更快的方法是分析地做到这一点[ [1:45:27](https://youtu.be/J99NV9Cr75I%3Ft%3D1h45m27s) ]。 为此，我们需要一个链规则：
+
+![](../img/1_DS4ZfpUfsseOBayQMqS4Yw.png)
+
+<figcaption class="imageCaption">[链规则](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version)概述</figcaption>
+
+
+
+这是Chris Olah关于[反向传播作为连锁规则](http://colah.github.io/posts/2015-08-Backprop/)的伟大文章。
+
+现在我们用[WolframAlpha](https://www.wolframalpha.com/)给出的实际导数替换有限差分（注意有限差分输出与实际导数非常接近，如果你需要计算自己的导数，那么做好快速健全性检查的好方法）：
+
+![](../img/1_VHXoG1HpJxlR_y0yfrEz-g.jpeg)
+
+*   “在线”培训 - 规模为1的小批量培训
+
+这就是你如何使用excel表格进行SGD。 如果您要使用CNN电子表格的输出更改预测值，我们可以使用SGD训练CNN。
+
+#### 动量[ [1:53:47](https://youtu.be/J99NV9Cr75I%3Ft%3D1h53m47s) ]
+
+> 来吧，采取一些暗示 - 这是一个很好的方向。 请继续这样做，但更多。
+
+通过这种方法，我们将在当前的小批量衍生物和我们在最后一批小批量（单元格K9）之后采取的步骤（和方向）之间使用线性插值：
+
+![](../img/1_zvTMttj6h4iwFcxnt8zKyg.png)
+
+与其符号（+/-）是随机的_de_ / _db_相比，具有动量的那个将继续向同一方向移动一点点直到某一点。 这将减少培训所需的一些时期。
+
+#### 亚当[ [1:59:04](https://youtu.be/J99NV9Cr75I%3Ft%3D1h59m4s) ]
+
+亚当的速度要快得多，但问题在于最终的预测并不像SGD那样有动力。 似乎这是由于亚当和体重衰减的联合使用。 解决此问题的新版本称为**AdamW** 。
+
+![](../img/1_0yZ9Hbn2BPSNY9L-5jL0Tg.png)
+
+*   `cell J8` ：导数和前一个方向的线性插值（与我们在动量中的相同）
+*   `cell L8` ：从最后一步（ `cell L7` ）的导数平方+导数平方的线性插值
+*   这个想法被称为“指数加权移动平均线”（换句话说，平均值与之前的值相乘）
+
+学习率比以前高得多，因为我们将它除以`L8`平方根。
+
+如果你看一下fast.ai库（model.py），你会注意到在`fit`函数中，它不只是计算平均损失，而是计算损失的**指数加权移动平均值** 。
+
+```
+ avg_loss = avg_loss * avg_mom + loss * (1-avg_mom) 
+```
+
+另一个有用的概念是每当你看到`α（...）+（1-α）（...）`时，立即想到**线性插值。**
+
+#### **一些直觉**
+
+*   我们计算了梯度平方的指数加权移动平均值，取其平方根，并将学习率除以它。
+*   渐变平方总是积极的。
+*   当梯度的方差很大时，梯度平方将很大。
+*   当梯度恒定时，梯度平方将很小。
+*   如果渐变变化很大，我们要小心并将学习率除以大数（减速）
+*   如果梯度变化不大，我们将通过将学习率除以较小的数字来采取更大的步骤
+*   **自适应学习率** - 跟踪梯度平方的平均值，并使用它来调整学习率。 因此，只有一种学习风格，但如果梯度是恒定的，每个时期的每个参数都会得到更大的跳跃; 否则会跳得更小。
+*   有两个瞬间 - 一个用于渐变，另一个用于渐变平方（在PyTorch中，它被称为beta，它是两个数字的元组）
+
+#### AdamW [ [2:11:18](https://youtu.be/J99NV9Cr75I%3Ft%3D2h11m18s) ]
+
+当参数多于数据点时，正则化变得很重要。 我们以前见过辍学，体重衰退是另一种正规化。 权重衰减（L2正则化）通过将平方权重（权重衰减乘数乘以）加到损失中来惩罚大权重。 现在损失函数想要保持较小的权重，因为增加权重会增加损失; 因此，只有当损失提高超过罚款时才这样做。
+
+问题在于，由于我们将平方权重添加到损失函数，这会影响梯度的移动平均值和Adam的平方梯度的移动平均值。 这导致当梯度变化很大时减少重量衰减量，并且当变化很小时增加重量衰减量。 换句话说，“惩罚大重量，除非渐变变化很大”，这不是我们想要的。 AdamW从损失函数中删除了重量衰减，并在更新权重时直接添加它。
diff --git a/zh/dl6.md b/zh/dl6.md
new file mode 100644
index 0000000000000000000000000000000000000000..5961fc528eca14daba7b08bee52ca06d9dbee543
--- /dev/null
+++ b/zh/dl6.md
@@ -0,0 +1,692 @@
+# 深度学习2：第1部分第6课
+
+### [第6课](http://forums.fast.ai/t/wiki-lesson-6/9404)
+
+[**2017年深度学习重点的优化**](http://ruder.io/deep-learning-optimization-2017/index.html "http://ruder.io/deep-learning-optimization-2017/index.html")[
+](http://ruder.io/deep-learning-optimization-2017/index.html "http://ruder.io/deep-learning-optimization-2017/index.html")[_目录：深度学习最终是关于找到一个很好的概括 - 用..._ ruder.io的_奖励积分_](http://ruder.io/deep-learning-optimization-2017/index.html "http://ruder.io/deep-learning-optimization-2017/index.html")[](http://ruder.io/deep-learning-optimization-2017/index.html)
+
+上周回顾[ [2:15](https://youtu.be/sHcLkfRrgoQ%3Ft%3D2m15s) ]
+
+我们上周深入研究了协同过滤，最后我们在fast.ai库中重新创建了`EmbeddingDotBias`类（ `column_data.py` ）。 让我们看一下嵌入式的样子[ [笔记本](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb) ]。
+
+在学习者`learn`内部，您可以通过调用`learn.model`来获取PyTorch模型。 `@property`看起来像常规函数，但在调用它时不需要括号。
+
+```
+ @property  def model(self): return self.models.model 
+```
+
+`learn.models`是`learn.models`的一个实例，它是PyTorch模型的一个薄包装器，它允许我们使用“层组”，这不是PyTorch中可用的概念，而fast.ai使用它将不同的学习速率应用于不同的层集（层组）。
+
+PyTorch模型很好地打印出层，包括层名，这就是我们在代码中称之为的层名。
+
+```
+ m=learn.model; m 
+```
+
+```
+ _EmbeddingDotBias (_  _(u): Embedding(671, 50)_  _(i): Embedding(9066, 50)_  _(ub): Embedding(671, 1)_  _(ib): Embedding(9066, 1)_  _)_ 
+```
+
+![](../img/1_4MrbqWktWz3oroYWn5Xh6w.png)
+
+`m.ib`是指项目偏差的嵌入层 - 在我们的例子中是电影偏见。 PyTorch模型和图层的好处是我们可以将它们称为函数。 因此，如果您想获得预测，则调用`m(...)`并传入变量。
+
+图层需要变量而不是张量，因为它需要跟踪导数 - 这就是`V(...)`将张量转换为变量的原因。 PyTorch 0.4将摆脱变量，我们将能够直接使用张量。
+
+```
+ movie_bias = to_np(m.ib(V(topMovieIdx))) 
+```
+
+`to_np`函数将采用变量或张量（无论是在CPU还是GPU上）并返回numpy数组。 Jeremy的方法[ [12:03](https://youtu.be/sHcLkfRrgoQ%3Ft%3D12m3s) ]是将numpy用于一切，除非他明确需要在GPU上运行某些东西或者需要它的衍生物 - 在这种情况下他使用PyTorch。 Numpy比PyTorch的使用时间更长，并且可以与OpenCV，Pandas等其他库一起使用。
+
+有关生产中CPU与GPU的问题。 建议的方法是对CPU进行推理，因为它更具可扩展性，您无需批量生产。 您可以通过键入`m.cpu()`将模型移动到CPU上，类似于键入`V(topMovieIndex).cpu()`的变量（从CPU到GPU将是`m.cuda()` ）。如果您的服务器没有GPU ，它会自动在CPU上运行推理。 要加载在GPU上训练过的已保存模型，请查看`torch_imports.py`以下代码`torch_imports.py` ：
+
+```
+ def load_model(m, p): m.load_state_dict(torch.load(p, map_location=lambda storage, loc: storage)) 
+```
+
+现在我们对前3000部电影有电影偏见，让我们来看看收视率：
+
+```
+ movie_ratings = [(b[0], movie_names[i]) **for** i,b **in** zip(topMovies,movie_bias)] 
+```
+
+`zip`将允许您同时迭代多个列表。
+
+#### 最糟糕的电影
+
+关于排序键 - Python有`itemgetter`函数，但普通`lambda`只是一个字符。
+
+```
+ sorted(movie_ratings, key= **lambda** o: o[0])[:15] 
+```
+
+```
+ _[(-0.96070349, 'Battlefield Earth (2000)'),_  _(-0.76858485, 'Speed 2: Cruise Control (1997)'),_  _(-0.73675376, 'Wild Wild West (1999)'),_  _(-0.73655486, 'Anaconda (1997)'),_  _...]_ 
+```
+
+```
+ sorted(movie_ratings, key= **itemgetter** (0))[:15] 
+```
+
+#### 最好的电影
+
+```
+ sorted(movie_ratings, key= **lambda** o: o[0], reverse= **True** )[:15] 
+```
+
+```
+ _[(1.3070084, 'Shawshank Redemption, The (1994)'),_  _(1.1196285, 'Godfather, The (1972)'),_  _(1.0844109, 'Usual Suspects, The (1995)'),_  _(0.96578616, "Schindler's List (1993)"),_  _...]_ 
+```
+
+#### 嵌入式解释[ [18:42](https://youtu.be/sHcLkfRrgoQ%3Ft%3D18m42s) ]
+
+每部电影有50个嵌入，很难看到50维空间，所以我们将它变成一个三维空间。 我们可以使用几种技术压缩尺寸：主成分分析（ [PCA](https://plot.ly/ipython-notebooks/principal-component-analysis/) ）（Rachel的计算线性代数类详细介绍了这一点 - 几乎与奇异值分解（SVD）相同）
+
+```
+ movie_emb = to_np(mi(V(topMovieIdx)))  movie_emb.shape 
+```
+
+```
+ _(3000, 50)_ 
+```
+
+```
+ from sklearn.decomposition import PCA  pca = PCA(n_components=3)  movie_pca = pca.fit(movie_emb.T).components_  movie_pca.shape 
+```
+
+```
+ _(3, 3000)_ 
+```
+
+我们将看看第一个维度“轻松观看与严肃”（我们不知道它代表什么但可以通过观察它们来推测）：
+
+```
+ fac0 = movie_pca[0]  movie_comp = [(f, movie_names[i]) **for** f,i **in** zip(fac0, topMovies)]  sorted(movie_comp, key=itemgetter(0), reverse=True)[:10] 
+```
+
+```
+ sorted(movie_comp, key=itemgetter(0), reverse=True)[:10] 
+```
+
+```
+ _[(0.06748189, 'Independence Day (aka ID4) (1996)'),_  _(0.061572548, 'Police Academy 4: Citizens on Patrol (1987)'),_  _(0.061050549, 'Waterworld (1995)'),_  _(0.057877172, 'Rocky V (1990)'),_  _..._  _]_ 
+```
+
+```
+ sorted(movie_comp, key=itemgetter(0))[:10] 
+```
+
+```
+ _[(-0.078433245, 'Godfather: Part II, The (1974)'),_  _(-0.072180331, 'Fargo (1996)'),_  _(-0.071351372, 'Pulp Fiction (1994)'),_  _(-0.068537779, 'Goodfellas (1990)'),_  _..._  _]_ 
+```
+
+第二个维度“对话驱动与CGI”
+
+```
+ fac1 = movie_pca[1]  movie_comp = [(f, movie_names[i]) for f,i in zip(fac1, topMovies)]  sorted(movie_comp, key=itemgetter(0), reverse=True)[:10] 
+```
+
+```
+ _[(0.058975246, 'Bonfire of the Vanities (1990)'),_  _(0.055992026, '2001: A Space Odyssey (1968)'),_  _(0.054682467, 'Tank Girl (1995)'),_  _(0.054429606, 'Purple Rose of Cairo, The (1985)'),_  _...]_ 
+```
+
+```
+ sorted(movie_comp, key=itemgetter(0))[:10] 
+```
+
+```
+ _[(-0.1064609, 'Lord of the Rings: The Return of the King, The (2003)'),_  _(-0.090635143, 'Aladdin (1992)'),_  _(-0.089208141, 'Star Wars: Episode V - The Empire Strikes Back (1980)'),_  _(-0.088854566, 'Star Wars: Episode IV - A New Hope (1977)'),_  _...]_ 
+```
+
+情节
+
+```
+ idxs = np.random.choice(len(topMovies), 50, replace=False)  X = fac0[idxs]  Y = fac1[idxs]  plt.figure(figsize=(15,15))  plt.scatter(X, Y)  for i, x, y in zip(topMovies[idxs], X, Y):  plt.text(x,y,movie_names[i], color=np.random.rand(3)*0.7, fontsize=11)  plt.show() 
+```
+
+![](../img/1_rH0bFyR8qSj6MuV0Rn-waA.png)
+
+当你说`learn.fit`时会发生什么？
+
+#### [实体嵌入分类变量](https://arxiv.org/pdf/1604.06737.pdf) [ [24:42](https://youtu.be/sHcLkfRrgoQ%3Ft%3D24m42s) ]
+
+第二篇论文谈论分类嵌入。 图。 1.标题应该听起来很熟悉，因为它们讨论了实体嵌入层如何等效于单热编码，然后是矩阵乘法。
+
+![](../img/1_BgBtlqi7Ja6aQ8wGvWQbgQ.png)
+
+他们做的有趣的事情是，他们采用由神经网络训练的实体嵌入，用学习的实体嵌入替换每个分类变量，然后将其输入到梯度增强机（GBM），随机森林（RF）和KNN中 - 这减少了这个错误几乎与神经网络（NN）一样好。 这是一种很好的方式，可以在您的组织中提供神经网络的强大功能，而不必强迫其他人学习深度学习，因为他们可以继续使用他们当前使用的东西并使用嵌入作为输入。 GBM和RF列车比NN快得多。
+
+![](../img/1_XYcNx7NmTyblDXa5diFMbg.png)
+
+他们还绘制了德国国家的嵌入，有趣的是（正如杰里米所说的那样“令人费解”）类似于实际的地图。
+
+他们还绘制了物理空间和嵌入空间中商店的距离 - 这显示出美丽而清晰的相关性。
+
+一周的天数或一年中的几个月之间似乎也存在相关性。 可视化嵌入可能很有趣，因为它向您显示您期望看到的内容或您未看到的内容。
+
+#### 关于Skip-Gram生成嵌入的问题[ [31:31](https://youtu.be/sHcLkfRrgoQ%3Ft%3D31m31s) ]
+
+Skip-Gram特定于NLP。 将未标记的问题转变为标记问题的好方法是“发明”标签。 Word2Vec的方法是采用11个单词的句子，删除中间单词，并用随机单词替换它。 然后他们在原句中给出了标签1; 0到假一个，并建立了一个机器学习模型来查找假句子。 因此，他们现在可以将嵌入物用于其他目的。 如果你将它作为单个矩阵乘数（浅模型）而不是深度神经网络，你可以非常快速地训练 - 缺点是它是一个预测性较低的模型，但优点是你可以训练一个非常大的数据集更重要的是，最终的嵌入具有_线性特征_ ，允许我们很好地加，减或绘制。 在NLP中，我们应该超越Word2Vec和Glove（即基于线性的方法），因为这些嵌入不太具有预测性。 最先进的语言模型使用深度RNN。
+
+#### 要学习任何类型的特征空间，您需要标记数据或者需要发明虚假任务[ [35:45](https://youtu.be/sHcLkfRrgoQ%3Ft%3D35m45s) ]
+
+*   一个假的任务比另一个好吗？ 还没有很好的研究。
+*   直观地说，我们想要一个帮助机器学习你关心的各种关系的任务。
+*   在计算机视觉中，人们使用的一种虚假任务是应用虚幻和不合理的数据增强。
+*   如果你不能提出很棒的假任务，那就去使用糟糕的任务 - 你需要的很少，这通常是令人惊讶的。
+*   **自动编码器** [ [38:10](https://youtu.be/sHcLkfRrgoQ%3Ft%3D38m10s) ] - 它最近赢得了[保险索赔竞赛](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629) 。 采取单一策略，通过神经网络运行，并让它重建自己（确保中间层的激活少于输入变量）。 基本上，这是一个任务，其输入=输出作为一个假任务令人惊讶地工作。
+
+在计算机视觉中，您可以对猫和狗进行训练并将其用于CT扫描。 也许它可能适用于语言/ NLP！ （未来的研究）
+
+#### [罗斯曼](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb) [ [41:04](https://youtu.be/sHcLkfRrgoQ%3Ft%3D41m4s) ]
+
+*   正确使用测试集的方法已添加到笔记本中。
+*   有关更详细的说明，请参阅机器学习课程。
+*   `apply_cats(joined_test, joined)`用于确保测试集和训练集具有相同的分类代码。
+*   跟踪包含每个连续列的平均值和标准差的`mapper` ，并将相同的`mapper`应用于测试集。
+*   不要依赖Kaggle公共董事会 - 依靠您自己精心设计的验证集。
+
+#### 为罗斯曼寻找一个好的[核心](https://www.kaggle.com/thie1e/exploratory-analysis-rossmann)
+
+*   周日对销售的影响
+
+商店关闭前后的销售额有所增长。 第三名获胜者在开始任何分析之前删除了关闭的商店行。
+
+> **不要触摸你的数据，除非你首先分析看你正在做什么是好的 - 没有假设。**
+
+#### Vim技巧[ [49:12](https://youtu.be/sHcLkfRrgoQ%3Ft%3D49m12s) ]
+
+*   `:tag ColumnarModelData`将带您进入类定义
+*   `ctrl + ]`将带您定义光标下的内容
+*   `ctrl + t`回去
+*   `*`找到光标下的内容的用法
+*   您可以使用`:tabn`选项在选项卡之间切换`:tabn`和`:tabp` ，使用`:tabe &lt;filepath&gt;`可以添加新选项卡; 并使用常规`:q`或`:wq`你关闭一个标签。 如果将`:tabn`和`:tabp`到F7 / F8键，则可以轻松地在文件之间切换。
+
+#### [ColumnarModelData](https://youtu.be/sHcLkfRrgoQ%3Ft%3D51m1s)内部[ [51:01](https://youtu.be/sHcLkfRrgoQ%3Ft%3D51m1s) ]
+
+慢慢但肯定地，过去只是“神奇”的东西开始看起来很熟悉。 如您所见， `get_learner`返回`Learner` ，它是包装数据和PyTorch模型的fast.ai概念：
+
+![](../img/1_Fda8NgH2L9m3d_UIdNsKCQ.png)
+
+在`MixedInputModel`内部，您可以看到它是如何创建我们现在更了解的`Embedding` 。 `nn.ModuleList`用于注册层列表。 我们将在下周讨论`BatchNorm` ，但我们之前已经看过了休息。
+
+![](../img/1_7E46VmEHXatQWNY2s7D9-g.png)
+
+同样，我们现在了解`forward`功能正在发生什么。
+
+*   使用第_i_个分类变量调用嵌入层并将它们连接在一起
+*   通过辍学把它
+*   浏览每个线性图层，调用它，应用relu和dropout
+*   然后最终线性层的大小为1
+*   如果`y_range` ，则应用sigmoid并将输出拟合到一个范围内（我们上周学到的）
+
+![](../img/1_Ry2bDxD36x8zV9KfH_IL9Q.png)
+
+#### [随机梯度下降 - 新元](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson6-sgd.ipynb) [ [59:56](https://youtu.be/sHcLkfRrgoQ%3Ft%3D59m56s) ]
+
+为了确保我们完全适应SGD，我们将用它来学习`_y = ax + b_` 。 如果我们可以用2个参数解决问题，我们可以使用相同的技术来解决1亿个参数。
+
+```
+ _# Here we generate some fake data_  **def** lin(a,b,x): **return** a*x+b  **def** gen_fake_data(n, a, b):  x = s = np.random.uniform(0,1,n)  y = lin(a,b,x) + 0.1 * np.random.normal(0,3,n)  **return** x, y  x, y = gen_fake_data(50, 3., 8.)  plt.scatter(x,y, s=8); plt.xlabel("x"); plt.ylabel("y"); 
+```
+
+![](../img/1_28U8r1xSD7ODB9BZnGHNZg.png)
+
+首先，我们需要一个损失功能。 这是一个回归问题，因为输出是连续输出，最常见的损失函数是均方误差（MSE）。
+
+> 回归 - 目标输出是实数或整数实数
+
+> 分类 - 目标输出是类标签
+
+```
+ **def** **mse** (y_hat, y): **return** ((y_hat - y) ** 2).mean() 
+```
+
+```
+ **def** **mse_loss** (a, b, x, y): **return** mse(lin(a,b,x), y) 
+```
+
+*   `y_hat` - 预测
+
+我们将制作10,000多个假数据并将它们转换为PyTorch变量，因为Jeremy不喜欢使用衍生物而PyTorch可以为他做到这一点：
+
+```
+ x, y = gen_fake_data(10000, 3., 8.)  x,y = V(x),V(y) 
+```
+
+然后为`a`和`b`创建随机权重，它们是我们想要学习的变量，所以设置`requires_grad=True` 。
+
+```
+ a = V(np.random.randn(1), requires_grad= **True** )  b = V(np.random.randn(1), requires_grad= **True** ) 
+```
+
+然后设置学习率并完成10000个完全梯度下降的时期（不是SGD，因为每个时期将查看所有数据）：
+
+```
+ learning_rate = 1e-3  **for** t **in** range(10000):  _# Forward pass: compute predicted y using operations on Variables_  loss = mse_loss(a,b,x,y)  **if** t % 1000 == 0: print(loss.data[0])  _# Computes the gradient of loss with respect to all Variables with requires_grad=True._  _# After this call a.grad and b.grad will be Variables holding the gradient_  _# of the loss with respect to a and b respectively_  loss.backward()  _# Update a and b using gradient descent;_ _a.data and b.data are Tensors,_  _# a.grad and b.grad are Variables and a.grad.data and b.grad.data are Tensors_  a.data -= learning_rate * a.grad.data  b.data -= learning_rate * b.grad.data  _# Zero the gradients_  a.grad.data.zero_()  b.grad.data.zero_() 
+```
+
+![](../img/1_LRtxJiNrnAX1o6mEnaiUpA.png)
+
+*   计算损失（记住， `a`和`b`最初设置为随机）
+*   不时（每1000个时期），打印出损失
+*   `loss.backward()`将使用`requires_grad=True`计算所有变量的渐变，并填写`.grad`属性
+*   将`a`更新为减去LR * `grad` （ `.data`访问变量内部的张量）
+*   当有多个损失函数或许多输出层对渐变有贡献时，PyTorch会将它们加在一起。 因此，您需要告诉何时将渐变设置回零（ `_`中的`zero_()`表示变量就地更改）。
+*   最后4行代码包含在`optim.SGD.step`函数中
+
+#### 让我们只用Numpy（没有PyTorch）[ [1:07:01](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h7m1s) ]
+
+我们实际上必须做微积分，但其他一切看起来应该相似：
+
+```
+ x, y = gen_fake_data(50, 3., 8.) 
+```
+
+```
+ a_guess,b_guess = -1., 1.  mse_loss(y, a_guess, b_guess, x) 
+```
+
+```
+ lr=0.01  **def** **upd** ():  **global** a_guess, b_guess  y_pred = lin(a_guess, b_guess, x)  dydb = 2 * (y_pred - y)  dyda = x*dydb  a_guess -= lr*dyda.mean()  b_guess -= lr*dydb.mean() 
+```
+
+只是为了好玩，您可以使用`matplotlib.animation.FuncAnimation`来制作动画：
+
+![](../img/1_yGWe-bn7PoDqx0pZc2fjtg.png)
+
+提示：Fast.ai AMI没有附带`ffmpeg` 。 所以如果你看到`KeyError: 'ffmpeg'`
+
+*   运行`print(animation.writers.list())`并打印出可用的MovieWriters列表
+*   如果`ffmpeg`就在其中。 否则[安装它](https://github.com/adaptlearning/adapt_authoring/wiki/Installing-FFmpeg) 。
+
+### [递归神经网络 - RNN](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson6-rnn.ipynb) [ [1:09:16](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h9m16s) ]
+
+让我们学习如何写尼采这样的哲学。 这类似于我们在第4课中学到的语言模型，但这一次，我们将一次完成一个角色。 RNN与我们已经学到的没什么不同。
+
+![](../img/1_wIccxf1fG4jtSZhLHTtw-A.png)
+
+#### 一些例子：
+
+*   [SwiftKey](https://blog.swiftkey.com/neural-networks-a-meaningful-leap-for-mobile-typing/)
+*   [Andrej Karpathy LaTex发电机](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
+
+#### 具有单个隐藏层的基本NN
+
+所有形状都是激活（激活是由relu，矩阵产品等计算的数字）。 箭头是图层操作（可能不止一个）。 查看机器学习课程9-11，从头开始创建。
+
+![](../img/1_vPfe01ALNgbxw8DP_4RFVw.png)
+
+#### 图像CNN具有单密集隐藏层
+
+我们将介绍如何在下周更多地压平图层，但主要方法称为“自适应最大池” - 我们在高度和宽度上进行平均并将其转换为矢量。
+
+![](../img/1_VEEVatttQmlWeI98vTO0iA.png)
+
+<figcaption class="imageCaption">`batch_size`维度和激活函数（例如relu，softmax）未在此处显示</figcaption>
+
+
+
+#### 使用字符1和2 [ [1:18:04](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h18m4s) ]预测字符3
+
+我们将为NLP实现这一个。
+
+*   输入可以是单热编码字符（向量的长度=唯一字符的数量）或单个整数，并假设它是使用嵌入层进行一次热编码。
+*   与CNN的不同之处在于添加了char 2输入。
+
+![](../img/1_gc1z1R1d5zHkYc75iqSWtw.png)
+
+<figcaption class="imageCaption">层图操作未显示; 记住箭头代表层操作</figcaption>
+
+
+
+让我们在没有torchtext或fast.ai库的情况下实现这一点，以便我们可以看到。
+
+*   `set`将返回所有唯一字符。
+
+```
+ text = open(f'{PATH}nietzsche.txt').read()  print(text[:400]) 
+```
+
+```
+ _'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then?_ _Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman?_ _Certainly she has never allowed herself '_ 
+```
+
+```
+ chars = sorted(list(set(text)))  vocab_size = len(chars)+1  print('total chars:', vocab_size) 
+```
+
+```
+ _total chars: 85_ 
+```
+
+*   总是很好地为填充添加null或空字符。
+
+```
+ chars.insert(0, "\0") 
+```
+
+将每个字符映射到唯一ID，以及字符的唯一ID
+
+```
+ char_indices = dict((c, i) for i, c in enumerate(chars))  indices_char = dict((i, c) for i, c in enumerate(chars)) 
+```
+
+现在我们可以使用其ID来表示文本：
+
+```
+ **idx** = [char_indices[c] for c in text]  idx[:10] 
+```
+
+```
+ _[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]_ 
+```
+
+#### 问题：基于字符的模型与基于单词的模型[ [1:22:30](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h22m30s) ]
+
+*   通常，您希望将字符级别模型和字级别模型组合在一起（例如，用于翻译）。
+*   当词汇表包含不常用的单词时，字符级别模型很有用 - 单词级别模型将仅视为“未知”。 当您看到之前没有见过的单词时，可以使用字符级模型。
+*   在它们之间还有一种称为字节对编码（BPE）的东西，它查看n-gram字符。
+
+#### 创建输入[ [1:23:48](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h23m48s) ]
+
+```
+ cs = 3  c1_dat = [idx[i] for i in range(0, len(idx)-cs, cs)]  c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)]  c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)]  c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)] 
+```
+
+注意`c1_dat[n+1] == c4_dat[n]`因为我们跳过3（ `range`的第三个参数）
+
+```
+ x1 = np.stack(c1_dat)  x2 = np.stack(c2_dat)  x3 = np.stack(c3_dat)  y = np.stack(c4_dat) 
+```
+
+`x`是我们的输入， `y`是我们的目标值。
+
+#### 建立模型[ [1:26:08](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h26m8s) ]
+
+```
+ n_hidden = 256  n_fac = 42 
+```
+
+*   `n_hiddein` - 图中的“ `n_hiddein` ”。
+*   `n_fac` - 嵌入矩阵的大小。
+
+这是上图的更新版本。 请注意，现在箭头已着色。 具有相同颜色的所有箭头将使用相同的权重矩阵。 这里的想法是，角色不具有不同的含义（语义上或概念上），这取决于它是序列中的第一个，第二个还是第三个项目，因此对它们的处理方式相同。
+
+![](../img/1_9XXQ3J7G3rD92tFkusi4bA.png)
+
+```
+ **class** **Char3Model** (nn.Module):  **def** **__init__** (self, vocab_size, n_fac):  super().__init__()   self.e = nn.Embedding(vocab_size, n_fac)   self.l_in = nn.Linear(n_fac, n_hidden)   self.l_hidden = nn.Linear(n_hidden, n_hidden)   self.l_out = nn.Linear(n_hidden, vocab_size) 
+```
+
+```
+ **def** **forward** (self, c1, c2, c3):  in1 = F.relu(self.l_in(self.e(c1)))  in2 = F.relu(self.l_in(self.e(c2)))  in3 = F.relu(self.l_in(self.e(c3)))  h = V(torch.zeros(in1.size()).cuda())  h = F.tanh(self.l_hidden(h+in1))  h = F.tanh(self.l_hidden(h+in2))  h = F.tanh(self.l_hidden(h+in3))  **return** F.log_softmax(self.l_out(h)) 
+```
+
+![](../img/1_gBZslK323CITflsnXp-DSA.png)
+
+<figcaption class="imageCaption">[视频[1:27:57]](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h27m57s)</figcaption>
+
+
+
+*   [ [1:29:58](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h29m58s) ]重要的是，这个`l_hidden`使用一个方形权重矩阵，其大小与`l_in`的输出相匹配。 然后`h`和`in2`将是相同的形状，允许我们在`self.l_hidden(h+in2)`看到它们的总和
+*   `V(torch.zeros(in1.size()).cuda())`只是使三条线相同，以便以后更容易放入for循环。
+
+```
+ md = ColumnarModelData.from_arrays('.', [-1], np.stack( **[x1,x2,x3]** , axis=1), y, bs=512) 
+```
+
+我们将重用[ColumnarModelData](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h32m20s) [ [1:32:20](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h32m20s) ]。 如果我们堆栈`x1` ， `x2`和`x3` ，我们将在`forward`方法中得到`c1` ， `c2` ， `c3` 。 当你想用原始方法训练模型时， `ColumnarModelData.from_arrays`会派上用场，你放入`[x1, x2, x3]` ，你将在`**def** **forward** (self, c1, c2, c3)`返回`**def** **forward** (self, c1, c2, c3)`
+
+```
+ m = Char3Model(vocab_size, n_fac).cuda() 
+```
+
+*   我们创建了一个标准的PyTorch模型（不是`Learner` ）
+*   因为它是标准的PyTorch模型，所以不要忘记`.cuda`
+
+```
+ it = iter(md.trn_dl)  *xs,yt = next(it)  t = m(*V(xs) 
+```
+
+*   它抓住了一个迭代器
+*   `next`返回一个小批量
+*   “变量” `xs`张量，并通过模型 - 这将给我们512x85张量包含预测（批量大小*唯一字符）
+
+```
+ opt = optim.Adam(m.parameters(), 1e-2) 
+```
+
+*   创建一个标准的PyTorch优化器 - 你需要传递一个要优化的东西列表，由`m.parameters()`返回
+
+```
+ fit(m, md, 1, opt, F.nll_loss)  set_lrs(opt, 0.001)  fit(m, md, 1, opt, F.nll_loss) 
+```
+
+*   我们没有找到学习速率查找器和SGDR，因为我们没有使用`Learner` ，所以我们需要手动进行学习速率退火（将LR设置得稍低）
+
+#### 测试模型[ [1:35:58](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h35m58s) ]
+
+```
+ **def** **get_next** (inp):  idxs = T(np.array([char_indices[c] **for** c **in** inp]))  p = m(*VV(idxs))  i = np.argmax(to_np(p))  **return** chars[i] 
+```
+
+此函数需要三个字符并返回模型预测的第四个字符。 注意： `np.argmax`返回最大值的索引。
+
+```
+ get_next('y. ')  _'T'_ 
+```
+
+```
+ _get_next('ppl')_  _'e'_ 
+```
+
+```
+ get_next(' th')  _'e'_ 
+```
+
+```
+ get_next('and')  ' ' 
+```
+
+#### 让我们创建我们的第一个RNN [ [1:37:45](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h37m45s) ]
+
+我们可以简化上面的图表如下：
+
+![](../img/1_xF-ab5Hn_3FGZRZtEGwFtw.png)
+
+<figcaption class="imageCaption">使用字符1到n-1预测字符</figcaption>
+
+
+
+让我们实现这一点。 这次，我们将使用前8个字符来预测第9个字符。 以下是我们如何创建输入和输出，就像上次一样：
+
+```
+ cs = 8 
+```
+
+```
+ c_in_dat = [[idx[i+j] **for** i **in** range(cs)] **for** j **in** range(len(idx)-cs)] 
+```
+
+```
+ c_out_dat = [idx[j+cs] **for** j **in** range(len(idx)-cs)] 
+```
+
+```
+ xs = np.stack(c_in_dat, axis=0) 
+```
+
+```
+ y = np.stack(c_out_dat) 
+```
+
+```
+ xs[:cs,:cs]  _array([[40, 42, 29, 30, 25, 27, 29, 1],_  _[42, 29, 30, 25, 27, 29, 1, 1],_  _[29, 30, 25, 27, 29, 1, 1, 1],_  _[30, 25, 27, 29, 1, 1, 1, 43],_  _[25, 27, 29, 1, 1, 1, 43, 45],_  _[27, 29, 1, 1, 1, 43, 45, 40],_  _[29, 1, 1, 1, 43, 45, 40, 40],_  _[ 1, 1, 1, 43, 45, 40, 40, 39]])_ 
+```
+
+```
+ y[:cs]  _array([ 1, 1, 43, 45, 40, 40, 39, 43])_ 
+```
+
+请注意它们是重叠的（即0-7预测8,1-8预测9）。
+
+```
+ val_idx = get_cv_idxs(len(idx)-cs-1)  md = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512) 
+```
+
+#### 创建模型[ [1:43:03](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h43m3s) ]
+
+```
+ **class** **CharLoopModel** (nn.Module):  _# This is an RNN!_  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.l_in = nn.Linear(n_fac, n_hidden)  self.l_hidden = nn.Linear(n_hidden, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, *cs):  bs = cs[0].size(0)  h = V(torch.zeros(bs, n_hidden).cuda())  **for** c **in** cs:  inp = F.relu(self.l_in(self.e(c)))  h = F.tanh(self.l_hidden(h+inp))  **return** F.log_softmax(self.l_out(h), dim=-1) 
+```
+
+大多数代码与以前相同。 您会注意到`forward`功能中有一个`for`循环。
+
+> 双曲正切（Tanh）[ [1:43:43](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h43m43s) ]
+
+> 这是一个偏移的sigmoid。 通常在隐藏状态下使用双曲线tanh来隐藏状态转换，因为它会阻止它飞得太高或太低。 出于其他目的，relu更常见。
+
+![](../img/1_EFvLR4S8KFKN9xTvTVMcng.png)
+
+现在这是一个非常深的网络，因为它使用8个字符而不是2个。随着网络越来越深入，它们变得越来越难以训练。
+
+```
+ m = CharLoopModel(vocab_size, n_fac).cuda()  opt = optim.Adam(m.parameters(), 1e-2)  fit(m, md, 1, opt, F.nll_loss)  set_lrs(opt, 0.001)  fit(m, md, 1, opt, F.nll_loss) 
+```
+
+#### 添加与连续
+
+我们现在将为`self.l_hidden( **h+inp** )` [inp](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h46m4s) `self.l_hidden( **h+inp** )` [ [1:46:04](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h46m4s) ]尝试别的东西。 原因是输入状态和隐藏状态在质量上是不同的。 输入是字符的编码，h是一系列字符的编码。 所以将它们加在一起，我们可能会丢失信息。 让我们将它们连接起来。 不要忘记更改输入以匹配形状（ `n_fac+n_hidden`而不是`n_fac` ）。
+
+```
+ **class** **CharLoopConcatModel** (nn.Module):  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.l_in = nn.Linear( **n_fac+n_hidden** , n_hidden)  self.l_hidden = nn.Linear(n_hidden, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, *cs):  bs = cs[0].size(0)  h = V(torch.zeros(bs, n_hidden).cuda())  **for** c **in** cs:  inp = **torch.cat** ((h, self.e(c)), 1)  inp = F.relu(self.l_in(inp))  h = F.tanh(self.l_hidden(inp))  **return** F.log_softmax(self.l_out(h), dim=-1) 
+```
+
+这提供了一些改进。
+
+#### RNT与PyTorch [ [1:48:47](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h48m47s) ]
+
+PyTorch将自动为我们和线性输入层编写`for`循环。
+
+```
+ **class** **CharRnn** (nn.Module):  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  **self.rnn = nn.RNN(n_fac, n_hidden)**  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, *cs):  bs = cs[0].size(0)  h = V(torch.zeros(1, bs, n_hidden))  inp = self.e(torch.stack(cs))  **outp,h = self.rnn(inp, h)**  **return** F.log_softmax(self.l_out( **outp[-1]** ), dim=-1) 
+```
+
+*   由于稍后会变得明显的原因， `self.rnn`会返回输出，还会返回隐藏状态。
+*   PyTorch的细微差别在于`self.rnn`会将一个新的隐藏状态附加到张量而不是替换（换句话说，它将返回图中的所有椭圆）。 我们只想要最后一个，所以我们做`outp[-1]`
+
+```
+ m = CharRnn(vocab_size, n_fac).cuda()  opt = optim.Adam(m.parameters(), 1e-3) 
+```
+
+```
+ ht = V(torch.zeros(1, 512,n_hidden))  outp, hn = m.rnn(t, ht)  outp.size(), hn.size()  _(torch.Size([8, 512, 256]), torch.Size([1, 512, 256]))_ 
+```
+
+在PyTorch版本中，隐藏状态是等级3张量`h = V(torch.zeros(1, bs, n_hidden))` （在我们的版本中，它是等级2张量）[ [1:51:58](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h51m58s) ]。 我们稍后会详细了解这一点，但事实证明你可以拥有倒退的第二个RNN。 我们的想法是找到倒退的关系会更好 - 它被称为“双向RNN”。 您也可以向RNN提供RNN馈送，称为“多层RNN”。 对于这些RNN，您将需要张量中的附加轴来跟踪隐藏状态的其他层。 现在，我们只有1，然后回来1。
+
+#### 测试模型
+
+```
+ **def** get_next(inp):  idxs = T(np.array([char_indices[c] **for** c **in** inp]))  p = m(*VV(idxs))  i = np.argmax(to_np(p))  **return** chars[i] 
+```
+
+```
+ **def** get_next_n(inp, n):  res = inp  **for** i **in** range(n):  c = get_next(inp)  res += c  inp = inp[1:]+c  **return** res 
+```
+
+```
+ get_next_n('for thos', 40)  _'for those the same the same the same the same th'_ 
+```
+
+这一次，我们每次循环`n`次调用`get_next` ，每次我们将通过删除第一个字符并添加我们刚预测的字符来替换输入。
+
+对于一个有趣的家庭作业，尝试编写自己的`nn.RNN` “ `JeremysRNN` ”而不需要查看PyTorch源代码。
+
+#### 多输出[ [1:55:31](https://youtu.be/sHcLkfRrgoQ%3Ft%3D1h55m31s) ]
+
+从上一个图中，我们可以通过将char 1与char 2相同地处理为n-1来进一步简化。 您注意到三角形（输出）也在循环内移动，换句话说，我们在每个字符后创建一个预测。
+
+![](../img/1_0-XkFkCIatPvenvKPfe2_g.png)
+
+<figcaption class="imageCaption">使用字符1到n-1预测字符2到n</figcaption>
+
+
+
+我们可能希望这样做的原因之一是我们之前看到的冗余：
+
+```
+ array([[40, 42, 29, 30, 25, 27, 29, 1],  [42, 29, 30, 25, 27, 29, 1, 1],  [29, 30, 25, 27, 29, 1, 1, 1],  [30, 25, 27, 29, 1, 1, 1, 43],  [25, 27, 29, 1, 1, 1, 43, 45],  [27, 29, 1, 1, 1, 43, 45, 40],  [29, 1, 1, 1, 43, 45, 40, 40],  [ 1, 1, 1, 43, 45, 40, 40, 39]]) 
+```
+
+我们可以通过这次采用**不重叠**的角色来提高效率。 因为我们正在进行多输出，对于输入字符0到7，输出将是char 1到8的预测。
+
+```
+ xs[:cs,:cs] 
+```
+
+```
+ array([[40, 42, 29, 30, 25, 27, 29, 1],  [ 1, 1, 43, 45, 40, 40, 39, 43],  [33, 38, 31, 2, 73, 61, 54, 73],  [ 2, 44, 71, 74, 73, 61, 2, 62],  [72, 2, 54, 2, 76, 68, 66, 54],  [67, 9, 9, 76, 61, 54, 73, 2],  [73, 61, 58, 67, 24, 2, 33, 72],  [ 2, 73, 61, 58, 71, 58, 2, 67]]) 
+```
+
+```
+ ys[:cs,:cs]  array([[42, 29, 30, 25, 27, 29, 1, 1],  [ 1, 43, 45, 40, 40, 39, 43, 33],  [38, 31, 2, 73, 61, 54, 73, 2],  [44, 71, 74, 73, 61, 2, 62, 72],  [ 2, 54, 2, 76, 68, 66, 54, 67],  [ 9, 9, 76, 61, 54, 73, 2, 73],  [61, 58, 67, 24, 2, 33, 72, 2],  [73, 61, 58, 71, 58, 2, 67, 68]]) 
+```
+
+这不会使我们的模型更准确，但我们可以更有效地训练它。
+
+```
+ **class** **CharSeqRnn** (nn.Module):  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = nn.RNN(n_fac, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, *cs):  bs = cs[0].size(0)  h = V(torch.zeros(1, bs, n_hidden))  inp = self.e(torch.stack(cs))  outp,h = self.rnn(inp, h)  **return** F.log_softmax(self.l_out( **outp** ), dim=-1) 
+```
+
+请注意，我们不再执行`outp[-1]`因为我们想保留所有这些。 但其他一切都是一样的。 一个复杂性[ [2:00:37](https://youtu.be/sHcLkfRrgoQ%3Ft%3D2h37s) ]是我们想要像以前一样使用负对数似然丢失函数，但它期望两个等级2张量（两个小批量向量）。 但在这里，我们有3级张量：
+
+*   8个字符（时间步长）
+*   84个概率
+*   为512 minibatch
+
+#### 让我们写一个自定义的损失函数[ [2:02:10](https://youtu.be/sHcLkfRrgoQ%3Ft%3D2h2m10s) ]：
+
+```
+ **def** nll_loss_seq(inp, targ):  sl,bs,nh = inp.size()  targ = targ.transpose(0,1).contiguous().view(-1)  **return** F.nll_loss(inp.view(-1,nh), targ) 
+```
+
+*   `F.nll_loss`是PyTorch损失函数。
+*   展平我们的投入和目标。
+*   转置前两个轴，因为PyTorch期望1.序列长度（多少时间步长），2。批量大小，3。隐藏状态本身。 `yt.size()`是512乘8，而`sl, bs`是8乘512。
+*   当您执行“转置”之类的操作时，PyTorch通常不会实际调整内存顺序，而是保留一些内部元数据来将其视为转置。 当您转置矩阵时，PyTorch只会更新元数据。 如果您看到一个错误“此张量不连续”，请在其后添加`.contiguous()`并且错误消失。
+*   `.view`与`np.reshape`相同。 `-1`表示只要它需要。
+
+```
+ fit(m, md, 4, opt, null_loss_seq) 
+```
+
+请记住， `fit(...)`是实现训练循环的最低级别fast.ai抽象。 所以所有参数都是标准的PyTorch，除了`md` ，它是我们的模型数据对象，它包装了测试集，训练集和验证集。
+
+问题[ [2:06:04](https://youtu.be/sHcLkfRrgoQ%3Ft%3D2h6m4s) ]：既然我们在循环中放了一个三角形，我们需要更大的序列大小吗？
+
+*   如果我们有一个像8这样的短序列，那么第一个字符就没有任何意义了。 它以空的隐藏状态零开始。
+*   我们将在下周学习如何避免这个问题。
+*   基本思想是“为什么我们每次都要将隐藏状态重置为零？”（参见下面的代码）。 如果我们能够以某种方式排列这些迷你批次，以便下一个小批量正确连接代表Nietsche作品中的下一个字母，那么我们可以将`h = V(torch.zeros(1, bs, n_hidden))`到构造函数中。
+
+```
+ **class** **CharSeqRnn** (nn.Module):  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = nn.RNN(n_fac, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, *cs):  bs = cs[0].size(0)  **h = V(torch.zeros(1, bs, n_hidden))**  inp = self.e(torch.stack(cs))  outp,h = self.rnn(inp, h)  **return** F.log_softmax(self.l_out(outp), dim=-1) 
+```
+
+#### 渐变爆炸[ [2:08:21](https://youtu.be/sHcLkfRrgoQ%3Ft%3D2h8m21s) ]
+
+`self.rnn(inp, h)`是一个循环，一次又一次地应用相同的矩阵。 如果这个矩阵乘以每次都会增加激活次数，那么我们实际上就是以8的幂为例 - 我们称之为梯度爆炸。 我们希望确保初始`l_hidden`不会导致我们的激活平均增加或减少。
+
+一个很好的矩阵就是这样称为单位矩阵：
+
+![](../img/1_MH5NhqJBth84L9ufaxJCig.jpeg)
+
+我们可以使用单位矩阵覆盖随机初始化的隐藏隐藏权重：
+
+```
+ m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden)) 
+```
+
+这是由Geoffrey Hinton等人介绍的。 人。 in 2015 ( [A Simple Way to Initialize Recurrent Networks of Rectified Linear Units](https://arxiv.org/abs/1504.00941) ) — after RNN has been around for decades. It works very well, and you can use higher learning rate since it is well behaved.
diff --git a/zh/dl7.md b/zh/dl7.md
new file mode 100644
index 0000000000000000000000000000000000000000..150511872657f3c987dfc4bc4790453b25aa4f5d
--- /dev/null
+++ b/zh/dl7.md
@@ -0,0 +1,878 @@
+# 深度学习2：第1部分第7课
+
+### [第7课](http://forums.fast.ai/t/wiki-lesson-7/9405)
+
+第1部分的主题是：
+
+*   深度学习的分类和回归
+*   识别和学习最佳和既定的实践
+*   重点是分类和回归，它预测“一件事”（例如一个数字，少数标签）
+
+课程的第2部分：
+
+*   重点是生成建模，这意味着预测“很多事情” - 例如，创建一个句子，如在神经翻译，图像字幕或问题回答中创建图像，如风格转移，超分辨率，分割等。
+*   不是最好的做法，而是从最近可能没有经过全面测试的论文中获得更多的推测。
+
+#### 审查Char3Model [ [02:49](https://youtu.be/H3g26EVADgY%3Ft%3D2m49s) ]
+
+提醒：RNN没有任何不同或异常或神奇 - 只是一个标准的完全连接网络。
+
+![](../img/1_9XXQ3J7G3rD92tFkusi4bA.png)
+
+<figcaption class="imageCaption">标准的全连接网络</figcaption>
+
+
+
+*   箭头表示一个或多个层操作 - 一般来说是线性的，后跟非线性函数，在这种情况下，矩阵乘法后跟`relu`或`tanh`
+*   相同颜色的箭头表示使用的完全相同的重量矩阵。
+*   与之前的一个细微差别是第二和第三层有输入。 我们尝试了两种方法 - 将这些输入连接并添加到当前激活中。
+
+```
+ **class** **Char3Model** (nn.Module):  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  _# The 'green arrow' from our diagram_  self.l_in = nn.Linear(n_fac, n_hidden)  _# The 'orange arrow' from our diagram_  self.l_hidden = nn.Linear(n_hidden, n_hidden)  _# The 'blue arrow' from our diagram_  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, c1, c2, c3):  in1 = F.relu(self.l_in(self.e(c1)))  in2 = F.relu(self.l_in(self.e(c2)))  in3 = F.relu(self.l_in(self.e(c3)))  h = V(torch.zeros(in1.size()).cuda())  h = F.tanh(self.l_hidden(h+in1))  h = F.tanh(self.l_hidden(h+in2))  h = F.tanh(self.l_hidden(h+in3))  **return** F.log_softmax(self.l_out(h)) 
+```
+
+*   通过使用`nn.Linear`我们可以免费获得权重矩阵和偏置向量。
+*   为了处理第一个椭圆没有橙色箭头的事实，我们发明了一个空矩阵
+
+```
+ **class** **CharLoopModel** (nn.Module):  _# This is an RNN!_  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.l_in = nn.Linear(n_fac, n_hidden)  self.l_hidden = nn.Linear(n_hidden, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, *cs):  bs = cs[0].size(0)  h = V(torch.zeros(bs, n_hidden).cuda())  **for** c **in** cs:  inp = F.relu(self.l_in(self.e(c)))  h = F.tanh(self.l_hidden(h+inp))  **return** F.log_softmax(self.l_out(h), dim=-1) 
+```
+
+*   几乎相同，除了`for`循环
+
+```
+ **class** **CharRnn** (nn.Module):  **def** __init__(self, vocab_size, n_fac):  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = nn.RNN(n_fac, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  **def** forward(self, *cs):  bs = cs[0].size(0)  h = V(torch.zeros(1, bs, n_hidden))  inp = self.e(torch.stack(cs))  outp,h = self.rnn(inp, h)  **return** F.log_softmax(self.l_out(outp[-1]), dim=-1) 
+```
+
+*   PyTorch版本`nn.RNN`将创建循环并跟踪`h`跟踪。
+*   我们使用白色部分来预测绿色字符 - 这看起来很浪费，因为下一部分主要与当前部分重叠。
+
+![](../img/1_4v68iwTS32RHplB8c-egmg.png)
+
+*   然后，我们尝试将其拆分为多输出模型中的非重叠部分：
+
+![](../img/1_5LY1Sdql1_VLHDfdd2e8lw.png)
+
+*   在这种方法中，我们在处理每个部分后开始抛弃我们的`h`激活并开始一个新的激活。 为了使用下一节中的第一个字符预测第二个字符，它没有任何内容可以继续，而是默认激活。 我们不要扔掉。
+
+#### 有状态的RNN [ [08:52](https://youtu.be/H3g26EVADgY%3Ft%3D8m52s) ]
+
+```
+ **class** **CharSeqStatefulRnn** (nn.Module):  **def** __init__(self, vocab_size, n_fac, bs):  self.vocab_size = vocab_size  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = nn.RNN(n_fac, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  **self.init_hidden(bs)**  **def** forward(self, cs):  bs = cs[0].size(0)  **if** self.h.size(1) != bs: self.init_hidden(bs)  outp,h = self.rnn(self.e(cs), self.h)  **self.h = repackage_var(h)**  **return** F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)  **def** init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden)) 
+```
+
+*   构造函数中的另一行。 `self.init_hidden(bs)`将`self.h`设置为一堆零。
+*   **Wrinkle＃1** [ [10:51](https://youtu.be/H3g26EVADgY%3Ft%3D10m51s) ] - 如果我们只是做`self.h = h` ，并且我们训练了一个长度为一百万字符的文档，那么RNN的展开版本的大小有一百万层（省略号） 。 一百万层完全连接的网络将占用大量内存，因为为了实现链规则，我们必须增加一百万个层，同时记住每批100万个梯度。
+*   为避免这种情况，我们会不时地忘记它的历史。 我们仍然可以记住状态（隐藏矩阵中的值）而不记得我们如何到达那里的一切。
+
+```
+ def repackage_var(h):  return Variable(h.data) if type(h) == Variable else tuple(repackage_var(v) for v in h) 
+```
+
+*   从`Variable` `h`取出张量（记住，张量本身没有任何历史概念），并从中创建一个新的`Variable` 。 新变量具有相同的值但没有操作历史记录，因此当它尝试反向传播时，它将停在那里。
+*   `forward`将处理8个字符，然后返回传播通过8个层，跟踪隐藏状态中的值，但它将丢弃其操作历史。 这称为**反向支撑通过时间（bptt）** 。
+*   换句话说，在`for`循环之后，只需丢弃操作历史并重新开始。 所以我们保持隐藏的状态，但我们没有保持隐藏的状态历史。
+*   不通过太多层反向传播的另一个好理由是，如果你有任何类型的梯度不稳定性（例如梯度爆炸或梯度消失），你拥有的层数越多，网络训练越难（速度越慢，弹性越小） 。
+*   另一方面，较长的`bptt`意味着您可以显式捕获更长的内存和更多的状态。
+*   **Wrinkle＃2** [ [16:00](https://youtu.be/H3g26EVADgY%3Ft%3D16m) ] - 如何创建迷你批次。 我们不希望一次处理一个部分，而是一次并行处理一个部分。
+*   当我们第一次开始关注TorchText时，我们谈到了它如何创建这些迷你批次。
+*   Jeremy说我们整整一份长文件，包括Nietzsche的整个作品或所有IMDB评论连在一起，我们把它分成64个相同大小的块（不是64块大小的块）。
+
+![](../img/1_YOUoCz-p7semcNuDFZqp_w.png)
+
+*   对于长度为6400万个字符的文档，每个“块”将是100万个字符。 我们将它们堆叠在一起，现在用`bptt`分割它们 - 1个mini-bach由64个`bptt`矩阵组成。
+*   第二个块（第1,000,001个字符）的第一个字符可能位于句子的中间。 但它没关系，因为它每百万个字符只发生一次。
+
+#### 问题：此类数据集的数据扩充？ [ [20:34](https://youtu.be/H3g26EVADgY%3Ft%3D20m34s) ]
+
+没有已知的好方法。 有人最近通过数据增加赢得了Kaggle比赛，随机插入了不同行的部分 - 这样的东西可能会有用。 但是，最近没有任何最新的NLP论文正在进行这种数据增强。
+
+#### 问题：我们如何选择bptt的大小？ [ [21:36](https://youtu.be/H3g26EVADgY%3Ft%3D21m36s) ]
+
+有几件事要考虑：
+
+*   首先，迷你批量矩阵的大小为`bptt` （块数＃），因此你的GPU RAM必须能够通过嵌入矩阵拟合。 因此，如果你得到CUDA内存不足错误，你需要减少其中一个。
+*   如果你的训练不稳定（例如你的`bptt`突然向NaN射击），那么你可以尝试减少你的`bptt`因为你有更少的层来渐变爆炸。
+*   如果它太慢[ [22:44](https://youtu.be/H3g26EVADgY%3Ft%3D22m44s) ]，请尝试减少你的`bptt`因为它会一次执行其中一个步骤。 `for`循环无法并行化（对于当前版本）。 最近有一种称为QRNN（准递归神经网络）的东西，它将它并行化，我们希望在第2部分中介绍。
+*   所以选择满足所有这些的最高数字。
+
+#### 有状态的RNN和TorchText [ [23:23](https://youtu.be/H3g26EVADgY%3Ft%3D23m23s) ]
+
+当使用希望数据为特定格式的现有API时，您可以更改数据以适合该格式，也可以编写自己的数据集子类来处理数据已经存在的格式。要么很好，要么在这种情况下，我们将以TorchText格式提供我们的数据。 围绕TorchText的Fast.ai包装器已经具有可以具有训练路径和验证路径的东西，并且每个路径中的一个或多个文本文件包含为您的语言模型连接在一起的一堆文本。
+
+```
+ **from** **torchtext** **import** vocab, data 
+```
+
+```
+ **from** **fastai.nlp** **import** *  **from** **fastai.lm_rnn** **import** * 
+```
+
+```
+ PATH='data/nietzsche/' 
+```
+
+```
+ TRN_PATH = 'trn/'  VAL_PATH = 'val/'  TRN = f' **{PATH}{TRN_PATH}** '  VAL = f' **{PATH}{VAL_PATH}** ' 
+```
+
+```
+ %ls {PATH}  _models/ nietzsche.txt trn/ val/_ 
+```
+
+```
+ %ls {PATH}trn  _trn.txt_ 
+```
+
+*   制作了Nietzsche文件的副本，粘贴到培训和验证目录中。 然后从训练集中删除了最后20％的行，并删除了验证集[ [25:15](https://youtu.be/H3g26EVADgY%3Ft%3D25m15s) ]中除最后20％之外的所有行。
+*   这样做的另一个好处是，拥有一个不是随机混乱的文本行集的验证集似乎更为现实，但它完全是语料库的一部分。
+*   在进行语言模型时，实际上并不需要单独的文件。 您可以拥有多个文件，但无论如何它们只是连在一起。
+
+```
+ TEXT = data.Field(lower= **True** , tokenize=list)  bs=64; bptt=8; n_fac=42; n_hidden=256  FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)  md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)  len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)  _(963, 56, 1, 493747)_ 
+```
+
+*   在TorchText中，我们将这个东西称为`Field` ，最初`Field`只是描述如何预处理文本。
+*   `lower` - 我们告诉它要小写文本
+*   `tokenize` - 上次，我们使用了一个在空格上分割的函数，它给了我们一个单词模型。 这次，我们需要一个字符模型，所以使用`list`函数来标记字符串。 请记住，在Python中， `list('abc')`将返回`['a', 'b', 'c']` 。
+*   `bs` ：批量大小， `bptt` ：我们重命名为`cs` ， `n_fac` ：嵌入的大小， `n_hidden` ：隐藏状态的大小
+*   我们没有单独的测试集，所以我们只使用验证集进行测试
+*   TorchText每次将bptt的长度随机化一点。 它并不总能给我们准确的8个字符; 5％的时间，它会将它减少一半并加上一个小的标准偏差，使其略大于或小于8.我们无法对数据进行混洗，因为它需要连续，所以这是一种引入一些随机性的方法。
+*   问题[ [31:46](https://youtu.be/H3g26EVADgY%3Ft%3D31m46s) ]：每个小批量的尺寸是否保持不变？ 是的，我们需要使用`h`权重矩阵进行矩阵乘法，因此小批量大小必须保持不变。 但序列长度可以改变没有问题。
+*   `len(md.trn_dl)` ：数据加载器的长度（即多少`md.nt`批量）， `md.nt` ：令牌数量（即词汇表中有多少独特的东西）
+*   一旦运行`LanguageModelData.from_text_files` ， `TEXT`将包含一个名为`vocab`的额外属性。 `TEXT.vocab.itos`词汇表中的唯一项目列表， `TEXT.vocab.stoi`是从每个项目到数字的反向映射。
+
+```
+ **class** **CharSeqStatefulRnn** (nn.Module):  **def** __init__(self, vocab_size, n_fac, bs):  self.vocab_size = vocab_size  super().__init__()  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = nn.RNN(n_fac, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  self.init_hidden(bs)  **def** forward(self, cs):  bs = cs[0].size(0)  **if self.h.size(1) != bs: self.init_hidden(bs)**  outp,h = self.rnn(self.e(cs), self.h)  self.h = repackage_var(h)  **return** **F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)**  **def** init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden)) 
+```
+
+*   **皱纹＃3** [ [33:51](https://youtu.be/H3g26EVADgY%3Ft%3D33m51s) ]：Jeremy说他们说小批量的尺寸保持不变。 除非数据集完全被`bptt`乘以`bs`整除，否则最后一个小批量很可能比其他小批量短。 这就是为什么我们检查`self.h`的第二个维度是否与输入的`bs`相同。 如果不相同，请使用输入的`bs`将其设置为零。 这发生在纪元的末尾和纪元的开始（设置回完整的批量大小）。
+*   **Wrinkle＃4** [ [35:44](https://youtu.be/H3g26EVADgY%3Ft%3D35m44s) ]：最后的皱纹对于PyTorch来说有点糟糕，也许有人可以用PR来修复它。 损失函数不满意接收秩3张量（即三维阵列）。 没有特别的原因他们应该不乐意接受等级3张量（按结果的批量大小的序列长度 - 所以你可以只计算两个初始轴中每一个的损失）。 适用于等级2或4，但不适用于3。
+*   `.view`将通过`vocab_size`将等级3张量重塑为`-1`等级2（无论多么大）。 TorchText会自动更改**目标**以使其变平，因此我们不需要为实际值执行此操作（当我们在第4课中查看小批量时，我们注意到它已被展平。杰里米说我们将了解为什么以后，所以后来现在。）
+*   PyTorch（截至0.3）， `log_softmax`要求我们指定我们想要在哪个轴上执行softmax（即我们想要总和为哪个轴）。 在这种情况下，我们希望在最后一个轴上进行`dim = -1` 。
+
+```
+ m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()  opt = optim.Adam(m.parameters(), 1e-3) 
+```
+
+```
+ fit(m, md, 4, opt, F.nll_loss) 
+```
+
+#### 让我们通过拆包RNN获得更多洞察力[ [42:48](https://youtu.be/H3g26EVADgY%3Ft%3D42m48s) ]
+
+我们删除了`nn.RNN`的使用并用`nn.RNNCell`替换它。 PyTorch源代码如下所示。 你应该能够阅读和理解（注意：它们不会连接输入和隐藏状态，但是它们将它们加在一起 ​​- 这是我们的第一种方法）：
+
+```
+ **def** RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):  **return** F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh)) 
+```
+
+关于`tanh`问题[ [44:06](https://youtu.be/H3g26EVADgY%3Ft%3D44m6s) ]：正如我们上周看到的那样， `tanh`强迫值在-1和1之间。由于我们一次又一次地乘以这个权重矩阵，我们会担心`relu` （因为它是无界）可能有更多的梯度爆炸问题。 话虽如此，您可以指定`RNNCell`使用默认为`tanh`不同`nonlineality` ，并要求它使用`relu`如果您愿意）。
+
+```
+ **class** **CharSeqStatefulRnn2** (nn.Module):  **def** __init__(self, vocab_size, n_fac, bs):  super().__init__()  self.vocab_size = vocab_size  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = **nn.RNNCell** (n_fac, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  self.init_hidden(bs)  **def** forward(self, cs):  bs = cs[0].size(0)  **if** self.h.size(1) != bs: self.init_hidden(bs)  outp = []  o = self.h  **for** c **in** cs:  o = self.rnn(self.e(c), o)  outp.append(o)  outp = self.l_out(torch.stack(outp))  self.h = repackage_var(o)  **return** F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)  **def** init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden)) 
+```
+
+*   `for`循环返回并将线性函数的结果附加到列表中 - 最终将它们堆叠在一起。
+*   fast.ai库实际上正是为了使用PyTorch不支持的正则化方法。
+
+#### 门控经常性单位（GRU）[ [46:44](https://youtu.be/H3g26EVADgY%3Ft%3D46m44s) ]
+
+在实践中，没有人真正使用`RNNCell`因为即使是`tanh` ，梯度爆炸仍然是一个问题，我们需要使用低学习率和小`bptt`来让他们训练。 所以我们要做的是用`RNNCell`替换`GRUCell` 。
+
+![](../img/1__29x3zNI1C0vM3fxiIpiVA.png)
+
+<figcaption class="imageCaption">[http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/)</figcaption>
+
+
+
+*   通常，输入乘以权重矩阵以创建新的激活`h`并立即添加到现有激活中。 那不是发生在这里。
+*   输入进入`h˜`并且它不仅仅被添加到先前的激活中，而是先前的激活乘以`r` （重置门），其值为0或1。
+*   `r`计算如下 - 一些权重矩阵的矩阵乘法和我们先前的隐藏状态和新输入的串联。 换句话说，这是一个隐藏层神经网络。 它也通过sigmoid函数。 这个迷你神经网络学会确定要记住多少隐藏状态（当它看到一个完整停止字符时可能会忘记它 - 新句子的开头）。
+*   `z` gate（更新门）确定使用`h˜` ~的程度（隐藏状态的新输入版本）以及隐藏状态与以前相同的程度。
+
+![](../img/1_qzfburCutJ3p-FYu1T6Q3Q.png)
+
+<figcaption class="imageCaption">[http://colah.github.io/posts/2015-08-Understanding-LSTMs/](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)</figcaption>
+
+
+
+![](../img/1_M7ujxxzjQfL5e33BjJQViw.png)
+
+*   线性插值
+
+```
+ **def** GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):  gi = F.linear(input, w_ih, b_ih)  gh = F.linear(hidden, w_hh, b_hh)  i_r, i_i, i_n = gi.chunk(3, 1)  h_r, h_i, h_n = gh.chunk(3, 1)  resetgate = F.sigmoid(i_r + h_r)  inputgate = F.sigmoid(i_i + h_i)  newgate = F.tanh(i_n + resetgate * h_n)  **return** newgate + inputgate * (hidden - newgate) 
+```
+
+以上是`GRUCell`代码的样子，我们使用它的新模型如下：
+
+```
+ **class** **CharSeqStatefulGRU** (nn.Module):  **def** __init__(self, vocab_size, n_fac, bs):  super().__init__()  self.vocab_size = vocab_size  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = nn.GRU(n_fac, n_hidden)  self.l_out = nn.Linear(n_hidden, vocab_size)  self.init_hidden(bs)  **def** forward(self, cs):  bs = cs[0].size(0)  **if** self.h.size(1) != bs: self.init_hidden(bs)  outp,h = self.rnn(self.e(cs), self.h)  self.h = repackage_var(h)  **return** F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)  **def** init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden)) 
+```
+
+结果，我们可以将损失降低到1.36（ `RNNCell`一个是1.54）。 在实践中，GRU和LSTM是人们使用的。
+
+#### 把它们放在一起：长期短期记忆[ [54:09](https://youtu.be/H3g26EVADgY%3Ft%3D54m9s) ]
+
+LSTM还有一个状态称为“单元状态”（不仅仅是隐藏状态），所以如果你使用LSTM，你必须在`init_hidden`返回一个矩阵元组（与隐藏状态完全相同）：
+
+```
+ **from** **fastai** **import** sgdr  n_hidden=512 
+```
+
+```
+ **class** **CharSeqStatefulLSTM** (nn.Module):  **def** __init__(self, vocab_size, n_fac, bs, nl):  super().__init__()  self.vocab_size,self.nl = vocab_size,nl  self.e = nn.Embedding(vocab_size, n_fac)  self.rnn = nn.LSTM(n_fac, n_hidden, nl, **dropout** =0.5)  self.l_out = nn.Linear(n_hidden, vocab_size)  self.init_hidden(bs)  **def** forward(self, cs):  bs = cs[0].size(0)  **if** self.h[0].size(1) != bs: self.init_hidden(bs)  outp,h = self.rnn(self.e(cs), self.h)  self.h = repackage_var(h)  **return** F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)  **def** init_hidden(self, bs):  **self.h = (V(torch.zeros(self.nl, bs, n_hidden)),**  **V(torch.zeros(self.nl, bs, n_hidden)))** 
+```
+
+代码与GRU代码相同。 添加的一件事是`dropout` ，它在每个时间步骤后都会辍学并将隐藏层加倍 - 希望它能够学到更多并且能够保持弹性。
+
+#### 没有学习者课程的回调（特别是SGDR）[ [55:23](https://youtu.be/H3g26EVADgY%3Ft%3D55m23s) ]
+
+```
+ m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()  lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5) 
+```
+
+*   在创建标准PyTorch模型之后，我们通常会执行类似`opt = optim.Adam(m.parameters(), 1e-3)` 。 相反，我们将使用fast.ai `LayerOptimizer` ，它采用优化器`optim.Adam` ，我们的模型`m` ，学习率`1e-2` ，以及可选的权重衰减`1e-5` 。
+*   `LayerOptimizer`存在的一个关键原因是差异学习率和差`LayerOptimizer`重衰减。 我们需要使用它的原因是fast.ai中的所有机制假设你有其中一个。 如果您想在代码中使用回调或SGDR而不使用Learner类，则需要使用它。
+*   `lo.opt`返回优化器。
+
+```
+ on_end = **lambda** sched, cycle: save_model(m, f' **{PATH}** models/cyc_ **{cycle}** ') 
+```
+
+```
+ cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)] 
+```
+
+```
+ fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb) 
+```
+
+*   当我们调用`fit` ，我们现在可以传递`LayerOptimizer`以及`callbacks` 。
+*   在这里，我们使用余弦退火回调 - 这需要一个`LayerOptimizer`对象。 它通过改变`lo`对象的学习率来进行余弦退火。
+*   概念：创建一个余弦退火回调，它将更新层优化器中的学习速率。 一个纪元的长度等于`len(md.trn_dl)` - 一个纪元中有多少`len(md.trn_dl)`批量是数据加载器的长度。 由于它正在进行余弦退火，因此需要知道复位的频率。 你可以用通常的方式传递`cycle_mult` 。 我们甚至可以像在`Learner.fit`使用`cycle_save_name`一样自动保存模型。
+*   我们可以在训练，纪元或批次开始时，或在训练，纪元或批次结束时进行回调。
+*   它已被用于`CosAnneal` （SGDR），去耦重量衰减（AdamW），时间损失图等。
+
+#### 测试[ [59:55](https://youtu.be/H3g26EVADgY%3Ft%3D59m55s) ]
+
+```
+ **def** get_next(inp):  idxs = TEXT.numericalize(inp)  p = m(VV(idxs.transpose(0,1)))  r = **torch.multinomial(p[-1].exp(), 1)**  **return** TEXT.vocab.itos[to_np(r)[0]] 
+```
+
+```
+ **def** get_next_n(inp, n):  res = inp  **for** i **in** range(n):  c = get_next(inp)  res += c  inp = inp[1:]+c  **return** res 
+```
+
+```
+ print(get_next_n('for thos', 400)) 
+```
+
+```
+ _for those the skemps), or imaginates, though they deceives._ _it should so each ourselvess and new present, step absolutely for the science." the contradity and measuring, the whole!_ 
+```
+
+```
+ _293\. perhaps, that every life a values of blood of intercourse when it senses there is unscrupulus, his very rights, and still impulse, love?_ _just after that thereby how made with the way anything, and set for harmless philos_ 
+```
+
+*   在第6课中，当我们测试`CharRnn`模型时，我们注意到它一遍又一遍地重复。 在这个新版本中使用的`torch.multinomial`处理这个问题。 `p[-1]`得到最终输出（三角形）， `exp`将log概率转换为概率。 然后我们使用`torch.multinomial`函数，它将使用给定的概率给我们一个样本。 如果概率是[0,1,0,0]并要求它给我们一个样本，它将始终返回第二个项目。 如果是[0.5,0,0.5]，它将给出第一项50％的时间，第二项。 50％的时间（ [多项分布审查](http://onlinestatbook.com/2/probability/multinomial.html) ）
+*   要使用像这样训练基于角色的语言模型，尝试在不同的损失级别运行`get_next_n` ，以了解它的外观。 上面的例子是在1.25，但在1.3，它看起来像一个完全垃圾。
+*   当你在玩NLP时，特别是这样的生成模型，结果有点好但不是很好，不要灰心，因为这意味着你实际上非常非常接近！
+
+### [回到计算机视觉：CIFAR 10](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson7-cifar10.ipynb) [ [1:01:58](https://youtu.be/H3g26EVADgY%3Ft%3D1h1m58s) ]
+
+CIFAR 10是学术界一个古老且众所周知的数据集 - 在ImageNet之前，有CIFAR 10.它在图像数量和图像大小方面都很小，这使它变得有趣和具有挑战性。 您可能会使用数千张图片而不是一百五十万张图片。 我们在医学成像中看到的很多东西，我们正在寻找有肺结节的特定区域，你可能最多看32×32像素。
+
+它也运行得很快，所以测试我们的算法要好得多。 正如Ali Rahini在2017年NIPS中所提到的，Jeremy担心许多人没有仔细调整和深入学习实验，而是他们抛出大量的GPU和TPU或大量数据并考虑一天。 在CIFAR 10等数据集上测试算法的许多版本非常重要，而不是需要数周的ImageNet。 即使人们倾向于抱怨，MNIST也有利于研究和实验。
+
+[此处](http://pjreddie.com/media/files/cifar.tgz)提供图像格式的CIFAR 10数据
+
+```
+ **from** **fastai.conv_learner** **import** *  PATH = "data/cifar10/"  os.makedirs(PATH,exist_ok= **True** ) 
+```
+
+```
+ classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')  stats = (np.array([ 0.4914 , 0.48216, 0.44653]), np.array([ 0.24703, 0.24349, 0.26159])) 
+```
+
+```
+ **def** get_data(sz,bs):  tfms = **tfms_from_stats** (stats, sz, aug_tfms=[RandomFlipXY()], pad=sz//8)  **return** ImageClassifierData.from_paths(PATH, val_name='test', tfms=tfms, bs=bs) 
+```
+
+```
+ bs=256 
+```
+
+*   `classes` - 图像标签
+*   `stats` - 当我们使用预先训练的模型时，您可以调用`tfms_from_model`来创建必要的变换，以根据训练过的原始模型中每个通道的均值和标准偏差将我们的数据集转换为标准化数据集。由于我们是从头开始训练模型，我们需要告诉它我们的数据的均值和标准偏差来规范它。 确保您可以计算每个通道的平均值和标准偏差。
+*   `tfms` - 对于CIFAR 10数据增强，人们通常在边缘周围进行水平翻转和黑色填充，并在填充图像中随机选择32×32区域。
+
+```
+ data = get_data(32,bs)  lr=1e-2 
+```
+
+来自我们的学生Kerem Turgutlu的[这本笔记本](https://github.com/KeremTurgutlu/deeplearning/blob/master/Exploring%2520Optimizers.ipynb) ：
+
+```
+ **class** **SimpleNet** (nn.Module):  **def** __init__(self, layers):  super().__init__()  self.layers = **nn.ModuleList** ([  nn.Linear(layers[i], layers[i + 1]) **for** i **in** range(len(layers) - 1)])  **def** forward(self, x):  x = x.view(x.size(0), -1)  **for** l **in** self.layers:  l_x = l(x)  x = F.relu(l_x)  **return** F.log_softmax(l_x, dim=-1) 
+```
+
+*   `nn.ModuleList` - 每当你在PyTorch中创建一个层列表时，你必须将它包装在`ModuleList`以将这些作为属性注册。
+
+```
+ learn = ConvLearner.from_model_data(SimpleNet([32*32*3, 40,10]), data) 
+```
+
+*   现在我们提高了一级API - 而不是调用`fit`函数，我们_从自定义模型_创建一个`learn`对象。 `ConfLearner.from_model_data`采用标准的PyTorch模型和模型数据对象。
+
+```
+ learn, [o.numel() **for** o **in** learn.model.parameters()] 
+```
+
+```
+ _(SimpleNet(_  _(layers): ModuleList(_  _(0): Linear(in_features=3072, out_features=40)_  _(1): Linear(in_features=40, out_features=10)_  _)_  _), [122880, 40, 400, 10])_ 
+```
+
+```
+ learn.summary() 
+```
+
+```
+ _OrderedDict([('Linear-1',_  _OrderedDict([('input_shape', [-1, 3072]),_  _('output_shape', [-1, 40]),_  _('trainable', True),_  _('nb_params', 122920)])),_  _('Linear-2',_  _OrderedDict([('input_shape', [-1, 40]),_  _('output_shape', [-1, 10]),_  _('trainable', True),_  _('nb_params', 410)]))])_ 
+```
+
+```
+ learn.lr_find() 
+```
+
+```
+ learn.sched.plot() 
+```
+
+![](../img/1__5sTAdoWHTBQUzbaVrc4HA.png)
+
+```
+ %time learn.fit(lr, 2) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 1.7658 1.64148 0.42129]  [ 1\. 1.68074 1.57897 0.44131]  CPU times: user 1min 11s, sys: 32.3 s, total: 1min 44s  Wall time: 55.1 s 
+```
+
+```
+ %time learn.fit(lr, 2, cycle_len=1) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 1.60857 1.51711 0.46631]  [ 1\. 1.59361 1.50341 0.46924]  CPU times: user 1min 12s, sys: 31.8 s, total: 1min 44s  Wall time: 55.3 s 
+```
+
+通过一个简单的隐藏层模型，122,880个参数，我们达到了46.9％的准确率。 让我们改进这一点，逐步建立一个基本的ResNet架构。
+
+#### CNN [ [01:12:30](https://youtu.be/H3g26EVADgY%3Ft%3D1h12m30s) ]
+
+*   让我们用卷积模型替换完全连接的模型。 完全连接的层只是做一个点积。 这就是权重矩阵很大的原因（3072输入* 40 = 122880）。 我们没有非常有效地使用这些参数，因为输入中的每个像素都具有不同的权重。 我们想要做的是一组3乘3像素，它们具有特定的模式（即卷积）。
+*   我们将使用具有三乘三内核的过滤器。 如果有多个过滤器，则输出将具有其他维度。
+
+```
+ **class** **ConvNet** (nn.Module):  **def** __init__(self, layers, c):  super().__init__()  self.layers = nn.ModuleList([  **nn.Conv2d(layers[i], layers[i + 1], kernel_size=3, stride=2)**  **for** i **in** range(len(layers) - 1)])  self.pool = nn.AdaptiveMaxPool2d(1)  self.out = nn.Linear(layers[-1], c)  **def** forward(self, x):  **for** l **in** self.layers: x = F.relu(l(x))  x = self.pool(x)  x = x.view(x.size(0), -1)  **return** F.log_softmax(self.out(x), dim=-1) 
+```
+
+*   用`nn.Linear`替换`nn.Conv2d`
+*   前两个参数与`nn.Linear` - 进入的`nn.Linear`数量以及出现的`nn.Linear`数量
+*   `kernel_size=3` ，过滤器的大小
+*   `stride=2`将使用每隔3乘3的区域，这将使每个维度的输出分辨率减半（即它具有与2乘2最大池相同的效果）
+
+```
+ learn = ConvLearner.from_model_data(ConvNet([3, 20, 40, 80], 10), data) 
+```
+
+```
+ learn.summary() 
+```
+
+```
+ _OrderedDict([('Conv2d-1',_  _OrderedDict([('input_shape', [-1, 3, 32, 32]),_  _('output_shape', [-1, 20, 15, 15]),_  _('trainable', True),_  _('nb_params', 560)])),_  _('Conv2d-2',_  _OrderedDict([('input_shape', [-1, 20, 15, 15]),_  _('output_shape', [-1, 40, 7, 7]),_  _('trainable', True),_  _('nb_params', 7240)])),_  _('Conv2d-3',_  _OrderedDict([('input_shape', [-1, 40, 7, 7]),_  _('output_shape', [-1, 80, 3, 3]),_  _('trainable', True),_  _('nb_params', 28880)])),_  _('AdaptiveMaxPool2d-4',_  _OrderedDict([('input_shape', [-1, 80, 3, 3]),_  _('output_shape', [-1, 80, 1, 1]),_  _('nb_params', 0)])),_  _('Linear-5',_  _OrderedDict([('input_shape', [-1, 80]),_  _('output_shape', [-1, 10]),_  _('trainable', True),_  _('nb_params', 810)]))])_ 
+```
+
+*   `ConvNet([3, 20, 40, 80], 10)` - 它以3个RGB通道， `ConvNet([3, 20, 40, 80], 10)`个特征开始，然后是10个类来预测。
+*   `AdaptiveMaxPool2d` - 接下来是一个线性层，是从3乘3到预测10个类中的一个的方法，现在是最先进算法的标准。 最后一层，我们执行一种特殊的max-pooling，您可以为其指定输出激活分辨率，而不是要调查的区域大小。 换句话说，这里我们做3乘3 max-pool，相当于1乘1 _自适应_ max-pool。
+*   `x = x.view(x.size(0), -1)` - `x`的特征形状为1乘1，因此它将删除最后两层。
+*   这个模型被称为“完全卷积网络” - 每个层都是卷积的，除了最后一层。
+
+```
+ learn.lr_find( **end_lr=100** )  learn.sched.plot() 
+```
+
+![](../img/1_YuNvyUac9HvAv0XZn08-3g.png)
+
+*   `lr_find`尝试的默认最终学习速率为10.如果此时丢失仍然越来越好，则可以通过指定`end_lr`来覆盖。
+
+```
+ %time learn.fit(1e-1, 2) 
+```
+
+```
+ _A Jupyter Widget_ 
+```
+
+```
+ _[ 0\. 1.72594 1.63399 0.41338]_  _[ 1\. 1.51599 1.49687 0.45723]_  _CPU times: user 1min 14s, sys: 32.3 s, total: 1min 46s_  _Wall time: 56.5 s_ 
+```
+
+```
+ %time learn.fit(1e-1, 4, cycle_len=1) 
+```
+
+```
+ _A Jupyter Widget_ 
+```
+
+```
+ _[ 0\. 1.36734 1.28901 0.53418]_  _[ 1\. 1.28854 1.21991 0.56143]_  _[ 2\. 1.22854 1.15514 0.58398]_  _[ 3\. 1.17904 1.12523 0.59922]_  _CPU times: user 2min 21s, sys: 1min 3s, total: 3min 24s_  _Wall time: 1min 46s_ 
+```
+
+*   它平衡了约60％的准确度。 考虑到它使用了大约30,000个参数（相比之下，参数为122k，为47％）
+*   每个时期的时间大致相同，因为它们的架构既简单又大部分时间花在进行内存传输上。
+
+#### 重构[ [01:21:57](https://youtu.be/H3g26EVADgY%3Ft%3D1h21m57s) ]
+
+通过创建`ConvLayer` （我们的第一个自定义层！）简化`forward`功能。 在PyTorch中，层定义和神经网络定义是相同的。 任何时候你有一个图层，你可以将它用作神经网络，当你有神经网络时，你可以将它用作图层。
+
+```
+ **class** **ConvLayer** (nn.Module):  **def** __init__(self, ni, nf):  super().__init__()  self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=2, padding=1)  **def** forward(self, x): **return** F.relu(self.conv(x)) 
+```
+
+*   `padding=1` - 当你进行卷积时，图像每边缩小1个像素。 因此它不会从32乘32到16乘16但实际上是15乘15\. `padding`将添加边框，以便我们可以保留边缘像素信息。 对于一个大图像来说，这并不是什么大不了的事情，但是当它降到4比4时，你真的不想丢掉一整块。
+
+```
+ **class** **ConvNet2** (nn.Module):  **def** __init__(self, layers, c):  super().__init__()  self.layers = nn.ModuleList([ConvLayer(layers[i], layers[i + 1])  **for** i **in** range(len(layers) - 1)])  self.out = nn.Linear(layers[-1], c)  **def** forward(self, x):  **for** l **in** self.layers: x = l(x)  x = **F.adaptive_max_pool2d(x, 1)**  x = x.view(x.size(0), -1)  **return** F.log_softmax(self.out(x), dim=-1) 
+```
+
+*   与上一个模型的另一个区别是`nn.AdaptiveMaxPool2d`没有任何状态（即没有权重）。 所以我们可以将它称为函数`F.adaptive_max_pool2d` 。
+
+#### BatchNorm [ [1:25:10](https://youtu.be/H3g26EVADgY%3Ft%3D1h25m10s) ]
+
+*   最后一个模型，当我们尝试添加更多图层时，我们在训练时遇到了麻烦。 我们训练有困难的原因是，如果我们使用较大的学习率，那么它将用于NaN，如果我们使用较小的学习率，则需要永远并且没有机会正确探索 - 因此它没有弹性。
+*   为了使其具有弹性，我们将使用称为批量规范化的东西。 BatchNorm大约两年前推出，它具有很大的变革性，因为它突然使得培养更深层的网络变得非常容易。
+*   我们可以简单地使用`nn.BatchNorm`但要了解它，我们将从头开始编写它。
+*   It is unlikely that the weight matrices on average are not going to cause your activations to keep getting smaller and smaller or keep getting bigger and bigger. It is important to keep them at reasonable scale. So we start things off with zero-mean standard deviation one by normalizing the input. What we really want to do is to do this for all layers, not just the inputs.
+
+```
+ class BnLayer (nn.Module):  def __init__(self, ni, nf, stride=2, kernel_size=3):  super().__init__()  self.conv = nn.Conv2d(ni, nf, kernel_size=kernel_size,  stride=stride, bias= False , padding=1)  self.a = nn.Parameter(torch.zeros(nf,1,1))  self.m = nn.Parameter(torch.ones(nf,1,1))  def forward(self, x):  x = F.relu(self.conv(x))  x_chan = x.transpose(0,1).contiguous().view(x.size(1), -1)  if self.training:  self.means = x_chan.mean(1)[:,None,None]   self.stds = x_chan.std (1)[:,None,None]  return (x-self.means) / self.stds *self.m + self.a 
+```
+
+*   Calculate the mean of each channel or each filter and standard deviation of each channel or each filter. Then subtract the means and divide by the standard deviations.
+*   We no longer need to normalize our input because it is normalizing it per channel or for later layers it is normalizing per filter.
+*   Turns out this is not enough since SGD is bloody-minded [ [01:29:20](https://youtu.be/H3g26EVADgY%3Ft%3D1h29m20s) ]. If SGD decided that it wants matrix to be bigger/smaller overall, doing `(x=self.means) / self.stds` is not enough because SGD will undo it and try to do it again in the next mini-batch. So we will add two parameters: `a` — adder (initial value zeros) and `m` — multiplier (initial value ones) for each channel.
+*   `Parameter` tells PyTorch that it is allowed to learn these as weights.
+*   为什么这样做？ If it wants to scale the layer up, it does not have to scale up every single value in the matrix. It can just scale up this single trio of numbers `self.m` , if it wants to shift it all up or down a bit, it does not have to shift the entire weight matrix, they can just shift this trio of numbers `self.a` . Intuition: We are normalizing the data and then we are saying you can then shift it and scale it using far fewer parameters than would have been necessary if it were to actually shift and scale the entire set of convolutional filters. In practice, it allows us to increase our learning rates, it increase the resilience of training, and it allows us to add more layers and still train effectively.
+*   The other thing batch norm does is that it regularizes, in other words, you can often decrease or remove dropout or weight decay. The reason why is each mini-batch is going to have a different mean and a different standard deviation to the previous mini-batch. So they keep changing and it is changing the meaning of the filters in a subtle way acting as a noise (ie regularization).
+*   In real version, it does not use this batch's mean and standard deviation but takes an exponentially weighted moving average standard deviation and mean.
+*   `**if** self.training` — this is important because when you are going through the validation set, you do not want to be changing the meaning of the model. There are some types of layer that are actually sensitive to what the mode of the network is whether it is in training mode or evaluation/test mode. There was a bug when we implemented mini net for MovieLens that dropout was applied during the validation — which was fixed. In PyTorch, there are two such layer: dropout and batch norm. `nn.Dropout` already does the check.
+*   [ [01:37:01](https://youtu.be/H3g26EVADgY%3Ft%3D1h37m1s) ] The key difference in fast.ai which no other library does is that these means and standard deviations get updated in training mode in every other library as soon as you basically say I am training, regardless of whether that layer is set to trainable or not. With a pre-trained network, that is a terrible idea. If you have a pre-trained network for specific values of those means and standard deviations in batch norm, if you change them, it changes the meaning of those pre-trained layers. In fast.ai, always by default, it will not touch those means and standard deviations if your layer is frozen. As soon as you un-freeze it, it will start updating them unless you set `learn.bn_freeze=True` . In practice, this often seems to work a lot better for pre-trained models particularly if you are working with data that is quite similar to what the pre-trained model was trained with.
+*   Where do you put batch-norm layer? We will talk more in a moment, but for now, after `relu`
+
+#### Ablation Study [ [01:39:41](https://youtu.be/H3g26EVADgY%3Ft%3D1h39m41s) ]
+
+It is something where you try turning on and off different pieces of your model to see which bits make which impacts, and one of the things that wasn't done in the original batch norm paper was any kind of effective ablation. And one of the things therefore that was missing was this question which was just asked — where to put the batch norm. That oversight caused a lot of problems because it turned out the original paper did not actually put it in the best spot. Other people since then have now figured that out and when Jeremy show people code where it is actually in the spot that is better, people say his batch norm is in the wrong spot.
+
+*   Try and always use batch norm on every layer if you can
+*   Don't stop normalizing your data so that people using your data will know how you normalized your data. Other libraries might not deal with batch norm for pre-trained models correctly, so when people start re-training, it might cause problems.
+
+```
+ class ConvBnNet (nn.Module):  def __init__(self, layers, c):  super().__init__()  self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)  self.layers = nn.ModuleList([ BnLayer (layers[i], layers[i + 1])  for i in range(len(layers) - 1)])  self.out = nn.Linear(layers[-1], c)  def forward(self, x):  x = self.conv1(x)  for l in self.layers: x = l(x)  x = F.adaptive_max_pool2d(x, 1)  x = x.view(x.size(0), -1)  return F.log_softmax(self.out(x), dim=-1) 
+```
+
+*   Rest of the code is similar — Using `BnLayer` instead of `ConvLayer`
+*   A single convolutional layer was added at the start trying to get closer to the modern approaches. It has a bigger kernel size and a stride of 1\. The basic idea is that we want the first layer to have a richer input. It does convolution using the 5 by 5 area which allows it to try and find more interesting richer features in that 5 by 5 area, then spit out bigger output (in this case, it's 10 by 5 by 5 filters). Typically it is 5 by 5 or 7 by 7, or even 11 by 11 convolution with quite a few filters coming out (eg 32 filters).
+*   Since `padding = kernel_size — 1 / 2` and `stride=1` , the input size is the same as the output size — just more filters.
+*   It is a good way of trying to create a richer starting point.
+
+#### Deep BatchNorm [ [01:50:52](https://youtu.be/H3g26EVADgY%3Ft%3D1h50m52s) ]
+
+Let's increase the depth of the model. We cannot just add more of stride 2 layers since it halves the size of the image each time. Instead, after each stride 2 layer, we insert a stride 1 layer.
+
+```
+ class ConvBnNet2 (nn.Module):  def __init__(self, layers, c):  super().__init__()  self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)  self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])  for i in range(len(layers) - 1)])  self.layers2 = nn.ModuleList([BnLayer(layers[i+1], layers[i + 1], 1)  for i in range(len(layers) - 1)])  self.out = nn.Linear(layers[-1], c)  def forward(self, x):  x = self.conv1(x)  for l,l2 in zip(self.layers, self.layers2):  x = l(x)  x = l2(x)  x = F.adaptive_max_pool2d(x, 1)  x = x.view(x.size(0), -1)  return F.log_softmax(self.out(x), dim=-1) 
+```
+
+```
+ learn = ConvLearner.from_model_data((ConvBnNet2([10, 20, 40, 80, 160], 10), data) 
+```
+
+```
+ %time learn.fit(1e-2, 2) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 1.53499 1.43782 0.47588] 
+ [ 1\. 1.28867 1.22616 0.55537] 
+
+ CPU times: user 1min 22s, sys: 34.5 s, total: 1min 56s 
+ Wall time: 58.2 s 
+```
+
+```
+ %time learn.fit(1e-2, 2, cycle_len=1) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 1.10933 1.06439 0.61582] 
+ [ 1\. 1.04663 0.98608 0.64609] 
+
+ CPU times: user 1min 21s, sys: 32.9 s, total: 1min 54s 
+ Wall time: 57.6 s 
+```
+
+The accuracy remained the same as before. This is now 12 layers deep, and it is too deep even for batch norm to handle. It is possible to train 12 layer deep conv net but it starts to get difficult. And it does not seem to be helping much if at all.
+
+#### ResNet [ [01:52:43](https://youtu.be/H3g26EVADgY%3Ft%3D1h52m43s) ]
+
+```
+ class ResnetLayer (BnLayer):  def forward(self, x): return x + super().forward(x) 
+```
+
+```
+ class Resnet (nn.Module):  def __init__(self, layers, c):  super().__init__()  self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)  self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])  for i in range(len(layers) - 1)])  self.layers2 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)  for i in range(len(layers) - 1)])  self.layers3 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)  for i in range(len(layers) - 1)])  self.out = nn.Linear(layers[-1], c)  def forward(self, x):  x = self.conv1(x)  for l,l2,l3 in zip(self.layers, self.layers2, self.layers3):  x = l3(l2(l(x)))  x = F.adaptive_max_pool2d(x, 1)  x = x.view(x.size(0), -1)  return F.log_softmax(self.out(x), dim=-1) 
+```
+
+*   `ResnetLayer` inherit from `BnLayer` and override `forward` .
+*   Then add bunch of layers and make it 3 times deeper, ad it still trains beautifully just because of `x + super().forward(x)` .
+
+```
+ learn = ConvLearner.from_model_data(Resnet([10, 20, 40, 80, 160], 10), data) 
+```
+
+```
+ wd=1e-5 
+```
+
+```
+ %time learn.fit(1e-2, 2, wds=wd) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 1.58191 1.40258 0.49131] 
+ [ 1\. 1.33134 1.21739 0.55625] 
+
+ CPU times: user 1min 27s, sys: 34.3 s, total: 2min 1s 
+ Wall time: 1min 3s 
+```
+
+```
+ %time learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2, wds=wd) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 1.11534 1.05117 0.62549] 
+ [ 1\. 1.06272 0.97874 0.65185] 
+ [ 2\. 0.92913 0.90472 0.68154] 
+ [ 3\. 0.97932 0.94404 0.67227] 
+ [ 4\. 0.88057 0.84372 0.70654] 
+ [ 5\. 0.77817 0.77815 0.73018] 
+ [ 6\. 0.73235 0.76302 0.73633] 
+
+ CPU times: user 5min 2s, sys: 1min 59s, total: 7min 1s 
+ Wall time: 3min 39s 
+```
+
+```
+ %time learn.fit(1e-2, 8, cycle_len=4, wds=wd) 
+```
+
+```
+ A Jupyter Widget 
+```
+
+```
+ [ 0\. 0.8307 0.83635 0.7126 ] 
+ [ 1\. 0.74295 0.73682 0.74189] 
+ [ 2\. 0.66492 0.69554 0.75996] 
+ [ 3\. 0.62392 0.67166 0.7625 ] 
+ [ 4\. 0.73479 0.80425 0.72861] 
+ [ 5\. 0.65423 0.68876 0.76318] 
+ [ 6\. 0.58608 0.64105 0.77783] 
+ [ 7\. 0.55738 0.62641 0.78721] 
+ [ 8\. 0.66163 0.74154 0.7501 ] 
+ [ 9\. 0.59444 0.64253 0.78106] 
+ [ 10\. 0.53 0.61772 0.79385] 
+ [ 11\. 0.49747 0.65968 0.77832] 
+ [ 12\. 0.59463 0.67915 0.77422] 
+ [ 13\. 0.55023 0.65815 0.78106] 
+ [ 14\. 0.48959 0.59035 0.80273] 
+ [ 15\. 0.4459 0.61823 0.79336] 
+ [ 16\. 0.55848 0.64115 0.78018] 
+ [ 17\. 0.50268 0.61795 0.79541] 
+ [ 18\. 0.45084 0.57577 0.80654] 
+ [ 19\. 0.40726 0.5708 0.80947] 
+ [ 20\. 0.51177 0.66771 0.78232] 
+ [ 21\. 0.46516 0.6116 0.79932] 
+ [ 22\. 0.40966 0.56865 0.81172] 
+ [ 23\. 0.3852 0.58161 0.80967] 
+ [ 24\. 0.48268 0.59944 0.79551] 
+ [ 25\. 0.43282 0.56429 0.81182] 
+ [ 26\. 0.37634 0.54724 0.81797] 
+ [ 27\. 0.34953 0.54169 0.82129] 
+ [ 28\. 0.46053 0.58128 0.80342] 
+ [ 29\. 0.4041 0.55185 0.82295] 
+ [ 30\. 0.3599 0.53953 0.82861] 
+ [ 31\. 0.32937 0.55605 0.82227] 
+
+ CPU times: user 22min 52s, sys: 8min 58s, total: 31min 51s 
+ Wall time: 16min 38s 
+```
+
+**ResNet block** [ [01:53:18](https://youtu.be/H3g26EVADgY%3Ft%3D1h53m18s) ]
+
+`**return** **x + super().forward(x)**`
+
+_y = x + f(x)_
+
+Where _x_ is prediction from the previous layer, _y_ is prediction from the current layer.Shuffle around the formula and we get:formula shuffle
+
+_f(x) = y − x_
+
+The difference _y − x_ is **residual** . The residual is the error in terms of what we have calculated so far. What this is saying is that try to find a set of convolutional weights that attempts to fill in the amount we were off by. So in other words, we have an input, and we have a function which tries to predict the error (ie how much we are off by). Then we add a prediction of how much we were wrong by to the input, then add another prediction of how much we were wrong by that time, and repeat that layer after layer — zooming into the correct answer. This is based on a theory called **boosting** .
+
+*   The full ResNet does two convolutions before it gets added back to the original input (we did just one here).
+*   In every block `x = l3(l2(l(x)))` , one of the layers is not a `ResnetLayer` but a standard convolution with `stride=2` — this is called a “bottleneck layer”. ResNet does not convolutional layer but a different form of bottleneck block which we will cover in Part 2\.
+
+![](../img/1_0_0J8BFYOTK4Mupk94Izrw.png)
+
+#### ResNet 2 [ [01:59:33](https://youtu.be/H3g26EVADgY%3Ft%3D1h59m33s) ]
+
+Here, we increased the size of features and added dropout.
+
+```
+ class Resnet2 (nn.Module):  def __init__(self, layers, c, p=0.5):  super().__init__()  self.conv1 = BnLayer(3, 16, stride=1, kernel_size=7)  self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])  for i in range(len(layers) - 1)])  self.layers2 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)  for i in range(len(layers) - 1)])  self.layers3 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)  for i in range(len(layers) - 1)])  self.out = nn.Linear(layers[-1], c)  self.drop = nn.Dropout(p)  def forward(self, x):  x = self.conv1(x)  for l,l2,l3 in zip(self.layers, self.layers2, self.layers3):  x = l3(l2(l(x)))  x = F.adaptive_max_pool2d(x, 1)  x = x.view(x.size(0), -1)  x = self.drop(x)  return F.log_softmax(self.out(x), dim=-1) 
+```
+
+```
+ learn = ConvLearner.from_model_data(Resnet2([ 16, 32, 64, 128, 256 ], 10, 0.2), data) 
+```
+
+```
+ wd=1e-6 
+```
+
+```
+ %time learn.fit(1e-2, 2, wds=wd)  %time learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2, wds=wd)  %time learn.fit(1e-2, 8, cycle_len=4, wds=wd) 
+```
+
+```
+ log_preds,y = learn.TTA()  preds = np.mean(np.exp(log_preds),0) 
+```
+
+```
+ metrics.log_loss(y,preds), accuracy(preds,y)  (0.44507397166057938, 0.84909999999999997) 
+```
+
+85% was a state-of-the-art back in 2012 or 2013 for CIFAR 10\. Nowadays, it is up to 97% so there is a room for improvement but all based on these tecniques:
+
+*   Better approaches to data augmentation
+*   Better approaches to regularization
+*   Some tweaks on ResNet
+
+Question [ [02:01:07](https://youtu.be/H3g26EVADgY%3Ft%3D2h1m7s) ]: Can we apply “training on the residual” approach for non-image problem? 是! But it has been ignored everywhere else. In NLP, “transformer architecture” recently appeared and was shown to be the state of the art for translation, and it has a simple ResNet structure in it. This general approach is called “skip connection” (ie the idea of skipping over a layer) and appears a lot in computer vision, but nobody else much seems to be using it even through there is nothing computer vision specific about it. Good opportunity!
+
+### [Dogs vs. Cats](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson7-CAM.ipynb) [ [02:02:03](https://youtu.be/H3g26EVADgY%3Ft%3D2h2m3s) ]
+
+Going back dogs and cats. We will create resnet34 (if you are interested in what the trailing number means, [see here](https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py) — just different parameters).
+
+```
+ PATH = "data/dogscats/"  sz = 224  arch = resnet34 # <-- Name of the function  bs = 64 
+```
+
+```
+ m = arch(pretrained=True) # Get a model w/ pre-trained weight loaded  m 
+```
+
+```
+ ResNet( 
+ (conv1): Conv2d (3, 64, _kernel_size=(7, 7)_ , stride=(2, 2), padding=(3, 3), bias=False) 
+ (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (maxpool): MaxPool2d(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1)) 
+ ( _layer1_ ): Sequential( 
+ (0): BasicBlock( 
+ (conv1): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) 
+ _)_ 
+ (1): BasicBlock( 
+ (conv1): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) 
+ _)_ 
+ (2): BasicBlock( 
+ (conv1): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) 
+ _)_  _)_ 
+ ( _layer2_ ): Sequential( 
+ (0): BasicBlock( 
+ (conv1): Conv2d (64, 128, kernel_size=(3, 3), _stride=(2, 2)_ , padding=(1, 1), bias=False) 
+ (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ (downsample): Sequential( 
+ (0): Conv2d (64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) 
+ (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ _)_  _)_ 
+ (1): BasicBlock( 
+ (conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ _)_ 
+ (2): BasicBlock( 
+ (conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ _)_ 
+ (3): BasicBlock( 
+ (conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ (relu): ReLU(inplace) 
+ (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 
+ (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) 
+ _)_  _)_ 
+```
+
+```
+ _..._ 
+```
+
+```
+  (avgpool): AvgPool2d(kernel_size=7, stride=7, padding=0, ceil_mode=False, count_include_pad=True) 
+ (fc): Linear(in_features=512, out_features=1000) 
+ _)_ 
+```
+
+Our ResNet model had Relu → BatchNorm. TorchVision does BatchNorm →Relu. There are three different versions of ResNet floating around, and the best one is PreAct ( [https://arxiv.org/pdf/1603.05027.pdf](https://arxiv.org/pdf/1603.05027.pdf) ).
+
+*   Currently, the final layer has a thousands features because ImageNet has 1000 features, so we need to get rid of it.
+*   When you use fast.ai's `ConvLearner` , it deletes the last two layers for you. fast.ai replaces `AvgPool2d` with Adaptive Average Pooling and Adaptive Max Pooling and concatenate the two together.
+*   For this exercise, we will do a simple version.
+
+```
+ m = nn.Sequential(*children(m)[:-2],  nn.Conv2d(512, 2, 3, padding=1),  nn.AdaptiveAvgPool2d(1), Flatten(),  nn.LogSoftmax()) 
+```
+
+*   Remove the last two layers
+*   Add a convolution which just has 2 outputs.
+*   Do average pooling then softmax
+*   There is no linear layer at the end. This is a different way of producing just two numbers — which allows us to do CAM!
+
+```
+ tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)  data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs) 
+```
+
+```
+ learn = ConvLearner.from_model_data (m, data) 
+```
+
+```
+ learn.freeze_to(-4) 
+```
+
+```
+ learn.fit(0.01, 1)  learn.fit(0.01, 1, cycle_len=1) 
+```
+
+*   `ConvLearner.from_model` is what we learned about earlier — allows us to create a Learner object with custom model.
+*   Then freeze the layer except the ones we just added.
+
+#### Class Activation Maps (CAM) [ [02:08:55](https://youtu.be/H3g26EVADgY%3Ft%3D2h8m55s) ]
+
+We pick a specific image, and use a technique called CAM where we take a model and we ask it which parts of the image turned out to be important.
+
+![](../img/1_BrMBBupbny4CFsqBVjgcfA.png)
+
+![](../img/1_zayLvr0jvnUXe-G27odldQ.png)
+
+How did it do this? Let's work backwards. The way it did it was by producing this matrix:
+
+![](../img/1_DPIlEiNjJOeAbiIQUubNLg.png)
+
+Big numbers correspond to the cat. So what is this matrix? This matrix simply equals to the value of feature matrix `feat` times `py` vector:
+
+```
+ f2=np.dot(np.rollaxis( feat ,0,3), py )  f2-=f2.min()  f2/=f2.max()  f2 
+```
+
+`py` vector is the predictions that says “I am 100% confident it's a cat.” `feat` is the values (2×7×7) coming out of the final convolutional layer (the `Conv2d` layer we added). If we multiply `feat` by `py` , we get all of the first channel and none of the second channel. Therefore, it is going to return the value of the last convolutional layers for the section which lines up with being a cat. In other words, if we multiply `feat` by `[0, 1]` , it will line up with being a dog.
+
+```
+ sf = SaveFeatures(m[-4])  py = m(Variable(x.cuda()))  sf.remove()  py = np.exp(to_np(py)[0]); py 
+```
+
+```
+ array([ 1., 0.], dtype=float32) 
+```
+
+```
+ feat = np.maximum(0, sf.features[0])  feat.shape 
+```
+
+Put it in another way, in the model, the only thing that happened after the convolutional layer was an average pooling layer. The average pooling layer took took the 7 by 7 grid and averaged out how much each part is “cat-like”. We then took the “cattyness” matrix, resized it to be the same size as the original cat image, and overlaid it on top, then you get the heat map.
+
+The way you can use this technique at home is
+
+1.  when you have a large image, you can calculate this matrix on a quick small little convolutional net
+2.  zoom into the area that has the highest value
+3.  re-run it just on that part
+
+We skipped this over quickly as we ran out of time, but we will learn more about these kind of approaches in Part 2\.
+
+“Hook” is the mechanism that lets us ask the model to return the matrix. `register_forward_hook` asks PyTorch that every time it calculates a layer it runs the function given — sort of like a callback that happens every time it calculates a layer. In the following case, it saves the value of the particular layer we were interested in:
+
+```
+ class SaveFeatures ():  features= None  def __init__(self, m):  self.hook = m.register_forward_hook(self.hook_fn)  def hook_fn(self, module, input, output):  self.features = to_np(output)  def remove(self): self.hook.remove() 
+```
+
+#### Questions to Jeremy [ [02:14:27](https://youtu.be/H3g26EVADgY%3Ft%3D2h14m27s) ]: “Your journey into Deep Learning” and “How to keep up with important research for practitioners”
+
+“If you intend to come to Part 2, you are expected to master all the techniques er have learned in Part 1”. Here are something you can do:
+
+1.  Watch each of the video at least 3 times.
+2.  Make sure you can re-create the notebooks without watching the videos — maybe do so with different datasets to make it more interesting.
+3.  Keep an eye on the forum for recent papers, recent advances.
+4.  Be tenacious and keep working at it!
diff --git a/zh/dl8.md b/zh/dl8.md
new file mode 100644
index 0000000000000000000000000000000000000000..d06e2683cb8a16f44582c9e7222fa06c09351a5f
--- /dev/null
+++ b/zh/dl8.md
@@ -0,0 +1,774 @@
+# 深度学习2：第2部分第8课
+
+### 物体检测
+
+[**论坛**](http://forums.fast.ai/t/part-2-lesson-8-in-class/13556/1) **/** [**视频**](https://youtu.be/Z0ssNAbe81M) **/** [**笔记本**](https://github.com/fastai/fastai/blob/master/courses/dl2/pascal.ipynb) **/** [**幻灯片**](https://github.com/fastai/fastai/blob/master/courses/dl2/ppt/lesson8.pptx)
+
+#### **我们在第1部分[** [**02:00**](https://youtu.be/Z0ssNAbe81M%3Ft%3D2m) **]中介绍的内容**
+
+![](../img/1_EDzZucEfAL2aeRCZKkdkag.png)
+
+**可区分层[** [**02:11**](https://youtu.be/Z0ssNAbe81M%3Ft%3D2m11s) **]**
+
+![](../img/1_SFx5Gwk9KRuOZRgiZTuvbg.png)
+
+Yann LeCun一直在宣传我们并不称之为“深度学习”，而是称之为“差异化编程”。 我们在第1部分中所做的只是设置一个可微分函数和一个描述参数有多好的损失函数，然后按下go并使其工作。 如果你可以配置一个损失函数来评估你的任务有多好，那么你就拥有了一个相当灵活的神经网络架构，你就完成了。
+
+> 是的，可分辨编程只不过是对现代集合深度学习技术的重塑，就像深度学习是对具有两层以上神经网络的现代化身的重塑一样。
+
+> 重要的是，人们现在正在通过组装参数化功能块的网络来构建一种新型软件，并通过使用某种形式的基于梯度的优化来训练它们。它实际上非常像常规程序，除了它的参数化，自动区分，可训练/可优化。
+
+> - [FAIR主任Yann LeCun](https://www.facebook.com/yann.lecun/posts/10155003011462143)
+
+**2.转学习[** [**03:23**](https://youtu.be/Z0ssNAbe81M%3Ft%3D3m23s) **]**
+
+![](../img/1_sDycvgDmfivum0HQhHpOuA.png)
+
+转移学习是有效使用深度学习能够做的最重要的事情。 你几乎永远不会想要或者不需要从随机权重开始，除非没有人曾经在模糊的类似数据集上训练模型，甚至远程连接的问题要像你正在做的那样解决 - 这几乎从未发生过。 Fastai图书馆专注于转移学习，这使其与其他图书馆不同。 转学习的基本思想是：
+
+*   给定一个做事物A的网络，删除最后一层。
+*   最后用几个随机层替换它
+*   在利用原始网络学习的功能的同时，微调这些层以执行事物B.
+*   然后可选择地对整个事物进行端到端微调，你现在可能会使用数量级更少的数据，更准确，并且训练速度更快。
+
+**3.建筑设计[** [**05:17**](https://youtu.be/Z0ssNAbe81M%3Ft%3D5m17s) **]**
+
+![](../img/1_Dn8YbBY47oaDWG9KwEQkvw.png)
+
+有一小部分架构通常在很多时候都能很好地运行。 我们一直专注于将CNN用于通​​常固定大小的有序数据，RNN用于具有某种状态的序列。 我们还通过激活函数摆弄了一小部分 - 如果您有单一的分类结果，则为softmax;如果您有多个结果，则为sigmoid。 我们将在第2部分中学习的一些架构设计变得更有趣。 特别是关于对象检测的第一个会话。 但总的来说，我们花在讨论架构设计上的时间可能比大多数课程或论文少，因为它通常不是很难。
+
+**4.处理过度贴合[** [**06:26**](https://youtu.be/Z0ssNAbe81M%3Ft%3D6m26s) **]**
+
+![](../img/1_Fg2M4xw2F2f1jZNOCg5Cpg.png)
+
+Jeremy喜欢建立模型的方式：
+
+*   创造一些绝对非常过度参数化的东西，它肯定会大量过度装备，训练它并确保它过度适应。 那时，你有一个能够反映训练集的模型。 然后就像做这些事情一样简单，以减少过度拟合。
+
+如果你没有从过度拟合的东西开始，你就会迷失方向。 所以你先从过度拟合的东西开始，然后减少过度装备，你可以：
+
+*   添加更多数据
+*   添加更多数据扩充
+*   做更多批处理规范图层，密集网络或可以处理更少数据的各种事情。
+*   添加正规化，如重量衰减和辍学
+*   终于（这通常是人们先做的事情，但这应该是你最后做的事情）降低架构的复杂性。 具有较少的层数或较少的激活。
+
+**5.嵌入[** [**07:46**](https://youtu.be/Z0ssNAbe81M%3Ft%3D7m46s) **]**
+
+![](../img/1_TGNCaF5RGYO8iylV43oSSg.png)
+
+我们已经谈了很多关于嵌入的内容 - 对于NLP和任何类型的分类数据的一般概念，你现在可以使用神经网络进行建模。 就在今年早些时候，几乎没有关于在深度学习中使用表格数据的例子，但是使用神经网络进行时间序列和表格数据分析正变得越来越流行。
+
+#### **第1部分至第2部分[** [**08:54**](https://youtu.be/Z0ssNAbe81M%3Ft%3D8m54s) **]**
+
+![](../img/1_ZlspbXQEsEpBIgNqaq3KUg.png)
+
+第1部分真的是关于引入深度学习的最佳实践。 我们看到了足够成熟的技术，它们确实可以合理可靠地解决实际的现实问题。 Jeremy在相当长的一段时间内进行了充分的研究和调整，提出了一系列步骤，架构等，并以我们能够快速轻松地完成的方式将它们放入fastai库中。
+
+第2部分是编码员的尖端深度学习，这意味着Jeremy通常不知道确切的最佳参数，架构细节等等来解决特定问题。 我们不一定知道它是否能够很好地解决问题，实际上是有用的。 它几乎肯定不会很好地集成到fastai或任何其他库中，你可以按几个按钮，它将开始工作。 Jeremy不会教它，除非他非常确信它现在或将很快是非常实用的技术。 但它需要经常进行大量调整并尝试使其能够解决您的特定问题，因为我们不了解细节以了解如何使其适用于每个数据集或每个示例。
+
+这意味着而不是Fastai和PyTorch是你只知道这些食谱的晦涩的黑盒子，你将很好地了解它们的细节，你可以按照你想要的方式自定义它们，你可以调试它们，你可以阅读他们的源代码，看看发生了什么。 如果您对面向对象的Python没有信心，那么您希望在本课程中专注于学习，因为我们不会在课堂上介绍它。 但Jeremy将介绍一些他认为特别有用的工具，如Python调试器，如何使用编辑器跳过代码。 总的来说，将会有更详细和具体的代码演练，编码技术讨论，以及更详细的论文演练。
+
+请注意示例代码[ [13:20](https://youtu.be/Z0ssNAbe81M%3Ft%3D13m20s) ]！ 代码学者已经提出了与其他人在github上编写的论文或示例代码，Jeremy几乎总是发现存在一些巨大的关键缺陷，所以要小心从在线资源中获取代码并准备好进行一些调试。
+
+**如何使用笔记本[** [**14:17**](https://youtu.be/Z0ssNAbe81M%3Ft%3D14m17s) **]**
+
+![](../img/1_Ank7Dub7DwvSlozpLr1Lqg.png)
+
+![](../img/1_GyRVknri5gUktxgDmnxo4A.png)
+
+<figcaption class="imageCaption">建立自己的盒子[ [16:50](https://youtu.be/Z0ssNAbe81M%3Ft%3D16m50s) ]</figcaption>
+
+
+
+![](../img/1__r-uV41M5zUGTdV26N9bsg.png)
+
+<figcaption class="imageCaption">阅读论文[ [21:37](https://youtu.be/Z0ssNAbe81M%3Ft%3D21m37s) ]</figcaption>
+
+
+
+每周，我们将实施一两篇论文。 左边是实现adam的纸张摘录（您还在电子表格中看到adam是一个excel公式）。 在学术论文中，人们喜欢使用希腊字母。 他们也讨厌重构，所以你经常会看到一个页面长的公式，当你仔细观察它时，你会发现相同的子方程出现了8次。 学术论文有点奇怪，但最终，它是研究界传达他们的发现的方式，所以我们需要学习阅读它们。 一件好事是拿一篇论文，努力去理解它，然后写一个博客，用代码和普通英语解释它。 许多这样做的人最终获得了相当多的关注，最终获得了一些相当不错的工作机会等等，因为这是一项非常有用的技能，能够表明您可以理解这些论文，在代码中实现它们并解释他们用英语。 很难阅读或理解你无法发声的东西。 所以学习希腊字母！
+
+![](../img/1_LBOcbbeBFypTgQ2AOit1EQ.png)
+
+<figcaption class="imageCaption">更多机会[ [25:29](https://youtu.be/Z0ssNAbe81M%3Ft%3D25m29s) ]</figcaption>
+
+
+
+![](../img/1_cNNnbJwImpFbqSKdA5_RIQ.png)
+
+<figcaption class="imageCaption">第2部分的主题[ [30:12](https://youtu.be/Z0ssNAbe81M%3Ft%3D30m12s) ]</figcaption>
+
+
+
+**生成模型**
+
+在第1部分中，我们的神经网络的输出通常是一个数字或类别，在其他地方，第2部分中很多东西的输出将是很多东西，如：
+
+*   图像中每个对象的左上角和右下角位置以及该对象的位置
+*   完整的图片，包含该图片中每个像素的类别
+*   增强的输入图像的超分辨率版本
+*   将整个原始输入段翻译成法语
+
+我们将要查看的大部分数据将是文本或图像数据。
+
+我们将根据数据集中的对象数量和每个对象的大小来查看一些较大的数据集。 对于那些使用有限计算资源的人，请不要让它让你失望。 随意用更小更简单的东西替换它。 杰里米实际上写了大量的课程，没有互联网（在狮子座点）15英寸的表面书。 几乎所有这些课程都适用于笔记本电脑上的Windows。 您始终可以使用较小的批量大小，数据集的缩减版本。 但是，如果您拥有这些资源，那么当数据集可用时，您将获得更好的结果。
+
+* * *
+
+#### 物体检测[ [35:32](https://youtu.be/Z0ssNAbe81M%3Ft%3D35m32s) ]
+
+![](../img/1_EGZrr-rVX29Ot1j20i116w.png)
+
+与我们习惯的两个主要区别：
+
+**我们有很多东西要分类。**
+
+这不是闻所未闻的 - 我们在第1部分的行星卫星数据中做到了这一点。
+
+**2.围绕我们分类的边界框。**
+
+边界框有一个非常具体的定义，它是一个矩形，矩形的对象完全适合它，但它不比它必须大。
+
+我们的工作是获取以这种方式标记的数据和未标记的数据，以生成对象的类和每个对象的边界框。 需要注意的一点是，标记此类数据通常更为昂贵[ [37:09](https://youtu.be/Z0ssNAbe81M%3Ft%3D37m09s) ]。 对于对象检测数据集，给注释器一个对象类列表，并要求它们查找图片中任何类型的每一个以及它们的位置。 在这种情况下，为什么没有标记树或跳跃？ 这是因为对于这个特定的数据集，它们不是要求注释者找到的类之一，因此不是这个特定问题的一部分。
+
+#### 阶段[ [38:33](https://youtu.be/Z0ssNAbe81M%3Ft%3D38m33s) ]：
+
+1.  对每个图像中的最大对象进行分类。
+2.  找到每个图像的最大对象的位置。
+3.  最后，我们将同时尝试两者（即标记它是什么以及图片中最大对象的位置）。
+
+![](../img/1_RAxYMkvF3zHFe_cst52STA.png)
+
+#### [帕斯卡笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/pascal.ipynb) [ [40:06](https://youtu.be/Z0ssNAbe81M%3Ft%3D40m06s) ]
+
+```
+ %matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+```
+ from fastai.conv_learner import *  from fastai.dataset import * 
+```
+
+```
+ from pathlib import Path  import json  from PIL import ImageDraw, ImageFont  from matplotlib import patches, patheffects  # torch.cuda.set_device(1) 
+```
+
+您可能会发现留下一行`torch.cuda.set_device(1)` ，如果您只有一个GPU，则会出现错误。 这是您在拥有多个GPU时选择GPU的方式，因此只需将其设置为零或完全取出该行即可。
+
+有许多标准物体检测数据集，就像ImageNet是标准物体分类数据集[ [41:12](https://youtu.be/Z0ssNAbe81M%3Ft%3D41m12s) ]。 经典的ImageNet等效物是Pascal VOC。
+
+#### 帕斯卡VOC
+
+我们将查看[Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/)数据集。 这很慢，所以你可能更喜欢从[这个镜像](https://pjreddie.com/projects/pascal-voc-dataset-mirror/)下载。 从2007年到2012年，有两个不同的竞争/研究数据集。我们将使用2007版本。 您可以使用更大的2012来获得更好的结果，甚至可以将它们组合起来[ [42:25](https://youtu.be/Z0ssNAbe81M%3Ft%3D42m25s) ]（但是如果这样做，请注意避免验证集之间的数据泄漏）。
+
+与之前的课程不同，我们使用python 3标准库`pathlib`来实现路径和文件访问。 请注意，它返回特定于操作系统的类（在Linux上， `PosixPath` ），因此您的输出可能看起来有点不同[ [44:50](https://youtu.be/Z0ssNAbe81M%3Ft%3D44m50s) ]。 将路径作为输入的大多数库可以采用pathlib对象 - 尽管有些（如`cv2` ）不能，在这种情况下，您可以使用`str()`将其转换为字符串。
+
+[Pathlib备忘单](http://pbpython.com/pathlib-intro.html)
+
+```
+ PATH = Path('data/pascal')  list(PATH.iterdir()) 
+```
+
+```
+ _[PosixPath('data/pascal/PASCAL_VOC.zip'),_  _PosixPath('data/pascal/VOCdevkit'),_  _PosixPath('data/pascal/VOCtrainval_06-Nov-2007.tar'),_  _PosixPath('data/pascal/pascal_train2012.json'),_  _PosixPath('data/pascal/pascal_val2012.json'),_  _PosixPath('data/pascal/pascal_val2007.json'),_  _PosixPath('data/pascal/pascal_train2007.json'),_  _PosixPath('data/pascal/pascal_test2007.json')]_ 
+```
+
+**关于发电机[** [**43:23**](https://youtu.be/Z0ssNAbe81M%3Ft%3D43m23s) **]：**
+
+生成器是Python 3中可以迭代的东西。
+
+*   `for i in PATH.iterdir(): print(i)`
+*   `[i for i in PATH.iterdir()]` （列表理解）
+*   `list(PATH.iterdir())` （将生成器转换为列表）
+
+事物通常返回生成器的原因是，如果目录中有1000万个项目，则不一定需要1000万个列表。 Generator让你“懒洋洋地”做事。
+
+#### 加载注释
+
+除了图像之外，还有_注释_ - 显示每个对象所在位置的_边界框_ 。 这些是手工贴上的。 原始版本采用XML [ [47:59](https://youtu.be/Z0ssNAbe81M%3Ft%3D47m59s) ]，现在有点[难以使用](https://youtu.be/Z0ssNAbe81M%3Ft%3D47m59s) ，因此我们使用了最新的JSON版本，您可以从此[链接](https://storage.googleapis.com/coco-dataset/external/PASCAL_VOC.zip)下载。
+
+您可以在此处看到`pathlib`如何包含打开文件的功能（以及许多其他功能）。
+
+```
+ trn_j = json.load((PATH/'pascal_train2007.json').open())  trn_j.keys() 
+```
+
+```
+ _dict_keys(['images', 'type', 'annotations', 'categories'])_ 
+```
+
+这里`/`不是除以它是路径斜线[ [45:55](https://youtu.be/Z0ssNAbe81M%3Ft%3D45m55s) ]。 `PATH/`让你的孩子走在那条路上。 `PATH/'pascal_train2007.json'`返回一个具有`open`方法的`pathlib`对象。 此JSON文件不包含图像，而是包含边界框和对象的类。
+
+```
+ IMAGES,ANNOTATIONS,CATEGORIES = ['images', 'annotations',  'categories'] 
+```
+
+```
+ **trn_j[IMAGES]** [:5] 
+```
+
+```
+ _[{'file_name': '000012.jpg', 'height': 333, 'id': 12, 'width': 500}, {'file_name': '000017.jpg', 'height': 364, 'id': 17, 'width': 480}, {'file_name': '000023.jpg', 'height': 500, 'id': 23, 'width': 334}, {'file_name': '000026.jpg', 'height': 333, 'id': 26, 'width': 500}, {'file_name': '000032.jpg', 'height': 281, 'id': 32, 'width': 500}]_ 
+```
+
+#### 注释[ [49:16](https://youtu.be/Z0ssNAbe81M%3Ft%3D49m16s) ]
+
+*   `bbox` ：列，行（左上角），高度，宽度
+*   `image_id` ：你可以用`trn_j[IMAGES]` （上面）加入这个来查找`file_name`等。
+*   `category_id` ：见`trn_j[CATEGORIES]` （下）
+*   `segmentation` ：多边形分割（我们将使用它们）
+*   `ignore` ：我们将忽略忽略标志
+*   `iscrowd` ：指定它是该对象的一群，而不仅仅是其中一个
+
+```
+ **trn_j[ANNOTATIONS]** [:2] 
+```
+
+```
+ _[{'area': 34104,_  _'bbox': [155, 96, 196, 174],_  _'category_id': 7,_  _'id': 1,_  _'ignore': 0,_  _'image_id': 12,_  _'iscrowd': 0,_  _'segmentation': [[155, 96, 155, 270, 351, 270, 351, 96]]},_  _{'area': 13110,_  _'bbox': [184, 61, 95, 138],_  _'category_id': 15,_  _'id': 2,_  _'ignore': 0,_  _'image_id': 17,_  _'iscrowd': 0,_  _'segmentation': [[184, 61, 184, 199, 279, 199, 279, 61]]}]_ 
+```
+
+#### 分类[ [50:15](https://youtu.be/Z0ssNAbe81M%3Ft%3D50m15s) ]
+
+```
+ **trn_j[CATEGORIES]** [:4] 
+```
+
+```
+ _[{'id': 1, 'name': 'aeroplane', 'supercategory': 'none'},_  _{'id': 2, 'name': 'bicycle', 'supercategory': 'none'},_  _{'id': 3, 'name': 'bird', 'supercategory': 'none'},_  _{'id': 4, 'name': 'boat', 'supercategory': 'none'}]_ 
+```
+
+使用常量而不是字符串是有帮助的，因为我们得到制表符并且不会输错。
+
+```
+ FILE_NAME,ID,IMG_ID,CAT_ID,BBOX =  'file_name','id','image_id','category_id','bbox' 
+```
+
+```
+ cats = dict((o[ID], o['name']) for o in trn_j[CATEGORIES])  trn_fns = dict((o[ID], o[FILE_NAME]) for o in trn_j[IMAGES])  trn_ids = [o[ID] for o in trn_j[IMAGES]] 
+```
+
+[**旁注**](https://youtu.be/Z0ssNAbe81M%3Ft%3D51m21s) **：** **当人们看到杰里米在看过他的班级时实时工作时，人们最评论的是什么[** [**51:21**](https://youtu.be/Z0ssNAbe81M%3Ft%3D51m21s) **]：**
+
+“哇，你其实不知道你在做什么，对吗”。 他所做的99％的事情都不起作用，而有效的事情的一小部分最终会在这里结束。 他之所以提到这一点，是因为机器学习，特别是深度学习令人难以置信的沮丧[ [51:45](https://youtu.be/Z0ssNAbe81M%3Ft%3D51m45s) ]。 从理论上讲，您只需定义正确的损失函数和足够灵活的架构，然后按下列车即可完成。 但如果真的那么多，那么任何事情都不会花费任何时间。 问题是沿途的所有步骤直到它工作，它不起作用。 就像它直接进入无限，崩溃时张力大小不正确等等。他会努力向你展示一些调试技巧，但它是最难教的东西之一。 它需要的主要是坚韧。 超级有效的人和那些似乎没有走得太远的人之间的区别从来都不是智力。 它始终是坚持它 - 基本上永不放弃。 这种深度学习尤为重要，因为你没有得到连续的奖励周期[ [53:04](https://youtu.be/Z0ssNAbe81M%3Ft%3D53m04s) ]。 这是一个不变的工作，不起作用，不起作用，直到最终它这样做它有点烦人。
+
+#### 我们来看看图像[ [53:45](https://youtu.be/Z0ssNAbe81M%3Ft%3D53m45s) ]
+
+```
+ list((PATH/'VOCdevkit'/'VOC2007').iterdir()) 
+```
+
+```
+ _[PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/SegmentationObject'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/ImageSets'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/SegmentationClass'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/Annotations')]_ 
+```
+
+```
+ JPEGS = 'VOCdevkit/VOC2007/JPEGImages' 
+```
+
+```
+ IMG_PATH = PATH/JPEGS  list(IMG_PATH.iterdir())[:5] 
+```
+
+```
+ _[PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/007594.jpg'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/005682.jpg'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/005016.jpg'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/001930.jpg'),_  _PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/007666.jpg')]_ 
+```
+
+#### 创建字典（关键字：图像ID，值：注释）[ [54:16](https://youtu.be/Z0ssNAbe81M%3Ft%3D54m16s) ]
+
+每张图片都有一个唯一的ID。
+
+```
+ im0_d = trn_j[IMAGES][0]  im0_d[FILE_NAME],im0_d[ID] 
+```
+
+```
+ ('000012.jpg', 12) 
+```
+
+只要你想拥有新密钥的默认字典条目[ [55:05](https://youtu.be/Z0ssNAbe81M%3Ft%3D55m05s) ]， [defaultdict就很有用](https://youtu.be/Z0ssNAbe81M%3Ft%3D55m05s) 。 如果您尝试访问不存在的键，它会神奇地使其自身存在，并且它将自身设置为等于您指定的函数的返回值（在本例中为`lambda:[]` ）。
+
+在这里，我们创建一个从图像ID到注释列表的dict（边界框和类ID的元组）。
+
+我们将VOC的高度/宽度转换为左上/右下，并将x / y坐标切换为与numpy一致。 如果给定的数据集是蹩脚的格式，请花一些时间使事情保持一致，并按照您希望的方式制作它们[ [1:01:24](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h1m24s) ]
+
+```
+ trn_anno = **collections.defaultdict** (lambda:[])  for o in trn_j[ANNOTATIONS]:  if not o['ignore']:  bb = o[BBOX]  bb = np.array([bb[1], bb[0], bb[3]+bb[1]-1, bb[2]+bb[0]-1])  trn_anno[o[IMG_ID]] **.append** ((bb,o[CAT_ID]))  len(trn_anno) 
+```
+
+```
+ _2501_ 
+```
+
+**变量命名，编码风格哲学等[** [**56:15**](https://youtu.be/Z0ssNAbe81M%3Ft%3D56m15s) **-** [**59:33**](https://youtu.be/Z0ssNAbe81M%3Ft%3D59m33s) **]**
+
+**例1**
+
+*   `[ 96, 155, 269, 350]` [96,155,269,350](https://youtu.be/Z0ssNAbe81M%3Ft%3D59m53s) ]：一个边界框[ [59:53](https://youtu.be/Z0ssNAbe81M%3Ft%3D59m53s) ]。 如上所述，当我们创建边界框时，我们做了几件事。 首先是我们切换x和y坐标。 其原因在于计算机视觉领域，当你说“我的屏幕是640×480”时，它是高度的宽度。 或者，数学世界，当你说“我的数组是640乘480”时，它是逐列的。 所以枕头图像库倾向于按宽度或逐行逐行进行处理，而numpy则是相反的方式。 第二个是我们要通过描述左上角xy坐标和右下角xy坐标来做事情 - 而不是x，y，高度，宽度。
+*   `7` ：班级标签/类别
+
+```
+ im0_a = im_a[0]; im0_a 
+```
+
+```
+ _[(array(_ **_[ 96, 155, 269, 350]_** _),_ **_7_** _)]_ 
+```
+
+```
+ im0_a = im_a[0]; im0_a 
+```
+
+```
+ _(array([ 96, 155, 269, 350]), 7)_ 
+```
+
+```
+ cats[7] 
+```
+
+```
+ _'car'_ 
+```
+
+**例2**
+
+```
+ trn_anno[17] 
+```
+
+```
+ _[(array([61, 184, 198, 278]), 15), (array([77, 89, 335, 402]), 13)]_ 
+```
+
+```
+ cats[15],cats[13] 
+```
+
+```
+ _('person', 'horse')_ 
+```
+
+有些lib采用VOC格式的边界框，所以这让我们在需要时转换回来[ [1:02:23](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h2m23s) ]：
+
+```
+ def bb_hw(a): return np.array([a[1],a[0],a[3]-a[1],a[2]-a[0]]) 
+```
+
+我们将使用fast.ai的`open_image`来显示它：
+
+```
+ im = open_image(IMG_PATH/im0_d[FILE_NAME]) 
+```
+
+#### **集成开发环境（IDE）简介[** [**1:03:13**](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h3m13s) **]**
+
+您可以使用[Visual Studio Code](https://code.visualstudio.com/) （vscode - 最新版本的Anaconda附带的开源编辑器，或者可以单独安装），或者大多数编辑器和IDE，来查找有关`open_image`函数的所有信息。 vscode要知道的事情：
+
+*   命令调色板（ `Ctrl-shift-p` ）
+*   选择口译员（适用于fastai env）
+*   选择终端shell
+*   转到符号（ `Ctrl-t` ）
+*   查找参考文献（ `Shift-F12` ）
+*   转到定义（ `F12` ）
+*   回去（ `alt-left` ）
+*   查看文档
+*   隐藏侧边栏（ `Ctrl-b` ）
+*   禅模式（ `Ctrl-k,z` ）
+
+如果你像我一样在Mac上使用PyCharm专业版：
+
+*   命令调色板（ `Shift-command-a` ）
+*   选择解释器（用于fastai env）（ `Shift-command-a`然后查找“interpreter”）
+*   选择终端外壳（ `Option-F12` ）
+*   转到符号（ `Option-command-shift-n`并键入类，函数等的名称。如果它在camelcase或下划线中分隔，则可以键入每个位的前几个字母）
+*   查找引用（ `Option-F7` `Option-command-⬇︎` ），下一次出现（ `Option-command-⬇︎` ），上一次出现（ `Option-command-⬆︎` ）
+*   转到定义（ `Command-b` ）
+*   返回（ `Option-command-⬅︎` ）
+*   查看文档
+*   Zen模式（ `Control-`-4–2`或搜索“distraction free mode”）
+
+#### 我们来谈谈open_image [ [1:10:52](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h10m52s) ]
+
+Fastai使用OpenCV。 TorchVision使用PyTorch张量进行数据增强等。许多人使用Pillow `PIL` 。 Jeremy对所有这些进行了大量测试，他发现OpenCV比TorchVision快5到10倍。 对于[行星卫星图像竞争](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space) [ [1:11:55](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h11m55s) ]，TorchVision速度[太快](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h11m55s) ，因为他们只进行了大量的数据增加，因此只能获得25％的GPU利用率。 Profiler显示它全部在TorchVision中。
+
+枕头速度要快得多，但它不如OpenCV快，而且也不像线程安全[ [1:12:19](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h12m19s) ]。 Python有这个称为全局解释器锁（GIL）的东西，这意味着两个线程不能同时做pythonic事情 - 这使得Python成为现代编程的一种糟糕的语言，但我们坚持使用它。 OpenCV发布了GIL。 fast.ai库如此之快的原因之一是因为它不像其他库那样使用多个处理器来进行数据扩充 - 它实际上是多线程的。 它可以做多线程的原因是因为它使用OpenCV。 不幸的是，OpenCV有一个难以理解的API，文档有点迟钝。 这就是杰里米试图做到这一点的原因，以至于没有人使用fast.ai需要知道它正在使用OpenCV。 您无需知道要打开图像的标志。 您不需要知道如果读取失败，则不会显示异常 - 它会以静默方式返回`None` 。
+
+![](../img/1_afXUCCfpzM6E1anLo8bKxA.png)
+
+不要开始使用PyTorch进行数据扩充或开始引入Pillow - 你会发现突然发生的事情变得非常缓慢或多线程不再起作用。 您应该坚持使用OpenCV进行处理[ [1:14:10](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h14m10s) ]
+
+#### 更好地使用Matplotlib [ [1:14:45](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h14m45s) ]
+
+Matplotlib之所以如此命名是因为它最初是Matlab绘图库的克隆。 不幸的是，Matlab的绘图库并不是很好，但当时，这是每个人都知道的。 在某些时候，matplotlib人员意识到并添加了第二个API，这是一个面向对象的API。 不幸的是，因为最初学习matplotlib的人没有学过OO API，所以他们教会了下一代人的旧Matlab风格的API。 现在没有很多示例或教程使用更好，更容易理解和更简单的OO API。 因为绘图在深度学习中非常重要，所以我们将在本课程中学习的内容之一是如何使用此API。
+
+**技巧1：plt.subplots [** [**1:16:00**](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h16m) **]**
+
+无论您是否有多`plt.subplots`图，Matplotlib的`plt.subplots`都是创建图的非常有用的包装器。 请注意，Matplotlib有一个可选的面向对象的API，我认为它更容易理解和使用（尽管很少有在线使用它的例子！）
+
+```
+ def show_img(im, figsize=None, ax=None):  if not ax: fig,ax = plt.subplots(figsize=figsize)  ax.imshow(im)  ax.get_xaxis().set_visible(False)  ax.get_yaxis().set_visible(False)  return ax 
+```
+
+它返回两件事 - 你可能不关心第一件事（图对象），第二件是Axes对象（或者它们的数组）。 基本上你曾经说过`plt.`任何地方`plt.` 什么，你现在说`ax.` 什么，它现在将绘制到特定的子图。 当您想要绘制多个绘图以便可以相互比较时，这非常有用。
+
+**技巧2：无论背景颜色如何都可见文字[** [**1:17:59**](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h17m59s) **]**
+
+无论背景如何，使文本可见的简单但很少使用的技巧是使用带有黑色轮廓的白色文本，反之亦然。 这是在matplotlib中如何做到这一点。
+
+```
+ def draw_outline(o, lw):  o.set_path_effects([patheffects.Stroke(  linewidth=lw, foreground='black'), patheffects.Normal()]) 
+```
+
+请注意，参数列表中的`*`是[splat运算符](https://stackoverflow.com/questions/5239856/foggy-on-asterisk-in-python) 。 在这种情况下，与写出`b[-2],b[-1]`相比，这是一个小捷径。
+
+```
+ def draw_rect(ax, b):  patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:],  fill=False, edgecolor='white', lw=2))  draw_outline(patch, 4) 
+```
+
+```
+ def draw_text(ax, xy, txt, sz=14):  text = ax.text(*xy, txt, verticalalignment='top', color='white',  fontsize=sz, weight='bold')  draw_outline(text, 1) 
+```
+
+```
+ ax = show_img(im)  b = bb_hw(im0_a[0])  draw_rect(ax, b)  draw_text(ax, b[:2], cats[im0_a[1]]) 
+```
+
+![](../img/1_qAHYi8J1TjnAtuQ9IGGIYg.png)
+
+**打包它[** [**1:21:20**](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h21m20s) **]**
+
+```
+ **def** draw_im(im, ann):  ax = show_img(im, figsize=(16,8))  **for** b,c **in** ann:  b = bb_hw(b)  draw_rect(ax, b)  draw_text(ax, b[:2], cats[c], sz=16) 
+```
+
+```
+ **def** draw_idx(i):  im_a = trn_anno[i]  im = open_image(IMG_PATH/trn_fns[i])  print(im.shape)  draw_im(im, im_a) 
+```
+
+```
+ draw_idx(17) 
+```
+
+![](../img/1_QrfruESADB4Fsbl1cvD8vA.png)
+
+当您使用新数据集时，达到可以快速探索它的点是值得的。
+
+### 最大的物品分类[ [1:22:57](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h22m57s) ]
+
+不要试图一下子解决所有问题，而是让我们不断进步。 我们知道如何找到每个图像中最大的对象并对其进行分类，所以让我们从那里开始。 Jeremy每天参加Kaggle比赛的时间是半小时[ [1:24:00](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h24m) ]。 在那个半小时结束时，提交一些东西并尝试使它比昨天好一点。
+
+我们需要做的第一件事是遍历图像中的每个边界框并获得最大的边界框。 _lambda函数_只是一种定义内联匿名函数的方法。 在这里，我们用它来描述如何为每个图像排序注释 - 通过限制框大小（降序）。
+
+我们从右下角减去左上角并乘以（ `np.product` ）值得到一个区域`lambda x: np.product(x[0][-2:]-x[0][:2])` 。
+
+```
+ **def** get_lrg(b):  if not b: raise Exception()  b = sorted(b, key=lambda x: np.product(x[0][-2:]-x[0][:2]),  reverse=True)  **return** b[0] 
+```
+
+**字典理解[** [**1:27:04**](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h27m04s) **]**
+
+```
+ trn_lrg_anno = {a: get_lrg(b) for a,b in trn_anno.items()} 
+```
+
+现在我们有一个从图像id到单个边界框的字典 - 这个图像的最大值。
+
+```
+ b,c = trn_lrg_anno[23]  b = bb_hw(b)  ax = show_img(open_image(IMG_PATH/trn_fns[23]), figsize=(5,10))  draw_rect(ax, b)  draw_text(ax, b[:2], cats[c], sz=16) 
+```
+
+![](../img/1_ncFID5QGdWrtpeIgVVhLeA.png)
+
+当你有任何处理管道[ [1:28:01](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h28m1s) ]时，你需要查看每个阶段。 假设你第一次做的一切都是错的。
+
+```
+ (PATH/'tmp').mkdir(exist_ok=True)  CSV = PATH/'tmp/lrg.csv' 
+```
+
+通常，最简单的方法是简单地创建要建模的数据的CSV，而不是尝试创建自定义数据集[ [1:29:06](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h29m06s) ]。 在这里，我们使用Pandas来帮助我们创建图像文件名和类的CSV。 `columns=['fn','cat']`因为字典没有订单而且列的顺序很重要。
+
+```
+ df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids],  'cat': [cats[trn_lrg_anno[o][1]] for o in trn_ids]},  columns=['fn','cat'])  df.to_csv(CSV, index=False) 
+```
+
+```
+ f_model = resnet34  sz=224  bs=64 
+```
+
+从这里开始就像Dogs vs Cats！
+
+```
+ tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_side_on,  crop_type= **CropType.NO** )  md = ImageClassifierData.from_csv(PATH, JPEGS, CSV, tfms=tfms) 
+```
+
+#### **我们来看看[** [**1:30:48**](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h30m48s) **]**
+
+有一点不同的是`crop_type` 。 在fast.ai中创建224 x 224图像的默认策略是首先调整它的大小，使最小边为224.然后在训练期间采用随机平方裁剪。 在验证期间，除非我们使用数据扩充，否则我们采用中心作物。
+
+对于边界框，我们不希望这样做，因为与图像网不同，我们关心的东西几乎在中间并且非常大，对象检测中的很多东西都很小并且接近边缘。 通过将`crop_type`设置为`CropType.NO` ，它将不会裁剪，因此，为了使其成为正方形，它会使它[ [1:32:09](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h32m9s) ]。 一般来说，如果你裁剪而不是挤压，许多计算机视觉模型的效果会好一点，但是如果你压扁它们仍然可以很好地工作。 在这种情况下，我们绝对不想裁剪，所以这完全没问题。
+
+```
+ x,y=next(iter(md.val_dl))  show_img(md.val_ds.denorm(to_np(x))[0]); 
+```
+
+![](../img/1_bTBWgFXrJYD7sKtiPnCt5g.png)
+
+#### 数据加载器[ [1:33:04](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h33m4s) ]
+
+您已经知道在模型数据对象内部，我们有很多东西，包括训练数据加载器和训练数据集。 关于数据加载器的主要知识是它是一个迭代器，每次你从中获取下一个东西的迭代时，你得到一个迷你批处理。 您获得的迷你批量是您要求的任何大小，默认情况下批量大小为64.在Python中，您从迭代器中获取下一个东西的方式是下`next(md.trn_dl)`但您不能只做那。 你不能说这是因为你需要说“现在开始一个新纪元”。 通常，不仅在PyTorch中，而且对于任何Python迭代器，您都需要说“请从序列的开始处开始”。 你这么说就是使用`iter(md.trn_dl)`来获取`iter(md.trn_dl)`一个迭代器 - 特别是我们稍后会学到的，这意味着这个类必须定义一个`__iter__`方法，它返回一些不同的对象，然后有一个`__next__`方法。
+
+如果你想只抓一个批处理，你就是这样做的（ `x` ：自变量， `y` ：因变量）：
+
+```
+ x,y=next(iter(md.val_dl)) 
+```
+
+我们不能直接发送到`show_image` [ [1:35:30](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h35m30s) ]。 例如， `x`不是一个numpy数组，不是在CPU上，并且形状都是错误的（ `3x224x224` ）。 此外，它们不是介于0和1之间的数字，因为所有标准的ImageNet预训练模型都希望我们的数据被归一化为零均值和1个标准差。
+
+![](../img/1_CbjuSpn8ZnX6SMLNiBzoag.png)
+
+如您所见，已经对输入做了大量事情，以便将其传递给预先训练的模型。 因此我们有一个名为`denorm` for denormalize的函数，并且还修复了维度顺序等等。因为非规范化取决于变换[ [1:37:52](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h37m52s) ]，并且数据集知道使用什么变换来创建它，所以这就是你必须要做的事情`md.val_ds.denorm`并将其转换为numpy数组后传递小批量：
+
+```
+ show_img(md.val_ds.denorm(to_np(x))[0]); 
+```
+
+#### 使用ResNet34进行培训[ [1:38:36](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h38m36s) ]
+
+```
+ learn = ConvLearner.pretrained(f_model, md, metrics=[accuracy])  learn.opt_fn = optim.Adam 
+```
+
+```
+ lrf=learn.lr_find(1e-5,100)  learn.sched.plot() 
+```
+
+![](../img/1_oZe5esLqorSwfDyN9ld3Rw.png)
+
+我们故意删除前几个点和最后几个点[ [1:38:54](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h38m54s) ]，因为通常最后几个点向无限远射高，你看不到任何东西，所以这通常是一个好主意。 但是当你的迷你批次非常少时，这不是一个好主意。 当您的LR查找器图形如上所示时，您可以在每一端要求更多点（您也可以使批量大小非常小）：
+
+```
+ learn.sched.plot(n_skip=5, n_skip_end=1) 
+```
+
+![](../img/1_KhVaT1KpcVj6JsXzjMlxdw.png)
+
+```
+ lr = 2e-2  learn.fit(lr, 1, cycle_len=1) 
+```
+
+```
+ _epoch trn_loss val_loss accuracy_  _0 1.280753 0.604127 0.806941_ 
+```
+
+解冻几层：
+
+```
+ lrs = np.array([lr/1000,lr/100,lr])  learn.freeze_to(-2)  learn.fit(lrs/5, 1, cycle_len=1) 
+```
+
+```
+ _epoch trn_loss val_loss accuracy_  _0 0.780925 0.575539 0.821064_ 
+```
+
+解冻整个事情：
+
+```
+ learn.unfreeze()  learn.fit(lrs/5, 1, cycle_len=2) 
+```
+
+```
+ epoch trn_loss val_loss accuracy  0 0.676254 0.546998 0.834285  1 0.460609 0.533741 0.833233 
+```
+
+精度没有太大提高 - 因为许多图像有多个不同的对象，所以不可能准确。
+
+#### 让我们来看看结果[ [1:40:48](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h40m48s) ]
+
+```
+ fig, axes = plt.subplots(3, 4, figsize=(12, 8))  for i,ax in enumerate(axes.flat):  ima=md.val_ds.denorm(x)[i]  b = md.classes[preds[i]]  ax = show_img(ima, ax=ax)  draw_text(ax, (0,0), b)  plt.tight_layout() 
+```
+
+![](../img/1_0Tq4_OSCmZnT_TFyZ5JScg.png)
+
+如何理解不熟悉的代码：
+
+*   逐步运行每行代码，打印输入和输出。
+
+**方法1** [ [1:42:28](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h42m28s) ]：你可以获取循环的内容，复制它，在它上面创建一个单元格，粘贴它，取消缩进它，设置`i=0`并将它们全部放在不同的单元格中。
+
+![](../img/1_mOfiv9blUSSx5iFEArlZNw.png)
+
+**方法2** [ [1:43:04](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h43m4s) ]：使用Python调试器
+
+您可以使用python调试器`pdb`来逐步执行代码。
+
+*   `pdb.set_trace()`设置断点
+*   `%debug` magic跟踪错误（发生异常后）
+
+你需要知道的命令：
+
+*   `h` （帮助）
+*   `s` （步入）
+*   `n` （下一行/步骤 - 你也可以点击输入）
+*   `c` （继续下一个断点）
+*   `u` （调用堆栈）
+*   `d` （调用堆栈下）
+*   `p` （打印） - 当有单个字母变量同时也是命令时强制打印。
+*   `l` （列表） - 显示它上面和下面的行
+*   `q` （退出） - 非常重要
+
+**评论[** [**1:49:10**](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h49m10s) **]：** `[IPython.core.debugger](http://ipython.readthedocs.io/en/stable/api/generated/IPython.core.debugger.html)` （在右下方）使它非常漂亮：
+
+![](../img/1_4WryeZDtKFciD7qchVA6WQ.png)
+
+![](../img/1_aztHN3af_MxEhHS71_SUDQ.png)
+
+#### 创建边界框[ [1:52:51](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h52m51s) ]
+
+在最大的对象周围创建一个边界框可能看起来像你之前没有做过的事情，但实际上它完全是你以前做过的事情。 我们可以创建回归而不是分类神经网络。 分类神经网络是具有sigmoid或softmax输出的网络，我们使用交叉熵，二进制交叉熵或负对数似然丢失函数。 这基本上是什么使它成为分类器。 如果我们最后没有softmax或sigmoid并且我们使用均方误差作为损失函数，它现在是一个回归模型，它预测连续数而不是类别。 我们也知道我们可以像行星竞赛那样有多个输出（多重分类）。 如果我们结合这两个想法并进行多列回归怎么办？
+
+这就是你在考虑差异化编程的地方。 它不像“我如何创建边界框模型？”但它更像是：
+
+*   我们需要四个数字，因此，我们需要一个具有4个激活的神经网络
+*   对于损失函数，什么是函数，当它较低时意味着四个数字更好？ 均方损失函数！
+
+而已。 我们来试试吧。
+
+#### 仅限Bbox [ [1:55:27](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h55m27s) ]
+
+现在我们将尝试找到最大对象的边界框。 这只是一个带有4个输出的回归。 因此，我们可以使用具有多个“标签”的CSV。 如果您记得第1部分要进行多标签分类，则多个标签必须以空格分隔，并且文件名以逗号分隔。
+
+```
+ BB_CSV = PATH/'tmp/bb.csv'  bb = np.array([trn_lrg_anno[o][0] for o in trn_ids])  bbs = [' '.join(str(p) for p in o) for o in bb] 
+```
+
+```
+ df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids],  'bbox': bbs}, columns=['fn','bbox'])  df.to_csv(BB_CSV, index=False) 
+```
+
+```
+ BB_CSV.open().readlines()[:5] 
+```
+
+```
+ _['fn,bbox\n',_  _'000012.jpg,96 155 269 350\n',_  _'000017.jpg,77 89 335 402\n',_  _'000023.jpg,1 2 461 242\n',_  _'000026.jpg,124 89 211 336\n']_ 
+```
+
+#### 训练[ [1:56:11](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h56m11s) ]
+
+```
+ f_model=resnet34  sz=224  bs=64 
+```
+
+设置`continuous=True`告诉fastai这是一个回归问题，这意味着它不会对标签进行单热编码，并将使用MSE作为默认暴击。
+
+请注意，我们必须告诉变换构造函数我们的标签是坐标，以便它可以正确处理变换。
+
+此外，我们使用CropType.NO，因为我们想要将矩形图像“挤压”成正方形而不是中心裁剪，这样我们就不会意外地裁掉一些对象。 （这在像imagenet这样的问题上不是一个问题，因为有一个对象可以分类，而且它通常很大且位于中心位置）。
+
+```
+ tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,  **tfm_y=TfmType.COORD** )  md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,  **continuous=True** ) 
+```
+
+下周我们将看看`TfmType.COORD` ，但是现在，我们才意识到当我们进行缩放和数据扩充时，需要在边界框中进行，而不仅仅是图像。
+
+```
+ x,y=next(iter(md.val_dl)) 
+```
+
+```
+ ima=md.val_ds.denorm(to_np(x))[0]  b = bb_hw(to_np(y[0])); b 
+```
+
+```
+ _array([ 49., 0., 131., 205.], dtype=float32)_ 
+```
+
+```
+ ax = show_img(ima)  draw_rect(ax, b)  draw_text(ax, b[:2], 'label') 
+```
+
+![](../img/1_nahTyZS46y9PuRseHjNg4g.png)
+
+#### 让我们根据ResNet34 [ [1:56:57](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h56m57s) ]创建一个卷积网：
+
+fastai允许您使用`custom_head`在`custom_head`上添加自己的模块，而不是默认添加的自适应池和完全连接的网络。 在这种情况下，我们不想进行任何池化，因为我们需要知道每个网格单元的激活。
+
+最后一层有4次激活，每个边界框坐标一次。 我们的目标是连续的，而不是分类的，因此使用的MSE损失函数不对模块输出执行任何sigmoid或softmax。
+
+```
+ head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088,4))  learn = ConvLearner.pretrained(f_model, md, **custom_head** =head_reg4)  learn.opt_fn = optim.Adam  learn.crit = nn.L1Loss() 
+```
+
+*   `Flatten()` ：通常前一层在`7x7x512`中有7x7x512，因此将其展平为长度为2508的单个向量
+*   `L1Loss` [ [1:58:22](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h58m22s) ]：不是将平方误差相加，而是将误差的绝对值相加。 这通常是你想要的，因为加上平方误差确实会对过多的错误进行惩罚。 所以L1Loss通常更适合使用。
+
+```
+ learn.lr_find(1e-5,100)  learn.sched.plot(5) 
+```
+
+```
+ 78%|███████▊ | 25/32 [00:04<00:01, 6.16it/s, loss=395] 
+```
+
+![](../img/1_aPhLm7eoGKPjlE1syDyQ-g.png)
+
+```
+ lr = 2e-3  learn.fit(lr, 2, cycle_len=1, cycle_mult=2) 
+```
+
+```
+ _epoch trn_loss val_loss_  _0 49.523444 34.764141_  _1 36.864003 28.007317_  _2 30.925234 27.230705_ 
+```
+
+```
+ lrs = np.array([lr/100,lr/10,lr])  learn.freeze_to(-2)  lrf=learn.lr_find(lrs/1000)  learn.sched.plot(1) 
+```
+
+![](../img/1_SBUJiX2JsdtzHrXRyEuknw.png)
+
+```
+ learn.fit(lrs, 2, cycle_len=1, cycle_mult=2) 
+```
+
+```
+ _epoch trn_loss val_loss_  _0 25.616161 22.83597_  _1 21.812624 21.387115_  _2 17.867176 20.335539_ 
+```
+
+```
+ learn.freeze_to(-3)  learn.fit(lrs, 1, cycle_len=2) 
+```
+
+```
+ _epoch trn_loss val_loss_  _0 16.571885 20.948696_  _1 15.072718 19.925312_ 
+```
+
+验证损失是绝对值的平均值，像素被关闭。
+
+```
+ learn.save('reg4') 
+```
+
+#### 看看结果[ [1:59:18](https://youtu.be/Z0ssNAbe81M%3Ft%3D1h59m18s) ]
+
+```
+ x,y = next(iter(md.val_dl))  learn.model.eval()  preds = to_np(learn.model(VV(x))) 
+```
+
+```
+ fig, axes = plt.subplots(3, 4, figsize=(12, 8))  **for** i,ax **in** enumerate(axes.flat):  ima=md.val_ds.denorm(to_np(x))[i]  b = bb_hw(preds[i])  ax = show_img(ima, ax=ax)  draw_rect(ax, b)  plt.tight_layout() 
+```
+
+![](../img/1_xM98QR8U9kz3MJZDHfz7BA.png)
+
+我们将在下周对此进行更多修改。 在本课之前，如果你被问到“你知道如何创建一个边界框模型吗？”，你可能会说“不，没有人教过我”。 但问题实际上是：
+
+*   你能创建一个有4个连续输出的模型吗？ 是。
+*   如果这4个输出接近4个其他数字，你能创建一个更低的损失函数吗？ 是
+
+然后你就完成了。
+
+当你向下看时，它开始看起来有点蹩脚 - 任何时候我们有多个对象。 这并不奇怪。 总的来说，它做得非常好。
diff --git a/zh/dl9.md b/zh/dl9.md
new file mode 100644
index 0000000000000000000000000000000000000000..5660b3b0061c612a6e8229ead93a3e0cac9104e1
--- /dev/null
+++ b/zh/dl9.md
@@ -0,0 +1,1180 @@
+# 深度学习2：第2部分第9课
+
+### 链接
+
+[**论坛**](http://forums.fast.ai/t/part-2-lesson-9-in-class/14028/1) **/** [**视频**](https://youtu.be/0frKXR-2PBY)
+
+### 评论
+
+#### 从上周开始：
+
+*   Pathlib; JSON
+*   字典理解
+*   Defaultdict
+*   如何跳过fastai源
+*   matplotlib OO API
+*   Lambda函数
+*   边界框坐标
+*   定制头; 边界框回归
+
+![](../img/1_2nxK3zuKRnDCu_3qVhSMnw.png)
+
+![](../img/1_9G88jQ42l5RdwFi2Yr_h_Q.png)
+
+#### 从第1部分：
+
+*   如何从DataLoader查看模型输入
+*   如何查看模型输出
+
+![](../img/1_E3Z5vKnp6ZkfuLR83979RA.png)
+
+### 数据增强和边界框[ [2:58](https://youtu.be/0frKXR-2PBY%3Ft%3D2m58s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/pascal.ipynb)
+
+**快餐的尴尬粗糙边缘：**
+_分类器_是具有因变量的任何分类或二项式。 与_回归_相反，任何具有因变量的东西都是连续的。 命名有点令人困惑，但将来会被整理出来。 这里， `continuous`是`True`因为我们的因变量是边界框的坐标 - 因此这实际上是一个回归数据。
+
+```
+ tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,  aug_tfms=augs)  md = Image **Classifier** Data.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,  **continuous=True** , bs=4) 
+```
+
+#### 让我们创建一些数据扩充[ [4:40](https://youtu.be/0frKXR-2PBY%3Ft%3D4m40s) ]
+
+```
+ augs = [RandomFlip(),  RandomRotate(30),  RandomLighting(0.1,0.1)] 
+```
+
+通常，我们使用Jeremy为我们创建的这些快捷方式，但它们只是随机扩充的列表。 但是你可以很容易地创建自己的（大多数（如果不是全部）以“随机”开头）。
+
+![](../img/1_lAIQHKT0GbjY0fRZKmpFaA.png)
+
+```
+ tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,  aug_tfms=augs)  md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,  continuous=True, bs=4) 
+```
+
+```
+ idx=3  fig,axes = plt.subplots(3,3, figsize=(9,9))  for i,ax in enumerate(axes.flat):  x,y=next(iter(md.aug_dl))  ima=md.val_ds.denorm(to_np(x))[idx]  b = bb_hw(to_np(y[idx]))  print(b)  show_img(ima, ax=ax)  draw_rect(ax, b) 
+```
+
+```
+ _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_  _[ 115\. 63\. 240\. 311.]_ 
+```
+
+![](../img/1_QMa_SUUVOypZHKaAuXDkSw.png)
+
+正如你所看到的，图像旋转并且光线变化，但是边界框_没有移动_并且_位于错误的位置_ [ [6:17](https://youtu.be/0frKXR-2PBY%3Ft%3D6m17s) ]。 当您的因变量是像素值或以某种方式连接到自变量时，这是数据增强的问题 - 它们需要一起扩充。 正如您在边界框坐标`[ 115\. 63\. 240\. 311.]`中所看到的，我们的图像是224乘224 - 所以它既不缩放也不裁剪。 因变量需要经历所有几何变换作为自变量。
+
+要做到这一点[ [7:10](https://youtu.be/0frKXR-2PBY%3Ft%3D7m10s) ]，每个转换都有一个可选的`tfm_y`参数：
+
+```
+ augs = [RandomFlip(tfm_y=TfmType.COORD),  RandomRotate(30, tfm_y=TfmType.COORD),  RandomLighting(0.1,0.1, tfm_y=TfmType.COORD)] 
+```
+
+```
+ tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,  tfm_y=TfmType.COORD, aug_tfms=augs)  md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,  continuous=True, bs=4) 
+```
+
+`TrmType.COORD`表示_y_值表示坐标。 这需要添加到所有扩充以及`tfms_from_model` ，后者负责裁剪，缩放，调整大小，填充等。
+
+```
+ idx=3  fig,axes = plt.subplots(3,3, figsize=(9,9))  for i,ax in enumerate(axes.flat):  x,y=next(iter(md.aug_dl))  ima=md.val_ds.denorm(to_np(x))[idx]  b = bb_hw(to_np(y[idx]))  print(b)  show_img(ima, ax=ax)  draw_rect(ax, b) 
+```
+
+```
+ _[ 48\. 34\. 112\. 188.]_  _[ 65\. 36\. 107\. 185.]_  _[ 49\. 27\. 131\. 195.]_  _[ 24\. 18\. 147\. 204.]_  _[ 61\. 34\. 113\. 188.]_  _[ 55\. 31\. 121\. 191.]_  _[ 52\. 19\. 144\. 203.]_  _[ 7\. 0\. 193\. 222.]_  _[ 52\. 38\. 105\. 182.]_ 
+```
+
+![](../img/1__ge-RyZpEIQ5fiSvo207rA.png)
+
+现在，边界框随图像移动并位于正确的位置。 您可能会注意到，有时它看起来很奇怪，就像底行中间的那样。 这是我们所拥有信息的约束。 如果对象占据原始边界框的角，则在图像旋转后，新的边界框需要更大。 所以你必须**小心不要使用边界框进行太高的旋转，**因为没有足够的信息让它们保持准确。 如果我们在做多边形或分段，我们就不会遇到这个问题。
+
+![](../img/1_4V4sjFZxn-y2cU9tCJPEUw.png)
+
+<figcaption class="imageCaption">这就是箱子变大的原因</figcaption>
+
+
+
+```
+ tfm_y = TfmType.COORD  augs = [RandomFlip(tfm_y=tfm_y),  RandomRotate( **3** , **p=0.5** , tfm_y=tfm_y),  RandomLighting(0.05,0.05, tfm_y=tfm_y)] 
+```
+
+```
+ tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,  tfm_y=tfm_y, aug_tfms=augs)  md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,  continuous=True) 
+```
+
+所以在这里，我们最多进行3度旋转以避免这个问题[ [9:14](https://youtu.be/0frKXR-2PBY%3Ft%3D9m14s) ]。 它也只旋转了一半的时间（ `p=0.5` ）。
+
+#### custom_head [ [9:34](https://youtu.be/0frKXR-2PBY%3Ft%3D9m34s) ]
+
+`learn.summary()`将通过模型运行一小批数据，并在每一层打印出张量的大小。 正如你所看到的，在`Flatten`层之前，张量的形状为512乘7乘7.所以如果它是1级张量（即单个向量），它的长度将是25088（512 * 7 * 7）并且这就是为什么我们的自定义标题的输入大小是25088.输出大小是4，因为它是边界框坐标。
+
+```
+ head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088,4))  learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4)  learn.opt_fn = optim.Adam  learn.crit = nn.L1Loss() 
+```
+
+![](../img/1_o9NFGVz1ua60kOpIafe5Hg.png)
+
+#### 单个物体检测[ [10:35](https://youtu.be/0frKXR-2PBY%3Ft%3D10m35s) ]
+
+让我们将两者结合起来创建可以对每个图像中最大的对象进行分类和本地化的东西。
+
+我们需要做三件事来训练神经网络：
+
+1.  数据
+2.  建筑
+3.  损失函数
+
+#### 1.提供数据
+
+我们需要一个`ModelData`对象，其独立变量是图像，而因变量是边界框坐标和类标签的元组。 有几种方法可以做到这一点，但这里有一个特别懒惰和方便的方法，Jeremy提出的方法是创建两个`ModelData`对象，表示我们想要的两个不同的因变量（一个带有边界框坐标，一个带有类）。
+
+```
+ f_model=resnet34  sz=224  bs=64 
+```
+
+```
+ val_idxs = get_cv_idxs(len(trn_fns))  tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,  tfm_y=TfmType.COORD, aug_tfms=augs) 
+```
+
+```
+ md = ImageClassifierData.from_csv(PATH, JPEGS, **BB_CSV** , tfms=tfms,  continuous=True, val_idxs=val_idxs) 
+```
+
+```
+ md2 = ImageClassifierData.from_csv(PATH, JPEGS, **CSV** ,  tfms=tfms_from_model(f_model, sz)) 
+```
+
+数据集可以是`__len__`和`__getitem__`任何数据集。 这是一个向现有数据集添加第二个标签的数据集：
+
+```
+ **class** **ConcatLblDataset** (Dataset):  **def** __init__(self, ds, y2): self.ds,self.y2 = ds,y2  **def** __len__(self): **return** len(self.ds)  **def** __getitem__(self, i):  x,y = self.ds[i]  **return** (x, (y,self.y2[i])) 
+```
+
+*   `ds` ：包含独立变量和因变量
+*   `y2` ：包含其他因变量
+*   `(x, (y,self.y2[i]))` ： `(x, (y,self.y2[i]))`返回一个自变量和两个因变量的组合。
+
+我们将使用它将类添加到边界框标签。
+
+```
+ trn_ds2 = ConcatLblDataset(md.trn_ds, md2.trn_y)  val_ds2 = ConcatLblDataset(md.val_ds, md2.val_y) 
+```
+
+这是一个示例因变量：
+
+```
+ val_ds2[0][1] 
+```
+
+```
+ _(array([ 0., 49., 205., 180.], dtype=float32), 14)_ 
+```
+
+我们可以用这些新数据集替换数据加载器的数据集。
+
+```
+ md.trn_dl.dataset = trn_ds2  md.val_dl.dataset = val_ds2 
+```
+
+在绘制之前，我们必须对`denorm`的图像进行声明。
+
+```
+ x,y = next(iter(md.val_dl))  idx = 3  ima = md.val_ds.ds.denorm(to_np(x))[idx]  b = bb_hw(to_np(y[0][idx])); b 
+```
+
+```
+ _array([ 52., 38., 106., 184.], dtype=float32)_ 
+```
+
+```
+ ax = show_img(ima)  draw_rect(ax, b)  draw_text(ax, b[:2], md2.classes[y[1][idx]]) 
+```
+
+![](../img/1_6QqfOpqgyRogEiTCU8WZgQ.png)
+
+#### 2.选择建筑[ [13:54](https://youtu.be/0frKXR-2PBY%3Ft%3D13m54s) ]
+
+该体系结构将与我们用于分类器和边界框回归的体系结构相同，但我们将仅将它们组合在一起。 换句话说，如果我们有`c`类，那么我们在最后一层中需要的激活次数是4加`c` 。 4用于边界框坐标和`c`概率（每个类一个）。
+
+这次我们将使用额外的线性层，加上一些辍学，以帮助我们训练更灵活的模型。 一般来说，我们希望我们的自定义头能够自己解决问题，如果它所连接的预训练骨干是合适的。 所以在这种情况下，我们试图做很多 - 分类器和边界框回归，所以只是单个线性层似乎不够。 如果您想知道为什么在第一个`ReLU`之后没有`BatchNorm1d` ，ResNet主干已经将`BatchNorm1d`作为其最后一层。
+
+```
+ head_reg4 = nn.Sequential(  Flatten(),  nn.ReLU(),  nn.Dropout(0.5),  nn.Linear(25088,256),  nn.ReLU(),  nn.BatchNorm1d(256),  nn.Dropout(0.5),  nn.Linear(256, **4+len(cats)** ),  )  models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)  learn = ConvLearner(md, models)  learn.opt_fn = optim.Adam 
+```
+
+#### 3.损失函数[ [15:46](https://youtu.be/0frKXR-2PBY%3Ft%3D15m46s) ]
+
+损失函数需要查看这些`4 + len(cats)`激活并确定它们是否良好 - 这些数字是否准确反映了图像中最大对象的位置和类别。 我们知道如何做到这一点。 对于前4次激活，我们将像以前一样使用L1Loss（L1Loss就像均方误差 - 而不是平方误差之和，它使用绝对值之和）。 对于其余的激活，我们可以使用交叉熵损失。
+
+```
+ **def** detn_loss(input, target):  bb_t,c_t = target  bb_i,c_i = input[:, :4], input[:, 4:]  bb_i = F.sigmoid(bb_i)*224  _# I looked at these quantities separately first then picked a_  _# multiplier to make them approximately equal_  **return** F.l1_loss(bb_i, bb_t) + F.cross_entropy(c_i, c_t)*20 
+```
+
+```
+ **def** detn_l1(input, target):  bb_t,_ = target  bb_i = input[:, :4]  bb_i = F.sigmoid(bb_i)*224  **return** F.l1_loss(V(bb_i),V(bb_t)).data 
+```
+
+```
+ **def** detn_acc(input, target):  _,c_t = target  c_i = input[:, 4:]  **return** accuracy(c_i, c_t) 
+```
+
+```
+ learn.crit = detn_loss  learn.metrics = [detn_acc, detn_l1] 
+```
+
+*   `input` ：激活
+*   `target` ：基本事实
+*   `bb_t,c_t = target` ：我们的自定义数据集返回一个包含边界框坐标和类的元组。 这项任务将对它们进行解构。
+*   `bb_i,c_i = input[:, :4], input[:, 4:]` ：第一个`:`用于批量维度。
+*   `b_i = F.sigmoid(bb_i)*224` ：我们知道我们的图像是224乘`Sigmoid`将强制它在0和1之间，并将它乘以224以帮助我们的神经网络在它的范围内成为。
+
+**问题：**作为一般规则，在ReLU [ [18:02](https://youtu.be/0frKXR-2PBY%3Ft%3D18m2s) ]之前或之后放置BatchNorm会更好吗？ Jeremy建议将它放在ReLU之后，因为BathNorm意味着走向零均值的单标准偏差。 因此，如果你把ReLU放在它之后，你将它截断为零，这样就无法创建负数。 但是如果你把ReLU然后放入BatchNorm，它确实具有这种能力并且给出稍微好一些的结果。 话虽如此，无论如何都不是太大的交易。 你在课程的这一部分看到，大多数时候，Jeremy做了ReLU然后是BatchNorm，但是当他想要与论文保持一致时，有时则相反。
+
+**问题** ：BatchNorm之后使用dropout的直觉是什么？ BatchNorm是否已经做好了正规化[ [19:12](https://youtu.be/0frKXR-2PBY%3Ft%3D19m12s) ]的工作？ BatchNorm可以正常化，但如果你回想第1部分，我们讨论了一些事情，我们这样做是为了避免过度拟合，添加BatchNorm就像数据增加一样。 但你完全有可能仍然过度拟合。 关于辍学的一个好处是，它有一个参数来说明辍学的数量。 参数是特别重要的参数，决定了规则的多少，因为它可以让你构建一个漂亮的大参数化模型，然后决定规范它的程度。 Jeremy倾向于总是从`p=0`开始辍学，然后当他添加正则化时，他可以改变辍学参数而不用担心他是否保存了他想要能够加载它的模型，但如果他有在一个中丢弃层而在另一个中没有，它将不再加载。 所以这样，它保持一致。
+
+现在我们有输入和目标，我们可以计算L1损失并添加交叉熵[ [20:39](https://youtu.be/0frKXR-2PBY%3Ft%3D20m39s) ]：
+
+`F.l1_loss(bb_i, bb_t) + F.cross_entropy(c_i, c_t)*20`
+
+这是我们的损失功能。 交叉熵和L1损失可能具有完全不同的尺度 - 在这种情况下，损失函数中较大的一个将占主导地位。 在这种情况下，杰里米打印出这些值，并发现如果我们将交叉熵乘以20会使它们的大小相同。
+
+```
+ lr=1e-2  learn.fit(lr, 1, cycle_len=3, use_clr=(32,5)) 
+```
+
+```
+ _epoch trn_loss val_loss detn_acc detn_l1_  _0 72.036466 45.186367 0.802133 32.647586_  _1 51.037587 36.34964 0.828425 25.389733_  _2 41.4235 35.292709 0.835637 24.343577_ 
+```
+
+```
+ _[35.292709, 0.83563701808452606, 24.343576669692993]_ 
+```
+
+在训练时打印出信息很好，所以我们抓住L1损失并将其作为指标添加。
+
+```
+ learn.save('reg1_0')  learn.freeze_to(-2)  lrs = np.array([lr/100, lr/10, lr])  learn.fit(lrs/5, 1, cycle_len=5, use_clr=(32,10)) 
+```
+
+```
+ epoch trn_loss val_loss detn_acc detn_l1  0 34.448113 35.972973 0.801683 22.918499  1 28.889909 33.010857 0.830379 21.689888  2 24.237017 30.977512 0.81881 20.817996  3 21.132993 30.60677 0.83143 20.138552  4 18.622983 30.54178 0.825571 19.832196 
+```
+
+```
+ [30.54178, 0.82557091116905212, 19.832195997238159] 
+```
+
+```
+ learn.unfreeze()  learn.fit(lrs/10, 1, cycle_len=10, use_clr=(32,10)) 
+```
+
+```
+ epoch trn_loss val_loss detn_acc detn_l1  0 15.957164 31.111507 0.811448 19.970753  1 15.955259 32.597153 0.81235 20.111022  2 15.648723 32.231941 0.804087 19.522853  3 14.876172 30.93821 0.815805 19.226574  4 14.113872 31.03952 0.808594 19.155093  5 13.293885 29.736671 0.826022 18.761728  6 12.562566 30.000023 0.827524 18.82006  7 11.885125 30.28841 0.82512 18.904158  8 11.498326 30.070133 0.819712 18.635296  9 11.015841 30.213772 0.815805 18.551489 
+```
+
+```
+ [30.213772, 0.81580528616905212, 18.551488876342773] 
+```
+
+检测精度低至80，与以前相同。 这并不奇怪，因为ResNet旨在进行分类，因此我们不希望以这种简单的方式改进事物。 它当然不是为了进行边界框回归而设计的。 它显然实际上是以不关心几何的方式设计的 - 它需要最后7到7个激活网格并将它们平均放在一起扔掉所有关于来自何处的信息。
+
+有趣的是，当我们同时进行准确性（分类）和边界框时，L1似乎比我们刚进行边界框回归时要好一些[ [22:46](https://youtu.be/0frKXR-2PBY%3Ft%3D22m46s) ]。 如果这对你来说是违反直觉的，那么这将是本课后要考虑的主要事项之一，因为这是一个非常重要的想法。 这个想法是这样的 - 弄清楚图像中的主要对象是什么，是一种困难的部分。 然后确定边界框的确切位置以及它的类别是一个简单的部分。 所以当你有一个网络既说对象是什么，对象在哪里时，它就会分享关于找到对象的所有计算。 所有共享计算都非常有效。 当我们返回传播类和地方中的错误时，这就是有助于计算找到最大对象的所有信息。 因此，只要您有多个任务分享这些任务完成工作所需要的概念，他们很可能应该至少共享网络的某些层。 今天晚些时候，我们将看一个模型，其中除了最后一层之外，大多数层都是共享的。
+
+结果如下[ [24:34](https://youtu.be/0frKXR-2PBY%3Ft%3D24m34s) ]。 和以前一样，当图像中有单个主要对象时，它做得很好。
+
+![](../img/1_g4JAJgAcDNDikhgwtLTcwQ.png)
+
+### 多标签分类[ [25:29](https://youtu.be/0frKXR-2PBY%3Ft%3D25m29s) ]
+
+[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/pascal-multi.ipynb)
+
+我们希望继续构建比上一个模型稍微复杂的模型，这样如果某些东西停止工作，我们就会确切地知道它在哪里破碎。 以下是以前笔记本的功能：
+
+```
+ %matplotlib inline  %reload_ext autoreload  %autoreload 2 
+```
+
+```
+ **from** **fastai.conv_learner** **import** *  **from** **fastai.dataset** **import** *  **import** **json** , **pdb**  **from** **PIL** **import** ImageDraw, ImageFont  **from** **matplotlib** **import** patches, patheffects  torch.backends.cudnn.benchmark= **True** 
+```
+
+#### 建立
+
+```
+ PATH = Path('data/pascal')  trn_j = json.load((PATH / 'pascal_train2007.json').open())  IMAGES,ANNOTATIONS,CATEGORIES = ['images', 'annotations',  'categories']  FILE_NAME,ID,IMG_ID,CAT_ID,BBOX = 'file_name','id','image_id',  'category_id','bbox'  cats = dict((o[ID], o['name']) **for** o **in** trn_j[CATEGORIES])  trn_fns = dict((o[ID], o[FILE_NAME]) **for** o **in** trn_j[IMAGES])  trn_ids = [o[ID] **for** o **in** trn_j[IMAGES]]  JPEGS = 'VOCdevkit/VOC2007/JPEGImages'  IMG_PATH = PATH/JPEGS 
+```
+
+```
+ **def** get_trn_anno():  trn_anno = collections.defaultdict( **lambda** :[])  **for** o **in** trn_j[ANNOTATIONS]:  **if** **not** o['ignore']:  bb = o[BBOX]  bb = np.array([bb[1], bb[0], bb[3]+bb[1]-1,  bb[2]+bb[0]-1])  trn_anno[o[IMG_ID]].append((bb,o[CAT_ID]))  **return** trn_anno  trn_anno = get_trn_anno() 
+```
+
+```
+ **def** show_img(im, figsize= **None** , ax= **None** ):  **if** **not** ax: fig,ax = plt.subplots(figsize=figsize)  ax.imshow(im)  ax.set_xticks(np.linspace(0, 224, 8))  ax.set_yticks(np.linspace(0, 224, 8))  ax.grid()  ax.set_yticklabels([])  ax.set_xticklabels([])  **return** ax  **def** draw_outline(o, lw):  o.set_path_effects([patheffects.Stroke(  linewidth=lw, foreground='black'), patheffects.Normal()])  **def** draw_rect(ax, b, color='white'):  patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:],  fill= **False** , edgecolor=color, lw=2))  draw_outline(patch, 4)  **def** draw_text(ax, xy, txt, sz=14, color='white'):  text = ax.text(*xy, txt,  verticalalignment='top', color=color, fontsize=sz,  weight='bold')  draw_outline(text, 1) 
+```
+
+```
+ **def** bb_hw(a): **return** np.array([a[1],a[0],a[3]-a[1],a[2]-a[0]])  **def** draw_im(im, ann):  ax = show_img(im, figsize=(16,8))  **for** b,c **in** ann:  b = bb_hw(b)  draw_rect(ax, b)  draw_text(ax, b[:2], cats[c], sz=16)  **def** draw_idx(i):  im_a = trn_anno[i]  im = open_image(IMG_PATH/trn_fns[i])  draw_im(im, im_a) 
+```
+
+#### 多级[ [26:12](https://youtu.be/0frKXR-2PBY%3Ft%3D26m12s) ]
+
+```
+ MC_CSV = PATH/'tmp/mc.csv' 
+```
+
+```
+ trn_anno[12] 
+```
+
+```
+ _[(array([ 96, 155, 269, 350]), 7)]_ 
+```
+
+```
+ mc = [set([cats[p[1]] **for** p **in** trn_anno[o]]) **for** o **in** trn_ids]  mcs = [' '.join(str(p) **for** p **in** o) **for** o **in** mc] 
+```
+
+```
+ df = pd.DataFrame({'fn': [trn_fns[o] **for** o **in** trn_ids],  'clas': mcs}, columns=['fn','clas'])  df.to_csv(MC_CSV, index= **False** ) 
+```
+
+其中一名学生指出，通过使用Pandas，我们可以比使用`collections.defaultdict`更简单，并分享[这个要点](https://gist.github.com/binga/1bc4ebe5e41f670f5954d2ffa9d6c0ed) 。 你越了解熊猫，你越经常意识到它是解决许多不同问题的好方法。
+
+**问题** ：当您在较小的模型上逐步构建时，是否将它们重新用作预先训练过的权重？ 或者你把它扔掉然后从头开始重新训练[ [27:11](https://youtu.be/0frKXR-2PBY%3Ft%3D27m11s) ]？ 当Jeremy在他这样做时想出东西时，他通常会倾向于扔掉，因为重复使用预先训练过的砝码会带来不必要的复杂性。 然而，如果他试图达到他可以在真正大的图像上进行训练的程度，他通常会从更小的角度开始，并且经常重新使用这些重量。
+
+```
+ f_model=resnet34  sz=224  bs=64 
+```
+
+```
+ tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO)  md = ImageClassifierData.from_csv(PATH, JPEGS, MC_CSV, tfms=tfms) 
+```
+
+```
+ learn = ConvLearner.pretrained(f_model, md)  learn.opt_fn = optim.Adam 
+```
+
+```
+ lr = 2e-2 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=3, use_clr=(32,5)) 
+```
+
+```
+ _epoch trn_loss val_loss <lambda>_  _0 0.104836 0.085015 0.972356_  _1 0.088193 0.079739 0.972461_  _2 0.072346 0.077259 0.974114_ 
+```
+
+```
+ _[0.077258907, 0.9741135761141777]_ 
+```
+
+```
+ lrs = np.array([lr/100, lr/10, lr]) 
+```
+
+```
+ learn.freeze_to(-2) 
+```
+
+```
+ learn.fit(lrs/10, 1, cycle_len=5, use_clr=(32,5)) 
+```
+
+```
+ _epoch trn_loss val_loss <lambda>_  _0 0.063236 0.088847 0.970681_  _1 0.049675 0.079885 0.973723_  _2 0.03693 0.076906 0.975601_  _3 0.026645 0.075304 0.976187_  _4 0.018805 0.074934 0.975165_ 
+```
+
+```
+ _[0.074934497, 0.97516526281833649]_ 
+```
+
+```
+ learn.save('mclas') 
+```
+
+```
+ learn.load('mclas') 
+```
+
+```
+ y = learn.predict()  x,_ = next(iter(md.val_dl))  x = to_np(x) 
+```
+
+```
+ fig, axes = plt.subplots(3, 4, figsize=(12, 8))  **for** i,ax **in** enumerate(axes.flat):  ima=md.val_ds.denorm(x)[i]  ya = np.nonzero(y[i]>0.4)[0]  b = ' **\n** '.join(md.classes[o] **for** o **in** ya)  ax = show_img(ima, ax=ax)  draw_text(ax, (0,0), b)  plt.tight_layout() 
+```
+
+![](../img/1_2m1Qoq3NhsqdYBd4hUTR6A.png)
+
+多级分类非常简单[ [28:28](https://youtu.be/0frKXR-2PBY%3Ft%3D28m28s) ]。 一个小调整是在这一行中使用`set` ，以便每个对象类型出现一次：
+
+```
+ mc = [ **set** ([cats[p[1]] **for** p **in** trn_anno[o]]) **for** o **in** trn_ids] 
+```
+
+#### SSD和YOLO [ [29:10](https://youtu.be/0frKXR-2PBY%3Ft%3D29m10s) ]
+
+我们有一个输入图像，它通过一个转换网络，输出一个大小为`4+c`的向量，其中`c=len(cats)` 。 这为我们提供了一个最大物体的物体探测器。 现在让我们创建一个找到16个对象的对象。 显而易见的方法是采用最后一个线性层而不是`4+c`输出，我们可以有`16x(4+c)`输出。 这给了我们16组类概率和16组边界框坐标。 然后我们只需要一个损失函数来检查这16组边界框是否正确表示了图像中最多16个对象（我们稍后会进入损失函数）。
+
+![](../img/1_fPHmCosDHcrHmtKvWFK9Mg.png)
+
+第二种方法是使用`nn.linear`而不是使用`nn.linear` ，如果相反，我们从ResNet卷积主干中获取并添加了一个`nn.Conv2d`和stride 2 [ [31:32](https://youtu.be/0frKXR-2PBY%3Ft%3D31m32s) ]？ 这将给我们一个`4x4x[# of filters]`张量 - 这里让我们使它成为`4x4x(4+c)`这样我们得到一个张量，其中元素的数量正好等于我们想要的元素数量。 现在，如果我们创建了一个`4x4x(4+c)`张量的损失函数，并将其映射到图像中的16个对象，并检查每个对象是否通过这些`4+c`激活正确表示，这也可以。 事实证明，这两种方法实际上都在使用[ [33:48](https://youtu.be/0frKXR-2PBY%3Ft%3D33m48s) ]。 输出是来自完全连接的线性层的一个大长矢量的方法被称为[YOLO（You Only Look Once）](https://arxiv.org/abs/1506.02640)的一类模型使用，在其他地方，卷积激活的方法被以某些东西开始的模型使用称为[SSD（单发探测器）](https://arxiv.org/abs/1512.02325) 。 由于这些事情在2015年末非常相似，所以事情已经转向SSD。 所以今天早上， [YOLO版本3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)出现了，现在正在做SSD，这就是我们要做的事情。 我们还将了解为什么这也更有意义。
+
+#### 锚箱[ [35:04](https://youtu.be/0frKXR-2PBY%3Ft%3D35m04s) ]
+
+![](../img/1_8kpDP3FZFxW99IUQE0C8Xw.png)
+
+让我们假设我们有另一个`Conv2d(stride=2)`然后我们将有`2x2x(4+c)`张量。 基本上，它创建一个看起来像这样的网格：
+
+![](../img/1_uA-oJok4-Rng6mnHOOPyNQ.png)
+
+这就是第二额外卷积步幅2层的激活的几何形状。 请记住，步幅2卷积对激活的几何形状做同样的事情，如步幅1卷积，然后是假设填充正常的最大值。
+
+我们来谈谈我们在这里可以做些什么[ [36:09](https://youtu.be/0frKXR-2PBY%3Ft%3D36m9s) ]。 我们希望这些网格单元中的每一个都负责查找图像该部分中的最大对象。
+
+#### 感受野[ [37:20](https://youtu.be/0frKXR-2PBY%3Ft%3D37m20s) ]
+
+为什么我们关心的是我们希望每个卷积网格单元负责查找图像相应部分中的内容？ 原因是因为卷积网格单元的感知域。 基本思想是，在整个卷积层中，这些张量的每一部分都有一个感知场，这意味着输入图像的哪一部分负责计算该细胞。 像生活中的所有事情一样，最简单的方法就是用Excel [ [38:01](https://youtu.be/0frKXR-2PBY%3Ft%3D38m1s) ]。
+
+![](../img/1_IgL2CMSit3Hh9N2Fq2Zlgg.png)
+
+进行一次激活（在这种情况下，在maxpool图层中）让我们看看它来自哪里[ [38:45](https://youtu.be/0frKXR-2PBY%3Ft%3D38m45s) ]。 在excel中，您可以执行公式→跟踪先例。 一直追溯到输入层，您可以看到它来自图像的这个6 x 6部分（以及过滤器）。 更重要的是，中间部分有很多重量从外面的细胞出来只有一个重量出来的地方。 因此，我们将这个6 x 6细胞称为我们选择的一次激活的感受野。
+
+![](../img/1_cCBVbJ2WjiPMlqX4nA2bwA.png)
+
+<figcaption class="imageCaption">3x3卷积，不透明度为15％ - 显然盒子的中心有更多的依赖性</figcaption>
+
+
+
+请注意，感知字段不只是说它是这个框，而且框的中心有更多的依赖关系[ [40:27](https://youtu.be/0frKXR-2PBY%3Ft%3D40m27s) ]这是一个非常重要的概念，当涉及到理解架构并理解为什么会员网以他们的方式工作时。
+
+#### 建筑[ [41:18](https://youtu.be/0frKXR-2PBY%3Ft%3D41m18s) ]
+
+架构是，我们将有一个ResNet主干，然后是一个或多个2D卷积（现在一个），这将给我们一个`4x4`网格。
+
+```
+ **class** **StdConv** (nn.Module):  **def** __init__(self, nin, nout, stride=2, drop=0.1):  super().__init__()  self.conv = nn.Conv2d(nin, nout, 3, stride=stride,  padding=1)  self.bn = nn.BatchNorm2d(nout)  self.drop = nn.Dropout(drop)  **def** forward(self, x):  **return** self.drop(self.bn(F.relu(self.conv(x))))  **def** flatten_conv(x,k):  bs,nf,gx,gy = x.size()  x = x.permute(0,2,3,1).contiguous()  **return** x.view(bs,-1,nf//k) 
+```
+
+```
+ **class** **OutConv** (nn.Module):  **def** __init__(self, k, nin, bias):  super().__init__()  self.k = k  self.oconv1 = nn.Conv2d(nin, (len(id2cat)+1)*k, 3,  padding=1)  self.oconv2 = nn.Conv2d(nin, 4*k, 3, padding=1)  self.oconv1.bias.data.zero_().add_(bias)  **def** forward(self, x):  **return** [flatten_conv(self.oconv1(x), self.k),  flatten_conv(self.oconv2(x), self.k)] 
+```
+
+```
+ **class** **SSD_Head** (nn.Module):  **def** __init__(self, k, bias):  super().__init__()  self.drop = nn.Dropout(0.25)  self.sconv0 = StdConv(512,256, stride=1)  self.sconv2 = StdConv(256,256)  self.out = OutConv(k, 256, bias)  **def** forward(self, x):  x = self.drop(F.relu(x))  x = self.sconv0(x)  x = self.sconv2(x)  **return** self.out(x)  head_reg4 = SSD_Head(k, -3.)  models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)  learn = ConvLearner(md, models)  learn.opt_fn = optim.Adam 
+```
+
+**SSD_Head**
+
+1.  我们从ReLU和辍学开始
+2.  然后迈步1卷积。 我们从步幅1卷积开始的原因是因为它根本不会改变几何 - 它只是让我们添加一个额外的计算层。 它让我们不仅创建一个线性层，而且现在我们在自定义头中有一个小的神经网络。 `StdConv`在上面定义 - 它执行卷积，ReLU，BatchNorm和dropout。 你看到的大多数研究代码都不会定义这样的类，而是一次又一次地写出整个事物。 不要那样。 重复的代码会导致错误和理解不足。
+3.  跨步2卷积[ [44:56](https://youtu.be/0frKXR-2PBY%3Ft%3D44m56s) ]
+4.  最后，步骤3的输出为`4x4` ，并传递给`OutConv` 。 `OutConv`有两个独立的卷积层，每个卷层都是步长1，因此它不会改变输入的几何形状。 其中一个是类的数量的长度（现在忽略`k`而`+1`是“背景” - 即没有检测到对象），另一个的长度是4.而不是有一个输出`4+c`转换层，让我们有两个转换层，并在列表中返回它们的输出。 这允许这些层专门化一点点。 我们谈到了这个想法，当你有多个任务时，他们可以共享图层，但他们不必共享所有图层。 在这种情况下，我们创建分类器以及创建和创建边界框回归的两个任务共享除最后一个层之外的每个层。
+5.  最后，我们弄平了卷积，因为杰里米写了损失函数，期望压低张量，但我们可以完全重写它不要那样做。
+
+#### [Fastai编码风格](https://github.com/fastai/fastai/blob/master/docs/style.md) [ [42:58](https://youtu.be/0frKXR-2PBY%3Ft%3D42m58s) ]
+
+第一稿于本周发布。 它非常依赖于说明性编程的思想，即编程代码应该是一种可以用来解释一个想法的东西，理想情况下就像数学符号一样，对于理解你的编码方法的人来说。 这个想法可以追溯到很长一段时间，但最好的描述可能是杰里米最伟大的计算机科学英雄Ken Iverson在1979年的图灵奖演讲中。 他从1964年以来一直在研究它，但1964年是他发布的这种方法的第一个例子，即APL，25年后，他赢得了图灵奖。 然后他将接力棒传给了他的儿子Eric Iverson。 Fastai风格指南试图采用其中一些想法。
+
+#### 损失函数[ [47:44](https://youtu.be/0frKXR-2PBY%3Ft%3D47m44s) ]
+
+损失函数需要查看这16组激活中的每一组，每组激活具有四个边界框坐标和`c+1`类概率，并确定这些激活是否离最近该网格单元的对象很近或远离在图像中。 如果没有，那么它是否正确预测背景。 事实证明这很难。
+
+#### 匹配问题[ [48:43](https://youtu.be/0frKXR-2PBY%3Ft%3D48m43s) ]
+
+![](../img/1_2dqj3hivcOF6ThoL-nhMyA.png)
+
+损失函数需要获取图像中的每个对象并将它们与这些卷积网格单元中的一个匹配，以说“此网格单元负责此特定对象”，因此它可以继续说“好吧，有多接近4个坐标和类概率有多接近。
+
+这是我们的目标[ [49:56](https://youtu.be/0frKXR-2PBY%3Ft%3D49m56s) ]：
+
+![](../img/1_8M9x-WgHNasmuLSJNbKoaQ.png)
+
+我们的因变量看起来像左边的变量，我们的最终卷积层将是`4x4x(c+1)`在这种情况下`c=20` 。 然后我们将其展平成一个向量。 我们的目标是提出一个函数，它接受一个因变量以及最终从模型中出来的一些特定的激活，并且如果这些激活不是地面实况边界框的良好反映，则返回更高的数字; 如果它是一个很好的反映，或更低的数字。
+
+#### 测试[ [51:58](https://youtu.be/0frKXR-2PBY%3Ft%3D51m58s) ]
+
+```
+ x,y = next(iter(md.val_dl))  x,y = V(x),V(y)  learn.model.eval()  batch = learn.model(x)  b_clas,b_bb = batch  b_clas.size(),b_bb.size() 
+```
+
+```
+ _(torch.Size([64, 16, 21]), torch.Size([64, 16, 4]))_ 
+```
+
+确保这些形状有意义。 现在让我们看看基础事实[ [53:24](https://youtu.be/0frKXR-2PBY%3Ft%3D53m24s) ]：
+
+```
+ idx=7  b_clasi = b_clas[idx]  b_bboxi = b_bb[idx]  ima=md.val_ds.ds.denorm(to_np(x))[idx]  bbox,clas = get_y(y[0][idx], y[1][idx])  bbox,clas 
+```
+
+```
+ _(Variable containing:_  _0.6786 0.4866 0.9911 0.6250_  _0.7098 0.0848 0.9911 0.5491_  _0.5134 0.8304 0.6696 0.9063_  _[torch.cuda.FloatTensor of size 3x4 (GPU 0)], Variable containing:_  _8_  _10_  _17_  _[torch.cuda.LongTensor of size 3 (GPU 0)])_ 
+```
+
+请注意，边界框坐标已缩放到0到1之间 - 基本上我们将图像视为1x1，因此它们相对于图像的大小。
+
+我们已经有`show_ground_truth`函数。 这个`torch_gt` （gt：地面实况）函数只是将张量转换为numpy数组。
+
+```
+ **def** torch_gt(ax, ima, bbox, clas, prs= **None** , thresh=0.4):  **return** show_ground_truth(ax, ima, to_np((bbox*224).long()),  to_np(clas),  to_np(prs) **if** prs **is** **not** **None** **else** **None** , thresh) 
+```
+
+```
+ fig, ax = plt.subplots(figsize=(7,7))  torch_gt(ax, ima, bbox, clas) 
+```
+
+![](../img/1_Q3ZtSRtk-a2OwKfE1wa5zw.png)
+
+以上是一个基本事实。 这是我们最终卷积层的`4x4`网格单元格[ [54:44](https://youtu.be/0frKXR-2PBY%3Ft%3D54m44s) ]：
+
+```
+ fig, ax = plt.subplots(figsize=(7,7))  torch_gt(ax, ima, anchor_cnr, b_clasi.max(1)[1]) 
+```
+
+![](../img/1_xjKmShqdLnD_JX4Aj7U80g.png)
+
+每个方形盒子，不同的纸张都称它们为不同的东西。 您将听到的三个术语是：锚箱，先前的箱子或默认箱子。 我们将坚持使用术语锚箱。
+
+我们要为这个损失函数做些什么，我们将要经历一个匹配问题，我们将采用这16个方框中的每一个，看看这三个地面实况对象中哪一个具有最大的重叠量给定方[ [55:21](https://youtu.be/0frKXR-2PBY%3Ft%3D55m21s) ]。 要做到这一点，我们必须有一些方法来测量重叠量，这个标准函数叫做[Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) （IoU）。
+
+![](../img/1_10ORjq4HuOc0umcnojiDPA.png)
+
+我们将通过找到三个物体中的每一个与16个锚箱中的每一个的Jaccard重叠[ [57:11](https://youtu.be/0frKXR-2PBY%3Ft%3D57m11s) ]。 这将给我们一个`3x16`矩阵。
+
+以下是我们所有锚箱的_坐标_ （中心，高度，宽度）：
+
+```
+ anchors 
+```
+
+```
+ _Variable containing:_  _0.1250 0.1250 0.2500 0.2500_  _0.1250 0.3750 0.2500 0.2500_  _0.1250 0.6250 0.2500 0.2500_  _0.1250 0.8750 0.2500 0.2500_  _0.3750 0.1250 0.2500 0.2500_  _0.3750 0.3750 0.2500 0.2500_  _0.3750 0.6250 0.2500 0.2500_  _0.3750 0.8750 0.2500 0.2500_  _0.6250 0.1250 0.2500 0.2500_  _0.6250 0.3750 0.2500 0.2500_  _0.6250 0.6250 0.2500 0.2500_  _0.6250 0.8750 0.2500 0.2500_  _0.8750 0.1250 0.2500 0.2500_  _0.8750 0.3750 0.2500 0.2500_  _0.8750 0.6250 0.2500 0.2500_  _0.8750 0.8750 0.2500 0.2500_  _[torch.cuda.FloatTensor of size 16x4 (GPU 0)]_ 
+```
+
+以下是3个地面实况对象和16个锚箱之间的重叠量：
+
+```
+ overlaps = jaccard(bbox.data, anchor_cnr.data)  overlaps 
+```
+
+```
+ Columns 0 to 7  0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 
+```
+
+```
+ Columns 8 to 15  0.0000 0.0091 0.0922 0.0000 0.0000 0.0315 0.3985 0.0000 0.0356 0.0549 0.0103 0.0000 0.2598 0.4538 0.0653 0.0000 0.0000 0.0000 0.0000 0.1897 0.0000 0.0000 0.0000 0.0000 [torch.cuda.FloatTensor of size 3x16 (GPU 0)] 
+```
+
+我们现在可以做的是我们可以采用维度1（行方向）的最大值，它将告诉我们每个地面实况对象，与某些网格单元重叠的最大量以及索引：
+
+```
+ overlaps.max(1) 
+```
+
+```
+ _(_  _0.3985_  _0.4538_  _0.1897_  _[torch.cuda.FloatTensor of size 3 (GPU 0)],_  _14_  _13_  _11_  _[torch.cuda.LongTensor of size 3 (GPU 0)])_ 
+```
+
+我们还将查看尺寸0（列方向）上的最大值，它将告诉我们所有地面[实例](https://youtu.be/0frKXR-2PBY%3Ft%3D59m8s)对象中每个网格单元的最大重叠量[ [59:08](https://youtu.be/0frKXR-2PBY%3Ft%3D59m8s) ]：
+
+```
+ overlaps.max(0) 
+```
+
+```
+ _(_  _0.0000_  _0.0000_  _0.0000_  _0.0000_  _0.0000_  _0.0000_  _0.0000_  _0.0000_  _0.0356_  _0.0549_  _0.0922_  _0.1897_  _0.2598_  _0.4538_  _0.3985_  _0.0000_  _[torch.cuda.FloatTensor of size 16 (GPU 0)],_  _0_  _0_  _0_  _0_  _0_  _0_  _0_  _0_  _1_  _1_  _0_  _2_  _1_  _1_  _0_  _0_  _[torch.cuda.LongTensor of size 16 (GPU 0)])_ 
+```
+
+这里特别有趣的是它告诉我们每个网格单元最重要的地面实况对象的索引是什么。 零在这里有点过载 - 零可能意味着重叠量为零或其最大重叠与对象索引为零。 事实证明不仅仅是问题，而是仅仅是因为。
+
+有一个名为`map_to_ground_truth`的函数，我们现在不用担心[ [59:57](https://youtu.be/0frKXR-2PBY%3Ft%3D59m57s) ]。 它是超级简单的代码，但考虑起来有点尴尬。 基本上它的作用是它以SSD论文中描述的方式组合这两组重叠，以将每个锚盒分配给基础事实对象。 它分配的方式是三个（行方式最大）中的每一个按原样分配。 对于其余的锚箱，它们被分配给它们具有至少0.5的重叠的任何东西（逐列）。 If neither applies, it is considered to be a cell which contains background.
+
+```
+ gt_overlap,gt_idx = map_to_ground_truth(overlaps)  gt_overlap,gt_idx 
+```
+
+```
+ _(_ 
+ 0.0000 
+ 0.0000 
+ 0.0000 
+ 0.0000 
+ 0.0000 
+ 0.0000 
+ 0.0000 
+ 0.0000 
+ 0.0356 
+ 0.0549 
+ 0.0922 
+ 1.9900 
+ 0.2598 
+ 1.9900 
+ 1.9900 
+ 0.0000 
+ [torch.cuda.FloatTensor of size 16 (GPU 0)], 
+ _0_  _0_  _0_  _0_  _0_  _0_  _0_  _0_  _1_  _1_  _0_  _2_  _1_  _1_  _0_  _0_ 
+ [torch.cuda.LongTensor of size 16 (GPU 0)]) 
+```
+
+Now you can see a list of all the assignments [ [1:01:05](https://youtu.be/0frKXR-2PBY%3Ft%3D1h1m5s) ]. Anywhere that has `gt_overlap &lt; 0.5` gets assigned background. The three row-wise max anchor box has high number to force the assignments. Now we can combine these values to classes:
+
+```
+ gt_clas = clas[gt_idx]; gt_clas 
+```
+
+```
+ Variable containing: 
+ _8_  _8_  _8_  _8_  _8_  _8_  _8_  _8_  _10_  _10_  _8_  _17_  _10_  _10_  _8_  _8_ 
+ [torch.cuda.LongTensor of size 16 (GPU 0)] 
+```
+
+Then add a threshold and finally comes up with the three classes that are being predicted:
+
+```
+ thresh = 0.5  pos = gt_overlap > thresh  pos_idx = torch.nonzero(pos)[:,0]  neg_idx = torch.nonzero(1-pos)[:,0]  pos_idx 
+```
+
+```
+ _11_  _13_  _14_ 
+ [torch.cuda.LongTensor of size 3 (GPU 0)] 
+```
+
+And here are what each of these anchor boxes is meant to be predicting:
+
+```
+ gt_clas[1-pos] = len(id2cat)  [id2cat[o] if o<len(id2cat) else 'bg' for o in gt_clas.data] 
+```
+
+```
+ ['bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'bg', 
+ 'sofa', 
+ 'bg', 
+ 'diningtable', 
+ 'chair', 
+ 'bg'] 
+```
+
+So that was the matching stage [ [1:02:29](https://youtu.be/0frKXR-2PBY%3Ft%3D1h2m29s) ]. For L1 loss, we can:
+
+1.  take the activations which matched ( `pos_idx = [11, 13, 14]` )
+2.  subtract from those the ground truth bounding boxes
+3.  take the absolute value of the difference
+4.  take the mean of that.
+
+For classifications, we can just do a cross entropy
+
+```
+ gt_bbox = bbox[gt_idx]  loc_loss = ((a_ic[pos_idx] - gt_bbox[pos_idx]).abs()).mean()  clas_loss = F.cross_entropy(b_clasi, gt_clas)  loc_loss,clas_loss 
+```
+
+```
+ (Variable containing: 
+ 1.00000e-02 * 
+ 6.5887 
+ [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing: 
+ 1.0331 
+ [torch.cuda.FloatTensor of size 1 (GPU 0)]) 
+```
+
+We will end up with 16 predicted bounding boxes, most of them will be background. If you are wondering what it predicts in terms of bounding box of background, the answer is it totally ignores it.
+
+```
+ fig, axes = plt.subplots(3, 4, figsize=(16, 12))  for idx,ax in enumerate(axes.flat):  ima=md.val_ds.ds.denorm(to_np(x))[idx]  bbox,clas = get_y(y[0][idx], y[1][idx])  ima=md.val_ds.ds.denorm(to_np(x))[idx]  bbox,clas = get_y(bbox,clas); bbox,clas  a_ic = actn_to_bb(b_bb[idx], anchors)  torch_gt(ax, ima, a_ic, b_clas[idx].max(1)[1],  b_clas[idx].max(1)[0].sigmoid(), 0.01)  plt.tight_layout() 
+```
+
+![](../img/1_8azTUd1Ujf3FQSMBwIXgAw.png)
+
+#### Tweak 1\. How do we interpret the activations [ [1:04:16](https://youtu.be/0frKXR-2PBY%3Ft%3D1h4m16s) ]?
+
+The way we interpret the activation is defined here:
+
+```
+ def actn_to_bb(actn, anchors):  actn_bbs = torch.tanh(actn)  actn_centers = (actn_bbs[:,:2]/2 * grid_sizes) + anchors[:,:2]  actn_hw = (actn_bbs[:,2:]/2+1) * anchors[:,2:]  return hw2corners(actn_centers, actn_hw) 
+```
+
+We grab the activations, we stick them through `tanh` (remember `tanh` is the same shape as sigmoid except it is scaled to be between -1 and 1) which forces it to be within that range. We then grab the actual position of the anchor boxes, and we will move them around according to the value of the activations divided by two ( `actn_bbs[:,:2]/2` ). In other words, each predicted bounding box can be moved by up to 50% of a grid size from where its default position is. Ditto for its height and width — it can be up to twice as big or half as big as its default size.
+
+#### Tweak 2\. We actually use binary cross entropy loss instead of cross entropy [ [1:05:36](https://youtu.be/0frKXR-2PBY%3Ft%3D1h5m36s) ]
+
+```
+ class BCE_Loss (nn.Module):  def __init__(self, num_classes):  super().__init__()  self.num_classes = num_classes  def forward(self, pred, targ):  t = one_hot_embedding(targ, self.num_classes+1)  t = V(t[:,:-1].contiguous()) #.cpu()  x = pred[:,:-1]  w = self.get_weight(x,t)  return F.binary_cross_entropy_with_logits(x, t, w,  size_average= False )/self.num_classes  def get_weight(self,x,t): return None 
+```
+
+Binary cross entropy is what we normally use for multi-label classification. Like in the planet satellite competition, each satellite image could have multiple things. If it has multiple things in it, you cannot use softmax because softmax really encourages just one thing to have the high number. In our case, each anchor box can only have one object associated with it, so it is not for that reason that we are avoiding softmax. It is something else — which is it is possible for an anchor box to have nothing associated with it. There are two ways to handle this idea of “background”; one would be to say background is just a class, so let's use softmax and just treat background as one of the classes that the softmax could predict. A lot of people have done it this way. But that is a really hard thing to ask neural network to do [ [1:06:52](https://youtu.be/0frKXR-2PBY%3Ft%3D1h5m52s) ] — it is basically asking whether this grid cell does not have any of the 20 objects that I am interested with Jaccard overlap of more than 0.5\. It is a really hard to thing to put into a single computation. On the other hand, what if we just asked for each class; “is it a motorbike?” “is it a bus?”, “ is it a person?” etc and if all the answer is no, consider that background. That is the way we do it here. It is not that we can have multiple true labels, but we can have zero.
+
+In `forward` :
+
+1.  First we take the one hot embedding of the target (at this stage, we do have the idea of background)
+2.  Then we remove the background column (the last one) which results in a vector either of all zeros or one one.
+3.  Use binary cross-entropy predictions.
+
+This is a minor tweak, but it is the kind of minor tweak that Jeremy wants you to think about and understand because it makes a really big difference to your training and when there is some increment over a previous paper, it would be something like this [ [1:08:25](https://youtu.be/0frKXR-2PBY%3Ft%3D1h8m25s) ]. It is important to understand what this is doing and more importantly why.
+
+So now we have [ [1:09:39](https://youtu.be/0frKXR-2PBY%3Ft%3D1h9m39s) ]:
+
+*   A custom loss function
+*   A way to calculate Jaccard index
+*   A way to convert activations to bounding box
+*   A way to map anchor boxes to ground truth
+
+Now all it's left is SSD loss function.
+
+#### SSD Loss Function [ [1:09:55](https://youtu.be/0frKXR-2PBY%3Ft%3D1h9m55s) ]
+
+```
+ def ssd_1_loss(b_c,b_bb,bbox,clas,print_it= False ):  bbox,clas = get_y(bbox,clas)  a_ic = actn_to_bb(b_bb, anchors)  overlaps = jaccard(bbox.data, anchor_cnr.data)  gt_overlap,gt_idx = map_to_ground_truth(overlaps,print_it)  gt_clas = clas[gt_idx]  pos = gt_overlap > 0.4  pos_idx = torch.nonzero(pos)[:,0]  gt_clas[1-pos] = len(id2cat)  gt_bbox = bbox[gt_idx]  loc_loss = ((a_ic[pos_idx] - gt_bbox[pos_idx]).abs()).mean()  clas_loss = loss_f(b_c, gt_clas)  return loc_loss, clas_loss  def ssd_loss(pred,targ,print_it= False ):  lcs,lls = 0.,0\.  for b_c,b_bb,bbox,clas in zip(*pred,*targ):  loc_loss,clas_loss = ssd_1_loss(b_c,b_bb,bbox,clas,print_it)  lls += loc_loss  lcs += clas_loss  if print_it: print(f'loc: {lls.data[0]} , clas: {lcs.data[0]} ')  return lls+lcs 
+```
+
+The `ssd_loss` function which is what we set as the criteria, it loops through each image in the mini-batch and call `ssd_1_loss` function (ie SSD loss for one image).
+
+`ssd_1_loss` is where it is all happening. It begins by de-structuring `bbox` and `clas` . Let's take a closer look at `get_y` [ [1:10:38](https://youtu.be/0frKXR-2PBY%3Ft%3D1h10m38s) ]:
+
+```
+ def get_y(bbox,clas):  bbox = bbox.view(-1,4)/sz  bb_keep = ((bbox[:,2]-bbox[:,0])>0).nonzero()[:,0]  return bbox[bb_keep],clas[bb_keep] 
+```
+
+A lot of code you find on the internet does not work with mini-batches. It only does one thing at a time which we don't want. In this case, all these functions ( `get_y` , `actn_to_bb` , `map_to_ground_truth` ) is working on, not exactly a mini-batch at a time, but a whole bunch of ground truth objects at a time. The data loader is being fed a mini-batch at a time to do the convolutional layers. Because we can have _different numbers of ground truth objects in each image_ but a tensor has to be the strict rectangular shape, fastai automatically pads it with zeros (any target values that are shorter) [ [1:11:08](https://youtu.be/0frKXR-2PBY%3Ft%3D1h11m8s) ]. This was something that was added recently and super handy, but that does mean that you then have to make sure that you get rid of those zeros. So `get_y` gets rid of any of the bounding boxes that are just padding.
+
+1.  Get rid of the padding
+2.  Turn the activations to bounding boxes
+3.  Do the Jaccard
+4.  Do map_to_ground_truth
+5.  Check that there is an overlap greater than something around 0.4~0.5 (different papers use different values for this)
+6.  Find the indices of things that matched
+7.  Assign background class for the ones that did not match
+8.  Then finally get L1 loss for the localization, binary cross entropy loss for the classification, and return them which gets added in `ssd_loss`
+
+#### Training [ [1:12:47](https://youtu.be/0frKXR-2PBY%3Ft%3D1h12m47s) ]
+
+```
+ learn.crit = ssd_loss  lr = 3e-3  lrs = np.array([lr/100,lr/10,lr]) 
+```
+
+```
+ learn.lr_find(lrs/1000,1.)  learn.sched.plot(1) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 44.232681 21476.816406 
+```
+
+![](../img/1_V8J7FkreIVG7tKxGQQRV2Q.png)
+
+```
+ learn.lr_find(lrs/1000,1.)  learn.sched.plot(1) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 86.852668 32587.789062 
+```
+
+![](../img/1_-q583mkIy-e3k6dz5HmkYw.png)
+
+```
+ learn.fit(lr, 1, cycle_len=5, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 45.570843 37.099854 
+ 1 37.165911 32.165031 
+ 2 33.27844 30.990122 
+ 3 31.12054 29.804482 
+ 4 29.305789 28.943184 
+```
+
+```
+ [28.943184] 
+```
+
+```
+ learn.fit(lr, 1, cycle_len=5, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 43.726979 33.803085 
+ 1 34.771754 29.012939 
+ 2 30.591864 27.132868 
+ 3 27.896905 26.151638 
+ 4 25.907382 25.739273 
+```
+
+```
+ [25.739273] 
+```
+
+```
+ learn.save('0') 
+```
+
+```
+ learn.load('0') 
+```
+
+#### Result [ [1:13:16](https://youtu.be/0frKXR-2PBY%3Ft%3D1h13m16s) ]
+
+![](../img/1_8azTUd1Ujf3FQSMBwIXgAw.png)
+
+In practice, we want to remove the background and also add some threshold for probabilities, but it is on the right track. The potted plant image, the result is not surprising as all of our anchor boxes were small (4x4 grid). To go from here to something that is going to be more accurate, all we are going to do is to create way more anchor boxes.
+
+**Question** : For the multi-label classification, why aren't we multiplying the categorical loss by a constant like we did before [ [1:15:20](https://youtu.be/0frKXR-2PBY%3Ft%3D1h15m20s) ]? Great question. It is because later on it will turn out we do not need to.
+
+#### More anchors! [ [1:14:47](https://youtu.be/0frKXR-2PBY%3Ft%3D1h14m47s) ]
+
+There are 3 ways to do this:
+
+1.  Create anchor boxes of different sizes (zoom):
+
+![](../img/1_OtrTSJqBXyjeypKehik1CQ.png)
+
+![](../img/1_YG5bCP3O-jVhaQX_wuiSSg.png)
+
+![](../img/1_QCo0wOgJKXDBYNlmE7zUmA.png)
+
+<figcaption class="imageCaption" style="width: 301.205%; left: -201.205%;">From left (1x1, 2x2, 4x4 grids of anchor boxes). Notice that some of the anchor box is bigger than the original image.</figcaption>
+
+
+
+2\. Create anchor boxes of different aspect ratios:
+
+![](../img/1_ko8vZK4RD8H2l4u1hXCQZQ.png)
+
+![](../img/1_3rvuvY6Fu2S6eoN3nK1QWg.png)
+
+![](../img/1_bWZwFqf2Bv-ZbW-KedNO0Q.png)
+
+3\. Use more convolutional layers as sources of anchor boxes (the boxes are randomly jittered so that we can see ones that are overlapping [ [1:16:28](https://youtu.be/0frKXR-2PBY%3Ft%3D1h16m28s) ]):
+
+![](../img/1_LwFOFtmawmpqp6VDc56RmA.png)
+
+Combining these approaches, you can create lots of anchor boxes (Jeremy said he wouldn't print it, but here it is):
+
+![](../img/1_ymt8L0CCKMd9SG82SemdIA.png)
+
+```
+ anc_grids = [4, 2, 1]  anc_zooms = [0.75, 1., 1.3]  anc_ratios = [(1., 1.), (1., 0.5), (0.5, 1.)]  anchor_scales = [(anz*i,anz*j) for anz in anc_zooms  for (i,j) in anc_ratios]  k = len(anchor_scales)  anc_offsets = [1/(o*2) for o in anc_grids] 
+```
+
+```
+ anc_x = np.concatenate([np.repeat(np.linspace(ao, 1-ao, ag), ag)  for ao,ag in zip(anc_offsets,anc_grids)])  anc_y = np.concatenate([np.tile(np.linspace(ao, 1-ao, ag), ag)  for ao,ag in zip(anc_offsets,anc_grids)])  anc_ctrs = np.repeat(np.stack([anc_x,anc_y], axis=1), k, axis=0) 
+```
+
+```
+ anc_sizes = np.concatenate([np.array([[o/ag,p/ag]  for i in range(ag*ag) for o,p in anchor_scales])  for ag in anc_grids])  grid_sizes = V(np.concatenate([np.array([ 1/ag  for i in range(ag*ag) for o,p in anchor_scales])  for ag in anc_grids]),  requires_grad= False ).unsqueeze(1)  anchors = V(np.concatenate([anc_ctrs, anc_sizes], axis=1),  requires_grad= False ).float()  anchor_cnr = hw2corners(anchors[:,:2], anchors[:,2:]) 
+```
+
+`anchors` : middle and height, width
+
+`anchor_cnr` : top left and bottom right corners
+
+#### Review of key concept [ [1:18:00](https://youtu.be/0frKXR-2PBY%3Ft%3D1h18m) ]
+
+![](../img/1_C67J9RhTAiz9MCD-ebpp_w.png)
+
+*   We have a vector of ground truth (sets of 4 bounding box coordinates and a class)
+*   We have a neural net that takes some input and spits out some output activations
+*   Compare the activations and the ground truth, calculate a loss, find the derivative of that, and adjust weights according to the derivative times a learning rate.
+*   We need a loss function that can take ground truth and activation and spit out a number that says how good these activations are. To do this, we need to take each one of `m` ground truth objects and decide which set of `(4+c)` activations is responsible for that object [ [1:21:58](https://youtu.be/0frKXR-2PBY%3Ft%3D1h21m58s) ] — which one we should be comparing to decide whether the class is correct and bounding box is close or not (matching problem).
+*   Since we are using SSD approach, so it is not arbitrary which ones we match up [ [1:23:18](https://youtu.be/0frKXR-2PBY%3Ft%3D1h23m18s) ]. We want to match up the set of activations whose receptive field has the maximum density from where the real object is.
+*   The loss function needs to be some consistent task. If in the first image, the top left object corresponds with the first 4+c activations, and in the second image, we threw things around and suddenly it's now going with the last 4+c activations, the neural net doesn't know what to learn.
+*   Once matching problem is resolved, the rest is just the same as the single object detection.
+
+Architectures:
+
+*   YOLO — the last layer is fully connected (no concept of geometry)
+*   SSD — the last layer is convolutional
+
+#### k (zooms x ratios)[ [1:29:39](https://youtu.be/0frKXR-2PBY%3Ft%3D1h29m39s) ]
+
+For every grid cell which can be different sizes, we can have different orientations and zooms representing different anchor boxes which are just like conceptual ideas that every one of anchor boxes is associated with one set of `4+c` activations in our model. So however many anchor boxes we have, we need to have that times `(4+c)` activations. That does not mean that each convolutional layer needs that many activations. Because 4x4 convolutional layer already has 16 sets of activations, the 2x2 layer has 4 sets of activations, and finally 1x1 has one set. So we basically get 1 + 4 + 16 for free. So we only needs to know `k` where `k` is the number of zooms by the number of aspect ratios. Where else, the grids, we will get for free through our architecture.
+
+#### Model Architecture [ [1:31:10](https://youtu.be/0frKXR-2PBY%3Ft%3D1h31m10s) ]
+
+```
+ drop=0.4  class SSD_MultiHead (nn.Module):  def __init__(self, k, bias):  super().__init__()  self.drop = nn.Dropout(drop)  self.sconv0 = StdConv(512,256, stride=1, drop=drop)  self.sconv1 = StdConv(256,256, drop=drop)  self.sconv2 = StdConv(256,256, drop=drop)  self.sconv3 = StdConv(256,256, drop=drop)  self.out1 = OutConv(k, 256, bias)  self.out2 = OutConv(k, 256, bias)  self.out3 = OutConv(k, 256, bias)  def forward(self, x):  x = self.drop(F.relu(x))  x = self.sconv0(x)  x = self.sconv1(x)  o1c,o1l = self.out1(x)  x = self.sconv2(x)  o2c,o2l = self.out2(x)  x = self.sconv3(x)  o3c,o3l = self.out3(x)  return [torch.cat([o1c,o2c,o3c], dim=1),  torch.cat([o1l,o2l,o3l], dim=1)]  head_reg4 = SSD_MultiHead(k, -4.)  models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)  learn = ConvLearner(md, models)  learn.opt_fn = optim.Adam 
+```
+
+The model is nearly identical to what we had before. But we have a number of stride 2 convolutions which is going to take us through to 4x4, 2x2, and 1x1 (each stride 2 convolution halves our grid size in both directions).
+
+*   After we do our first convolution to get to 4x4, we will grab a set of outputs from that because we want to save away the 4x4 anchors.
+*   Once we get to 2x2, we grab another set of now 2x2 anchors
+*   Then finally we get to 1x1
+*   We then concatenate them all together, which gives us the correct number of activations (one activation for every anchor box).
+
+#### Training [ [1:32:50](https://youtu.be/0frKXR-2PBY%3Ft%3D1h32m50s) ]
+
+```
+ learn.crit = ssd_loss  lr = 1e-2  lrs = np.array([lr/100,lr/10,lr]) 
+```
+
+```
+ learn.lr_find(lrs/1000,1.)  learn.sched.plot(n_skip_end=2) 
+```
+
+![](../img/1_jB_OxbaTmMXHbkeXE4G0SQ.png)
+
+```
+ learn.fit(lrs, 1, cycle_len=4, use_clr=(20,8)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 15.124349 15.015433 
+ 1 13.091956 10.39855 
+ 2 11.643629 9.4289 
+ 3 10.532467 8.822998 
+```
+
+```
+ [8.822998] 
+```
+
+```
+ learn.save('tmp') 
+```
+
+```
+ learn.freeze_to(-2)  learn.fit(lrs/2, 1, cycle_len=4, use_clr=(20,8)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 9.821056 10.335152 
+ 1 9.419633 11.834093 
+ 2 8.78818 7.907762 
+ 3 8.219976 7.456364 
+```
+
+```
+ [7.4563637] 
+```
+
+```
+ x,y = next(iter(md.val_dl))  y = V(y)  batch = learn.model(V(x))  b_clas,b_bb = batch  x = to_np(x)  fig, axes = plt.subplots(3, 4, figsize=(16, 12))  for idx,ax in enumerate(axes.flat):  ima=md.val_ds.ds.denorm(x)[idx]  bbox,clas = get_y(y[0][idx], y[1][idx])  a_ic = actn_to_bb(b_bb[idx], anchors)  torch_gt(ax, ima, a_ic, b_clas[idx].max(1)[1],  b_clas[idx].max(1)[0].sigmoid(), 0.2 )  plt.tight_layout() 
+```
+
+Here, we printed out those detections with at least probability of `0.2` . Some of them look pretty hopeful but others not so much.
+
+![](../img/1_l168j5d3fWBZLST3XLPD6A.png)
+
+### History of object detection [ [1:33:43](https://youtu.be/0frKXR-2PBY%3Ft%3D1h33m43s) ]
+
+![](../img/1_bQPvoI0soxtlBt1cEZlzcQ.png)
+
+[Scalable Object Detection using Deep Neural Networks](https://arxiv.org/abs/1312.2249)
+
+*   When people refer to the multi-box method, they are talking about this paper.
+*   This was the paper that came up with the idea that we can have a loss function that has this matching process and then you can use that to do object detection. So everything since that time has been trying to figure out how to make this better.
+
+[Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
+
+*   In parallel, Ross Girshick was going down a totally different direction. He had these two-stage process where the first stage used the classical computer vision approaches to find edges and changes of gradients to guess which parts of the image may represent distinct objects. Then fit each of those into a convolutional neural network which was basically designed to figure out if that is the kind of object we are interested in.
+*   R-CNN and Fast R-CNN are hybrid of traditional computer vision and deep learning.
+*   What Ross and his team then did was they took the multibox idea and replaced the traditional non-deep learning computer vision part of their two stage process with the conv net. So now they have two conv nets: one for region proposals (all of the things that might be objects) and the second part was the same as his earlier work.
+
+[You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640)
+
+[SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325)
+
+*   At similar time these paper came out. Both of these did something pretty cool which is they achieved similar performance as the Faster R-CNN but with 1 stage.
+*   They took the multibox idea and they tried to figure out how to deal with messy outputs. The basic ideas were to use, for example, hard negative mining where they would go through and find all of the matches that did not look that good and throw them away, use very tricky and complex data augmentation methods, and all kind of hackery. But they got them to work pretty well.
+
+[Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) (RetinaNet)
+
+*   Then something really cool happened late last year which is this thing called focal loss.
+*   They actually realized why this messy thing wasn't working. When we look at an image, there are 3 different granularities of convolutional grid (4x4, 2x2, 1x1) [ [1:37:28](https://youtu.be/0frKXR-2PBY%3Ft%3D1h37m28s) ]. The 1x1 is quite likely to have a reasonable overlap with some object because most photos have some kind of main subject. On the other hand, in the 4x4 grid cells, the most of 16 anchor boxes are not going to have a much of an overlap with anything. So if somebody was to say to you “$20 bet, what do you reckon this little clip is?” and you are not sure, you will say “background” because most of the time, it is the background.
+
+**Question** : I understand why we have a 4x4 grid of receptive fields with 1 anchor box each to coarsely localize objects in the image. But what I think I'm missing is why we need multiple receptive fields at different sizes. The first version already included 16 receptive fields, each with a single anchor box associated. With the additions, there are now many more anchor boxes to consider. Is this because you constrained how much a receptive field could move or scale from its original size? Or is there another reason? [ [1:38:47](https://youtu.be/0frKXR-2PBY%3Ft%3D1h38m47s) ] It is kind of backwards. The reason Jeremy did the constraining was because he knew he was going to be adding more boxes later. But really, the reason is that the Jaccard overlap between one of those 4x4 grid cells and a picture where a single object that takes up most of the image is never going to be 0.5\. The intersection is much smaller than the union because the object is too big. So for this general idea to work where we are saying you are responsible for something that you have better than 50% overlap with, we need anchor boxes which will on a regular basis have a 50% or higher overlap which means we need to have a variety of sizes, shapes, and scales. This all happens in the loss function. The vast majority of the interesting stuff in all of the object detection is the loss function.
+
+#### Focal Loss [ [1:40:38](https://youtu.be/0frKXR-2PBY%3Ft%3D1h40m38s) ]
+
+![](../img/1_6Bood7G6dUuhigy9cxkZ-Q.png)
+
+The key thing is this very first picture. The blue line is the binary cross entropy loss. If the answer is not a motorbike [ [1:41:46](https://youtu.be/0frKXR-2PBY%3Ft%3D1h41m46s) ], and I said “I think it's not a motorbike and I am 60% sure” with the blue line, the loss is still about 0.5 which is pretty bad. So if we want to get our loss down, then for all these things which are actually back ground, we have to be saying “I am sure that is background”, “I am sure it's not a motorbike, or a bus, or a person” — because if I don't say we are sure it is not any of these things, then we still get loss.
+
+That is why the motorbike example did not work [ [1:42:39](https://youtu.be/0frKXR-2PBY%3Ft%3D1h42m39s) ]. Because even when it gets to lower right corner and it wants to say “I think it's a motorbike”, there is no payoff for it to say so. If it is wrong, it gets killed. And the vast majority of the time, it is background. Even if it is not background, it is not enough just to say “it's not background” — you have to say which of the 20 things it is.
+
+So the trick is to trying to find a different loss function [ [1:44:00](https://youtu.be/0frKXR-2PBY%3Ft%3D1h44m) ] that looks more like the purple line. Focal loss is literally just a scaled cross entropy loss. Now if we say “I'm .6 sure it's not a motorbike” then the loss function will say “good for you! no worries” [ [1:44:42](https://youtu.be/0frKXR-2PBY%3Ft%3D1h44m42s) ].
+
+The actual contribution of this paper is to add `(1 − pt)^γ` to the start of the equation [ [1:45:06](https://youtu.be/0frKXR-2PBY%3Ft%3D1h45m6s) ] which sounds like nothing but actually people have been trying to figure out this problem for years. When you come across a paper like this which is game-changing, you shouldn't assume you are going to have to write thousands of lines of code. Very often it is one line of code, or the change of a single constant, or adding log to a single place.
+
+A couple of terrific things about this paper [ [1:46:08](https://youtu.be/0frKXR-2PBY%3Ft%3D1h46m8s) ]:
+
+*   Equations are written in a simple manner
+*   They “refactor”
+
+#### Implementing Focal Loss [ [1:49:27](https://youtu.be/0frKXR-2PBY%3Ft%3D1h49m27s) ]:
+
+![](../img/1_wIp0HYEWPnkiuxLeCfEiAg.png)
+
+Remember, -log(pt) is the cross entropy loss and focal loss is just a scaled version. When we defined the binomial cross entropy loss, you may have noticed that there was a weight which by default was none:
+
+```
+ class BCE_Loss (nn.Module):  def __init__(self, num_classes):  super().__init__()  self.num_classes = num_classes  def forward(self, pred, targ):  t = one_hot_embedding(targ, self.num_classes+1)  t = V(t[:,:-1].contiguous()) #.cpu()  x = pred[:,:-1]  w = self.get_weight(x,t)  return F.binary_cross_entropy_with_logits(x, t, w,  size_average= False )/self.num_classes  def get_weight(self,x,t): return None 
+```
+
+When you call `F.binary_cross_entropy_with_logits` , you can pass in the weight. Since we just wanted to multiply a cross entropy by something, we can just define `get_weight` . Here is the entirety of focal loss [ [1:50:23](https://youtu.be/0frKXR-2PBY%3Ft%3D1h50m23s) ]:
+
+```
+ class FocalLoss (BCE_Loss):  def get_weight(self,x,t):  alpha,gamma = 0.25,2\.  p = x.sigmoid()  pt = p*t + (1-p)*(1-t)  w = alpha*t + (1-alpha)*(1-t)  return w * (1-pt).pow(gamma) 
+```
+
+If you were wondering why alpha and gamma are 0.25 and 2, here is another excellent thing about this paper, because they tried lots of different values and found that these work well:
+
+![](../img/1_qFPRvFHQMQplSJGp3QLiNA.png)
+
+#### Training [ [1:51:25](https://youtu.be/0frKXR-2PBY%3Ft%3D1h51m25s) ]
+
+```
+ learn.lr_find(lrs/1000,1.)  learn.sched.plot(n_skip_end=2) 
+```
+
+![](../img/1_lQPSR3V2IXbxOpcgNE-U-Q.png)
+
+```
+ learn.fit(lrs, 1, cycle_len=10, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 24.263046 28.975235 
+ 1 20.459562 16.362392 
+ 2 17.880827 14.884829 
+ 3 15.956896 13.676485 
+ 4 14.521345 13.134197 
+ 5 13.460941 12.594139 
+ 6 12.651842 12.069849 
+ 7 11.944972 11.956457 
+ 8 11.385798 11.561226 
+ 9 10.988802 11.362164 
+```
+
+```
+ [11.362164] 
+```
+
+```
+ learn.save('fl0')  learn.load('fl0') 
+```
+
+```
+ learn.freeze_to(-2)  learn.fit(lrs/4, 1, cycle_len=10, use_clr=(20,10)) 
+```
+
+```
+ _epoch trn_loss val_loss_ 
+ 0 10.871668 11.615532 
+ 1 10.908461 11.604334 
+ 2 10.549796 11.486127 
+ 3 10.130961 11.088478 
+ 4 9.70691 10.72144 
+ 5 9.319202 10.600481 
+ 6 8.916653 10.358334 
+ 7 8.579452 10.624706 
+ 8 8.274838 10.163422 
+ 9 7.994316 10.108068 
+```
+
+```
+ [10.108068] 
+```
+
+```
+ learn.save('drop4')  learn.load('drop4') 
+```
+
+```
+ plot_results(0.75) 
+```
+
+![](../img/1_G4HCc1mpkvHFqbhrb5Uwpw.png)
+
+This time things are looking quite a bit better. So our last step, for now, is to basically figure out how to pull out just the interesting ones.
+
+#### Non Maximum Suppression [ [1:52:15](https://youtu.be/0frKXR-2PBY%3Ft%3D1h52m15s) ]
+
+All we are going to do is we are going to go through every pair of these bounding boxes and if they overlap by more than some amount, say 0.5, using Jaccard and they are both predicting the same class, we are going to assume they are the same thing and we are going to pick the one with higher `p` value.
+
+It is really boring code, Jeremy didn't write it himself and copied somebody else's. No reason particularly to go through it.
+
+```
+ def nms(boxes, scores, overlap=0.5, top_k=100):  keep = scores.new(scores.size(0)).zero_().long()  if boxes.numel() == 0: return keep  x1 = boxes[:, 0]  y1 = boxes[:, 1]  x2 = boxes[:, 2]  y2 = boxes[:, 3]  area = torch.mul(x2 - x1, y2 - y1)  v, idx = scores.sort(0) # sort in ascending order  idx = idx[-top_k:] # indices of the top-k largest vals  xx1 = boxes.new()  yy1 = boxes.new()  xx2 = boxes.new()  yy2 = boxes.new()  w = boxes.new()  h = boxes.new()  count = 0  while idx.numel() > 0:  i = idx[-1] # index of current largest val  keep[count] = i  count += 1  if idx.size(0) == 1: break  idx = idx[:-1] # remove kept element from view  # load bboxes of next highest vals  torch.index_select(x1, 0, idx, out=xx1)  torch.index_select(y1, 0, idx, out=yy1)  torch.index_select(x2, 0, idx, out=xx2)  torch.index_select(y2, 0, idx, out=yy2)  # store element-wise max with next highest score  xx1 = torch.clamp(xx1, min=x1[i])  yy1 = torch.clamp(yy1, min=y1[i])  xx2 = torch.clamp(xx2, max=x2[i])  yy2 = torch.clamp(yy2, max=y2[i])  w.resize_as_(xx2)  h.resize_as_(yy2)  w = xx2 - xx1  h = yy2 - yy1  # check sizes of xx1 and xx2.. after each iteration  w = torch.clamp(w, min=0.0)  h = torch.clamp(h, min=0.0)  inter = w*h  # IoU = i / (area(a) + area(b) - i)  rem_areas = torch.index_select(area, 0, idx)  # load remaining areas)  union = (rem_areas - inter) + area[i]  IoU = inter/union # store result in iou  # keep only elements with an IoU <= overlap  idx = idx[IoU.le(overlap)]  return keep, count 
+```
+
+```
+ def show_nmf(idx):  ima=md.val_ds.ds.denorm(x)[idx]  bbox,clas = get_y(y[0][idx], y[1][idx])  a_ic = actn_to_bb(b_bb[idx], anchors)  clas_pr, clas_ids = b_clas[idx].max(1)  clas_pr = clas_pr.sigmoid()  conf_scores = b_clas[idx].sigmoid().t().data  out1,out2,cc = [],[],[]  for cl in range(0, len(conf_scores)-1):  c_mask = conf_scores[cl] > 0.25  if c_mask.sum() == 0: continue  scores = conf_scores[cl][c_mask]  l_mask = c_mask.unsqueeze(1).expand_as(a_ic)  boxes = a_ic[l_mask].view(-1, 4)  ids, count = nms(boxes.data, scores, 0.4, 50)  ids = ids[:count]  out1.append(scores[ids])  out2.append(boxes.data[ids])  cc.append([cl]*count)  cc = T(np.concatenate(cc))  out1 = torch.cat(out1)  out2 = torch.cat(out2)  fig, ax = plt.subplots(figsize=(8,8))  torch_gt(ax, ima, out2, cc, out1, 0.1) 
+```
+
+```
+ for i in range(12): show_nmf(i) 
+```
+
+![](../img/1_MXk2chJJEcjOz8hMn1ZsOQ.png)
+
+![](../img/1_Fj9fK3G6iXBsGI_XJrxXyg.png)
+
+![](../img/1_6p3dm-i-YxC9QkxouHJdoA.png)
+
+![](../img/1_nkEpAd2_H4lG1vQfnCJn4Q.png)
+
+![](../img/1_THGq5C21NaP92vw5E_QNdA.png)
+
+![](../img/1_0wckbiUSax2JpBlgJxJ05g.png)
+
+![](../img/1_EWbNGEQFvYMgC4PSaLe8Ww.png)
+
+![](../img/1_vTRCVjln4vkma1R6eBeSwA.png)
+
+![](../img/1_3Q01FZuzfptkYrekJiGm1g.png)
+
+![](../img/1_-cD3LQIG9FnyJbt0cnpbNg.png)
+
+![](../img/1_Hkgs1u9PFH9ZrTKL8YBW2Q.png)
+
+![](../img/1_uyTNlp61jcyaW9knbnNSEw.png)
+
+There are some things still to fix here [ [1:53:43](https://youtu.be/0frKXR-2PBY%3Ft%3D1h53m43s) ]. The trick will be to use something called feature pyramid. That is what we are going to do in lesson 14\.
+
+#### Talking a little more about SSD paper [ [1:54:03](https://youtu.be/0frKXR-2PBY%3Ft%3D1h54m3s) ]
+
+When this paper came out, Jeremy was excited because this and YOLO were the first kind of single-pass good quality object detection method that come along. There has been this continuous repetition of history in the deep learning world which is things that involve multiple passes of multiple different pieces, over time, particularly where they involve some non-deep learning pieces (like R-CNN did), over time, they always get turned into a single end-to-end deep learning model. So I tend to ignore them until that happens because that's the point where people have figured out how to show this as a deep learning model, as soon as they do that they generally end up something much faster and much more accurate. So SSD and YOLO were really important.
+
+The model is 4 paragraphs. Papers are really concise which means you need to read them pretty carefully. Partly, though, you need to know which bits to read carefully. The bits where they say “here we are going to prove the error bounds on this model,” you could ignore that because you don't care about proving error bounds. But the bit which says here is what the model is, you need to read real carefully.
+
+Jeremy reads a section **2.1 Model** [ [1:56:37](https://youtu.be/0frKXR-2PBY%3Ft%3D1h56m37s) ]
+
+If you jump straight in and read a paper like this, these 4 paragraphs would probably make no sense. But now that we've gone through it, you read those and hopefully thinking “oh that's just what Jeremy said, only they sad it better than Jeremy and less words [ [2:00:37](https://youtu.be/0frKXR-2PBY%3Ft%3D2h37s) ]. If you start to read a paper and go “what the heck”, the trick is to then start reading back over the citations.
+
+Jeremy reads **Matching strategy** and **Training objective** (aka Loss function)[ [2:01:44](https://youtu.be/0frKXR-2PBY%3Ft%3D2h1m44s) ]
+
+#### Some paper tips [ [2:02:34](https://youtu.be/0frKXR-2PBY%3Ft%3D2h2m34s) ]
+
+[Scalable Object Detection using Deep Neural Networks](https://arxiv.org/pdf/1312.2249.pdf)
+
+*   “Training objective” is loss function
+*   Double bars and two 2's like this means Mean Squared Error
+
+![](../img/1_LubBtX9ODFMBgI34bFHtdw.png)
+
+*   log(c) and log(1-c), and x and (1-x) they are all the pieces for binary cross entropy:
+
+![](../img/1_3Xq3HB72jsVKI7uHOHzRDQ.png)
+
+This week, go through the code and go through the paper and see what is going on. Remember what Jeremy did to make it easier for you was he took that loss function, he copied it into a cell and split it up so that each bit was in a separate cell. Then after every sell, he printed or plotted that value. Hopefully this is a good starting point.