!29 retnet50

Merge pull request !29 from yangyaqin/resnet50

!29 retnet50
Merge pull request !29 from yangyaqin/resnet50
af12ca66 · dyonghan · Gitee · 15f68670 · 13656a6c · af12ca66
Showing with 285 addition and 64 deletion

resnet50/README.md resnet50/README.md +285 -64

resnet50/images/resnet_archs.png resnet50/images/resnet_archs.png +0 -0

resnet50/images/resnet_block.png resnet50/images/resnet_block.png +0 -0

未找到文件。
--- a/experiment_3/3-Computer_Vision.md
+++ b/experiment_3/3-Computer_Vision.md
@@ -2,7 +2,7 @@

 ## 实验介绍

-本实验主要介绍使用MindSpore在CIFAR-10数据集上训练ResNet50。本实验使用MindSpore model_zoo中提供的ResNet50模型定义，以及MindSpore官网教程[在云上使用MindSpore](https://www.mindspore.cn/tutorial/zh-CN/0.2.0-alpha/advanced_use/use_on_the_cloud.html)里的训练脚本。
+本实验主要介绍使用MindSpore在CIFAR-10数据集上训练ResNet50。本实验使用MindSpore model_zoo中提供的ResNet50模型定义，以及MindSpore官网教程[在云上使用MindSpore](https://www.mindspore.cn/tutorial/zh-CN/r0.5/advanced_use/use_on_the_cloud.html)里的训练脚本。

 ## 实验目的

@@ -19,7 +19,7 @@

 ## 实验环境

- MindSpore 0.2.0（MindSpore版本会定期更新，本指导也会定期刷新，与版本配套）；
+- MindSpore 0.5.0（MindSpore版本会定期更新，本指导也会定期刷新，与版本配套）；
 - 华为云ModelArts：ModelArts是华为云提供的面向开发者的一站式AI开发平台，集成了昇腾AI处理器资源池，用户可以在该平台下体验MindSpore。

 ## 实验准备
@@ -42,13 +42,15 @@

 ### 数据集准备

-CIFAR-10是一个图片分类数据集，包含60000张32x32的彩色物体图片，训练集50000张，测试集10000张，共10类，每类6000张。CIFAR-10数据集的官网：[The CIFAR-10 and CIFAR-100 datasets](http://www.cs.toronto.edu/~kriz/cifar.html)。
+CIFAR-10是一个图片分类数据集，包含60000张32x32的彩色物体图片，训练集50000张，测试集10000张，共10类，每类6000张。

-从CIFAR-10官网下载“CIFAR-10 binary version (suitable for C programs)”到本地并解压。
+- 方式一，从[CIFAR-10官网](http://www.cs.toronto.edu/~kriz/cifar.html)下载“CIFAR-10 binary version (suitable for C programs)”到本地并解压。
+
+- 方式二，从华为云OBS中下载[CIFAR-10数据集](https://share-course.obs.cn-north-4.myhuaweicloud.com/dataset/cifar10.zip)并解压。

 ### 脚本准备

-从[MindSpore tutorial仓库](https://gitee.com/mindspore/docs/tree/r0.2/tutorials/tutorial_code/sample_for_cloud/)里下载相关脚本。
+从[MindSpore tutorial仓库](https://gitee.com/mindspore/docs/tree/r0.5/tutorials/tutorial_code/sample_for_cloud)里下载相关脚本。

 ### 上传文件

@@ -57,6 +59,7 @@ CIFAR-10是一个图片分类数据集，包含60000张32x32的彩色物体图
 ```
 experiment_3
 ├── dataset.py
+├── resnet.py
 ├── resnet50_train.py
 └── cifar10
    ├── batches.meta.txt
@@ -74,8 +77,11 @@ experiment_3

 ### 代码梳理

- resnet50_train.py：主脚本，包含性能测试`PerformanceCallback`、动态学习率`get_lr`、执行函数`resnet50_train`等函数；
+- resnet50_train.py：主脚本，包含性能测试`PerformanceCallback`、动态学习率`get_lr`、执行函数`resnet50_train`、主函数；
 - dataset.py：数据处理脚本。
+- resnet.py: resnet模型定义脚本，包含ResidualBlock模块类`ResidualBlock`、`ResNet`类、`ResNet50`类、`ResNet101`类等。
+
+#### resnet50_train.py代码梳理

 `PerformanceCallback`继承MindSpore Callback类，并统计每个训练step的时延：

@@ -153,14 +159,15 @@ def get_lr(global_step,
    return learning_rate
 ```

+#### dataset.py代码梳理
+
 MindSpore支持直接读取CIFAR-10数据集：

 ```python
 if device_num == 1 or not do_train:
    ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle)
 else:
-    ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle,
-                           num_shards=device_num, shard_id=device_id)
+    ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle,num_shards=device_num, shard_id=device_id)
 ```

 使用数据增强，如随机裁剪、随机水平反转：
@@ -169,55 +176,27 @@ else:
 # define map operations
 random_crop_op = C.RandomCrop((32, 32), (4, 4, 4, 4))
 random_horizontal_flip_op = C.RandomHorizontalFlip(device_id / (device_id + 1))
-```

-导入并使用model_zoo里的resnet50模型：
+resize_op = C.Resize((resize_height, resize_width))
+rescale_op = C.Rescale(rescale, shift)
+normalize_op = C.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])

-```python
-from mindspore.model_zoo.resnet import resnet50
-# create model
-net = resnet50(class_num = class_num)
-```
+change_swap_op = C.HWC2CHW()

-`model_zoo.resnet`中resnet50定义如下：
+trans = []
+if do_train:
+    trans += [random_crop_op, random_horizontal_flip_op]

-```python
-def resnet50(class_num=10):
-    return ResNet(ResidualBlock,
-                  [3, 4, 6, 3],
-                  [64, 256, 512, 1024],
-                  [256, 512, 1024, 2048],
-                  [1, 2, 2, 2],
-                  class_num)
-```
+trans += [resize_op, rescale_op, normalize_op, change_swap_op]

-ResNet类定义如下：
+type_cast_op = C2.TypeCast(mstype.int32)

-```python
-class ResNet(nn.Cell):
-    """
-    ResNet architecture.
-
-    Args:
-        block (Cell): Block for network.
-        layer_nums (list): Numbers of block in different layers.
-        in_channels (list): Input channel in each layer.
-        out_channels (list): Output channel in each layer.
-        strides (list):  Stride size in each layer.
-        num_classes (int): The number of classes that the training images are belonging to.
-    Returns:
-        Tensor, output tensor.
-
-    Examples:
-        >>> ResNet(ResidualBlock,
-        >>>        [3, 4, 6, 3],
-        >>>        [64, 256, 512, 1024],
-        >>>        [256, 512, 1024, 2048],
-        >>>        [1, 2, 2, 2],
-        >>>        10)
-    """
+ds = ds.map(input_columns="label", num_parallel_workers=8, operations=type_cast_op)
+ds = ds.map(input_columns="image", num_parallel_workers=8, operations=trans)
 ```

+#### resnet.py代码梳理
+
 ResNet的不同版本均由5个阶段（stage）组成，其中ResNet50结构为Convx1 -> ResidualBlockx3 -> ResidualBlockx4 -> ResidualBlockx6 -> ResidualBlockx5 -> Pooling+FC。

 ![ResNet Architectures](images/resnet_archs.png)
@@ -230,8 +209,24 @@ ResNet的不同版本均由5个阶段（stage）组成，其中ResNet50结构为

 [2] 图片来源于https://arxiv.org/pdf/1512.03385.pdf

+ResNet的ResidualBlock（残差模块）定义如下，是组成ResNet网络的基础模块。
+
 ```python
 class ResidualBlock(nn.Cell):
+    """
+    ResNet V1 residual block definition.
+
+    Args:
+        in_channel (int): Input channel.
+        out_channel (int): Output channel.
+        stride (int): Stride size for the first convolutional layer. Default: 1.
+
+    Returns:
+        Tensor, output tensor.
+
+    Examples:
+        >>> ResidualBlock(3, 256, stride=2)
+    """
    expansion = 4

    def __init__(self,
@@ -253,9 +248,11 @@ class ResidualBlock(nn.Cell):
        self.relu = nn.ReLU()

        self.down_sample = False
+
        if stride != 1 or in_channel != out_channel:
            self.down_sample = True
        self.down_sample_layer = None
+
        if self.down_sample:
            self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride),
                                                        _bn(out_channel)])
@@ -278,7 +275,7 @@ class ResidualBlock(nn.Cell):
        # ResNet50未使用带有下采样的残差支路
        if self.down_sample:
            identity = self.down_sample_layer(identity)
-
+            
        # output为残差支路，identity为short-cut支路
        out = self.add(out, identity)
        out = self.relu(out)
@@ -286,6 +283,158 @@ class ResidualBlock(nn.Cell):
        return out
 ```

+ResNet类定义如下，传入的参数包括：
+
+- layer_nums：每个stage中ResidualBlock重复次数列表（list）
+- in_channels：每个stage输入通道数列表（list）
+- out_channels：每个stage输出通道数列表（list）
+- strides：每个stage中卷积算子的stride列表（list）
+- num_classes：图片分类数（int）
+
+>**注解：**
+>
+>- 这里的stage不是ResNet真实层数，只是将ResNet分成多个stage，每个stage包含多个ResidualBlock。
+>- layer_nums、in_channels、out_channels、strides列表的长度必须相同。
+>- 传入的参数不同则网络结构不同，典型的有ResNet50、ResNet101。其定义可以参考resnet.py文件。学员可以尝试自定义参数设计一个新的网络。
+
+```python
+class ResNet(nn.Cell):
+    """
+    ResNet architecture.
+
+    Args:
+        block (Cell): Block for network.
+        layer_nums (list): Numbers of block in different layers.
+        in_channels (list): Input channel in each layer.
+        out_channels (list): Output channel in each layer.
+        strides (list):  Stride size in each layer.
+        num_classes (int): The number of classes that the training images are belonging to.
+    Returns:
+        Tensor, output tensor.
+
+    Examples:
+        >>> ResNet(ResidualBlock,
+        >>>        [3, 4, 6, 3],
+        >>>        [64, 256, 512, 1024],
+        >>>        [256, 512, 1024, 2048],
+        >>>        [1, 2, 2, 2],
+        >>>        10)
+    """
+
+    def __init__(self,
+                 block,
+                 layer_nums,
+                 in_channels,
+                 out_channels,
+                 strides,
+                 num_classes):
+        super(ResNet, self).__init__()
+
+        if not len(layer_nums) == len(in_channels) == len(out_channels) == 4:
+            raise ValueError("the length of layer_num, in_channels, out_channels list must be 4!")
+
+        self.conv1 = _conv7x7(3, 64, stride=2)
+        self.bn1 = _bn(64)
+        self.relu = P.ReLU()
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same")
+
+        self.layer1 = self._make_layer(block,
+                                       layer_nums[0],
+                                       in_channel=in_channels[0],
+                                       out_channel=out_channels[0],
+                                       stride=strides[0])
+        self.layer2 = self._make_layer(block,
+                                       layer_nums[1],
+                                       in_channel=in_channels[1],
+                                       out_channel=out_channels[1],
+                                       stride=strides[1])
+        self.layer3 = self._make_layer(block,
+                                       layer_nums[2],
+                                       in_channel=in_channels[2],
+                                       out_channel=out_channels[2],
+                                       stride=strides[2])
+        self.layer4 = self._make_layer(block,
+                                       layer_nums[3],
+                                       in_channel=in_channels[3],
+                                       out_channel=out_channels[3],
+                                       stride=strides[3])
+
+        self.mean = P.ReduceMean(keep_dims=True)
+        self.flatten = nn.Flatten()
+        self.end_point = _fc(out_channels[3], num_classes)
+
+    def _make_layer(self, block, layer_num, in_channel, out_channel, stride):
+        """
+        Make stage network of ResNet.
+
+        Args:
+            block (Cell): Resnet block.
+            layer_num (int): Layer number.
+            in_channel (int): Input channel.
+            out_channel (int): Output channel.
+            stride (int): Stride size for the first convolutional layer.
+
+        Returns:
+            SequentialCell, the output layer.
+
+        Examples:
+            >>> _make_layer(ResidualBlock, 3, 128, 256, 2)
+        """
+        layers = []
+
+        resnet_block = block(in_channel, out_channel, stride=stride)
+        layers.append(resnet_block)
+
+        for _ in range(1, layer_num):
+            resnet_block = block(out_channel, out_channel, stride=1)
+            layers.append(resnet_block)
+
+        return nn.SequentialCell(layers)
+
+    def construct(self, x):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        c1 = self.maxpool(x)
+
+        c2 = self.layer1(c1)
+        c3 = self.layer2(c2)
+        c4 = self.layer3(c3)
+        c5 = self.layer4(c4)
+
+        out = self.mean(c5, (2, 3))
+        out = self.flatten(out)
+        out = self.end_point(out)
+
+        return out
+```
+
+ResNet50类定义如下：
+
+```python
+def resnet50(class_num=10):
+    """
+    Get ResNet50 neural network.
+
+    Args:
+        class_num (int): Class number.
+
+    Returns:
+        Cell, cell instance of ResNet50 neural network.
+
+    Examples:
+        >>> net = resnet50(10)
+    """
+    return ResNet(ResidualBlock,
+                  [3, 4, 6, 3],
+                  [64, 256, 512, 1024],
+                  [256, 512, 1024, 2048],
+                  [1, 2, 2, 2],
+                  class_num)
+```
+
+### 适配训练作业
+
 创建训练作业时，运行参数会通过脚本传参的方式输入给脚本代码，脚本必须解析传参才能在代码中使用相应参数。如data_url和train_url，分别对应数据存储路径(OBS路径)和训练输出路径(OBS路径)。脚本对传参进行解析后赋值到`args`变量里，在后续代码里可以使用。

 ```python
@@ -293,23 +442,51 @@ import argparse
 parser = argparse.ArgumentParser()
 parser.add_argument('--data_url', required=True, default=None, help='Location of data.')
 parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
-parser.add_argument('--num_epochs', type=int, default=1, help='Number of training epochs.')
+parser.add_argument('--num_epochs', type=int, default=90, help='Number of training epochs.')
 args, unknown = parser.parse_known_args()
 ```

-MindSpore暂时没有提供直接访问OBS数据的接口，需要通过MoXing提供的API与OBS交互。将OBS中存储的数据拷贝至执行容器：
+MindSpore暂时没有提供直接访问OBS数据的接口，需要通过MoXing提供的API与OBS交互。

-```python
-import moxing as mox
-mox.file.copy_parallel(src_url=args.data_url, dst_url='cifar10/')
-```
+**方式一**

-如需将训练输出（如模型Checkpoint）从执行容器拷贝至OBS，请参考：
+- 拷贝自己账户下OBS桶内的数据集至执行容器

-```python
-import moxing as mox
-mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH')
-```
+    ```python
+    import moxing as mox
+    mox.file.copy_parallel(src_url=args.data_url, dst_url='cifar10/')
+    ```
+    
+- 如需将训练输出（如模型Checkpoint）从执行容器拷贝至自己的OBS，请参考：
+
+    ```python
+    import moxing as mox
+    mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH')
+    ```
+
+**方式二**
+
+- 拷贝他人账户下OBS桶内的数据集，前提是他人账户下的OBS桶已设为公共读/公共读写，且需要他人账户的访问密钥、私有访问密钥、OBS桶-概览-基本信息-Endpoint。
+
+    ```python
+    import moxing as mox
+    # set moxing/obs auth info, ak:Access Key Id, sk:Secret Access Key, server:endpoint of obs bucket
+    mox.file.set_auth(ak='VCT2GKI3GJOZBQYJG5WM', sk='t1y8M4Z6bHLSAEGK2bCeRYMjo2S2u0QBqToYbxzB',
+                         server="obs.cn-north-4.myhuaweicloud.com")
+    # copy dataset from obs bucket to container/cache
+    mox.file.copy_parallel(src_url="s3://share-course/dataset/cifar10/", dst_url='cifar10/')
+    ```
+
+- 通过set_auth()设置了他人账户的密钥，则再通过set_auth()设置自己账户的密钥，然后再行拷贝。
+
+    ```python
+    import moxing as mox
+    mox.file.set_auth(ak='Your own Access Key', sk='Your own Secret Access Key',
+                         server="obs.cn-north-4.myhuaweicloud.com")
+    mox.file.copy_parallel(src_url='ckpt', dst_url=os.path.join(args.train_url, 'ckpt'))
+    ```
+
+    如果不设置自己账户的密钥，则只能将Checkpoint拷贝到他人账户下的OBS桶中。

 ### 创建训练作业

@@ -331,9 +508,53 @@ mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH')
 1. 点击提交以开始训练；
 2. 在训练作业列表里可以看到刚创建的训练作业，在训练作业页面可以看到版本管理；
 3. 点击运行中的训练作业，在展开的窗口中可以查看作业配置信息，以及训练过程中的日志，日志会不断刷新，等训练作业完成后也可以下载日志到本地进行查看；
-4. 在训练日志中可以看到`epoch 90 cost time = 27.963477849960327, train step num: 1562, one step time: 17.90235457743939 ms, train samples per second of cluster: 1787.5`等字段，即训练过程的性能数据；
-5. 在训练日志中可以看到`epoch: 90 step: 1562, loss is 0.00250402`等字段，即训练过程的loss数据；
-6. 在训练日志里可以看到`Evaluation result: {'acc': 0.9182692307692307}.`字段，即训练完成后的验证精度。
+4. 在训练日志中可以看到`epoch 90 cost time = 27.328994035720825, train step num: 1562, one step time: 17.496154952446112 ms, train samples per second of cluster: 1829.0`等字段，即训练过程的性能数据；
+5. 在训练日志中可以看到`epoch: 90 step 1562, loss is 0.0002547435578890145 `等字段，即训练过程的loss数据；
+6. 在训练日志里可以看到`Evaluation result: {'acc': 0.9467147435897436}.`字段，即训练完成后的验证精度。
+
+```
+epoch 1 cost time = 156.34279108047485, train step num: 1562, one step time: 100.09141554447814 ms, train samples per second of cluster: 319.7
+epoch: 1 step 1562, loss is 1.5020508766174316
+Epoch time: 156343.661, per step time: 100.092, avg loss: 1.502
+************************************************************
+epoch 2 cost time = 27.33933186531067, train step num: 1562, one step time: 17.502773281248828 ms, train samples per second of cluster: 1828.3
+epoch: 2 step 1562, loss is 1.612194299697876
+Epoch time: 27339.779, per step time: 17.503, avg loss: 1.612
+************************************************************
+epoch 3 cost time = 27.33275270462036, train step num: 1562, one step time: 17.498561270563613 ms, train samples per second of cluster: 1828.7
+epoch: 3 step 1562, loss is 1.0880045890808105
+Epoch time: 27333.157, per step time: 17.499, avg loss: 1.088
+************************************************************
+...
+...
+...
+epoch 50 cost time = 27.318379402160645, train step num: 1562, one step time: 17.48935941239478 ms, train samples per second of cluster: 1829.7
+epoch: 50 step 1562, loss is 0.028316421434283257
+Epoch time: 27318.783, per step time: 17.490, avg loss: 0.028
+************************************************************
+epoch 51 cost time = 27.317234992980957, train step num: 1562, one step time: 17.488626756069756 ms, train samples per second of cluster: 1829.8
+epoch: 51 step 1562, loss is 0.09725271165370941
+Epoch time: 27317.556, per step time: 17.489, avg loss: 0.097
+************************************************************
+...
+...
+...
+************************************************************
+epoch 88 cost time = 27.33049988746643, train step num: 1562, one step time: 17.497119006060455 ms, train samples per second of cluster: 1828.9
+epoch: 88 step 1562, loss is 0.0008127370965667069
+Epoch time: 27330.821, per step time: 17.497, avg loss: 0.001
+************************************************************
+epoch 89 cost time = 27.33343005180359, train step num: 1562, one step time: 17.498994911525987 ms, train samples per second of cluster: 1828.7
+epoch: 89 step 1562, loss is 0.00029994442593306303
+Epoch time: 27333.826, per step time: 17.499, avg loss: 0.000
+************************************************************
+epoch 90 cost time = 27.328994035720825, train step num: 1562, one step time: 17.496154952446112 ms, train samples per second of cluster: 1829.0
+epoch: 90 step 1562, loss is 0.0002547435578890145
+Epoch time: 27329.307, per step time: 17.496, avg loss: 0.000
+************************************************************
+Start run evaluation.
+Evaluation result: {'acc': 0.9467147435897436}.
+```

 ## 实验结论


--- a/experiment_3/images/resnet_archs.png
+++ b/experiment_3/images/resnet_archs.png
--- a/experiment_3/images/resnet_block.png
+++ b/experiment_3/images/resnet_block.png