提交 af12ca66 编写于 作者: D dyonghan 提交者: Gitee

!29 retnet50

Merge pull request !29 from yangyaqin/resnet50
......@@ -2,7 +2,7 @@
## 实验介绍
本实验主要介绍使用MindSpore在CIFAR-10数据集上训练ResNet50。本实验使用MindSpore model_zoo中提供的ResNet50模型定义,以及MindSpore官网教程[在云上使用MindSpore](https://www.mindspore.cn/tutorial/zh-CN/0.2.0-alpha/advanced_use/use_on_the_cloud.html)里的训练脚本。
本实验主要介绍使用MindSpore在CIFAR-10数据集上训练ResNet50。本实验使用MindSpore model_zoo中提供的ResNet50模型定义,以及MindSpore官网教程[在云上使用MindSpore](https://www.mindspore.cn/tutorial/zh-CN/r0.5/advanced_use/use_on_the_cloud.html)里的训练脚本。
## 实验目的
......@@ -19,7 +19,7 @@
## 实验环境
- MindSpore 0.2.0(MindSpore版本会定期更新,本指导也会定期刷新,与版本配套);
- MindSpore 0.5.0(MindSpore版本会定期更新,本指导也会定期刷新,与版本配套);
- 华为云ModelArts:ModelArts是华为云提供的面向开发者的一站式AI开发平台,集成了昇腾AI处理器资源池,用户可以在该平台下体验MindSpore。
## 实验准备
......@@ -42,13 +42,15 @@
### 数据集准备
CIFAR-10是一个图片分类数据集,包含60000张32x32的彩色物体图片,训练集50000张,测试集10000张,共10类,每类6000张。CIFAR-10数据集的官网:[The CIFAR-10 and CIFAR-100 datasets](http://www.cs.toronto.edu/~kriz/cifar.html)
CIFAR-10是一个图片分类数据集,包含60000张32x32的彩色物体图片,训练集50000张,测试集10000张,共10类,每类6000张。
从CIFAR-10官网下载“CIFAR-10 binary version (suitable for C programs)”到本地并解压。
- 方式一,从[CIFAR-10官网](http://www.cs.toronto.edu/~kriz/cifar.html)下载“CIFAR-10 binary version (suitable for C programs)”到本地并解压。
- 方式二,从华为云OBS中下载[CIFAR-10数据集](https://share-course.obs.cn-north-4.myhuaweicloud.com/dataset/cifar10.zip)并解压。
### 脚本准备
[MindSpore tutorial仓库](https://gitee.com/mindspore/docs/tree/r0.2/tutorials/tutorial_code/sample_for_cloud/)里下载相关脚本。
[MindSpore tutorial仓库](https://gitee.com/mindspore/docs/tree/r0.5/tutorials/tutorial_code/sample_for_cloud)里下载相关脚本。
### 上传文件
......@@ -57,6 +59,7 @@ CIFAR-10是一个图片分类数据集,包含60000张32x32的彩色物体图
```
experiment_3
├── dataset.py
├── resnet.py
├── resnet50_train.py
└── cifar10
├── batches.meta.txt
......@@ -74,8 +77,11 @@ experiment_3
### 代码梳理
- resnet50_train.py:主脚本,包含性能测试`PerformanceCallback`、动态学习率`get_lr`、执行函数`resnet50_train`函数;
- resnet50_train.py:主脚本,包含性能测试`PerformanceCallback`、动态学习率`get_lr`、执行函数`resnet50_train`、主函数;
- dataset.py:数据处理脚本。
- resnet.py: resnet模型定义脚本,包含ResidualBlock模块类`ResidualBlock``ResNet`类、`ResNet50`类、`ResNet101`类等。
#### resnet50_train.py代码梳理
`PerformanceCallback`继承MindSpore Callback类,并统计每个训练step的时延:
......@@ -153,14 +159,15 @@ def get_lr(global_step,
return learning_rate
```
#### dataset.py代码梳理
MindSpore支持直接读取CIFAR-10数据集:
```python
if device_num == 1 or not do_train:
ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle)
else:
ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle,
num_shards=device_num, shard_id=device_id)
ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle,num_shards=device_num, shard_id=device_id)
```
使用数据增强,如随机裁剪、随机水平反转:
......@@ -169,55 +176,27 @@ else:
# define map operations
random_crop_op = C.RandomCrop((32, 32), (4, 4, 4, 4))
random_horizontal_flip_op = C.RandomHorizontalFlip(device_id / (device_id + 1))
```
导入并使用model_zoo里的resnet50模型:
resize_op = C.Resize((resize_height, resize_width))
rescale_op = C.Rescale(rescale, shift)
normalize_op = C.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])
```python
from mindspore.model_zoo.resnet import resnet50
# create model
net = resnet50(class_num = class_num)
```
change_swap_op = C.HWC2CHW()
`model_zoo.resnet`中resnet50定义如下:
trans = []
if do_train:
trans += [random_crop_op, random_horizontal_flip_op]
```python
def resnet50(class_num=10):
return ResNet(ResidualBlock,
[3, 4, 6, 3],
[64, 256, 512, 1024],
[256, 512, 1024, 2048],
[1, 2, 2, 2],
class_num)
```
trans += [resize_op, rescale_op, normalize_op, change_swap_op]
ResNet类定义如下:
type_cast_op = C2.TypeCast(mstype.int32)
```python
class ResNet(nn.Cell):
"""
ResNet architecture.
Args:
block (Cell): Block for network.
layer_nums (list): Numbers of block in different layers.
in_channels (list): Input channel in each layer.
out_channels (list): Output channel in each layer.
strides (list): Stride size in each layer.
num_classes (int): The number of classes that the training images are belonging to.
Returns:
Tensor, output tensor.
Examples:
>>> ResNet(ResidualBlock,
>>> [3, 4, 6, 3],
>>> [64, 256, 512, 1024],
>>> [256, 512, 1024, 2048],
>>> [1, 2, 2, 2],
>>> 10)
"""
ds = ds.map(input_columns="label", num_parallel_workers=8, operations=type_cast_op)
ds = ds.map(input_columns="image", num_parallel_workers=8, operations=trans)
```
#### resnet.py代码梳理
ResNet的不同版本均由5个阶段(stage)组成,其中ResNet50结构为Convx1 -> ResidualBlockx3 -> ResidualBlockx4 -> ResidualBlockx6 -> ResidualBlockx5 -> Pooling+FC。
![ResNet Architectures](images/resnet_archs.png)
......@@ -230,8 +209,24 @@ ResNet的不同版本均由5个阶段(stage)组成,其中ResNet50结构为
[2] 图片来源于https://arxiv.org/pdf/1512.03385.pdf
ResNet的ResidualBlock(残差模块)定义如下,是组成ResNet网络的基础模块。
```python
class ResidualBlock(nn.Cell):
"""
ResNet V1 residual block definition.
Args:
in_channel (int): Input channel.
out_channel (int): Output channel.
stride (int): Stride size for the first convolutional layer. Default: 1.
Returns:
Tensor, output tensor.
Examples:
>>> ResidualBlock(3, 256, stride=2)
"""
expansion = 4
def __init__(self,
......@@ -253,9 +248,11 @@ class ResidualBlock(nn.Cell):
self.relu = nn.ReLU()
self.down_sample = False
if stride != 1 or in_channel != out_channel:
self.down_sample = True
self.down_sample_layer = None
if self.down_sample:
self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride),
_bn(out_channel)])
......@@ -278,7 +275,7 @@ class ResidualBlock(nn.Cell):
# ResNet50未使用带有下采样的残差支路
if self.down_sample:
identity = self.down_sample_layer(identity)
# output为残差支路,identity为short-cut支路
out = self.add(out, identity)
out = self.relu(out)
......@@ -286,6 +283,158 @@ class ResidualBlock(nn.Cell):
return out
```
ResNet类定义如下,传入的参数包括:
- layer_nums:每个stage中ResidualBlock重复次数列表(list)
- in_channels:每个stage输入通道数列表(list)
- out_channels:每个stage输出通道数列表(list)
- strides:每个stage中卷积算子的stride列表(list)
- num_classes:图片分类数(int)
>**注解:**
>
>- 这里的stage不是ResNet真实层数,只是将ResNet分成多个stage,每个stage包含多个ResidualBlock。
>- layer_nums、in_channels、out_channels、strides列表的长度必须相同。
>- 传入的参数不同则网络结构不同,典型的有ResNet50、ResNet101。其定义可以参考resnet.py文件。学员可以尝试自定义参数设计一个新的网络。
```python
class ResNet(nn.Cell):
"""
ResNet architecture.
Args:
block (Cell): Block for network.
layer_nums (list): Numbers of block in different layers.
in_channels (list): Input channel in each layer.
out_channels (list): Output channel in each layer.
strides (list): Stride size in each layer.
num_classes (int): The number of classes that the training images are belonging to.
Returns:
Tensor, output tensor.
Examples:
>>> ResNet(ResidualBlock,
>>> [3, 4, 6, 3],
>>> [64, 256, 512, 1024],
>>> [256, 512, 1024, 2048],
>>> [1, 2, 2, 2],
>>> 10)
"""
def __init__(self,
block,
layer_nums,
in_channels,
out_channels,
strides,
num_classes):
super(ResNet, self).__init__()
if not len(layer_nums) == len(in_channels) == len(out_channels) == 4:
raise ValueError("the length of layer_num, in_channels, out_channels list must be 4!")
self.conv1 = _conv7x7(3, 64, stride=2)
self.bn1 = _bn(64)
self.relu = P.ReLU()
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same")
self.layer1 = self._make_layer(block,
layer_nums[0],
in_channel=in_channels[0],
out_channel=out_channels[0],
stride=strides[0])
self.layer2 = self._make_layer(block,
layer_nums[1],
in_channel=in_channels[1],
out_channel=out_channels[1],
stride=strides[1])
self.layer3 = self._make_layer(block,
layer_nums[2],
in_channel=in_channels[2],
out_channel=out_channels[2],
stride=strides[2])
self.layer4 = self._make_layer(block,
layer_nums[3],
in_channel=in_channels[3],
out_channel=out_channels[3],
stride=strides[3])
self.mean = P.ReduceMean(keep_dims=True)
self.flatten = nn.Flatten()
self.end_point = _fc(out_channels[3], num_classes)
def _make_layer(self, block, layer_num, in_channel, out_channel, stride):
"""
Make stage network of ResNet.
Args:
block (Cell): Resnet block.
layer_num (int): Layer number.
in_channel (int): Input channel.
out_channel (int): Output channel.
stride (int): Stride size for the first convolutional layer.
Returns:
SequentialCell, the output layer.
Examples:
>>> _make_layer(ResidualBlock, 3, 128, 256, 2)
"""
layers = []
resnet_block = block(in_channel, out_channel, stride=stride)
layers.append(resnet_block)
for _ in range(1, layer_num):
resnet_block = block(out_channel, out_channel, stride=1)
layers.append(resnet_block)
return nn.SequentialCell(layers)
def construct(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
c1 = self.maxpool(x)
c2 = self.layer1(c1)
c3 = self.layer2(c2)
c4 = self.layer3(c3)
c5 = self.layer4(c4)
out = self.mean(c5, (2, 3))
out = self.flatten(out)
out = self.end_point(out)
return out
```
ResNet50类定义如下:
```python
def resnet50(class_num=10):
"""
Get ResNet50 neural network.
Args:
class_num (int): Class number.
Returns:
Cell, cell instance of ResNet50 neural network.
Examples:
>>> net = resnet50(10)
"""
return ResNet(ResidualBlock,
[3, 4, 6, 3],
[64, 256, 512, 1024],
[256, 512, 1024, 2048],
[1, 2, 2, 2],
class_num)
```
### 适配训练作业
创建训练作业时,运行参数会通过脚本传参的方式输入给脚本代码,脚本必须解析传参才能在代码中使用相应参数。如data_url和train_url,分别对应数据存储路径(OBS路径)和训练输出路径(OBS路径)。脚本对传参进行解析后赋值到`args`变量里,在后续代码里可以使用。
```python
......@@ -293,23 +442,51 @@ import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--data_url', required=True, default=None, help='Location of data.')
parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
parser.add_argument('--num_epochs', type=int, default=1, help='Number of training epochs.')
parser.add_argument('--num_epochs', type=int, default=90, help='Number of training epochs.')
args, unknown = parser.parse_known_args()
```
MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing提供的API与OBS交互。将OBS中存储的数据拷贝至执行容器:
MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing提供的API与OBS交互。
```python
import moxing as mox
mox.file.copy_parallel(src_url=args.data_url, dst_url='cifar10/')
```
**方式一**
如需将训练输出(如模型Checkpoint)从执行容器拷贝至OBS,请参考:
- 拷贝自己账户下OBS桶内的数据集至执行容器
```python
import moxing as mox
mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH')
```
```python
import moxing as mox
mox.file.copy_parallel(src_url=args.data_url, dst_url='cifar10/')
```
- 如需将训练输出(如模型Checkpoint)从执行容器拷贝至自己的OBS,请参考:
```python
import moxing as mox
mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH')
```
**方式二**
- 拷贝他人账户下OBS桶内的数据集,前提是他人账户下的OBS桶已设为公共读/公共读写,且需要他人账户的访问密钥、私有访问密钥、OBS桶-概览-基本信息-Endpoint。
```python
import moxing as mox
# set moxing/obs auth info, ak:Access Key Id, sk:Secret Access Key, server:endpoint of obs bucket
mox.file.set_auth(ak='VCT2GKI3GJOZBQYJG5WM', sk='t1y8M4Z6bHLSAEGK2bCeRYMjo2S2u0QBqToYbxzB',
server="obs.cn-north-4.myhuaweicloud.com")
# copy dataset from obs bucket to container/cache
mox.file.copy_parallel(src_url="s3://share-course/dataset/cifar10/", dst_url='cifar10/')
```
- 通过set_auth()设置了他人账户的密钥,则再通过set_auth()设置自己账户的密钥,然后再行拷贝。
```python
import moxing as mox
mox.file.set_auth(ak='Your own Access Key', sk='Your own Secret Access Key',
server="obs.cn-north-4.myhuaweicloud.com")
mox.file.copy_parallel(src_url='ckpt', dst_url=os.path.join(args.train_url, 'ckpt'))
```
如果不设置自己账户的密钥,则只能将Checkpoint拷贝到他人账户下的OBS桶中。
### 创建训练作业
......@@ -331,9 +508,53 @@ mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH')
1. 点击提交以开始训练;
2. 在训练作业列表里可以看到刚创建的训练作业,在训练作业页面可以看到版本管理;
3. 点击运行中的训练作业,在展开的窗口中可以查看作业配置信息,以及训练过程中的日志,日志会不断刷新,等训练作业完成后也可以下载日志到本地进行查看;
4. 在训练日志中可以看到`epoch 90 cost time = 27.963477849960327, train step num: 1562, one step time: 17.90235457743939 ms, train samples per second of cluster: 1787.5`等字段,即训练过程的性能数据;
5. 在训练日志中可以看到`epoch: 90 step: 1562, loss is 0.00250402`等字段,即训练过程的loss数据;
6. 在训练日志里可以看到`Evaluation result: {'acc': 0.9182692307692307}.`字段,即训练完成后的验证精度。
4. 在训练日志中可以看到`epoch 90 cost time = 27.328994035720825, train step num: 1562, one step time: 17.496154952446112 ms, train samples per second of cluster: 1829.0`等字段,即训练过程的性能数据;
5. 在训练日志中可以看到`epoch: 90 step 1562, loss is 0.0002547435578890145 `等字段,即训练过程的loss数据;
6. 在训练日志里可以看到`Evaluation result: {'acc': 0.9467147435897436}.`字段,即训练完成后的验证精度。
```
epoch 1 cost time = 156.34279108047485, train step num: 1562, one step time: 100.09141554447814 ms, train samples per second of cluster: 319.7
epoch: 1 step 1562, loss is 1.5020508766174316
Epoch time: 156343.661, per step time: 100.092, avg loss: 1.502
************************************************************
epoch 2 cost time = 27.33933186531067, train step num: 1562, one step time: 17.502773281248828 ms, train samples per second of cluster: 1828.3
epoch: 2 step 1562, loss is 1.612194299697876
Epoch time: 27339.779, per step time: 17.503, avg loss: 1.612
************************************************************
epoch 3 cost time = 27.33275270462036, train step num: 1562, one step time: 17.498561270563613 ms, train samples per second of cluster: 1828.7
epoch: 3 step 1562, loss is 1.0880045890808105
Epoch time: 27333.157, per step time: 17.499, avg loss: 1.088
************************************************************
...
...
...
epoch 50 cost time = 27.318379402160645, train step num: 1562, one step time: 17.48935941239478 ms, train samples per second of cluster: 1829.7
epoch: 50 step 1562, loss is 0.028316421434283257
Epoch time: 27318.783, per step time: 17.490, avg loss: 0.028
************************************************************
epoch 51 cost time = 27.317234992980957, train step num: 1562, one step time: 17.488626756069756 ms, train samples per second of cluster: 1829.8
epoch: 51 step 1562, loss is 0.09725271165370941
Epoch time: 27317.556, per step time: 17.489, avg loss: 0.097
************************************************************
...
...
...
************************************************************
epoch 88 cost time = 27.33049988746643, train step num: 1562, one step time: 17.497119006060455 ms, train samples per second of cluster: 1828.9
epoch: 88 step 1562, loss is 0.0008127370965667069
Epoch time: 27330.821, per step time: 17.497, avg loss: 0.001
************************************************************
epoch 89 cost time = 27.33343005180359, train step num: 1562, one step time: 17.498994911525987 ms, train samples per second of cluster: 1828.7
epoch: 89 step 1562, loss is 0.00029994442593306303
Epoch time: 27333.826, per step time: 17.499, avg loss: 0.000
************************************************************
epoch 90 cost time = 27.328994035720825, train step num: 1562, one step time: 17.496154952446112 ms, train samples per second of cluster: 1829.0
epoch: 90 step 1562, loss is 0.0002547435578890145
Epoch time: 27329.307, per step time: 17.496, avg loss: 0.000
************************************************************
Start run evaluation.
Evaluation result: {'acc': 0.9467147435897436}.
```
## 实验结论
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册