diff --git a/doc/.DS_Store b/doc/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..e62e3f4b9d9e91b952298d3179e46687fbf38763 Binary files /dev/null and b/doc/.DS_Store differ diff --git a/doc/fluid/.DS_Store b/doc/fluid/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..191078cf89246126f6000ef369c2e56d02559551 Binary files /dev/null and b/doc/fluid/.DS_Store differ diff --git a/doc/fluid/release_note_cn.md b/doc/fluid/release_note_cn.md index 44748ac6685f4dec4891d1547708c235f3f8dc3c..34bafda25e716feae3e920d1af1602aee2c59682 100644 --- a/doc/fluid/release_note_cn.md +++ b/doc/fluid/release_note_cn.md @@ -3,323 +3,336 @@ Release Notes ============== ## 重要更新 -本版本对框架功能层面进行了重点增强,预测部署能力全面提升,分布式发布PLSC支持超大规模分类,并对参数服务器模式进行优化整合。对编译选项、编译依赖以及代码库进行了全面清理优化。模型库持续完善,优化了整体层次结构,增加了动态图模型实现。端到端开发套件和工具组件进一步完善。 +本版本对框架功能层面进行了重点增强,预测部署能力全面提升,分布式训练发布PLSC支持超大规模分类,并对参数服务器模式进行优化整合。对编译选项、编译依赖以及代码库进行了全面清理优化。模型库持续完善,优化了整体层次结构,增加了动态图模型实现。端到端开发套件和工具组件进一步完善。 **训练框架**:增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。 -**预测部署**:服务器端预测库的Python API大幅优化,新增R语言、Go语言调用预测库的使用方法和示例,强化了量化支持能力;Paddle Lite支持无校准数据的训练后量化方法生成的模型,加强对OpenCL的支持,支持昆仑XPU的预测;模型压缩库PaddleSlim重构裁剪、量化、蒸馏、搜索接口,新增大规模可扩展知识蒸馏框架 Pantheon,与模型库充分打通。 +**预测部署**:服务器端预测库的Python API大幅优化,新增R语言、Go语言调用预测库的使用方法和示例,强化了量化支持能力;Paddle Lite支持无校准数据的训练后量化方法生成的模型,加强对OpenCL的支持,支持昆仑XPU的预测;模型压缩库PaddleSlim重构裁剪、量化、蒸馏、搜索接口,与模型库充分打通,新增大规模可扩展知识蒸馏框架 Pantheon。 -**分布式方面**:参数服务器模式下针对transpiler的同步、半异步、全异步三种模式,后端实现上统一到communicator中,前端接口统一到fleet中,通过fleet strategy灵活选择不同模式;发布大规模分类库PLSC,通过模型并行支持超多类别的分类任务。 +**分布式训练**:参数服务器模式下针对transpiler半异步、全异步、GEO三种模式,后端实现上统一到communicator中,前端接口统一到fleet中,通过fleet strategy灵活选择不同模式;发布大规模分类库PLSC,通过模型并行支持超多类别的分类任务。 **基础模型库**:发布语音合成库Parakeet,包括多个前沿合成算法;PaddleCV新增14个图像分类预训练模型,3D和跟踪方向模型持续丰富;PaddleNLP的分词和词性标注模型支持jieba分词;PaddleRec增加多任务模型MMoE。模型库整体增加了广泛的动态图模型实现。模型库整体层次结构做了调整优化。 **端到端开发套件**:PaddleDetection和PaddleSeg新增大量模型实现及预训练模型,典型模型的训练速度和精度提升,模型压缩和部署能力大幅提升,使用体验全面优化。发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。 -**工具组件**:PaddleHub新增52个预训练模型,总数超过100,功能和体验持续优化;多任务学习框架PALM升级内核,开放API调用,支持更多的任务类型;联邦学习PaddleFL新增公开数据集。 +**工具组件**:PaddleHub新增52个预训练模型,总数超过100,功能和体验持续优化;多任务学习框架PALM升级内核,开放API调用,支持更多的任务类型;联邦学习PaddleFL新增公开数据集。深度强化学习框架PARL和飞桨图学习框架PGL也对应版本升级,支持更多功能,k开放更多算法和基线。 ## 训练框架 - API - - 增加自动混合精度训练AMP接口:能以通用的方式把一个网络转成混合精度训练,同时保证精度波动在正常范围内 - - 增加新的控制流接口并推荐使用:新增while_loop(循环控制功能)、cond(条件分支功能)、case和switch_case(分支控制功能)4个控制流OP,更加易用,且支持如下新增功能: - - 支持使用python callable作为控制条件或执行体 - - 支持控制流中的不同分支使用不同loss或optimizer - - 支持控制流中的condition部分使用CPU数据或GPU数据 - - 部分API参数支持使用变量列表:针对部分API的parameter_list或no_grad_set参数只支持使用字符串列表的情况,增加对变量列表的支持,使用如下API时不再需要提前获取相关变量的name属性: - - fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None) - - fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None) - - 各种Optimizer的minimize方法,如Adam的minimize:minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) +- 增加自动混合精度训练AMP接口:能以通用的方式把一个网络转成混合精度训练,同时保证精度波动在正常范围内 +- 增加新的控制流接口并推荐使用:新增while_loop(循环控制功能)、cond(条件分支功能)、case和switch_case(分支控制功能)4个控制流OP,更加易用,且支持如下新增功能: +- 支持使用python callable作为控制条件或执行体 +- 支持控制流中的不同分支使用不同loss或optimizer +- 支持控制流中的condition部分使用CPU数据或GPU数据 +- 部分API参数支持使用变量列表:针对部分API的parameter_list或no_grad_set参数只支持使用字符串列表的情况,增加对变量列表的支持,使用如下API时不再需要提前获取相关变量的name属性: +- fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None) +- fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None) +- 各种Optimizer的minimize方法,如Adam的minimize:minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) - 基础功能优化 - - 支持使用numpy的float16类型设置Tensor数据,无需先转换为uint16类型。 - - 支持直接使用负号,得到Tensor的相反数。 - - 显存分配策略: - - 默认策略变为AutoGrowth:在不影响训练速度的情况下,按需申请显存。规避之前的默认显存预分配策略下难以在同一张GPU卡上再起新任务的问题。 - - 多卡任务显存分配调整:将不同GPU卡上的显存分配器设置为Lazy初始化的方式。若用户不使用某张卡,则不会在该卡上申请显存。避免当其他GPU卡上有显存占用时,在空闲GPU卡上跑任务若不设置CUDA_VISIBLE_DEVICES导致显存OOM的问题。 - - OP功能升级 - - elu:该激活函数支持计算二阶梯度。 - - prroi_pool:rois参数可以接受Tensor或LoDTensor类型。 - - conv2d,pool2d,batch_norm,lrn:反向计算全部支持使用MKL-DNN高性能计算库。 - - argsort:支持降序排序(新增descending参数,默认值False)。 +- 支持使用numpy的float16类型设置Tensor数据,无需先转换为uint16类型。 +- 支持直接使用负号,得到Tensor的相反数。 +- 显存分配策略: +- 默认策略变为AutoGrowth:在不影响训练速度的情况下,按需申请显存。规避之前的默认显存预分配策略下难以在同一张GPU卡上再起新任务的问题。 +- 多卡任务显存分配调整:将不同GPU卡上的显存分配器设置为Lazy初始化的方式。若用户不使用某张卡,则不会在该卡上申请显存。避免当其他GPU卡上有显存占用时,在空闲GPU卡上跑任务若不设置CUDA_VISIBLE_DEVICES导致显存OOM的问题。 +- OP功能升级 +- elu:该激活函数支持计算二阶梯度。 +- prroi_pool:rois参数可以接受Tensor或LoDTensor类型。 +- conv2d,pool2d,batch_norm,lrn:反向计算全部支持使用MKL-DNN高性能计算库。 +- argsort:支持降序排序(新增descending参数,默认值False)。 - 基础性能优化 - - DALI预处理加速 - - 增加对Nvidia DALI GPU数据预处理库的支持,可用于加速图片,视频,语音等数据预处理。 - - 自动混合精度训练优化 - - 实现如下优化策略,并配合DALI数据预处理,ResNet50模型训练吞吐大幅提升:V100单卡混合精度训练吞吐从600+ images/sec提升到1000+ images/sec;单机8卡吞吐达到7840 image/sec,4机32卡吞吐达到28594 images/sec。 - - 增强batch_norm和conv2d等op对NHWC数据布局输入的支持,以使用Tensor Core技术加速fp16计算。 - - 基于IR Pass机制对模型中的部分op pattern进行融合,如batch_norm和relu等。 - - 优化elementwise(add,mul)等op的计算kernel。 - - 优化RecomputeOptimizer提升batchsize, 在Bert-large模型上最大batchsize比不使用RecomputeOptimizer增大533.62%,比上一版本提升一倍。 - - OP性能优化 - - 实现embedding和sequence_pool的融合算子fuse_emb_seq_pool,优化bloom_filter中的murmurhash3_x64_128,有效提升部分NLP模型的训练速度。 - - 优化了mean op的GPU性能,输入数据为32*32*8*8的Tensor时,前向计算速度提升2.7倍。 - - 优化assign、lod_reset op,避免不需要的显存拷贝和data transform。 - - 优化了stack OP的kernel实现,XLnet/Ernie模型GPU单卡性能提升4.1%。 +- DALI预处理加速 +- 增加对Nvidia DALI GPU数据预处理库的支持,可用于加速图片,视频,语音等数据预处理。 +- 自动混合精度训练优化 +- 实现如下优化策略,并配合DALI数据预处理,ResNet50模型训练吞吐大幅提升:V100单卡混合精度训练吞吐从600+ images/sec提升到1000+ images/sec;单机8卡吞吐达到7840 image/sec,4机32卡吞吐达到28594 images/sec。 +- 增加batch_norm和conv2d等op对NHWC数据布局输入的支持,以使用Tensor Core加速fp16计算或减少访存耗时。 +- 基于IR Pass机制对模型中的部分op pattern进行融合,如batch_norm和relu等。 +- 优化elementwise(add,mul)等op的计算kernel。 +- 优化RecomputeOptimizer提升batchsize, 在Bert-large模型上最大batchsize比不使用RecomputeOptimizer增大533.62%,比上一版本提升一倍。 +- OP性能优化 +- 实现embedding和sequence_pool的融合算子fuse_emb_seq_pool,优化bloom_filter中的murmurhash3_x64_128,有效提升部分NLP模型的训练速度。 +- 优化了mean op的GPU性能,输入数据为32*32*8*8的Tensor时,前向计算速度提升2.7倍。 +- 优化assign、lod_reset op,避免不需要的显存拷贝和data transform。 +- 优化了stack OP的kernel实现,XLnet/Ernie模型GPU单卡性能提升4.1%。 - 动态图 - - 功能优化 - - 移除了动态图Layers 中的 name_scope 参数,使得用户更方便继承和调用。 - - 移除to_variable接口中的block参数,简化了API的使用。 - - 针对模型参数依赖数据的问题,移除了 build_once设计,使得Layers在 **init** 执行完成之后就可以获取到所有的参数表,方便save load、参数初始化、参数debug、参数优化等。 - - 完善自动剪枝,方便用户组网并减少反向计算量。 - - 支持 SelectedRows 操作,使 Embedding 层支持单卡的稀疏更新。 - - 针对框架缺少容器类的问题,新增ParameterList、LayerList、Sequencial功能,方便用户组网。 - - 支持named_sublayers、named_parameters功能,方便用户编程。 - - 支持Linear lr warmup decay策略。 - - 性能优化 - - 优化了python 与c++ 交互,GradMaker、OperatorBase、allocator等。基于LSTM的语言模型任务p在P40机器上性能提升提升270%。 - - 针对optimize中多次调用optimized_guard无用代码导致的性能问题,移除了冗余代码。Transformer模型(batch_size=64)在P40机器上,SGD、Adam等优化器有5%~8%%的性能提升。 - - 针对AdamOptimizer中额外添加scale_op更新beta参数对性能的影响,将beta更新逻辑融合到adam_op中,减少op kernel调用开销。Dialogue-PLATO模型P40机器上性能提升9.67%。 - - 优化动态图异步DataLoader,在Mnist、ResNet、等模型上整体训练速度提升约30%。 - - 新增numpy bridge功能,支持在cpu模式下Tensor和ndarray之间共享底层数据,避免创建Variable时numpy输入需要拷贝的问题,提升效率。 - - 显存优化:提前删除反向不需要Tensor Buffer的前向变量空间的优化策略,在ResNet等模型上最大batch size提升20%-30%以上。 - - 动态图部署 - - 支持TracedLayer接口,实现 data independent的动态图模型转为静态图可预测部署的模型。 +- 功能优化 +- 移除了动态图Layers 中的 name_scope 参数,使得用户更方便继承和调用。 +- 移除to_variable接口中的block参数,简化了API的使用。 +- 针对模型参数依赖数据的问题,移除了 build_once设计,使得Layers在 **init** 执行完成之后就可以获取到所有的参数表,方便save load、参数初始化、参数debug、参数优化等。 +- 完善自动剪枝,方便用户组网并减少反向计算量。 +- 支持 SelectedRows 操作,使 Embedding 层支持单卡的稀疏更新。 +- 针对框架缺少容器类的问题,新增ParameterList、LayerList、Sequencial功能,方便用户组网。 +- 支持named_sublayers、named_parameters功能,方便用户编程。 +- 支持Linear lr warmup decay策略。 +- 性能优化 +- 优化了python 与c++ 交互,GradMaker、OperatorBase、allocator等。基于LSTM的语言模型任务p在P40机器上性能提升提升270%。 +- 针对optimize中多次调用optimized_guard无用代码导致的性能问题,移除了冗余代码。Transformer模型(batch_size=64)在P40机器上,SGD、Adam等优化器有5%~8%%的性能提升。 +- 针对AdamOptimizer中额外添加scale_op更新beta参数对性能的影响,将beta更新逻辑融合到adam_op中,减少op kernel调用开销。Dialogue-PLATO模型P40机器上性能提升9.67%。 +- 优化动态图异步DataLoader,对于Mnist、ResNet等CV模型任务在P40机器上单卡训练速度提升超过40%。 +- 新增numpy bridge功能,支持在cpu模式下Tensor和ndarray之间共享底层数据,避免创建Variable时numpy输入需要拷贝的问题,提升效率。 +- 显存优化:提前删除反向不需要Tensor Buffer的前向变量空间的优化策略,在ResNet等模型上最大batch size提升20%-30%以上。 +- 动态图部署 +- 支持TracedLayer接口,实现 data independent的动态图模型转为静态图可预测部署的模型。 - 调试分析 - - 报错信息优化 :对框架报错信息整体归类,实现报错信息的体系化,同时完成文案优化,帮助用户更快速、准确的定位和解决问题。 - - 优化性能分析profile 功能 - - 增强profiler的功能和准确性,支持不同级别的profile选项,能够在profile数据中记录事件的调用关系并打印出来。 - - 优化nan inf检查调试(通过FLAGS_check_nan_inf生效),性能、功能及输出信息均有较大提升: - - 速度上,v100测试ResNet50模型相比原工具组件约有1000倍性能提升,保持正常训练80%以上的效率。 - - 功能上,增加fp16的支持,可设置环境变量跳过op、op_role、op_var的检查,方便fp16模型的调试。 - - 输出信息更加翔实,除出错的op及tensor名称外,还会打印出错的nan、inf及正常数值的数量以便于调试。 +- 报错信息优化 :对框架报错信息整体归类,实现报错信息的体系化,同时完成文案优化,帮助用户更快速、准确的定位和解决问题。 +- 优化性能分析profile 功能 +- 增强profiler的功能和准确性,支持不同级别的profile选项,能够在profile数据中记录事件的调用关系并打印出来。 +- 优化nan inf检查调试(通过FLAGS_check_nan_inf生效),性能、功能及输出信息均有较大提升: +- 速度上,v100测试ResNet50模型相比原工具组件约有1000倍性能提升,保持正常训练80%以上的效率。 +- 功能上,增加fp16的支持,可设置环境变量跳过op、op_role、op_var的检查,方便fp16模型的调试。 +- 输出信息更加翔实,除出错的op及tensor名称外,还会打印出错的nan、inf及正常数值的数量以便于调试。 - 发布cpu训练和预测的轻量级安装包paddlepaddle-tiny,支持window/linux/Mac操作系统以及python27/python35/python36/python37: - - 编译选项:no avx, no ml, no gpu, no unittest - - 裁剪掉slim和部分dataset。 - - linux包体积从90M减小到37M;windows包体积从50.8M减小到9.6M;mac包体积从59M减小到19.8M。 - - 安装requirements依赖从15个减小到7个。 +- 编译选项:no avx, no ml, no gpu, no unittest +- 裁剪掉slim和部分dataset。 +- linux包体积从90M减小到37M;windows包体积从50.8M减小到9.6M;mac包体积从59M减小到19.8M。 +- 安装requirements依赖从15个减小到7个。 ## 预测部署 - 服务器端预测库 - - Python API - - 支持从内存读写模型,以满足模型加密的需求。 - - 不再在预测模型最后添加 Scale 算子。 - - 新增对ZeroCopy预测的支持,与C++接口基本一致,支持以numpy.ndarray作为输入和输出,在Python端使用更加方便。 - - 在AnalysisConfig中增加多个接口,完整覆盖C++预测的功能,包括删除pass、禁用预测glog等。 - - 其他编程语言的支持 - - 新增R语言、Go语言调用预测库的使用方法和示例 - - 对外提供 ProtoBuf 对应的头文件,方便用户解析模型结构的需求。 - - 带TRT编译的预测库不再从thrid_party中提供TensorRT库,需要用户自行到https://developer.nvidia.com/tensorrt 下载 - - 功能增强: - - 打通Paddle Lite以子图方式接入,已验证 ResNet50。 - - 新增MKL-DNN FC INT8 kernel的支持 - - Paddle-TensorRT支持Ernie模型,Ernie模型(seq length=128) 在T4卡上fp16预测速度为3.6ms, 比fp32加速37%。 - - 量化:在ERNIE INT8精度相比于FP32 精度提升2%下,ERNIE INT8在第二代至强可扩展平台6271上单线程性能优化提升2.70倍,多线程性能提升1.79倍 +- Python API +- 支持从内存读写模型,以满足模型加密的需求。 +- 不再在预测模型最后添加 Scale 算子。 +- 新增ZeroCopy API,与C++接口基本一致,支持以numpy.ndarray作为输入和输出,在Python端使用更加方便。 +- 在AnalysisConfig中增加多个接口,完整覆盖C++预测的功能,包括删除pass、禁用预测glog等。 +- 其他编程语言的支持 +- 新增R语言、Go语言调用预测库的使用方法和示例 +- 对外提供 ProtoBuf 对应的头文件,方便用户解析模型结构的需求。 +- 带TRT编译的预测库不再从thrid_party中提供TensorRT库,需要用户自行到https://developer.nvidia.com/tensorrt 下载 +- 功能增强: +- 打通Paddle Lite以子图方式接入,已验证 ResNet50。 +- 新增MKL-DNN FC INT8 kernel的支持 +- Paddle-TensorRT支持Ernie模型,Ernie模型(seq length=128) 在T4卡上fp16预测速度为3.6ms, 比fp32加速37%。 +- 量化:ERNIE INT8精度相比于FP32 精度略有下降,但其在第二代至强可扩展平台6271上单线程性能优化提升2.70倍,多线程性能提升1.79倍 - 移动/嵌入式端Paddle Lite(https://github.com/PaddlePaddle/Paddle-Lite) - - 对应发布v2.3版本。 - - model_optimize_tool多项功能升级。 - - 支持“无校准数据的训练后量化方法”,减小模型存储空间(2~4倍)。 - - OpenCL:完成30个Image2D Kernel迁移,涵盖14个OP。 - - 对FPGA、NPU的支持进一步加强;支持昆仑XPU的预测。 - - 发布全新官网文档;新增“无校准数据的训练后量化方法”使用文档。 +- 对应发布v2.3版本。 +- model_optimize_tool多项功能升级。 +- 支持“无校准数据的训练后量化方法”,模型存储空间可减少2~4倍。 +- OpenCL:完成30个Image2D Kernel迁移,涵盖14个OP。 +- 对FPGA、NPU的支持进一步加强;支持昆仑XPU的预测。 +- 发布全新官网文档;新增“无校准数据的训练后量化方法”使用文档。 - Paddle Serving(https://github.com/PaddlePaddle/Serving): - - 发布bert类语义理解模型的远程文本向量表示预测服务。 - - 发布了paddle-gpu-serving whl包,通过pip安装和Python代码即可部署和使用预测服务; - - 支持Paddlehub中的13种语义理解模型,支持单机多卡,使用Ernie_tiny模型在单张P4 GPU下平均样本长度为7时预测速度为869.56样本每秒。 +- 发布bert类语义理解模型的远程文本向量表示预测服务。 +- 发布了paddle-gpu-serving whl包,通过pip安装和Python代码即可部署和使用预测服务; +- 支持Paddlehub中的13种语义理解模型,支持单机多卡,使用Ernie_tiny模型在单张P4 GPU下平均样本长度为7时预测速度为869.56样本每秒。 - PaddleSlim(https://github.com/PaddlePaddle/PaddleSlim): - - 拆分PaddleSlim为独立repo。 - - 重构裁剪、量化、蒸馏、搜索接口,对用户开放底层接口。 - - 量化: - - 新增基于KL散度的离线量化功能,支持对Embedding层量化。 - - 新增对FC的QAT MKL-DNN量化策略支持 - - 新增PostTrainingQuantization,完整实现训练后量化功能:支持量化30种OP,支持灵活设置需要量化的OP,生成统一格式的量化模型,具有耗时短、易用性强、精度损失较小的优点。 - - 量化训练支持设定需要量化的OP类型。 - - 裁剪: 重构剪裁实现,方便扩展支持更多类型的网络。 - - 搜索: - - 支持SA搜索,增加更多的搜索空间,支持用户自定义搜索空间。 - - 新增one-shot搜索算法,搜索速度比上个版本快20倍。 - - 新增大规模可扩展知识蒸馏框架 Pantheon - - student 与 teacher 、teacher与 teacher 模型之间充分解耦,可分别独立运行在不同的物理设备上,便于充分利用计算资源; - - 支持 teacher 模型的单节点多设备大规模预测,在 BERT 等模型上测试加速比达到线性; - - 用 TCP/IP 协议实现在线蒸馏模式的通信,支持在同一网络环境下,运行在任意两个物理设备上的 teacher 模型和 student 模型之间进行知识传输; - - 统一在线和离线两种蒸馏模式的 API 接口,不同的 teacher 模型可以工作在不同的模式下; - - 在 student 端自动完成知识的归并与知识数据的 batch 重组,便于多 teacher 模型的知识融合。 - - 模型库: - - 发布ResNet50、MobileNet模型的压缩benchmark - - 打通检测库,并发布YOLOv3系列模型的压缩benchmark - - 打通分割库,并发布Deepabv3+系列分割模型的压缩benchmark - - 完善文档: - - 补充API文档;新增入门教程和高级教程;增加ModelZoo文档,覆盖分类、检测、分割任务。所有文档包含中、英文。 +- 拆分PaddleSlim为独立repo。 +- 重构裁剪、量化、蒸馏、搜索接口,对用户开放底层接口。 +- 量化: +- 新增基于KL散度的离线量化功能,支持对Embedding层量化。 +- 新增对FC的QAT MKL-DNN量化策略支持 +- 新增PostTrainingQuantization,完整实现训练后量化功能:支持量化30种OP,支持灵活设置需要量化的OP。 +- 量化训练支持设定需要量化的OP类型。 +- 裁剪: 重构剪裁实现,方便扩展支持更多类型的网络。 +- 网络结构搜索: +- 支持SA搜索,增加更多的搜索空间,支持用户自定义搜索空间。 +- 新增one-shot搜索算法,搜索速度比上个版本快20倍。 +- 新增大规模可扩展知识蒸馏框架 Pantheon +- student 与 teacher 、teacher与 teacher 模型之间充分解耦,可分别独立运行在不同的物理设备上,便于充分利用计算资源; +- 支持 teacher 模型的单节点多设备大规模预测,在 BERT 等模型上测试加速比达到线性; +- 用 TCP/IP 协议实现在线蒸馏模式的通信,支持在同一网络环境下,运行在任意两个物理设备上的 teacher 模型和 student 模型之间进行知识传输; +- 统一在线和离线两种蒸馏模式的 API 接口,不同的 teacher 模型可以工作在不同的模式下; +- 在 student 端自动完成知识的归并与知识数据的 batch 重组,便于多 teacher 模型的知识融合。 +- 模型库: +- 发布ResNet50、MobileNet模型的压缩benchmark +- 打通检测库,并发布YOLOv3系列模型的压缩benchmark +- 打通分割库,并发布Deepabv3+系列分割模型的压缩benchmark +- 完善文档: +- 补充API文档;新增入门教程和高级教程;增加ModelZoo文档,覆盖分类、检测、分割任务。所有文档包含中、英文。 ## 分布式 - 参数服务器模式: - - 大幅降低训练过程中的内存占用,在1亿规模embedding任务上,Trainer端内存可以降低90% - - 大幅降低分布式保存模型、加载模型的内存占用, Pserver端内存峰值最大可降低为原先的$1/N,N$为Pserver节点个数。 - - 优化GEO-SGD 稠密参数通信 - - 支持分布式AUC指标计算 - - 新增分布式Barrier功能 - - 非Fleet的transpiler API加入过期警示, 该API计划在PaddlePaddle-Fluid 2.0中移除 - - Communicator加入半异步模式和同步模式 - - TrainFromDataset训练接口支持半异步模式和同步模式 - - Fleet加入DistributedStrategy, 进一步提升分布式易用性, 整合目前分布式相关FLAG - - Fleet pslib模式支持一个program多loss训练,优化训练性能 - - 千亿稀疏模式支持k8s环境。 +- 大幅降低训练过程中的内存占用,在1亿规模embedding任务上,Trainer端内存可以降低90% +- 大幅降低分布式保存模型、加载模型的内存占用, Pserver端内存峰值最大可降低为原先的$1/N,N$为Pserver节点个数。 +- 优化GEO模式 稠密参数通信 +- 支持分布式AUC指标计算 +- 新增分布式Barrier功能 +- 非Fleet的transpiler API加入过期警示, 该API计划在下一个版本中移除 +- Communicator加入半异步模式 +- TrainFromDataset训练接口支持半异步模式 +- Fleet加入DistributedStrategy, 进一步提升分布式易用性, 整合目前分布式相关FLAG +- Fleet pslib模式支持一个program多loss训练,优化训练性能 +- 千亿稀疏模式支持k8s环境。 - 大规模分类库PLSC:支持受限于显存容量数据并行无法处理的大规模分类问题(https://github.com/PaddlePaddle/PLSC) - - 内建ResNet50、ResNet101和ResNet152三种模型,并支持自定义模型;单机8张V100 GPU配置下,ResNet50模型百万类别训练速度2,122.56 images/s,相比标准ResNet50模型加速倍1.3倍; - - 发布模型在线预测服务plsc-serving whl包,预测人脸识别模型的图片语义向量表示,支持使用用户训练的模型进行预测。ResNet50模型(batch size=256)在单张V100 GPU下预测速度为523.47 images/s; - - 发布基于ResNet50网络和MS1M-ArcFace数据集的预训练模型:https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz。 +- 内建ResNet50、ResNet101和ResNet152三种模型,并支持自定义模型;单机8张V100 GPU配置下,ResNet50模型百万类别训练速度2,122.56 images/s,相比标准ResNet50模型加速倍1.3倍; +- 发布模型在线预测服务plsc-serving whl包,预测人脸识别模型的图片语义向量表示,支持使用用户训练的模型进行预测。ResNet50模型(batch size=256)在单张V100 GPU下预测速度为523.47 images/s; +- 发布基于ResNet50网络和MS1M-ArcFace数据集的预训练模型:https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz。 - 发布ResNet50混合精度训练benchmark(单卡、多卡、多机)。 ## 基础模型库 (https://github.com/PaddlePaddle/models) - PaddleNLP - - seq2seq支持RL和GAN等训练模式 - - 发布分词和词性标注训练模型,利用知识蒸馏框架 Pantheon,在自有数据集上比paddleNLP上LAC上F1值提升1%;合入jieba分词,通过加入use_paddle标签来开启深度学习模型模式;并在在jieba加入paddle版本检测和回退机制,保障用户体验。 - - 增加动态图模型实现:word2vec、senta、transformer、bert、seq2seq、LAC。 +- seq2seq支持RL和GAN等训练模式 +- 发布分词和词性标注训练模型,利用知识蒸馏框架 Pantheon,在自有数据集上比PaddleNLP上LAC上F1值提升1%;合入jieba分词,通过加入use_paddle标签来开启深度学习模型模式;并在在jieba加入paddle版本检测和回退机制,保障用户体验。 +- 增加动态图模型实现:word2vec、senta、transformer、bert、seq2seq、LAC。 - PaddleSpeech - - 语音合成:发布合成库Parakeet - - 实现语音合成模型数据预处理、训练和合成等的标准工作流 - - 提供对常见数据集的开箱即用的预处理实现 - - 提供语音合成领域常用模型组件,为实现模型提供支持 - - 发布语音合成模型 DeepVoice3、ClarinNet 、TransformerTTS、FastSpeech、WaveNet、WaveFlow +- 发布语音合成库Parakeet (Paddle PARAllel text-to-speech toolkit) +- 实现语音合成模型数据预处理、训练和合成等的标准工作流 +- 提供对常见数据集的开箱即用的预处理实现 +- 提供语音合成领域常用模型组件,为实现模型提供支持 +- 发布语音合成模型 DeepVoice3、ClarinNet 、TransformerTTS、FastSpeech、WaveNet、WaveFlow - PaddleCV - - 图像分类: - - 新增预训练模型SENet-vd、Res2Net、HRNet系列模型总共14个: - - SE_ResNet18_vd,SE_ResNet34_vd,SE_ResNeXt50_vd_32x4d,ResNeXt152_vd_32x4d - - Res2Net50_26w_4s,Res2Net50_14w_8s,Res2Net50_vd_26w_4s - - HRNet_W18_C,HRNet_W30_C,HRNet_W32_C,HRNet_W40_C,HRNet_W44_C,HRNet_W48_C,HRNet_W64_C - - 支持使用DALI加速数据预处理,在ImageNet训练上获得1.5倍(ResNet50) 至3倍以上(ShuffleNet))加速,并大幅提升GPU利用率。 - - 3D方向: - - 发布模型PointNet++、PointRCNN。 - - 跟踪模型库 : - - 发布模型SiamFC、SiamRPN、SiamMASK、ATOM、ATP。 - - 增加动态图模型实现: MobileNet-v1/v2、YOLOv3、FasterRCNN、MaskRCNN、视频分类TSM模型、视频动作定位BMN模型。 +- 图像分类: +- 新增预训练模型SENet-vd、Res2Net、HRNet系列模型总共14个: +- SE_ResNet18_vd,SE_ResNet34_vd,SE_ResNeXt50_vd_32x4d,ResNeXt152_vd_32x4d +- Res2Net50_26w_4s,Res2Net50_14w_8s,Res2Net50_vd_26w_4s +- HRNet_W18_C,HRNet_W30_C,HRNet_W32_C,HRNet_W40_C,HRNet_W44_C,HRNet_W48_C,HRNet_W64_C +- 支持使用DALI加速数据预处理,在ImageNet训练上获得1.5倍(ResNet50) 至3倍以上(ShuffleNet)加速,并大幅提升GPU利用率。 +- 3D方向: +- 发布模型PointNet++、PointRCNN。 +- 跟踪模型库 : +- 发布模型SiamFC、ATOM。 +- 增加动态图模型实现: MobileNet-v1/v2、YOLOv3、FasterRCNN、MaskRCNN、视频分类TSM模型、视频动作定位BMN模型。 - PaddleRec - - 发布推荐领域多任务模型MMoE, 适用于工业界大规模多任务联合训练。 - - 增加动态图模型实现:gru4rec、deepfm。 +- 发布推荐领域多任务模型MMoE, 适用于工业界大规模多任务联合训练。 +- 增加动态图模型实现:gru4rec、deepfm。 ## 端到端开发套件 - PaddleDetection(https://github.com/PaddlePaddle/PaddleDetection) - - 进一步提升YOLOv3模型精度,COCO数据上精度达到43.2%,相比上个版本绝对提升1.4%。 - - 新增模型实现及预训练模型: - - 新增Google AI Open Images 2019-Object Detction比赛中的最佳单模型CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd,同时也发布此算法基于Objects365数据的预训练模型。 - - 新增backbone为CBResNet、Res2Net、HRNet的系列预训练模型。 - - 新增LibraRCNN算法及预训练模型。 - - FasterRCNN R50 FPN模型新增基于GIoU、DIoU、CIoU loss的预训练模型,不降低预测速度的情况下,在COCO数据上精度分别提升1.1%,0.9%,1.3%。 - - 新增模块: - - 主干网络: 新增CBResNet、Res2Net、HRNet。 - - Loss模块: 新增GIoU loss、 DIoU loss、CIoU loss,以及Libra loss,YOLOv3的loss支持细粒度op组合。 - - 后处理模块: 新增softnms,DIOU nms模块。 - - 正则模块: 新增DropBlock模块。 - - 功能优化和改进: - - 加速YOLOv3数据预处理,整体训练提速40%。 - - 优化数据预处理逻辑。 - - 增加人脸检测预测benchmark数据。 - - 增加Paddle预测库Python API下的预测示例。 - - 检测模型压缩 : - - 裁剪: 发布MobileNet-YOLOv3裁剪方案和模型,在VOC数据集上FLOPs - 69.6%, mAP + 1.4%,在COCO数据集上FLOPS-28.8%, mAP + 0.9%; 发布ResNet50vd-dcn-YOLOv3裁剪方案和模型,在COCO数据集上FLOPS - 18.4%, mAP + 0.8%。 - - 蒸馏: 发布MobileNet-YOLOv3蒸馏方案和模型,在VOC数据上mAP + 2.8%,在COCO数据上mAP + 2.1%。 - - 量化: 发布YOLOv3-MobileNet和BlazeFace的量化模型。 - - 裁剪+蒸馏: 发布MobileNet-YOLOv3裁剪+蒸馏方案和模型,在COCO数据集上FLOPS - 69.6%,GPU下预测加速64.5%,mAP - 0.3 %; 发布ResNet50vd-dcn-YOLOv3裁剪+蒸馏方案和模型,基于COCO数据FLOPS - 43.7%,GPU下预测加速24.0%,mAP + 0.6 %。 - - 搜索: 开源BlazeFace-Nas的完整搜索方案。 - - 预测部署: - - 适配Paddle预测库对TensorRT的支持、对FP16精度的支持。 - - 文档: - - 新增数据预处理模块介绍文档、实现自定义数据Reader的文档。 - - 新增如何新增算法模型的文档。 - - 文档部署到网站: https://paddledetection.readthedocs.io/zh/latest/ +- 进一步提升YOLOv3模型精度,COCO数据上精度达到43.2%,相比上个版本绝对提升1.4%。 +- 新增模型实现及预训练模型: +- 新增Google AI Open Images 2019-Object Detction比赛中的最佳单模型CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd,同时也发布此算法基于Objects365数据的预训练模型。 +- 新增backbone为CBResNet、Res2Net、HRNet的系列预训练模型。 +- 新增LibraRCNN算法及预训练模型。 +- FasterRCNN R50 FPN模型新增基于GIoU、DIoU、CIoU loss的预训练模型,不降低预测速度的情况下,在COCO数据上精度分别提升1.1%,0.9%,1.3%。 +- 新增模块: +- 主干网络: 新增CBResNet、Res2Net、HRNet。 +- Loss模块: 新增GIoU loss、 DIoU loss、CIoU loss,以及Libra loss,YOLOv3的loss支持细粒度op组合。 +- 后处理模块: 新增softnms,DIOU nms模块。 +- 正则模块: 新增DropBlock模块。 +- 功能优化和改进: +- 加速YOLOv3数据预处理,整体训练提速40%。 +- 优化数据预处理逻辑。 +- 增加人脸检测预测benchmark数据。 +- 增加Paddle预测库Python API下的预测示例。 +- 检测模型压缩 : +- 裁剪: 发布MobileNet-YOLOv3裁剪方案和模型,在VOC数据集上FLOPs - 69.6%, mAP + 1.4%,在COCO数据集上FLOPS-28.8%, mAP + 0.9%; 发布ResNet50vd-dcn-YOLOv3裁剪方案和模型,在COCO数据集上FLOPS - 18.4%, mAP + 0.8%。 +- 蒸馏: 发布MobileNet-YOLOv3蒸馏方案和模型,在VOC数据上mAP + 2.8%,在COCO数据上mAP + 2.1%。 +- 量化: 发布YOLOv3和BlazeFace的量化模型。 +- 裁剪+蒸馏: 发布MobileNet-YOLOv3裁剪+蒸馏方案和模型,在COCO数据集上FLOPS - 69.6%,GPU下预测加速64.5%,mAP - 0.3 %; 发布ResNet50vd-dcn-YOLOv3裁剪+蒸馏方案和模型,基于COCO数据FLOPS - 43.7%,GPU下预测加速24.0%,mAP + 0.6 %。 +- 搜索: 开源BlazeFace-Nas的完整搜索方案。 +- 预测部署: +- 适配Paddle预测库对TensorRT的支持、对FP16精度的支持。 +- 文档: +- 新增数据预处理模块介绍文档、实现自定义数据Reader的文档。 +- 新增如何新增算法模型的文档。 +- 文档部署到网站: https://paddledetection.readthedocs.io/zh/latest/ - PaddleSeg(https://github.com/PaddlePaddle/PaddleSeg) - - 新增模型 - - 适用于车道线分割场景的LaneNet模型。 - - 适用于轻量级Fast-SCNN模型。 - - 适用于高精度场景的HRNet语义分割模型 。 - - 发布基于PaddleSlim的多种模型压缩方案: - - 基于Cityscape的Fast-SCNN裁剪方案和模型。 - - 基于Cityscape的Deeplabv3p-Xception和Deeplabv3p-MobilenetV2蒸馏方案。 - - 基于Cityscape的Deeplabv3p-MobilenetV2搜索方案。 - - 基于Cityscape的Deeplabv3p-Mobilenet量化方案和模型。 - - 预测部署能力提升 - - 新增Python轻量级部署。 - - 新增对 FP16、Int8量化模型的TensorRT预测加速支持。 - - 新增DeepLabv3p-MobileNetV2的人像分割Paddle-Lite移动端部署教程和案例。 - - 优化模型导出环节,支持图像预处理和后处理的GPU化,性能提升10%~20%。 - - 提供U-Net, ICNet, PSPNet, DeepLabv3+等模型的在不同尺寸图像的预测性能Benchmark,便于用户根据性能进行模型选型。 - - 体验优化 - - 新增学习率warmup功能,支持与不同的学习率Decay策略配合使用,提升Fine-tuning的稳定性。 - - 支持对标注图使用伪彩色图像格式的保存,提升标注图片的预览体验。 - - 新增自动保存mIoU最优模型的功能。 - - 全面优化文档逻辑,提供如工业质检、眼底筛查等工业场景的AIStudio实战教程。 +- 新增模型 +- 适用于车道线分割场景的LaneNet模型。 +- 适用于轻量级Fast-SCNN模型。 +- 适用于高精度场景的HRNet语义分割模型 。 +- 发布基于PaddleSlim的多种模型压缩方案: +- 基于Cityscape的Fast-SCNN裁剪方案和模型。 +- 基于Cityscape的Deeplabv3p-Xception和Deeplabv3p-MobilenetV2蒸馏方案。 +- 基于Cityscape的Deeplabv3p-MobilenetV2搜索方案。 +- 基于Cityscape的Deeplabv3p-Mobilenet量化方案和模型。 +- 预测部署能力提升 +- 新增Python轻量级部署。 +- 新增对 FP16、Int8量化模型的TensorRT预测加速支持。 +- 新增DeepLabv3p-MobileNetV2的人像分割Paddle-Lite移动端部署教程和案例。 +- 优化模型导出环节,支持图像预处理和后处理的GPU化,性能提升10%~20%。 +- 提供U-Net, ICNet, PSPNet, DeepLabv3+等模型的在不同尺寸图像的预测性能Benchmark,便于用户根据性能进行模型选型。 +- 体验优化 +- 新增学习率warmup功能,支持与不同的学习率Decay策略配合使用,提升Fine-tuning的稳定性。 +- 支持对标注图使用伪彩色图像格式的保存,提升标注图片的预览体验。 +- 新增自动保存mIoU最优模型的功能。 +- 全面优化文档逻辑,提供如工业质检、眼底筛查等工业场景的AIStudio实战教程。 - ElasticRec(https://github.com/PaddlePaddle/ElasticRec) - - - - 发布了ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。 +- +- 发布了ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。 ## 工具组件 - PaddleHub(https://github.com/PaddlePaddle/PaddleHub) - - 预训练模型丰富,新增52个预训练模型,目前预训练模型总数100+: - - 语义模型:新增RoBERTa_wwm、BERT_wwm、ERNIE-Tiny等5个语义模型 - - 文本分类:新增黄反鉴别模型3个。 - - 图像分类:新增ResNext-WSL、EfficientNet等共36个图像分类模型。 - - 目标检测:新增行人检测,车辆检测等共5个检测模型。 - - 关键点检测:新增人脸关键点检测和人体姿态关键点检测模型2个。 - - 人脸口罩检测:新增基于PyramidBox-Lite的人脸口罩检测模型2个。 - - 通用人脸检测:新增Ultra Light Fast Generic Face Detector、PyramidBox-Lite等通用人脸检测模型4个。 - - 功能: - - 新增基于Paddle Serving的Bert Service文本向量表示服务。 - - Task灵活性增强,新增Hook机制可以支持用户自定义代码加载。 - - 新增彩色Colorlog,修复日志重复打印问题。 - - 优化代码结果,命令行执行速度提升50% 。 - - 重构Dataset、Reader,适配自定义数据集代码量降低60%。 - - 优化AutoFinetune接口,支持多实验的可视化效果显示。 - - 体验优化 - - 逻辑全面优化,新增丰富的AIStudio教程内容。 - - 官网落地页全新升级,提供在线快速体验和教程指导的功能。 +- 预训练模型丰富,新增52个预训练模型,目前预训练模型总数100+: +- 语义模型:新增RoBERTa_wwm、BERT_wwm、ERNIE-Tiny等5个语义模型 +- 文本分类:新增黄反鉴别模型3个。 +- 图像分类:新增ResNext-WSL、EfficientNet等共36个图像分类模型。 +- 目标检测:新增行人检测,车辆检测等共5个检测模型。 +- 关键点检测:新增人脸关键点检测和人体姿态关键点检测模型2个。 +- 人脸口罩检测:新增基于PyramidBox-Lite的人脸口罩检测模型2个。 +- 通用人脸检测:新增Ultra Light Fast Generic Face Detector、PyramidBox-Lite等通用人脸检测模型4个。 +- 功能: +- 新增基于Paddle Serving的Bert Service文本向量表示服务。 +- Task灵活性增强,新增Hook机制可以支持用户自定义代码加载。 +- 新增彩色Colorlog,修复日志重复打印问题。 +- 优化代码结果,命令行执行速度提升50% 。 +- 重构Dataset、Reader,适配自定义数据集代码量降低60%。 +- 优化AutoFinetune接口,支持多实验的可视化效果显示。 +- 体验优化 +- 逻辑全面优化,新增丰富的AIStudio教程内容。 +- 官网落地页全新升级,提供在线快速体验和教程指导的功能。 - 多任务学习框架PALM(https://github.com/PaddlePaddle/PALM) - - 支持python3和windows - - 升级框架内核和多任务底层机制,开放API调用 - - 灵活的模型保存机制,支持单任务保存和全图保存 - - 支持连续训练和连续预测,单次执行下可自由切换数据集文件 - - 新增模型定制化/自定义功能 - - 重构多任务底层kernel,修复若干影响通用性和稳定性的bugs - - 强化多任务学习能力 - - 支持多任务场景下每个任务有不同的batch size和sequence length - - 修复了多任务多卡训练时,各个显卡上任务不一致的问题 - - 优化了多任务学习调度和终止策略,普遍提升模型泛化能力 - - 强化支持的任务的功能和类型 - - 匹配任务支持增强,支持pairwise learning和多类别(如NLI句子关系判断)。 - - 机器阅读理解任务支持增强,新增用户可控的预处理超参数。 - - 新增支持序列标注任务。 - - 强化大规模训练/推理能力 - - 新增自动多卡预测能力 - - 重构异步reader,多卡场景下支持变长padding - - 新增预训练模型管理和下载模块 - - 支持BERT、ERNIE、RoBERTa等各预训练模型的管理和下载 - - 新增RoBERTa中文预训练模型 +- 支持python3和windows +- 升级框架内核和多任务底层机制,开放API调用 +- 灵活的模型保存机制,支持单任务保存和全图保存 +- 支持连续训练和连续预测,单次执行下可自由切换数据集文件 +- 新增模型定制化/自定义功能 +- 重构多任务底层kernel,修复若干影响通用性和稳定性的bugs +- 强化多任务学习能力 +- 支持多任务场景下每个任务有不同的batch size和sequence length +- 修复了多任务多卡训练时,各个显卡上任务不一致的问题 +- 优化了多任务学习调度和终止策略,普遍提升模型泛化能力 +- 强化支持的任务的功能和类型 +- 匹配任务支持增强,支持pairwise learning和多类别(如NLI句子关系判断)。 +- 机器阅读理解任务支持增强,新增用户可控的预处理超参数。 +- 新增支持序列标注任务。 +- 强化大规模训练/推理能力 +- 新增自动多卡预测能力 +- 重构异步reader,多卡场景下支持变长padding +- 新增预训练模型管理和下载模块 +- 支持BERT、ERNIE、RoBERTa等各预训练模型的管理和下载 +- 新增RoBERTa中文预训练模型 - 联邦学习PaddleFL(https://github.com/PaddlePaddle/PaddleFL): - - 新增scheduler与submitter功能:scheduler可用于在训练过程中控制trainer是否参加更新 。submitter可用于完成在MPI集群提交paddleFL任务的功能 - - 新增LEAF dataset联邦学习公开数据集,并添加api,用于设置benchmark。支持图像分类,情感分析,字符预测等领域的经典数据集,如MNIST,Sentiment140 - - 根据新增组件,在example中修改了原有的样例,并添加了femnist_demo, submitter_demo样例 - - 优化fl_distribute_transpiler,使FedAvg strategy新增对adam optimizer支持; - - 新增SecAgg strategy(Secure Aggregation),用于实现安全的参数聚合; +- 新增scheduler与submitter功能:scheduler可用于在训练过程中控制trainer是否参加更新 。submitter可用于完成在MPI集群提交paddleFL任务的功能 +- 新增LEAF dataset联邦学习公开数据集,并添加api,用于设置benchmark。支持图像分类,情感分析,字符预测等领域的经典数据集,如MNIST,Sentiment140 +- 根据新增组件,在example中修改了原有的样例,并添加了femnist_demo, submitter_demo样例 +- 优化fl_distribute_transpiler,使FedAvg strategy新增对adam optimizer支持; +- 新增SecAgg strategy(Secure Aggregation),用于实现安全的参数聚合; + +- 深度强化学习框架PARL(https://github.com/PaddlePaddle/PARL) +- 发布v1.3版。 +- 新增对Multi-Agent RL算法支持,包括MADDPG。 +- 新增对多卡训练的支持,发布多卡DQN算法示例。 +- 开源连续控制领域的SOTA算法TD3和SAC。 +- 开源NeurIPS2019强化学习挑战赛事冠军模型实现和训练方案,开放训练好的模型(可考虑公开课) +- 飞桨图学习框架PGL(https://github.com/PaddlePaddle/PGL) +- 发布v1.1版: +- 新增对权威图学习数据集OGB的支持,全面支持nodepropered、linkpred、graphpropered三大类型任务,并发布SOTA基线。 +- 发布图推荐解决方案PGL-Rec和知识图嵌入算法集PGL-KE。 +- 易用化改进,发布PGL高阶API。 +- 其他升级点:多进程图采样优化,加速GraphSAGE类模型3倍;新增基于Lod Tensor的Graph Batch算子,Graph Pooling算子;Model Zoo新增模型,包括分布式异构图算法、GraphZoom、PinSage等。 ## 代码重构和升级 - 编译 - - 增加WITH_NCCL编译选项,单卡用户可显示指定WITH_NCCL=OFF加速编译。 - - 新增编译选项WITH_TP_CACHE,缓存第三方源码,避免重复下载,Windows用户可将其设置为ON,加快编译速度并提高编译稳定性。 - - `CUDA_ARCH_NAME`默认值设成`Auto`(`All`表示编译所有gpu架构,`Auto`表示只编译当前机器gpu架构),对开发者来说,使用`Auto`比`All`节省非常多的编译时间,提高开发效率。 - - 减少了冗余的link环节与产物、多余的文件拷贝,加快了Windows下的编译速度。 +- 增加WITH_NCCL编译选项,单卡用户可显示指定WITH_NCCL=OFF加速编译。 +- 新增编译选项WITH_TP_CACHE,缓存第三方源码,避免重复下载,Windows用户可将其设置为ON,加快编译速度并提高编译稳定性。 +- `CUDA_ARCH_NAME`默认值设成`Auto`(`All`表示编译所有gpu架构,`Auto`表示只编译当前机器gpu架构),对开发者来说,使用`Auto`比`All`节省非常多的编译时间,提高开发效率。 +- 减少了冗余的link环节与产物、多余的文件拷贝,加快了Windows下的编译速度。 - 外部依赖库 - - 升级MKL-DNN到最新1.1版本。 - - 将预测库与`third_party` 解耦,重构了28个第三方依赖的编译代码,便于统一管理外部依赖。 - - 移除了第三方依赖的私人仓库2个、无用依赖1个、无用的patch下代码2000+行,提高仓库质量。 +- 升级MKL-DNN到最新1.1版本。 +- 将预测库与`third_party` 解耦,重构了28个第三方依赖的编译代码,便于统一管理外部依赖。 +- 移除了第三方依赖的私人仓库2个、无用依赖1个、无用的patch下代码2000+行,提高仓库质量。 - 代码清理、重构和优化 - - 去掉无用的`contrib/float16`目录,删除BRPC下无用的snappy/snappystream依赖。 - - 从 `python/paddle/fluid/layers/nn.py`中,根据API功能拆出`loss.py`和`sequence_lod.py`,减少`nn.py`的代码量,便于阅读。 - - 修复`-Wno-error=sign-compare`的warning对应的代码(共100多处),后续所有该类warning会在编译时报错,提高代码质量 - - 去掉WindowsMSVC编译的`WarningLnk4006/WarningLnk4221`(共约300处),提高仓库质量。 - - 减少reduce_op, expand_op, expand_as_op模版类数量,加速GPU编译和减少whl包70M的空间。 - - 动态图下通过代码自动生成每个OP的pybind函数,用于在layers中直接调用,提高动态图性能并减少与静态图的耦合度。 +- 去掉无用的`contrib/float16`目录,删除BRPC下无用的snappy/snappystream依赖。 +- 从 `python/paddle/fluid/layers/nn.py`中,根据API功能拆出`loss.py`和`sequence_lod.py`,减少`nn.py`的代码量,便于阅读。 +- 修复`-Wno-error=sign-compare`的warning对应的代码(共100多处),后续所有该类warning会在编译时报错,提高代码质量 +- 去掉WindowsMSVC编译的`WarningLnk4006/WarningLnk4221`(共约300处),提高仓库质量。 +- 减少reduce_op, expand_op, expand_as_op模版类数量,加速GPU编译和减少whl包70M的空间。 +- 动态图下通过代码自动生成每个OP的pybind函数,用于在layers中直接调用,提高动态图性能并减少与静态图的耦合度。 ## BUG修复 @@ -332,9 +345,10 @@ Release Notes - 修复一些 GFLAGS 不能在预测库外进行指定的问题。 - 修复 Analysistor 多线程下若干 Pass 导致预测随机 core 的问题。(fc_gru_fuse_pass,seqconv_eltadd_relu_fuse_pass,attention_lstm_fuse_pass,embedding_fc_lstm_fuse_pass,fc_lstm_fuse_pass,seq_concat_fc_fuse_pass) - 修复了在使用 NativePredictor 指定使用 CPU 预测后,在同一进程内使用 AnalysisConfig 指定 GPU 不生效的错误。 -- 修复-DWITH_MKL=OFF时编译报错(setup.py拷贝与op_function_cmd出错)的bug。 +- 修复Windows上-DWITH_MKL=OFF时编译报错的bug。 - 修复py_func OP无法输入tuple(Variable) 的bug,新增如何写PythonOP的代码示例。 - 修复sigmoid cudnn kernel错调用成tanh cudnn kernel的问题。 -- 修复部分动态图模式下reshape、depthwiseconv相关的bug;修复网络中部分参数无梯度,导致程序crash 的bug。 +- 修复部分动态图模式下reshape、Conv2D相关的bug;修复网络中部分参数无梯度,导致程序crash 的bug。 - 修复GradientClip在参数服务器模式下运行错误的BUG。 - 修复参数服务器全异步模式下内存泄露的问题。 + diff --git a/doc/fluid/release_note_en.md b/doc/fluid/release_note_en.md index d48271b633acb52bfdd93d261eb0fba28041d310..bc734ebe6ead541f60bb0650a2292296076a659a 100644 --- a/doc/fluid/release_note_en.md +++ b/doc/fluid/release_note_en.md @@ -1,165 +1,203 @@ Release Notes ============== - ## Important Updates -In this version, the authors focus on enhancing the framework function level, the forecast deployment capability is fully improved, the distributed release PLSC supports the super-large-scale classification, and the parameter server mode is optimized and integrated. The compilation options, the compilation dependence, and the code library are fully cleaned up and optimized. The model library is continuously improved, the overall hierarchy is optimized, and the implementation of the dynamic graph model is added. The end-to-end development kits and utility components are further perfected. +This version focuses on enhancement of the framework functions, includes improving the inference deployment capability, releasing PLSC for super-large-scale classification training task, and optimizing the parameter server mode. In addition, the compilation options, compilation dependence and code library are fully cleaned up and optimized. The model library is optimized by adjusting the structure and adding dynamic graph models. The development kits and utility components are upgraded. + +**Training Framework**: + +- Adds AMP (Automatic Mixed Precision) interfaces and control flow interfaces. +- Optimizes the tensor using method and GPU memory allocation strategy. +- Supports Nvidia DALI GPU data preprocessing library. +- Optimizes the functions and performance of basic Ops +- Enhances the functions of dynamic graph models, including performance improvement and supporting new APIs which can converts the data independent dynamic graph model into static graph model. +- Improves the user experience of debug functions. + +**Inference Deployment**: + +- Paddle Serving +- Optimizes the Python API. +- Supports new programming languages API, such as R and Go. +- Enhanced the quantitative capability. +- Paddle Lite +- Supports deploying the model generated by the post-training quantization method without calibration data. +- Enhanced the OpenCL capability. +- Supports Kunlun XPU. +- Paddle Slim +- Optimizes the pruning, quantization, distillation and NAS (Network Architecture Search) API for adapting the model library. +- Supports large-scale knowledge distillation framework called Pantheon. -**Training Framework**: An AMP interface and a new control flow interface are added. The tensor usage method and the GPU memory allocation strategy are optimized. A library that supports the Nvidia DALI GPU data preprocessing is added. The function and performance of the basic OP are continually optimized. The function of the dynamic graph is further perfected and the performance is greatly improved. A function that converts the data independent dynamic graph model into the static graph predictable deployment model is provided. The framework debugging analysis function and the ease of use are fully enhanced. +**Distributed Training**: -**Forecast Deployment**: The Python API of the server-side forecast library is significantly optimized. A usage method and example of the R language and Go language call forecast library are added. The quantification support capability is strengthened. Paddle Lite supports a model generated by the post-training quantification method without calibration data. Tailoring, quantification, distillation, and search interfaces are reconstructed for the model compression library PaddleSlim. A large-scale scalable knowledge distillation framework Pantheon is added to fully connect to the model library. +- Unified the implementation mode of the semi-asynchronous, fully asynchronous and GEO modes in parameter server mode. The back-end is unified into communicator. The front-end interface is unified into fleet. Select different mode by configuring the fleet strategy. +- Releases the PLSC for super-large-scale classification training task. -**Distributed Aspect**: In parameter server mode, the back-end implementation is united into the communicator and the front-end interface is united into the fleet for the synchronous, semi-asynchronous, and fully asynchronous modes of the transpiler. Different modes are flexibly selected using the fleet strategy. A large-scale classification library PLSC is released and the classification tasks of a great many classes are supported using model parallel. +**Model Construction**: -**Basic Model Library**: A speech synthesis library Parakeet is released, including several leading-edge synthesis algorithms. 14 image classification pre-training models are added in PaddleCV. The 3D and tracking direction model continues to be enriched. The participle and part-of-speech tagging model of PaddleNLP supports a jieba participle. A multi-task model MMoE is added in PaddleRec. Extensive dynamic graph model implementations are added in the model library as a whole. The overall hierarchy of the model library is adjusted and optimized. +- Releases the text-so-speech model library called Parakeet, including several leading-edge text-to-speech algorithms. +- Adds 14 image classification pre-training models in PaddleCV, for enriching the 3D and tracking direction models. +- Supports Jieba word segmentation in PaddleNLP. +- Adds a multi-task model called MMoE in PaddleRec. +- Adds more dynamic graph models. +- Adjusts and optimizes the structure of model library. -**End-to-End Development Kits**: A large number of model implementations and pre-training models are added in PaddleDetection and PaddleSeg. The training speed and accuracy of typical models are enhanced. The model compression and deployment capabilities are significantly improved. The user experience is fully optimized. A recommended sorting system ElasticRec is released. Deployment is performed via K8S. Streaming training and online forecast services are supported. +**Development Kits**: -**Utility Components**: 52 pre-training models are added in PaddleHub, with a total of more than 100. The function and experience are continuously optimized. The kernel of the multi-task learning framework PALM is upgraded. The API call is open. More task types are supported. An open dataset is added in the federated learning PaddleFL. +- Optimizes the PaddleDetection and PaddleSeg by adding a large number of models as well as pre-training models, enhancing the training speed and accuracy of typical models, and strengthens the model compression and deployment capabilities. +- Releases the recommended sorting system called ElasticRec, can be deployed via K8S and support streaming training and online forecast services. + +**Utility Components**: + +- Adds 52 pre-training models to enrich the models up to 100+, as well as improves the function experience. +- Upgrades the kernel of PALM, opens API, and supports more task types. +- Adds an open dataset in PaddleFL (Federated learning framework). +- Upgrades the versions of PARL (Deep reinforcement learning framework) and PGL (Graph learning framework) . Opens more algorithm and supports more functions. ## Training Framework - API - - An AMP interface is added: A network can be converted into mixed accuracy training in a general way while the accuracy fluctuation is ensured to be within the normal range. - - A new control flow interface is added and recommended: Four control flow Ops including while\_loop (loop control function), cond (conditional branch function), case, and switch\_case (branch control function) are added for the ease of use and the following new functions are supported: - - Python callable is used as a control condition or executive. - - Different branches in the control flow use different losses or optimizers. - - Conditions in the control flow partially use CPU or GPU data. - - Parameters of some APIs support the use of a variable list: Support for a variable list is added according to the case that the parameter\_list or no\_grad\_set parameter of some APIs supports only the use of a string list. It is no longer necessary to obtain the name attribute of related variables in advance when using the following APIs: - - fluid.backward.append\_backward(loss, parameter\_list=None, no\_grad\_set=None, callbacks=None) - - fluid.backward.gradients(targets, inputs, target\_gradients=None, no\_grad\_set=None) - - The minimize methods of various optimizers, such as Adam’s minimize: minimize(loss, startup\_program=None, parameter\_list=None, no\_grad\_set=None, grad\_clip=None) +- Adds AMP (Automatic Mixed Precision) APIs, which can convert a network training mode into mixed accuracy mode in a general way, and ensuring the accuracy fluctuation within the normal range. +- Adds control flow OPs, such as while_loop, cond, case and switch_case. It is recommended to use the new APIs for much easier to use. The following functions are supported: +- Supports using python callable as the control condition or executive objects. +- Supports using different losses or optimizers in different branches of the control flow. +- Supports using CPU data or GPU data in condition of the control flow + +- Supports using the variable lists as parameters for some APIs, while these APIs only supported string lists as the ‘parameter_list’ or ‘no_grad_set’. Do not need to obtain the ‘name’ attribute of variables when using the following APIs: +- fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None) +- fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None) +- The minimize methods of optimizers, such as Adam’s minimize: minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) - Basic Function Optimization - - The float16 type of numpy is used to set to Tensor data without the necessity of conversion into the uint16 type first. - - The minus sign is directly used to get the opposite number of Tensor. - - GPU memory Allocation Strategy: - - The default policy is changed to AutoGrowth: The GPU memory is applied for as needed without affecting the training speed. This avoids the problem that it is difficult to restart a new task on the same GPU card under the previous default GPU memory pre-allocation strategy. - - GPU memory allocation adjustment for multi-card tasks: The GPU memory allocators on different GPU cards are set to the Lazy initialization mode. If a user does not use a card, no GPU memory will be applied for on this card. This avoids the GPU memory OOM problem caused by running tasks on idle GPU cards without setting CUDA\_VISIBLE\_DEVICES when any GPU memory is occupied on other GPU cards. - - OP Function Upgrade - - elu: This activation function supports the calculation of second-order gradients. - - Prroi\_pool: The rois parameter may accept the Tensor or LoDTensor type. - - Conv2d, pool2d, batch\_norm, LRN: All reverse calculations support the use of the MKL-DNN high-performance calculation library. - - argsort: The descending sort is supported (A descending parameter is added. The default is False). +- Supports configuring tensor data with numpy float16 data types, and no need to convert to unit16 type first. +- Supports using minus sign to express the tensor’s opposite. +- GPU memory Allocation Strategy: +- Changes the default policy to ‘AutoGrowth’. In this policy, the GPU memory is applied on demand when not affecting the training speed. While it’s difficult to start another task on the same GPU in the GPU memory pre-allocation strategy before. This change can avoid this problem. +- Adjusts the GPU memory allocation for multi-card tasks: Set the GPU memory allocators on different GPU cards to the ‘Lazy’ initialization mode. If a card is not used, the GPU memory will not be applied for this card. While the GPU memory OOM problem could be caused by running tasks on idle GPU cards without setting CUDA_VISIBLE_DEVICES, when GPU memory is occupied on other GPU cards. This change can avoid this problem. +- OP Function Upgrade +- elu: This activation function supports the calculation of second-order gradients. +- Prroi_pool: The parameter ‘rois’ supports the ‘Tensor’ or ‘LoDTensor’ type. +- Conv2d, pool2d, batch_norm, lrn: supports using the MKL-DNN library to perform gradient calculation of these OPs. +- argsort: Supports descending. A new parameter ‘descending’ is added, default value is ‘False’. + - Basic Performance Optimization - - DALI Preprocessing Acceleration - - The support for the Nvidia DALI GPU data preprocessing library is added, which can be used to accelerate the preprocessing of data such as images, videos, and speeches. - - Automatic Mixed Precision Training Optimization - - With the implementation of the following optimization strategy as well as DALI data preprocessing, the training throughput of the ResNet50 model is increased substantially: The mixed accuracy training throughput of a single V100 card is increased to 1,000+ images/s from 600+ images/s. The throughput of 8 cards for a single machine is 7,840 image/s. The throughput of 32 cards for 4 machines is 28,594 images/s. - - The support of batch\_norm, conv2d, and other ops for NHWC data layout input is enhanced to accelerate fp16 calculation using Tensor Core technology. - - Some op patterns in the model such as batch\_norm and relu are fused based on the IR Pass mechanism. - - The kernel of elementwise (add, mul) and other ops is optimized. - - RecomputeOptimizer is optimized to improve the batchsize. In the bert-large model, the maximum batchsize is increased by 533.62% compared with that without using RecomputeOptimizer, doubling the maximum batchsize of the previous version. - - OP Performance Optimization - - The fusion operator fuse\_emb\_seq\_pool of embedding and sequence\_pool is implemented and murmurhash3\_x64\_128 in bloom\_filter is optimized. The training speed of some NLP models is effectively improved. - - The GPU performance of mean op is optimized. When the input data is 32328\*8 Tensor, the forward calculation speed is increased by 2.7 times. - - Optimize assign and lod\_reset op are optimized to avoid unwanted GPU memory copy and data transform. - - The kernel implementation of stack OP is optimized. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%. +- DALI Preprocessing Acceleration +- Supports the Nvidia DALI GPU data preprocessing library, which can be used to accelerate the preprocessing speed of data such as images, videos, and speeches. +- Automatic Mixed Precision Training Optimization +- Implements the following optimization strategies to increase the training throughput of the ResNet50 model, along with the DALI data preprocessing module. The mixed accuracy training throughput of a single V100 card is increased from 600+ images/s to 1,000+ images/s. The throughput of 8 cards for a single machine is increased to 7,840 image/s. The throughput of 32 cards for 4 machines is increased to 28,594 images/s. +- Supports NHWC data layout inputs for some OPs such as batch_norm, conv2d. Accelerates fp16 calculation speed by using Tensor Core technology. +- Fusing some op patterns in the model, such as batch_norm and relu, based on the IR Pass mechanism. +- Optimizes kernel of some elementwise OPs, such as add, mul. +- Optimize the ‘RecomputeOptimizer’ to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the ‘RecomputeOptimizer’. +- OP Performance Optimization +- Implements the fusion operator called ‘fuse_emb_seq_pool’ of ‘embedding’ and ‘sequence_pool’. Optimizes the ‘murmurhash3_x64_128’ in ‘bloom_filter’. These optimization increases the training speed of some NLP models. +- Optimizes the GPU performance of ‘mean op’. When a data of 3232 8 *8 tensor is input, the forward calculation speed is increased by 2.7 times. +- Optimizes OPs of ‘assign’ and ‘lod_reset’, to avoid nnecessary GPU memory copy and data transform. +- Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%. - Dynamic Graph - - Function Optimization - - The name\_scope parameter in the dynamic graph Layers is removed to make it easier for users to inherit and call. - - The block parameter in the to\_variable interface is removed to simplify the use of the API. - - As for the problem that model parameters depend on data, the build\_once design is removed so that Layers can get all the parameter tables at the end of **init** execution, which is convenient for load saving, parameter initialization, parameter debugging, and parameter optimization. - - Automatic pruning is improved to facilitate user networking and reduce the reverse calculation amount. - - The SelectedRows operation is supported so that the Embedding layer supports sparse update of a single card. - - As for the problem that the framework lacks containers, ParameterList, LayerList, and Sequencial functions are added to facilitate user networking. - - Named\_sublayers and named\_parameters functions are supported to facilitate user programming. - - The Linear lr warmup decay strategy is supported. - - Performance Optimization - - The interaction of python with c++, GradMaker, OperatorBase, and allocator are optimized. For the LSTM-based language model task p on the P40 machine, the performance is improved by 270%. - - Redundant codes are removed for performance problems caused by calling dead codes of optimized\_guard in optimize for many times. For the Transformer model (batch\_size=64) on the P40 machine, the performance of optimizers such as SGD and Adam is improved by 5% to 8%. - - For the performance impact caused by adding scale\_op extra to update the beta parameter in AdamOptimizer, the beta updating logic is fused into adam\_op to reduce the call overhead of the op kernel. For the Dialogue-PLATO model on the P40 machine, the performance is improved by 9.67%. - - The asynchronous DataLoader of the dynamic graph is optimized. The overall training speed is improved by about 30% in the Mnist, ResNet, and other models. - - The numpy bridge function is added. Sharing the underlying data between Tensor and ndarray in CPU mode is supported to avoid the problem of needing to copy a numpy input when creating variables, and to improve efficiency. - - GPU memory optimization: Optimization strategy of deleting in advance the forward variable space that does not require Tensor Buffer in reverse. The maximum batch size is increased by more than 20%-30% in the ResNet and other models. - - Dynamic Graph Deployment - - The TracedLayer interface is supported. The conversion of the dynamic graph model into the static graph predictable deployment model is implemented. +- Function Optimization +- Removes the ‘name_scope’ parameter in ‘Layers’ to make it easier to inherit and call. +- Removes the ‘block’ parameter in the ‘to_variable’ to simplify the use of the API. +- Removes the ‘build_once’ as for the the problem that model parameters depend on data. So that ‘Layers’ can get all the parameter tables when implementing the ‘init’ execution. It’s convenient for saving and loading, parameter initialization, parameter debugging, and parameter optimization. +- Optimizes the automatic pruning function facilitate user networking and reduce the reverse calculation amount. +- Supports ‘SelectedRows’ operation so that the Embedding layer supports sparse update of a single card. +- Adds functions such as ParameterList, LayerList, and Sequencial, as for the problem that the framework lacks containers. It’s more convenient for networking with these functions. +- Supports functions such as named_sublayers and named_parameters to facilitate programming. +- Supports the ‘Linear lr warmup decay’ strategy. +- Performance Optimization +- Optimizes the interaction of python with c++, GradMaker, OperatorBase, and allocator. The performance is improved by 270% for the LSTM-based language model task on the P40 machine. +- Removes the redundant codes for performance problems caused by calling dead codes of ‘optimized_guard’. The performance of optimizers such as SGD and Adam is improved by 5% to 8% for or the Transformer model (batch_size=64) on the P40 machine. +- To reduce the performance impact caused by adding extra ‘scale_op’ to update the beta parameter in ‘AdamOptimizer’.To reduce the performance impact caused by adding extra ‘scale_op’ to update the beta parameter in ‘AdamOptimizer’, Iintegrate the updating logic of ‘beta’ into ‘adam_op’ to reduce the cost of calling op kernel. The performance 偶发of is improved by 9.67% on the P40 machine. +- Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine. +- Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency. +- Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet. +- Dynamic Graph Deployment +- Supports the ‘TracedLayer’ interface to convert the dynamic graph model into the static graph. - Debugging Analysis - - Error message optimization: Framework error messages are classified as a whole to achieve the , systematization of error messages. Copywriting optimization is finished to help users locate and solve problems more quickly and accurately. - - Optimization of the Performance Analysis Profile Function - - The function and accuracy of the profiler is enhanced. Profile options at different levels are supported. The call relation of events can be recorded in the profile data and printed. - - The nan inf check and debugging are optimized (effective through FLAGS\_check\_nan\_inf) and the performance, function, and output information are all greatly improved: - - In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training. - - In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op\_role, and op\_var to facilitate the debugging of the fp16 model. - - The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging. -- A lightweight installation package paddlepaddle-tiny for CPU training and forecast is released and the window/linux/Mac operating system and python27/python35/python36/python37 are supported: - - The following options are compiled: no avx, no ml, no gpu, no unittest - - The slim and some datasets are pruned off. - - The Linux package size is reduced to 37 M from 90 M. The Windows package size is reduced to 9.6 M from 50.8 M. The MAC package size is reduced to 19.8 M from 59 M. - - The number of installation requirement dependencies are reduced to 7 from 15. - -## Forecast Deployment +- Optimizes the error message. Classifies the framework error messages and optimizes the message descriptions for more convenient to solve the problem according to the messages. +- Optimizes the performance analysis profile function. +- Enhances the functions and accuracy of the profile. Supports profile options at different levels. The call relation of events can be recorded in the profile data and printed. +- Optimizes the checking and debugging functions of ‘nan inf’ which is enabled through ‘FLAGS_check_nan_inf’. The performance, function, and output information are all greatly improved. +- In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training. +- In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op_role, and op_var to facilitate the debugging of the fp16 model. +- The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging. +- Releases the lightweight installation package ‘paddlepaddle-tiny’ for CPU training and forecast, supporting installed on Windows/Linux/Mac OS and python27/python35/python36/python37. +- Supports the following compile functions: no avx, no ml, no gpu, no unittest. +- Remove the slim and some dataset. +- Reduce the Linux package size from 90M to 37M. Reduce the Windows package size from50.8 M to 9.6M. Reduce the MAC package size from 59M to 19.8M. +- Reduce the number of installation requirement dependencies from 15 to 7. + +## Inference Deployment - Server-side Forecast Library - - Python API - - The read and write model from the memory is supported to meet the model encryption requirements. - - The Scale operator is no longer added at the end of the forecast model. - - The support for ZeroCopy forecast is added. The interface is basically the same as the C++ interface and supports numpy.ndarray as input and output. It is easier to use on the Python side. - - Multiple interfaces are added in AnalysisConfig to completely cover the C++ forecast functions, including removing pass and disabling forecast glog. - - Support for Other Programming Languages - - The usage method and example of the R language and Go language call forecast library are added. - - The corresponding header file of ProtoBuf is provided to external users to facilitate users to analyze the requirements for the model structure. - - For a forecast library with TRT compilation, a TensorRT library is not provided from thrid\_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt. - - Function Enhancement: - - Access to Paddle Lite using a submap is achieved and ResNet50 has been verified. - - The support for MKL-DNN FC INT8 kernel is added. - - Paddle-TensorRT supports the Ernie model. For the Ernie model (seq length = 128) on the T4 card, the fp16 forecast speed is 3.6 ms, which is faster than the fp32 forecast speed by 37%. - - Quantification: Under the 2% improvement of the ERNIE INT8 accuracy compared with the FP32 accuracy, the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively. +- Python API +- Supports reading and writing model from the memory to meet the model encryption requirements. +- The Scale operator is no longer added at the end of the inference model. +- Adds ZeroCopy API, which is basically the same as the C++ APIs. Supports using numpy.ndarray as the input and output. It’s convenient for Python scenario. +- Adds several interfaces in AnalysisConfig to completely cover the C++ inference functions, including removing pass and disabling inference glog. +- Support for Other Programming Languages +- Add inference API of R and Go, and the related usage methods and examples are added. +- Provides the corresponding header file of ProtoBuf to facilitate users to analyzing structure of models. +- For a inference library with TRT compilation, the TensorRT library is not provided from thrid_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt. +- Functional Enhancement: +- Supports access Paddle Lite by submap mode, and ResNet50 has been verified. +- Supports the MKL-DNN FC INT8 kernel. +- Supports Ernie model in Paddle-TensorRT. For the Ernie model (seq length = 128) on the T4 card, the delay of fp16 inference is 3.6 ms, which is faster than the fp32 inference by 37%. +- Quantization: the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively, while the Ernie INT8 model has only slight decline precision compared with the FP32 model. - Mobile/Embedded End-side Paddle Lite (https://github.com/PaddlePaddle/Paddle-Lite) - - Version v2.3 is released. - - Multiple functions of Model\_optimize\_tool are upgraded. - - “The post-training quantification method without calibration data” is supported. The model storage space is reduced (by 2 to 4 times). - - OpenCL: The migration of 30 Image2D Kernels are finished and 14 Ops are covered. - - The support for FPGA and NPU is further strengthened. The forecast of Kunlun XPU is supported. - - A new official website document is released. A "post-training quantification method without calibration data" usage document is added. +- Releases the version v2.3. +- Upgrades the functions of Model_optimize_tool. +- Supports“The post-training quantization method without calibration data”. The model storage space can be reduced by 2 to 4 times. +- OpenCL: The migration of 30 Image2D Kernels are finished and 14 Ops are covered. +- Strenthens the capability with FPGA, NPU. Supports Kunlun XPU for inference. +- Releases a new official website document. Adds the document of “post-training quantization method without calibration data” - Paddle Serving (https://github.com/PaddlePaddle/Serving): - - The forecast service of remote text vector representation of the bert-type semantic understanding model is released. - - A paddle-gpu-serving WHL package is released. The forecast service can be deployed and used through pip installation and Python codes. - - 13 semantic understanding models in Paddlehub are supported. The single-machine multi-card mode is supported. The forecast speed is 869.56 samples/s when the average sample length is 7 under a single P4 GPU using the Ernie\_tiny model. +- Releases the forecast service of remote text vector representation of the bert-type semantic understanding model. +- Release the paddle-gpu-serving WHL package. Supports pip installation and Python codes. +- Supports 13 semantic understanding models in Paddlehub. Supports the single-machine multi-card mode. The forecast speed is 869.56 samples per second using the Ernie_tiny model, when the average sample length is 7 under a single P4 GPU. - PaddleSlim (https://github.com/PaddlePaddle/PaddleSlim): - - PaddleSlim is split into independent repo. - - The tailoring, quantification, distillation and search interfaces are reconstructed. The underlying interfaces are open to users. - - Quantification: - - An offline quantification function based on KL divergence is added. The quantification of the Embedding layer is supported. - - The QAT MKL-DNN quantification strategy support for FC is added. - - PostTrainingQuantization is added to fully implement the post-training quantification function: The quantization of 30 kinds of Ops is supported. The flexible setting of OPs to be quantified is supported. Quantitative models are generated in a unified format . It has the advantages of short time consumption, ease of use, and small precision loss. - - Quantitative training supports setting the type of OP to be quantified. - - Tailoring: The tailoring implementation is reconstructed to support more types of networks. - - Search: - - SA search is supported. More search space is added. User-defined search space is supported. - - A one-shot search algorithm is added. The search speed is 20 times faster than that of the previous version. - - A large-scale scalable knowledge distillation framework Pantheon is added. - - Full decoupling is achieved between student and teacher models and between teacher models. They can independently run on different physical devices respectively to make full use of computing resources. - - The single-node multi-device large-scale forecast of the teacher model is supported. The acceleration ratio is tested to be linear on BERT and other models. - - TCP/IP protocol is used to achieve communication in online distillation mode. Knowledge transmission between teacher and student models running on any two physical devices in the same network environment is supported. - - API interfaces in online and offline distillation modes are unified. Different teacher models may operate in different modes. - - The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher model. - - Model Library: - - The compression benchmark of ResNet50 and MobileNet models is released. - - The detection library is connected and the compression benchmark for the YOLOv3 series of models is released. - - The segmentation library is connected and the compression benchmark for the Deepabv3+ series of segmentation models is released. - - Document Improvement: - - An API document is supplemented. An introductory tutorial and an advanced tutorial are added. A ModelZoo document is added to cover classification, detection, and segmentation tasks. All documents contain Chinese and English. +- Moves PaddleSlim to independent repo. +- Refactors pruning, quantization, distillation and NAS API. Provide more low-level APIs for developer . +- Quantification: +- Adds post training quantization strategy based on KL divergence. Supports quantization of the embedding layer. +- Supports quantization for MKL-DNN-FC layer based on QAT. +- Adds post training quantization that support 30 kinds of operators. Supports spartial operators to skip quantization. +- Supports skipping some operators in training aware strategy +- Pruning: Refactors and enhances the code of pruning to support more kinds of networks. +- NAS: +- Supports NAS based on simulated annealing. Provides more predefined search spaces and support custom search spaces. +- Adds one-shot algorithm for NAS. The speed of search is 20 times faster than that of the previous version. +- Releases the large-scale scalable knowledge distillation framework called Pantheon. +- Achieves full decoupling between student and teacher models and among teacher models. They can independently run on different physical devices respectively to make full use of computing resources. +- Supports the multi-device large-scale inference of the teacher model in the single node. The acceleration ratio is tested to be linear on BERT-like complex models. +- Supports knowledge transmission between teacher and student models running on any two physical devices in the same Internet environment. By using TCP/IP protocol for communication in online distillation mode. +- Unifies API interfaces for online and offline distillation modes, enabling different teacher models operating in different distillation modes. +- The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher models. + +- Model Zoo: +- Releases benchmark of image classification model such as ResNet50, MobileNet. +- Adapts PaddleDetection library and release benchmark of YOLOv3 models with different backbone. +- Adapts PaddleSeg library and release benchmark of Deepabv3+ models with different backbone. +- Refines Document: +- Refines documents of API. Adds some QuickStart tutorials and advanced tutorials. Adds model zoo docu which contain models for image classification, object detection, semantic segmentation. Translates all documents to English. ## Distributed - Parameter Server Mode: - - The memory usage is greatly reduced during training. On 100 million embedding tasks, the Trainer-side memory can be reduced by 90%. - - The memory usage of distributed saving and loading models is greatly reduced. The Pserver-side memory peak value can be minimized to $1/N of the original value, where N$ is the number of Pserver nodes. - - The geo-sgd dense parameter communication is optimized. - - The distributed AUC index calculation is supported. - - A distributed barrier function is added. - - An overdue warning is added in the non-Fleet transpiler API. This API is planned to be removed in PaddlePaddle-Fluid 2.0。 - - Semi-asynchronous and synchronous modes are added in Communicator. - - The TrainFromDataset training interface supports semi-asynchronous and synchronous modes. - - DistributedStrategy is added in Fleet to further improve the distributed ease of use and integrate the current distributed related flags. - - The Fleet pslib mode supports single-program multi-loss training to optimize the training performance. - - 100 billion sparse mode supports the k8s environment. +- Reduces the memory usage greately during training. On 100 million embedding trainging tasks, the Trainer-side memory can be reduced by 90%. +- Reduces the memory usage of distributed saving and loading models greatly. The Pserver-side memory peak value can be minimized to 1/N of the original value, where N is the number of Pserver nodes. +- Optimizes the dense parameter communication in GEO mode. +- Supports distributed AUC index calculation. +- Adds distributed barrier functions. +- Adds Semi-asynchronous modes in Communicator. +- Supports semi-asynchronous modes of the ‘TrainFromDataset’ training interface. +- Adds ‘DistributedStrategy’ in ‘Fleet’ to improve the convenient usage. Integrates the current distributed related flags. +- Supports single-program multi-loss training modes in ‘Fleet pslib’ to optimize the training performance. +- Supports k8s environment in 100 billion sparse mode. - Large-scale classification library PLSC: It supports the large-scale classification problem that data parallel cannot solve due to the limitation of video memory capacity (https://github.com/PaddlePaddle/PLSC). - - Three built-in models ResNet50, ResNet101, and ResNet152 are available and User-defined models are supported. Under the single-machine eight-V100 GPU configuration, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model. - - A plsc-serving whl package for model online forecast service is released to forecasts the image semantic vector representation of the face recognition model. Making a forecast using a user-trained model is supported. The forecast speed of the ResNet50 model (batch size=256) under a single V100 GPU is 523.47 images/s. - - A pre-training model based on the ResNet50 network and the MS1M-ArcFace dataset is released: https://plsc.bj.bcebos.com/pretrained\_model/resnet50\_distarcface\_ms1mv2.tar.gz. -- The benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) is released. +- Supports three built-in models such as ResNet50, ResNet101, and ResNet152. Supports User-defined models. Under the single-machine eight-V100 GPU configuration environment, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model. +- Releases a ‘plsc-serving whl’ package for model online forecast service. It can forecast the image semantic vector representation of the face recognition model. Supports making a forecast using a user-trained model. The forecast speed of the ResNet50 model (batch size=256) is 523.47 images/s under a single V100 GPU. +- Releases the pre-training models based on the ResNet50 network and the MS1M-ArcFace dataset: https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz.- The benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) is released. +- Releases the benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) ## Basic Model Library @@ -167,184 +205,196 @@ In this version, the authors focus on enhancing the framework function level, th - PaddleNLP - - Seq2seq supports training modes such as RL and GAN. - - A training model for participle and part-of-speech tagging is released. A knowledge distillation framework Pantheon is used. The F1 value for its own dataset is 1% more than that of paddleNLP LAC. Jieba participles are incorporated. The deep learning model mode is enabled by adding a use\_paddle label. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience. - - Dynamic graph model implementations are added: word2vec, senta, transformer, Bert, seq2seq, LAC. +- Seq2seq supports training modes such as RL and GAN in the static-graph of Paddle. +- A training model for word segmentation and part-of-speech tagging is released. With the knowledge distillation framework Pantheon, the F1 score of this model on the own dataset is improved 1% over that of PaddleNLP LAC. This model is merged into the jieba repo, with adding a flag use_paddle to enable deep learning model mode. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience. +- Adds dynamic graph model implementations for these models: word2vec, senta, transformer, Bert, seq2seq, and LAC. - PaddleSpeech - - Speech synthesis: A synthesis library Parakeet is released. - - A standard workflow for data preprocessing, training, and synthesis of the speech synthesis model is implemented. - - The out-of-the-box pre-processing implementation of typical datasets is provided. - - Commonly-used model components in the speech synthesis field are provided to support the model implementation. - - Speech synthesis models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow are released. +- Releases text-to-speech toolkit Parakeet (Paddle PARAllel text-to-speech toolkit). +- Implements the standard workflow for data preprocessing, training, and synthesis of the TTS models. +- Provides the out-of-the-box pre-processing implementation of typical datasets. +- Provides the commonly-used model components in the TTS field to facilitate the model implementation. +- Reseases the TTS models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow. - PaddleCV - - Image Classification: - - A total of 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models are added: - - SE\_ResNet18\_vd, SE\_ResNet34\_vd, SE\_ResNeXt50\_vd\_32x4d, ResNeXt152\_vd\_32x4d - - Res2Net50\_26w\_4s, Res2Net50\_14w\_8s, Res2Net50\_vd\_26w\_4s - - HRNet\_W18\_C, HRNet\_W30\_C, HRNet\_W32\_C, HRNet\_W40\_C, HRNet\_W44\_C, HRNet\_W48\_C, HRNet\_W64\_C - - Accelerating data preprocessing by using DALI is supported. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved. - - 3D Direction: - - The models PointNet++ and PointRCNN are released. - - Tracking Model Library: - - The models SiamFC, SiamRPN, SiamMASK, ATOM, and ATP are released. - - Dynamic graph model implementations are added: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model. +- Image Classification: +- Adds 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models: +- SE_ResNet18_vd,SE_ResNet34_vd,SE_ResNeXt50_vd_32x4d,ResNeXt152_vd_32x4d +- Res2Net50_26w_4s,Res2Net50_14w_8s,Res2Net50_vd_26w_4s +- HRNet_W18_C,HRNet_W30_C,HRNet_W32_C,HRNet_W40_C,HRNet_W44_C,HRNet_W48_C,HRNet_W64_C +- Supports accelerating data preprocessing by using DALI. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved. +- 3D Vision: +- Releases PointNet++、PointRCNN models. +- Tracking Model Library: +- Releases SiamFC and ATOM models, +- Add dynamic graph model implementations for the following models: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model. - PaddleRec - - A multi-task model MMoE for the recommended field is released and applies to large-scale multi-task joint training in the industrial circles. - - Dynamic graph model implementations are added: gru4rec, deepfm. +- Releases a multi-task model called MMoE for the recommended field. It can be applied to large-scale multi-task joint training in the industrial circles. +- Adds dynamic graph model implementations for the following models: gru4rec, deepfm. ## End-To-End Development Kits - PaddleDetection (https://github.com/PaddlePaddle/PaddleDetection) - - The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version. - - Model implementations and pre-training models are added: - - The best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. A pre-training model of this algorithm based on Objects365 data is also released. - - Backbone is added as CBResNet, Res2Net, and HRNet series of pre-training models. - - A LibraRCNN algorithm and a pre-training model are added. - - GIoU, DIoU, and CIoU loss-based pre-training models are added in the FasterRCNN R50 FPN model. Without reducing the forecast speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively. - - Added Modules: - - Backbone network: CBResNet, Res2Net, and HRNet are added. - - Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination. - - Postprocessing modules: The softnms and DIOU nms modules are added. - - Regular module: A DropBlock module is added. - - Functional Optimization and Improvement: - - YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%. - - The data preprocessing logic is optimized. - - The benchmark data for face detection forecast is added. - - Forecast examples under the Paddle forecast library Python API are added. - - Detection Model Compression: - - Tailoring: A Mobilenet-yolov3MobileNet-YOLOv3 tailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 tailoring solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset. - - Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data. - - Quantification: YOLOv3-MobileNet and BlazeFace quantitative models are released. - - Tailoring + Distillation: A MobileNet-YOLOv3 tailoring + distillation solution and model are released, with FLOPS - 69.6%, forecast speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 tailoring + distillation solution and model are released, with FLOPS - 43.7%, forecast speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data. - - Search: A complete search solution for the open source blazeface-nas. - - Forecast Deployment: - - The support of the Paddle forecast library for TensorRT and FP16 precision is adapted. - - Documents: - - A document for introducing the data preprocessing module and a document for implementing the user-defined data Reader are added. - - A document about how to add an algorithm model is added. - - Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/ +- The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.– The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.– Improves the precision of the YOLOv3 model. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version. +- Add the following model implementations and pre-training models: +- Add the best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. Releases a pre-training model of this algorithm based on Objects365 data. +- Add a series of CBResNet, Res2Net, and HRNet pre-training models. +- Adds a LibraRCNN algorithm and the pre-training models. +- Add GIoU, DIoU, and CIoU loss-based pre-training models in the FasterRCNN R50 FPN model. Without reducing the inference speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively. +- Added Modules: +- Backbone network: CBResNet, Res2Net, and HRNet are added. +- Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination. +- Postprocessing modules: The softnms and DIOU nms modules are added. +- Regular module: A DropBlock module is added. +- Functional Optimization and Improvement: +- YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%. +- The data preprocessing logic is optimized. +- The benchmark data for face detection inference is added. +- Inferenerence examples under the Paddle inference library Python API are added. +- Detection Model Compression: +- Pruning: A MobileNet-YOLOv3 uningtailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset. +- Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data. +- Quantization: YOLOv3 and BlazeFace quantitative models are released. +- Pruning + Distillation: A MobileNet-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 69.6%, inference speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 43.7%, inference speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data. +- Search: A complete search solution for the open source BalzeFace-nas. +- Inference Deployment: +- The support of the Paddle inferencerence library for TensorRT and FP16 precision is adapted.• Adapts the Paddle forecastrence library for TensorRT and FP16 precision +- Documents: +- Adds the documents for introducing the data preprocessing module and a document for implementing the user-defined data Readers. +- Adds the documents about how to add an algorithm model. +- Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/ - PaddleSeg (https://github.com/PaddlePaddle/PaddleSeg) - - Added Models - - LaneNet model applicable to lane segmentation scenarios. - - Fast-SCNN model applicable to the lightweight. - - HRNet semantic segmentation model applicable to high-precision scenarios. - - Multiple PaddleSlim-based model compression solutions are released: - - Cityscape-based Fast-SCNN tailoring solution and model. - - Cityscape-based Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions. - - Cityscape-based Deeplabv3p-MobilenetV2 search solution. - - Cityscape-based Deeplabv3p-Mobilenet quantitative solution and model. - - Enhancement of the Forecast Deployment Capability - - Lightweight deployment of Python is added. - - The TensorRT forecast acceleration support for FP16 and Int8 quantitative models is added. - - Tutorials and cases for portrait segmentation Paddle-Lite mobile-side deployment of DeepLabv3p-MobileNetV2 are added. - - Model export is optimized. GPU implementation of image preprocessing and postprocessing is supported. The performance is improved by 10%-20%. - - The benchmark for the forecast performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes is provided to facilitate users to select models based on performance. - - Experience Optimization - - A learning rate warmup function is added. It supports the use with different learning rate decay strategies to improve Fine-tuning stability. - - Marked imaged can be saved in pseudo-color image format to improve their preview experience. - - The function of automatically saving an optimal mIoU model is added. - - The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided. +- Adds Models +- LaneNet model applicable to lane segmentation scenarios. +- Lightweight Fast-SCNN model applicable to high performance scenarios. +- HRNet semantic segmentation model applicable to high-precision scenarios. +- Releases multiple PaddleSlim-based model compression solutions: +- Fast-SCNN tailoring solution and model on Cityscapes dataset. +- Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions on Cityscapes dataset. +- Deeplabv3p-MobilenetV2 search solution on Cityscapes dataset. +- Deeplabv3p-Mobilenet quantitative solution and model on Cityscapes dataset.• Adds the TensorRT acceleration support for FP16 and Int8 quantitative models +- Enhance the deployment capability +- Adds the lightweight deployment of Python. +- The TensorRT acceleration support for FP16 and Int8 quantitative models is added. +- Adds the tutorials for human portraits segmentation Paddle-Lite mobile deployment of DeepLabv3p-MobileNetV2 +- Optimizes the Model exportation step. Supports GPU implementation of image preprocessing and post processing. The performance is improved by 10%-20%. +- Provides the benchmark for the prediction performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes to facilitate users to select models based on performance. +- Experience Optimization +- Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability. +- Marked imaged can be saved in pseudo-color image format to improve their preview experience.• Optimizes the logic of documents. Provides AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening. +- Adds the function of automatically saving an optimal mIoU model. +- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided. - ElasticRec (https://github.com/PaddlePaddle/ElasticRec) - - - An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online forecast service are supported. +- An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online inference service are supported. ## Utility Components - PaddleHub (https://github.com/PaddlePaddle/PaddleHub) - - The pre-training models are rich, with 52 added pre-training models. Currently, the total number of pre-training models is 100+: - - Semantic models: Five semantic models such as RoBERTa\_wwm, BERT\_wwm, and ERNIE-Tiny are added. - - Text classification: Three yellow anti-identification models are added. - - Image classification: A total of 36 image classification models such as ResNext-WSL and EfficientNet are added. - - Target detection: Five detection models such as pedestrian detection and vehicle detection are added. - - Key point detection: Two models for key point detection of face and body posture are added. - - Face mask detection: Two PyramidBox-Lite-based face mask detection models are added. - - Universal face detection: Four universal Face detection models such as Ultra Light Fast Generic Face Detector and PyramidBox-Lite are added. - - Function: - - A Bert Service text vector representation service based on Paddle Serving is added. - - Task flexibility is enhanced. An added hook mechanism supports the loading of user-defined codes. - - A color Colorlog is added. The problem on the repeated printing of logs is fixed. - - Code results are optimized. The command line execution speed is increased by 50%. - - Dataset and Reader are reconstructed. The quantity of adaptive user-defined dataset codes is reduced by 60%. - - The AutoFinetune interface is optimized. Multi-experiment visualization effect display is supported. - - Experience Optimization - - The logic is fully optimized. Rich AIStudio tutorial contents are added. - - The landing page of the official website has been fully upgraded to provide the function of quick online experience and tutorial guidance. +- 52 new pre-trained models are added. Currently, the total number of pre-training models is 100+: +- Semantic models: Five semantic models such as RoBERTa_wwm, BERT_wwm, and ERNIE-Tiny are added. +- Text classification: Three anti-porned models are added. +- Image classification: A total of 36 image classification models such as ResNext-WSL and EfficientNet are added. +- Object detection: Five detection models such as pedestrian detection and vehicle detection are added. +- Key point detection: Two models for key point detection of face and body posture are added. +- Face mask detection: Two PyramidBox-Lite-based face mask detection models are added. +- Universal face detection: Four universal Face detection models such as Ultra Light Fast Generic Face Detector and PyramidBox-Lite are added. +- Function: +- Bert Service, a text vector representation service based on Paddle Serving is added. +- Task flexibility is enhanced. An hook mechanism supports the loading of user-defined codes is added. +- Code results are optimized. The command line execution speed is increased by 50%. +- The quantity of adaptive user-defined dataset codes is reduced by 60%. +- The AutoFinetune interface is optimized. Multi-experiment visualization effect display is supported. +- Experience Optimization +- The logic is fully optimized. Rich AIStudio tutorial contents are added. +- The landing page of the official website has been fully upgraded to provide the function of quick online experience and tutorial guidance. - Multi-task learning framework PALM (https://github.com/PaddlePaddle/PALM) - - Python3 and Windows are supported. - - The framework kernel and the multi-tasking underlying mechanism, are upgraded. The API call is open. - - The flexible model saving mechanism supports single-task saving and full-image saving. - - Continuous training and forecast are supported. Dataset files can be switched over freely under a single execution. - - A model customization/self-definition function is added. - - The multi-task underlying kernel is reconstructed. Some bugs that affect universality and stability are fixed. - - The multi-task learning ability is strengthened. - - It is supported that every task has a different batch size and sequence length under a multi-task scenario. - - The problem on inconsistent tasks on each video card during multi-task multi-card training is fixed. - - The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability. - - The function and type of supported tasks are strengthened. - - Matching task support is enhanced. Pairwise learning and multiple categories (e.g. NLI sentence relation judgment) are supported. - - The support for machine reading comprehension tasks is enhanced. User controllable preprocessing hyper-parameters are added. - - The support for sequence labeling tasks is added. - - The large-scale training/inferential capability is strengthened. - - The automatic multi-card forecast capability is added. - - An asynchronous reader is supported. A variable-length padding is supported in multi-card scenarios. - - A module for the management and downloading of pre-training models is added. - - The management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa are supported. - - A RoBERTa Chinese pre-training model is added. - +- Python3 and Windows are supported. +- Release APIs and the multi-task learning kernel are upgraded. +- Support independent task saver. +- Continuous training and inference are supported. Dataset files can be switched over freely under a single execution.– Ugrades the machine reading comprehension tasks. Adds preprocessing hyper-parameters.• Strengthens +- Supports model customization. +- The multi-task learning kernel is refactored and fix some bugs. +- Upgrade multi-task learning ability. +- Support independent settings of batch size and sequence length across tasks.• Adds a module for the management and downloading pre-training models.– Supports the management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa. +- Fix inconsistent problem of the tasks on GPUs. +- The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability. +- Upgrade the ability and types of pre-defined tasks. +- Upgrade matching task. Add pairwise learning and multiple categories support. +- The support for machine reading comprehension tasks is enhanced. User controllable preprocessing hyper-parameters are added. +- The support for sequence labeling tasks is added. +- The large-scale training/inference capability is strengthened. +- Add automatic multi-gpus inference. +- Refactor asynchronous reader. Support dynamic padding length for multi-task learning running on multiple-gpus. +- A module for the management and downloading of pre-training models is added. +- The management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa are supported. +- A RoBERTa Chinese pre-training model is addedReleases the version v1.3. - Federated Learning PaddleFL (https://github.com/PaddlePaddle/PaddleFL): - - The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI cluster. - - A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character forecast, and other fields , such as MNIST and Sentiment140, are supported. - - According to the added components, the original samples are modified in example and the femnist\_demo and submitter\_demo examples are added - - Fl\_distribute\_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer. - - SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation. +- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus– Supports the models NeurIPS2019, which is the reforcement learning challenge champion modelReleases the version v1.1: +- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.– Releases a garaph solution called PGL-Rec and a knowledge graph embedding algorithm set called PGL-KE.– Releases a high-order API of PGL. +- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added +- Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer. +- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation. + +- Deep Reinforcement Learning Framework PARL (https://github.com/PaddlePaddle/PARL) +- Version v1.3 is released. +- The support for the Multi-Agent RL algorithm including MADDPG is added. +- The support for multi-card training is added. An example of a multi-card DQN algorithm is released. +- SOTA algorithms TD3 and SAC in the open source continuous control field. +- Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class) +- Paddle Graph Learning Framework PGL (https://github.com/PaddlePaddle/PGL) +- Version v1.1 is released: +- The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.– Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.s +- A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.– Removes +- An improvement on ease of use is made. A high-order API of PGL is released.– Removes the unnecessary contrib/float16 directory. Deletes the unnecessary snappy/snappystream dependency under the BRPC. +- Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo. + ## Code Reconstruction and Upgrade - Compilation - - A compilation option WITH\_NCCL is added. Single-card users can display and specify WITH\_NCCL=OFF to accelerate compilation. - - A compilation option WITH\_TP\_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability. - - The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency. - - Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows. +- A compilation thus improving the code quality.– Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.– Removes the +- A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability. +- The CUDA_ARCH_NAME default value is set to Auto (All indicates compiling all GPU architectures and Auto indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using Auto than using All, thus improving development efficiency. +- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows. - External Dependency Library - - MKL-DNN is upgraded to the latest Version 1.1. - - The forecast library is decoupled from `third_party` and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies. - - Two third-party-dependent private warehouses, one unnecessary dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the warehouse quality. +- MKL-DNN is upgraded to the latest Version 1.1. +- The inference library is decoupled from third_party and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies. +- Two third-party-dependent private code repository, one unnecessary ernal dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the code repository quality. - Code Cleanup, Refactoring, and Optimization - - The unnecessary `contrib/float16` directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted. - - `loss.py` and `sequence_lod.py` are split out of `python/paddle/fluid/layers/nn.py` according to the API functions, thus reducing the code quantity of `nn.py` and facilitating reading. - - The codes corresponding to the warnings of `-Wno-error=sign-compare` (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality. - - `WarningLnk4006/WarningLnk4221` compiled by WindowsMSVC (at a total of about 300 points) is removed to improve the warehouse quality. - - The quantity of reduce\_op, expand\_op, and expand\_as\_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M. - - The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph. +- The unnecessary contrib/float16 directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted. +- loss.py and sequence_lod.py are split out of python/paddle/fluid/layers/nn.py according to the API functions, thus reducing the code quantity of nn.py and facilitating reading. +- The codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality. +- WarningLnk4006/WarningLnk4221 (There are about 300) compiled by Windows MSVC is removed to improve the code repository quality. +- The quantity of reduce_op, expand_op, and expand_as_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M. +- The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph. ## Bug Fixes -- Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a forecast. +- Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a inference. - Fix the problem of training suspension in the GPU implementation of sum op because some Tensors are not initialized. -- Fix the problem of precision loss when the value in fill\_constant is set to a large integer. -- Fix the problem of precision inconsistency of softmax\_with\_cross\_entropy\_op with regard to the CUDA. -- Fix the problem that when a clone program is fixed, the stop\_gradient attribute in the program can not be copied to a new program. -- Fix the problem of precision loss of elementwise\_pow op with regard to integers. -- Fixed the problem that some GFLAGSs cannot perform specifying outside the forecast library. -- Fix the problem of random forecast core caused by some passes in Analysistor multithreading. (fc\_gru\_fuse\_pass, seqconv\_eltadd\_relu\_fuse\_pass, attention\_lstm\_fuse\_pass, embedding\_fc\_lstm\_fuse\_pass, fc\_lstm\_fuse\_pass, seq\_concat\_fc\_fuse\_pass) -- Fix the error that specifying a GPU in the same process using AnalysisConfig does not take effect after NativePredictor is used to specify the use of CPU forecast. -- Fix the bug of compilation error (setup.py copy and op\_function\_cmd error) in the case of -DWITH\_MKL=OFF. -- Fix the bug that tuple (Variable) cannot be entered in the py\_func OP; add an example of how to write PythonOP codes. +- Fix the problem of precision loss when the value in fill_constant is set to a large integer. +- Fix the problem of precision inconsistency of softmax_with_cross_entropy_op with regard to the CUDA. +- Fix the problem that when a clone program is fixed, the stop_gradient attribute in the program can not be copied to a new program. +- Fix the problem of precision loss of elementwise_pow op with regard to integers. +- Fixed the problem that some GFLAGSs cannot perform specifying outside the inference library. +- Fix the problem of random inference core caused by some passes in Analysistor multithreading. (fc_gru_fuse_pass, seqconv_eltadd_relu_fuse_pass, attention_lstm_fuse_pass, embedding_fc_lstm_fuse_pass, fc_lstm_fuse_pass, seq_concat_fc_fuse_pass) +- Fix the error that specifying a GPU in the same process using AnalysisConfig does not take effect after NativePredictor is used to specify the use of CPU inference. +- Fix the bug of compilation error in the case of -DWITH_MKL=OFF on Windows. +- Fix the bug that tuple (Variable) cannot be input into the py_func OP; Add an code example of how to write Python OP. - Fix the problem of the sigmoid cudnn kernel being called as the tanh cudnn kernel by mistake. -- Fix some bugs related to reshape and depthwiseconv in dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash. +- Fix some bugs related to reshape and Conv2D depthwisecoin dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash. - Fix the bug of running error of GradientClip in parameter server mode. -- Fix the problem of memory leak in full asynchronous mode of of the parameter server. +- Fix the problem of memory leak in full asynchronous mode of the parameter server. diff --git a/doc/fluid/user_guides/.DS_Store b/doc/fluid/user_guides/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..bac0678d4e876d4c09bc86b05ee899bde334070e Binary files /dev/null and b/doc/fluid/user_guides/.DS_Store differ diff --git a/doc/fluid/user_guides/cv_case/.DS_Store b/doc/fluid/user_guides/cv_case/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..f800718233277d9753c9e4a3d02e0f45815cc2d0 Binary files /dev/null and b/doc/fluid/user_guides/cv_case/.DS_Store differ diff --git a/doc/fluid/user_guides/simple_case/image_classification/.gitignore b/doc/fluid/user_guides/cv_case/image_classification/.gitignore similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/.gitignore rename to doc/fluid/user_guides/cv_case/image_classification/.gitignore diff --git a/doc/fluid/user_guides/simple_case/image_classification/.run_ce.sh b/doc/fluid/user_guides/cv_case/image_classification/.run_ce.sh similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/.run_ce.sh rename to doc/fluid/user_guides/cv_case/image_classification/.run_ce.sh diff --git a/doc/fluid/user_guides/simple_case/image_classification/README.cn.md b/doc/fluid/user_guides/cv_case/image_classification/README.cn.md similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/README.cn.md rename to doc/fluid/user_guides/cv_case/image_classification/README.cn.md diff --git a/doc/fluid/user_guides/simple_case/image_classification/README.md b/doc/fluid/user_guides/cv_case/image_classification/README.md similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/README.md rename to doc/fluid/user_guides/cv_case/image_classification/README.md diff --git a/doc/fluid/user_guides/simple_case/image_classification/_ce.py b/doc/fluid/user_guides/cv_case/image_classification/_ce.py similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/_ce.py rename to doc/fluid/user_guides/cv_case/image_classification/_ce.py diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/cifar.png b/doc/fluid/user_guides/cv_case/image_classification/image/cifar.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/cifar.png rename to doc/fluid/user_guides/cv_case/image_classification/image/cifar.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/dog.png b/doc/fluid/user_guides/cv_case/image_classification/image/dog.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/dog.png rename to doc/fluid/user_guides/cv_case/image_classification/image/dog.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/dog_cat.png b/doc/fluid/user_guides/cv_case/image_classification/image/dog_cat.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/dog_cat.png rename to doc/fluid/user_guides/cv_case/image_classification/image/dog_cat.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/fea_conv0.png b/doc/fluid/user_guides/cv_case/image_classification/image/fea_conv0.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/fea_conv0.png rename to doc/fluid/user_guides/cv_case/image_classification/image/fea_conv0.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/flowers.png b/doc/fluid/user_guides/cv_case/image_classification/image/flowers.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/flowers.png rename to doc/fluid/user_guides/cv_case/image_classification/image/flowers.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/googlenet.jpeg b/doc/fluid/user_guides/cv_case/image_classification/image/googlenet.jpeg similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/googlenet.jpeg rename to doc/fluid/user_guides/cv_case/image_classification/image/googlenet.jpeg diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/ilsvrc.png b/doc/fluid/user_guides/cv_case/image_classification/image/ilsvrc.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/ilsvrc.png rename to doc/fluid/user_guides/cv_case/image_classification/image/ilsvrc.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/inception.png b/doc/fluid/user_guides/cv_case/image_classification/image/inception.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/inception.png rename to doc/fluid/user_guides/cv_case/image_classification/image/inception.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/inception_en.png b/doc/fluid/user_guides/cv_case/image_classification/image/inception_en.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/inception_en.png rename to doc/fluid/user_guides/cv_case/image_classification/image/inception_en.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/lenet.png b/doc/fluid/user_guides/cv_case/image_classification/image/lenet.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/lenet.png rename to doc/fluid/user_guides/cv_case/image_classification/image/lenet.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/lenet_en.png b/doc/fluid/user_guides/cv_case/image_classification/image/lenet_en.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/lenet_en.png rename to doc/fluid/user_guides/cv_case/image_classification/image/lenet_en.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/plot.png b/doc/fluid/user_guides/cv_case/image_classification/image/plot.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/plot.png rename to doc/fluid/user_guides/cv_case/image_classification/image/plot.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/plot_en.png b/doc/fluid/user_guides/cv_case/image_classification/image/plot_en.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/plot_en.png rename to doc/fluid/user_guides/cv_case/image_classification/image/plot_en.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/resnet.png b/doc/fluid/user_guides/cv_case/image_classification/image/resnet.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/resnet.png rename to doc/fluid/user_guides/cv_case/image_classification/image/resnet.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/resnet_block.jpg b/doc/fluid/user_guides/cv_case/image_classification/image/resnet_block.jpg similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/resnet_block.jpg rename to doc/fluid/user_guides/cv_case/image_classification/image/resnet_block.jpg diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/train_and_test.png b/doc/fluid/user_guides/cv_case/image_classification/image/train_and_test.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/train_and_test.png rename to doc/fluid/user_guides/cv_case/image_classification/image/train_and_test.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/variations.png b/doc/fluid/user_guides/cv_case/image_classification/image/variations.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/variations.png rename to doc/fluid/user_guides/cv_case/image_classification/image/variations.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/variations_en.png b/doc/fluid/user_guides/cv_case/image_classification/image/variations_en.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/variations_en.png rename to doc/fluid/user_guides/cv_case/image_classification/image/variations_en.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/image/vgg16.png b/doc/fluid/user_guides/cv_case/image_classification/image/vgg16.png similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/image/vgg16.png rename to doc/fluid/user_guides/cv_case/image_classification/image/vgg16.png diff --git a/doc/fluid/user_guides/simple_case/image_classification/index.cn.html b/doc/fluid/user_guides/cv_case/image_classification/index.cn.html similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/index.cn.html rename to doc/fluid/user_guides/cv_case/image_classification/index.cn.html diff --git a/doc/fluid/user_guides/simple_case/image_classification/index.html b/doc/fluid/user_guides/cv_case/image_classification/index.html similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/index.html rename to doc/fluid/user_guides/cv_case/image_classification/index.html diff --git a/doc/fluid/user_guides/simple_case/image_classification/resnet.py b/doc/fluid/user_guides/cv_case/image_classification/resnet.py similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/resnet.py rename to doc/fluid/user_guides/cv_case/image_classification/resnet.py diff --git a/doc/fluid/user_guides/simple_case/image_classification/train.py b/doc/fluid/user_guides/cv_case/image_classification/train.py similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/train.py rename to doc/fluid/user_guides/cv_case/image_classification/train.py diff --git a/doc/fluid/user_guides/simple_case/image_classification/vgg.py b/doc/fluid/user_guides/cv_case/image_classification/vgg.py similarity index 100% rename from doc/fluid/user_guides/simple_case/image_classification/vgg.py rename to doc/fluid/user_guides/cv_case/image_classification/vgg.py diff --git a/doc/fluid/user_guides/cv_case/index_cn.rst b/doc/fluid/user_guides/cv_case/index_cn.rst index 8efd1d4d8dced4ee736f0a4522d69a31771ebeb3..6d1b108f45407bd37111bd611e3a5c0663609c32 100644 --- a/doc/fluid/user_guides/cv_case/index_cn.rst +++ b/doc/fluid/user_guides/cv_case/index_cn.rst @@ -2,9 +2,13 @@ 计算机视觉 ################ +.. todo:: + +计算机视觉是一门关于如何运用照相机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。在这里PaddlePaddle为大家提供了两篇cv的教程供大家学习: .. toctree:: :titlesonly: + image_classification/README.cn.md gan/README.cn.md diff --git a/doc/fluid/user_guides/cv_case/index_en.rst b/doc/fluid/user_guides/cv_case/index_en.rst index fa597ee13258083f41ab5f31f5d111e2583314b6..0523bfb6bd5455557ef3282bdfbcf63b1bccac7f 100644 --- a/doc/fluid/user_guides/cv_case/index_en.rst +++ b/doc/fluid/user_guides/cv_case/index_en.rst @@ -6,5 +6,6 @@ Computer Vision .. toctree:: :titlesonly: + image_classification/README.md gan/README.md diff --git a/doc/fluid/user_guides/index_cn.rst b/doc/fluid/user_guides/index_cn.rst index 20fdb1b9d10d942558b5d785d62f56becd61646c..be7d31882574834be02c9f4a74cc8ee4a5a1ebd9 100644 --- a/doc/fluid/user_guides/index_cn.rst +++ b/doc/fluid/user_guides/index_cn.rst @@ -21,8 +21,6 @@ - `自然语言处理 <../user_guides/nlp_case/index_cn.html>`_: 介绍使用 Paddle 实现自然语言处理方向的案例 - `推荐 <../user_guides/rec_case/index_cn.html>`_:介绍如何使用 Paddle 完成推荐领域任务的案例 - - - `模型库 <../user_guides/models/index_cn.html>`_:介绍了 Paddle 经典的模型库 - `工具组件 <../user_guides/tools/index_cn.html>`_:介绍在 Paddle 工具组件的使用案例 @@ -33,7 +31,6 @@ cv_case/index_cn.rst nlp_case/index_cn.rst rec_case/index_cn.rst - models/index_cn.rst tools/index_cn.rst diff --git a/doc/fluid/user_guides/nlp_case/index_cn.rst b/doc/fluid/user_guides/nlp_case/index_cn.rst index 8905cb8fc100dfe1a64ed9cde05efeaa88615413..dc001b9fc4734c4c9df59c7bcce7c7f7deffa782 100644 --- a/doc/fluid/user_guides/nlp_case/index_cn.rst +++ b/doc/fluid/user_guides/nlp_case/index_cn.rst @@ -2,6 +2,10 @@ 自然语言处理 ################ +.. todo:: + +自然语言处理(Natural Language Processing)是人工智能和语言学领域的分支学科。此领域探讨如何处理及运用自然语言,特别是如何编程计算机以成功处理大量的自然语言数据。在这里PaddlePaddle为大家提供了三篇NLP领域的学习教程: + .. toctree:: :titlesonly: diff --git a/doc/fluid/user_guides/rec_case/index_cn.rst b/doc/fluid/user_guides/rec_case/index_cn.rst index ee8e55abb7f6c00a4f306113ab5bb367456fbdf2..0894710bf03dce07f72ac071e471811ac222a5c7 100644 --- a/doc/fluid/user_guides/rec_case/index_cn.rst +++ b/doc/fluid/user_guides/rec_case/index_cn.rst @@ -2,6 +2,10 @@ 推荐 ################ +.. todo:: + +推荐系统是利用电子商务网站向客户提供商品信息和建议,帮助用户决定应该购买什么产品,模拟销售人员帮助客户完成购买过程。在这里PaddlePaddle为大家提供了一篇个性化推荐的案例详解: + .. toctree:: :titlesonly: diff --git a/doc/fluid/user_guides/simple_case/.DS_Store b/doc/fluid/user_guides/simple_case/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..ae81c89f33e7122996d6b5799a7369a0ec866b37 Binary files /dev/null and b/doc/fluid/user_guides/simple_case/.DS_Store differ diff --git a/doc/fluid/user_guides/simple_case/index_cn.rst b/doc/fluid/user_guides/simple_case/index_cn.rst index 4d44cf7e7f2c66890c7ad8b5f5e4a62fe326a06d..4b954ebfdc2449a33e7607d7516ac64c4cce3543 100644 --- a/doc/fluid/user_guides/simple_case/index_cn.rst +++ b/doc/fluid/user_guides/simple_case/index_cn.rst @@ -2,10 +2,13 @@ 简单案例 ################ +.. todo:: + +这里是基于PaddlePaddle实现的简单深度学习入门案例,帮助您更快速的了解飞桨的使用方法,并解决简单深度学习问题,以下是具体的案例详解: + .. toctree:: :titlesonly: fit_a_line/README.cn.md recognize_digits/README.cn.md - image_classification/README.cn.md word2vec/README.cn.md diff --git a/doc/fluid/user_guides/simple_case/index_en.rst b/doc/fluid/user_guides/simple_case/index_en.rst index 91ebfc14573d5ea04a0263cbceb31636928ffe88..bccd0a4b83aed64ab7501460fb6770767755ec0a 100644 --- a/doc/fluid/user_guides/simple_case/index_en.rst +++ b/doc/fluid/user_guides/simple_case/index_en.rst @@ -8,6 +8,5 @@ Simple Case fit_a_line/README.md recognize_digits/README.md - image_classification/README.md word2vec/README.md diff --git a/doc/fluid/user_guides/tools/.DS_Store b/doc/fluid/user_guides/tools/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..9754f3b21bd199fa7505d7b125895b78c1b3bf47 Binary files /dev/null and b/doc/fluid/user_guides/tools/.DS_Store differ diff --git a/doc/fluid/user_guides/tools/index_cn.rst b/doc/fluid/user_guides/tools/index_cn.rst index 8481b7ef5151e2ebf8540cb4928f2f65e48f534d..f80c3fc7a3107e699533941f97d183f80acc767b 100644 --- a/doc/fluid/user_guides/tools/index_cn.rst +++ b/doc/fluid/user_guides/tools/index_cn.rst @@ -2,7 +2,12 @@ 工具组件 ################ +.. todo:: + +这里PaddlePaddle为大家提供了一篇:百度云分布式训练CTR预估任务和Serving流程一键部署的案例文章 + + .. toctree:: :titlesonly: - elastic_ctr/deploy_ctr_on_baidu_cloud_cn.md + deploy_ctr_on_baidu_cloud_cn.rst