Release Notes

ARM CPU

  • 1、新增kernel为3x3的滑窗直接卷积实现,在channel数较少时会比winograd和gemm效率更高
  • 2、新增winograd armv8的实现,在IOS以及v8的硬件上能取得更高的预测性能,以及算子融合时支持winograd ,保证算子融合后的效率更高
  • 3、新增了while、sequence_expand、sequence_pool、sequence_softmax、gru_unit、beam_search和beam_search_decode等19个算子,并做了大量的优化工作,支持NLP/OCR等attention-based端到端模型的预测
  • 4、完成矩阵运算库sgemm和sgemv的重构和效率优化,在大部分模型上能获得10%~100%以上的性能加速
  • 5、完成kernel为3x3的depthwise convolution的重构和优化,相比之前版本支持任意的padding、性能更优且计算结果更可靠
  • 6、完成kernel为5x5的depthwise convolution armv8版本的实现,NAS模型的预测效率提升30%以上
  • 7、完成col2im的neon优化,提升反卷积conv2d_transpose的效率
  • 8、新增基于图优化的精简内存复用策略,大部分模型能降低近50%的内存占用。ARM CPU已自动开启,FPGA和GPU暂不支持

ARM GPU

  • 1、完成kernel为1x1的卷积优化,MobileNet v1在高通Adreno GPU上平均预测性能提升35%

ARM CPU

  • Paddle-mobile has reconstructed and enhanced efficiency of the matrix operation library sgemm and sgemv, which gives rise to performance boost of 10%~100% on most models.
  • 19 new operators are provided in this version such as while, sequence_expand, sequence_pool, sequence_softmax, gru_unit, beam_search, and beam_search_decode. Apart from that, there has also been a large amount of optimization, and the support attention-based end-to-end Model prediction.
  • arm v8 of winograd implementation: higher inference performance on v8 hardware on IOS; winograd support for operator fusion to ensure higher efficiency after operator fusion.
  • Direct convolution for kernel with a 3x3 sliding window, which will be more efficient than winograd and gemm on the condition that the number of channels is small.
  • Reconstructed and optimized depthwise convolution with the kernel size 3x3: in contrast to previous versions, it supports arbitrary padding, and attains better performance and returns more reliable calculation results.
  • Depthwise convolution with the kernel size 5x5 on armv8: the NAS model prediction speeds up by more than 30%.
  • Complete the efficiency optimization of the deconvolution conv2d_transpose.
  • Consolidated with memory reuse strategy based on graph optimization. When the strategy is applied, most models can reduce memory usage by nearly 50%. It is automatically turned on for the ARM CPU (not compatible with FPGA and GPU).

ARM GPU

  • Paddle-mobile completes the convolution optimization for the kernel with size 1x1, and MobileNet v1 has an average inference performance improvement of 35% on Qualcomm Adreno GPUs.

项目简介

Multi-platform high performance deep learning inference engine (『飞桨』多平台高性能深度学习预测引擎)

发行版本 20

v2.7-beta

全部发行版

贡献者 87

全部贡献者

开发语言

  • C++ 82.3 %
  • Swift 4.1 %
  • CMake 3.0 %
  • Metal 2.6 %
  • C 2.3 %