New Features

  • 增加cuda对nchw quantized数据的计算支持。
  • conv1x1中添加对gemv的支持。
  • CPU(X86,ARM)上增加NCHW4 layout 计算方式的支持。
  • 针对Armv8.2-a+dotprod指令集增加 nchw44-dot layout的优化支持。
  • 增加nchw44来优化float的计算,包括直接卷积,channel wise卷积,混合layout的卷积,winograd,mk4-matmul,以及pooling和elemwise等algo的优化。
  • 整理图优化,将一些通用转换能同时支持runtime和dump阶段。
  • 增加 Calibration 量化训练接口。
  • QAT量化训练优化:ConvBn 添加支持BN和fake quantization/observer的复合操作、添加Conv的量化操作、添加Linear的量化操作、quantize_qat支持自定义跳过Module。
  • 多卡训练增加同步 BN 统计量的支持。
  • 多卡训练在图优化中增加 PackAllReducePass,打包需要AllReduce的参数,减少卡间通信次数。
  • API的一些优化调整:F.eye原本放在functional/nn.py里,现在挪到了core/tensor_factory.py里;F.add_axis和F.remove_axis里强行限制只能传入int的axis,而不再允许传入list。

Bug Fix

  • 在FuseConvBiasWithZ的pass里添加HSwish激活函数的支持,将QFUSE_ADD_H_SWISH折叠进conv bias算子,提升性能。
  • 修复cuda-TopK算法在batch超过65535时会导致grid的y维超出限制,而报出invalid parameter的cuda错误。
  • 解除 cuda-stub 中对 的路径限制。
  • 修复conv1x1错误使用了基类的is_prefered方法导致的性能问题。
  • ConvDirectUnrollBuffer算法中,在load src时取出的数据会变成0,加入printf语句或者去掉循环的unroll优化可以避免这个问题。
  • 修复paramfuse在处理midconst时,endpoint导致endpoint被replace两次的问题。
  • 修复自8.3.0(包括)gopt中的ReorderArithChainPass BUG fix reorder arith chain pass。
  • 修复cond op不支持空shape的问题。
  • 修复SetMeshIndexing使用多个axis做indexing时的问题。
  • 修复CompNode中assert locator.device < sd.MAX_NR_DEVICE 的书写错误 @zjd1988 。
  • 修复voc和objects365的书写错误。
  • 修复voc中错误的类名。
  • 修复Tensor 的default_comp_graph 使用 。
  • 修复Function中saved_tensors在静态图下无法copy而导致图优化失败的问题 。
  • 修复 scatter 的API文档,避免在GPU上报错。
  • 修复unused var ins的问题。
  • 修复Module中字段的非str键错误。
  • 修复QAT训练完的模型在eval模式下依然会更新scale和zero_point 的问题。
  • 在所有mali系列机器上都关闭 image算法。

Thanks to our Contributors

  • 本次release非常感谢@zjd1988 提交PR,期待更多的开发者一起共建MegEngine!

New Features

  • Enable cuda algos for nchw quantized.
  • Update conv1x1 to support gemv.
  • NCHW4 layout is now supported on CPU(X86,ARM).
  • Optimized nchw44-dot layout is available in Armv8.2-a+dotprod instruction set.
  • nchw44 is incorporated to optimize float-typed calculation, including but not limited to direct convolution, channel wise convolution, hybrid layout convolution, winograd, mk4-matmul, along with algorithm optimization of pooling and elemwise.
  • Graph optimization. Generalized conversion is supported both in runtime and dump phase.
  • Synchronized BN statistics are now available on multi-device training tasks.
  • PackAllReducePass is introduced into graph optimization on multi-device training.
  • Calibration quantization training interface is now available.
  • QAT quantization training updates: ConvBn is now able to conduct composed operation of BN and fake quantization/observer; enable quantization on Conv and Linear; quantize_qat is now allowed to skip Module on your needs
  • API adjustments: F.eye is moved to core/ from the previous location functional/ F.add_axis and F.remove_axis are now restricted to accept axis of int type only, which disables axis of list type.

Bug Fix

  • HSwish activation function is enabled in pass of FuseConvBiasWithZ, and QFUSE_ADD_H_SWISH is wrapped into conv bias operator to enhance performance.
  • Fix cuda error‘invalid parameter’raised from cuda-TopK when batch exceeds 65535 which violates the y dimension limit of grid.
  • Drop path restriction of in cuda-stub.
  • Fix impacted performance for conv1x1 mistakenly adopts is_prefered from its base class.
  • Insert printf statements or removing looped unroll optimization to avoid the issue that data fetched through load src in ConvDirectUnrollBuffer are unexpectedly casted to 0.
  • Fix issue that endpoint would be replaced twice when paramfuse was processing midconst.
  • Fix ReorderArithChainPass in gopt raised since 8.3.0 (inclusive).
  • Fix empty shape not recognized by cond op.
  • Fix SetMeshIndexing uses multiple axes for indexing.
  • Fix typo assert locator.device < sd.MAX_NR_DEVICE in CompNode @zjd1988 .
  • Fix typo in voc and objects365.
  • Fix incorrect class name in voc.
  • Fix default_comp_graph of Tensor.
  • Fix graph optimization failure on occasion that saved_tensors in Function is unable to copy in a static graph.
  • Fix API documentation of scatter to circumvent exception on GPU environment.
  • Fix issues of unused var ins.
  • Fix none-str key exception in Module fields.
  • Fix unexpected eval-mode scale and zero_point updates in models trained by QAT.
  • Disable image algorithm on all of mali-series machines.

Thanks to our Contributors

  • A kind acknowledgement to PR lodged by @zjd1988 , and we are genuinely welcoming more developers to co-build MegEngine!


MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架

发行版本 15

MegEngine v1.4.0


贡献者 13



  • C++ 79.8 %
  • Cuda 13.8 %
  • Python 4.9 %
  • C 0.9 %
  • CMake 0.5 %