v0.5.0 · 标签 · MegEngine 天元 / MegEngine

v0.5.0

New Features

增加cuda对nchw quantized数据的计算支持。
conv1x1中添加对gemv的支持。
CPU(X86,ARM)上增加NCHW4 layout 计算方式的支持。
针对Armv8.2-a+dotprod指令集增加 nchw44-dot layout的优化支持。
增加nchw44来优化float的计算，包括直接卷积，channel wise卷积，混合layout的卷积，winograd，mk4-matmul，以及pooling和elemwise等algo的优化。
整理图优化，将一些通用转换能同时支持runtime和dump阶段。
增加 Calibration 量化训练接口。
QAT量化训练优化：ConvBn 添加支持BN和fake quantization/observer的复合操作、添加Conv的量化操作、添加Linear的量化操作、quantize_qat支持自定义跳过Module。
多卡训练增加同步 BN 统计量的支持。
多卡训练在图优化中增加 PackAllReducePass，打包需要AllReduce的参数，减少卡间通信次数。
API的一些优化调整：F.eye原本放在functional/nn.py里，现在挪到了core/tensor_factory.py里；F.add_axis和F.remove_axis里强行限制只能传入int的axis，而不再允许传入list。

在FuseConvBiasWithZ的pass里添加HSwish激活函数的支持，将QFUSE_ADD_H_SWISH折叠进conv bias算子，提升性能。
修复cuda-TopK算法在batch超过65535时会导致grid的y维超出限制，而报出invalid parameter的cuda错误。
解除 cuda-stub 中对 libcuda.so 的路径限制。
修复conv1x1错误使用了基类的is_prefered方法导致的性能问题。
ConvDirectUnrollBuffer算法中，在load src时取出的数据会变成0，加入printf语句或者去掉循环的unroll优化可以避免这个问题。
修复paramfuse在处理midconst时，endpoint导致endpoint被replace两次的问题。
修复自8.3.0(包括)gopt中的ReorderArithChainPass BUG fix reorder arith chain pass。
修复cond op不支持空shape的问题。
修复SetMeshIndexing使用多个axis做indexing时的问题。
修复CompNode中assert locator.device < sd.MAX_NR_DEVICE 的书写错误 @zjd1988 。
修复voc和objects365的书写错误。
修复voc中错误的类名。
修复Tensor 的default_comp_graph 使用。
修复Function中saved_tensors在静态图下无法copy而导致图优化失败的问题。
修复 scatter 的API文档，避免在GPU上报错。
修复unused var ins的问题。
修复Module中字段的非str键错误。
修复QAT训练完的模型在eval模式下依然会更新scale和zero_point 的问题。
在所有mali系列机器上都关闭 image算法。

Enable cuda algos for nchw quantized.
Update conv1x1 to support gemv.
NCHW4 layout is now supported on CPU(X86,ARM).
Optimized nchw44-dot layout is available in Armv8.2-a+dotprod instruction set.
nchw44 is incorporated to optimize float-typed calculation, including but not limited to direct convolution, channel wise convolution, hybrid layout convolution, winograd, mk4-matmul, along with algorithm optimization of pooling and elemwise.
Graph optimization. Generalized conversion is supported both in runtime and dump phase.
Synchronized BN statistics are now available on multi-device training tasks.
PackAllReducePass is introduced into graph optimization on multi-device training.
Calibration quantization training interface is now available.
QAT quantization training updates: ConvBn is now able to conduct composed operation of BN and fake quantization/observer; enable quantization on Conv and Linear; quantize_qat is now allowed to skip Module on your needs
API adjustments: F.eye is moved to core/tensor_factory.py from the previous location functional/nn.py. F.add_axis and F.remove_axis are now restricted to accept axis of int type only, which disables axis of list type.

HSwish activation function is enabled in pass of FuseConvBiasWithZ, and QFUSE_ADD_H_SWISH is wrapped into conv bias operator to enhance performance.
Fix cuda error‘invalid parameter’raised from cuda-TopK when batch exceeds 65535 which violates the y dimension limit of grid.
Drop path restriction of libcuda.so in cuda-stub.
Fix impacted performance for conv1x1 mistakenly adopts is_prefered from its base class.
Insert printf statements or removing looped unroll optimization to avoid the issue that data fetched through load src in ConvDirectUnrollBuffer are unexpectedly casted to 0.
Fix issue that endpoint would be replaced twice when paramfuse was processing midconst.
Fix ReorderArithChainPass in gopt raised since 8.3.0 (inclusive).
Fix empty shape not recognized by cond op.
Fix SetMeshIndexing uses multiple axes for indexing.
Fix typo assert locator.device < sd.MAX_NR_DEVICE in CompNode @zjd1988 .
Fix typo in voc and objects365.
Fix incorrect class name in voc.
Fix default_comp_graph of Tensor.
Fix graph optimization failure on occasion that saved_tensors in Function is unable to copy in a static graph.
Fix API documentation of scatter to circumvent exception on GPU environment.
Fix issues of unused var ins.
Fix none-str key exception in Module fields.
Fix unexpected eval-mode scale and zero_point updates in models trained by QAT.
Disable image algorithm on all of mali-series machines.