Created by: GaoWei8
3个Tensor尺寸[1280, 2560], 3次sum op调用。 优化前: 33.1969 ms, 优化后: 26.3337ms, 提升 20.67%
- 优化前
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fill_constant 3 190.968 190.968050 (1.000000) 0.000000 (0.000000) 2.81928 185.274 63.656 0.825612
thread0::sum 3 33.1969 33.196943 (1.000000) 0.000000 (0.000000) 11.0564 11.0796 11.0656 0.14352
thread0::fetch 1 7.13995 7.139947 (1.000000) 0.000000 (0.000000) 7.13995 7.13995 7.13995 0.0308681
- 优化后
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fill_constant 3 173.255 173.254770 (1.000000) 0.000000 (0.000000) 2.7889 167.637 57.7516 0.839231
thread0::sum 3 26.3337 26.333724 (1.000000) 0.000000 (0.000000) 8.76836 8.7875 8.77791 0.127558
thread0::fetch 1 6.85611 6.856106 (1.000000) 0.000000 (0.000000) 6.85611 6.85611 6.85611 0.0332104
4个Tensor尺寸[1280, 2560], 4次sum op调用。 优化前: 62.8654 ms, 优化后: 43.336761 ms, 提升 31.06%
- 优化前
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fill_constant 3 190.968 190.968050 (1.000000) 0.000000 (0.000000) 2.81928 185.274 63.656 0.825612
thread0::sum 3 33.1969 33.196943 (1.000000) 0.000000 (0.000000) 11.0564 11.0796 11.0656 0.14352
thread0::fetch 1 7.13995 7.139947 (1.000000) 0.000000 (0.000000) 7.13995 7.13995 7.13995 0.0308681
- 优化后
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fill_constant 4 183.522 183.522086 (1.000000) 0.000000 (0.000000) 2.77571 175.089 45.8805 0.785169
thread0::sum 4 43.3368 43.336761 (1.000000) 0.000000 (0.000000) 10.7804 10.9012 10.8342 0.185409
thread0::fetch 1 6.87684 6.876843 (1.000000) 0.000000 (0.000000) 6.87684 6.87684 6.87684 0.0294215