提交 · 55d32c333c8a298da5307bc1d219f48967fe5490 · Oneflow-Inc / oneflow

16 7月, 2020 1 次提交

update docs about compiling and install (#3177) · 99b1454d

由 Shenghang Tsai 提交于 7月 16, 2020

* update docs

* lower case

* refine

* refine indentation

* rm useless

* refine cmake docs

* fix indentation

* fix case

* add pip install target

* more succinct intruction on installation

* don't install oneflow by default
Co-authored-by: Ntsai <caishenghang@oneflow.org>

99b1454d

13 7月, 2020 1 次提交

xrt support TensorRT int8 (#2637) · d4d84e60

由 Houjiang Chen 提交于 7月 13, 2020

* Add tensorrt int8 calibrator

* Generate calibration correctly.

* Refine xrt int8 and readme

* Update readme

* Add xrt int8 unittest

* merge develop

* leaky relu test

* function->global_function

* fix LookupOrCreate

* OF_CHECK->CHECK_OR_RETURN
Co-authored-by: Nguo-ran <360112263@qq.com>

d4d84e60

05 2月, 2020 1 次提交
- H
  
  Update xrt readme (#2611) · 15558f7d
  由 Houjiang Chen 提交于 2月 05, 2020
  
  15558f7d
26 12月, 2019 1 次提交

XRT: XLA + TensorRT (#2525) · 8f3dcf94

由 Houjiang Chen 提交于 12月 26, 2019

* Enable multiply definition for xla compilation in oneflow

* Realize running an executable

* Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore

* Implement a seperate xla allocator to avoid introducing much objects of tensorflow

* Define CompilationContext separately

* Running XLA by CPU mode is OK now

* Make the result shape after running the executable is a tuple, and refine comments

* Add compilation cache to solve recompiling every time

* Resolve InferSbpSignature in XlaLaunchOp

* Resove executing on specified cuda stream

* Refine XlaLaunch parallel conf, add batch matmul op

* Refactor job rebuilding and fixup time shape

* Update batch_dim_lbis field if XlaLaunch has any output which has batch dim

* Resolve cluster-ring after clustered, take sbp policy and time shape into consideration

* Add reshape op

* Fix bugs

* Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle

* Fix bugs

* Update cmake to compile with xla optionally

* Support more ops

* Add more ops, and fix bugs

* Implement XLA allocator and internal memory pool

* Adaptively resize allocator memory size

* Refine memory allocator

* Block host if running cpu executable

* Fix bug for getting scalar value

* Fix result layout bug. This bug causes wrong result for transpose

* Refine gelu backward

* Of xla sx (#1990)

* add identity xla op

* Add batch gather op

* Refine batch gather

* fix batch gather bug aand add gather op, mv identity op to unary_op

* Add softmax and gather/batch_gather

* Add xla softmax_grad op

* Add xla layer normalization op

* Add xla layer norm backward op

* Alias inputs and outputs to compute in-place

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Refine xla allocator

* Refine code style

* Add xla reduce_sum op

* Rewrite model update op to optimizer graph

* Fix hang bugs

* Fix input which body is disabled in xla launch kernel

* Fix self control in

* Fix self control in

* Add fake consume op

* Fix HasAttr bug for optional field

* Refine AdamOptimizer

* Fix xla AdamOptimizer bugs

* Add meta data in HLO instruction, and refine

* Fix bugs

* add reduce sum and split normal model update (#2040)

* remove append_func_to_list

* Rm deprecated model update and save code (#1958)

* remove code

* mv random gen to kernel

* mk seed required

* address reviews

* fix unused warning

* address reviews

* check in more deprecation

* remove ModelSaveOpConf

* move out ops and modify item (#1962)

* ModelInit.__oneflow_input_remote_blobs__

* fix cpu only query & add error info (#1964)

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* modify check_point and add test check_point (#1963)

* fix misuse of Scope/raii

* op_name2variable_blob

* add sigmoid test and tanh test (#1966)

* add op matmul and matmul test (#1967)

* rename oneflow.val to oneflow.input_blob_def

* support auto var for convolution (#1972)

* add op add and test add (#1973)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Of xla (#2237)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Fix static cublas library and xla link conflict

* Fix cublas link conflict with tensorflow

* Fix different connection kinds for multiple gpu cards (#2282)

* Refine xla cluster algo (#2289)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Refine MarkClusterId pass and ReduceSplit task node (#2314)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Determine fusion disabled edges

* update

* Produce multiple registers on edges for ReduceSplit task node.
Fix new allocator by stream id.

* Refine MarkClusterId pass

* Clustering subgraph with reverse ordering is better

* Support strict clustering by taking dependencies into consideration

* Translate rebuild job and rewrite optimizer into passes, and refine code style

* Fix spell error

* Update cmake

* Merge branch dev_python (#2321)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Fix xla reshape op

* Merge upstream of_xla (#2322)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Dev cuda 9 arch 70 (#2318)

* kCudaAlignSize = 256

* always compute_70

* __CUDA_API_VERSION >= 10000

* __CUDA_API_VERSION >= 10000

* disable_all_reduce_sequence

* Fix xla reshape op

* Fix compilation without xla

* Remove useless code and fix data type mismatch in field desc (#2326)

* Remove useless code

* Refine code style

* Fix data type mismatch in field desc

* Update README.md (#2335)

* Refine code style (#2336)

* Update XLA usage document (#2337)

* Update XLA usage document

* Fix mistakes

* Add xla clang-format and format codestyle (#2340)

* Revert "Add xla clang-format and format codestyle (#2340)" (#2341)

This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724.

* Add xla clang-format and format codestyle (#2342)

* Add xla clang-format and format codestyle

* Fix header file missing

* Of xla sx (#2334)

* add gather grad op and pass testing

* rm check

* done batch gather grad

* pass test

* modify according to the review

* add unsorted_segment_sum and refine unsorted_batch_segment_sum

* reform according to review

* refromate according to the clang-format and rm reference to the temp object

* Pick step0 and step1 new commits (#2346)

* Add xla clang-format and format codestyle

* Fix header file missing

* Modify codes to support XLA

Conflicts:
	oneflow/core/job/job_builder.cpp
	oneflow/core/job/job_builder.h
	oneflow/core/operator/op_conf.proto

* Fix a bug for building subgraph although it won't lead to wrong results (#2347)

* Fix setting is_mutable in xla launch op (#2349)

* Change directory xla to xrt, apply patch if building with xla

* Refactor

* Add infer shape pass, and Refactor launch kernel, graph compiler

* Refine code style, add xla executable and graph compiler

* Rename platform.proto as types.proto

* change OpCompiler to OpKernel, complete xla graph compiler

* Fix compilation bugs and add allocator, now xla compilation is ok

* Add xla executable runtime

* Add executable run scope to support launch kernel on specific stream.

* Fix infer shape pass, and revert cuda event pool

* Refactor graph building with attaching argument metadata.

* Set mutability if rebuilding job

* Set device ordinal correctly

* Refine DelOps

* Refine Argument definition and abstract function as subgraph

* Fix infer shape in xrt launch op and launch kernel.

* Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt.

* Refine code style

* Rename xla Operand as XlaValue.

* Complete TensorRT compiler and builder, Refine OpKernel

* Pick public code changes from the new tensorrt branch.

* Fix tensorrt compilation

* Fake implementation of trt executable

* Support selecting engine in launch kernel, refine trt executable

* Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix.

* Support train phase setting for registered op kernel

* Remove RewriteOptimizer pass, update xla optimizer op.

* Format job builder .h and .cpp files.

* Remove RewriteOptimizer pass, update xla optimizer op.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Refine code style and comment.

* Refine model update inference for launch op.

* Refine

* Refine code style and comment.

* Refine model update inference for launch op.

Conflicts:
	oneflow/xrt/kernel/op_kernel.h
	oneflow/xrt/node_util.cpp
	oneflow/xrt/node_util.h
	oneflow/xrt/passes/cluster.h
	oneflow/xrt/passes/mark_cluster_id_pass.cpp
	oneflow/xrt/passes/rebuild_job_pass.cpp
	oneflow/xrt/types.h

* Add xrt README.md

* Add use_xla_jit and use_tensorrt options in job proto

* Refine code style

* Fix BlobDesc getter and xla LayerNorm op for FP16

* Make use_xla_jit and use_tensorrt configurable from python config and env variables.

* Update benchmark

* Refine xrt README and rename compile_with_xrt.h file

* Update README

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Disable WITH_XLA by default

* Update xrt benchmark

* Format xrt as core

* add activation op

* add softmax op

* Refine code style, remove unused code

* Remove duplication of XLA usage

* test pass

* pooling test pass

* add concat op, not tested

* add activation ops, test not psassed

* Add xla gelu unittest

* add  activation op, and test  passed

* add pooling op, and test passed

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* try to solve conv bug

* add elementwise add op, test passed

* add concat op, test passed

* Bugfix: transfer weights from gpu to host since tensorrt requires host weights.

* add op unit tests

* resolve conflicts and fix softmax bug

* add identity op and topk op, to test

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* add reduce mean op, test passed

* formate ops, add CHECKs, and optimize function structure

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* add trt gather op and unit test

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* add conv unit test

* reformate

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* check files

* modify files according to review advice.

* Add xrt unittests (#2483)

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Add xla gelu unittest

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* Fix reduce_mean facade bug if keep_dims if true.

* Refine tensorrt unittests

* Check failed if full reduce without keep dimension.

* madd pooling unit test

* Add tensorrt bias_add and reshape op, and their unittests.

* Support fp16 for tensorrt.

* Add tensorrt transpose op and unittest.

* add unit test conv_2d

* add unit test concat

* Fix concat if axis is -1.

* Refine tensorrt conv2d unittest

* Fix padding mode for conv2d and pooling, refine unittests.

* Refine tensorrt concat unittest

* Add convert api from string engine to XrtEngine.

* Revert tensorrt, and merge of_xrt branch

* Remove some comments.

* Refine tensorrt unittests

* Add XrtConfig to deal with xla and tensorrt configurations.

Conflicts:
	oneflow/xrt/api.cpp

* Update tensorflow.cmake to avoid applying the patch repeatedly.

* Remove XrtConfig Option, and fix xrt unittests

* Add tensorrt batch norm (#2516)

* Refine xrt signatrue hash, and fix python configuration (#2520)

* Fix XrtCompilationEnabled returns (#2524)

* Fix compilation after merge dev_python

* Update xrt unittests

* Revert protobuf version

* Remove comment FOR_RANGE

* Remove unused code

* Reformart

* Refine job builder

* Disable dump job if not debug mode
Co-authored-by: NSnow <snow3s@qq.com>
Co-authored-by: NJuncheng <liujuncheng1022@gmail.com>

8f3dcf94

Oneflow-Inc / oneflow 上一次同步 2 年多

Oneflow-Inc / oneflow
上一次同步 2 年多