• H
    XRT: XLA + TensorRT (#2525) · 8f3dcf94
    Houjiang Chen 提交于
    * Enable multiply definition for xla compilation in oneflow
    
    * Realize running an executable
    
    * Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore
    
    * Implement a seperate xla allocator to avoid introducing much objects of tensorflow
    
    * Define CompilationContext separately
    
    * Running XLA by CPU mode is OK now
    
    * Make the result shape after running the executable is a tuple, and refine comments
    
    * Add compilation cache to solve recompiling every time
    
    * Resolve InferSbpSignature in XlaLaunchOp
    
    * Resove executing on specified cuda stream
    
    * Refine XlaLaunch parallel conf, add batch matmul op
    
    * Refactor job rebuilding and fixup time shape
    
    * Update batch_dim_lbis field if XlaLaunch has any output which has batch dim
    
    * Resolve cluster-ring after clustered, take sbp policy and time shape into consideration
    
    * Add reshape op
    
    * Fix bugs
    
    * Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle
    
    * Fix bugs
    
    * Update cmake to compile with xla optionally
    
    * Support more ops
    
    * Add more ops, and fix bugs
    
    * Implement XLA allocator and internal memory pool
    
    * Adaptively resize allocator memory size
    
    * Refine memory allocator
    
    * Block host if running cpu executable
    
    * Fix bug for getting scalar value
    
    * Fix result layout bug. This bug causes wrong result for transpose
    
    * Refine gelu backward
    
    * Of xla sx (#1990)
    
    * add identity xla op
    
    * Add batch gather op
    
    * Refine batch gather
    
    * fix batch gather bug aand add gather op, mv identity op to unary_op
    
    * Add softmax and gather/batch_gather
    
    * Add xla softmax_grad op
    
    * Add xla layer normalization op
    
    * Add xla layer norm backward op
    
    * Alias inputs and outputs to compute in-place
    
    * Reuse output buffers when running xla executable. It brings about 10%
    speedup for bert on single gpu by zero copy results
    
    * Reuse output buffers when running xla executable. It brings about 10%
    speedup for bert on single gpu by zero copy results
    
    * Refine xla allocator
    
    * Refine code style
    
    * Add xla reduce_sum op
    
    * Rewrite model update op to optimizer graph
    
    * Fix hang bugs
    
    * Fix input which body is disabled in xla launch kernel
    
    * Fix self control in
    
    * Fix self control in
    
    * Add fake consume op
    
    * Fix HasAttr bug for optional field
    
    * Refine AdamOptimizer
    
    * Fix xla AdamOptimizer bugs
    
    * Add meta data in HLO instruction, and refine
    
    * Fix bugs
    
    * add reduce sum and split normal model update (#2040)
    
    * remove append_func_to_list
    
    * Rm deprecated model update and save code (#1958)
    
    * remove code
    
    * mv random gen to kernel
    
    * mk seed required
    
    * address reviews
    
    * fix unused warning
    
    * address reviews
    
    * check in more deprecation
    
    * remove ModelSaveOpConf
    
    * move out ops and modify item (#1962)
    
    * ModelInit.__oneflow_input_remote_blobs__
    
    * fix cpu only query & add error info (#1964)
    
    * NumaAwareCudaMallocHost (#1959)
    
    * NumaAwareCudaMallocHost
    
    * add conf
    
    * modify check_point and add test check_point (#1963)
    
    * fix misuse of Scope/raii
    
    * op_name2variable_blob
    
    * add sigmoid test and tanh test (#1966)
    
    * add op matmul and matmul test (#1967)
    
    * rename oneflow.val to oneflow.input_blob_def
    
    * support auto var for convolution (#1972)
    
    * add op add and test add (#1973)
    
    * mv deprecated.pb_util to lib.core.pb_util
    
    * add op get_variable and get_variable test (#1975)
    
    * add op get_variable and get_variable test
    
    * modify shape extend
    
    * AllReduceSequencePass (#1976)
    
    * python2 compatibility for check_point
    
    * fix "return (blob_a, blob_b)" bug
    
    * rename: arg_passing => arg_pass
    
    * shared regst blob header between jobs (#1919)
    
    * half impl
    
    * register manager handle memory shared for separated memory
    
    * set separated memory shared id for shared regst between jobs
    
    * half impl of python for blob
    
    * fix BUG of pod ToProto() when proto has inited
    
    * fix BUG of infer dim0_inner_shape() in foreign_input_op
    
    * 1. PushJob copy from python can infer dim0_valid_num
    
    * add test for dynamic relu
    
    * refine test file
    
    * refine code
    
    * refine note
    
    * update test file for new interface
    
    * rename separated_header* (#1979)
    
    * some bugs fixes for a train&eval job (#1978)
    
    * debugging alex net
    
    * check in test pull_multiple_blob.py
    
    * strcter check
    
    * fix bias in conv
    
    * fix various bugs
    
    * rm file
    
    * op_name in different jobs can be overloaded
    
    * fix compile bug in job_set_compile_ctx
    
    * rm cmake code for building oneflow binary
    
    * check in script (#1980)
    
    * check in script
    
    * rm used import
    
    * CudaCurrentDeviceGuard (#1977)
    
    * fix val (#1981)
    
    * Merge job set and split fw bw (#1982)
    
    * add MemoryCopier and TensorSliceCopier (#1901)
    
    * add MemoryCopier and TensorSliceCopier
    
    * Index=>NdIndex
    
    * refine
    
    * refine
    
    * fix addition error checking (#1911)
    
    * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * Merge dev_mixed_precision: Part-2 (#1907)
    
    * feat: add NewKernelUtil
    
    * fix typos
    
    * feat: add cublas_tensor_op_math_handle()
    
    * add gemm (#1860)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
    
    * feat: update FullyConnectedKernel to use NewKernelUtil
    
    * Dev sx mixed precision (#1861)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * save cpp
    
    * save
    
    * add relu and relu_backward
    
    * remove spared space
    
    * add explicit declaration
    
    * rename
    
    * feat: update ConvKernel to support half
    
    * add sigmoid and tanh (#1867)
    
    * add axpy (#1866)
    
    * style: formatting
    
    * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
    
    * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
    
    * refine(new_kernel_util.h)
    
    * refine(new_kernel_util.cu)
    
    * feat(new_kernel_util): add OFBatchedGemm()
    
    * feat: update MatMulKernel to support half
    
    * feat: update ConvData/Bias/FilterGradKernel to support half
    
    * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
    
    * feat: support loss scale
    
    * fix(operator): :bug:add InferHasBatchDim()
    
    * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
    
    * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
    
    * style(kernel/cast_kernel.cpp): formatting
    
    * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
    
    * style(cast_kernel.cpp): formatting
    
    * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
    
    * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
    
    * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
    
    * refactor(dropout_kernel): remove backward funcs
    
    * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
    
    * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
    
    * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: fix little bugs
    
    * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
    
    * feat: support half for bias_add_kernel
    
    * fix(bias_add_op): remove data type check
    
    * feat(relu_kernel): support half
    
    * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
    
    * fix: typos
    
    * feat(pooling_kernel): support half
    
    * fix: remove CHECK_EQ of default data type
    
    * feat(pooling_grad_kernel): support half
    
    * feat: support half in ofrecord_encoder (TODO)
    
    * fix
    
    * feat: support half in sparse_cross_entropy_kernel
    
    * debug grad op (#1883)
    
    * Dev debug op mixed precision (#1884)
    
    * debug grad op
    
    * do nothing instead of UNIMPLEMENTED
    
    * fix(dropout_kernel): add tmp_split_fw_bw condition
    
    * build(half.cmake): https->http
    
    * fix(record_load_kernel): support total_batch_num
    
    * fix pooling (#1885)
    
    * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: add GetCudnnScalingParameters() to fix scaling params
    
    * fix: add enable_true_half_config_when_conf() into config and update related code
    
    * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
    
    * refactor(matmul_kernel): remove Backward()
    
    * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
    
    * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
    
    * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
    
    * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
    
    * feat(job_conf.proto): add enable_auto_mixed_precision field
    
    * feat(auto_mixed_precision_lists): add amp_lists
    
    * feat(auto_mixed_precision): build the skeleton
    
    * feat(auto_mixed_precision): almost finish amp graph pass
    
    * feat(auto_mixed_precision.cpp): complte InsertCastOp()
    
    * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
    
    * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
    
    * refine(auto_mixed_precision.cpp): refine LOG
    
    * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
    
    * Dev half ndarray (#1886)
    
    * debug grad op
    
    * ZeroVal => GetZeroVal; OneVal => GetOneVal
    
    * MaxVal => GetMaxVal; MinVal => GetMinVal
    
    * check data type
    
    * DevDType
    
    * move function template to struct template for BinaryFunc* and UnaryFunc*
    
    * support half for reduce_sum_kernel
    
    * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
    
    * half for NdarrayUtil
    
    * OF_DEVICE_FUNC is always inline
    
    * half for NdarrayApplyUnaray
    
    * simplify usage of NdarrayUtil
    
    * UnaryFuncExp
    
    * add VarNdarrayBuilder and ValNdarrayBuilder
    
    * simplify NdarrayUtil in layer_norm_param_grad_kernel
    
    * InplaceBroadcast
    
    * remove SoftmaxKernelUtil
    
    * half for softmax_kernel
    
    * fix improper use of __CUDA_ARCH__
    
    * disable sm_30,sm_52
    
    * refine(conv_kernel.cu): fix typo
    
    * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix: fix typos of GetOneVal
    
    * fix(auto_mixed_precision.cpp): allocate for shared_ptr
    
    * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
    
    * fix(auto_mixed_precision.cpp): fix typo
    
    * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
    
    * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
    
    * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
    
    * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
    
    * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
    
    * feat(auto_mixed_precision.cpp): more logs
    
    * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
    
    * fix(bias_add_op.cpp): fix bias_multiplier shape
    
    * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
    
    * feat: update MatmulKernel and new_kernel_util to support half
    
    * refactor(auto_mixed_precision): add ClearList and refine code
    
    * feat(tanh_*_kernel): support half
    
    * feat(add_kernel): support half
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
    
    * style(CMakeLists.txt): fix typo
    
    * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
    
    * fix get one ptr (#1913)
    
    * fix(layer_norm): add LayerNormOp to grey_list and support the half
    
    * fix(layer_norm about): fix it to run when amp
    
    * fix: move fix sbp signature from OpNode to OpGraph
    
    * Dev new kernel util (#1925)
    
    * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
    
    * refactor(kernel/util): add BlasIf
    
    * refactor(kernel/util): add ArithemeticIf
    
    * refactor(kernel/util): add cuda_kernel_util.*
    
    * refactor: refactor NewKernelUtil
    
    * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
    
    * refactor(new_kernel_util.h): remove unused header files
    
    * refactor: refactor loop include
    
    * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
    
    * CHECK cuda version > 10.0 when use auto_mixed_presion
    
    * Fix bug of Snapshot delete file Unwanted (#1937)
    
    * fix link BUG of release version (#1938)
    
    * delete redundant code in OpGraph JobCompleter and Operator (#1927)
    
    * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
    
    * revert README change
    
    * split 2 pull request
    
    * Refactor Kernel Registry V2: The clear & easy Way (#1941)
    
    * refactor(resource.proto): move DeviceType to common/device_type.proto
    
    * feat(kernel_registration): add kernel_registration.h/cpp
    
    * feat(kernel_registration): update matmul_kernel to support new registration
    
    * feat: add CreateKernel for new registry
    
    * feat: udpate registry of cast conf
    
    * refactor(kernel_registration): remove KernelRegMap
    
    * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
    
    * grpc SetMaxMessageSize(INT_MAX) (#1950)
    
    * fix bug of Graph::ForEachConnectedComponent (#1952)
    
    * Grpc set max size (#1953)
    
    * grpc SetMaxMessageSize(INT_MAX)
    
    * set max msg len for ctrl service
    
    * code for test grpc max msg size
    
    * remove test code
    
    * NumaAwareCudaMallocHost (#1959)
    
    * NumaAwareCudaMallocHost
    
    * add conf
    
    * AllReduceSequencePass (#1976)
    
    * Merge job set and split fw bw (#1983)
    
    * add MemoryCopier and TensorSliceCopier (#1901)
    
    * add MemoryCopier and TensorSliceCopier
    
    * Index=>NdIndex
    
    * refine
    
    * refine
    
    * fix addition error checking (#1911)
    
    * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * Merge dev_mixed_precision: Part-2 (#1907)
    
    * feat: add NewKernelUtil
    
    * fix typos
    
    * feat: add cublas_tensor_op_math_handle()
    
    * add gemm (#1860)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
    
    * feat: update FullyConnectedKernel to use NewKernelUtil
    
    * Dev sx mixed precision (#1861)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * save cpp
    
    * save
    
    * add relu and relu_backward
    
    * remove spared space
    
    * add explicit declaration
    
    * rename
    
    * feat: update ConvKernel to support half
    
    * add sigmoid and tanh (#1867)
    
    * add axpy (#1866)
    
    * style: formatting
    
    * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
    
    * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
    
    * refine(new_kernel_util.h)
    
    * refine(new_kernel_util.cu)
    
    * feat(new_kernel_util): add OFBatchedGemm()
    
    * feat: update MatMulKernel to support half
    
    * feat: update ConvData/Bias/FilterGradKernel to support half
    
    * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
    
    * feat: support loss scale
    
    * fix(operator): :bug:add InferHasBatchDim()
    
    * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
    
    * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
    
    * style(kernel/cast_kernel.cpp): formatting
    
    * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
    
    * style(cast_kernel.cpp): formatting
    
    * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
    
    * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
    
    * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
    
    * refactor(dropout_kernel): remove backward funcs
    
    * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
    
    * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
    
    * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: fix little bugs
    
    * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
    
    * feat: support half for bias_add_kernel
    
    * fix(bias_add_op): remove data type check
    
    * feat(relu_kernel): support half
    
    * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
    
    * fix: typos
    
    * feat(pooling_kernel): support half
    
    * fix: remove CHECK_EQ of default data type
    
    * feat(pooling_grad_kernel): support half
    
    * feat: support half in ofrecord_encoder (TODO)
    
    * fix
    
    * feat: support half in sparse_cross_entropy_kernel
    
    * debug grad op (#1883)
    
    * Dev debug op mixed precision (#1884)
    
    * debug grad op
    
    * do nothing instead of UNIMPLEMENTED
    
    * fix(dropout_kernel): add tmp_split_fw_bw condition
    
    * build(half.cmake): https->http
    
    * fix(record_load_kernel): support total_batch_num
    
    * fix pooling (#1885)
    
    * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: add GetCudnnScalingParameters() to fix scaling params
    
    * fix: add enable_true_half_config_when_conf() into config and update related code
    
    * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
    
    * refactor(matmul_kernel): remove Backward()
    
    * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
    
    * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
    
    * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
    
    * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
    
    * feat(job_conf.proto): add enable_auto_mixed_precision field
    
    * feat(auto_mixed_precision_lists): add amp_lists
    
    * feat(auto_mixed_precision): build the skeleton
    
    * feat(auto_mixed_precision): almost finish amp graph pass
    
    * feat(auto_mixed_precision.cpp): complte InsertCastOp()
    
    * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
    
    * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
    
    * refine(auto_mixed_precision.cpp): refine LOG
    
    * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
    
    * Dev half ndarray (#1886)
    
    * debug grad op
    
    * ZeroVal => GetZeroVal; OneVal => GetOneVal
    
    * MaxVal => GetMaxVal; MinVal => GetMinVal
    
    * check data type
    
    * DevDType
    
    * move function template to struct template for BinaryFunc* and UnaryFunc*
    
    * support half for reduce_sum_kernel
    
    * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
    
    * half for NdarrayUtil
    
    * OF_DEVICE_FUNC is always inline
    
    * half for NdarrayApplyUnaray
    
    * simplify usage of NdarrayUtil
    
    * UnaryFuncExp
    
    * add VarNdarrayBuilder and ValNdarrayBuilder
    
    * simplify NdarrayUtil in layer_norm_param_grad_kernel
    
    * InplaceBroadcast
    
    * remove SoftmaxKernelUtil
    
    * half for softmax_kernel
    
    * fix improper use of __CUDA_ARCH__
    
    * disable sm_30,sm_52
    
    * refine(conv_kernel.cu): fix typo
    
    * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix: fix typos of GetOneVal
    
    * fix(auto_mixed_precision.cpp): allocate for shared_ptr
    
    * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
    
    * fix(auto_mixed_precision.cpp): fix typo
    
    * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
    
    * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
    
    * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
    
    * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
    
    * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
    
    * feat(auto_mixed_precision.cpp): more logs
    
    * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
    
    * fix(bias_add_op.cpp): fix bias_multiplier shape
    
    * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
    
    * feat: update MatmulKernel and new_kernel_util to support half
    
    * refactor(auto_mixed_precision): add ClearList and refine code
    
    * feat(tanh_*_kernel): support half
    
    * feat(add_kernel): support half
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
    
    * style(CMakeLists.txt): fix typo
    
    * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
    
    * fix get one ptr (#1913)
    
    * fix(layer_norm): add LayerNormOp to grey_list and support the half
    
    * fix(layer_norm about): fix it to run when amp
    
    * fix: move fix sbp signature from OpNode to OpGraph
    
    * Dev new kernel util (#1925)
    
    * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
    
    * refactor(kernel/util): add BlasIf
    
    * refactor(kernel/util): add ArithemeticIf
    
    * refactor(kernel/util): add cuda_kernel_util.*
    
    * refactor: refactor NewKernelUtil
    
    * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
    
    * refactor(new_kernel_util.h): remove unused header files
    
    * refactor: refactor loop include
    
    * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
    
    * CHECK cuda version > 10.0 when use auto_mixed_presion
    
    * Fix bug of Snapshot delete file Unwanted (#1937)
    
    * fix link BUG of release version (#1938)
    
    * delete redundant code in OpGraph JobCompleter and Operator (#1927)
    
    * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
    
    * revert README change
    
    * split 2 pull request
    
    * Refactor Kernel Registry V2: The clear & easy Way (#1941)
    
    * refactor(resource.proto): move DeviceType to common/device_type.proto
    
    * feat(kernel_registration): add kernel_registration.h/cpp
    
    * feat(kernel_registration): update matmul_kernel to support new registration
    
    * feat: add CreateKernel for new registry
    
    * feat: udpate registry of cast conf
    
    * refactor(kernel_registration): remove KernelRegMap
    
    * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
    
    * grpc SetMaxMessageSize(INT_MAX) (#1950)
    
    * fix bug of Graph::ForEachConnectedComponent (#1952)
    
    * Grpc set max size (#1953)
    
    * grpc SetMaxMessageSize(INT_MAX)
    
    * set max msg len for ctrl service
    
    * code for test grpc max msg size
    
    * remove test code
    
    * NumaAwareCudaMallocHost (#1959)
    
    * NumaAwareCudaMallocHost
    
    * add conf
    
    * AllReduceSequencePass (#1976)
    
    * CudaCurrentDeviceGuard (#1977)
    
    * delete tmp_split_fw_bw_train_conf (#1985)
    
    * delete tmp_split_fw_bw_train_conf
    
    * delete useless comments
    
    * fix refactor bug in layer_norm_op
    
    * minor fixes
    
    * update py script
    
    * remove code could be misleading
    
    * Fix all reduce mem sharing (#1986)
    
    * fix all reduce mem sharing
    
    * ByteSizeOfDataContentField=>ByteSizeOfBlobBody
    
    * remove obsolete task_graph optimization
    
    * no arg_pass_job for variable_op
    
    * merge memory block id between jobs (#1910)
    
    * refine MemBlock and CriticalSection
    
    * job memory sharing strategy
    
    * revert diff in CriticalSectionDesc
    
    * Merge memory block between sub plans
    
    * Get mutual exclusion job groups
    
    * forget to consider memory merge only in same machine
    
    * memory zone unique id
    
    * Merge Done;  merge memory block id from right to left; get memory block ids info
    
    * revert MemBlock
    
    * generate mutual exclusion job groups Done.
    
    * update for proto
    
    * add JobMemSharingStrategy in python interface
    
    * remove memorycase hash
    
    * move JobMemSharingStrategy to JobSetProto
    
    * using default strategy = parallel priority strategy
    
    * update interface of flow.job_mem_sharing_strategy
    
    * InterJobMemSharingUtil and PlanUtil
    
    * revert oneflow.h
    
    * fix bug
    
    * New implement of Merge memory block id between jobs
    
    * refine code
    
    * fix a fatal bug in std::hash<oneflow::Shape>
    
    * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node
    
    * unlock critical sections as more as possible (#1994)
    
    * Bugfix actor case (#1995)
    
    * unlock critical sections as more as possible
    
    * consumed and produced regst of actor 'case' are customized
    
    * refine code
    
    * Bugfix actor case (#1996)
    
    * unlock critical sections as more as possible
    
    * consumed and produced regst of actor 'case' are customized
    
    * refine code
    
    * small regst_num for reentrant_lock (#1997)
    
    * fmt dev_job_set(#1999)
    
    * double buffer for tick_op
    
    * tick is cpu op
    
    * speedup compile time (#2000)
    
    * only merge mem_block_id between user job (#1993)
    
    * Fix keep header only (#2001)
    
    * speedup compile time
    
    * fix keep header only
    
    * remove shared model (#2003)
    
    * remove blob_mem_sharing (#2005)
    
    * No copyhd for output (#2006)
    
    * no cpu tick
    
    * no copyhd for output_op/swith_output_op
    
    * remove temp comments
    
    * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo
    
    * remove clone_id (#2007)
    
    * layer norm auto var (#2004)
    
    * layer norm auto var
    
    * make of_format
    
    * bn sbp (#2008)
    
    * Refactor job completer (#1998)
    
    * fmt
    
    * refactor GenerateOpConf4Trainning
    
    * more refactor
    
    * refactor SetCtrlInOpName4VariableOp
    
    * use uniq ptr
    
    * refactor RewriteBoxingWithAllReduce
    
    * refactor MakeAllReduceSequence
    
    * refactor auto_mixed_precision
    
    * refactor DumpLogicalBlobDescAndSbpSignature
    
    * refactor group_boxing_by_dst_parallel
    
    * refactor add_keep_header_only_op_conf
    
    * refactor AutoSourceTick
    
    * refactor AddTickForTimeShape
    
    * refactor AutoSinkTick
    
    * refactor AddGlobalOutputCriticalSections
    
    * refactor SetOpTimeShape7BatchDimLbis
    
    * fix a bug in IsInterfaceTask (#2009)
    
    * Bugfix is interface task (#2010)
    
    * fix a bug in IsInterfaceTask
    
    * IsOutputInterfaceTask
    
    * copyhd-free output_op task_node
    
    * Dev job set config util (#2011)
    
    * add more if in JobConfigProtoBuilder
    
    * unlock critical sections as more as possible
    
    * consumed and produced regst of actor 'case' are customized
    
    * remove total batch num in config util
    
    * remove clone_id
    
    * assert has train_conf
    
    * rm debug info
    
    * Dev job set bert (#2013)
    
    * support bert
    
    * mv into bert
    
    * manual format
    
    * fix adam (#2015)
    
    * fix adam
    
    * div batch instance num before update model
    
    * remove outdate code in oneflow.cpp (#2017)
    
    * Dev split like (#2016)
    
    * no total_instance_num
    
    * add auto grad for concat
    
    * check in impl
    
    * check in bug fixes
    
    * fix bugs for split_like
    
    * split_like_op.cpp format
    
    * add normalization_autovar
    
    * Update op_conf.proto
    
    * address reviews
    
    * fix typo
    
    * constant ref
    
    * rm forward_loss_instance_num (#2018)
    
    * Bugfix job set multi device (#2019)
    
    * sbp for tick input bn
    
    * interface_blob_conf for output_op/switch_output_op
    
    * set sbp conf for tuple identity op
    
    * fix bugs when merge main plan
    
    * delete useless code
    
    * address review
    
    * fix error use of GenRepeatedBn()
    
    * ForEachConnectedComponent is easily misused
    
    * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil
    
    * only for return output_op
    
    * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name
    
    * return op instead of output op acts as part of user job
    
    * enable_all_reduce_group
    
    * bugfix: init RuntimeBuffersScope before Runtime
    
    * demo python scripts for enable_all_reduce_group
    
    * remove wrong optimization code
    
    * constant_conf for enable_all_reduce_group.py test
    
    * fix interface op parallel conf
    
    * fix reduce concat kernel (#2020)
    
    * binary program oneflow_worker
    
    * user_job_completer
    
    * remove unused code loss_print
    
    * rm unused code loss_acc
    
    * remove unused accuracy_acc and accuracy_print
    
    * remove input_diff/output_diff/model_diff bns
    
    * remove unused bns in gdb util
    
    * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns
    
    * support mpi using style
    
    * Bugfix put job conf into plan (#2023)
    
    * put job_conf into plan
    
    * using job_name judge isPullJob/isPushJob
    
    * fix wrong job_id error
    
    * model_init is a push job; model_save is a pull job
    
    * make cmake more reasonable (#2024)
    
    * Restructure python module and minimum setup.py (#2026)
    
    * check in updated paths
    
    * check in minimum setup tool
    
    * Dev python init multi unit (#2022)
    
    * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine
    
    * refine var name
    
    * refine code
    
    * compile user/main job only on master
    
    * bert multi machine test code
    
    * fix bugs
    
    * JobConfs
    
    * fix bugs under WITH_RDMA
    
    * fix multi-machine bugs
    
    * delete useless code
    
    * Add xla reduce_sum op
    
    * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)
    
    * feat: init_worker can without scp binary and no use uuid (#2029)
    
    * half impl of without scp bin
    
    * feat: init_worker can without scp binary and no use uuid
    
    * check in fixes (#2030)
    
    * fixbug of delete worker (#2033)
    
    * Dev dot plan (#2035)
    
    * reuse plan to dot file
    
    * refine plan dot
    
    * Check in bug fix and multi node script (#2032)
    
    * check in fixes
    
    * check in script
    
    * fix boxing bug when setting conf with sbp
    
    * flag for iter
    
    * fixbug of delete worker
    
    * fix delete worker in script
    
    * address review, add exclusive or check
    
    * reuse plan to dot file
    
    * refine plan dot
    
    * fix and add flags
    
    * fmt
    
    * rm debug output
    
    * more flags
    
    * check Activation
    
    * fix fc bug when num axes > 2
    
    * reverse change
    
    * fix next_batch_num (#2036)
    
    * upgrade nccl to 2.4.8 (#2037)
    
    * fix shape of fc in_diff (#2038)
    
    * Rewrite model update op to optimizer graph
    
    * Update oneflow.cmake (#2041)
    
    * better looking merged_plan to dot v1 (#2039)
    
    * better looking and more infomation of merged_plan.dot
    
    * refine color
    
    * Fix tick in multi node parallel (#2042) (#2047)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * Dev train conf builder (#2046)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * check in impl
    
    * fix data dir (#2054)
    
    * fix data dir
    
    * rm model load path
    
    * AssignOp (#2058)
    
    * AssignOp
    
    * remove useless code
    
    * Python ops gather and unit test (#2053)
    
    * python_ops gather and unit test
    
    * format
    
    * minor mod
    
    * SnapshotOp (#2060)
    
    * magical add and fix bug (#2061)
    
    * check in impl
    
    * add todo
    
    * Dev jxf python pooling (#2056)
    
    * run max_pool_2d without bug
    
    * correct max_pool_2d
    
    * correct average_pool_2d
    
    * minor refine
    
    * final version
    
    * rename to nn.py
    
    * add name arg to pool1d ops
    
    * refine by review
    
    * rename to _GetSequence and move it to the end of file (#2063)
    
    * fix BindInterfaceMemBlockId (#2065)
    
    * mark py file generated (#2066)
    
    * Dev gracious exit (#2057)
    
    * add more checks
    
    * make language more consistant
    
    * better error info for worker init
    
    * better error
    
    * Update setup.py (#2068)
    
    * Refine Infer APIs by return Maybe<void> type (#2051)
    
    * Refine Infer APIs by return Maybe<void> type
    
    * Fix return type
    
    * Fix code style
    
    * Replace CHECK macros in the implementation of infer APIs
    
    * Revert IsOk
    
    * fix bug for split like op (#2070)
    
    * fix snapshot path (#2071)
    
    * Dev job set fix infer apis (#2072)
    
    * Refine Infer APIs by return Maybe<void> type
    
    * Fix return type
    
    * Fix code style
    
    * Replace CHECK macros in the implementation of infer APIs
    
    * Revert IsOk
    
    * update
    
    * add AutoGlobalStep (#2073)
    
    * rm default_initializer_conf in train conf (#2075)
    
    * Fix sigmoid op (#2076)
    
    * fix sigmoid op bug
    
    * fix bug for split like op
    
    * add sigmoid grad op
    
    * Fix bn (#2077)
    
    * fix bn
    
    * return Maybe<void> OK in lambda
    
    * fix typo
    
    * fix SigmoidGradOp (#2078)
    
    * Dev python merge job set (#2081)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * fix gcc warning in release (#2080)
    
    * fix gcc version in release
    
    * fix empty line
    
    * Fix adam mv initilizer (#2082)
    
    * zero constant initilzer for adam m and v
    
    * make of_format
    
    * init adam m v beta1_t and beta2_t
    
    * use value instead of initializer
    
    * const float& -> const float
    
    * update
    
    * LearningRateScheduleOp (#2079)
    
    * matmul (#2084)
    
    * matmul
    
    * np.allclose
    
    * Fix hang bugs
    
    * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)
    
    * bugfix: reshape op infer dim0 size; and look up tensorflow reshape
    
    * refine code for read
    
    * check py if and test
    
    * prelu (#2086)
    
    * prelu
    
    * fix
    
    * fix
    
    * template for either ptr cast (#2088)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * add template for cast
    
    * rename
    
    * Dev build and infer ctx (#2089)
    
    * add job_build_and_infer_ctx interface
    
    * lbn_with_split_hint
    
    * fix maybe macro
    
    * fix signature of Maybe<T>::Error()
    
    * job_build_and_infer_if
    
    * add c_api_util wrapper for job_build_and_infer_ctx
    
    * implement python/job_build_and_infer interface
    
    * CurJobBuildAndInferCtx_AddPlacementGroup
    
    * BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)
    
    * job_build_and_infer_ctx_mgr
    
    * refine interface of infer_ctx_mgr
    
    * JobBuildInferCtx set job conf; add and refine error type
    
    * revert job.proto
    
    * half impl of add op in build_infer_ctx
    
    * generate op produced empty logical blob desc ; infer out blob desc interface
    
    * job_build_and_infer_ctx VERSION 1
    
    * add InferOutBlobDesc for conv op; remove record_piece_size in interface op
    
    * maybe return
    
    * job_set hold by job_build_and_infer_ctx_mgr
    
    * check placement when infer ctx mgr leave cur job
    
    * Global New/Delete JobBuildAndInferCtxMgr
    
    * add JUST when ctx add op
    
    * remove unused job_conf.arg_op_name
    
    * fix bugs caused by python new api
    
    * fix bugs caused by lack of Global<JobDesc>
    
    * fix bugs caused by new api
    
    * refactor compiler.Compile
    
    * merge dev_python
    
    * remove unused message proto
    
    * rename api
    
    * Fix input which body is disabled in xla launch kernel
    
    * add RemoteBlob.shape and RemoteBlob.dtype
    
    * Fix data type set default variable (#2092)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * fix default data type
    
    * Add conf axis for bias_add for any axis channel (#2093)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * bias_add completion
    
    * follow comment
    
    * make conf axis required
    
    * Dev jxf python initializer (#2090)
    
    * oneflow initializer
    
    * update
    
    * Fix self control in
    
    * Bugfix python alexnet (#2096)
    
    * bugfix_python_alexnet
    
    * fix
    
    * Add fake consume op
    
    * Dev global step (#2100)
    
    * assign op
    
    
    AddGlobalStepOpConf
    
    
    fix
    
    
    ARITHMETIC_DATA_TYPE_SEQ
    
    
    identity_op_conf
    
    
    add ops
    
    
    GenNewSnapshotName
    
    
    SnapshotOp
    
    
    cleanup
    
    
    blob name
    
    
    LearningRateScheduleOp
    
    
    LearningRateScheduleKernel
    
    
    LearningRateScheduleKernel
    
    
    AddLearningRateScheduleOpConf
    
    
    learning rate
    
    
    cleanup
    
    
    fix
    
    
    fix
    
    * remove total_mbn_num
    
    * date time format
    
    * save
    
    * refine
    
    * refine
    
    * revert
    
    * refine snapshot
    
    * fix
    
    * refine
    
    * AutoGlobalStep
    
    * refine
    
    * GenLogicalBlobName
    
    * AutoLearningRate
    
    * remove JobDesc lr
    
    * fix snapshot path
    
    * Maybe<void>
    
    * learning_rate blob
    
    * remove next_model_vid
    
    
    fix
    
    
    fix 
    
    
    fix
    
    
    learning_rate
    
    * train_conf
    
    * fix for global step on multi nodes
    
    * Fix optimizer initializer (#2095)
    
    * fix optimizer initializer
    
    * rename lars data temp bn
    
    * fix job_type (#2102)
    
    * Dev alexnet new api (#2094)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * check in softmax loss
    
    * nn.conv2d and nn.bias_add
    
    * fix opname
    
    * fix merge conflict
    
    * fix name
    
    * dense (#2097)
    
    * Fix jxf dense v2 (#2098)
    
    * dense
    
    * minor fix
    
    * alexnet
    
    * fix conf
    
    * quick fix
    
    * transpose
    
    * fix layers
    
    * add transpose
    
    * fix fc
    
    * fix
    
    * fix
    
    * fix data laod
    
    * params check and format
    
    * rm activation in op conf
    
    * save workaround
    
    * fix avg pool 2d
    
    * fix max pool 2d
    
    * remove fc3 relu
    
    * alexnet eval
    
    * minor
    
    * replace has_batch_dim with batch_axis (#2104)
    
    * replace has_batch_dim with batch_axis
    
    * refactor OrderValue4HasBatchAxis
    
    * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp
    
    * no CHECK in MatmulOp::InferBatchAxis
    
    * infer op by op_conf and  parallel_conf
    
    * wrapper Error for ErrorProto
    
    * replace ErrorUtil with Error
    
    * add OF_CHECK (#2110)
    
    * optional split_axis (#2113)
    
    * Fix HasAttr bug for optional field
    
    * undefined (#2116)
    
    * merge reduce xxx (#2119)
    
    * Update GetSbpSig() with Maybe (#2118)
    
    * fix sveral ops
    
    * modify all ops
    
    * format
    
    * update complete
    
    * Refine AdamOptimizer
    
    * fix (#2120)
    
    * Fix xla AdamOptimizer bugs
    
    * support scalar for reduce_xxx axis args (#2122)
    
    * Dev opt split axis (#2121)
    
    * optional split_axis
    
    * backup
    
    * VariableConf::(OptInt64 split_axis)
    
    * backup
    
    * fix autovar split_axis (#2125)
    
    * Dev model init op (#2117)
    
    * assign op
    
    
    AddGlobalStepOpConf
    
    
    fix
    
    
    ARITHMETIC_DATA_TYPE_SEQ
    
    
    identity_op_conf
    
    
    add ops
    
    
    GenNewSnapshotName
    
    
    SnapshotOp
    
    
    cleanup
    
    
    blob name
    
    
    LearningRateScheduleOp
    
    
    LearningRateScheduleKernel
    
    
    LearningRateScheduleKernel
    
    
    AddLearningRateScheduleOpConf
    
    
    learning rate
    
    
    cleanup
    
    
    fix
    
    
    fix
    
    * remove total_mbn_num
    
    * date time format
    
    * save
    
    * refine
    
    * refine
    
    * revert
    
    * refine snapshot
    
    * fix
    
    * refine
    
    * AutoGlobalStep
    
    * refine
    
    * GenLogicalBlobName
    
    * AutoLearningRate
    
    * remove JobDesc lr
    
    * fix snapshot path
    
    * Maybe<void>
    
    * learning_rate blob
    
    * remove next_model_vid
    
    
    fix
    
    
    fix 
    
    
    fix
    
    
    learning_rate
    
    * train_conf
    
    * fix for global step on multi nodes
    
    * SnapshotReader
    
    
    snapshot writer
    
    
    model init op
    
    
    fix
    
    
    refine
    
    
    init
    
    
    InitializeFromSnapshotConf
    
    
    model io job
    
    
    ModelLoadOp
    
    
    ModelLoadKernel
    
    
    MakeModelLoadJob
    
    
    ModelSaveOp
    
    
    fix
    
    
    InterUserJobInfo
    
    
    _MakeModelLoadJobFunc
    
    
    MutModelLoadOpConTickInputHelper
    
    
    fix
    
    
    refine
    
    
    init/load/save
    
    
    set_default_variable
    
    * remove SnapshotMgr
    
    * snapshot.h
    
    * delete model_init_job.cpp
    
    
    foreign_input_op_conf
    
    
    fix
    
    
    snapshot path
    
    
    set path
    
    
    op_conf
    
    
    fix
    
    
    fix CopyFromNdarray
    
    
    to bytes c
    
    
    use uint8
    
    
    char2uint8
    
    * model init
    
    * model io
    
    * fix
    
    * ModelSaveKernel
    
    * mutable_batch_axis()->Clear()
    
    * InferBatchAxis
    
    * fix
    
    * refine
    
    * job set
    
    * MakeModelIoJobs
    
    * fix
    
    * jobs
    
    * fix
    
    * model io job
    
    * GenOutputOpConf
    
    * refine snapshot
    
    * refine
    
    * fix
    
    * refine CheckPoint
    
    * remove session
    
    * refine
    
    * refine
    
    * refine
    
    * remove keyword.h/cpp
    
    * refine
    
    * global_step=>train_step
    
    * GetSbpSignatures
    
    * ModelInitOp
    
    * fix (#2127)
    
    * rm stale alextnet script (#2129)
    
    * Dev plain maybe (#2126)
    
    * optional split_axis
    
    * backup
    
    * VariableConf::(OptInt64 split_axis)
    
    * backup
    
    * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
    
    * SharedOrPlain
    
    * const std::shared_ptr<T>& => std::shared_ptr<T>
    
    * Dev simple checkpoint manager (#2128)
    
    * SimpleCheckPointManager
    
    * makedirs
    
    * fix path
    
    * save
    
    * refine
    
    * refine
    
    * fix path to numpy (#2130)
    
    * Dev plain maybe (#2132)
    
    * optional split_axis
    
    * backup
    
    * VariableConf::(OptInt64 split_axis)
    
    * backup
    
    * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
    
    * SharedOrPlain
    
    * const std::shared_ptr<T>& => std::shared_ptr<T>
    
    * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()
    
    * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>
    
    * Dev jxf merge general ops (#2131)
    
    * merge some general ops to dev_python
    
    * dense demo
    
    * rm print in test
    
    * new line at the end of file
    
    * format
    
    * fix check point
    
    * update alexnet
    
    * broadcast_xxx (#2134)
    
    * broadcast_xxx
    
    * typo
    
    * typo
    
    * rm job_conf.num_of_batches_in_snapshot
    
    * fix args (#2136)
    
    * fix proto if (#2138)
    
    * pass name to inner function (#2139)
    
    * check dropout if (#2140)
    
    * check dropout if
    
    * fix typo
    
    * Dev merge math ops (#2143)
    
    * merge math ops
    
    * new line at the end of file
    
    * merge layer norm (#2144)
    
    * variable_scope (#2141)
    
    * variable_scope
    
    * revert format
    
    * add check
    
    * Merge dropout if (#2145)
    
    * check dropout if
    
    * fix typo
    
    * fix typo
    
    * slice (#2142)
    
    * slice
    
    * add check and docstring
    
    * minor
    
    * minor
    
    * add const (#2146)
    
    * add const
    
    * fix indentation
    
    * address review
    
    * fmt
    
    * rm redundant
    
    * Update array_ops.py
    
    * Update array_ops.py
    
    * Update array_ops.py
    
    * add more activations to math_ops (#2147)
    
    * fix bug (#2149)
    
    * trancated normal for bert (#2150)
    
    * Update bert for dev python (#2151)
    
    * trancated normal for bert
    
    * bert support
    
    * math.dropout to nn.dropout (#2153)
    
    * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto
    
    * allow export multiple interfaces in oneflow_export decorator (#2154)
    
    * refactor job_build_and_infer_if.h
    
    * update oneflow_internal.h to use Maybe (#2135)
    
    * Fix python internal (#2133)
    
    * Return error meassage in oneflow_internal
    
    * Refine environment_objects_scope
    
    * add OF_ERROR_STR_CHECK and OFStrCat()
    
    * format
    
    * fix based on review
    
    * fix(oneflow_internal.h): add undef
    
    * fix: expr -> (expr)
    
    * feat: update oneflow_internal_helper to use func
    
    *  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)
    
    *  Transfer data_part_num to DecodeOp and RecordLoadOp
    
    * Fix python scripts
    
    * Dev nc of internal (#2155)
    
    * Fix python internal (#2133)
    
    * Return error meassage in oneflow_internal
    
    * Refine environment_objects_scope
    
    * add OF_ERROR_STR_CHECK and OFStrCat()
    
    * format
    
    * fix based on review
    
    * fix(oneflow_internal.h): add undef
    
    * fix: expr -> (expr)
    
    * feat: update oneflow_internal_helper to use func
    
    * fix: fix ctor bug
    
    * fix config_proto
    
    * rename c_api_util.Init => c_api_util.InitEnvironment
    
    * refactor compile_context.cur_job => compile_context.cur_job_conf
    
    * remove FixPackedBlobDescOfProducedRegst (#2156)
    
    * Fix snapshot root path empty log (#2158)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * Fix snapshot root path empty log
    
    * fix channel last (#2157)
    
    * fix channel last
    
    * minor
    
    * merge pb_message
    
    * add cudnn conv force algo (#2159)
    
    * Update bert for dev python (#2160)
    
    * remove old bert
    
    * set data_part_num in decoder
    
    * support model load/saveargs
    
    * Dev flow function (#2152)
    
    * add of.function, refactor init, refine session, and refine runtime
    
    * rm useless code
    
    * rename
    
    * update
    
    * add test
    
    * @oneflow_export JobConfigProto and Trainconf (#2162)
    
    * @oneflow_export JobConfigProto and Trainconf
    
    * remove unused config in config_util.py
    
    * remove oneflow.get_cur_job_conf_builder
    
    * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)
    
    * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf
    
    * fix config.train.model_update_conf
    
    * _GetJobConfAttr
    
    * update alexnet (#2166)
    
    * Update alexnet (#2167)
    
    * update alexnet
    
    * update for bert
    
    * 15->16
    
    * more reasonable conf
    
    * get variable in py layer norm
    
    * replace val in pb msg;  decode lbn string with split hint (#2165)
    
    * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)
    
    * Add meta data in HLO instruction, and refine
    
    * python model parallel (#2103)
    
    * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op
    
    * merge placement group
    
    * refine code in AddAndInferOp
    
    * auto merge placement group when add op; remove mergeplacementgroup interface
    
    * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx
    
    * python blob add interface for model parallel
    
    * refine code of python blob split
    
    * remove interface of has/get_split_axis in python blob
    
    * remove interface of has_batch_dim in python blob
    
    * add check blob split_axis can be divide by parallel num
    
    * refine code for maybe get/infer sbp
    
    * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc
    
    * fix for plain point maybe
    
    * fix bug: add repeated placement group, remove add placement interface in hand
    
    * fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel
    
    * dev_python model parallel runnable and check correct
    
    * remove add placement group when placment scope exit
    
    * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel
    
    * bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done
    
    * refine python blob_desc.split implement
    
    * refine interface decode lbn to split hint
    
    * refine auto add placment group
    
    * refine lbn with split hint decode
    
    * refine code for review
    
    * remove AutoVar related code (#2168)
    
    * feat: remove all autovar
    
    * fix and format
    
    * fix: fix op::InferBlobDesc
    
    * add prototype (#2172)
    
    * add prototype
    
    * infer blob desc with sbp_signature
    
    * `str_a is not str_b' is buggy, use `str_a != str_b' instead
    
    * Update snapshot.cpp (#2174)
    
    * remove useless lines (#2176)
    
    * Fix bert multi nodes (#2177)
    
    * remove useless lines
    
    * fix bert and init_cluster_env for multi nodes
    
    * CHECK_JUST for InferBlobDescsIf (#2178)
    
    * Fix bert multi nodes (#2180)
    
    * remove useless lines
    
    * fix bert and init_cluster_env for multi nodes
    
    * config_proto -> default_config_proto
    
    * delete worker
    
    * update alexnet
    
    * remove unused op (#2182)
    
    * remove parallel_ctx when kernel init (#2185)
    
    * InferOpSbpSignature in op_graph and infer_ctx (#2175)
    
    * InferOpSbpSignature in op_graph and infer_ctx
    
    * bugfix: lambda life time;  gen job build error add location info
    
    * refine error generation and return
    
    * refine check lbi vaild and exists
    
    * remove parallel num in decode_of_record op/kernel (#2186)
    
    * Fix bugs
    
    * delete GlobalJobDesc() in operator/ (#2188)
    
    * rm unused test file
    
    * Refine
    
    * Add assign ops behind adam optimizer to update model and momentum etc.
    
    * Add assign ops behind adam optimizer to update model and momentum etc.
    
    * Remove fake consume op
    
    * Support enable/disable XLA by set env
    
    * Merge callback, limit max operator count for each XLA subgraph
    
    * CudaEventPool
    
    * fix vector
    
    * refine
    
    * Support in-place update for optimizer
    
    * Add alias input and output to prevent reusing input with other temp buffers
    
    * Refine code style
    
    * Remove unused code
    
    * Of xla (#2237)
    
    * mv deprecated.pb_util to lib.core.pb_util
    
    * add op get_variable and get_variable test (#1975)
    
    * add op get_variable and get_variable test
    
    * modify shape extend
    
    * AllReduceSequencePass (#1976)
    
    * python2 compatibility for check_point
    
    * fix "return (blob_a, blob_b)" bug
    
    * rename: arg_passing => arg_pass
    
    * shared regst blob header between jobs (#1919)
    
    * half impl
    
    * register manager handle memory shared for separated memory
    
    * set separated memory shared id for shared regst between jobs
    
    * half impl of python for blob
    
    * fix BUG of pod ToProto() when proto has inited
    
    * fix BUG of infer dim0_inner_shape() in foreign_input_op
    
    * 1. PushJob copy from python can infer dim0_valid_num
    
    * add test for dynamic relu
    
    * refine test file
    
    * refine code
    
    * refine note
    
    * update test file for new interface
    
    * rename separated_header* (#1979)
    
    * some bugs fixes for a train&eval job (#1978)
    
    * debugging alex net
    
    * check in test pull_multiple_blob.py
    
    * strcter check
    
    * fix bias in conv
    
    * fix various bugs
    
    * rm file
    
    * op_name in different jobs can be overloaded
    
    * fix compile bug in job_set_compile_ctx
    
    * rm cmake code for building oneflow binary
    
    * check in script (#1980)
    
    * check in script
    
    * rm used import
    
    * CudaCurrentDeviceGuard (#1977)
    
    * fix val (#1981)
    
    * Merge job set and split fw bw (#1982)
    
    * add MemoryCopier and TensorSliceCopier (#1901)
    
    * add MemoryCopier and TensorSliceCopier
    
    * Index=>NdIndex
    
    * refine
    
    * refine
    
    * fix addition error checking (#1911)
    
    * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * Merge dev_mixed_precision: Part-2 (#1907)
    
    * feat: add NewKernelUtil
    
    * fix typos
    
    * feat: add cublas_tensor_op_math_handle()
    
    * add gemm (#1860)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
    
    * feat: update FullyConnectedKernel to use NewKernelUtil
    
    * Dev sx mixed precision (#1861)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * save cpp
    
    * save
    
    * add relu and relu_backward
    
    * remove spared space
    
    * add explicit declaration
    
    * rename
    
    * feat: update ConvKernel to support half
    
    * add sigmoid and tanh (#1867)
    
    * add axpy (#1866)
    
    * style: formatting
    
    * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
    
    * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
    
    * refine(new_kernel_util.h)
    
    * refine(new_kernel_util.cu)
    
    * feat(new_kernel_util): add OFBatchedGemm()
    
    * feat: update MatMulKernel to support half
    
    * feat: update ConvData/Bias/FilterGradKernel to support half
    
    * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
    
    * feat: support loss scale
    
    * fix(operator): :bug:add InferHasBatchDim()
    
    * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
    
    * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
    
    * style(kernel/cast_kernel.cpp): formatting
    
    * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
    
    * style(cast_kernel.cpp): formatting
    
    * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
    
    * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
    
    * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
    
    * refactor(dropout_kernel): remove backward funcs
    
    * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
    
    * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
    
    * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: fix little bugs
    
    * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
    
    * feat: support half for bias_add_kernel
    
    * fix(bias_add_op): remove data type check
    
    * feat(relu_kernel): support half
    
    * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
    
    * fix: typos
    
    * feat(pooling_kernel): support half
    
    * fix: remove CHECK_EQ of default data type
    
    * feat(pooling_grad_kernel): support half
    
    * feat: support half in ofrecord_encoder (TODO)
    
    * fix
    
    * feat: support half in sparse_cross_entropy_kernel
    
    * debug grad op (#1883)
    
    * Dev debug op mixed precision (#1884)
    
    * debug grad op
    
    * do nothing instead of UNIMPLEMENTED
    
    * fix(dropout_kernel): add tmp_split_fw_bw condition
    
    * build(half.cmake): https->http
    
    * fix(record_load_kernel): support total_batch_num
    
    * fix pooling (#1885)
    
    * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: add GetCudnnScalingParameters() to fix scaling params
    
    * fix: add enable_true_half_config_when_conf() into config and update related code
    
    * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
    
    * refactor(matmul_kernel): remove Backward()
    
    * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
    
    * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
    
    * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
    
    * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
    
    * feat(job_conf.proto): add enable_auto_mixed_precision field
    
    * feat(auto_mixed_precision_lists): add amp_lists
    
    * feat(auto_mixed_precision): build the skeleton
    
    * feat(auto_mixed_precision): almost finish amp graph pass
    
    * feat(auto_mixed_precision.cpp): complte InsertCastOp()
    
    * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
    
    * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
    
    * refine(auto_mixed_precision.cpp): refine LOG
    
    * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
    
    * Dev half ndarray (#1886)
    
    * debug grad op
    
    * ZeroVal => GetZeroVal; OneVal => GetOneVal
    
    * MaxVal => GetMaxVal; MinVal => GetMinVal
    
    * check data type
    
    * DevDType
    
    * move function template to struct template for BinaryFunc* and UnaryFunc*
    
    * support half for reduce_sum_kernel
    
    * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
    
    * half for NdarrayUtil
    
    * OF_DEVICE_FUNC is always inline
    
    * half for NdarrayApplyUnaray
    
    * simplify usage of NdarrayUtil
    
    * UnaryFuncExp
    
    * add VarNdarrayBuilder and ValNdarrayBuilder
    
    * simplify NdarrayUtil in layer_norm_param_grad_kernel
    
    * InplaceBroadcast
    
    * remove SoftmaxKernelUtil
    
    * half for softmax_kernel
    
    * fix improper use of __CUDA_ARCH__
    
    * disable sm_30,sm_52
    
    * refine(conv_kernel.cu): fix typo
    
    * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix: fix typos of GetOneVal
    
    * fix(auto_mixed_precision.cpp): allocate for shared_ptr
    
    * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
    
    * fix(auto_mixed_precision.cpp): fix typo
    
    * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
    
    * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
    
    * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
    
    * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
    
    * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
    
    * feat(auto_mixed_precision.cpp): more logs
    
    * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
    
    * fix(bias_add_op.cpp): fix bias_multiplier shape
    
    * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
    
    * feat: update MatmulKernel and new_kernel_util to support half
    
    * refactor(auto_mixed_precision): add ClearList and refine code
    
    * feat(tanh_*_kernel): support half
    
    * feat(add_kernel): support half
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
    
    * style(CMakeLists.txt): fix typo
    
    * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
    
    * fix get one ptr (#1913)
    
    * fix(layer_norm): add LayerNormOp to grey_list and support the half
    
    * fix(layer_norm about): fix it to run when amp
    
    * fix: move fix sbp signature from OpNode to OpGraph
    
    * Dev new kernel util (#1925)
    
    * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
    
    * refactor(kernel/util): add BlasIf
    
    * refactor(kernel/util): add ArithemeticIf
    
    * refactor(kernel/util): add cuda_kernel_util.*
    
    * refactor: refactor NewKernelUtil
    
    * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
    
    * refactor(new_kernel_util.h): remove unused header files
    
    * refactor: refactor loop include
    
    * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
    
    * CHECK cuda version > 10.0 when use auto_mixed_presion
    
    * Fix bug of Snapshot delete file Unwanted (#1937)
    
    * fix link BUG of release version (#1938)
    
    * delete redundant code in OpGraph JobCompleter and Operator (#1927)
    
    * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
    
    * revert README change
    
    * split 2 pull request
    
    * Refactor Kernel Registry V2: The clear & easy Way (#1941)
    
    * refactor(resource.proto): move DeviceType to common/device_type.proto
    
    * feat(kernel_registration): add kernel_registration.h/cpp
    
    * feat(kernel_registration): update matmul_kernel to support new registration
    
    * feat: add CreateKernel for new registry
    
    * feat: udpate registry of cast conf
    
    * refactor(kernel_registration): remove KernelRegMap
    
    * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
    
    * grpc SetMaxMessageSize(INT_MAX) (#1950)
    
    * fix bug of Graph::ForEachConnectedComponent (#1952)
    
    * Grpc set max size (#1953)
    
    * grpc SetMaxMessageSize(INT_MAX)
    
    * set max msg len for ctrl service
    
    * code for test grpc max msg size
    
    * remove test code
    
    * NumaAwareCudaMallocHost (#1959)
    
    * NumaAwareCudaMallocHost
    
    * add conf
    
    * AllReduceSequencePass (#1976)
    
    * Merge job set and split fw bw (#1983)
    
    * add MemoryCopier and TensorSliceCopier (#1901)
    
    * add MemoryCopier and TensorSliceCopier
    
    * Index=>NdIndex
    
    * refine
    
    * refine
    
    * fix addition error checking (#1911)
    
    * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * Merge dev_mixed_precision: Part-2 (#1907)
    
    * feat: add NewKernelUtil
    
    * fix typos
    
    * feat: add cublas_tensor_op_math_handle()
    
    * add gemm (#1860)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
    
    * feat: update FullyConnectedKernel to use NewKernelUtil
    
    * Dev sx mixed precision (#1861)
    
    * add gemm
    
    * save
    
    * add blobgemm
    
    * update
    
    * update
    
    * fix cu
    
    * update cpp
    
    * save cpp
    
    * save
    
    * add relu and relu_backward
    
    * remove spared space
    
    * add explicit declaration
    
    * rename
    
    * feat: update ConvKernel to support half
    
    * add sigmoid and tanh (#1867)
    
    * add axpy (#1866)
    
    * style: formatting
    
    * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
    
    * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
    
    * refine(new_kernel_util.h)
    
    * refine(new_kernel_util.cu)
    
    * feat(new_kernel_util): add OFBatchedGemm()
    
    * feat: update MatMulKernel to support half
    
    * feat: update ConvData/Bias/FilterGradKernel to support half
    
    * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
    
    * feat: support loss scale
    
    * fix(operator): :bug:add InferHasBatchDim()
    
    * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
    
    * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
    
    * style(kernel/cast_kernel.cpp): formatting
    
    * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
    
    * style(cast_kernel.cpp): formatting
    
    * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
    
    * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
    
    * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
    
    * refactor(dropout_kernel): remove backward funcs
    
    * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
    
    * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
    
    * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: fix little bugs
    
    * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
    
    * feat: support half for bias_add_kernel
    
    * fix(bias_add_op): remove data type check
    
    * feat(relu_kernel): support half
    
    * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
    
    * fix: typos
    
    * feat(pooling_kernel): support half
    
    * fix: remove CHECK_EQ of default data type
    
    * feat(pooling_grad_kernel): support half
    
    * feat: support half in ofrecord_encoder (TODO)
    
    * fix
    
    * feat: support half in sparse_cross_entropy_kernel
    
    * debug grad op (#1883)
    
    * Dev debug op mixed precision (#1884)
    
    * debug grad op
    
    * do nothing instead of UNIMPLEMENTED
    
    * fix(dropout_kernel): add tmp_split_fw_bw condition
    
    * build(half.cmake): https->http
    
    * fix(record_load_kernel): support total_batch_num
    
    * fix pooling (#1885)
    
    * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
    
    * fix: add GetCudnnScalingParameters() to fix scaling params
    
    * fix: add enable_true_half_config_when_conf() into config and update related code
    
    * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
    
    * refactor(matmul_kernel): remove Backward()
    
    * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
    
    * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
    
    * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
    
    * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
    
    * feat(job_conf.proto): add enable_auto_mixed_precision field
    
    * feat(auto_mixed_precision_lists): add amp_lists
    
    * feat(auto_mixed_precision): build the skeleton
    
    * feat(auto_mixed_precision): almost finish amp graph pass
    
    * feat(auto_mixed_precision.cpp): complte InsertCastOp()
    
    * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
    
    * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
    
    * refine(auto_mixed_precision.cpp): refine LOG
    
    * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
    
    * Dev half ndarray (#1886)
    
    * debug grad op
    
    * ZeroVal => GetZeroVal; OneVal => GetOneVal
    
    * MaxVal => GetMaxVal; MinVal => GetMinVal
    
    * check data type
    
    * DevDType
    
    * move function template to struct template for BinaryFunc* and UnaryFunc*
    
    * support half for reduce_sum_kernel
    
    * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
    
    * half for NdarrayUtil
    
    * OF_DEVICE_FUNC is always inline
    
    * half for NdarrayApplyUnaray
    
    * simplify usage of NdarrayUtil
    
    * UnaryFuncExp
    
    * add VarNdarrayBuilder and ValNdarrayBuilder
    
    * simplify NdarrayUtil in layer_norm_param_grad_kernel
    
    * InplaceBroadcast
    
    * remove SoftmaxKernelUtil
    
    * half for softmax_kernel
    
    * fix improper use of __CUDA_ARCH__
    
    * disable sm_30,sm_52
    
    * refine(conv_kernel.cu): fix typo
    
    * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix: fix typos of GetOneVal
    
    * fix(auto_mixed_precision.cpp): allocate for shared_ptr
    
    * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
    
    * fix(auto_mixed_precision.cpp): fix typo
    
    * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
    
    * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
    
    * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
    
    * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
    
    * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
    
    * feat(auto_mixed_precision.cpp): more logs
    
    * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
    
    * fix(bias_add_op.cpp): fix bias_multiplier shape
    
    * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
    
    * feat: update MatmulKernel and new_kernel_util to support half
    
    * refactor(auto_mixed_precision): add ClearList and refine code
    
    * feat(tanh_*_kernel): support half
    
    * feat(add_kernel): support half
    
    * update binary_func.h
    
    * udpate
    
    * update ndarray
    
    * update
    
    * update
    
    * update
    
    * udpate
    
    * refactor(data_type.h): better representation
    
    * fix(unary_func.h): fix typo
    
    * style(data_type.h): format
    
    * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
    
    * style(CMakeLists.txt): fix typo
    
    * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
    
    * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
    
    * fix get one ptr (#1913)
    
    * fix(layer_norm): add LayerNormOp to grey_list and support the half
    
    * fix(layer_norm about): fix it to run when amp
    
    * fix: move fix sbp signature from OpNode to OpGraph
    
    * Dev new kernel util (#1925)
    
    * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
    
    * refactor(kernel/util): add BlasIf
    
    * refactor(kernel/util): add ArithemeticIf
    
    * refactor(kernel/util): add cuda_kernel_util.*
    
    * refactor: refactor NewKernelUtil
    
    * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
    
    * refactor(new_kernel_util.h): remove unused header files
    
    * refactor: refactor loop include
    
    * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
    
    * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
    
    * CHECK cuda version > 10.0 when use auto_mixed_presion
    
    * Fix bug of Snapshot delete file Unwanted (#1937)
    
    * fix link BUG of release version (#1938)
    
    * delete redundant code in OpGraph JobCompleter and Operator (#1927)
    
    * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
    
    * revert README change
    
    * split 2 pull request
    
    * Refactor Kernel Registry V2: The clear & easy Way (#1941)
    
    * refactor(resource.proto): move DeviceType to common/device_type.proto
    
    * feat(kernel_registration): add kernel_registration.h/cpp
    
    * feat(kernel_registration): update matmul_kernel to support new registration
    
    * feat: add CreateKernel for new registry
    
    * feat: udpate registry of cast conf
    
    * refactor(kernel_registration): remove KernelRegMap
    
    * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
    
    * grpc SetMaxMessageSize(INT_MAX) (#1950)
    
    * fix bug of Graph::ForEachConnectedComponent (#1952)
    
    * Grpc set max size (#1953)
    
    * grpc SetMaxMessageSize(INT_MAX)
    
    * set max msg len for ctrl service
    
    * code for test grpc max msg size
    
    * remove test code
    
    * NumaAwareCudaMallocHost (#1959)
    
    * NumaAwareCudaMallocHost
    
    * add conf
    
    * AllReduceSequencePass (#1976)
    
    * CudaCurrentDeviceGuard (#1977)
    
    * delete tmp_split_fw_bw_train_conf (#1985)
    
    * delete tmp_split_fw_bw_train_conf
    
    * delete useless comments
    
    * fix refactor bug in layer_norm_op
    
    * minor fixes
    
    * update py script
    
    * remove code could be misleading
    
    * Fix all reduce mem sharing (#1986)
    
    * fix all reduce mem sharing
    
    * ByteSizeOfDataContentField=>ByteSizeOfBlobBody
    
    * remove obsolete task_graph optimization
    
    * no arg_pass_job for variable_op
    
    * merge memory block id between jobs (#1910)
    
    * refine MemBlock and CriticalSection
    
    * job memory sharing strategy
    
    * revert diff in CriticalSectionDesc
    
    * Merge memory block between sub plans
    
    * Get mutual exclusion job groups
    
    * forget to consider memory merge only in same machine
    
    * memory zone unique id
    
    * Merge Done;  merge memory block id from right to left; get memory block ids info
    
    * revert MemBlock
    
    * generate mutual exclusion job groups Done.
    
    * update for proto
    
    * add JobMemSharingStrategy in python interface
    
    * remove memorycase hash
    
    * move JobMemSharingStrategy to JobSetProto
    
    * using default strategy = parallel priority strategy
    
    * update interface of flow.job_mem_sharing_strategy
    
    * InterJobMemSharingUtil and PlanUtil
    
    * revert oneflow.h
    
    * fix bug
    
    * New implement of Merge memory block id between jobs
    
    * refine code
    
    * fix a fatal bug in std::hash<oneflow::Shape>
    
    * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node
    
    * unlock critical sections as more as possible (#1994)
    
    * Bugfix actor case (#1995)
    
    * unlock critical sections as more as possible
    
    * consumed and produced regst of actor 'case' are customized
    
    * refine code
    
    * Bugfix actor case (#1996)
    
    * unlock critical sections as more as possible
    
    * consumed and produced regst of actor 'case' are customized
    
    * refine code
    
    * small regst_num for reentrant_lock (#1997)
    
    * fmt dev_job_set(#1999)
    
    * double buffer for tick_op
    
    * tick is cpu op
    
    * speedup compile time (#2000)
    
    * only merge mem_block_id between user job (#1993)
    
    * Fix keep header only (#2001)
    
    * speedup compile time
    
    * fix keep header only
    
    * remove shared model (#2003)
    
    * remove blob_mem_sharing (#2005)
    
    * No copyhd for output (#2006)
    
    * no cpu tick
    
    * no copyhd for output_op/swith_output_op
    
    * remove temp comments
    
    * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo
    
    * remove clone_id (#2007)
    
    * layer norm auto var (#2004)
    
    * layer norm auto var
    
    * make of_format
    
    * bn sbp (#2008)
    
    * Refactor job completer (#1998)
    
    * fmt
    
    * refactor GenerateOpConf4Trainning
    
    * more refactor
    
    * refactor SetCtrlInOpName4VariableOp
    
    * use uniq ptr
    
    * refactor RewriteBoxingWithAllReduce
    
    * refactor MakeAllReduceSequence
    
    * refactor auto_mixed_precision
    
    * refactor DumpLogicalBlobDescAndSbpSignature
    
    * refactor group_boxing_by_dst_parallel
    
    * refactor add_keep_header_only_op_conf
    
    * refactor AutoSourceTick
    
    * refactor AddTickForTimeShape
    
    * refactor AutoSinkTick
    
    * refactor AddGlobalOutputCriticalSections
    
    * refactor SetOpTimeShape7BatchDimLbis
    
    * fix a bug in IsInterfaceTask (#2009)
    
    * Bugfix is interface task (#2010)
    
    * fix a bug in IsInterfaceTask
    
    * IsOutputInterfaceTask
    
    * copyhd-free output_op task_node
    
    * Dev job set config util (#2011)
    
    * add more if in JobConfigProtoBuilder
    
    * unlock critical sections as more as possible
    
    * consumed and produced regst of actor 'case' are customized
    
    * remove total batch num in config util
    
    * remove clone_id
    
    * assert has train_conf
    
    * rm debug info
    
    * Dev job set bert (#2013)
    
    * support bert
    
    * mv into bert
    
    * manual format
    
    * fix adam (#2015)
    
    * fix adam
    
    * div batch instance num before update model
    
    * remove outdate code in oneflow.cpp (#2017)
    
    * Dev split like (#2016)
    
    * no total_instance_num
    
    * add auto grad for concat
    
    * check in impl
    
    * check in bug fixes
    
    * fix bugs for split_like
    
    * split_like_op.cpp format
    
    * add normalization_autovar
    
    * Update op_conf.proto
    
    * address reviews
    
    * fix typo
    
    * constant ref
    
    * rm forward_loss_instance_num (#2018)
    
    * Bugfix job set multi device (#2019)
    
    * sbp for tick input bn
    
    * interface_blob_conf for output_op/switch_output_op
    
    * set sbp conf for tuple identity op
    
    * fix bugs when merge main plan
    
    * delete useless code
    
    * address review
    
    * fix error use of GenRepeatedBn()
    
    * ForEachConnectedComponent is easily misused
    
    * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil
    
    * only for return output_op
    
    * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name
    
    * return op instead of output op acts as part of user job
    
    * enable_all_reduce_group
    
    * bugfix: init RuntimeBuffersScope before Runtime
    
    * demo python scripts for enable_all_reduce_group
    
    * remove wrong optimization code
    
    * constant_conf for enable_all_reduce_group.py test
    
    * fix interface op parallel conf
    
    * fix reduce concat kernel (#2020)
    
    * binary program oneflow_worker
    
    * user_job_completer
    
    * remove unused code loss_print
    
    * rm unused code loss_acc
    
    * remove unused accuracy_acc and accuracy_print
    
    * remove input_diff/output_diff/model_diff bns
    
    * remove unused bns in gdb util
    
    * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns
    
    * support mpi using style
    
    * Bugfix put job conf into plan (#2023)
    
    * put job_conf into plan
    
    * using job_name judge isPullJob/isPushJob
    
    * fix wrong job_id error
    
    * model_init is a push job; model_save is a pull job
    
    * make cmake more reasonable (#2024)
    
    * Restructure python module and minimum setup.py (#2026)
    
    * check in updated paths
    
    * check in minimum setup tool
    
    * Dev python init multi unit (#2022)
    
    * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine
    
    * refine var name
    
    * refine code
    
    * compile user/main job only on master
    
    * bert multi machine test code
    
    * fix bugs
    
    * JobConfs
    
    * fix bugs under WITH_RDMA
    
    * fix multi-machine bugs
    
    * delete useless code
    
    * Add xla reduce_sum op
    
    * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)
    
    * feat: init_worker can without scp binary and no use uuid (#2029)
    
    * half impl of without scp bin
    
    * feat: init_worker can without scp binary and no use uuid
    
    * check in fixes (#2030)
    
    * fixbug of delete worker (#2033)
    
    * Dev dot plan (#2035)
    
    * reuse plan to dot file
    
    * refine plan dot
    
    * Check in bug fix and multi node script (#2032)
    
    * check in fixes
    
    * check in script
    
    * fix boxing bug when setting conf with sbp
    
    * flag for iter
    
    * fixbug of delete worker
    
    * fix delete worker in script
    
    * address review, add exclusive or check
    
    * reuse plan to dot file
    
    * refine plan dot
    
    * fix and add flags
    
    * fmt
    
    * rm debug output
    
    * more flags
    
    * check Activation
    
    * fix fc bug when num axes > 2
    
    * reverse change
    
    * fix next_batch_num (#2036)
    
    * upgrade nccl to 2.4.8 (#2037)
    
    * fix shape of fc in_diff (#2038)
    
    * Rewrite model update op to optimizer graph
    
    * Update oneflow.cmake (#2041)
    
    * better looking merged_plan to dot v1 (#2039)
    
    * better looking and more infomation of merged_plan.dot
    
    * refine color
    
    * Fix tick in multi node parallel (#2042) (#2047)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * Dev train conf builder (#2046)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * check in impl
    
    * fix data dir (#2054)
    
    * fix data dir
    
    * rm model load path
    
    * AssignOp (#2058)
    
    * AssignOp
    
    * remove useless code
    
    * Python ops gather and unit test (#2053)
    
    * python_ops gather and unit test
    
    * format
    
    * minor mod
    
    * SnapshotOp (#2060)
    
    * magical add and fix bug (#2061)
    
    * check in impl
    
    * add todo
    
    * Dev jxf python pooling (#2056)
    
    * run max_pool_2d without bug
    
    * correct max_pool_2d
    
    * correct average_pool_2d
    
    * minor refine
    
    * final version
    
    * rename to nn.py
    
    * add name arg to pool1d ops
    
    * refine by review
    
    * rename to _GetSequence and move it to the end of file (#2063)
    
    * fix BindInterfaceMemBlockId (#2065)
    
    * mark py file generated (#2066)
    
    * Dev gracious exit (#2057)
    
    * add more checks
    
    * make language more consistant
    
    * better error info for worker init
    
    * better error
    
    * Update setup.py (#2068)
    
    * Refine Infer APIs by return Maybe<void> type (#2051)
    
    * Refine Infer APIs by return Maybe<void> type
    
    * Fix return type
    
    * Fix code style
    
    * Replace CHECK macros in the implementation of infer APIs
    
    * Revert IsOk
    
    * fix bug for split like op (#2070)
    
    * fix snapshot path (#2071)
    
    * Dev job set fix infer apis (#2072)
    
    * Refine Infer APIs by return Maybe<void> type
    
    * Fix return type
    
    * Fix code style
    
    * Replace CHECK macros in the implementation of infer APIs
    
    * Revert IsOk
    
    * update
    
    * add AutoGlobalStep (#2073)
    
    * rm default_initializer_conf in train conf (#2075)
    
    * Fix sigmoid op (#2076)
    
    * fix sigmoid op bug
    
    * fix bug for split like op
    
    * add sigmoid grad op
    
    * Fix bn (#2077)
    
    * fix bn
    
    * return Maybe<void> OK in lambda
    
    * fix typo
    
    * fix SigmoidGradOp (#2078)
    
    * Dev python merge job set (#2081)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * fix gcc warning in release (#2080)
    
    * fix gcc version in release
    
    * fix empty line
    
    * Fix adam mv initilizer (#2082)
    
    * zero constant initilzer for adam m and v
    
    * make of_format
    
    * init adam m v beta1_t and beta2_t
    
    * use value instead of initializer
    
    * const float& -> const float
    
    * update
    
    * LearningRateScheduleOp (#2079)
    
    * matmul (#2084)
    
    * matmul
    
    * np.allclose
    
    * Fix hang bugs
    
    * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)
    
    * bugfix: reshape op infer dim0 size; and look up tensorflow reshape
    
    * refine code for read
    
    * check py if and test
    
    * prelu (#2086)
    
    * prelu
    
    * fix
    
    * fix
    
    * template for either ptr cast (#2088)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * add template for cast
    
    * rename
    
    * Dev build and infer ctx (#2089)
    
    * add job_build_and_infer_ctx interface
    
    * lbn_with_split_hint
    
    * fix maybe macro
    
    * fix signature of Maybe<T>::Error()
    
    * job_build_and_infer_if
    
    * add c_api_util wrapper for job_build_and_infer_ctx
    
    * implement python/job_build_and_infer interface
    
    * CurJobBuildAndInferCtx_AddPlacementGroup
    
    * BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)
    
    * job_build_and_infer_ctx_mgr
    
    * refine interface of infer_ctx_mgr
    
    * JobBuildInferCtx set job conf; add and refine error type
    
    * revert job.proto
    
    * half impl of add op in build_infer_ctx
    
    * generate op produced empty logical blob desc ; infer out blob desc interface
    
    * job_build_and_infer_ctx VERSION 1
    
    * add InferOutBlobDesc for conv op; remove record_piece_size in interface op
    
    * maybe return
    
    * job_set hold by job_build_and_infer_ctx_mgr
    
    * check placement when infer ctx mgr leave cur job
    
    * Global New/Delete JobBuildAndInferCtxMgr
    
    * add JUST when ctx add op
    
    * remove unused job_conf.arg_op_name
    
    * fix bugs caused by python new api
    
    * fix bugs caused by lack of Global<JobDesc>
    
    * fix bugs caused by new api
    
    * refactor compiler.Compile
    
    * merge dev_python
    
    * remove unused message proto
    
    * rename api
    
    * Fix input which body is disabled in xla launch kernel
    
    * add RemoteBlob.shape and RemoteBlob.dtype
    
    * Fix data type set default variable (#2092)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * fix default data type
    
    * Add conf axis for bias_add for any axis channel (#2093)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * bias_add completion
    
    * follow comment
    
    * make conf axis required
    
    * Dev jxf python initializer (#2090)
    
    * oneflow initializer
    
    * update
    
    * Fix self control in
    
    * Bugfix python alexnet (#2096)
    
    * bugfix_python_alexnet
    
    * fix
    
    * Add fake consume op
    
    * Dev global step (#2100)
    
    * assign op
    
    
    AddGlobalStepOpConf
    
    
    fix
    
    
    ARITHMETIC_DATA_TYPE_SEQ
    
    
    identity_op_conf
    
    
    add ops
    
    
    GenNewSnapshotName
    
    
    SnapshotOp
    
    
    cleanup
    
    
    blob name
    
    
    LearningRateScheduleOp
    
    
    LearningRateScheduleKernel
    
    
    LearningRateScheduleKernel
    
    
    AddLearningRateScheduleOpConf
    
    
    learning rate
    
    
    cleanup
    
    
    fix
    
    
    fix
    
    * remove total_mbn_num
    
    * date time format
    
    * save
    
    * refine
    
    * refine
    
    * revert
    
    * refine snapshot
    
    * fix
    
    * refine
    
    * AutoGlobalStep
    
    * refine
    
    * GenLogicalBlobName
    
    * AutoLearningRate
    
    * remove JobDesc lr
    
    * fix snapshot path
    
    * Maybe<void>
    
    * learning_rate blob
    
    * remove next_model_vid
    
    
    fix
    
    
    fix 
    
    
    fix
    
    
    learning_rate
    
    * train_conf
    
    * fix for global step on multi nodes
    
    * Fix optimizer initializer (#2095)
    
    * fix optimizer initializer
    
    * rename lars data temp bn
    
    * fix job_type (#2102)
    
    * Dev alexnet new api (#2094)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * check in softmax loss
    
    * nn.conv2d and nn.bias_add
    
    * fix opname
    
    * fix merge conflict
    
    * fix name
    
    * dense (#2097)
    
    * Fix jxf dense v2 (#2098)
    
    * dense
    
    * minor fix
    
    * alexnet
    
    * fix conf
    
    * quick fix
    
    * transpose
    
    * fix layers
    
    * add transpose
    
    * fix fc
    
    * fix
    
    * fix
    
    * fix data laod
    
    * params check and format
    
    * rm activation in op conf
    
    * save workaround
    
    * fix avg pool 2d
    
    * fix max pool 2d
    
    * remove fc3 relu
    
    * alexnet eval
    
    * minor
    
    * replace has_batch_dim with batch_axis (#2104)
    
    * replace has_batch_dim with batch_axis
    
    * refactor OrderValue4HasBatchAxis
    
    * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp
    
    * no CHECK in MatmulOp::InferBatchAxis
    
    * infer op by op_conf and  parallel_conf
    
    * wrapper Error for ErrorProto
    
    * replace ErrorUtil with Error
    
    * add OF_CHECK (#2110)
    
    * optional split_axis (#2113)
    
    * Fix HasAttr bug for optional field
    
    * undefined (#2116)
    
    * merge reduce xxx (#2119)
    
    * Update GetSbpSig() with Maybe (#2118)
    
    * fix sveral ops
    
    * modify all ops
    
    * format
    
    * update complete
    
    * Refine AdamOptimizer
    
    * fix (#2120)
    
    * Fix xla AdamOptimizer bugs
    
    * support scalar for reduce_xxx axis args (#2122)
    
    * Dev opt split axis (#2121)
    
    * optional split_axis
    
    * backup
    
    * VariableConf::(OptInt64 split_axis)
    
    * backup
    
    * fix autovar split_axis (#2125)
    
    * Dev model init op (#2117)
    
    * assign op
    
    
    AddGlobalStepOpConf
    
    
    fix
    
    
    ARITHMETIC_DATA_TYPE_SEQ
    
    
    identity_op_conf
    
    
    add ops
    
    
    GenNewSnapshotName
    
    
    SnapshotOp
    
    
    cleanup
    
    
    blob name
    
    
    LearningRateScheduleOp
    
    
    LearningRateScheduleKernel
    
    
    LearningRateScheduleKernel
    
    
    AddLearningRateScheduleOpConf
    
    
    learning rate
    
    
    cleanup
    
    
    fix
    
    
    fix
    
    * remove total_mbn_num
    
    * date time format
    
    * save
    
    * refine
    
    * refine
    
    * revert
    
    * refine snapshot
    
    * fix
    
    * refine
    
    * AutoGlobalStep
    
    * refine
    
    * GenLogicalBlobName
    
    * AutoLearningRate
    
    * remove JobDesc lr
    
    * fix snapshot path
    
    * Maybe<void>
    
    * learning_rate blob
    
    * remove next_model_vid
    
    
    fix
    
    
    fix 
    
    
    fix
    
    
    learning_rate
    
    * train_conf
    
    * fix for global step on multi nodes
    
    * SnapshotReader
    
    
    snapshot writer
    
    
    model init op
    
    
    fix
    
    
    refine
    
    
    init
    
    
    InitializeFromSnapshotConf
    
    
    model io job
    
    
    ModelLoadOp
    
    
    ModelLoadKernel
    
    
    MakeModelLoadJob
    
    
    ModelSaveOp
    
    
    fix
    
    
    InterUserJobInfo
    
    
    _MakeModelLoadJobFunc
    
    
    MutModelLoadOpConTickInputHelper
    
    
    fix
    
    
    refine
    
    
    init/load/save
    
    
    set_default_variable
    
    * remove SnapshotMgr
    
    * snapshot.h
    
    * delete model_init_job.cpp
    
    
    foreign_input_op_conf
    
    
    fix
    
    
    snapshot path
    
    
    set path
    
    
    op_conf
    
    
    fix
    
    
    fix CopyFromNdarray
    
    
    to bytes c
    
    
    use uint8
    
    
    char2uint8
    
    * model init
    
    * model io
    
    * fix
    
    * ModelSaveKernel
    
    * mutable_batch_axis()->Clear()
    
    * InferBatchAxis
    
    * fix
    
    * refine
    
    * job set
    
    * MakeModelIoJobs
    
    * fix
    
    * jobs
    
    * fix
    
    * model io job
    
    * GenOutputOpConf
    
    * refine snapshot
    
    * refine
    
    * fix
    
    * refine CheckPoint
    
    * remove session
    
    * refine
    
    * refine
    
    * refine
    
    * remove keyword.h/cpp
    
    * refine
    
    * global_step=>train_step
    
    * GetSbpSignatures
    
    * ModelInitOp
    
    * fix (#2127)
    
    * rm stale alextnet script (#2129)
    
    * Dev plain maybe (#2126)
    
    * optional split_axis
    
    * backup
    
    * VariableConf::(OptInt64 split_axis)
    
    * backup
    
    * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
    
    * SharedOrPlain
    
    * const std::shared_ptr<T>& => std::shared_ptr<T>
    
    * Dev simple checkpoint manager (#2128)
    
    * SimpleCheckPointManager
    
    * makedirs
    
    * fix path
    
    * save
    
    * refine
    
    * refine
    
    * fix path to numpy (#2130)
    
    * Dev plain maybe (#2132)
    
    * optional split_axis
    
    * backup
    
    * VariableConf::(OptInt64 split_axis)
    
    * backup
    
    * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
    
    * SharedOrPlain
    
    * const std::shared_ptr<T>& => std::shared_ptr<T>
    
    * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()
    
    * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>
    
    * Dev jxf merge general ops (#2131)
    
    * merge some general ops to dev_python
    
    * dense demo
    
    * rm print in test
    
    * new line at the end of file
    
    * format
    
    * fix check point
    
    * update alexnet
    
    * broadcast_xxx (#2134)
    
    * broadcast_xxx
    
    * typo
    
    * typo
    
    * rm job_conf.num_of_batches_in_snapshot
    
    * fix args (#2136)
    
    * fix proto if (#2138)
    
    * pass name to inner function (#2139)
    
    * check dropout if (#2140)
    
    * check dropout if
    
    * fix typo
    
    * Dev merge math ops (#2143)
    
    * merge math ops
    
    * new line at the end of file
    
    * merge layer norm (#2144)
    
    * variable_scope (#2141)
    
    * variable_scope
    
    * revert format
    
    * add check
    
    * Merge dropout if (#2145)
    
    * check dropout if
    
    * fix typo
    
    * fix typo
    
    * slice (#2142)
    
    * slice
    
    * add check and docstring
    
    * minor
    
    * minor
    
    * add const (#2146)
    
    * add const
    
    * fix indentation
    
    * address review
    
    * fmt
    
    * rm redundant
    
    * Update array_ops.py
    
    * Update array_ops.py
    
    * Update array_ops.py
    
    * add more activations to math_ops (#2147)
    
    * fix bug (#2149)
    
    * trancated normal for bert (#2150)
    
    * Update bert for dev python (#2151)
    
    * trancated normal for bert
    
    * bert support
    
    * math.dropout to nn.dropout (#2153)
    
    * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto
    
    * allow export multiple interfaces in oneflow_export decorator (#2154)
    
    * refactor job_build_and_infer_if.h
    
    * update oneflow_internal.h to use Maybe (#2135)
    
    * Fix python internal (#2133)
    
    * Return error meassage in oneflow_internal
    
    * Refine environment_objects_scope
    
    * add OF_ERROR_STR_CHECK and OFStrCat()
    
    * format
    
    * fix based on review
    
    * fix(oneflow_internal.h): add undef
    
    * fix: expr -> (expr)
    
    * feat: update oneflow_internal_helper to use func
    
    *  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)
    
    *  Transfer data_part_num to DecodeOp and RecordLoadOp
    
    * Fix python scripts
    
    * Dev nc of internal (#2155)
    
    * Fix python internal (#2133)
    
    * Return error meassage in oneflow_internal
    
    * Refine environment_objects_scope
    
    * add OF_ERROR_STR_CHECK and OFStrCat()
    
    * format
    
    * fix based on review
    
    * fix(oneflow_internal.h): add undef
    
    * fix: expr -> (expr)
    
    * feat: update oneflow_internal_helper to use func
    
    * fix: fix ctor bug
    
    * fix config_proto
    
    * rename c_api_util.Init => c_api_util.InitEnvironment
    
    * refactor compile_context.cur_job => compile_context.cur_job_conf
    
    * remove FixPackedBlobDescOfProducedRegst (#2156)
    
    * Fix snapshot root path empty log (#2158)
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * fix 121 for tick (#2069)
    
    * Fix snapshot root path empty log
    
    * fix channel last (#2157)
    
    * fix channel last
    
    * minor
    
    * merge pb_message
    
    * add cudnn conv force algo (#2159)
    
    * Update bert for dev python (#2160)
    
    * remove old bert
    
    * set data_part_num in decoder
    
    * support model load/saveargs
    
    * Dev flow function (#2152)
    
    * add of.function, refactor init, refine session, and refine runtime
    
    * rm useless code
    
    * rename
    
    * update
    
    * add test
    
    * @oneflow_export JobConfigProto and Trainconf (#2162)
    
    * @oneflow_export JobConfigProto and Trainconf
    
    * remove unused config in config_util.py
    
    * remove oneflow.get_cur_job_conf_builder
    
    * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)
    
    * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf
    
    * fix config.train.model_update_conf
    
    * _GetJobConfAttr
    
    * update alexnet (#2166)
    
    * Update alexnet (#2167)
    
    * update alexnet
    
    * update for bert
    
    * 15->16
    
    * more reasonable conf
    
    * get variable in py layer norm
    
    * replace val in pb msg;  decode lbn string with split hint (#2165)
    
    * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)
    
    * Add meta data in HLO instruction, and refine
    
    * python model parallel (#2103)
    
    * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op
    
    * merge placement group
    
    * refine code in AddAndInferOp
    
    * auto merge placement group when add op; remove mergeplacementgroup interface
    
    * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx
    
    * python blob add interface for model parallel
    
    * refine code of python blob split
    
    * remove interface of has/get_split_axis in python blob
    
    * remove interface of has_batch_dim in python blob
    
    * add check blob split_axis can be divide by parallel num
    
    * refine code for maybe get/infer sbp
    
    * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc
    
    * fix for plain point maybe
    
    * fix bug: add repeated placement group, remove add placement interface in hand
    
    * fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel
    
    * dev_python model parallel runnable and check correct
    
    * remove add placement group when placment scope exit
    
    * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel
    
    * bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done
    
    * refine python blob_desc.split implement
    
    * refine interface decode lbn to split hint
    
    * refine auto add placment group
    
    * refine lbn with split hint decode
    
    * refine code for review
    
    * remove AutoVar related code (#2168)
    
    * feat: remove all autovar
    
    * fix and format
    
    * fix: fix op::InferBlobDesc
    
    * add prototype (#2172)
    
    * add prototype
    
    * infer blob desc with sbp_signature
    
    * `str_a is not str_b' is buggy, use `str_a != str_b' instead
    
    * Update snapshot.cpp (#2174)
    
    * remove useless lines (#2176)
    
    * Fix bert multi nodes (#2177)
    
    * remove useless lines
    
    * fix bert and init_cluster_env for multi nodes
    
    * CHECK_JUST for InferBlobDescsIf (#2178)
    
    * Fix bert multi nodes (#2180)
    
    * remove useless lines
    
    * fix bert and init_cluster_env for multi nodes
    
    * config_proto -> default_config_proto
    
    * delete worker
    
    * update alexnet
    
    * remove unused op (#2182)
    
    * remove parallel_ctx when kernel init (#2185)
    
    * InferOpSbpSignature in op_graph and infer_ctx (#2175)
    
    * InferOpSbpSignature in op_graph and infer_ctx
    
    * bugfix: lambda life time;  gen job build error add location info
    
    * refine error generation and return
    
    * refine check lbi vaild and exists
    
    * remove parallel num in decode_of_record op/kernel (#2186)
    
    * Fix bugs
    
    * delete GlobalJobDesc() in operator/ (#2188)
    
    * rm unused test file
    
    * Refine
    
    * Add assign ops behind adam optimizer to update model and momentum etc.
    
    * Add assign ops behind adam optimizer to update model and momentum etc.
    
    * Remove fake consume op
    
    * Support enable/disable XLA by set env
    
    * Merge callback, limit max operator count for each XLA subgraph
    
    * CudaEventPool
    
    * fix vector
    
    * refine
    
    * Support in-place update for optimizer
    
    * Add alias input and output to prevent reusing input with other temp buffers
    
    * Refine code style
    
    * Remove unused code
    
    * Fix static cublas library and xla link conflict
    
    * Fix cublas link conflict with tensorflow
    
    * Fix different connection kinds for multiple gpu cards (#2282)
    
    * Refine xla cluster algo (#2289)
    
    * Fix different connection kinds for multiple gpu cards
    
    * Fix bug for mutiple outputs consumed by one node
    
    * Refine cluster algo
    
    * Refine MarkClusterId pass and ReduceSplit task node (#2314)
    
    * Fix different connection kinds for multiple gpu cards
    
    * Fix bug for mutiple outputs consumed by one node
    
    * Refine cluster algo
    
    * Determine fusion disabled edges
    
    * update
    
    * Produce multiple registers on edges for ReduceSplit task node.
    Fix new allocator by stream id.
    
    * Refine MarkClusterId pass
    
    * Clustering subgraph with reverse ordering is better
    
    * Support strict clustering by taking dependencies into consideration
    
    * Translate rebuild job and rewrite optimizer into passes, and refine code style
    
    * Fix spell error
    
    * Update cmake
    
    * Merge branch dev_python (#2321)
    
    * Dev res50 new api (#2173)
    
    * check in script
    
    * runable
    
    * fix multinode
    
    * fix and real train
    
    * fix param data_format
    
    * fix truncated normal
    
    * quick fix multi node launch (#2193)
    
    * Dev reshape sbp (#2192)
    
    * reshape sbp
    
    * more check for reshape conf
    
    * fix error CHECK
    
    * refactor reshape
    
    * fix reshape like op
    
    * support naive case of s0
    
    * refine
    
    * rm redundant code
    
    * more generous check for equal element cnt
    
    * restore empty line
    
    * add GatherMs0Grad op (#2191)
    
    * support for gather with s(0) `in'
    
    * add gather_ms0_op
    
    * fix bugs in message GatherMs0OpConf and GatherMs0Kernel
    
    * only (B, S(0)) -> P supported for gather_ms0 op
    
    * add GatherMs0Grad op
    
    * minor fix
    
    * refine code
    
    * bugfix and update gather test case
    
    * add concat op and pass the test (#2067)
    
    * add concat op and pass the test
    
    * add vgg job_conf
    
    * model compared to be same as the old one
    
    * rm unnecessary file
    
    * Update array_ops.py
    
    * mv file
    
    * get rid of ternary operator (#2195)
    
    * Dev reshape util struct (#2194)
    
    * check in changes
    
    * rm file
    
    * minor fix
    
    * Merge network files of 2 cnns (#2196)
    
    * add inceptionV3
    
    * check in vgg16
    
    * add cnns test scripts for dev_python (#2170)
    
    * add cnns test scripts for dev_python
    
    * add alexnet test scripts
    
    * add resnet50
    
    * add inceptionv3
    
    * add resnet50
    
    * add vgg16
    
    * first version of run_cnns_test.py
    
    * remove old files
    
    * unsorted_segment_sum (#2198)
    
    * oneflow.unsorted_segment_sum (#2199)
    
    * oneflow.unsorted_segment_sum
    
    * remote unused import
    
    * remove unused import
    
    * Dev batch unsorted segment sum (#2200)
    
    * oneflow.unsorted_segment_sum
    
    * remote unused import
    
    * remove unused import
    
    * rename UnsortedSegmentSum to BatchUnsortedSegmentSum
    
    * rename: batch_unsorted_* => unsorted_batch_*
    
    * unsorted_segment_sum (#2201)
    
    * unsorted_segment_sum
    
    * fix job_completer/unsorted_segment_sum_grad.cpp
    
    * more check for unsorted_segment_sum batch_axis
    
    * remove FixParallelDesc (#2202)
    
    * rm KernelIfWithModel KernelIfWithActivation (#2203)
    
    * remove KernelIfWithActivation
    
    * remove KernelIfWithModel
    
    * rm blob header kLossInstanceNum (#2204)
    
    * rm ActivationType from op/kernel (#2205)
    
    * refactor sigmoid_cross_entropy_loss
    
    * fix SigmoidGrad::InferBatchAxis
    
    * support part_name_prefix and part_name_suffix_length (#2208)
    
    * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
    
    * oneflow.watch for debug
    
    * Dev decode batch size (#2206)
    
    * rm batch_size and piece_size
    
    * merge dev_python
    
    * Update reshape_like_op.cpp (#2213)
    
    * oneflow.parallel (#2211)
    
    * oneflow.parallel
    
    * refactor split_axis => parallel
    
    * rename parallel => distribute
    
    * fix typo: *Parallel => *Distribute
    
    * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
    
    * fix warning: return string reference to temporary (#2212)
    
    * docker build support (#2002)
    
    * update cmake files
    
    * check in files
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * shrink ctx size
    
    * fix script
    
    * fix wheel build
    
    * fix wheel build not adding .so (#2052)
    
    * lower cmake version bar
    
    * rm more files
    
    * keep build dir
    
    * check in test bash script
    
    * fix
    
    * Dev docker sx (#2124)
    
    * add python2 docker env
    
    * rm old docker files
    
    * update repository
    
    * add ARG CUDA and USE_PYTHON_3_OR_2
    
    * reform files
    
    * update
    
    * rm log doesn't print when there is cache
    
    * use default arg in dockerfile
    
    * better py 2 or 3 condition
    
    * add default
    
    * use if
    
    * update alexnet
    
    * update for bert
    
    * 15->16
    
    * add resnet50 in model (#2217)
    
    * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)
    
    * remove parallel policy
    
    * rm FC/rnn/embedding_look_up op/kernel
    
    * add check data parallel for conv/layer_norm op
    
    * bugfix: bias add + use math_add when batch size = 1
    
    * fix InferBatchAxis (#2220)
    
    * sync with bert_benchamrk (#2221)
    
    * sync with bert_benchamrk
    
    * rename run.sh
    
    * Dev actor msg queue (#2225)
    
    * async msg queue
    
    * EnqueueAsyncMsg
    
    * Merge wnd python (#2226)
    
    * not ready yet
    
    * segment fix
    
    * fix segment_sum bugs
    
    * 1st wide_n_deep push
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * run sucessfully on single GPU
    
    * fix 121 for tick (#2069)
    
    * delete unncessary multiply_grad class
    
    * speed up generate time for dot2svg (#2083)
    
    * Add axis conf to bias_add for any axis channel (#2087)
    
    * bias_add completion
    
    * follow comment
    
    * make conf axis required
    
    * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
    
    This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
    
    * updated
    
    * fix segment_sum_grad
    
    * fix sbp
    
    * fix segment_sum impl for data parallel
    
    * fix
    
    * remove useless code in segment_kernel_util.h
    
    * add python interface
    
    * fix sigmoid conf
    
    * fix naming error
    
    * fix typo
    
    * temp mod loss sbp
    
    * add LazyAdam
    
    * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep
    
    * rm useless code
    
    * unsorted_segment_sum
    
    * refactor sigmoid_cross_entropy_loss_kernel to high performance
    
    * Improve sigmoid cross entropy loss grad (#2207)
    
    * remove for loop called cuda kernel
    
    * minor fix
    
    * ../oneflow/python/ops/data_ops.py (#2209)
    
    * fix lazy_adam
    
    * Merge wnd and python (#2214)
    
    * rm ActivationType from op/kernel (#2205)
    
    * refactor sigmoid_cross_entropy_loss
    
    * fix SigmoidGrad::InferBatchAxis
    
    * support part_name_prefix and part_name_suffix_length (#2208)
    
    * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
    
    * oneflow.watch for debug
    
    * Dev decode batch size (#2206)
    
    * rm batch_size and piece_size
    
    * merge dev_python
    
    * Update reshape_like_op.cpp (#2213)
    
    * oneflow.parallel (#2211)
    
    * oneflow.parallel
    
    * refactor split_axis => parallel
    
    * rename parallel => distribute
    
    * fix typo: *Parallel => *Distribute
    
    * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
    
    * merge dev_python
    
    * fix boxing: P->S(0)
    
    * check in docker build scripts (#2216)
    
    * Dev python widedeep docker (#2218)
    
    * check in docker build scripts
    
    * check in .dockerignore
    
    * rm oneflow.segment_sum
    
    * remove segment_sum
    
    * rm unused file
    
    * rm debug code
    
    * rm debug code
    
    * rm double empty lines
    
    * remove useless comments
    
    * fix send msg (#2227)
    
    * fix reduction_coefficient (#2228)
    
    * refactor ndarray for eq/ne/...
    
    * Dev kernel launch synchronized (#2230)
    
    * IsKernelLaunchSynchronized
    
    * virtual
    
    * refine
    
    * refine
    
    * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
    
    * more static_assert
    
    * remove unused task related dot function (#2236)
    
    * remove unused task related dot function
    
    * do not output dot rank info
    
    * Dev non distributed optimizer js (#2234)
    
    * op&kernel&actor
    
    * job
    
    * job_completer
    
    * graph
    
    * format
    
    * fix pd
    
    * fix
    
    * ignore DelPlacementByOpName
    
    * fix auto tick
    
    * JobBuilder
    
    * fix
    
    * config util
    
    * fix
    
    * fix opgrade
    
    * broadcast tick
    
    * fix allreduce
    
    * balance by model size
    
    * GetSoleOutBlobSize
    
    * async_actor_msg_deque
    
    * group
    
    * AddOrMutOpsOnlyOnce
    
    * fix NcclTupleBroadcastGrad
    
    * order
    
    * set nccl order hint
    
    * op_conf
    
    * grad hint
    
    * NcclTupleBroadcastReduceSequencePass
    
    * add missed mutops
    
    * order fix
    
    * try kMdUpdtArea
    
    * fix nccl_order_hint
    
    * fix
    
    * add ti
    
    * tuple_identity_op
    
    * remove useless
    
    * group
    
    * fix dead lock
    
    * force ctrl in
    
    * sc broadcast
    
    * sort obn
    
    * group nccl
    
    * config group_size_mbyte
    
    * non_distributed_optimizer_group_size_mbyte
    
    * format
    
    * stop check
    
    * rm message sending optimization
    
    * refine lazy adam (#2244)
    
    * refine lazy adam
    
    * update
    
    * memory version 2 step 1: replace original concept about mem sharing (#2242)
    
    * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
    
    * memory version 2 step 1: replace original concept about mem sharing
    
    * record reader multi thread (#2246)
    
    * multi thread
    
    * ComputeThreadPoolSize
    
    * python api
    
    * Fix random decode (#2252)
    
    * add decode random
    
    * fix decode random actor
    
    * Dev pr boxing v2 (#2248)
    
    * NcclDeviceCtx
    
    * include naive_actor
    
    * refine
    
    * use_boxing_v2
    
    * config.use_boxing_v2
    
    * SubTskGphBuilder
    
    * fix
    
    * hash<oneflow::MemoryCase>
    
    * Maybe<void>
    
    * ChainSubTskGphBuilder
    
    * SliceBoxingOp
    
    * return ok
    
    * SliceBoxingKernel
    
    * SliceBoxingActor
    
    * kSliceBoxing
    
    * nccl boxing op
    
    * nccl actor
    
    * REGISTER_OP
    
    * GetMsgFromCustomizedConf
    
    * NcclBoxingTaskNode
    
    * BldSubTskGphByBoxingV2
    
    * NcclBoxingSubTskGphBuilder
    
    * fix
    
    * fix
    
    * NcclKernel
    
    * ParallelContext
    
    * REGISTER_ACTOR
    
    * fix rank set
    
    * IsNcclTaskType
    
    * limit
    
    * 1024
    
    * multi thread reader
    
    * thread_num
    
    * IsKernelLaunchSynchronized
    
    * refine
    
    * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx
    
    * MakeHostMemCase
    
    * NcclBldSubTskGph
    
    * remove use less code
    
    * use_boxing_v2
    
    * refine
    
    * refine
    
    * refine
    
    * refine
    
    * refine
    
    * cmake find python note when version less 3.14 (#2286)
    
    * fix bug: reduce split kernel inplace (#2297)
    
    * Dev bias add (#2299)
    
    * use bias add
    
    * fix
    
    * bias_add
    
    * bias add half
    
    * fix
    
    * reinterpret_cast
    
    * fix half
    
    * HALF
    
    * fix
    
    * ADD_DEFAULT_KERNEL_CREATOR
    
    * fix
    
    * format
    
    * Fix dev python test (#2294)
    
    * add decode random
    
    * fix decode random actor
    
    * fix dev_python test scripts
    
    * fix batch_size test scripts
    
    * fix
    
    * Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)
    
    * MemBlockProto and ChunkProto
    
    * create mem block and chunk after improver
    
    * interface merge mem block and chunk between sub plans
    
    * merge chunk between jobs for memory reuse
    
    * using memory zone unique id replace memory case hash
    
    * merge interface op mem block between jobs for mem shared
    
    * gen GlobalCriticalSection by mem block id and chunk id
    
    * check mem block and chunk valid before runtime
    
    * Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst
    
    * fix bug; and pass test
    
    * fig bug: init chunk_id_count in id_manager
    
    * reuse copyHd out mem between jobs
    
    * PushPlan and PullPlan for memblock and chunk
    
    * refine merge mem block / chunk in oneflow.cpp
    
    * at(i);
    
    * GetOpName2JobId2TaskProtos functional
    
    * using output ptr; pass test AlexNet and Resnet
    
    * Fix xla reshape op
    
    * Merge upstream of_xla (#2322)
    
    * Dev res50 new api (#2173)
    
    * check in script
    
    * runable
    
    * fix multinode
    
    * fix and real train
    
    * fix param data_format
    
    * fix truncated normal
    
    * quick fix multi node launch (#2193)
    
    * Dev reshape sbp (#2192)
    
    * reshape sbp
    
    * more check for reshape conf
    
    * fix error CHECK
    
    * refactor reshape
    
    * fix reshape like op
    
    * support naive case of s0
    
    * refine
    
    * rm redundant code
    
    * more generous check for equal element cnt
    
    * restore empty line
    
    * add GatherMs0Grad op (#2191)
    
    * support for gather with s(0) `in'
    
    * add gather_ms0_op
    
    * fix bugs in message GatherMs0OpConf and GatherMs0Kernel
    
    * only (B, S(0)) -> P supported for gather_ms0 op
    
    * add GatherMs0Grad op
    
    * minor fix
    
    * refine code
    
    * bugfix and update gather test case
    
    * add concat op and pass the test (#2067)
    
    * add concat op and pass the test
    
    * add vgg job_conf
    
    * model compared to be same as the old one
    
    * rm unnecessary file
    
    * Update array_ops.py
    
    * mv file
    
    * get rid of ternary operator (#2195)
    
    * Dev reshape util struct (#2194)
    
    * check in changes
    
    * rm file
    
    * minor fix
    
    * Merge network files of 2 cnns (#2196)
    
    * add inceptionV3
    
    * check in vgg16
    
    * add cnns test scripts for dev_python (#2170)
    
    * add cnns test scripts for dev_python
    
    * add alexnet test scripts
    
    * add resnet50
    
    * add inceptionv3
    
    * add resnet50
    
    * add vgg16
    
    * first version of run_cnns_test.py
    
    * remove old files
    
    * unsorted_segment_sum (#2198)
    
    * oneflow.unsorted_segment_sum (#2199)
    
    * oneflow.unsorted_segment_sum
    
    * remote unused import
    
    * remove unused import
    
    * Dev batch unsorted segment sum (#2200)
    
    * oneflow.unsorted_segment_sum
    
    * remote unused import
    
    * remove unused import
    
    * rename UnsortedSegmentSum to BatchUnsortedSegmentSum
    
    * rename: batch_unsorted_* => unsorted_batch_*
    
    * unsorted_segment_sum (#2201)
    
    * unsorted_segment_sum
    
    * fix job_completer/unsorted_segment_sum_grad.cpp
    
    * more check for unsorted_segment_sum batch_axis
    
    * remove FixParallelDesc (#2202)
    
    * rm KernelIfWithModel KernelIfWithActivation (#2203)
    
    * remove KernelIfWithActivation
    
    * remove KernelIfWithModel
    
    * rm blob header kLossInstanceNum (#2204)
    
    * rm ActivationType from op/kernel (#2205)
    
    * refactor sigmoid_cross_entropy_loss
    
    * fix SigmoidGrad::InferBatchAxis
    
    * support part_name_prefix and part_name_suffix_length (#2208)
    
    * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
    
    * oneflow.watch for debug
    
    * Dev decode batch size (#2206)
    
    * rm batch_size and piece_size
    
    * merge dev_python
    
    * Update reshape_like_op.cpp (#2213)
    
    * oneflow.parallel (#2211)
    
    * oneflow.parallel
    
    * refactor split_axis => parallel
    
    * rename parallel => distribute
    
    * fix typo: *Parallel => *Distribute
    
    * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
    
    * fix warning: return string reference to temporary (#2212)
    
    * docker build support (#2002)
    
    * update cmake files
    
    * check in files
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * shrink ctx size
    
    * fix script
    
    * fix wheel build
    
    * fix wheel build not adding .so (#2052)
    
    * lower cmake version bar
    
    * rm more files
    
    * keep build dir
    
    * check in test bash script
    
    * fix
    
    * Dev docker sx (#2124)
    
    * add python2 docker env
    
    * rm old docker files
    
    * update repository
    
    * add ARG CUDA and USE_PYTHON_3_OR_2
    
    * reform files
    
    * update
    
    * rm log doesn't print when there is cache
    
    * use default arg in dockerfile
    
    * better py 2 or 3 condition
    
    * add default
    
    * use if
    
    * update alexnet
    
    * update for bert
    
    * 15->16
    
    * add resnet50 in model (#2217)
    
    * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)
    
    * remove parallel policy
    
    * rm FC/rnn/embedding_look_up op/kernel
    
    * add check data parallel for conv/layer_norm op
    
    * bugfix: bias add + use math_add when batch size = 1
    
    * fix InferBatchAxis (#2220)
    
    * sync with bert_benchamrk (#2221)
    
    * sync with bert_benchamrk
    
    * rename run.sh
    
    * Dev actor msg queue (#2225)
    
    * async msg queue
    
    * EnqueueAsyncMsg
    
    * Merge wnd python (#2226)
    
    * not ready yet
    
    * segment fix
    
    * fix segment_sum bugs
    
    * 1st wide_n_deep push
    
    * Fix tick in multi node parallel (#2042)
    
    * check in fixes
    
    * fix by adding boxing method
    
    * register tick op
    
    * move code and add more check
    
    * fix typo
    
    * fix bug when filtering op nodes before adding tick
    
    * fix wheel build not adding .so (#2052)
    
    * color plan dot VERSION-2 (#2045)
    
    * run sucessfully on single GPU
    
    * fix 121 for tick (#2069)
    
    * delete unncessary multiply_grad class
    
    * speed up generate time for dot2svg (#2083)
    
    * Add axis conf to bias_add for any axis channel (#2087)
    
    * bias_add completion
    
    * follow comment
    
    * make conf axis required
    
    * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
    
    This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
    
    * updated
    
    * fix segment_sum_grad
    
    * fix sbp
    
    * fix segment_sum impl for data parallel
    
    * fix
    
    * remove useless code in segment_kernel_util.h
    
    * add python interface
    
    * fix sigmoid conf
    
    * fix naming error
    
    * fix typo
    
    * temp mod loss sbp
    
    * add LazyAdam
    
    * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep
    
    * rm useless code
    
    * unsorted_segment_sum
    
    * refactor sigmoid_cross_entropy_loss_kernel to high performance
    
    * Improve sigmoid cross entropy loss grad (#2207)
    
    * remove for loop called cuda kernel
    
    * minor fix
    
    * ../oneflow/python/ops/data_ops.py (#2209)
    
    * fix lazy_adam
    
    * Merge wnd and python (#2214)
    
    * rm ActivationType from op/kernel (#2205)
    
    * refactor sigmoid_cross_entropy_loss
    
    * fix SigmoidGrad::InferBatchAxis
    
    * support part_name_prefix and part_name_suffix_length (#2208)
    
    * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
    
    * oneflow.watch for debug
    
    * Dev decode batch size (#2206)
    
    * rm batch_size and piece_size
    
    * merge dev_python
    
    * Update reshape_like_op.cpp (#2213)
    
    * oneflow.parallel (#2211)
    
    * oneflow.parallel
    
    * refactor split_axis => parallel
    
    * rename parallel => distribute
    
    * fix typo: *Parallel => *Distribute
    
    * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
    
    * merge dev_python
    
    * fix boxing: P->S(0)
    
    * check in docker build scripts (#2216)
    
    * Dev python widedeep docker (#2218)
    
    * check in docker build scripts
    
    * check in .dockerignore
    
    * rm oneflow.segment_sum
    
    * remove segment_sum
    
    * rm unused file
    
    * rm debug code
    
    * rm debug code
    
    * rm double empty lines
    
    * remove useless comments
    
    * fix send msg (#2227)
    
    * fix reduction_coefficient (#2228)
    
    * refactor ndarray for eq/ne/...
    
    * Dev kernel launch synchronized (#2230)
    
    * IsKernelLaunchSynchronized
    
    * virtual
    
    * refine
    
    * refine
    
    * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
    
    * more static_assert
    
    * remove unused task related dot function (#2236)
    
    * remove unused task related dot function
    
    * do not output dot rank info
    
    * Dev non distributed optimizer js (#2234)
    
    * op&kernel&actor
    
    * job
    
    * job_completer
    
    * graph
    
    * format
    
    * fix pd
    
    * fix
    
    * ignore DelPlacementByOpName
    
    * fix auto tick
    
    * JobBuilder
    
    * fix
    
    * config util
    
    * fix
    
    * fix opgrade
    
    * broadcast tick
    
    * fix allreduce
    
    * balance by model size
    
    * GetSoleOutBlobSize
    
    * async_actor_msg_deque
    
    * group
    
    * AddOrMutOpsOnlyOnce
    
    * fix NcclTupleBroadcastGrad
    
    * order
    
    * set nccl order hint
    
    * op_conf
    
    * grad hint
    
    * NcclTupleBroadcastReduceSequencePass
    
    * add missed mutops
    
    * order fix
    
    * try kMdUpdtArea
    
    * fix nccl_order_hint
    
    * fix
    
    * add ti
    
    * tuple_identity_op
    
    * remove useless
    
    * group
    
    * fix dead lock
    
    * force ctrl in
    
    * sc broadcast
    
    * sort obn
    
    * group nccl
    
    * config group_size_mbyte
    
    * non_distributed_optimizer_group_size_mbyte
    
    * format
    
    * stop check
    
    * rm message sending optimization
    
    * refine lazy adam (#2244)
    
    * refine lazy adam
    
    * update
    
    * memory version 2 step 1: replace original concept about mem sharing (#2242)
    
    * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
    
    * memory version 2 step 1: replace original concept about mem sharing
    
    * record reader multi thread (#2246)
    
    * multi thread
    
    * ComputeThreadPoolSize
    
    * python api
    
    * Fix random decode (#2252)
    
    * add decode random
    
    * fix decode random actor
    
    * Dev pr boxing v2 (#2248)
    
    * NcclDeviceCtx
    
    * include naive_actor
    
    * refine
    
    * use_boxing_v2
    
    * config.use_boxing_v2
    
    * SubTskGphBuilder
    
    * fix
    
    * hash<oneflow::MemoryCase>
    
    * Maybe<void>
    
    * ChainSubTskGphBuilder
    
    * SliceBoxingOp
    
    * return ok
    
    * SliceBoxingKernel
    
    * SliceBoxingActor
    
    * kSliceBoxing
    
    * nccl boxing op
    
    * nccl actor
    
    * REGISTER_OP
    
    * GetMsgFromCustomizedConf
    
    * NcclBoxingTaskNode
    
    * BldSubTskGphByBoxingV2
    
    * NcclBoxingSubTskGphBuilder
    
    * fix
    
    * fix
    
    * NcclKernel
    
    * ParallelContext
    
    * REGISTER_ACTOR
    
    * fix rank set
    
    * IsNcclTaskType
    
    * limit
    
    * 1024
    
    * multi thread reader
    
    * thread_num
    
    * IsKernelLaunchSynchronized
    
    * refine
    
    * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx
    
    * MakeHostMemCase
    
    * NcclBldSubTskGph
    
    * remove use less code
    
    * use_boxing_v2
    
    * refine
    
    * refine
    
    * refine
    
    * refine
    
    * refine
    
    * cmake find python note when version less 3.14 (#2286)
    
    * fix bug: reduce split kernel inplace (#2297)
    
    * Dev bias add (#2299)
    
    * use bias add
    
    * fix
    
    * bias_add
    
    * bias add half
    
    * fix
    
    * reinterpret_cast
    
    * fix half
    
    * HALF
    
    * fix
    
    * ADD_DEFAULT_KERNEL_CREATOR
    
    * fix
    
    * format
    
    * Fix dev python test (#2294)
    
    * add decode random
    
    * fix decode random actor
    
    * fix dev_python test scripts
    
    * fix batch_size test scripts
    
    * fix
    
    * Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)
    
    * MemBlockProto and ChunkProto
    
    * create mem block and chunk after improver
    
    * interface merge mem block and chunk between sub plans
    
    * merge chunk between jobs for memory reuse
    
    * using memory zone unique id replace memory case hash
    
    * merge interface op mem block between jobs for mem shared
    
    * gen GlobalCriticalSection by mem block id and chunk id
    
    * check mem block and chunk valid before runtime
    
    * Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst
    
    * fix bug; and pass test
    
    * fig bug: init chunk_id_count in id_manager
    
    * reuse copyHd out mem between jobs
    
    * PushPlan and PullPlan for memblock and chunk
    
    * refine merge mem block / chunk in oneflow.cpp
    
    * at(i);
    
    * GetOpName2JobId2TaskProtos functional
    
    * using output ptr; pass test AlexNet and Resnet
    
    * Dev cuda 9 arch 70 (#2318)
    
    * kCudaAlignSize = 256
    
    * always compute_70
    
    * __CUDA_API_VERSION >= 10000
    
    * __CUDA_API_VERSION >= 10000
    
    * disable_all_reduce_sequence
    
    * Fix xla reshape op
    
    * Fix compilation without xla
    
    * Remove useless code and fix data type mismatch in field desc (#2326)
    
    * Remove useless code
    
    * Refine code style
    
    * Fix data type mismatch in field desc
    
    * Update README.md (#2335)
    
    * Refine code style (#2336)
    
    * Update XLA usage document (#2337)
    
    * Update XLA usage document
    
    * Fix mistakes
    
    * Add xla clang-format and format codestyle (#2340)
    
    * Revert "Add xla clang-format and format codestyle (#2340)" (#2341)
    
    This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724.
    
    * Add xla clang-format and format codestyle (#2342)
    
    * Add xla clang-format and format codestyle
    
    * Fix header file missing
    
    * Of xla sx (#2334)
    
    * add gather grad op and pass testing
    
    * rm check
    
    * done batch gather grad
    
    * pass test
    
    * modify according to the review
    
    * add unsorted_segment_sum and refine unsorted_batch_segment_sum
    
    * reform according to review
    
    * refromate according to the clang-format and rm reference to the temp object
    
    * Pick step0 and step1 new commits (#2346)
    
    * Add xla clang-format and format codestyle
    
    * Fix header file missing
    
    * Modify codes to support XLA
    
    Conflicts:
    	oneflow/core/job/job_builder.cpp
    	oneflow/core/job/job_builder.h
    	oneflow/core/operator/op_conf.proto
    
    * Fix a bug for building subgraph although it won't lead to wrong results (#2347)
    
    * Fix setting is_mutable in xla launch op (#2349)
    
    * Change directory xla to xrt, apply patch if building with xla
    
    * Refactor
    
    * Add infer shape pass, and Refactor launch kernel, graph compiler
    
    * Refine code style, add xla executable and graph compiler
    
    * Rename platform.proto as types.proto
    
    * change OpCompiler to OpKernel, complete xla graph compiler
    
    * Fix compilation bugs and add allocator, now xla compilation is ok
    
    * Add xla executable runtime
    
    * Add executable run scope to support launch kernel on specific stream.
    
    * Fix infer shape pass, and revert cuda event pool
    
    * Refactor graph building with attaching argument metadata.
    
    * Set mutability if rebuilding job
    
    * Set device ordinal correctly
    
    * Refine DelOps
    
    * Refine Argument definition and abstract function as subgraph
    
    * Fix infer shape in xrt launch op and launch kernel.
    
    * Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt.
    
    * Refine code style
    
    * Rename xla Operand as XlaValue.
    
    * Complete TensorRT compiler and builder, Refine OpKernel
    
    * Pick public code changes from the new tensorrt branch.
    
    * Fix tensorrt compilation
    
    * Fake implementation of trt executable
    
    * Support selecting engine in launch kernel, refine trt executable
    
    * Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix.
    
    * Support train phase setting for registered op kernel
    
    * Remove RewriteOptimizer pass, update xla optimizer op.
    
    * Format job builder .h and .cpp files.
    
    * Remove RewriteOptimizer pass, update xla optimizer op.
    
    * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.
    
    * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.
    
    * Refine code style and comment.
    
    * Refine model update inference for launch op.
    
    * Refine
    
    * Refine code style and comment.
    
    * Refine model update inference for launch op.
    
    Conflicts:
    	oneflow/xrt/kernel/op_kernel.h
    	oneflow/xrt/node_util.cpp
    	oneflow/xrt/node_util.h
    	oneflow/xrt/passes/cluster.h
    	oneflow/xrt/passes/mark_cluster_id_pass.cpp
    	oneflow/xrt/passes/rebuild_job_pass.cpp
    	oneflow/xrt/types.h
    
    * Add xrt README.md
    
    * Add use_xla_jit and use_tensorrt options in job proto
    
    * Refine code style
    
    * Fix BlobDesc getter and xla LayerNorm op for FP16
    
    * Make use_xla_jit and use_tensorrt configurable from python config and env variables.
    
    * Update benchmark
    
    * Refine xrt README and rename compile_with_xrt.h file
    
    * Update README
    
    * Revert tensorrt
    
    * Fix absl missing if building with TensorRT but without XLA
    
    * Update xrt benchmark
    
    * Disable WITH_XLA by default
    
    * Update xrt benchmark
    
    * Format xrt as core
    
    * add activation op
    
    * add softmax op
    
    * Refine code style, remove unused code
    
    * Remove duplication of XLA usage
    
    * test pass
    
    * pooling test pass
    
    * add concat op, not tested
    
    * add activation ops, test not psassed
    
    * Add xla gelu unittest
    
    * add  activation op, and test  passed
    
    * add pooling op, and test passed
    
    * Fix int64 env variable
    
    * Export float16 for python
    
    * Add xla relu unittest
    
    * try to solve conv bug
    
    * add elementwise add op, test passed
    
    * add concat op, test passed
    
    * Bugfix: transfer weights from gpu to host since tensorrt requires host weights.
    
    * add op unit tests
    
    * resolve conflicts and fix softmax bug
    
    * add identity op and topk op, to test
    
    * Add xla bias add and reshape unittests
    
    * Add xla identity unittest
    
    * Add xla cast and scalar op unittests
    
    * Add xla broadcast op and transpose unittests
    
    * Add xla add, sigmoid and tanh unittests
    
    * add reduce mean op, test passed
    
    * formate ops, add CHECKs, and optimize function structure
    
    * Add xla gather and batch_gather unittests
    
    * Add xla softmax unittest and fix softmax bug if axis is not the last dim.
    
    * add trt gather op and unit test
    
    * Add xla reduce_sum unittest, and support keep_dims for xla reduce
    
    * Add xla layer_norm unittest, and refine xla layer norm op
    
    * Add reshape_like unittest, and export reshape_like api
    
    * Refine xrt unittest code style
    
    * Export softmax_grad op, add softmax_grad unittest
    
    * Export tanh_grad op and add xla unittest
    
    * Export gelu_grad op, and add xla unittest
    
    * add conv unit test
    
    * reformate
    
    * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests
    
    * Commit to merge upstream of_xrt
    
    * check files
    
    * modify files according to review advice.
    
    * Add xrt unittests (#2483)
    
    * Revert tensorrt
    
    * Fix absl missing if building with TensorRT but without XLA
    
    * Update xrt benchmark
    
    * Add xla gelu unittest
    
    * Fix int64 env variable
    
    * Export float16 for python
    
    * Add xla relu unittest
    
    * Add xla bias add and reshape unittests
    
    * Add xla identity unittest
    
    * Add xla cast and scalar op unittests
    
    * Add xla broadcast op and transpose unittests
    
    * Add xla add, sigmoid and tanh unittests
    
    * Add xla gather and batch_gather unittests
    
    * Add xla softmax unittest and fix softmax bug if axis is not the last dim.
    
    * Add xla reduce_sum unittest, and support keep_dims for xla reduce
    
    * Add xla layer_norm unittest, and refine xla layer norm op
    
    * Add reshape_like unittest, and export reshape_like api
    
    * Refine xrt unittest code style
    
    * Export softmax_grad op, add softmax_grad unittest
    
    * Export tanh_grad op and add xla unittest
    
    * Export gelu_grad op, and add xla unittest
    
    * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests
    
    * Commit to merge upstream of_xrt
    
    * Fix reduce_mean facade bug if keep_dims if true.
    
    * Refine tensorrt unittests
    
    * Check failed if full reduce without keep dimension.
    
    * madd pooling unit test
    
    * Add tensorrt bias_add and reshape op, and their unittests.
    
    * Support fp16 for tensorrt.
    
    * Add tensorrt transpose op and unittest.
    
    * add unit test conv_2d
    
    * add unit test concat
    
    * Fix concat if axis is -1.
    
    * Refine tensorrt conv2d unittest
    
    * Fix padding mode for conv2d and pooling, refine unittests.
    
    * Refine tensorrt concat unittest
    
    * Add convert api from string engine to XrtEngine.
    
    * Revert tensorrt, and merge of_xrt branch
    
    * Remove some comments.
    
    * Refine tensorrt unittests
    
    * Add XrtConfig to deal with xla and tensorrt configurations.
    
    Conflicts:
    	oneflow/xrt/api.cpp
    
    * Update tensorflow.cmake to avoid applying the patch repeatedly.
    
    * Remove XrtConfig Option, and fix xrt unittests
    
    * Add tensorrt batch norm (#2516)
    
    * Refine xrt signatrue hash, and fix python configuration (#2520)
    
    * Fix XrtCompilationEnabled returns (#2524)
    
    * Fix compilation after merge dev_python
    
    * Update xrt unittests
    
    * Revert protobuf version
    
    * Remove comment FOR_RANGE
    
    * Remove unused code
    
    * Reformart
    
    * Refine job builder
    
    * Disable dump job if not debug mode
    Co-authored-by: NSnow <snow3s@qq.com>
    Co-authored-by: NJuncheng <liujuncheng1022@gmail.com>
    8f3dcf94
test_layer_norm_grad.py 3.3 KB