1. 16 7月, 2020 1 次提交
  2. 13 7月, 2020 1 次提交
    • H
      xrt support TensorRT int8 (#2637) · d4d84e60
      Houjiang Chen 提交于
      * Add tensorrt int8 calibrator
      
      * Generate calibration correctly.
      
      * Refine xrt int8 and readme
      
      * Update readme
      
      * Add xrt int8 unittest
      
      * merge develop
      
      * leaky relu test
      
      * function->global_function
      
      * fix LookupOrCreate
      
      * OF_CHECK->CHECK_OR_RETURN
      Co-authored-by: Nguo-ran <360112263@qq.com>
      d4d84e60
  3. 05 2月, 2020 1 次提交
  4. 26 12月, 2019 1 次提交
    • H
      XRT: XLA + TensorRT (#2525) · 8f3dcf94
      Houjiang Chen 提交于
      * Enable multiply definition for xla compilation in oneflow
      
      * Realize running an executable
      
      * Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore
      
      * Implement a seperate xla allocator to avoid introducing much objects of tensorflow
      
      * Define CompilationContext separately
      
      * Running XLA by CPU mode is OK now
      
      * Make the result shape after running the executable is a tuple, and refine comments
      
      * Add compilation cache to solve recompiling every time
      
      * Resolve InferSbpSignature in XlaLaunchOp
      
      * Resove executing on specified cuda stream
      
      * Refine XlaLaunch parallel conf, add batch matmul op
      
      * Refactor job rebuilding and fixup time shape
      
      * Update batch_dim_lbis field if XlaLaunch has any output which has batch dim
      
      * Resolve cluster-ring after clustered, take sbp policy and time shape into consideration
      
      * Add reshape op
      
      * Fix bugs
      
      * Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle
      
      * Fix bugs
      
      * Update cmake to compile with xla optionally
      
      * Support more ops
      
      * Add more ops, and fix bugs
      
      * Implement XLA allocator and internal memory pool
      
      * Adaptively resize allocator memory size
      
      * Refine memory allocator
      
      * Block host if running cpu executable
      
      * Fix bug for getting scalar value
      
      * Fix result layout bug. This bug causes wrong result for transpose
      
      * Refine gelu backward
      
      * Of xla sx (#1990)
      
      * add identity xla op
      
      * Add batch gather op
      
      * Refine batch gather
      
      * fix batch gather bug aand add gather op, mv identity op to unary_op
      
      * Add softmax and gather/batch_gather
      
      * Add xla softmax_grad op
      
      * Add xla layer normalization op
      
      * Add xla layer norm backward op
      
      * Alias inputs and outputs to compute in-place
      
      * Reuse output buffers when running xla executable. It brings about 10%
      speedup for bert on single gpu by zero copy results
      
      * Reuse output buffers when running xla executable. It brings about 10%
      speedup for bert on single gpu by zero copy results
      
      * Refine xla allocator
      
      * Refine code style
      
      * Add xla reduce_sum op
      
      * Rewrite model update op to optimizer graph
      
      * Fix hang bugs
      
      * Fix input which body is disabled in xla launch kernel
      
      * Fix self control in
      
      * Fix self control in
      
      * Add fake consume op
      
      * Fix HasAttr bug for optional field
      
      * Refine AdamOptimizer
      
      * Fix xla AdamOptimizer bugs
      
      * Add meta data in HLO instruction, and refine
      
      * Fix bugs
      
      * add reduce sum and split normal model update (#2040)
      
      * remove append_func_to_list
      
      * Rm deprecated model update and save code (#1958)
      
      * remove code
      
      * mv random gen to kernel
      
      * mk seed required
      
      * address reviews
      
      * fix unused warning
      
      * address reviews
      
      * check in more deprecation
      
      * remove ModelSaveOpConf
      
      * move out ops and modify item (#1962)
      
      * ModelInit.__oneflow_input_remote_blobs__
      
      * fix cpu only query & add error info (#1964)
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * modify check_point and add test check_point (#1963)
      
      * fix misuse of Scope/raii
      
      * op_name2variable_blob
      
      * add sigmoid test and tanh test (#1966)
      
      * add op matmul and matmul test (#1967)
      
      * rename oneflow.val to oneflow.input_blob_def
      
      * support auto var for convolution (#1972)
      
      * add op add and test add (#1973)
      
      * mv deprecated.pb_util to lib.core.pb_util
      
      * add op get_variable and get_variable test (#1975)
      
      * add op get_variable and get_variable test
      
      * modify shape extend
      
      * AllReduceSequencePass (#1976)
      
      * python2 compatibility for check_point
      
      * fix "return (blob_a, blob_b)" bug
      
      * rename: arg_passing => arg_pass
      
      * shared regst blob header between jobs (#1919)
      
      * half impl
      
      * register manager handle memory shared for separated memory
      
      * set separated memory shared id for shared regst between jobs
      
      * half impl of python for blob
      
      * fix BUG of pod ToProto() when proto has inited
      
      * fix BUG of infer dim0_inner_shape() in foreign_input_op
      
      * 1. PushJob copy from python can infer dim0_valid_num
      
      * add test for dynamic relu
      
      * refine test file
      
      * refine code
      
      * refine note
      
      * update test file for new interface
      
      * rename separated_header* (#1979)
      
      * some bugs fixes for a train&eval job (#1978)
      
      * debugging alex net
      
      * check in test pull_multiple_blob.py
      
      * strcter check
      
      * fix bias in conv
      
      * fix various bugs
      
      * rm file
      
      * op_name in different jobs can be overloaded
      
      * fix compile bug in job_set_compile_ctx
      
      * rm cmake code for building oneflow binary
      
      * check in script (#1980)
      
      * check in script
      
      * rm used import
      
      * CudaCurrentDeviceGuard (#1977)
      
      * fix val (#1981)
      
      * Merge job set and split fw bw (#1982)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * Merge job set and split fw bw (#1983)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * CudaCurrentDeviceGuard (#1977)
      
      * delete tmp_split_fw_bw_train_conf (#1985)
      
      * delete tmp_split_fw_bw_train_conf
      
      * delete useless comments
      
      * fix refactor bug in layer_norm_op
      
      * minor fixes
      
      * update py script
      
      * remove code could be misleading
      
      * Fix all reduce mem sharing (#1986)
      
      * fix all reduce mem sharing
      
      * ByteSizeOfDataContentField=>ByteSizeOfBlobBody
      
      * remove obsolete task_graph optimization
      
      * no arg_pass_job for variable_op
      
      * merge memory block id between jobs (#1910)
      
      * refine MemBlock and CriticalSection
      
      * job memory sharing strategy
      
      * revert diff in CriticalSectionDesc
      
      * Merge memory block between sub plans
      
      * Get mutual exclusion job groups
      
      * forget to consider memory merge only in same machine
      
      * memory zone unique id
      
      * Merge Done;  merge memory block id from right to left; get memory block ids info
      
      * revert MemBlock
      
      * generate mutual exclusion job groups Done.
      
      * update for proto
      
      * add JobMemSharingStrategy in python interface
      
      * remove memorycase hash
      
      * move JobMemSharingStrategy to JobSetProto
      
      * using default strategy = parallel priority strategy
      
      * update interface of flow.job_mem_sharing_strategy
      
      * InterJobMemSharingUtil and PlanUtil
      
      * revert oneflow.h
      
      * fix bug
      
      * New implement of Merge memory block id between jobs
      
      * refine code
      
      * fix a fatal bug in std::hash<oneflow::Shape>
      
      * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node
      
      * unlock critical sections as more as possible (#1994)
      
      * Bugfix actor case (#1995)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * Bugfix actor case (#1996)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * small regst_num for reentrant_lock (#1997)
      
      * fmt dev_job_set(#1999)
      
      * double buffer for tick_op
      
      * tick is cpu op
      
      * speedup compile time (#2000)
      
      * only merge mem_block_id between user job (#1993)
      
      * Fix keep header only (#2001)
      
      * speedup compile time
      
      * fix keep header only
      
      * remove shared model (#2003)
      
      * remove blob_mem_sharing (#2005)
      
      * No copyhd for output (#2006)
      
      * no cpu tick
      
      * no copyhd for output_op/swith_output_op
      
      * remove temp comments
      
      * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo
      
      * remove clone_id (#2007)
      
      * layer norm auto var (#2004)
      
      * layer norm auto var
      
      * make of_format
      
      * bn sbp (#2008)
      
      * Refactor job completer (#1998)
      
      * fmt
      
      * refactor GenerateOpConf4Trainning
      
      * more refactor
      
      * refactor SetCtrlInOpName4VariableOp
      
      * use uniq ptr
      
      * refactor RewriteBoxingWithAllReduce
      
      * refactor MakeAllReduceSequence
      
      * refactor auto_mixed_precision
      
      * refactor DumpLogicalBlobDescAndSbpSignature
      
      * refactor group_boxing_by_dst_parallel
      
      * refactor add_keep_header_only_op_conf
      
      * refactor AutoSourceTick
      
      * refactor AddTickForTimeShape
      
      * refactor AutoSinkTick
      
      * refactor AddGlobalOutputCriticalSections
      
      * refactor SetOpTimeShape7BatchDimLbis
      
      * fix a bug in IsInterfaceTask (#2009)
      
      * Bugfix is interface task (#2010)
      
      * fix a bug in IsInterfaceTask
      
      * IsOutputInterfaceTask
      
      * copyhd-free output_op task_node
      
      * Dev job set config util (#2011)
      
      * add more if in JobConfigProtoBuilder
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * remove total batch num in config util
      
      * remove clone_id
      
      * assert has train_conf
      
      * rm debug info
      
      * Dev job set bert (#2013)
      
      * support bert
      
      * mv into bert
      
      * manual format
      
      * fix adam (#2015)
      
      * fix adam
      
      * div batch instance num before update model
      
      * remove outdate code in oneflow.cpp (#2017)
      
      * Dev split like (#2016)
      
      * no total_instance_num
      
      * add auto grad for concat
      
      * check in impl
      
      * check in bug fixes
      
      * fix bugs for split_like
      
      * split_like_op.cpp format
      
      * add normalization_autovar
      
      * Update op_conf.proto
      
      * address reviews
      
      * fix typo
      
      * constant ref
      
      * rm forward_loss_instance_num (#2018)
      
      * Bugfix job set multi device (#2019)
      
      * sbp for tick input bn
      
      * interface_blob_conf for output_op/switch_output_op
      
      * set sbp conf for tuple identity op
      
      * fix bugs when merge main plan
      
      * delete useless code
      
      * address review
      
      * fix error use of GenRepeatedBn()
      
      * ForEachConnectedComponent is easily misused
      
      * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil
      
      * only for return output_op
      
      * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name
      
      * return op instead of output op acts as part of user job
      
      * enable_all_reduce_group
      
      * bugfix: init RuntimeBuffersScope before Runtime
      
      * demo python scripts for enable_all_reduce_group
      
      * remove wrong optimization code
      
      * constant_conf for enable_all_reduce_group.py test
      
      * fix interface op parallel conf
      
      * fix reduce concat kernel (#2020)
      
      * binary program oneflow_worker
      
      * user_job_completer
      
      * remove unused code loss_print
      
      * rm unused code loss_acc
      
      * remove unused accuracy_acc and accuracy_print
      
      * remove input_diff/output_diff/model_diff bns
      
      * remove unused bns in gdb util
      
      * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns
      
      * support mpi using style
      
      * Bugfix put job conf into plan (#2023)
      
      * put job_conf into plan
      
      * using job_name judge isPullJob/isPushJob
      
      * fix wrong job_id error
      
      * model_init is a push job; model_save is a pull job
      
      * make cmake more reasonable (#2024)
      
      * Restructure python module and minimum setup.py (#2026)
      
      * check in updated paths
      
      * check in minimum setup tool
      
      * Dev python init multi unit (#2022)
      
      * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine
      
      * refine var name
      
      * refine code
      
      * compile user/main job only on master
      
      * bert multi machine test code
      
      * fix bugs
      
      * JobConfs
      
      * fix bugs under WITH_RDMA
      
      * fix multi-machine bugs
      
      * delete useless code
      
      * Add xla reduce_sum op
      
      * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)
      
      * feat: init_worker can without scp binary and no use uuid (#2029)
      
      * half impl of without scp bin
      
      * feat: init_worker can without scp binary and no use uuid
      
      * check in fixes (#2030)
      
      * fixbug of delete worker (#2033)
      
      * Dev dot plan (#2035)
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * Check in bug fix and multi node script (#2032)
      
      * check in fixes
      
      * check in script
      
      * fix boxing bug when setting conf with sbp
      
      * flag for iter
      
      * fixbug of delete worker
      
      * fix delete worker in script
      
      * address review, add exclusive or check
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * fix and add flags
      
      * fmt
      
      * rm debug output
      
      * more flags
      
      * check Activation
      
      * fix fc bug when num axes > 2
      
      * reverse change
      
      * fix next_batch_num (#2036)
      
      * upgrade nccl to 2.4.8 (#2037)
      
      * fix shape of fc in_diff (#2038)
      
      * Rewrite model update op to optimizer graph
      
      * Update oneflow.cmake (#2041)
      
      * better looking merged_plan to dot v1 (#2039)
      
      * better looking and more infomation of merged_plan.dot
      
      * refine color
      
      * Fix tick in multi node parallel (#2042) (#2047)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * Dev train conf builder (#2046)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * check in impl
      
      * fix data dir (#2054)
      
      * fix data dir
      
      * rm model load path
      
      * AssignOp (#2058)
      
      * AssignOp
      
      * remove useless code
      
      * Python ops gather and unit test (#2053)
      
      * python_ops gather and unit test
      
      * format
      
      * minor mod
      
      * SnapshotOp (#2060)
      
      * magical add and fix bug (#2061)
      
      * check in impl
      
      * add todo
      
      * Dev jxf python pooling (#2056)
      
      * run max_pool_2d without bug
      
      * correct max_pool_2d
      
      * correct average_pool_2d
      
      * minor refine
      
      * final version
      
      * rename to nn.py
      
      * add name arg to pool1d ops
      
      * refine by review
      
      * rename to _GetSequence and move it to the end of file (#2063)
      
      * fix BindInterfaceMemBlockId (#2065)
      
      * mark py file generated (#2066)
      
      * Dev gracious exit (#2057)
      
      * add more checks
      
      * make language more consistant
      
      * better error info for worker init
      
      * better error
      
      * Update setup.py (#2068)
      
      * Refine Infer APIs by return Maybe<void> type (#2051)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * fix bug for split like op (#2070)
      
      * fix snapshot path (#2071)
      
      * Dev job set fix infer apis (#2072)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * update
      
      * add AutoGlobalStep (#2073)
      
      * rm default_initializer_conf in train conf (#2075)
      
      * Fix sigmoid op (#2076)
      
      * fix sigmoid op bug
      
      * fix bug for split like op
      
      * add sigmoid grad op
      
      * Fix bn (#2077)
      
      * fix bn
      
      * return Maybe<void> OK in lambda
      
      * fix typo
      
      * fix SigmoidGradOp (#2078)
      
      * Dev python merge job set (#2081)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix gcc warning in release (#2080)
      
      * fix gcc version in release
      
      * fix empty line
      
      * Fix adam mv initilizer (#2082)
      
      * zero constant initilzer for adam m and v
      
      * make of_format
      
      * init adam m v beta1_t and beta2_t
      
      * use value instead of initializer
      
      * const float& -> const float
      
      * update
      
      * LearningRateScheduleOp (#2079)
      
      * matmul (#2084)
      
      * matmul
      
      * np.allclose
      
      * Fix hang bugs
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape
      
      * refine code for read
      
      * check py if and test
      
      * prelu (#2086)
      
      * prelu
      
      * fix
      
      * fix
      
      * template for either ptr cast (#2088)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * add template for cast
      
      * rename
      
      * Dev build and infer ctx (#2089)
      
      * add job_build_and_infer_ctx interface
      
      * lbn_with_split_hint
      
      * fix maybe macro
      
      * fix signature of Maybe<T>::Error()
      
      * job_build_and_infer_if
      
      * add c_api_util wrapper for job_build_and_infer_ctx
      
      * implement python/job_build_and_infer interface
      
      * CurJobBuildAndInferCtx_AddPlacementGroup
      
      * BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)
      
      * job_build_and_infer_ctx_mgr
      
      * refine interface of infer_ctx_mgr
      
      * JobBuildInferCtx set job conf; add and refine error type
      
      * revert job.proto
      
      * half impl of add op in build_infer_ctx
      
      * generate op produced empty logical blob desc ; infer out blob desc interface
      
      * job_build_and_infer_ctx VERSION 1
      
      * add InferOutBlobDesc for conv op; remove record_piece_size in interface op
      
      * maybe return
      
      * job_set hold by job_build_and_infer_ctx_mgr
      
      * check placement when infer ctx mgr leave cur job
      
      * Global New/Delete JobBuildAndInferCtxMgr
      
      * add JUST when ctx add op
      
      * remove unused job_conf.arg_op_name
      
      * fix bugs caused by python new api
      
      * fix bugs caused by lack of Global<JobDesc>
      
      * fix bugs caused by new api
      
      * refactor compiler.Compile
      
      * merge dev_python
      
      * remove unused message proto
      
      * rename api
      
      * Fix input which body is disabled in xla launch kernel
      
      * add RemoteBlob.shape and RemoteBlob.dtype
      
      * Fix data type set default variable (#2092)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix default data type
      
      * Add conf axis for bias_add for any axis channel (#2093)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Dev jxf python initializer (#2090)
      
      * oneflow initializer
      
      * update
      
      * Fix self control in
      
      * Bugfix python alexnet (#2096)
      
      * bugfix_python_alexnet
      
      * fix
      
      * Add fake consume op
      
      * Dev global step (#2100)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * Fix optimizer initializer (#2095)
      
      * fix optimizer initializer
      
      * rename lars data temp bn
      
      * fix job_type (#2102)
      
      * Dev alexnet new api (#2094)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * check in softmax loss
      
      * nn.conv2d and nn.bias_add
      
      * fix opname
      
      * fix merge conflict
      
      * fix name
      
      * dense (#2097)
      
      * Fix jxf dense v2 (#2098)
      
      * dense
      
      * minor fix
      
      * alexnet
      
      * fix conf
      
      * quick fix
      
      * transpose
      
      * fix layers
      
      * add transpose
      
      * fix fc
      
      * fix
      
      * fix
      
      * fix data laod
      
      * params check and format
      
      * rm activation in op conf
      
      * save workaround
      
      * fix avg pool 2d
      
      * fix max pool 2d
      
      * remove fc3 relu
      
      * alexnet eval
      
      * minor
      
      * replace has_batch_dim with batch_axis (#2104)
      
      * replace has_batch_dim with batch_axis
      
      * refactor OrderValue4HasBatchAxis
      
      * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp
      
      * no CHECK in MatmulOp::InferBatchAxis
      
      * infer op by op_conf and  parallel_conf
      
      * wrapper Error for ErrorProto
      
      * replace ErrorUtil with Error
      
      * add OF_CHECK (#2110)
      
      * optional split_axis (#2113)
      
      * Fix HasAttr bug for optional field
      
      * undefined (#2116)
      
      * merge reduce xxx (#2119)
      
      * Update GetSbpSig() with Maybe (#2118)
      
      * fix sveral ops
      
      * modify all ops
      
      * format
      
      * update complete
      
      * Refine AdamOptimizer
      
      * fix (#2120)
      
      * Fix xla AdamOptimizer bugs
      
      * support scalar for reduce_xxx axis args (#2122)
      
      * Dev opt split axis (#2121)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * fix autovar split_axis (#2125)
      
      * Dev model init op (#2117)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * SnapshotReader
      
      
      snapshot writer
      
      
      model init op
      
      
      fix
      
      
      refine
      
      
      init
      
      
      InitializeFromSnapshotConf
      
      
      model io job
      
      
      ModelLoadOp
      
      
      ModelLoadKernel
      
      
      MakeModelLoadJob
      
      
      ModelSaveOp
      
      
      fix
      
      
      InterUserJobInfo
      
      
      _MakeModelLoadJobFunc
      
      
      MutModelLoadOpConTickInputHelper
      
      
      fix
      
      
      refine
      
      
      init/load/save
      
      
      set_default_variable
      
      * remove SnapshotMgr
      
      * snapshot.h
      
      * delete model_init_job.cpp
      
      
      foreign_input_op_conf
      
      
      fix
      
      
      snapshot path
      
      
      set path
      
      
      op_conf
      
      
      fix
      
      
      fix CopyFromNdarray
      
      
      to bytes c
      
      
      use uint8
      
      
      char2uint8
      
      * model init
      
      * model io
      
      * fix
      
      * ModelSaveKernel
      
      * mutable_batch_axis()->Clear()
      
      * InferBatchAxis
      
      * fix
      
      * refine
      
      * job set
      
      * MakeModelIoJobs
      
      * fix
      
      * jobs
      
      * fix
      
      * model io job
      
      * GenOutputOpConf
      
      * refine snapshot
      
      * refine
      
      * fix
      
      * refine CheckPoint
      
      * remove session
      
      * refine
      
      * refine
      
      * refine
      
      * remove keyword.h/cpp
      
      * refine
      
      * global_step=>train_step
      
      * GetSbpSignatures
      
      * ModelInitOp
      
      * fix (#2127)
      
      * rm stale alextnet script (#2129)
      
      * Dev plain maybe (#2126)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * Dev simple checkpoint manager (#2128)
      
      * SimpleCheckPointManager
      
      * makedirs
      
      * fix path
      
      * save
      
      * refine
      
      * refine
      
      * fix path to numpy (#2130)
      
      * Dev plain maybe (#2132)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()
      
      * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>
      
      * Dev jxf merge general ops (#2131)
      
      * merge some general ops to dev_python
      
      * dense demo
      
      * rm print in test
      
      * new line at the end of file
      
      * format
      
      * fix check point
      
      * update alexnet
      
      * broadcast_xxx (#2134)
      
      * broadcast_xxx
      
      * typo
      
      * typo
      
      * rm job_conf.num_of_batches_in_snapshot
      
      * fix args (#2136)
      
      * fix proto if (#2138)
      
      * pass name to inner function (#2139)
      
      * check dropout if (#2140)
      
      * check dropout if
      
      * fix typo
      
      * Dev merge math ops (#2143)
      
      * merge math ops
      
      * new line at the end of file
      
      * merge layer norm (#2144)
      
      * variable_scope (#2141)
      
      * variable_scope
      
      * revert format
      
      * add check
      
      * Merge dropout if (#2145)
      
      * check dropout if
      
      * fix typo
      
      * fix typo
      
      * slice (#2142)
      
      * slice
      
      * add check and docstring
      
      * minor
      
      * minor
      
      * add const (#2146)
      
      * add const
      
      * fix indentation
      
      * address review
      
      * fmt
      
      * rm redundant
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * add more activations to math_ops (#2147)
      
      * fix bug (#2149)
      
      * trancated normal for bert (#2150)
      
      * Update bert for dev python (#2151)
      
      * trancated normal for bert
      
      * bert support
      
      * math.dropout to nn.dropout (#2153)
      
      * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto
      
      * allow export multiple interfaces in oneflow_export decorator (#2154)
      
      * refactor job_build_and_infer_if.h
      
      * update oneflow_internal.h to use Maybe (#2135)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp
      
      * Fix python scripts
      
      * Dev nc of internal (#2155)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      * fix: fix ctor bug
      
      * fix config_proto
      
      * rename c_api_util.Init => c_api_util.InitEnvironment
      
      * refactor compile_context.cur_job => compile_context.cur_job_conf
      
      * remove FixPackedBlobDescOfProducedRegst (#2156)
      
      * Fix snapshot root path empty log (#2158)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * Fix snapshot root path empty log
      
      * fix channel last (#2157)
      
      * fix channel last
      
      * minor
      
      * merge pb_message
      
      * add cudnn conv force algo (#2159)
      
      * Update bert for dev python (#2160)
      
      * remove old bert
      
      * set data_part_num in decoder
      
      * support model load/saveargs
      
      * Dev flow function (#2152)
      
      * add of.function, refactor init, refine session, and refine runtime
      
      * rm useless code
      
      * rename
      
      * update
      
      * add test
      
      * @oneflow_export JobConfigProto and Trainconf (#2162)
      
      * @oneflow_export JobConfigProto and Trainconf
      
      * remove unused config in config_util.py
      
      * remove oneflow.get_cur_job_conf_builder
      
      * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)
      
      * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf
      
      * fix config.train.model_update_conf
      
      * _GetJobConfAttr
      
      * update alexnet (#2166)
      
      * Update alexnet (#2167)
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * more reasonable conf
      
      * get variable in py layer norm
      
      * replace val in pb msg;  decode lbn string with split hint (#2165)
      
      * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)
      
      * Add meta data in HLO instruction, and refine
      
      * python model parallel (#2103)
      
      * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op
      
      * merge placement group
      
      * refine code in AddAndInferOp
      
      * auto merge placement group when add op; remove mergeplacementgroup interface
      
      * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx
      
      * python blob add interface for model parallel
      
      * refine code of python blob split
      
      * remove interface of has/get_split_axis in python blob
      
      * remove interface of has_batch_dim in python blob
      
      * add check blob split_axis can be divide by parallel num
      
      * refine code for maybe get/infer sbp
      
      * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc
      
      * fix for plain point maybe
      
      * fix bug: add repeated placement group, remove add placement interface in hand
      
      * fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel
      
      * dev_python model parallel runnable and check correct
      
      * remove add placement group when placment scope exit
      
      * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel
      
      * bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done
      
      * refine python blob_desc.split implement
      
      * refine interface decode lbn to split hint
      
      * refine auto add placment group
      
      * refine lbn with split hint decode
      
      * refine code for review
      
      * remove AutoVar related code (#2168)
      
      * feat: remove all autovar
      
      * fix and format
      
      * fix: fix op::InferBlobDesc
      
      * add prototype (#2172)
      
      * add prototype
      
      * infer blob desc with sbp_signature
      
      * `str_a is not str_b' is buggy, use `str_a != str_b' instead
      
      * Update snapshot.cpp (#2174)
      
      * remove useless lines (#2176)
      
      * Fix bert multi nodes (#2177)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * CHECK_JUST for InferBlobDescsIf (#2178)
      
      * Fix bert multi nodes (#2180)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * config_proto -> default_config_proto
      
      * delete worker
      
      * update alexnet
      
      * remove unused op (#2182)
      
      * remove parallel_ctx when kernel init (#2185)
      
      * InferOpSbpSignature in op_graph and infer_ctx (#2175)
      
      * InferOpSbpSignature in op_graph and infer_ctx
      
      * bugfix: lambda life time;  gen job build error add location info
      
      * refine error generation and return
      
      * refine check lbi vaild and exists
      
      * remove parallel num in decode_of_record op/kernel (#2186)
      
      * Fix bugs
      
      * delete GlobalJobDesc() in operator/ (#2188)
      
      * rm unused test file
      
      * Refine
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Remove fake consume op
      
      * Support enable/disable XLA by set env
      
      * Merge callback, limit max operator count for each XLA subgraph
      
      * CudaEventPool
      
      * fix vector
      
      * refine
      
      * Support in-place update for optimizer
      
      * Add alias input and output to prevent reusing input with other temp buffers
      
      * Refine code style
      
      * Remove unused code
      
      * Of xla (#2237)
      
      * mv deprecated.pb_util to lib.core.pb_util
      
      * add op get_variable and get_variable test (#1975)
      
      * add op get_variable and get_variable test
      
      * modify shape extend
      
      * AllReduceSequencePass (#1976)
      
      * python2 compatibility for check_point
      
      * fix "return (blob_a, blob_b)" bug
      
      * rename: arg_passing => arg_pass
      
      * shared regst blob header between jobs (#1919)
      
      * half impl
      
      * register manager handle memory shared for separated memory
      
      * set separated memory shared id for shared regst between jobs
      
      * half impl of python for blob
      
      * fix BUG of pod ToProto() when proto has inited
      
      * fix BUG of infer dim0_inner_shape() in foreign_input_op
      
      * 1. PushJob copy from python can infer dim0_valid_num
      
      * add test for dynamic relu
      
      * refine test file
      
      * refine code
      
      * refine note
      
      * update test file for new interface
      
      * rename separated_header* (#1979)
      
      * some bugs fixes for a train&eval job (#1978)
      
      * debugging alex net
      
      * check in test pull_multiple_blob.py
      
      * strcter check
      
      * fix bias in conv
      
      * fix various bugs
      
      * rm file
      
      * op_name in different jobs can be overloaded
      
      * fix compile bug in job_set_compile_ctx
      
      * rm cmake code for building oneflow binary
      
      * check in script (#1980)
      
      * check in script
      
      * rm used import
      
      * CudaCurrentDeviceGuard (#1977)
      
      * fix val (#1981)
      
      * Merge job set and split fw bw (#1982)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * Merge job set and split fw bw (#1983)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * CudaCurrentDeviceGuard (#1977)
      
      * delete tmp_split_fw_bw_train_conf (#1985)
      
      * delete tmp_split_fw_bw_train_conf
      
      * delete useless comments
      
      * fix refactor bug in layer_norm_op
      
      * minor fixes
      
      * update py script
      
      * remove code could be misleading
      
      * Fix all reduce mem sharing (#1986)
      
      * fix all reduce mem sharing
      
      * ByteSizeOfDataContentField=>ByteSizeOfBlobBody
      
      * remove obsolete task_graph optimization
      
      * no arg_pass_job for variable_op
      
      * merge memory block id between jobs (#1910)
      
      * refine MemBlock and CriticalSection
      
      * job memory sharing strategy
      
      * revert diff in CriticalSectionDesc
      
      * Merge memory block between sub plans
      
      * Get mutual exclusion job groups
      
      * forget to consider memory merge only in same machine
      
      * memory zone unique id
      
      * Merge Done;  merge memory block id from right to left; get memory block ids info
      
      * revert MemBlock
      
      * generate mutual exclusion job groups Done.
      
      * update for proto
      
      * add JobMemSharingStrategy in python interface
      
      * remove memorycase hash
      
      * move JobMemSharingStrategy to JobSetProto
      
      * using default strategy = parallel priority strategy
      
      * update interface of flow.job_mem_sharing_strategy
      
      * InterJobMemSharingUtil and PlanUtil
      
      * revert oneflow.h
      
      * fix bug
      
      * New implement of Merge memory block id between jobs
      
      * refine code
      
      * fix a fatal bug in std::hash<oneflow::Shape>
      
      * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node
      
      * unlock critical sections as more as possible (#1994)
      
      * Bugfix actor case (#1995)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * Bugfix actor case (#1996)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * small regst_num for reentrant_lock (#1997)
      
      * fmt dev_job_set(#1999)
      
      * double buffer for tick_op
      
      * tick is cpu op
      
      * speedup compile time (#2000)
      
      * only merge mem_block_id between user job (#1993)
      
      * Fix keep header only (#2001)
      
      * speedup compile time
      
      * fix keep header only
      
      * remove shared model (#2003)
      
      * remove blob_mem_sharing (#2005)
      
      * No copyhd for output (#2006)
      
      * no cpu tick
      
      * no copyhd for output_op/swith_output_op
      
      * remove temp comments
      
      * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo
      
      * remove clone_id (#2007)
      
      * layer norm auto var (#2004)
      
      * layer norm auto var
      
      * make of_format
      
      * bn sbp (#2008)
      
      * Refactor job completer (#1998)
      
      * fmt
      
      * refactor GenerateOpConf4Trainning
      
      * more refactor
      
      * refactor SetCtrlInOpName4VariableOp
      
      * use uniq ptr
      
      * refactor RewriteBoxingWithAllReduce
      
      * refactor MakeAllReduceSequence
      
      * refactor auto_mixed_precision
      
      * refactor DumpLogicalBlobDescAndSbpSignature
      
      * refactor group_boxing_by_dst_parallel
      
      * refactor add_keep_header_only_op_conf
      
      * refactor AutoSourceTick
      
      * refactor AddTickForTimeShape
      
      * refactor AutoSinkTick
      
      * refactor AddGlobalOutputCriticalSections
      
      * refactor SetOpTimeShape7BatchDimLbis
      
      * fix a bug in IsInterfaceTask (#2009)
      
      * Bugfix is interface task (#2010)
      
      * fix a bug in IsInterfaceTask
      
      * IsOutputInterfaceTask
      
      * copyhd-free output_op task_node
      
      * Dev job set config util (#2011)
      
      * add more if in JobConfigProtoBuilder
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * remove total batch num in config util
      
      * remove clone_id
      
      * assert has train_conf
      
      * rm debug info
      
      * Dev job set bert (#2013)
      
      * support bert
      
      * mv into bert
      
      * manual format
      
      * fix adam (#2015)
      
      * fix adam
      
      * div batch instance num before update model
      
      * remove outdate code in oneflow.cpp (#2017)
      
      * Dev split like (#2016)
      
      * no total_instance_num
      
      * add auto grad for concat
      
      * check in impl
      
      * check in bug fixes
      
      * fix bugs for split_like
      
      * split_like_op.cpp format
      
      * add normalization_autovar
      
      * Update op_conf.proto
      
      * address reviews
      
      * fix typo
      
      * constant ref
      
      * rm forward_loss_instance_num (#2018)
      
      * Bugfix job set multi device (#2019)
      
      * sbp for tick input bn
      
      * interface_blob_conf for output_op/switch_output_op
      
      * set sbp conf for tuple identity op
      
      * fix bugs when merge main plan
      
      * delete useless code
      
      * address review
      
      * fix error use of GenRepeatedBn()
      
      * ForEachConnectedComponent is easily misused
      
      * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil
      
      * only for return output_op
      
      * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name
      
      * return op instead of output op acts as part of user job
      
      * enable_all_reduce_group
      
      * bugfix: init RuntimeBuffersScope before Runtime
      
      * demo python scripts for enable_all_reduce_group
      
      * remove wrong optimization code
      
      * constant_conf for enable_all_reduce_group.py test
      
      * fix interface op parallel conf
      
      * fix reduce concat kernel (#2020)
      
      * binary program oneflow_worker
      
      * user_job_completer
      
      * remove unused code loss_print
      
      * rm unused code loss_acc
      
      * remove unused accuracy_acc and accuracy_print
      
      * remove input_diff/output_diff/model_diff bns
      
      * remove unused bns in gdb util
      
      * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns
      
      * support mpi using style
      
      * Bugfix put job conf into plan (#2023)
      
      * put job_conf into plan
      
      * using job_name judge isPullJob/isPushJob
      
      * fix wrong job_id error
      
      * model_init is a push job; model_save is a pull job
      
      * make cmake more reasonable (#2024)
      
      * Restructure python module and minimum setup.py (#2026)
      
      * check in updated paths
      
      * check in minimum setup tool
      
      * Dev python init multi unit (#2022)
      
      * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine
      
      * refine var name
      
      * refine code
      
      * compile user/main job only on master
      
      * bert multi machine test code
      
      * fix bugs
      
      * JobConfs
      
      * fix bugs under WITH_RDMA
      
      * fix multi-machine bugs
      
      * delete useless code
      
      * Add xla reduce_sum op
      
      * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)
      
      * feat: init_worker can without scp binary and no use uuid (#2029)
      
      * half impl of without scp bin
      
      * feat: init_worker can without scp binary and no use uuid
      
      * check in fixes (#2030)
      
      * fixbug of delete worker (#2033)
      
      * Dev dot plan (#2035)
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * Check in bug fix and multi node script (#2032)
      
      * check in fixes
      
      * check in script
      
      * fix boxing bug when setting conf with sbp
      
      * flag for iter
      
      * fixbug of delete worker
      
      * fix delete worker in script
      
      * address review, add exclusive or check
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * fix and add flags
      
      * fmt
      
      * rm debug output
      
      * more flags
      
      * check Activation
      
      * fix fc bug when num axes > 2
      
      * reverse change
      
      * fix next_batch_num (#2036)
      
      * upgrade nccl to 2.4.8 (#2037)
      
      * fix shape of fc in_diff (#2038)
      
      * Rewrite model update op to optimizer graph
      
      * Update oneflow.cmake (#2041)
      
      * better looking merged_plan to dot v1 (#2039)
      
      * better looking and more infomation of merged_plan.dot
      
      * refine color
      
      * Fix tick in multi node parallel (#2042) (#2047)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * Dev train conf builder (#2046)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * check in impl
      
      * fix data dir (#2054)
      
      * fix data dir
      
      * rm model load path
      
      * AssignOp (#2058)
      
      * AssignOp
      
      * remove useless code
      
      * Python ops gather and unit test (#2053)
      
      * python_ops gather and unit test
      
      * format
      
      * minor mod
      
      * SnapshotOp (#2060)
      
      * magical add and fix bug (#2061)
      
      * check in impl
      
      * add todo
      
      * Dev jxf python pooling (#2056)
      
      * run max_pool_2d without bug
      
      * correct max_pool_2d
      
      * correct average_pool_2d
      
      * minor refine
      
      * final version
      
      * rename to nn.py
      
      * add name arg to pool1d ops
      
      * refine by review
      
      * rename to _GetSequence and move it to the end of file (#2063)
      
      * fix BindInterfaceMemBlockId (#2065)
      
      * mark py file generated (#2066)
      
      * Dev gracious exit (#2057)
      
      * add more checks
      
      * make language more consistant
      
      * better error info for worker init
      
      * better error
      
      * Update setup.py (#2068)
      
      * Refine Infer APIs by return Maybe<void> type (#2051)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * fix bug for split like op (#2070)
      
      * fix snapshot path (#2071)
      
      * Dev job set fix infer apis (#2072)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * update
      
      * add AutoGlobalStep (#2073)
      
      * rm default_initializer_conf in train conf (#2075)
      
      * Fix sigmoid op (#2076)
      
      * fix sigmoid op bug
      
      * fix bug for split like op
      
      * add sigmoid grad op
      
      * Fix bn (#2077)
      
      * fix bn
      
      * return Maybe<void> OK in lambda
      
      * fix typo
      
      * fix SigmoidGradOp (#2078)
      
      * Dev python merge job set (#2081)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix gcc warning in release (#2080)
      
      * fix gcc version in release
      
      * fix empty line
      
      * Fix adam mv initilizer (#2082)
      
      * zero constant initilzer for adam m and v
      
      * make of_format
      
      * init adam m v beta1_t and beta2_t
      
      * use value instead of initializer
      
      * const float& -> const float
      
      * update
      
      * LearningRateScheduleOp (#2079)
      
      * matmul (#2084)
      
      * matmul
      
      * np.allclose
      
      * Fix hang bugs
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape
      
      * refine code for read
      
      * check py if and test
      
      * prelu (#2086)
      
      * prelu
      
      * fix
      
      * fix
      
      * template for either ptr cast (#2088)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * add template for cast
      
      * rename
      
      * Dev build and infer ctx (#2089)
      
      * add job_build_and_infer_ctx interface
      
      * lbn_with_split_hint
      
      * fix maybe macro
      
      * fix signature of Maybe<T>::Error()
      
      * job_build_and_infer_if
      
      * add c_api_util wrapper for job_build_and_infer_ctx
      
      * implement python/job_build_and_infer interface
      
      * CurJobBuildAndInferCtx_AddPlacementGroup
      
      * BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)
      
      * job_build_and_infer_ctx_mgr
      
      * refine interface of infer_ctx_mgr
      
      * JobBuildInferCtx set job conf; add and refine error type
      
      * revert job.proto
      
      * half impl of add op in build_infer_ctx
      
      * generate op produced empty logical blob desc ; infer out blob desc interface
      
      * job_build_and_infer_ctx VERSION 1
      
      * add InferOutBlobDesc for conv op; remove record_piece_size in interface op
      
      * maybe return
      
      * job_set hold by job_build_and_infer_ctx_mgr
      
      * check placement when infer ctx mgr leave cur job
      
      * Global New/Delete JobBuildAndInferCtxMgr
      
      * add JUST when ctx add op
      
      * remove unused job_conf.arg_op_name
      
      * fix bugs caused by python new api
      
      * fix bugs caused by lack of Global<JobDesc>
      
      * fix bugs caused by new api
      
      * refactor compiler.Compile
      
      * merge dev_python
      
      * remove unused message proto
      
      * rename api
      
      * Fix input which body is disabled in xla launch kernel
      
      * add RemoteBlob.shape and RemoteBlob.dtype
      
      * Fix data type set default variable (#2092)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix default data type
      
      * Add conf axis for bias_add for any axis channel (#2093)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Dev jxf python initializer (#2090)
      
      * oneflow initializer
      
      * update
      
      * Fix self control in
      
      * Bugfix python alexnet (#2096)
      
      * bugfix_python_alexnet
      
      * fix
      
      * Add fake consume op
      
      * Dev global step (#2100)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * Fix optimizer initializer (#2095)
      
      * fix optimizer initializer
      
      * rename lars data temp bn
      
      * fix job_type (#2102)
      
      * Dev alexnet new api (#2094)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * check in softmax loss
      
      * nn.conv2d and nn.bias_add
      
      * fix opname
      
      * fix merge conflict
      
      * fix name
      
      * dense (#2097)
      
      * Fix jxf dense v2 (#2098)
      
      * dense
      
      * minor fix
      
      * alexnet
      
      * fix conf
      
      * quick fix
      
      * transpose
      
      * fix layers
      
      * add transpose
      
      * fix fc
      
      * fix
      
      * fix
      
      * fix data laod
      
      * params check and format
      
      * rm activation in op conf
      
      * save workaround
      
      * fix avg pool 2d
      
      * fix max pool 2d
      
      * remove fc3 relu
      
      * alexnet eval
      
      * minor
      
      * replace has_batch_dim with batch_axis (#2104)
      
      * replace has_batch_dim with batch_axis
      
      * refactor OrderValue4HasBatchAxis
      
      * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp
      
      * no CHECK in MatmulOp::InferBatchAxis
      
      * infer op by op_conf and  parallel_conf
      
      * wrapper Error for ErrorProto
      
      * replace ErrorUtil with Error
      
      * add OF_CHECK (#2110)
      
      * optional split_axis (#2113)
      
      * Fix HasAttr bug for optional field
      
      * undefined (#2116)
      
      * merge reduce xxx (#2119)
      
      * Update GetSbpSig() with Maybe (#2118)
      
      * fix sveral ops
      
      * modify all ops
      
      * format
      
      * update complete
      
      * Refine AdamOptimizer
      
      * fix (#2120)
      
      * Fix xla AdamOptimizer bugs
      
      * support scalar for reduce_xxx axis args (#2122)
      
      * Dev opt split axis (#2121)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * fix autovar split_axis (#2125)
      
      * Dev model init op (#2117)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * SnapshotReader
      
      
      snapshot writer
      
      
      model init op
      
      
      fix
      
      
      refine
      
      
      init
      
      
      InitializeFromSnapshotConf
      
      
      model io job
      
      
      ModelLoadOp
      
      
      ModelLoadKernel
      
      
      MakeModelLoadJob
      
      
      ModelSaveOp
      
      
      fix
      
      
      InterUserJobInfo
      
      
      _MakeModelLoadJobFunc
      
      
      MutModelLoadOpConTickInputHelper
      
      
      fix
      
      
      refine
      
      
      init/load/save
      
      
      set_default_variable
      
      * remove SnapshotMgr
      
      * snapshot.h
      
      * delete model_init_job.cpp
      
      
      foreign_input_op_conf
      
      
      fix
      
      
      snapshot path
      
      
      set path
      
      
      op_conf
      
      
      fix
      
      
      fix CopyFromNdarray
      
      
      to bytes c
      
      
      use uint8
      
      
      char2uint8
      
      * model init
      
      * model io
      
      * fix
      
      * ModelSaveKernel
      
      * mutable_batch_axis()->Clear()
      
      * InferBatchAxis
      
      * fix
      
      * refine
      
      * job set
      
      * MakeModelIoJobs
      
      * fix
      
      * jobs
      
      * fix
      
      * model io job
      
      * GenOutputOpConf
      
      * refine snapshot
      
      * refine
      
      * fix
      
      * refine CheckPoint
      
      * remove session
      
      * refine
      
      * refine
      
      * refine
      
      * remove keyword.h/cpp
      
      * refine
      
      * global_step=>train_step
      
      * GetSbpSignatures
      
      * ModelInitOp
      
      * fix (#2127)
      
      * rm stale alextnet script (#2129)
      
      * Dev plain maybe (#2126)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * Dev simple checkpoint manager (#2128)
      
      * SimpleCheckPointManager
      
      * makedirs
      
      * fix path
      
      * save
      
      * refine
      
      * refine
      
      * fix path to numpy (#2130)
      
      * Dev plain maybe (#2132)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()
      
      * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>
      
      * Dev jxf merge general ops (#2131)
      
      * merge some general ops to dev_python
      
      * dense demo
      
      * rm print in test
      
      * new line at the end of file
      
      * format
      
      * fix check point
      
      * update alexnet
      
      * broadcast_xxx (#2134)
      
      * broadcast_xxx
      
      * typo
      
      * typo
      
      * rm job_conf.num_of_batches_in_snapshot
      
      * fix args (#2136)
      
      * fix proto if (#2138)
      
      * pass name to inner function (#2139)
      
      * check dropout if (#2140)
      
      * check dropout if
      
      * fix typo
      
      * Dev merge math ops (#2143)
      
      * merge math ops
      
      * new line at the end of file
      
      * merge layer norm (#2144)
      
      * variable_scope (#2141)
      
      * variable_scope
      
      * revert format
      
      * add check
      
      * Merge dropout if (#2145)
      
      * check dropout if
      
      * fix typo
      
      * fix typo
      
      * slice (#2142)
      
      * slice
      
      * add check and docstring
      
      * minor
      
      * minor
      
      * add const (#2146)
      
      * add const
      
      * fix indentation
      
      * address review
      
      * fmt
      
      * rm redundant
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * add more activations to math_ops (#2147)
      
      * fix bug (#2149)
      
      * trancated normal for bert (#2150)
      
      * Update bert for dev python (#2151)
      
      * trancated normal for bert
      
      * bert support
      
      * math.dropout to nn.dropout (#2153)
      
      * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto
      
      * allow export multiple interfaces in oneflow_export decorator (#2154)
      
      * refactor job_build_and_infer_if.h
      
      * update oneflow_internal.h to use Maybe (#2135)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp
      
      * Fix python scripts
      
      * Dev nc of internal (#2155)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      * fix: fix ctor bug
      
      * fix config_proto
      
      * rename c_api_util.Init => c_api_util.InitEnvironment
      
      * refactor compile_context.cur_job => compile_context.cur_job_conf
      
      * remove FixPackedBlobDescOfProducedRegst (#2156)
      
      * Fix snapshot root path empty log (#2158)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * Fix snapshot root path empty log
      
      * fix channel last (#2157)
      
      * fix channel last
      
      * minor
      
      * merge pb_message
      
      * add cudnn conv force algo (#2159)
      
      * Update bert for dev python (#2160)
      
      * remove old bert
      
      * set data_part_num in decoder
      
      * support model load/saveargs
      
      * Dev flow function (#2152)
      
      * add of.function, refactor init, refine session, and refine runtime
      
      * rm useless code
      
      * rename
      
      * update
      
      * add test
      
      * @oneflow_export JobConfigProto and Trainconf (#2162)
      
      * @oneflow_export JobConfigProto and Trainconf
      
      * remove unused config in config_util.py
      
      * remove oneflow.get_cur_job_conf_builder
      
      * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)
      
      * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf
      
      * fix config.train.model_update_conf
      
      * _GetJobConfAttr
      
      * update alexnet (#2166)
      
      * Update alexnet (#2167)
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * more reasonable conf
      
      * get variable in py layer norm
      
      * replace val in pb msg;  decode lbn string with split hint (#2165)
      
      * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)
      
      * Add meta data in HLO instruction, and refine
      
      * python model parallel (#2103)
      
      * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op
      
      * merge placement group
      
      * refine code in AddAndInferOp
      
      * auto merge placement group when add op; remove mergeplacementgroup interface
      
      * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx
      
      * python blob add interface for model parallel
      
      * refine code of python blob split
      
      * remove interface of has/get_split_axis in python blob
      
      * remove interface of has_batch_dim in python blob
      
      * add check blob split_axis can be divide by parallel num
      
      * refine code for maybe get/infer sbp
      
      * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc
      
      * fix for plain point maybe
      
      * fix bug: add repeated placement group, remove add placement interface in hand
      
      * fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel
      
      * dev_python model parallel runnable and check correct
      
      * remove add placement group when placment scope exit
      
      * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel
      
      * bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done
      
      * refine python blob_desc.split implement
      
      * refine interface decode lbn to split hint
      
      * refine auto add placment group
      
      * refine lbn with split hint decode
      
      * refine code for review
      
      * remove AutoVar related code (#2168)
      
      * feat: remove all autovar
      
      * fix and format
      
      * fix: fix op::InferBlobDesc
      
      * add prototype (#2172)
      
      * add prototype
      
      * infer blob desc with sbp_signature
      
      * `str_a is not str_b' is buggy, use `str_a != str_b' instead
      
      * Update snapshot.cpp (#2174)
      
      * remove useless lines (#2176)
      
      * Fix bert multi nodes (#2177)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * CHECK_JUST for InferBlobDescsIf (#2178)
      
      * Fix bert multi nodes (#2180)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * config_proto -> default_config_proto
      
      * delete worker
      
      * update alexnet
      
      * remove unused op (#2182)
      
      * remove parallel_ctx when kernel init (#2185)
      
      * InferOpSbpSignature in op_graph and infer_ctx (#2175)
      
      * InferOpSbpSignature in op_graph and infer_ctx
      
      * bugfix: lambda life time;  gen job build error add location info
      
      * refine error generation and return
      
      * refine check lbi vaild and exists
      
      * remove parallel num in decode_of_record op/kernel (#2186)
      
      * Fix bugs
      
      * delete GlobalJobDesc() in operator/ (#2188)
      
      * rm unused test file
      
      * Refine
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Remove fake consume op
      
      * Support enable/disable XLA by set env
      
      * Merge callback, limit max operator count for each XLA subgraph
      
      * CudaEventPool
      
      * fix vector
      
      * refine
      
      * Support in-place update for optimizer
      
      * Add alias input and output to prevent reusing input with other temp buffers
      
      * Refine code style
      
      * Remove unused code
      
      * Fix static cublas library and xla link conflict
      
      * Fix cublas link conflict with tensorflow
      
      * Fix different connection kinds for multiple gpu cards (#2282)
      
      * Refine xla cluster algo (#2289)
      
      * Fix different connection kinds for multiple gpu cards
      
      * Fix bug for mutiple outputs consumed by one node
      
      * Refine cluster algo
      
      * Refine MarkClusterId pass and ReduceSplit task node (#2314)
      
      * Fix different connection kinds for multiple gpu cards
      
      * Fix bug for mutiple outputs consumed by one node
      
      * Refine cluster algo
      
      * Determine fusion disabled edges
      
      * update
      
      * Produce multiple registers on edges for ReduceSplit task node.
      Fix new allocator by stream id.
      
      * Refine MarkClusterId pass
      
      * Clustering subgraph with reverse ordering is better
      
      * Support strict clustering by taking dependencies into consideration
      
      * Translate rebuild job and rewrite optimizer into passes, and refine code style
      
      * Fix spell error
      
      * Update cmake
      
      * Merge branch dev_python (#2321)
      
      * Dev res50 new api (#2173)
      
      * check in script
      
      * runable
      
      * fix multinode
      
      * fix and real train
      
      * fix param data_format
      
      * fix truncated normal
      
      * quick fix multi node launch (#2193)
      
      * Dev reshape sbp (#2192)
      
      * reshape sbp
      
      * more check for reshape conf
      
      * fix error CHECK
      
      * refactor reshape
      
      * fix reshape like op
      
      * support naive case of s0
      
      * refine
      
      * rm redundant code
      
      * more generous check for equal element cnt
      
      * restore empty line
      
      * add GatherMs0Grad op (#2191)
      
      * support for gather with s(0) `in'
      
      * add gather_ms0_op
      
      * fix bugs in message GatherMs0OpConf and GatherMs0Kernel
      
      * only (B, S(0)) -> P supported for gather_ms0 op
      
      * add GatherMs0Grad op
      
      * minor fix
      
      * refine code
      
      * bugfix and update gather test case
      
      * add concat op and pass the test (#2067)
      
      * add concat op and pass the test
      
      * add vgg job_conf
      
      * model compared to be same as the old one
      
      * rm unnecessary file
      
      * Update array_ops.py
      
      * mv file
      
      * get rid of ternary operator (#2195)
      
      * Dev reshape util struct (#2194)
      
      * check in changes
      
      * rm file
      
      * minor fix
      
      * Merge network files of 2 cnns (#2196)
      
      * add inceptionV3
      
      * check in vgg16
      
      * add cnns test scripts for dev_python (#2170)
      
      * add cnns test scripts for dev_python
      
      * add alexnet test scripts
      
      * add resnet50
      
      * add inceptionv3
      
      * add resnet50
      
      * add vgg16
      
      * first version of run_cnns_test.py
      
      * remove old files
      
      * unsorted_segment_sum (#2198)
      
      * oneflow.unsorted_segment_sum (#2199)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * Dev batch unsorted segment sum (#2200)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * rename UnsortedSegmentSum to BatchUnsortedSegmentSum
      
      * rename: batch_unsorted_* => unsorted_batch_*
      
      * unsorted_segment_sum (#2201)
      
      * unsorted_segment_sum
      
      * fix job_completer/unsorted_segment_sum_grad.cpp
      
      * more check for unsorted_segment_sum batch_axis
      
      * remove FixParallelDesc (#2202)
      
      * rm KernelIfWithModel KernelIfWithActivation (#2203)
      
      * remove KernelIfWithActivation
      
      * remove KernelIfWithModel
      
      * rm blob header kLossInstanceNum (#2204)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * fix warning: return string reference to temporary (#2212)
      
      * docker build support (#2002)
      
      * update cmake files
      
      * check in files
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * shrink ctx size
      
      * fix script
      
      * fix wheel build
      
      * fix wheel build not adding .so (#2052)
      
      * lower cmake version bar
      
      * rm more files
      
      * keep build dir
      
      * check in test bash script
      
      * fix
      
      * Dev docker sx (#2124)
      
      * add python2 docker env
      
      * rm old docker files
      
      * update repository
      
      * add ARG CUDA and USE_PYTHON_3_OR_2
      
      * reform files
      
      * update
      
      * rm log doesn't print when there is cache
      
      * use default arg in dockerfile
      
      * better py 2 or 3 condition
      
      * add default
      
      * use if
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * add resnet50 in model (#2217)
      
      * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)
      
      * remove parallel policy
      
      * rm FC/rnn/embedding_look_up op/kernel
      
      * add check data parallel for conv/layer_norm op
      
      * bugfix: bias add + use math_add when batch size = 1
      
      * fix InferBatchAxis (#2220)
      
      * sync with bert_benchamrk (#2221)
      
      * sync with bert_benchamrk
      
      * rename run.sh
      
      * Dev actor msg queue (#2225)
      
      * async msg queue
      
      * EnqueueAsyncMsg
      
      * Merge wnd python (#2226)
      
      * not ready yet
      
      * segment fix
      
      * fix segment_sum bugs
      
      * 1st wide_n_deep push
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * run sucessfully on single GPU
      
      * fix 121 for tick (#2069)
      
      * delete unncessary multiply_grad class
      
      * speed up generate time for dot2svg (#2083)
      
      * Add axis conf to bias_add for any axis channel (#2087)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
      
      This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
      
      * updated
      
      * fix segment_sum_grad
      
      * fix sbp
      
      * fix segment_sum impl for data parallel
      
      * fix
      
      * remove useless code in segment_kernel_util.h
      
      * add python interface
      
      * fix sigmoid conf
      
      * fix naming error
      
      * fix typo
      
      * temp mod loss sbp
      
      * add LazyAdam
      
      * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep
      
      * rm useless code
      
      * unsorted_segment_sum
      
      * refactor sigmoid_cross_entropy_loss_kernel to high performance
      
      * Improve sigmoid cross entropy loss grad (#2207)
      
      * remove for loop called cuda kernel
      
      * minor fix
      
      * ../oneflow/python/ops/data_ops.py (#2209)
      
      * fix lazy_adam
      
      * Merge wnd and python (#2214)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * merge dev_python
      
      * fix boxing: P->S(0)
      
      * check in docker build scripts (#2216)
      
      * Dev python widedeep docker (#2218)
      
      * check in docker build scripts
      
      * check in .dockerignore
      
      * rm oneflow.segment_sum
      
      * remove segment_sum
      
      * rm unused file
      
      * rm debug code
      
      * rm debug code
      
      * rm double empty lines
      
      * remove useless comments
      
      * fix send msg (#2227)
      
      * fix reduction_coefficient (#2228)
      
      * refactor ndarray for eq/ne/...
      
      * Dev kernel launch synchronized (#2230)
      
      * IsKernelLaunchSynchronized
      
      * virtual
      
      * refine
      
      * refine
      
      * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
      
      * more static_assert
      
      * remove unused task related dot function (#2236)
      
      * remove unused task related dot function
      
      * do not output dot rank info
      
      * Dev non distributed optimizer js (#2234)
      
      * op&kernel&actor
      
      * job
      
      * job_completer
      
      * graph
      
      * format
      
      * fix pd
      
      * fix
      
      * ignore DelPlacementByOpName
      
      * fix auto tick
      
      * JobBuilder
      
      * fix
      
      * config util
      
      * fix
      
      * fix opgrade
      
      * broadcast tick
      
      * fix allreduce
      
      * balance by model size
      
      * GetSoleOutBlobSize
      
      * async_actor_msg_deque
      
      * group
      
      * AddOrMutOpsOnlyOnce
      
      * fix NcclTupleBroadcastGrad
      
      * order
      
      * set nccl order hint
      
      * op_conf
      
      * grad hint
      
      * NcclTupleBroadcastReduceSequencePass
      
      * add missed mutops
      
      * order fix
      
      * try kMdUpdtArea
      
      * fix nccl_order_hint
      
      * fix
      
      * add ti
      
      * tuple_identity_op
      
      * remove useless
      
      * group
      
      * fix dead lock
      
      * force ctrl in
      
      * sc broadcast
      
      * sort obn
      
      * group nccl
      
      * config group_size_mbyte
      
      * non_distributed_optimizer_group_size_mbyte
      
      * format
      
      * stop check
      
      * rm message sending optimization
      
      * refine lazy adam (#2244)
      
      * refine lazy adam
      
      * update
      
      * memory version 2 step 1: replace original concept about mem sharing (#2242)
      
      * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
      
      * memory version 2 step 1: replace original concept about mem sharing
      
      * record reader multi thread (#2246)
      
      * multi thread
      
      * ComputeThreadPoolSize
      
      * python api
      
      * Fix random decode (#2252)
      
      * add decode random
      
      * fix decode random actor
      
      * Dev pr boxing v2 (#2248)
      
      * NcclDeviceCtx
      
      * include naive_actor
      
      * refine
      
      * use_boxing_v2
      
      * config.use_boxing_v2
      
      * SubTskGphBuilder
      
      * fix
      
      * hash<oneflow::MemoryCase>
      
      * Maybe<void>
      
      * ChainSubTskGphBuilder
      
      * SliceBoxingOp
      
      * return ok
      
      * SliceBoxingKernel
      
      * SliceBoxingActor
      
      * kSliceBoxing
      
      * nccl boxing op
      
      * nccl actor
      
      * REGISTER_OP
      
      * GetMsgFromCustomizedConf
      
      * NcclBoxingTaskNode
      
      * BldSubTskGphByBoxingV2
      
      * NcclBoxingSubTskGphBuilder
      
      * fix
      
      * fix
      
      * NcclKernel
      
      * ParallelContext
      
      * REGISTER_ACTOR
      
      * fix rank set
      
      * IsNcclTaskType
      
      * limit
      
      * 1024
      
      * multi thread reader
      
      * thread_num
      
      * IsKernelLaunchSynchronized
      
      * refine
      
      * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx
      
      * MakeHostMemCase
      
      * NcclBldSubTskGph
      
      * remove use less code
      
      * use_boxing_v2
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * cmake find python note when version less 3.14 (#2286)
      
      * fix bug: reduce split kernel inplace (#2297)
      
      * Dev bias add (#2299)
      
      * use bias add
      
      * fix
      
      * bias_add
      
      * bias add half
      
      * fix
      
      * reinterpret_cast
      
      * fix half
      
      * HALF
      
      * fix
      
      * ADD_DEFAULT_KERNEL_CREATOR
      
      * fix
      
      * format
      
      * Fix dev python test (#2294)
      
      * add decode random
      
      * fix decode random actor
      
      * fix dev_python test scripts
      
      * fix batch_size test scripts
      
      * fix
      
      * Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)
      
      * MemBlockProto and ChunkProto
      
      * create mem block and chunk after improver
      
      * interface merge mem block and chunk between sub plans
      
      * merge chunk between jobs for memory reuse
      
      * using memory zone unique id replace memory case hash
      
      * merge interface op mem block between jobs for mem shared
      
      * gen GlobalCriticalSection by mem block id and chunk id
      
      * check mem block and chunk valid before runtime
      
      * Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst
      
      * fix bug; and pass test
      
      * fig bug: init chunk_id_count in id_manager
      
      * reuse copyHd out mem between jobs
      
      * PushPlan and PullPlan for memblock and chunk
      
      * refine merge mem block / chunk in oneflow.cpp
      
      * at(i);
      
      * GetOpName2JobId2TaskProtos functional
      
      * using output ptr; pass test AlexNet and Resnet
      
      * Fix xla reshape op
      
      * Merge upstream of_xla (#2322)
      
      * Dev res50 new api (#2173)
      
      * check in script
      
      * runable
      
      * fix multinode
      
      * fix and real train
      
      * fix param data_format
      
      * fix truncated normal
      
      * quick fix multi node launch (#2193)
      
      * Dev reshape sbp (#2192)
      
      * reshape sbp
      
      * more check for reshape conf
      
      * fix error CHECK
      
      * refactor reshape
      
      * fix reshape like op
      
      * support naive case of s0
      
      * refine
      
      * rm redundant code
      
      * more generous check for equal element cnt
      
      * restore empty line
      
      * add GatherMs0Grad op (#2191)
      
      * support for gather with s(0) `in'
      
      * add gather_ms0_op
      
      * fix bugs in message GatherMs0OpConf and GatherMs0Kernel
      
      * only (B, S(0)) -> P supported for gather_ms0 op
      
      * add GatherMs0Grad op
      
      * minor fix
      
      * refine code
      
      * bugfix and update gather test case
      
      * add concat op and pass the test (#2067)
      
      * add concat op and pass the test
      
      * add vgg job_conf
      
      * model compared to be same as the old one
      
      * rm unnecessary file
      
      * Update array_ops.py
      
      * mv file
      
      * get rid of ternary operator (#2195)
      
      * Dev reshape util struct (#2194)
      
      * check in changes
      
      * rm file
      
      * minor fix
      
      * Merge network files of 2 cnns (#2196)
      
      * add inceptionV3
      
      * check in vgg16
      
      * add cnns test scripts for dev_python (#2170)
      
      * add cnns test scripts for dev_python
      
      * add alexnet test scripts
      
      * add resnet50
      
      * add inceptionv3
      
      * add resnet50
      
      * add vgg16
      
      * first version of run_cnns_test.py
      
      * remove old files
      
      * unsorted_segment_sum (#2198)
      
      * oneflow.unsorted_segment_sum (#2199)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * Dev batch unsorted segment sum (#2200)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * rename UnsortedSegmentSum to BatchUnsortedSegmentSum
      
      * rename: batch_unsorted_* => unsorted_batch_*
      
      * unsorted_segment_sum (#2201)
      
      * unsorted_segment_sum
      
      * fix job_completer/unsorted_segment_sum_grad.cpp
      
      * more check for unsorted_segment_sum batch_axis
      
      * remove FixParallelDesc (#2202)
      
      * rm KernelIfWithModel KernelIfWithActivation (#2203)
      
      * remove KernelIfWithActivation
      
      * remove KernelIfWithModel
      
      * rm blob header kLossInstanceNum (#2204)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * fix warning: return string reference to temporary (#2212)
      
      * docker build support (#2002)
      
      * update cmake files
      
      * check in files
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * shrink ctx size
      
      * fix script
      
      * fix wheel build
      
      * fix wheel build not adding .so (#2052)
      
      * lower cmake version bar
      
      * rm more files
      
      * keep build dir
      
      * check in test bash script
      
      * fix
      
      * Dev docker sx (#2124)
      
      * add python2 docker env
      
      * rm old docker files
      
      * update repository
      
      * add ARG CUDA and USE_PYTHON_3_OR_2
      
      * reform files
      
      * update
      
      * rm log doesn't print when there is cache
      
      * use default arg in dockerfile
      
      * better py 2 or 3 condition
      
      * add default
      
      * use if
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * add resnet50 in model (#2217)
      
      * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)
      
      * remove parallel policy
      
      * rm FC/rnn/embedding_look_up op/kernel
      
      * add check data parallel for conv/layer_norm op
      
      * bugfix: bias add + use math_add when batch size = 1
      
      * fix InferBatchAxis (#2220)
      
      * sync with bert_benchamrk (#2221)
      
      * sync with bert_benchamrk
      
      * rename run.sh
      
      * Dev actor msg queue (#2225)
      
      * async msg queue
      
      * EnqueueAsyncMsg
      
      * Merge wnd python (#2226)
      
      * not ready yet
      
      * segment fix
      
      * fix segment_sum bugs
      
      * 1st wide_n_deep push
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * run sucessfully on single GPU
      
      * fix 121 for tick (#2069)
      
      * delete unncessary multiply_grad class
      
      * speed up generate time for dot2svg (#2083)
      
      * Add axis conf to bias_add for any axis channel (#2087)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
      
      This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
      
      * updated
      
      * fix segment_sum_grad
      
      * fix sbp
      
      * fix segment_sum impl for data parallel
      
      * fix
      
      * remove useless code in segment_kernel_util.h
      
      * add python interface
      
      * fix sigmoid conf
      
      * fix naming error
      
      * fix typo
      
      * temp mod loss sbp
      
      * add LazyAdam
      
      * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep
      
      * rm useless code
      
      * unsorted_segment_sum
      
      * refactor sigmoid_cross_entropy_loss_kernel to high performance
      
      * Improve sigmoid cross entropy loss grad (#2207)
      
      * remove for loop called cuda kernel
      
      * minor fix
      
      * ../oneflow/python/ops/data_ops.py (#2209)
      
      * fix lazy_adam
      
      * Merge wnd and python (#2214)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * merge dev_python
      
      * fix boxing: P->S(0)
      
      * check in docker build scripts (#2216)
      
      * Dev python widedeep docker (#2218)
      
      * check in docker build scripts
      
      * check in .dockerignore
      
      * rm oneflow.segment_sum
      
      * remove segment_sum
      
      * rm unused file
      
      * rm debug code
      
      * rm debug code
      
      * rm double empty lines
      
      * remove useless comments
      
      * fix send msg (#2227)
      
      * fix reduction_coefficient (#2228)
      
      * refactor ndarray for eq/ne/...
      
      * Dev kernel launch synchronized (#2230)
      
      * IsKernelLaunchSynchronized
      
      * virtual
      
      * refine
      
      * refine
      
      * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
      
      * more static_assert
      
      * remove unused task related dot function (#2236)
      
      * remove unused task related dot function
      
      * do not output dot rank info
      
      * Dev non distributed optimizer js (#2234)
      
      * op&kernel&actor
      
      * job
      
      * job_completer
      
      * graph
      
      * format
      
      * fix pd
      
      * fix
      
      * ignore DelPlacementByOpName
      
      * fix auto tick
      
      * JobBuilder
      
      * fix
      
      * config util
      
      * fix
      
      * fix opgrade
      
      * broadcast tick
      
      * fix allreduce
      
      * balance by model size
      
      * GetSoleOutBlobSize
      
      * async_actor_msg_deque
      
      * group
      
      * AddOrMutOpsOnlyOnce
      
      * fix NcclTupleBroadcastGrad
      
      * order
      
      * set nccl order hint
      
      * op_conf
      
      * grad hint
      
      * NcclTupleBroadcastReduceSequencePass
      
      * add missed mutops
      
      * order fix
      
      * try kMdUpdtArea
      
      * fix nccl_order_hint
      
      * fix
      
      * add ti
      
      * tuple_identity_op
      
      * remove useless
      
      * group
      
      * fix dead lock
      
      * force ctrl in
      
      * sc broadcast
      
      * sort obn
      
      * group nccl
      
      * config group_size_mbyte
      
      * non_distributed_optimizer_group_size_mbyte
      
      * format
      
      * stop check
      
      * rm message sending optimization
      
      * refine lazy adam (#2244)
      
      * refine lazy adam
      
      * update
      
      * memory version 2 step 1: replace original concept about mem sharing (#2242)
      
      * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
      
      * memory version 2 step 1: replace original concept about mem sharing
      
      * record reader multi thread (#2246)
      
      * multi thread
      
      * ComputeThreadPoolSize
      
      * python api
      
      * Fix random decode (#2252)
      
      * add decode random
      
      * fix decode random actor
      
      * Dev pr boxing v2 (#2248)
      
      * NcclDeviceCtx
      
      * include naive_actor
      
      * refine
      
      * use_boxing_v2
      
      * config.use_boxing_v2
      
      * SubTskGphBuilder
      
      * fix
      
      * hash<oneflow::MemoryCase>
      
      * Maybe<void>
      
      * ChainSubTskGphBuilder
      
      * SliceBoxingOp
      
      * return ok
      
      * SliceBoxingKernel
      
      * SliceBoxingActor
      
      * kSliceBoxing
      
      * nccl boxing op
      
      * nccl actor
      
      * REGISTER_OP
      
      * GetMsgFromCustomizedConf
      
      * NcclBoxingTaskNode
      
      * BldSubTskGphByBoxingV2
      
      * NcclBoxingSubTskGphBuilder
      
      * fix
      
      * fix
      
      * NcclKernel
      
      * ParallelContext
      
      * REGISTER_ACTOR
      
      * fix rank set
      
      * IsNcclTaskType
      
      * limit
      
      * 1024
      
      * multi thread reader
      
      * thread_num
      
      * IsKernelLaunchSynchronized
      
      * refine
      
      * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx
      
      * MakeHostMemCase
      
      * NcclBldSubTskGph
      
      * remove use less code
      
      * use_boxing_v2
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * cmake find python note when version less 3.14 (#2286)
      
      * fix bug: reduce split kernel inplace (#2297)
      
      * Dev bias add (#2299)
      
      * use bias add
      
      * fix
      
      * bias_add
      
      * bias add half
      
      * fix
      
      * reinterpret_cast
      
      * fix half
      
      * HALF
      
      * fix
      
      * ADD_DEFAULT_KERNEL_CREATOR
      
      * fix
      
      * format
      
      * Fix dev python test (#2294)
      
      * add decode random
      
      * fix decode random actor
      
      * fix dev_python test scripts
      
      * fix batch_size test scripts
      
      * fix
      
      * Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)
      
      * MemBlockProto and ChunkProto
      
      * create mem block and chunk after improver
      
      * interface merge mem block and chunk between sub plans
      
      * merge chunk between jobs for memory reuse
      
      * using memory zone unique id replace memory case hash
      
      * merge interface op mem block between jobs for mem shared
      
      * gen GlobalCriticalSection by mem block id and chunk id
      
      * check mem block and chunk valid before runtime
      
      * Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst
      
      * fix bug; and pass test
      
      * fig bug: init chunk_id_count in id_manager
      
      * reuse copyHd out mem between jobs
      
      * PushPlan and PullPlan for memblock and chunk
      
      * refine merge mem block / chunk in oneflow.cpp
      
      * at(i);
      
      * GetOpName2JobId2TaskProtos functional
      
      * using output ptr; pass test AlexNet and Resnet
      
      * Dev cuda 9 arch 70 (#2318)
      
      * kCudaAlignSize = 256
      
      * always compute_70
      
      * __CUDA_API_VERSION >= 10000
      
      * __CUDA_API_VERSION >= 10000
      
      * disable_all_reduce_sequence
      
      * Fix xla reshape op
      
      * Fix compilation without xla
      
      * Remove useless code and fix data type mismatch in field desc (#2326)
      
      * Remove useless code
      
      * Refine code style
      
      * Fix data type mismatch in field desc
      
      * Update README.md (#2335)
      
      * Refine code style (#2336)
      
      * Update XLA usage document (#2337)
      
      * Update XLA usage document
      
      * Fix mistakes
      
      * Add xla clang-format and format codestyle (#2340)
      
      * Revert "Add xla clang-format and format codestyle (#2340)" (#2341)
      
      This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724.
      
      * Add xla clang-format and format codestyle (#2342)
      
      * Add xla clang-format and format codestyle
      
      * Fix header file missing
      
      * Of xla sx (#2334)
      
      * add gather grad op and pass testing
      
      * rm check
      
      * done batch gather grad
      
      * pass test
      
      * modify according to the review
      
      * add unsorted_segment_sum and refine unsorted_batch_segment_sum
      
      * reform according to review
      
      * refromate according to the clang-format and rm reference to the temp object
      
      * Pick step0 and step1 new commits (#2346)
      
      * Add xla clang-format and format codestyle
      
      * Fix header file missing
      
      * Modify codes to support XLA
      
      Conflicts:
      	oneflow/core/job/job_builder.cpp
      	oneflow/core/job/job_builder.h
      	oneflow/core/operator/op_conf.proto
      
      * Fix a bug for building subgraph although it won't lead to wrong results (#2347)
      
      * Fix setting is_mutable in xla launch op (#2349)
      
      * Change directory xla to xrt, apply patch if building with xla
      
      * Refactor
      
      * Add infer shape pass, and Refactor launch kernel, graph compiler
      
      * Refine code style, add xla executable and graph compiler
      
      * Rename platform.proto as types.proto
      
      * change OpCompiler to OpKernel, complete xla graph compiler
      
      * Fix compilation bugs and add allocator, now xla compilation is ok
      
      * Add xla executable runtime
      
      * Add executable run scope to support launch kernel on specific stream.
      
      * Fix infer shape pass, and revert cuda event pool
      
      * Refactor graph building with attaching argument metadata.
      
      * Set mutability if rebuilding job
      
      * Set device ordinal correctly
      
      * Refine DelOps
      
      * Refine Argument definition and abstract function as subgraph
      
      * Fix infer shape in xrt launch op and launch kernel.
      
      * Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt.
      
      * Refine code style
      
      * Rename xla Operand as XlaValue.
      
      * Complete TensorRT compiler and builder, Refine OpKernel
      
      * Pick public code changes from the new tensorrt branch.
      
      * Fix tensorrt compilation
      
      * Fake implementation of trt executable
      
      * Support selecting engine in launch kernel, refine trt executable
      
      * Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix.
      
      * Support train phase setting for registered op kernel
      
      * Remove RewriteOptimizer pass, update xla optimizer op.
      
      * Format job builder .h and .cpp files.
      
      * Remove RewriteOptimizer pass, update xla optimizer op.
      
      * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.
      
      * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.
      
      * Refine code style and comment.
      
      * Refine model update inference for launch op.
      
      * Refine
      
      * Refine code style and comment.
      
      * Refine model update inference for launch op.
      
      Conflicts:
      	oneflow/xrt/kernel/op_kernel.h
      	oneflow/xrt/node_util.cpp
      	oneflow/xrt/node_util.h
      	oneflow/xrt/passes/cluster.h
      	oneflow/xrt/passes/mark_cluster_id_pass.cpp
      	oneflow/xrt/passes/rebuild_job_pass.cpp
      	oneflow/xrt/types.h
      
      * Add xrt README.md
      
      * Add use_xla_jit and use_tensorrt options in job proto
      
      * Refine code style
      
      * Fix BlobDesc getter and xla LayerNorm op for FP16
      
      * Make use_xla_jit and use_tensorrt configurable from python config and env variables.
      
      * Update benchmark
      
      * Refine xrt README and rename compile_with_xrt.h file
      
      * Update README
      
      * Revert tensorrt
      
      * Fix absl missing if building with TensorRT but without XLA
      
      * Update xrt benchmark
      
      * Disable WITH_XLA by default
      
      * Update xrt benchmark
      
      * Format xrt as core
      
      * add activation op
      
      * add softmax op
      
      * Refine code style, remove unused code
      
      * Remove duplication of XLA usage
      
      * test pass
      
      * pooling test pass
      
      * add concat op, not tested
      
      * add activation ops, test not psassed
      
      * Add xla gelu unittest
      
      * add  activation op, and test  passed
      
      * add pooling op, and test passed
      
      * Fix int64 env variable
      
      * Export float16 for python
      
      * Add xla relu unittest
      
      * try to solve conv bug
      
      * add elementwise add op, test passed
      
      * add concat op, test passed
      
      * Bugfix: transfer weights from gpu to host since tensorrt requires host weights.
      
      * add op unit tests
      
      * resolve conflicts and fix softmax bug
      
      * add identity op and topk op, to test
      
      * Add xla bias add and reshape unittests
      
      * Add xla identity unittest
      
      * Add xla cast and scalar op unittests
      
      * Add xla broadcast op and transpose unittests
      
      * Add xla add, sigmoid and tanh unittests
      
      * add reduce mean op, test passed
      
      * formate ops, add CHECKs, and optimize function structure
      
      * Add xla gather and batch_gather unittests
      
      * Add xla softmax unittest and fix softmax bug if axis is not the last dim.
      
      * add trt gather op and unit test
      
      * Add xla reduce_sum unittest, and support keep_dims for xla reduce
      
      * Add xla layer_norm unittest, and refine xla layer norm op
      
      * Add reshape_like unittest, and export reshape_like api
      
      * Refine xrt unittest code style
      
      * Export softmax_grad op, add softmax_grad unittest
      
      * Export tanh_grad op and add xla unittest
      
      * Export gelu_grad op, and add xla unittest
      
      * add conv unit test
      
      * reformate
      
      * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests
      
      * Commit to merge upstream of_xrt
      
      * check files
      
      * modify files according to review advice.
      
      * Add xrt unittests (#2483)
      
      * Revert tensorrt
      
      * Fix absl missing if building with TensorRT but without XLA
      
      * Update xrt benchmark
      
      * Add xla gelu unittest
      
      * Fix int64 env variable
      
      * Export float16 for python
      
      * Add xla relu unittest
      
      * Add xla bias add and reshape unittests
      
      * Add xla identity unittest
      
      * Add xla cast and scalar op unittests
      
      * Add xla broadcast op and transpose unittests
      
      * Add xla add, sigmoid and tanh unittests
      
      * Add xla gather and batch_gather unittests
      
      * Add xla softmax unittest and fix softmax bug if axis is not the last dim.
      
      * Add xla reduce_sum unittest, and support keep_dims for xla reduce
      
      * Add xla layer_norm unittest, and refine xla layer norm op
      
      * Add reshape_like unittest, and export reshape_like api
      
      * Refine xrt unittest code style
      
      * Export softmax_grad op, add softmax_grad unittest
      
      * Export tanh_grad op and add xla unittest
      
      * Export gelu_grad op, and add xla unittest
      
      * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests
      
      * Commit to merge upstream of_xrt
      
      * Fix reduce_mean facade bug if keep_dims if true.
      
      * Refine tensorrt unittests
      
      * Check failed if full reduce without keep dimension.
      
      * madd pooling unit test
      
      * Add tensorrt bias_add and reshape op, and their unittests.
      
      * Support fp16 for tensorrt.
      
      * Add tensorrt transpose op and unittest.
      
      * add unit test conv_2d
      
      * add unit test concat
      
      * Fix concat if axis is -1.
      
      * Refine tensorrt conv2d unittest
      
      * Fix padding mode for conv2d and pooling, refine unittests.
      
      * Refine tensorrt concat unittest
      
      * Add convert api from string engine to XrtEngine.
      
      * Revert tensorrt, and merge of_xrt branch
      
      * Remove some comments.
      
      * Refine tensorrt unittests
      
      * Add XrtConfig to deal with xla and tensorrt configurations.
      
      Conflicts:
      	oneflow/xrt/api.cpp
      
      * Update tensorflow.cmake to avoid applying the patch repeatedly.
      
      * Remove XrtConfig Option, and fix xrt unittests
      
      * Add tensorrt batch norm (#2516)
      
      * Refine xrt signatrue hash, and fix python configuration (#2520)
      
      * Fix XrtCompilationEnabled returns (#2524)
      
      * Fix compilation after merge dev_python
      
      * Update xrt unittests
      
      * Revert protobuf version
      
      * Remove comment FOR_RANGE
      
      * Remove unused code
      
      * Reformart
      
      * Refine job builder
      
      * Disable dump job if not debug mode
      Co-authored-by: NSnow <snow3s@qq.com>
      Co-authored-by: NJuncheng <liujuncheng1022@gmail.com>
      8f3dcf94