XRT: XLA + TensorRT (#2525)

* Enable multiply definition for xla compilation in oneflow * Realize running an executable * Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore * Implement a seperate xla allocator to avoid introducing much objects of tensorflow * Define CompilationContext separately * Running XLA by CPU mode is OK now * Make the result shape after running the executable is a tuple, and refine comments * Add compilation cache to solve recompiling every time * Resolve InferSbpSignature in XlaLaunchOp * Resove executing on specified cuda stream * Refine XlaLaunch parallel conf, add batch matmul op * Refactor job rebuilding and fixup time shape * Update batch_dim_lbis field if XlaLaunch has any output which has batch dim * Resolve cluster-ring after clustered, take sbp policy and time shape into consideration * Add reshape op * Fix bugs * Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle * Fix bugs * Update cmake to compile with xla optionally * Support more ops * Add more ops, and fix bugs * Implement XLA allocator and internal memory pool * Adaptively resize allocator memory size * Refine memory allocator * Block host if running cpu executable * Fix bug for getting scalar value * Fix result layout bug. This bug causes wrong result for transpose * Refine gelu backward * Of xla sx (#1990) * add identity xla op * Add batch gather op * Refine batch gather * fix batch gather bug aand add gather op, mv identity op to unary_op * Add softmax and gather/batch_gather * Add xla softmax_grad op * Add xla layer normalization op * Add xla layer norm backward op * Alias inputs and outputs to compute in-place * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Refine xla allocator * Refine code style * Add xla reduce_sum op * Rewrite model update op to optimizer graph * Fix hang bugs * Fix input which body is disabled in xla launch kernel * Fix self control in * Fix self control in * Add fake consume op * Fix HasAttr bug for optional field * Refine AdamOptimizer * Fix xla AdamOptimizer bugs * Add meta data in HLO instruction, and refine * Fix bugs * add reduce sum and split normal model update (#2040) * remove append_func_to_list * Rm deprecated model update and save code (#1958) * remove code * mv random gen to kernel * mk seed required * address reviews * fix unused warning * address reviews * check in more deprecation * remove ModelSaveOpConf * move out ops and modify item (#1962) * ModelInit.__oneflow_input_remote_blobs__ * fix cpu only query & add error info (#1964) * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * modify check_point and add test check_point (#1963) * fix misuse of Scope/raii * op_name2variable_blob * add sigmoid test and tanh test (#1966) * add op matmul and matmul test (#1967) * rename oneflow.val to oneflow.input_blob_def * support auto var for convolution (#1972) * add op add and test add (#1973) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Of xla (#2237) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Fix static cublas library and xla link conflict * Fix cublas link conflict with tensorflow * Fix different connection kinds for multiple gpu cards (#2282) * Refine xla cluster algo (#2289) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Refine MarkClusterId pass and ReduceSplit task node (#2314) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Determine fusion disabled edges * update * Produce multiple registers on edges for ReduceSplit task node. Fix new allocator by stream id. * Refine MarkClusterId pass * Clustering subgraph with reverse ordering is better * Support strict clustering by taking dependencies into consideration * Translate rebuild job and rewrite optimizer into passes, and refine code style * Fix spell error * Update cmake * Merge branch dev_python (#2321) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Fix xla reshape op * Merge upstream of_xla (#2322) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Dev cuda 9 arch 70 (#2318) * kCudaAlignSize = 256 * always compute_70 * __CUDA_API_VERSION >= 10000 * __CUDA_API_VERSION >= 10000 * disable_all_reduce_sequence * Fix xla reshape op * Fix compilation without xla * Remove useless code and fix data type mismatch in field desc (#2326) * Remove useless code * Refine code style * Fix data type mismatch in field desc * Update README.md (#2335) * Refine code style (#2336) * Update XLA usage document (#2337) * Update XLA usage document * Fix mistakes * Add xla clang-format and format codestyle (#2340) * Revert "Add xla clang-format and format codestyle (#2340)" (#2341) This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724. * Add xla clang-format and format codestyle (#2342) * Add xla clang-format and format codestyle * Fix header file missing * Of xla sx (#2334) * add gather grad op and pass testing * rm check * done batch gather grad * pass test * modify according to the review * add unsorted_segment_sum and refine unsorted_batch_segment_sum * reform according to review * refromate according to the clang-format and rm reference to the temp object * Pick step0 and step1 new commits (#2346) * Add xla clang-format and format codestyle * Fix header file missing * Modify codes to support XLA Conflicts: oneflow/core/job/job_builder.cpp oneflow/core/job/job_builder.h oneflow/core/operator/op_conf.proto * Fix a bug for building subgraph although it won't lead to wrong results (#2347) * Fix setting is_mutable in xla launch op (#2349) * Change directory xla to xrt, apply patch if building with xla * Refactor * Add infer shape pass, and Refactor launch kernel, graph compiler * Refine code style, add xla executable and graph compiler * Rename platform.proto as types.proto * change OpCompiler to OpKernel, complete xla graph compiler * Fix compilation bugs and add allocator, now xla compilation is ok * Add xla executable runtime * Add executable run scope to support launch kernel on specific stream. * Fix infer shape pass, and revert cuda event pool * Refactor graph building with attaching argument metadata. * Set mutability if rebuilding job * Set device ordinal correctly * Refine DelOps * Refine Argument definition and abstract function as subgraph * Fix infer shape in xrt launch op and launch kernel. * Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt. * Refine code style * Rename xla Operand as XlaValue. * Complete TensorRT compiler and builder, Refine OpKernel * Pick public code changes from the new tensorrt branch. * Fix tensorrt compilation * Fake implementation of trt executable * Support selecting engine in launch kernel, refine trt executable * Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix. * Support train phase setting for registered op kernel * Remove RewriteOptimizer pass, update xla optimizer op. * Format job builder .h and .cpp files. * Remove RewriteOptimizer pass, update xla optimizer op. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Refine code style and comment. * Refine model update inference for launch op. * Refine * Refine code style and comment. * Refine model update inference for launch op. Conflicts: oneflow/xrt/kernel/op_kernel.h oneflow/xrt/node_util.cpp oneflow/xrt/node_util.h oneflow/xrt/passes/cluster.h oneflow/xrt/passes/mark_cluster_id_pass.cpp oneflow/xrt/passes/rebuild_job_pass.cpp oneflow/xrt/types.h * Add xrt README.md * Add use_xla_jit and use_tensorrt options in job proto * Refine code style * Fix BlobDesc getter and xla LayerNorm op for FP16 * Make use_xla_jit and use_tensorrt configurable from python config and env variables. * Update benchmark * Refine xrt README and rename compile_with_xrt.h file * Update README * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Disable WITH_XLA by default * Update xrt benchmark * Format xrt as core * add activation op * add softmax op * Refine code style, remove unused code * Remove duplication of XLA usage * test pass * pooling test pass * add concat op, not tested * add activation ops, test not psassed * Add xla gelu unittest * add activation op, and test passed * add pooling op, and test passed * Fix int64 env variable * Export float16 for python * Add xla relu unittest * try to solve conv bug * add elementwise add op, test passed * add concat op, test passed * Bugfix: transfer weights from gpu to host since tensorrt requires host weights. * add op unit tests * resolve conflicts and fix softmax bug * add identity op and topk op, to test * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * add reduce mean op, test passed * formate ops, add CHECKs, and optimize function structure * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * add trt gather op and unit test * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * add conv unit test * reformate * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * check files * modify files according to review advice. * Add xrt unittests (#2483) * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Add xla gelu unittest * Fix int64 env variable * Export float16 for python * Add xla relu unittest * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * Fix reduce_mean facade bug if keep_dims if true. * Refine tensorrt unittests * Check failed if full reduce without keep dimension. * madd pooling unit test * Add tensorrt bias_add and reshape op, and their unittests. * Support fp16 for tensorrt. * Add tensorrt transpose op and unittest. * add unit test conv_2d * add unit test concat * Fix concat if axis is -1. * Refine tensorrt conv2d unittest * Fix padding mode for conv2d and pooling, refine unittests. * Refine tensorrt concat unittest * Add convert api from string engine to XrtEngine. * Revert tensorrt, and merge of_xrt branch * Remove some comments. * Refine tensorrt unittests * Add XrtConfig to deal with xla and tensorrt configurations. Conflicts: oneflow/xrt/api.cpp * Update tensorflow.cmake to avoid applying the patch repeatedly. * Remove XrtConfig Option, and fix xrt unittests * Add tensorrt batch norm (#2516) * Refine xrt signatrue hash, and fix python configuration (#2520) * Fix XrtCompilationEnabled returns (#2524) * Fix compilation after merge dev_python * Update xrt unittests * Revert protobuf version * Remove comment FOR_RANGE * Remove unused code * Reformart * Refine job builder * Disable dump job if not debug mode Co-authored-by: N Snow <snow3s@qq.com> Co-authored-by: N Juncheng <liujuncheng1022@gmail.com>

XRT: XLA + TensorRT (#2525)
* Enable multiply definition for xla compilation in oneflow * Realize running an executable * Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore * Implement a seperate xla allocator to avoid introducing much objects of tensorflow * Define CompilationContext separately * Running XLA by CPU mode is OK now * Make the result shape after running the executable is a tuple, and refine comments * Add compilation cache to solve recompiling every time * Resolve InferSbpSignature in XlaLaunchOp * Resove executing on specified cuda stream * Refine XlaLaunch parallel conf, add batch matmul op * Refactor job rebuilding and fixup time shape * Update batch_dim_lbis field if XlaLaunch has any output which has batch dim * Resolve cluster-ring after clustered, take sbp policy and time shape into consideration * Add reshape op * Fix bugs * Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle * Fix bugs * Update cmake to compile with xla optionally * Support more ops * Add more ops, and fix bugs * Implement XLA allocator and internal memory pool * Adaptively resize allocator memory size * Refine memory allocator * Block host if running cpu executable * Fix bug for getting scalar value * Fix result layout bug. This bug causes wrong result for transpose * Refine gelu backward * Of xla sx (#1990) * add identity xla op * Add batch gather op * Refine batch gather * fix batch gather bug aand add gather op, mv identity op to unary_op * Add softmax and gather/batch_gather * Add xla softmax_grad op * Add xla layer normalization op * Add xla layer norm backward op * Alias inputs and outputs to compute in-place * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Refine xla allocator * Refine code style * Add xla reduce_sum op * Rewrite model update op to optimizer graph * Fix hang bugs * Fix input which body is disabled in xla launch kernel * Fix self control in * Fix self control in * Add fake consume op * Fix HasAttr bug for optional field * Refine AdamOptimizer * Fix xla AdamOptimizer bugs * Add meta data in HLO instruction, and refine * Fix bugs * add reduce sum and split normal model update (#2040) * remove append_func_to_list * Rm deprecated model update and save code (#1958) * remove code * mv random gen to kernel * mk seed required * address reviews * fix unused warning * address reviews * check in more deprecation * remove ModelSaveOpConf * move out ops and modify item (#1962) * ModelInit.__oneflow_input_remote_blobs__ * fix cpu only query & add error info (#1964) * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * modify check_point and add test check_point (#1963) * fix misuse of Scope/raii * op_name2variable_blob * add sigmoid test and tanh test (#1966) * add op matmul and matmul test (#1967) * rename oneflow.val to oneflow.input_blob_def * support auto var for convolution (#1972) * add op add and test add (#1973) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Of xla (#2237) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Fix static cublas library and xla link conflict * Fix cublas link conflict with tensorflow * Fix different connection kinds for multiple gpu cards (#2282) * Refine xla cluster algo (#2289) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Refine MarkClusterId pass and ReduceSplit task node (#2314) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Determine fusion disabled edges * update * Produce multiple registers on edges for ReduceSplit task node. Fix new allocator by stream id. * Refine MarkClusterId pass * Clustering subgraph with reverse ordering is better * Support strict clustering by taking dependencies into consideration * Translate rebuild job and rewrite optimizer into passes, and refine code style * Fix spell error * Update cmake * Merge branch dev_python (#2321) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Fix xla reshape op * Merge upstream of_xla (#2322) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Dev cuda 9 arch 70 (#2318) * kCudaAlignSize = 256 * always compute_70 * __CUDA_API_VERSION >= 10000 * __CUDA_API_VERSION >= 10000 * disable_all_reduce_sequence * Fix xla reshape op * Fix compilation without xla * Remove useless code and fix data type mismatch in field desc (#2326) * Remove useless code * Refine code style * Fix data type mismatch in field desc * Update README.md (#2335) * Refine code style (#2336) * Update XLA usage document (#2337) * Update XLA usage document * Fix mistakes * Add xla clang-format and format codestyle (#2340) * Revert "Add xla clang-format and format codestyle (#2340)" (#2341) This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724. * Add xla clang-format and format codestyle (#2342) * Add xla clang-format and format codestyle * Fix header file missing * Of xla sx (#2334) * add gather grad op and pass testing * rm check * done batch gather grad * pass test * modify according to the review * add unsorted_segment_sum and refine unsorted_batch_segment_sum * reform according to review * refromate according to the clang-format and rm reference to the temp object * Pick step0 and step1 new commits (#2346) * Add xla clang-format and format codestyle * Fix header file missing * Modify codes to support XLA Conflicts: oneflow/core/job/job_builder.cpp oneflow/core/job/job_builder.h oneflow/core/operator/op_conf.proto * Fix a bug for building subgraph although it won't lead to wrong results (#2347) * Fix setting is_mutable in xla launch op (#2349) * Change directory xla to xrt, apply patch if building with xla * Refactor * Add infer shape pass, and Refactor launch kernel, graph compiler * Refine code style, add xla executable and graph compiler * Rename platform.proto as types.proto * change OpCompiler to OpKernel, complete xla graph compiler * Fix compilation bugs and add allocator, now xla compilation is ok * Add xla executable runtime * Add executable run scope to support launch kernel on specific stream. * Fix infer shape pass, and revert cuda event pool * Refactor graph building with attaching argument metadata. * Set mutability if rebuilding job * Set device ordinal correctly * Refine DelOps * Refine Argument definition and abstract function as subgraph * Fix infer shape in xrt launch op and launch kernel. * Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt. * Refine code style * Rename xla Operand as XlaValue. * Complete TensorRT compiler and builder, Refine OpKernel * Pick public code changes from the new tensorrt branch. * Fix tensorrt compilation * Fake implementation of trt executable * Support selecting engine in launch kernel, refine trt executable * Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix. * Support train phase setting for registered op kernel * Remove RewriteOptimizer pass, update xla optimizer op. * Format job builder .h and .cpp files. * Remove RewriteOptimizer pass, update xla optimizer op. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Refine code style and comment. * Refine model update inference for launch op. * Refine * Refine code style and comment. * Refine model update inference for launch op. Conflicts: oneflow/xrt/kernel/op_kernel.h oneflow/xrt/node_util.cpp oneflow/xrt/node_util.h oneflow/xrt/passes/cluster.h oneflow/xrt/passes/mark_cluster_id_pass.cpp oneflow/xrt/passes/rebuild_job_pass.cpp oneflow/xrt/types.h * Add xrt README.md * Add use_xla_jit and use_tensorrt options in job proto * Refine code style * Fix BlobDesc getter and xla LayerNorm op for FP16 * Make use_xla_jit and use_tensorrt configurable from python config and env variables. * Update benchmark * Refine xrt README and rename compile_with_xrt.h file * Update README * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Disable WITH_XLA by default * Update xrt benchmark * Format xrt as core * add activation op * add softmax op * Refine code style, remove unused code * Remove duplication of XLA usage * test pass * pooling test pass * add concat op, not tested * add activation ops, test not psassed * Add xla gelu unittest * add activation op, and test passed * add pooling op, and test passed * Fix int64 env variable * Export float16 for python * Add xla relu unittest * try to solve conv bug * add elementwise add op, test passed * add concat op, test passed * Bugfix: transfer weights from gpu to host since tensorrt requires host weights. * add op unit tests * resolve conflicts and fix softmax bug * add identity op and topk op, to test * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * add reduce mean op, test passed * formate ops, add CHECKs, and optimize function structure * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * add trt gather op and unit test * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * add conv unit test * reformate * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * check files * modify files according to review advice. * Add xrt unittests (#2483) * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Add xla gelu unittest * Fix int64 env variable * Export float16 for python * Add xla relu unittest * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * Fix reduce_mean facade bug if keep_dims if true. * Refine tensorrt unittests * Check failed if full reduce without keep dimension. * madd pooling unit test * Add tensorrt bias_add and reshape op, and their unittests. * Support fp16 for tensorrt. * Add tensorrt transpose op and unittest. * add unit test conv_2d * add unit test concat * Fix concat if axis is -1. * Refine tensorrt conv2d unittest * Fix padding mode for conv2d and pooling, refine unittests. * Refine tensorrt concat unittest * Add convert api from string engine to XrtEngine. * Revert tensorrt, and merge of_xrt branch * Remove some comments. * Refine tensorrt unittests * Add XrtConfig to deal with xla and tensorrt configurations. Conflicts: oneflow/xrt/api.cpp * Update tensorflow.cmake to avoid applying the patch repeatedly. * Remove XrtConfig Option, and fix xrt unittests * Add tensorrt batch norm (#2516) * Refine xrt signatrue hash, and fix python configuration (#2520) * Fix XrtCompilationEnabled returns (#2524) * Fix compilation after merge dev_python * Update xrt unittests * Revert protobuf version * Remove comment FOR_RANGE * Remove unused code * Reformart * Refine job builder * Disable dump job if not debug mode Co-authored-by: N Snow <snow3s@qq.com> Co-authored-by: N Juncheng <liujuncheng1022@gmail.com>
8f3dcf94 · Houjiang Chen · cheng cheng · 465ee822 · 8f3dcf94 · 8f3dcf94
187 changed file
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -8,6 +8,8 @@ option(BUILD_RDMA "" OFF)
 option(BUILD_CUDA "" ON)
 option(RELEASE_VERSION "" ON)
 option(PY3 "" OFF)
+option(WITH_XLA "Option to build with XLA" OFF)
+option(WITH_TENSORRT "Option to build with TensorRT" OFF)
 if(NOT RELEASE_VERSION)
  set(CUDNN_STATIC OFF CACHE BOOL "")
@@ -20,6 +22,13 @@ else()
  project(oneflow C CXX)
 endif()
+if (WITH_XLA)
+  add_definitions(-DWITH_XLA)
+endif()
+if (WITH_TENSORRT)
+  add_definitions(-DWITH_TENSORRT)
+endif()
 enable_testing()
 set(CMAKE_CXX_STANDARD 11)
 set(CMAKE_POSITION_INDEPENDENT_CODE ON)
@@ -65,7 +74,7 @@ if(WIN32)
  #set(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS} /DEBUG:FASTLINK") 
  set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /D_ITERATOR_DEBUG_LEVEL=0")
 else()
-  list(APPEND CUDA_NVCC_FLAGS -std=c++11 -w -Wno-deprecated-gpu-targets)
+  list(APPEND CUDA_NVCC_FLAGS -w -Wno-deprecated-gpu-targets)
  #  half is not fully supported when __CUDA_ARCH__ < 530
  #  list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=sm_30")
  #  list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=compute_30")
@@ -85,10 +94,12 @@ else()
  foreach(CUDA_NVCC_GENCODE ${CUDA_NVCC_GENCODES})
    list(APPEND CUDA_NVCC_FLAGS -gencode ${CUDA_NVCC_GENCODE})
  endforeach()
-  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -Wall -Wno-sign-compare -Wno-unused-function -fPIC")
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -std=c++11 -Wall -Wno-sign-compare -Wno-unused-function -fPIC")
  if (RELEASE_VERSION)
    list(APPEND CUDA_NVCC_FLAGS -O3)
-    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -DNDEBUG")
+  else()
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O0")
  endif()
 endif()
@@ -97,4 +108,5 @@ if (THIRD_PARTY)
  set(THIRD_PARTY OFF CACHE BOOL "" FORCE)
 else()
  include(oneflow)
+  configure_file(${PROJECT_SOURCE_DIR}/setup.py.in ${PROJECT_BINARY_DIR}/setup.py)
 endif()
--- a/README.md
+++ b/README.md
@@ -42,3 +42,76 @@ or you can just clone source code and submodules step by step
 ```
    cmake -DTHIRD_PARTY=OFF .. && make -j
 ```
+### Build with XLA
+- Install bazel
+  Download and install bazel from [here](https://docs.bazel.build/versions/1.0.0/bazel-overview.html) , and version 0.24.1 is recommended. You can confirm bazel is installed successfully by running the following command:
+  ```shell
+  bazel version
+  ```
+- Update cmake
+  It is needed only if CMake installed does not support downloading .tgz file from URL with https protocol. Skip this step, just go back here to reinstall CMake if you encountered a downloading error while building the third-parties.
+  Download cmake(>=3.7) from [here](https://cmake.org/download/) , configure and install it by the following command:
+  ```shell
+  # Install curl develop toolkit
+  sudo yum install libcurl-devel
+  # install cmake
+  cd cmake && ./bootstrap --system-curl --prefix=$your_path && make install
+  ```
+- Build third-parties
+  Run the following command to build third-parties.
+  ```shell
+  cd build && cmake -DWITH_XLA=ON -DTHIRD_PARTY=ON ..
+  make -j$(nproc)
+  ```
+  If the downloading error occurred, you should go back to the previous step to reinstall the cmake, then clean the file CMakeCache.txt and build the third-parties once again.
+- Build OneFlow
+  ```shell
+  cmake .. \
+  -DWITH_XLA=ON \
+  -DPYTHON_LIBRARY=your_python_lib_path \
+  -DPYTHON_INCLUDE_DIR=your_python_include_dir \
+  -DPython_NumPy_INCLUDE_DIRS=your_numpy_include_dir
+  make -j$(nproc)
+  ```
+- XLA documents
+  You can check this [doc](./oneflow/xrt/README.md) to obtain more details about how to use XLA.
+### Build with TensorRT
+- Build third-parties
+  Run the following command to build third-parties.
+  ```shell
+  cd build && cmake -DWITH_TENSORRT=ON -DTHIRD_PARTY=ON ..
+  make -j$(nproc)
+  ```
+- Build OneFlow
+  ```shell
+  cmake .. \
+  -DWITH_TENSORRT=ON \
+  -DPYTHON_LIBRARY=your_python_lib_path \
+  -DPYTHON_INCLUDE_DIR=your_python_include_dir \
+  -DPython_NumPy_INCLUDE_DIRS=your_numpy_include_dir
+  make -j$(nproc)
+  ```
--- a/cmake/oneflow.cmake
+++ b/cmake/oneflow.cmake
@@ -45,6 +45,24 @@ foreach(oneflow_hdr_to_be_expanded ${oneflow_all_hdr_to_be_expanded})
 endforeach()
 file(GLOB_RECURSE oneflow_all_src "${PROJECT_SOURCE_DIR}/oneflow/core/*.*" "${PROJECT_SOURCE_DIR}/oneflow/python/*.*")
+if (WITH_XLA OR WITH_TENSORRT)
+  file(GLOB_RECURSE oneflow_xrt_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/*.*")
+  if (NOT WITH_XLA)
+    file(GLOB_RECURSE xla_removing_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/xla/*.*")
+  endif ()
+  if (NOT WITH_TENSORRT)
+    file(GLOB_RECURSE trt_removing_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/tensorrt/*.*")
+  endif ()
+  list(APPEND xrt_removing_srcs ${xla_removing_src})
+  list(APPEND xrt_removing_srcs ${trt_removing_src})
+  # message(STATUS "removing_srcs: ${xrt_removing_srcs}")
+  foreach (removing_file ${xrt_removing_srcs})
+    list(REMOVE_ITEM oneflow_xrt_src ${removing_file})
+  endforeach ()
+  list(APPEND oneflow_all_src ${oneflow_xrt_src})
+endif()
 foreach(oneflow_single_file ${oneflow_all_src})
  # Verify whether this file is for other platforms
  set(exclude_this OFF)
@@ -70,33 +88,33 @@ foreach(oneflow_single_file ${oneflow_all_src})
    set(group_this ON)
  endif()
-  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.h$")
+  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.h$")
    list(APPEND of_all_obj_cc ${oneflow_single_file})
    set(group_this ON)
  endif()
-  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cuh$")
+  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cuh$")
    if(BUILD_CUDA) 
      list(APPEND of_all_obj_cc ${oneflow_single_file})
    endif()
    set(group_this ON)
  endif()
-  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cu$")
+  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cu$")
    if(BUILD_CUDA)
      list(APPEND of_all_obj_cc ${oneflow_single_file})
    endif()
    set(group_this ON)
  endif()
-  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.proto$")
+  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.proto$")
    list(APPEND of_all_proto ${oneflow_single_file})
    #list(APPEND of_all_obj_cc ${oneflow_single_file})   # include the proto file in the project
    set(group_this ON)
  endif()
-  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cpp$")
+  if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cpp$")
-    if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*_test\\.cpp$")
+    if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*_test\\.cpp$")
      # test file
      # list(APPEND of_all_test_cc ${oneflow_single_file})
    else()

--- a/cmake/third_party.cmake
+++ b/cmake/third_party.cmake
@@ -15,6 +15,17 @@ include(cocoapi)
 include(half)
 include(json)
+if (WITH_XLA)
+  include(tensorflow)
+endif()
+if (WITH_TENSORRT)
+  if (NOT WITH_XLA)
+    include(absl)
+  endif()
+  include(tensorrt)
+endif()
 if (BUILD_CUDA)
  set(CUDA_SEPARABLE_COMPILATION ON)
  find_package(CUDA REQUIRED)
@@ -114,6 +125,11 @@ if (BUILD_CUDA)
  include(cub)
  include(nccl)
+  if (WITH_XLA)
+    # Fix conflicts between tensorflow cublas dso and oneflow static cublas.
+    # TODO(hjchen2) Should commit a issue about this fix.
+    list(APPEND oneflow_third_party_libs -Wl,--whole-archive ${cuda_lib_dir}/libcublas_static.a -Wl,--no-whole-archive)
+  endif()
  list(APPEND oneflow_third_party_libs ${CUDA_LIBRARIES})
  list(APPEND oneflow_third_party_libs ${CUDNN_LIBRARIES})
  list(APPEND oneflow_third_party_libs ${NCCL_STATIC_LIBRARIES})
@@ -150,6 +166,17 @@ if(BUILD_RDMA)
  endif()
 endif()
+if(WITH_XLA)
+  list(APPEND oneflow_third_party_libs ${TENSORFLOW_XLA_LIBRARIES})
+endif()
+if(WITH_TENSORRT)
+  if (NOT WITH_XLA)
+    list(APPEND oneflow_third_party_libs ${ABSL_LIBRARIES})
+  endif()
+  list(APPEND oneflow_third_party_libs ${TENSORRT_LIBRARIES})
+endif()
 message(STATUS "oneflow_third_party_libs: " ${oneflow_third_party_libs})
 add_definitions(-DHALF_ENABLE_CPP11_USER_LITERALS=0)
--- a/cmake/third_party/absl.cmake
+++ b/cmake/third_party/absl.cmake
+include (ExternalProject)
+SET(ABSL_PROJECT absl)
+SET(ABSL_GIT_URL https://github.com/abseil/abseil-cpp.git)
+SET(ABSL_GIT_TAG 43ef2148c0936ebf7cb4be6b19927a9d9d145b8f)
+SET(ABSL_SOURCE_DIR ${CMAKE_CURRENT_BINARY_DIR}/third_party/absl)
+SET(ABSL_INSTALL_DIR ${THIRD_PARTY_DIR}/absl)
+SET(ABSL_INCLUDE_DIR ${ABSL_INSTALL_DIR}/include CACHE PATH "" FORCE)
+SET(ABSL_LIBRARY_DIR ${ABSL_INSTALL_DIR}/lib CACHE PATH "" FORCE)
+INCLUDE_DIRECTORIES(${ABSL_INCLUDE_DIR})
+LINK_DIRECTORIES(${ABSL_LIBRARY_DIR})
+SET(ABSL_LIBRARIES
+    ${ABSL_LIBRARY_DIR}/libabsl_base.a
+    ${ABSL_LIBRARY_DIR}/libabsl_spinlock_wait.a
+    ${ABSL_LIBRARY_DIR}/libabsl_dynamic_annotations.a
+    ${ABSL_LIBRARY_DIR}/libabsl_malloc_internal.a
+    ${ABSL_LIBRARY_DIR}/libabsl_throw_delegate.a
+    ${ABSL_LIBRARY_DIR}/libabsl_int128.a
+    ${ABSL_LIBRARY_DIR}/libabsl_strings.a
+    ${ABSL_LIBRARY_DIR}/libabsl_str_format_internal.a
+    ${ABSL_LIBRARY_DIR}/libabsl_time.a
+    ${ABSL_LIBRARY_DIR}/libabsl_bad_optional_access.a)
+if (THIRD_PARTY)
+  ExternalProject_Add(${ABSL_PROJECT}
+    PREFIX ${ABSL_SOURCE_DIR}
+    GIT_REPOSITORY ${ABSL_GIT_URL}
+    GIT_TAG ${ABSL_GIT_TAG}
+    UPDATE_COMMAND ""
+    CMAKE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
+        -DBUILD_SHARED_LIBS:BOOL=OFF
+        -DCMAKE_CXX_FLAGS:STRING=${CMAKE_CXX_FLAGS}
+        -DCMAKE_CXX_FLAGS_DEBUG:STRING=${CMAKE_CXX_FLAGS_DEBUG}
+        -DCMAKE_CXX_FLAGS_RELEASE:STRING=${CMAKE_CXX_FLAGS_RELEASE}
+    CMAKE_CACHE_ARGS
+        -DCMAKE_INSTALL_PREFIX:PATH=${ABSL_INSTALL_DIR}
+        -DCMAKE_INSTALL_LIBDIR:PATH=${ABSL_LIBRARY_DIR}
+        -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+        -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
+  )
+endif(THIRD_PARTY)
--- a/cmake/third_party/eigen.cmake
+++ b/cmake/third_party/eigen.cmake
@@ -3,9 +3,18 @@ include (ExternalProject)
 set(EIGEN_INCLUDE_DIR ${THIRD_PARTY_DIR}/eigen/include/eigen3)
 set(EIGEN_INSTALL_DIR ${THIRD_PARTY_DIR}/eigen)
-set(EIGEN_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/eigen/src/eigen)
+if(WITH_XLA)
+  #set(EIGEN_URL "https://storage.googleapis.com/mirror.tensorflow.org/bitbucket.org/eigen/eigen/get/8071cda5714d.tar.gz")
+  set(EIGEN_URL "https://bitbucket.org/eigen/eigen/get/8071cda5714d.tar.gz")
+else()
+  set(EIGEN_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/eigen/src/eigen)
+endif()
-add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_NO_MALLOC -DEIGEN_USE_GPU)
+add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_USE_GPU)
+if (NOT WITH_XLA)
+add_definitions(-DEIGEN_NO_MALLOC)
+endif()
+#add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_NO_MALLOC -DEIGEN_USE_GPU)
 if (THIRD_PARTY)

--- a/cmake/third_party/protobuf.cmake
+++ b/cmake/third_party/protobuf.cmake
@@ -5,7 +5,11 @@ set(PROTOBUF_LIBRARY_DIR ${THIRD_PARTY_DIR}/protobuf/lib)
 set(PROTOBUF_BINARY_DIR ${THIRD_PARTY_DIR}/protobuf/bin)
 set(PROTOBUF_SRC_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src)
-set(PROTOBUF_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/protobuf/src/protobuf)
+if(WITH_XLA)
+  set(PROTOBUF_URL "https://storage.googleapis.com/mirror.tensorflow.org/github.com/protocolbuffers/protobuf/archive/310ba5ee72661c081129eb878c1bbcec936b20f0.tar.gz")
+else()
+  set(PROTOBUF_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/protobuf/src/protobuf)
+endif()
 if(WIN32)
    set(PROTOBUF_BUILD_LIBRARY_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/${CMAKE_BUILD_TYPE})

--- a/cmake/third_party/tensorflow.cmake
+++ b/cmake/third_party/tensorflow.cmake
+include (ExternalProject)
+if (WITH_XLA)
+list(APPEND TENSORFLOW_BUILD_CMD --define with_xla_support=true)
+if (RELEASE_VERSION)
+  list(APPEND TENSORFLOW_BUILD_CMD -c opt)
+  set(TENSORFLOW_GENFILE_DIR k8-opt)
+else()
+  list(APPEND TENSORFLOW_BUILD_CMD --copt=-g -c dbg)
+  set(TENSORFLOW_GENFILE_DIR k8-dbg)
+endif()
+set(TF_WITH_CUDA ON)
+if (TF_WITH_CUDA)
+  set(CUDA_COMPUTE_CAPABILITIES "6.0,6.1")
+  if (NOT CUDA_VERSION VERSION_LESS "10.0")
+    set(CUDA_COMPUTE_CAPABILITIES "${CUDA_COMPUTE_CAPABILITIES},7.0")
+  endif()
+  list(APPEND TENSORFLOW_BUILD_CMD --config=cuda)
+  list(APPEND TENSORFLOW_BUILD_CMD --action_env TF_NEED_CUDA=1)
+  list(APPEND TENSORFLOW_BUILD_CMD --action_env TF_CUDA_COMPUTE_CAPABILITIES=${CUDA_COMPUTE_CAPABILITIES})
+endif()
+message(STATUS ${TENSORFLOW_BUILD_CMD})
+set(TENSORFLOW_PROJECT  tensorflow)
+set(TENSORFLOW_GIT_URL  https://github.com/tensorflow/tensorflow.git)
+#set(TENSORFLOW_GIT_TAG  master)
+set(TENSORFLOW_GIT_TAG  80c04b80ad66bf95aa3f41d72a6bba5e84a99622)
+set(TENSORFLOW_SOURCES_DIR ${THIRD_PARTY_DIR}/tensorflow)
+set(TENSORFLOW_SRCS_DIR ${TENSORFLOW_SOURCES_DIR}/src/tensorflow)
+set(TENSORFLOW_INC_DIR  ${TENSORFLOW_SOURCES_DIR}/src/tensorflow)
+set(PATCHES_DIR  ${PROJECT_SOURCE_DIR}/oneflow/xrt/patches)
+set(TENSORFLOW_JIT_DIR ${TENSORFLOW_SRCS_DIR}/tensorflow/compiler/jit)
+set(TENSORFLOW_GEN_DIR ${TENSORFLOW_SRCS_DIR}/bazel-out/${TENSORFLOW_GENFILE_DIR}/genfiles)
+set(TENSORFLOW_EXTERNAL_DIR ${TENSORFLOW_SRCS_DIR}/bazel-tensorflow/external)
+set(THIRD_ABSL_DIR ${TENSORFLOW_EXTERNAL_DIR}/com_google_absl)
+set(THIRD_PROTOBUF_DIR ${TENSORFLOW_EXTERNAL_DIR}/com_google_protobuf/src)
+set(THIRD_BORINGSSL_DIR ${TENSORFLOW_EXTERNAL_DIR}/boringssl/src)
+set(THIRD_SNAPPY_DIR ${TENSORFLOW_EXTERNAL_DIR}/snappy)
+list(APPEND TENSORFLOW_XLA_INCLUDE_DIR
+  ${TENSORFLOW_INC_DIR}
+  ${TENSORFLOW_GEN_DIR}
+  ${THIRD_ABSL_DIR}
+  ${THIRD_PROTOBUF_DIR}
+  ${THIRD_BORINGSSL_DIR}
+  ${THIRD_SNAPPY_DIR}
+)
+include_directories(${TENSORFLOW_XLA_INCLUDE_DIR})
+list(APPEND TENSORFLOW_XLA_LIBRARIES libtensorflow_framework.so.1)
+list(APPEND TENSORFLOW_XLA_LIBRARIES libxla_core.so)
+link_directories(
+  ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow
+  ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/compiler/jit/xla_lib
+)
+if (THIRD_PARTY)
+  ExternalProject_Add(${TENSORFLOW_PROJECT}
+    PREFIX ${TENSORFLOW_SOURCES_DIR}
+    GIT_REPOSITORY ${TENSORFLOW_GIT_URL}
+    GIT_TAG ${TENSORFLOW_GIT_TAG}
+    PATCH_COMMAND patch -Np1 < ${PATCHES_DIR}/xla.patch
+    CONFIGURE_COMMAND ""
+    BUILD_COMMAND cd ${TENSORFLOW_SRCS_DIR} &&
+                  bazel build ${TENSORFLOW_BUILD_CMD} -j 20 //tensorflow/compiler/jit/xla_lib:libxla_core.so
+    INSTALL_COMMAND ""
+  )
+endif(THIRD_PARTY)
+set(TENSORFLOW_XLA_FRAMEWORK_LIB ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/libtensorflow_framework.so.1)
+set(TENSORFLOW_XLA_CORE_LIB ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/compiler/jit/xla_lib/libxla_core.so)
+endif(WITH_XLA)
--- a/cmake/third_party/tensorrt.cmake
+++ b/cmake/third_party/tensorrt.cmake
+include (ExternalProject)
+if (WITH_TENSORRT)
+find_path(TENSORRT_INCLUDE_DIR NvInfer.h
+          PATHS ${TENSORRT_ROOT} ${TENSORRT_ROOT}/include
+          $ENV{TENSORRT_ROOT} $ENV{TENSORRT_ROOT}/include
+          ${THIRD_PARTY_DIR}/tensorrt/include)
+find_library(TENSORRT_LIBRARIES NAMES libnvinfer.so libnvinfer.a
+             PATHS ${TENSORRT_ROOT} ${TENSORRT_ROOT}/lib
+             $ENV{TENSORRT_ROOT} $ENV{TENSORRT_ROOT}/lib
+             ${THIRD_PARTY_DIR}/tensorrt/lib)
+if (TENSORRT_INCLUDE_DIR AND TENSORRT_LIBRARIES)
+else()
+  message(FATAL_ERROR "TensorRT was not found. You can set TENSORRT_ROOT to specify the search path.")
+endif()
+message(STATUS "TensorRT Include: ${TENSORRT_INCLUDE_DIR}")
+message(STATUS "TensorRT Lib: ${TENSORRT_LIBRARIES}")
+include_directories(${TENSORRT_INCLUDE_DIR})
+endif(WITH_TENSORRT)
--- a/oneflow/core/common/protobuf.cpp
+++ b/oneflow/core/common/protobuf.cpp
 #include "oneflow/core/common/protobuf.h"
+#include "oneflow/core/common/shape.pb.h"
 #include "oneflow/core/common/str_util.h"
 #include "oneflow/core/register/blob_desc.pb.h"
 #include <google/protobuf/io/coded_stream.h>
@@ -88,6 +89,11 @@ int32_t GetEnumFromPbMessage(const PbMessage& msg, const std::string& field_name
 OF_PP_FOR_EACH_TUPLE(DEFINE_SET_VAL_IN_PBMESSAGE, PROTOBUF_BASIC_DATA_TYPE_SEQ)
+const PbMessage& GetMessageInPbMessage(const PbMessage& msg, const std::string& field_name) {
+  PROTOBUF_REFLECTION(msg, field_name);
+  return r->GetMessage(msg, fd);
+}
 PbMessage* MutableMessageInPbMessage(PbMessage* msg, const std::string& field_name) {
  PROTOBUF_REFLECTION((*msg), field_name);
  return r->MutableMessage(msg, fd);
@@ -115,6 +121,67 @@ PbMessage* MutableMessageInPbMessage(PbMessage* msg, int field_index) {
  return r->MutableMessage(msg, fd);
 }
+#define DECLARE_GETTER_FUNC_HEADER(type) \
+  template<>                             \
+  type GetValFromPbMessage<type>(const PbMessage& msg, const std::string& field_name)
+#define DECLARE_SETTER_FUNC_HEADER(type) \
+  template<>                             \
+  void SetValInPbMessage<type>(PbMessage * msg, const std::string& field_name, const type& val)
+#define DEFINE_MESSAGE_VAL_GETTER_AND_SETTER(message_type)              \
+  DECLARE_GETTER_FUNC_HEADER(message_type) {                            \
+    PROTOBUF_REFLECTION(msg, field_name);                               \
+    return *dynamic_cast<const message_type*>(&r->GetMessage(msg, fd)); \
+  }                                                                     \
+  DECLARE_SETTER_FUNC_HEADER(message_type) {                            \
+    PROTOBUF_REFLECTION((*msg), field_name);                            \
+    r->MutableMessage(msg, fd)->CopyFrom(val);                          \
+  }
+DEFINE_MESSAGE_VAL_GETTER_AND_SETTER(ShapeProto);
+#define DEFINE_ENUM_VAL_GETTER_AND_SETTER(enum_type)         \
+  DECLARE_GETTER_FUNC_HEADER(enum_type) {                    \
+    PROTOBUF_REFLECTION(msg, field_name);                    \
+    return static_cast<enum_type>(r->GetEnumValue(msg, fd)); \
+  }                                                          \
+  DECLARE_SETTER_FUNC_HEADER(enum_type) {                    \
+    PROTOBUF_REFLECTION((*msg), field_name);                 \
+    r->SetEnumValue(msg, fd, val);                           \
+  }
+DEFINE_ENUM_VAL_GETTER_AND_SETTER(DataType);
+#define DEFINE_VECTOR_VAL_GETTER_AND_SETTER(vec_type, vec_type_name)                        \
+  DECLARE_GETTER_FUNC_HEADER(vec_type) {                                                    \
+    PROTOBUF_REFLECTION(msg, field_name);                                                   \
+    int32_t field_size = r->FieldSize(msg, fd);                                             \
+    vec_type retval(field_size);                                                            \
+    for (int i = 0; i < field_size; ++i) { retval[i] = r->Get##vec_type_name(msg, fd, i); } \
+    return std::move(retval);                                                               \
+  }                                                                                         \
+  DECLARE_SETTER_FUNC_HEADER(vec_type) {                                                    \
+    PROTOBUF_REFLECTION((*msg), field_name);                                                \
+    for (int i = 0; i < val.size(); ++i) { r->Set##vec_type_name(msg, fd, i, val[i]); }     \
+  }
+#define MAKE_REPEATED_TUPLE_SEQ(type, type_name) \
+  OF_PP_MAKE_TUPLE_SEQ(std::vector<type>, Repeated##type_name)
+#define PROTOBUF_BASIC_REPEATED_DATA_TYPE_SEQ  \
+  MAKE_REPEATED_TUPLE_SEQ(std::string, String) \
+  MAKE_REPEATED_TUPLE_SEQ(int32_t, Int32)      \
+  MAKE_REPEATED_TUPLE_SEQ(uint32_t, UInt32)    \
+  MAKE_REPEATED_TUPLE_SEQ(int64_t, Int64)      \
+  MAKE_REPEATED_TUPLE_SEQ(uint64_t, UInt64)    \
+  MAKE_REPEATED_TUPLE_SEQ(float, Float)        \
+  MAKE_REPEATED_TUPLE_SEQ(double, Double)      \
+  MAKE_REPEATED_TUPLE_SEQ(int16_t, EnumValue)  \
+  MAKE_REPEATED_TUPLE_SEQ(bool, Bool)
+OF_PP_FOR_EACH_TUPLE(DEFINE_VECTOR_VAL_GETTER_AND_SETTER, PROTOBUF_BASIC_REPEATED_DATA_TYPE_SEQ);
 #define DEFINE_ADD_VAL_IN_PBRF(cpp_type, pb_type_name)                                    \
  template<>                                                                              \
  void AddValInPbRf(PbMessage* msg, const std::string& field_name, const cpp_type& val) { \

--- a/oneflow/core/common/protobuf.h
+++ b/oneflow/core/common/protobuf.h
@@ -36,6 +36,7 @@ using PbMd = google::protobuf::util::MessageDifferencer;
  OF_PP_MAKE_TUPLE_SEQ(int64_t, Int64)      \
  OF_PP_MAKE_TUPLE_SEQ(uint64_t, UInt64)    \
  OF_PP_MAKE_TUPLE_SEQ(float, Float)        \
+  OF_PP_MAKE_TUPLE_SEQ(double, Double)      \
  OF_PP_MAKE_TUPLE_SEQ(int16_t, EnumValue)  \
  OF_PP_MAKE_TUPLE_SEQ(bool, Bool)
@@ -92,6 +93,7 @@ template<typename T>
 void SetValInPbMessage(PbMessage* msg, const std::string& field_name, const T& val);
 const PbMessage& GetMessageInPbMessage(const PbMessage& msg, int field_index);
+const PbMessage& GetMessageInPbMessage(const PbMessage& msg, const std::string& field_name);
 PbMessage* MutableMessageInPbMessage(PbMessage*, const std::string& field_name);
 PbMessage* MutableMessageInPbMessage(PbMessage*, int field_index);

--- a/oneflow/core/common/shape_view.h
+++ b/oneflow/core/common/shape_view.h
 #ifndef ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_
 #define ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_
+#include "oneflow/core/common/util.h"
 #include "oneflow/core/common/shape_vec.h"
 namespace oneflow {

--- a/oneflow/core/graph/normal_forward_compute_task_node.cpp
+++ b/oneflow/core/graph/normal_forward_compute_task_node.cpp
@@ -35,7 +35,9 @@ void NormalForwardCompTaskNode::ProduceAllRegstsAndBindEdges() {
 }
 void NormalForwardCompTaskNode::ConsumeAllRegsts() {
-  ForEachInDataEdge([&](TaskEdge* edge) { ConsumeRegst("in", edge->GetSoleRegst()); });
+  ForEachInDataEdge([&](TaskEdge* edge) {
+    for (const auto& regst : edge->GetRegsts()) { ConsumeRegst("in", regst); }
+  });
 }
 bool NormalForwardCompTaskNode::IsReadyForBuild() {

--- a/oneflow/core/graph/optimizer_compute_task_node.cpp
+++ b/oneflow/core/graph/optimizer_compute_task_node.cpp
@@ -4,7 +4,9 @@
 namespace oneflow {
 void OptimizerCompTaskNode::ConsumeAllRegsts() {
-  ForEachInDataEdge([&](TaskEdge* edge) { ConsumeRegst("in", edge->GetSoleRegst()); });
+  ForEachInDataEdge([&](TaskEdge* edge) {
+    for (const auto& regst : edge->GetRegsts()) { ConsumeRegst("in", regst); }
+  });
 }
 void OptimizerCompTaskNode::ProduceAllRegstsAndBindEdges() { ProduceRegst("tmp", false, 1, 1); }

--- a/oneflow/core/graph/reduce_split_compute_task_node.cpp
+++ b/oneflow/core/graph/reduce_split_compute_task_node.cpp
@@ -5,48 +5,28 @@
 namespace oneflow {
-namespace {
-int32_t GetDataRegstDescCnt(
-    const HashMap<std::string, std::shared_ptr<RegstDesc>> name2regst_desc) {
-  size_t cnt = 0;
-  for (const auto& pair : name2regst_desc) {
-    cnt += pair.second->regst_desc_type().has_data_regst_desc();
-  }
-  return cnt;
-}
-}  // namespace
 void ReduceSplitCompTaskNode::ProduceAllRegstsAndBindEdges() {
-  std::vector<EdgeInfo> edge_infos;
-  std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
  HashMap<LogicalBlobId, int32_t> lbi2order;
+  std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
  FOR_RANGE(int32_t, idx, 0, reduce_split_op->output_bns().size()) {
+    ProduceRegst("out_" + std::to_string(idx), false, 1, 1);
    const auto& lbi = reduce_split_op->BnInOp2Lbi(reduce_split_op->output_bns().Get(idx));
    CHECK(lbi2order.emplace(lbi, idx).second);
  }
  ForEachOutDataEdge([&](TaskEdge* edge) {
    TaskNode* dst_node = edge->dst_node();
    CHECK(edge->dst_node()->GetTaskType() == TaskType::kOptimizer
          || edge->dst_node()->GetTaskType() == TaskType::kNormalForward);
    CompTaskNode* mdupdt_node = dynamic_cast<CompTaskNode*>(dst_node);
    std::shared_ptr<Operator> mdupdt_op = mdupdt_node->logical_node()->SoleOp();
-    int32_t order = -1;
    for (const std::string& ibn : mdupdt_op->input_bns()) {
      const auto& order_it = lbi2order.find(mdupdt_op->BnInOp2Lbi(ibn));
-      if (order_it != lbi2order.end()) { order = order_it->second; }
+      if (order_it != lbi2order.end()) {
+        BindEdgeWithProducedRegst(edge, "out_" + std::to_string(order_it->second));
+      }
    }
-    CHECK_NE(order, -1);
-    EdgeInfo edge_info{edge, order};
-    edge_infos.emplace_back(edge_info);
  });
-  SortEdges(&edge_infos);
-  FOR_RANGE(size_t, idx, 0, edge_infos.size()) {
-    std::string out_regst_name = "out_" + std::to_string(idx);
-    std::shared_ptr<RegstDesc> out_regst = ProduceRegst(out_regst_name, false, 1, 1);
-    edge_infos[idx].edge->AddRegst(out_regst_name, out_regst);
-  }
 }
 void ReduceSplitCompTaskNode::ConsumeAllRegsts() {
@@ -68,22 +48,23 @@ void ReduceSplitCompTaskNode::BuildExecGphAndRegst() {
  node->BindBnWithRegst(reduce_split_op->SoleIbn(), GetSoleConsumedRegst("in"));
  FOR_RANGE(size_t, i, 0, reduce_split_op->output_bns().size()) {
-    std::shared_ptr<RegstDesc> out_regst = GetProducedRegst("out_" + std::to_string(i));
+    std::string blob_name = "out_" + std::to_string(i);
+    std::shared_ptr<RegstDesc> out_regst = GetProducedRegst(blob_name);
    CHECK(out_regst.get() != nullptr);
-    out_regst->AddLbi(reduce_split_op->BnInOp2Lbi(reduce_split_op->output_bns().Get(i)));
+    out_regst->AddLbi(reduce_split_op->BnInOp2Lbi(blob_name));
-    node->BindBnWithRegst(reduce_split_op->output_bns().Get(i), out_regst);
+    node->BindBnWithRegst(blob_name, out_regst);
  }
  node->InferBlobDescs(parallel_ctx());
 }
 void ReduceSplitCompTaskNode::EnableMemSharingInReduce(const ReduceMemSharingCtx& ctx) {
  CHECK_EQ(GetRankCtx().TotalSegmentCount(), 1);
-  size_t split_num = GetDataRegstDescCnt(produced_regsts());
+  std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
  int64_t offset = 0;
-  FOR_RANGE(int32_t, idx, 0, split_num) {
+  for (int i = 0; i < reduce_split_op->output_bns().size(); ++i) {
-    RegstDesc* split_out_regst = GetProducedRegst("out_" + std::to_string(idx)).get();
+    RegstDesc* out_regst = GetProducedRegst("out_" + std::to_string(i)).get();
-    ctx.EnableMemSharing4Regst(split_out_regst, offset);
+    ctx.EnableMemSharing4Regst(out_regst, offset);
-    offset += InferRegstSize(*split_out_regst);
+    offset += InferRegstSize(*out_regst);
  }
 }

--- a/oneflow/core/job/job.proto
+++ b/oneflow/core/job/job.proto
@@ -46,6 +46,20 @@ message MemoryAllocationAlgorithmConf {
  optional bool use_time_line_algo = 3 [default = false];
 }
+message XrtConfig {
+  message XlaConfig {
+    // TODO
+  }
+  message TensorRTConfig {
+    optional bool use_fp16 = 1 [default = false];
+    optional bool use_int8 = 2 [default = false];
+  }
+  optional bool use_xla_jit = 1 [default = false];
+  optional bool use_tensorrt = 2 [default = false];
+  optional XlaConfig xla_config = 3;
+  optional TensorRTConfig tensorrt_config = 4;
+}
 message JobConfigProto {
  required string job_name = 1;
@@ -65,6 +79,8 @@ message JobConfigProto {
  optional bool use_memory_allocation_algorithm_v2 = 101 [default = true];
  optional MemoryAllocationAlgorithmConf memory_allocation_algorithm_conf = 102;
+  optional XrtConfig xrt_config = 103;
  optional bool enable_cudnn = 200 [default = true];
  optional int64 cudnn_buf_limit_mbyte = 201 [default = 1024];  // 1GByte
  optional int32 cudnn_conv_force_fwd_algo = 202;

--- a/oneflow/core/job/job_builder.cpp
+++ b/oneflow/core/job/job_builder.cpp
@@ -30,6 +30,17 @@ JobBuilder::JobBuilder(Job* job) : job_(job) {
          op_name2parallel_conf_.emplace(op_name, placemnt_group->mutable_parallel_conf()).second);
    }
  }
+  auto* sbp_conf = job->mutable_sbp_conf();
+  for (auto& pair : *(sbp_conf->mutable_op_name2sbp_signature_conf())) {
+    op_name2sbp_signature_conf_.emplace(pair.first, &pair.second);
+  }
+  for (auto& pair : *(job->mutable_helper()->mutable_lbn2batch_axis())) {
+    lbn2batch_axis_.emplace(pair.first, &pair.second);
+  }
+  auto* helper_conf = job->mutable_helper();
+  for (auto& pair : *(helper_conf->mutable_op_name2op_time_shape())) {
+    op_name2time_shapes_.emplace(pair.first, &pair.second);
+  }
  FOR_RANGE(int32_t, i, 0, job->placement().blob_placement_group_size()) {
    auto* blob_pg = job->mutable_placement()->mutable_blob_placement_group(i);
    for (const auto& lbi : blob_pg->lbi()) {
@@ -38,12 +49,14 @@ JobBuilder::JobBuilder(Job* job) : job_(job) {
  }
 }
-const OperatorConf& JobBuilder::OpConf4OpName(const std::string& op_name) const {
+OperatorConf* JobBuilder::MutableOpConf4OpName(const std::string& op_name) {
-  return *op_name2op_conf_.at(op_name);
+  const auto& it = op_name2op_conf_.find(op_name);
+  CHECK(it != op_name2op_conf_.end());
+  return it->second;
 }
-const ParallelConf& JobBuilder::ParallelConf4OpName(const std::string& op_name) const {
+const OperatorConf& JobBuilder::OpConf4OpName(const std::string& op_name) const {
-  return *op_name2parallel_conf_.at(op_name);
+  return *op_name2op_conf_.at(op_name);
 }
 const ParallelConf& JobBuilder::ParallelConf4Lbi(const LogicalBlobId& lbi) const {
@@ -89,15 +102,69 @@ void JobBuilder::MutParallelConfOnlyOnce(const std::string& op_name,
  *placement_group->mutable_parallel_conf() = parallel_conf;
 }
-void JobBuilder::DelOps(const std::vector<OperatorConf>& op_confs) {
+void JobBuilder::RemoveOpByName(const std::string& op_name) {
-  for (const auto& op_conf : op_confs) {
+  RemoveOpByName(std::unordered_set<std::string>{op_name});
-    const std::string& op_name = op_conf.name();
+}
-    op_name2op_conf_.erase(op_name);
-    auto* op_list = job_->mutable_net()->mutable_op();
+void JobBuilder::RemoveOpByName(const std::unordered_set<std::string>& removing_names) {
-    auto it = std::remove_if(op_list->begin(), op_list->end(),
+  // Update net
-                             [&](const OperatorConf& conf) { return conf.name() == op_name; });
+  DLNetConf net = job_->net();
-    if (it != op_list->end()) { op_list->erase(it); }
+  job_->mutable_net()->clear_op();
+  for (const OperatorConf& op_conf : net.op()) {
+    if (removing_names.count(op_conf.name()) == 0) { *(job_->mutable_net()->add_op()) = op_conf; }
+  }
+  // Update placement
+  auto placement_group = job_->placement().placement_group();
+  job_->mutable_placement()->clear_placement_group();
+  for (const PlacementGroup& place : placement_group) {
+    PlacementGroup p;
+    OpNameSet* op_set = p.mutable_op_set();
+    for (const std::string& name : place.op_set().op_name()) {
+      if (removing_names.count(name) == 0) { op_set->add_op_name(name); }
+    }
+    *(p.mutable_parallel_conf()) = place.parallel_conf();
+    if (op_set->op_name().size() > 0) { *(job_->mutable_placement()->add_placement_group()) = p; }
+  }
+  auto* sbp_conf = job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf();
+  auto* time_shape_conf = job_->mutable_helper()->mutable_op_name2op_time_shape();
+  for (const std::string& op_name : removing_names) {
+    // Update Sbp
+    if (sbp_conf->count(op_name) > 0) { sbp_conf->erase(op_name); }
+    // Update time shape
+    if (time_shape_conf->count(op_name) > 0) { time_shape_conf->erase(op_name); }
  }
+  // Update batch dim lbis
+  // Update identical sbp oba pairs
+  if (job_->helper().has_identical_sbp_oba_pairs()) {
+    auto identical_sbp_oba_pairs = job_->helper().identical_sbp_oba_pairs().pair();
+    job_->mutable_helper()->mutable_identical_sbp_oba_pairs()->clear_pair();
+    for (const auto& pair : identical_sbp_oba_pairs) {
+      if (removing_names.count(pair.first().op_name()) == 0
+          && removing_names.count(pair.second().op_name()) == 0) {
+        *(job_->mutable_helper()->mutable_identical_sbp_oba_pairs()->mutable_pair()->Add()) = pair;
+      }
+    }
+  }
+  // Update builder
+  JobBuilder builder(job_);
+  op_name2op_conf_.swap(builder.op_name2op_conf_);
+  op_name2parallel_conf_.swap(builder.op_name2parallel_conf_);
+  op_name2sbp_signature_conf_.swap(builder.op_name2sbp_signature_conf_);
+  lbn2batch_axis_.swap(builder.lbn2batch_axis_);
+}
+void JobBuilder::DelOps(const std::vector<std::string>& op_names) {
+  std::unordered_set<std::string> removing_names;
+  for (const auto& op_name : op_names) { removing_names.insert(op_name); }
+  RemoveOpByName(removing_names);
+}
+void JobBuilder::DelOps(const std::vector<OperatorConf>& op_confs) {
+  std::unordered_set<std::string> removing_names;
+  for (const auto& op_conf : op_confs) { removing_names.insert(op_conf.name()); }
+  RemoveOpByName(removing_names);
 }
 void JobBuilder::MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs) {
@@ -130,6 +197,22 @@ void JobBuilder::ForEachOperator(const std::function<void(const Operator&)>& Han
  }
 }
+const ParallelConf& JobBuilder::ParallelConf4OpName(const std::string& op_name) const {
+  return *op_name2parallel_conf_.at(op_name);
+}
+void JobBuilder::AddParallelConf4OpName(const std::string& op_name,
+                                        const ParallelConf& parallel_conf) {
+  bool update = (op_name2parallel_conf_.count(op_name) == 0);
+  if (update) {
+    // update `op_name2parallel_conf_`
+    PlacementGroup* group = job_->mutable_placement()->add_placement_group();
+    group->mutable_op_set()->add_op_name(op_name);
+    *(group->mutable_parallel_conf()) = parallel_conf;
+    op_name2parallel_conf_[op_name] = group->mutable_parallel_conf();
+  }
+}
 SbpParallel* JobBuilder::MutSbpParallel4Oba(const OpBlobArg& oba) const {
  auto* sbp_sig = &(*job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf())[oba.op_name()];
  return &(*sbp_sig->mutable_bn_in_op2sbp_parallel())[oba.bn_in_op()];
@@ -141,4 +224,54 @@ void JobBuilder::BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpB
  *pair->mutable_second() = second;
 }
+const SbpSignature& JobBuilder::SbpSignature4OpName(const std::string& op_name) const {
+  const auto& it = op_name2sbp_signature_conf_.find(op_name);
+  CHECK(it != op_name2sbp_signature_conf_.end());
+  return *(it->second);
+}
+void JobBuilder::AddSbpSignature4OpName(const std::string& op_name,
+                                        const SbpSignature& sbp_signature) {
+  const auto& it = op_name2sbp_signature_conf_.find(op_name);
+  if (it != op_name2sbp_signature_conf_.end()) {
+    *(it->second) = sbp_signature;
+    return;
+  }
+  auto* op_name2sbp_signature_conf = job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf();
+  (*op_name2sbp_signature_conf)[op_name] = sbp_signature;
+  op_name2sbp_signature_conf_.emplace(op_name, &(*op_name2sbp_signature_conf)[op_name]);
+}
+const OpTimeShape& JobBuilder::TimeShape4OpName(const std::string& op_name) const {
+  const auto& it = op_name2time_shapes_.find(op_name);
+  CHECK(it != op_name2time_shapes_.end());
+  return *(it->second);
+}
+void JobBuilder::AddTimeShape4OpName(const std::string& op_name, const OpTimeShape& time_shape) {
+  bool update = (op_name2time_shapes_.count(op_name) == 0);
+  if (update) {
+    auto* time_shape_conf = job_->mutable_helper()->mutable_op_name2op_time_shape();
+    (*time_shape_conf)[op_name] = time_shape;
+    op_name2time_shapes_[op_name] = &((*time_shape_conf)[op_name]);
+  }
+}
+const OptInt64& JobBuilder::BatchAxis4Lbn(const std::string& lbn) const {
+  const auto& it = lbn2batch_axis_.find(lbn);
+  CHECK(it != lbn2batch_axis_.end());
+  return *(it->second);
+}
+void JobBuilder::AddBatchAxis4Lbn(const std::string& lbn, const OptInt64& axis) {
+  bool update =
+      (lbn2batch_axis_.count(lbn) == 0) || (lbn2batch_axis_[lbn]->value() != axis.value());
+  if (update) {
+    auto* batch_axis = job_->mutable_helper()->mutable_lbn2batch_axis();
+    (*batch_axis)[lbn] = axis;
+    lbn2batch_axis_[lbn] = &((*batch_axis)[lbn]);
+  }
+}
 }  // namespace oneflow
--- a/oneflow/core/job/job_builder.h
+++ b/oneflow/core/job/job_builder.h
@@ -26,19 +26,37 @@ class JobBuilder final {
  SbpConf* mutable_sbp_conf() { return job_->mutable_sbp_conf(); }
  const OperatorConf& OpConf4OpName(const std::string& op_name) const;
-  const ParallelConf& ParallelConf4OpName(const std::string& op_name) const;
+  OperatorConf* MutableOpConf4OpName(const std::string& op_name);
-  const ParallelConf& ParallelConf4Lbi(const LogicalBlobId& lbi) const;
  void AddOps(const ParallelConf& parallel_conf, const std::vector<OperatorConf>& op_confs);
  void MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs);
  void MutParallelConfOnlyOnce(const std::string& op_name, const ParallelConf& parallel_conf);
  void AddOrMutOpsOnlyOnce(const ParallelConf& parallel_conf,
                           const std::vector<OperatorConf>& op_confs);
+  void RemoveOpByName(const std::string& op_name);
+  void RemoveOpByName(const std::unordered_set<std::string>& removing_names);
+  void DelOps(const std::vector<std::string>& op_names);
  void DelOps(const std::vector<OperatorConf>& op_confs);
  SbpParallel* MutSbpParallel4Oba(const OpBlobArg& oba) const;
  void BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpBlobArg& second);
  void ForEachOperator(const std::function<void(const Operator&)>& Handler) const;
+  const ParallelConf& ParallelConf4Lbi(const LogicalBlobId& lbi) const;
+  const ParallelConf& ParallelConf4OpName(const std::string& op_name) const;
+  void AddParallelConf4OpName(const std::string& op_name, const ParallelConf& parallel_conf);
+  const SbpSignature& SbpSignature4OpName(const std::string& op_name) const;
+  void AddSbpSignature4OpName(const std::string& op_name, const SbpSignature& sbp_signature);
+  const OpTimeShape& TimeShape4OpName(const std::string& op_name) const;
+  void AddTimeShape4OpName(const std::string& op_name, const OpTimeShape& time_shape);
+  const OptInt64& BatchAxis4Lbn(const std::string& lbn) const;
+  void AddBatchAxis4Lbn(const std::string& lbn, const OptInt64& axis);
 private:
  PlacementGroup* FindPlacementGroup(const std::string& op_name) const;
@@ -48,6 +66,10 @@ class JobBuilder final {
  HashMap<LogicalBlobId, ParallelConf*> lbi2blob_parallel_conf_;
  HashSet<std::string> modified_op_conf_op_names_;
  HashSet<std::string> modified_parallel_conf_op_names_;
+  HashMap<std::string, SbpSignature*> op_name2sbp_signature_conf_;
+  HashMap<std::string, OpTimeShape*> op_name2time_shapes_;
+  HashMap<std::string, OptInt64*> lbn2batch_axis_;
 };
 }  // namespace oneflow

--- a/oneflow/core/job/job_desc.h
+++ b/oneflow/core/job/job_desc.h
@@ -64,6 +64,9 @@ class JobDesc final {
  bool all_reduce_fp16() const;
  int64_t cudnn_buf_limit_mbyte() const { return job_conf_.cudnn_buf_limit_mbyte(); }
+  bool has_xrt_config() const { return job_conf_.has_xrt_config(); }
+  const XrtConfig& xrt_config() const { return job_conf_.xrt_config(); }
 #define DEFINE_FUNCTION_CONFIG_GETTER(T, func_name, field_name)     \
  T func_name(const std::string& field_name) const {                \
    const UserOpAttrVal& attr_val = GetFunctionFlagVal(field_name); \

--- a/oneflow/core/job_completer/job_completer.cpp
+++ b/oneflow/core/job_completer/job_completer.cpp
@@ -16,6 +16,8 @@
 #include "oneflow/core/job_completer/add_lbi_diff_watcher.h"
 #include "oneflow/core/framework/config_def.h"
+#include "oneflow/core/job_completer/xrt_compilation.h"
 namespace oneflow {
 namespace {
@@ -356,6 +358,15 @@ void JobCompleter::Complete(Job* job) const {
  WithOpGraphAndMutJobBuilder(job, &AddGlobalOutputCriticalSections);
  WithOpGraphAndMutJobBuilder(job, &DumpLogicalBlobDescAndSbpSignature);
  WithOpGraphAndMutJobBuilder(job, &SetOpTimeShape7BatchAxisLbis);
+  if (XrtCompilationEnabled(GlobalJobDesc())) {
+#ifdef OF_WITH_XRT
+    WithOpGraphAndMutJob(job, &RebuildXrtCompiledJob);
+#else
+    LOG(WARNING) << "It will not use XLA or TensorRT since WITH_XLA or "
+                    "WITH_TENSORRT was not enabled when compiling the project.";
+#endif  // OF_WITH_XRT
+  }
  CheckOpGraph(OpGraph(*job));
 }

--- a/oneflow/core/job_completer/reduce_mean_facade.cpp
+++ b/oneflow/core/job_completer/reduce_mean_facade.cpp
@@ -29,6 +29,7 @@ void GenerateFacadeImplOpConf(const OpNode& op_node, JobBuilder* job_builder) {
  *reduce_sum_conf->mutable_axis() = reduce_mean_conf.axis();
  reduce_sum_conf->set_keep_dims(reduce_mean_conf.keep_dims());
  reduce_sum_conf->set_out("out");
+  if (reduce_mean_conf.keep_dims()) { reduce_sum_conf->set_keep_dims(true); }
  job_builder->MutOpsOnlyOnce({reduce_sum_op_conf});
  const auto& in_blob = op_node.LogicalBlobDesc4Lbi(GenLogicalBlobId(reduce_mean_conf.in()));

--- a/oneflow/core/job_completer/xrt_compilation.h
+++ b/oneflow/core/job_completer/xrt_compilation.h
+#include <string>
+#include "oneflow/core/common/util.h"
+#include "oneflow/core/graph/op_graph.h"
+#include "oneflow/core/job/job_desc.h"
+#if defined(WITH_XLA) || defined(WITH_TENSORRT)
+#include "oneflow/xrt/api.h"
+#define OF_WITH_XRT
+#endif  // WITH_XLA || WITH_TENSORRT
+namespace oneflow {
+inline void RebuildXrtCompiledJob(const OpGraph& op_graph, Job* job) {
+#ifdef OF_WITH_XRT
+  const auto& job_desc = GlobalJobDesc();
+  if (Global<ResourceDesc>::Get()->enable_debug_mode()) {
+    TeePersistentLogStream::Create("job_without_xrt_" + std::to_string(job_desc.job_id()))
+        ->Write(*job);
+  }
+  // Run compilation time passes currently include `MarkClusterId`, `BuildSubGraph`
+  // and `RebuildCompiledJob`.
+  xrt::RunCompilationTimeXrtPasses(op_graph, job, job_desc.IsTrain());
+  if (Global<ResourceDesc>::Get()->enable_debug_mode()) {
+    TeePersistentLogStream::Create("job_with_xrt_" + std::to_string(job_desc.job_id()))
+        ->Write(*job);
+  }
+#endif  // OF_WITH_XRT
+}
+inline bool XrtCompilationEnabled(const JobDesc& job_desc) {
+  if (!job_desc.has_xrt_config()) { return xrt::XrtCompilationEnabled(); }
+  const XrtConfig& config = job_desc.xrt_config();
+#ifdef OF_WITH_XRT
+  xrt::InitXrtConfigurations(config);
+  return xrt::XrtCompilationEnabled();
+#else
+  return (config.has_use_xla_jit() && config.use_xla_jit())
+         || (config.has_use_tensorrt() && config.use_tensorrt());
+#endif  // OF_WITH_XRT
+}
+}  // namespace oneflow
--- a/oneflow/core/kernel/concat_kernel.cpp
+++ b/oneflow/core/kernel/concat_kernel.cpp
@@ -6,8 +6,9 @@ namespace oneflow {
 template<DeviceType device_type, typename T>
 void ConcatKernel<device_type, T>::ForwardDataContent(
    const KernelCtx& ctx, std::function<Blob*(const std::string&)> BnInOp2Blob) const {
-  const int32_t axis = this->op_conf().concat_conf().axis();
  Blob* out_blob = BnInOp2Blob("out");
+  int32_t axis = this->op_conf().concat_conf().axis();
+  if (axis < 0) { axis += out_blob->shape().NumAxes(); }
  const int64_t row_num = out_blob->shape().elem_cnt() / out_blob->shape().Count(axis);
  const int64_t out_col_num = out_blob->shape().Count(axis);
  int64_t out_col_offset = 0;

--- a/oneflow/core/kernel/kernel.proto
+++ b/oneflow/core/kernel/kernel.proto
@@ -160,6 +160,10 @@ message NcclTupleBroadcastConf {
  required ParallelContext parallel_ctx = 1;
 }
+message XrtLaunchKernelConf {
+  required ParallelContext parallel_ctx = 1;
+}
 message KernelConf {
  required OpAttribute op_attribute = 1;
  required DataType data_type = 2;
@@ -182,6 +186,8 @@ message KernelConf {
    MaxPoolingKernelConf max_pooling_conf = 205;
    LocalResponseNormalizationKernelConf local_response_normalization_conf = 300;
    ReduceGatherKernelConf reduce_gather_conf = 350;
+    XrtLaunchKernelConf xrt_launch_conf = 353;
    AccuracyKernelConf accuracy_conf = 401;
    SliceKernelConf slice_conf = 402;
    ConstantKernelConf constant_conf = 403;

--- a/oneflow/core/operator/concat_op.cpp
+++ b/oneflow/core/operator/concat_op.cpp
@@ -55,8 +55,8 @@ Maybe<void> ConcatOp::GetSbpSignatures(
 int32_t ConcatOp::FixAxis(const int32_t axis, const int64_t num_axes) const {
  int32_t ret = axis;
  if (axis < 0) { ret += num_axes; }
-  CHECK_GE(axis, 0);
+  CHECK_GE(ret, 0);
-  CHECK_LT(axis, num_axes);
+  CHECK_LT(ret, num_axes);
  return ret;
 }

--- a/oneflow/core/operator/op_conf.proto
+++ b/oneflow/core/operator/op_conf.proto
@@ -7,6 +7,7 @@ import "oneflow/core/record/image.proto";
 import "oneflow/core/record/record.proto";
 import "oneflow/core/job/resource.proto";
 import "oneflow/core/register/logical_blob_id.proto";
+import "oneflow/core/job/sbp_parallel.proto";
 enum ActivationType {
  kNone = 0;
@@ -1517,6 +1518,32 @@ message LearningRateScheduleOpConf {
  optional WarmupConf warmup_conf = 5;
 }
+message XrtLaunchOpConf {
+  message Argument {
+    required string name = 1;
+    required string value = 2;
+    required DeviceType device_type = 3;
+  }
+  message Function {
+    repeated Argument argument = 1;
+    repeated OperatorConf node = 2;
+  }
+  repeated string in = 1;
+  repeated string out = 2;
+  required Function function = 3;
+  // Function executing engine.
+  // Only "XLA" and "TensorRT" are supported currently.
+  required string engine = 4;
+  // Input mutability.
+  map<string, bool> input_mutability = 5;
+  // Mapping launch op's input and output names into function.
+  map<string, string> input_output_mapping = 6;
+  map<string, OptInt64> batch_axis = 7;
+  // Sbp signatures of each function node.
+  map<string, SbpSignature> sbp_signatures = 8;
+  optional bool model_update = 9 [default = false];
+}
 message NcclBoxingReduceScatterOpConf {
  required LogicalBlobId lbi = 1;
 }
@@ -1690,6 +1717,8 @@ message OperatorConf {
    SigmoidCrossEntropyLossGradOpConf sigmoid_cross_entropy_loss_grad_conf = 317;
    ParallelCastOpConf parallel_cast_conf = 336;
+    XrtLaunchOpConf xrt_launch_conf = 410;
    // math op
    BroadcastAddOpConf broadcast_add_conf = 500;
    BroadcastSubOpConf broadcast_sub_conf = 501;

--- a/oneflow/python/framework/dtype.py
+++ b/oneflow/python/framework/dtype.py
@@ -7,6 +7,7 @@ float = data_type_pb2.kFloat
 float32 = float
 double = data_type_pb2.kDouble
 float64 = double
+float16 = data_type_pb2.kFloat16
 int8 = data_type_pb2.kInt8
 int32 = data_type_pb2.kInt32
 int64 = data_type_pb2.kInt64
@@ -19,6 +20,7 @@ _OF_BLOB_DTYPE2NUMPY_DTYPE = {
        data_type_pb2.kUInt8: np.uint8,
        data_type_pb2.kFloat: np.float32,
        data_type_pb2.kDouble: np.double,
+        data_type_pb2.kFloat16: np.float16,
        # could be np.ubyte on some platform
        data_type_pb2.kChar: np.byte, 
    }

--- a/oneflow/python/framework/function_util.py
+++ b/oneflow/python/framework/function_util.py
@@ -277,6 +277,24 @@ def set_default_placement(func_desc, value):
    assert isinstance(value, placement_ctx.PlacementScope)
    func_desc.function_attribute.default_placement_scope = value
+@oneflow_function_config('use_xla_jit')
+def set_use_xla_jit(func_desc, value = True):
+    func_desc.job_config_proto.xrt_config.use_xla_jit = value
+@oneflow_function_config('use_tensorrt')
+def set_use_tensorrt(func_desc, value = True):
+    func_desc.job_config_proto.xrt_config.use_tensorrt = value
+@oneflow_function_config('tensorrt.use_fp16')
+def set_tensorrt_use_fp16(func_desc, value = True):
+    set_use_tensorrt(func_desc, True)
+    func_desc.job_config_proto.xrt_config.tensorrt_config.use_fp16 = value
+@oneflow_function_config('tensorrt.use_int8')
+def set_tensorrt_use_int8(func_desc, value = True):
+    set_use_tensorrt(func_desc, True)
+    func_desc.job_config_proto.xrt_config.tensorrt_config.use_int8 = value
 @oneflow_function_config('default_distribute_strategy')
 def set_default_distribute_strategy(func_desc, value):
    assert isinstance(value, distribute_ctx.DistributeStrategy)

--- a/oneflow/python/keras/activations.py
+++ b/oneflow/python/keras/activations.py
@@ -43,6 +43,20 @@ def gelu(x, name=None):
    return remote_blob_util.RemoteBlob(lbi)
+@oneflow_export('keras.activations.gelu_grad')
+def gelu_grad(x, dy):
+    op_conf = op_conf_util.OperatorConf()
+    op_conf.name = id_util.UniqueStr('GeluGrad_')
+    setattr(op_conf.gelu_grad_conf, 'x', x.logical_blob_name)
+    setattr(op_conf.gelu_grad_conf, 'dy', dy.logical_blob_name)
+    op_conf.gelu_grad_conf.dx = "dx"
+    compile_context.CurJobAddOp(op_conf)
+    lbi = logical_blob_id_util.LogicalBlobId()
+    lbi.op_name = op_conf.name
+    lbi.blob_name = "dx"
+    return remote_blob_util.RemoteBlob(lbi)
 @oneflow_export("keras.activations.tanh")
 def tanh(x, name=None):
    op_conf = op_conf_util.OperatorConf()
@@ -57,6 +71,20 @@ def tanh(x, name=None):
    return remote_blob_util.RemoteBlob(lbi)
+@oneflow_export('keras.activations.tanh_grad')
+def tanh_grad(y, dy):
+    op_conf = op_conf_util.OperatorConf()
+    op_conf.name = id_util.UniqueStr('TanhGrad_')
+    setattr(op_conf.tanh_grad_conf, 'y', y.logical_blob_name)
+    setattr(op_conf.tanh_grad_conf, 'dy', dy.logical_blob_name)
+    op_conf.tanh_grad_conf.dx = "dx"
+    compile_context.CurJobAddOp(op_conf)
+    lbi = logical_blob_id_util.LogicalBlobId()
+    lbi.op_name = op_conf.name
+    lbi.blob_name = "dx"
+    return remote_blob_util.RemoteBlob(lbi)
 @oneflow_export("keras.activations.sigmoid")
 def sigmoid(x, name=None):
    op_conf = op_conf_util.OperatorConf()

--- a/oneflow/python/ops/array_ops.py
+++ b/oneflow/python/ops/array_ops.py
@@ -100,6 +100,18 @@ def reshape(x, shape, name=None):
    lbi.blob_name = "out"
    return remote_blob_util.RemoteBlob(lbi)
+@oneflow_export("reshape_like")
+def reshape_like(x, like, name=None):
+    op_conf = op_conf_util.OperatorConf()
+    op_conf.name = id_util.UniqueStr("ReshapeLike_")
+    setattr(op_conf.reshape_like_conf, "x", x.logical_blob_name)
+    setattr(op_conf.reshape_like_conf, "like", like.logical_blob_name)
+    op_conf.reshape_like_conf.y = "y"
+    compile_context.CurJobAddOp(op_conf)
+    lbi = logical_blob_id_util.LogicalBlobId()
+    lbi.op_name = op_conf.name
+    lbi.blob_name = "y"
+    return remote_blob_util.RemoteBlob(lbi)
 @oneflow_export("dynamic_reshape")
 def dynamic_reshape(x, shape, name=None):

--- a/oneflow/python/ops/layers.py
+++ b/oneflow/python/ops/layers.py
@@ -207,6 +207,66 @@ def layer_norm(
    setattr(out_lbi, "blob_name", "out")
    return remote_blob_util.RemoteBlob(out_lbi)
+@oneflow_export("layers.layer_norm_grad")
+def layer_norm_grad(
+    dy,
+    x,
+    mean,
+    inv_variance,
+    begin_norm_axis=1,
+    name=None,
+):
+    op_conf = op_conf_util.OperatorConf()
+    name = name if name is not None else id_util.UniqueStr(
+        "LayerNormGrad_")
+    setattr(op_conf, "name", name)
+    setattr(op_conf.layer_norm_grad_conf, "dy", dy.logical_blob_name)
+    setattr(op_conf.layer_norm_grad_conf, "x", x.logical_blob_name)
+    setattr(op_conf.layer_norm_grad_conf, "mean", mean.logical_blob_name)
+    setattr(op_conf.layer_norm_grad_conf, "inv_variance", inv_variance.logical_blob_name)
+    setattr(op_conf.layer_norm_grad_conf, "dx", "dx")
+    setattr(op_conf.layer_norm_grad_conf, "begin_norm_axis", begin_norm_axis)
+    setattr(op_conf.layer_norm_grad_conf, "epsilon", 1e-5)
+    compile_context.CurJobAddOp(op_conf)
+    out_lbi = logical_blob_id_util.LogicalBlobId()
+    setattr(out_lbi, "op_name", op_conf.name)
+    setattr(out_lbi, "blob_name", "dx")
+    return remote_blob_util.RemoteBlob(out_lbi)
+@oneflow_export("layers.layer_norm_param_grad")
+def layer_norm_param_grad(
+    dy,
+    norm,
+    gamma,
+    begin_params_axis=-1,
+    name=None,
+):
+    op_conf = op_conf_util.OperatorConf()
+    name = name if name is not None else id_util.UniqueStr(
+        "LayerNormParamGrad_")
+    setattr(op_conf, "name", name)
+    setattr(op_conf.layer_norm_param_grad_conf, "dy", dy.logical_blob_name)
+    setattr(op_conf.layer_norm_param_grad_conf, "normalized", norm.logical_blob_name)
+    setattr(op_conf.layer_norm_param_grad_conf, "gamma", gamma.logical_blob_name)
+    setattr(op_conf.layer_norm_param_grad_conf, "begin_params_axis", begin_params_axis)
+    setattr(op_conf.layer_norm_param_grad_conf, "normalized_diff", "normalized_diff")
+    setattr(op_conf.layer_norm_param_grad_conf, "beta_diff", "beta_diff")
+    setattr(op_conf.layer_norm_param_grad_conf, "gamma_diff", "gamma_diff")
+    compile_context.CurJobAddOp(op_conf)
+    normalized_diff_lbi = logical_blob_id_util.LogicalBlobId()
+    beta_diff_lbi = logical_blob_id_util.LogicalBlobId()
+    gamma_diff_lbi = logical_blob_id_util.LogicalBlobId()
+    setattr(normalized_diff_lbi, "op_name", op_conf.name)
+    setattr(beta_diff_lbi, "op_name", op_conf.name)
+    setattr(gamma_diff_lbi, "op_name", op_conf.name)
+    setattr(normalized_diff_lbi, "blob_name", "normalized_diff")
+    setattr(beta_diff_lbi, "blob_name", "beta_diff")
+    setattr(gamma_diff_lbi, "blob_name", "gamma_diff")
+    return (remote_blob_util.RemoteBlob(normalized_diff_lbi),
+            remote_blob_util.RemoteBlob(beta_diff_lbi),
+            remote_blob_util.RemoteBlob(gamma_diff_lbi))
 @oneflow_export("layers.batch_normalization")
 def batch_normalization(

--- a/oneflow/python/ops/nn_ops.py
+++ b/oneflow/python/ops/nn_ops.py
@@ -258,6 +258,40 @@ def softmax(logits, axis=None, name=None):
    lbi.blob_name = "out"
    return remote_blob_util.RemoteBlob(lbi)
+@oneflow_export("nn.softmax_grad")
+def softmax_grad(y, dy, axis=None, name=None):
+    if axis is None:
+        axis = -1
+    assert type(axis) is int
+    op_conf = op_conf_util.OperatorConf()
+    name_prefix = name if name is not None else id_util.UniqueStr("SoftmaxGrad_")
+    setattr(op_conf, "name", name_prefix)
+    need_transpose = False
+    permute = [i for i in range(len(y.shape))]
+    if axis > 0 and axis != len(y.shape) - 1:
+        need_transpose = True
+        permute[axis] = permute[-1]
+        permute[-1] = axis
+    if need_transpose:
+        y = oneflow.transpose(y, perm=permute)
+        dy = oneflow.transpose(dy, perm=permute)
+    setattr(op_conf.softmax_grad_conf, "y", y.logical_blob_name)
+    setattr(op_conf.softmax_grad_conf, "dy", dy.logical_blob_name)
+    op_conf.softmax_grad_conf.axis = -1
+    op_conf.softmax_grad_conf.dx = "dx"
+    compile_context.CurJobAddOp(op_conf)
+    lbi = logical_blob_id_util.LogicalBlobId()
+    lbi.op_name = op_conf.name
+    lbi.blob_name = "dx"
+    dx = remote_blob_util.RemoteBlob(lbi)
+    if need_transpose:
+        dx = oneflow.transpose(dx, perm=permute)
+    return dx
 @oneflow_export("nn.sparse_softmax_cross_entropy_with_logits")
 def sparse_softmax_cross_entropy_with_logits(

--- a/oneflow/python/ops/op_util.py
+++ b/oneflow/python/ops/op_util.py
@@ -5,6 +5,7 @@ from oneflow.core.operator.op_conf_pb2 import OperatorConf
 def IsOpConfOnlyCpuSupported(op_conf):
    assert isinstance(op_conf, OperatorConf)
+    """
    global _cpu_only_op_type_cases
    if _cpu_only_op_type_cases == None:
        _cpu_only_op_type_cases = set()
@@ -13,4 +14,9 @@ def IsOpConfOnlyCpuSupported(op_conf):
                _cpu_only_op_type_cases.add(field.number)
    op_type_field = op_conf.WhichOneof("op_type")
    return OperatorConf.DESCRIPTOR.fields_by_name[op_type_field].number in _cpu_only_op_type_cases
-_cpu_only_op_type_cases = None
+    """
+    op_type_field = op_conf.WhichOneof("op_type")
+    field_number = OperatorConf.DESCRIPTOR.fields_by_name[op_type_field].number
+    return c_api_util.IsOpTypeCaseCpuSupportOnly(field_number)
+# _cpu_only_op_type_cases = None
--- a/oneflow/python/test/xrt/test_add.py
+++ b/oneflow/python/test/xrt/test_add.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(x_shape, y_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+        return x + y + x
+    return add_job
+def make_xla_job(x_shape, y_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                    y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+        return x + y + x
+    return xla_add_job
+def make_trt_job(x_shape, y_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                    y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+        return x + y + x
+    return trt_add_job
+class TestAdd(unittest.TestCase):
+    def _test_body(self, x, y, dtype=np.float32):
+        f1 = make_job(x.shape, y.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, y.shape, dtype=flow.float32)
+        f3 = make_trt_job(x.shape, y.shape, dtype=flow.float32)
+        a = f1(x, y).get()
+        b = f2(x, y).get()
+        c = f3(x, y).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        print("with tensorrt", c)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, x_shape, y_shape, dtype=np.float32):
+        x = np.ones(x_shape, dtype=dtype)
+        y = np.ones(y_shape, dtype=dtype)
+        self._test_body(x, y, dtype=dtype)
+    def _test_random_body(self, x_shape, y_shape, dtype=np.float32):
+        x = np.random.random(x_shape).astype(dtype)
+        y = np.random.random(y_shape).astype(dtype)
+        self._test_body(x, y, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10), (1, 10))
+        self._test_ones_body((2, 10, 2), (2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2), (2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1, 10), (1, 10))
+        self._test_random_body((2, 10, 2), (2, 10, 2))
+        self._test_random_body((2, 5, 2, 2), (2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_batch_norm.py
+++ b/oneflow/python/test/xrt/test_batch_norm.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def batch_norm_job(x=flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.layers.batch_normalization(x, axis=axis)
+    return batch_norm_job
+def make_xla_job(input_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_batch_norm_job(x=flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.layers.batch_normalization(x, axis=axis)
+    return xla_batch_norm_job
+def make_trt_job(input_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_batch_norm_job(x=flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.layers.batch_normalization(x, axis=axis)
+    return trt_batch_norm_job
+class TestRelu(unittest.TestCase):
+    def _test_body(self, x, axis, dtype=np.float32):
+        f1 = make_job(x.shape, axis, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, axis, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, axis, dtype=flow.float32)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, axis, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, axis, dtype=dtype)
+    def _test_random_body(self, shape, axis, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, axis, dtype=dtype)
+    """
+      TensorRT batch norm only support 4-d tensor (NCHW).
+    """
+    def test_ones_input(self):
+        self._test_ones_body((2, 1, 2, 2), 1)
+        self._test_ones_body((2, 5, 2, 2), 1)
+    def test_random_input(self):
+        self._test_random_body((2, 1, 2, 2), 1)
+        self._test_random_body((2, 5, 2, 2), 1)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_bias_add.py
+++ b/oneflow/python/test/xrt/test_bias_add.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(x_shape, b_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def bias_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                     bias = flow.FixedTensorDef(b_shape, dtype=dtype)):
+        return flow.nn.bias_add(x, bias)
+    return bias_add_job
+def make_xla_job(x_shape, b_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_bias_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                         bias = flow.FixedTensorDef(b_shape, dtype=dtype)):
+        return flow.nn.bias_add(x, bias)
+    return xla_bias_add_job
+def make_trt_job(x_shape, b_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_bias_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                         bias = flow.FixedTensorDef(b_shape, dtype=dtype)):
+        return flow.nn.bias_add(x, bias)
+    return trt_bias_add_job
+class TestBiasAdd(unittest.TestCase):
+    def _test_body(self, x, bias, dtype=np.float32):
+        f1 = make_job(x.shape, bias.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, bias.shape, dtype=flow.float32)
+        a = f1(x, bias).get()
+        b = f2(x, bias).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, bias.shape, dtype=flow.float32)
+        c = f3(x, bias).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, x_shape, bias_shape, dtype=np.float32):
+        x = np.ones(x_shape, dtype=dtype)
+        b = np.ones(bias_shape, dtype=dtype)
+        self._test_body(x, b, dtype=dtype)
+    def _test_random_body(self, x_shape, bias_shape, dtype=np.float32):
+        x = np.random.random(x_shape).astype(dtype)
+        b = np.random.random(bias_shape).astype(dtype)
+        self._test_body(x, b, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10), (10))
+        self._test_ones_body((2, 10, 2), (10))
+        self._test_ones_body((2, 5, 2, 2), (5))
+    def test_random_input(self):
+        self._test_random_body((1, 10), (10))
+        self._test_random_body((2, 10, 2), (10))
+        self._test_random_body((2, 5, 2, 2), (5))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_broadcast_op.py
+++ b/oneflow/python/test/xrt/test_broadcast_op.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+class TestBroadcastOp(unittest.TestCase):
+    run_test = False
+    def _test_body(self, x, y, dtype=np.float32):
+        if not self.run_test:
+            return
+        f1 = self.make_job(x.shape, y.shape, dtype=flow.float32)
+        f2 = self.make_xla_job(x.shape, y.shape, dtype=flow.float32)
+        a = f1(x, y).get()
+        b = f2(x, y).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, x_shape, y_shape, dtype=np.float32):
+        x = np.ones(x_shape, dtype=dtype)
+        y = np.ones(y_shape, dtype=dtype)
+        self._test_body(x, y, dtype=dtype)
+    def _test_random_body(self, x_shape, y_shape, dtype=np.float32):
+        x = np.random.random(x_shape).astype(dtype)
+        y = np.random.random(y_shape).astype(dtype)
+        self._test_body(x, y, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10), (1, 1))
+        self._test_ones_body((2, 10, 2), (2, 1, 2))
+        self._test_ones_body((2, 5, 2, 2), (1, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1, 10), (1, 1))
+        self._test_random_body((2, 10, 2), (2, 1, 2))
+        self._test_random_body((2, 5, 2, 2), (1, 5, 2, 2))
+class TestBroadcastAddOp(TestBroadcastOp):
+    run_test = True
+    def make_job(self, x_shape, y_shape, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def broadcast_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                              y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+            return flow.math.add(x, y)
+        return broadcast_add_job
+    def make_xla_job(self, x_shape, y_shape, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_broadcast_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                                  y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+            return flow.math.add(x, y)
+        return xla_broadcast_add_job
+class TestBroadcastMulOp(TestBroadcastOp):
+    run_test = True
+    def make_job(self, x_shape, y_shape, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def broadcast_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                              y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+            return flow.math.multiply(x, y)
+        return broadcast_mul_job
+    def make_xla_job(self, x_shape, y_shape, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_broadcast_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                                  y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+            return flow.math.multiply(x, y)
+        return xla_broadcast_mul_job
+class TestBroadcastDivOp(TestBroadcastOp):
+    run_test = True
+    def make_job(self, x_shape, y_shape, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def broadcast_div_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                              y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+            return flow.math.divide(x, y)
+        return broadcast_div_job
+    def make_xla_job(self, x_shape, y_shape, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_broadcast_div_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                                  y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+            return flow.math.divide(x, y)
+        return xla_broadcast_div_job
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_cast.py
+++ b/oneflow/python/test/xrt/test_cast.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, dtype=flow.float32, target_dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def cast_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.cast(x, dtype=target_dtype)
+    return cast_job
+def make_xla_job(input_shape, dtype=flow.float32, target_dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_cast_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.cast(x, dtype=target_dtype)
+    return xla_cast_job
+class TestCast(unittest.TestCase):
+    def _test_body(self, x, dtype=flow.float32, target_dtype=flow.float32):
+        f1 = make_job(x.shape, dtype=dtype, target_dtype=target_dtype)
+        f2 = make_xla_job(x.shape, dtype=dtype, target_dtype=target_dtype)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        # b = trt_cast_job(x).get()
+        # print("with tensorrt", b)
+        # self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=flow.float32, target_dtype=flow.float32):
+        np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
+        x = np.ones(shape, dtype=np_dtype)
+        self._test_body(x, dtype=dtype, target_dtype=target_dtype)
+    def _test_random_body(self, shape, dtype=flow.float32, target_dtype=flow.float32):
+        np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
+        x = (1000 * np.random.random(shape)).astype(np_dtype)
+        self._test_body(x, dtype=dtype, target_dtype=target_dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1), flow.float32, flow.int32)
+        self._test_ones_body((1, 10), flow.int32, flow.float32)
+    def test_random_input(self):
+        self._test_random_body((1), flow.float32, flow.int32)
+        self._test_random_body((1, 10), flow.int32, flow.float32)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_concat.py
+++ b/oneflow/python/test/xrt/test_concat.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(a_shape, b_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def concat_job(x=flow.FixedTensorDef(a_shape, dtype=dtype),
+            y=flow.FixedTensorDef(b_shape, dtype=dtype)):
+        return flow.concat([x, y], axis=axis)
+    return concat_job
+def make_trt_job(a_shape, b_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_concat_job(x=flow.FixedTensorDef(a_shape, dtype=dtype),
+            y=flow.FixedTensorDef(b_shape, dtype=dtype)):
+        return flow.concat([x, y], axis=axis)
+    return trt_concat_job
+class Testconcat(unittest.TestCase):
+    def _test_body(self, x, y, axis, dtype=np.float32):
+        f1 = make_job(x.shape, y.shape, axis, dtype=flow.float32)
+        f2 = make_trt_job(x.shape, y.shape, axis, dtype=flow.float32)
+        a = f1(x, y).get()
+        b = f2(x, y).get()
+        print("without xla: ", a)
+        print("with tensorrt: ", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, a_shape, b_shape, axis, dtype=np.float32):
+        x = np.ones(a_shape, dtype=dtype)
+        y = np.ones(b_shape, dtype=dtype)
+        self._test_body(x, y, axis, dtype=dtype)
+    def _test_random_body(self, a_shape, b_shape, axis, dtype=np.float32):
+        x = np.random.random(a_shape).astype(dtype)
+        y = np.random.random(b_shape).astype(dtype)
+        self._test_body(x, y, axis, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((5, 2), (5, 3), axis=1)
+        self._test_ones_body((5, 2), (5, 3), axis=-1)
+        self._test_ones_body((5, 1, 2), (5, 1, 2), axis=1)
+        self._test_ones_body((5, 1, 2), (5, 1, 2), axis=2)
+    def test_random_input(self):
+        self._test_random_body((5, 2), (5, 3), axis=1)
+        self._test_random_body((5, 2), (5, 3), axis=-1)
+        self._test_random_body((5, 1, 2), (5, 1, 2), axis=1)
+        self._test_random_body((5, 3, 2), (5, 3, 2), axis=2)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_conv2d.py
+++ b/oneflow/python/test/xrt/test_conv2d.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(x_shape, w_shape, kernel_size=None, strides=None,
+        padding="valid", data_format="NCHW", dilation_rate=None, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def conv2d_job(x=flow.FixedTensorDef(x_shape, dtype=dtype),
+                   weight=flow.FixedTensorDef(w_shape, dtype=dtype)):
+        return flow.nn.conv2d(x, weight, strides, padding, data_format, dilation_rate)
+    return conv2d_job
+def make_trt_job(x_shape, w_shape, kernel_size=None, strides=None,
+        padding="valid", data_format="NCHW", dilation_rate=None, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_conv2d_job(x=flow.FixedTensorDef(x_shape, dtype=dtype),
+                       weight=flow.FixedTensorDef(w_shape, dtype=dtype)):
+        return flow.nn.conv2d(x, weight, strides, padding, data_format, dilation_rate)
+    return trt_conv2d_job
+class TestConv2d(unittest.TestCase):
+    def make_filter_shape(self, shape, filters, kernel_size, data_format):
+        if data_format == "NCHW":
+          return [filters, shape[1], kernel_size, kernel_size]
+        else:
+          return [filters, kernel_size, kernel_size, shape[3]]
+    def _test_body(self, x, filters, kernel_size, strides, padding, data_format,
+                   dilation_rate, dtype=np.float32):
+        f1 = make_job(x.shape, filters.shape, kernel_size, strides, padding,
+                      data_format, dilation_rate, dtype=flow.float32)
+        f2 = make_trt_job(x.shape, filters.shape, kernel_size, strides, padding,
+                      data_format, dilation_rate, dtype=flow.float32)
+        a = f1(x, filters).get()
+        b = f2(x, filters).get()
+        print("without xla: ", a)
+        print("with tensorrt: ", b)
+        self.assertTrue(a.shape == b.shape)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, filters, kernel_size, strides,
+                        padding, data_format, dilation_rate, dtype=np.float32):
+        assert(len(shape) == 4)
+        x = np.ones(shape, dtype=dtype)
+        w_shape = self.make_filter_shape(shape, filters, kernel_size, data_format)
+        weight = np.random.random(w_shape).astype(dtype)
+        self._test_body(x, weight, kernel_size=kernel_size,
+                        strides=strides, padding=padding, data_format=data_format,
+                        dilation_rate=dilation_rate)
+    def _test_random_body(self, shape, filters, kernel_size, strides,
+                          padding, data_format, dilation_rate, dtype=np.float32):
+        assert(len(shape) == 4)
+        x = np.random.random(shape).astype(dtype)
+        w_shape = self.make_filter_shape(shape, filters, kernel_size, data_format)
+        weight = np.random.random(w_shape).astype(dtype)
+        self._test_body(x, weight, kernel_size=kernel_size,
+                        strides=strides, padding=padding, data_format=data_format,
+                        dilation_rate=dilation_rate)
+    def test_ones_kernel_1x1(self):
+        self._test_ones_body(shape=[1, 1, 1, 1], filters=1, kernel_size=1, strides=1,
+                             padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_ones_body(shape=[1, 3, 1, 1], filters=1, kernel_size=1, strides=1,
+                             padding="SAME", data_format="NCHW", dilation_rate=1)
+        self._test_ones_body(shape=[1, 1, 5, 5], filters=1, kernel_size=1, strides=1,
+                             padding="VALID", data_format="NCHW", dilation_rate=1)
+        #self._test_ones_body(shape=[3, 1, 1, 5], filters=1, kernel_size=1, strides=1,
+        #                     padding="SAME", data_format="NHWC", dilation_rate=1)
+        self._test_ones_body(shape=[3, 3, 5, 5], filters=1, kernel_size=1, strides=1,
+                             padding="VALID", data_format="NCHW", dilation_rate=1)
+    def test_random_kernel_1x1(self):
+        self._test_random_body(shape=[1, 1, 1, 1], filters=1, kernel_size=1, strides=1,
+                               padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 3, 1, 1], filters=1, kernel_size=1, strides=1,
+                               padding="SAME", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 1, 5, 5], filters=1, kernel_size=1, strides=1,
+                               padding="VALID", data_format="NCHW", dilation_rate=1)
+        #self._test_random_body(shape=[3, 1, 1, 5], filters=1, kernel_size=1, strides=1,
+        #                       padding="SAME", data_format="NHWC", dilation_rate=1)
+        self._test_random_body(shape=[3, 3, 5, 5], filters=1, kernel_size=1, strides=1,
+                               padding="VALID", data_format="NCHW", dilation_rate=1)
+    def test_ones_kernel_3x3(self):
+        self._test_ones_body(shape=[1, 1, 3, 3], filters=1, kernel_size=3, strides=1,
+                             padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_ones_body(shape=[1, 3, 5, 5], filters=1, kernel_size=3, strides=1,
+                             padding="SAME", data_format="NCHW", dilation_rate=1)
+        self._test_ones_body(shape=[1, 5, 3, 3], filters=1, kernel_size=3, strides=1,
+                             padding="VALID", data_format="NCHW", dilation_rate=1)
+        #self._test_ones_body(shape=[1, 3, 3, 7], filters=1, kernel_size=3, strides=1,
+        #                     padding="SAME", data_format="NHWC", dilation_rate=1)
+    def test_random_kernel_3x3(self):
+        self._test_random_body(shape=[1, 1, 3, 3], filters=1, kernel_size=3, strides=1,
+                               padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 3, 3, 3], filters=1, kernel_size=3, strides=1,
+                               padding="SAME", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 3, 3, 3], filters=1, kernel_size=3, strides=1,
+                               padding="SAME", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 3, 3, 3], filters=1, kernel_size=3, strides=1,
+                               padding="SAME", data_format="NCHW", dilation_rate=1)
+    def test_ones_kernel_11x11(self):
+        self._test_ones_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
+                             strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_ones_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
+                             strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
+        self._test_ones_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
+                             strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_ones_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
+                             strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
+    def test_random_kernel_11x11(self):
+        self._test_random_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
+                               strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
+                               strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
+                               strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
+        self._test_random_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
+                               strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_gather.py
+++ b/oneflow/python/test/xrt/test_gather.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+class TestGather(unittest.TestCase):
+    def _test_body(self, x, indices, axis, dtype=flow.float32):
+        indices = np.array(indices).astype(np.int32)
+        f1 = self.make_job(x.shape, indices.shape, axis, dtype=dtype)
+        f2 = self.make_xla_job(x.shape, indices.shape, axis, dtype=dtype)
+        a = f1(x, indices).get()
+        b = f2(x, indices).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def make_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
+                       indices = flow.FixedTensorDef(indices_shape, dtype=flow.int32)):
+            return flow.gather(x, indices, axis=axis)
+        return gather_job
+    def make_xla_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
+                           indices = flow.FixedTensorDef(indices_shape, dtype=flow.int32)):
+            return flow.gather(x, indices, axis=axis)
+        return xla_gather_job
+    def _test_ones_body(self, shape, indices, axis, dtype=flow.float32):
+        np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
+        x = np.ones(shape, dtype=np_dtype)
+        self._test_body(x, indices, axis, dtype=dtype)
+    def _test_random_body(self, shape, indices, axis, dtype=flow.float32):
+        np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
+        x = np.random.random(shape).astype(np_dtype)
+        self._test_body(x, indices, axis, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 1), [0], 0)
+        self._test_ones_body((2, 2), [0, 0], 0)
+        self._test_ones_body((1, 10), [[0], [0]], 0)
+        self._test_ones_body((1, 10), [[0, 1, 2], [2, 3, 4]], 1)
+        self._test_ones_body((2, 10, 2), [[0, 1], [2, 3], [4, 5]], 1)
+        self._test_ones_body((2, 5, 2, 2), [[0, 0], [1, 1]], 3)
+    def test_random_input(self):
+        self._test_random_body((1, 1), [0], 0)
+        self._test_random_body((2, 2), [0, 0], 0)
+        self._test_random_body((1, 10), [[0], [0]], 0)
+        self._test_random_body((1, 10), [[0, 1, 2], [2, 3, 4]], 1)
+        self._test_random_body((2, 10, 2), [[0, 1], [2, 3], [4, 5]], 1)
+        self._test_random_body((2, 5, 2, 2), [[0, 0], [1, 1]], 3)
+class TestBatchGather(TestGather):
+    def make_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def batch_gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
+                             indices = flow.FixedTensorDef(indices_shape,
+                             dtype=flow.int32)):
+            return flow.gather(x, indices, batch_dims=axis)
+        return batch_gather_job
+    def make_xla_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_batch_gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
+                                 indices = flow.FixedTensorDef(indices_shape, dtype=flow.int32)):
+            return flow.gather(x, indices, batch_dims=axis)
+        return xla_batch_gather_job
+    def test_ones_input(self):
+        # batch_dims should be Dims(indices) - 1 and batch_dims > 0
+        self._test_ones_body((2, 3, 2), [[0], [1]], 1)
+        self._test_ones_body((2, 3, 2), [[0, 1], [1, 0]], 1)
+        self._test_ones_body((2, 3, 2, 2), [[[0], [0], [0]], [[1], [1], [1]]], 2)
+    def test_random_input(self):
+        # batch_dims should be Dims(indices) - 1 and batch_dims > 0
+        self._test_random_body((2, 3, 2), [[0], [1]], 1)
+        self._test_random_body((2, 3, 2), [[0, 1], [1, 2]], 1)
+        self._test_random_body((2, 3, 2, 2), [[[0], [0], [0]], [[1], [1], [1]]], 2)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_gelu.py
+++ b/oneflow/python/test/xrt/test_gelu.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def gelu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.gelu(x)
+    return gelu_job
+def make_xla_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_gelu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.gelu(x)
+    return xla_gelu_job
+class TestGelu(unittest.TestCase):
+    def _test_body(self, x, dtype=np.float32):
+        f1 = make_job(x.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, dtype=dtype)
+    def _test_random_body(self, shape, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1))
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1))
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_gelu_grad.py
+++ b/oneflow/python/test/xrt/test_gelu_grad.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def gelu_grad_job(x = flow.FixedTensorDef(shape, dtype=dtype),
+                      dy = flow.FixedTensorDef(shape, dtype=dtype)):
+        return flow.keras.activations.gelu_grad(x, dy)
+    return gelu_grad_job
+def make_xla_job(shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_gelu_grad_job(x = flow.FixedTensorDef(shape, dtype=dtype),
+                          dy = flow.FixedTensorDef(shape, dtype=dtype)):
+        return flow.keras.activations.gelu_grad(x, dy)
+    return xla_gelu_grad_job
+class TestGeluGrad(unittest.TestCase):
+    def _test_body(self, x, dy, dtype=np.float32):
+        f1 = make_job(x.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, dtype=flow.float32)
+        a = f1(x, dy).get()
+        b = f2(x, dy).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        dy = np.ones(shape, dtype=dtype)
+        self._test_body(x, dy, dtype=dtype)
+    def _test_random_body(self, shape, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        dy = np.random.random(shape).astype(dtype)
+        self._test_body(x, dy, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1))
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1))
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_identity.py
+++ b/oneflow/python/test/xrt/test_identity.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def identity_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.identity(x)
+    return identity_job
+def make_xla_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_identity_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.identity(x)
+    return xla_identity_job
+def make_trt_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_identity_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.identity(x)
+    return trt_identity_job
+class TestIdentity(unittest.TestCase):
+    def _test_body(self, x, dtype=np.float32):
+        f1 = make_job(x.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, dtype=flow.float32)
+        f3 = make_trt_job(x.shape, dtype=flow.float32)         
+        a = f1(x).get()
+        b = f2(x).get()
+        c = f3(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        print("with tensorrt: ", c)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, dtype=dtype)
+    def _test_random_body(self, shape, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1))
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1))
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_layer_norm.py
+++ b/oneflow/python/test/xrt/test_layer_norm.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, norm_axis, params_axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def layer_norm_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.layers.layer_norm(x, begin_norm_axis=norm_axis,
+                                      begin_params_axis=params_axis)
+    return layer_norm_job
+def make_xla_job(input_shape, norm_axis, params_axis, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_layer_norm_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.layers.layer_norm(x, begin_norm_axis=norm_axis,
+                                      begin_params_axis=params_axis)
+    return xla_layer_norm_job
+class TestLayerNorm(unittest.TestCase):
+    def _test_body(self, x, norm_axis, params_axis, dtype=np.float32):
+        f1 = make_job(x.shape, norm_axis, params_axis, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, norm_axis, params_axis, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape,
+                        norm_axis=-1,
+                        params_axis=-1,
+                        dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, norm_axis, params_axis, dtype=dtype)
+    def _test_random_body(self, shape,
+                          norm_axis=-1,
+                          params_axis=-1,
+                          dtype=np.float32):
+        x = (10 * np.random.random(shape)).astype(dtype)
+        self._test_body(x, norm_axis, params_axis, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_layer_norm_grad.py
+++ b/oneflow/python/test/xrt/test_layer_norm_grad.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(shape, mean_shape, norm_axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def layer_norm_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
+                            x = flow.FixedTensorDef(shape, dtype=dtype),
+                            mean = flow.FixedTensorDef(mean_shape, dtype=dtype),
+                            inv_variance = flow.FixedTensorDef(mean_shape, dtype=dtype)):
+        return flow.layers.layer_norm_grad(dy, x, mean, inv_variance,
+                                           begin_norm_axis=norm_axis)
+    return layer_norm_grad_job
+def make_xla_job(shape, mean_shape, norm_axis, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_layer_norm_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
+                                x = flow.FixedTensorDef(shape, dtype=dtype),
+                                mean = flow.FixedTensorDef(mean_shape, dtype=dtype),
+                                inv_variance = flow.FixedTensorDef(mean_shape, dtype=dtype)):
+        return flow.layers.layer_norm_grad(dy, x, mean, inv_variance,
+                                           begin_norm_axis=norm_axis)
+    return xla_layer_norm_grad_job
+class TestLayerNormGrad(unittest.TestCase):
+    def _test_body(self, dy, x,
+                   mean,
+                   inv_variance,
+                   norm_axis,
+                   dtype=np.float32):
+        f1 = make_job(x.shape, mean.shape, norm_axis, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, mean.shape, norm_axis, dtype=flow.float32)
+        a = f1(dy, x, mean, inv_variance).get()
+        b = f2(dy, x, mean, inv_variance).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape,
+                        norm_axis=-1,
+                        dtype=np.float32):
+        dy = np.ones(shape, dtype=dtype)
+        x = np.ones(shape, dtype=dtype)
+        if norm_axis < 0:
+            norm_axis += len(shape)
+        mean_shape = shape[:norm_axis]
+        mean = np.ones(mean_shape, dtype=dtype)
+        inv_variance = np.ones(mean_shape, dtype=dtype)
+        self._test_body(dy, x, mean, inv_variance, norm_axis, dtype=dtype)
+    def _test_random_body(self, shape,
+                          norm_axis=-1,
+                          dtype=np.float32):
+        dy = np.random.random(shape).astype(dtype)
+        x = np.random.random(shape).astype(dtype)
+        if norm_axis < 0:
+            norm_axis += len(shape)
+        mean_shape = shape[:norm_axis]
+        mean = np.random.random(mean_shape).astype(dtype)
+        inv_variance = np.random.random(mean_shape).astype(dtype)
+        self._test_body(dy, x, mean, inv_variance, norm_axis, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_layer_norm_param_grad.py
+++ b/oneflow/python/test/xrt/test_layer_norm_param_grad.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(shape, gamma_shape, params_axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def layer_norm_param_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
+                                  norm = flow.FixedTensorDef(shape, dtype=dtype),
+                                  gamma = flow.FixedTensorDef(gamma_shape, dtype=dtype)):
+        return flow.layers.layer_norm_param_grad(
+            dy, norm, gamma, begin_params_axis=params_axis)
+    return layer_norm_param_grad_job
+def make_xla_job(shape, gamma_shape, params_axis, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_layer_norm_param_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
+                                      norm = flow.FixedTensorDef(shape, dtype=dtype),
+                                      gamma = flow.FixedTensorDef(gamma_shape, dtype=dtype)):
+        return flow.layers.layer_norm_param_grad(
+            dy, norm, gamma, begin_params_axis=params_axis)
+    return xla_layer_norm_param_grad_job
+class TestLayerNormParamGrad(unittest.TestCase):
+    def _test_body(self, dy, norm, gamma, params_axis,
+                   dtype=np.float32):
+        f1 = make_job(dy.shape, gamma.shape, params_axis, dtype=flow.float32)
+        f2 = make_xla_job(dy.shape, gamma.shape, params_axis, dtype=flow.float32)
+        (d_norm1, d_beta1, d_gamma1) = f1(dy, norm, gamma).get()
+        (d_norm2, d_beta2, d_gamma2) = f2(dy, norm, gamma).get()
+        print("normalize diff:")
+        print("    without xla: ", d_norm1)
+        print("    with xla: ", d_norm2)
+        print("beta diff:")
+        print("    without xla: ", d_beta1)
+        print("    with xla: ", d_beta2)
+        print("gamma diff:")
+        print("    without xla: ", d_gamma1)
+        print("    with xla: ", d_gamma2)
+        self.assertTrue(d_norm1.shape, d_norm2.shape)
+        self.assertTrue(d_beta1.shape, d_beta2.shape)
+        self.assertTrue(d_gamma1.shape, d_gamma2.shape)
+        self.assertTrue(np.allclose(d_norm1.ndarray(), d_norm2.ndarray(), rtol=1e-03, atol=1e-05))
+        self.assertTrue(np.allclose(d_beta1.ndarray(), d_beta2.ndarray(), rtol=1e-03, atol=1e-05))
+        self.assertTrue(np.allclose(d_gamma1.ndarray(), d_gamma2.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape,
+                        params_axis=-1,
+                        dtype=np.float32):
+        dy = np.ones(shape, dtype=dtype)
+        norm = np.ones(shape, dtype=dtype)
+        if params_axis < 0:
+            params_axis += len(shape)
+        gamma_shape = shape[params_axis:]
+        if len(gamma_shape) == 0:
+          gamma_shape = [1]
+        gamma = np.ones(gamma_shape, dtype=dtype)
+        self._test_body(dy, norm, gamma, params_axis, dtype=dtype)
+    def _test_random_body(self, shape,
+                          params_axis=-1,
+                          dtype=np.float32):
+        dy = np.random.random(shape).astype(dtype)
+        norm = np.random.random(shape).astype(dtype)
+        if params_axis < 0:
+            params_axis += len(shape)
+        gamma_shape = shape[params_axis:]
+        if len(gamma_shape) == 0:
+          gamma_shape = [1]
+        gamma = np.random.random(gamma_shape).astype(dtype)
+        self._test_body(dy, norm, gamma, params_axis, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_matmul.py
+++ b/oneflow/python/test/xrt/test_matmul.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(a_shape, b_shape, trans_a=False, trans_b=False, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def matmul_job(a=flow.FixedTensorDef(a_shape, dtype=dtype),
+                   b=flow.FixedTensorDef(b_shape, dtype=dtype)):
+        return flow.matmul(a, b, transpose_a=trans_a, transpose_b=trans_b)
+    return matmul_job
+def make_xla_job(a_shape, b_shape, trans_a=False, trans_b=False, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_matmul_job(a=flow.FixedTensorDef(a_shape, dtype=dtype),
+                       b=flow.FixedTensorDef(b_shape, dtype=dtype)):
+        return flow.matmul(a, b, transpose_a=trans_a, transpose_b=trans_b)
+    return xla_matmul_job
+class TestMatmul(unittest.TestCase):
+    def make_shape(self, m, n, transpose):
+      if transpose:
+        return (n, m)
+      else:
+        return (m, n)
+    def _test_body(self, a, b, trans_a, trans_b, dtype=np.float32):
+        f1 = make_job(a.shape, b.shape, trans_a, trans_b)
+        f2 = make_xla_job(a.shape, b.shape, trans_a, trans_b)
+        x = f1(a, b).get()
+        y = f2(a, b).get()
+        print("without xla: ", x)
+        print("with xla: ", y)
+        self.assertTrue(np.allclose(x.ndarray(), y.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, m, k, n, trans_a, trans_b, dtype=np.float32):
+        shape_a = self.make_shape(m, k, trans_a)
+        shape_b = self.make_shape(k, n, trans_b)
+        a = np.ones(shape_a, dtype=dtype)
+        b = np.ones(shape_b, dtype=dtype)
+        self._test_body(a, b, trans_a, trans_b, dtype=dtype)
+    def _test_random_body(self, m, k, n, trans_a, trans_b, dtype=np.float32):
+        shape_a = self.make_shape(m, k, trans_a)
+        shape_b = self.make_shape(k, n, trans_b)
+        a = np.random.random(shape_a).astype(dtype)
+        b = np.random.random(shape_b).astype(dtype)
+        self._test_body(a, b, trans_a, trans_b, dtype=dtype)
+    def test_ones1x1x1_input(self):
+        print("run test_ones1x1x1_input: ")
+        self._test_ones_body(1, 1, 1, False, False)
+        self._test_ones_body(1, 1, 1, False, True)
+        self._test_ones_body(1, 1, 1, True, False)
+        self._test_ones_body(1, 1, 1, True, True)
+    def test_random1x1x1_input(self):
+        print("test_random1x1x1_input: ")
+        self._test_random_body(1, 1, 1, False, False)
+        self._test_random_body(1, 1, 1, False, True)
+        self._test_random_body(1, 1, 1, True, False)
+        self._test_random_body(1, 1, 1, True, True)
+    def test_ones1x10x1_input(self):
+        print("test_ones1x10x1_input: ")
+        self._test_ones_body(1, 10, 1, False, False)
+        self._test_ones_body(1, 10, 1, False, True)
+        self._test_ones_body(1, 10, 1, True, False)
+        self._test_ones_body(1, 10, 1, True, True)
+    def test_random1x10x1_input(self):
+        print("test_random1x10x1_input: ")
+        self._test_random_body(1, 10, 1, False, False)
+        self._test_random_body(1, 10, 1, False, True)
+        self._test_random_body(1, 10, 1, True, False)
+        self._test_random_body(1, 10, 1, True, True)
+    def test_ones10x10x2_input(self):
+        print("test_ones10x10x2_input: ")
+        self._test_ones_body(10, 10, 2, False, False)
+        self._test_ones_body(10, 10, 2, False, True)
+        self._test_ones_body(10, 10, 2, True, False)
+        self._test_ones_body(10, 10, 2, True, True)
+    def test_random10x10x2_input(self):
+        print("run test_random10x10x2_input: ")
+        self._test_random_body(10, 10, 2, False, False)
+        self._test_random_body(10, 10, 2, False, True)
+        self._test_random_body(10, 10, 2, True, False)
+        self._test_random_body(10, 10, 2, True, True)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_multiply.py
+++ b/oneflow/python/test/xrt/test_multiply.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(x_shape, y_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def multiply_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+        return flow.math.multiply(x, y)
+    return multiply_job
+def make_trt_job(x_shape, y_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_multiply_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                    y = flow.FixedTensorDef(y_shape, dtype=dtype)):
+        return flow.math.multiply(x, y)
+    return trt_multiply_job
+class TestMultiply(unittest.TestCase):
+    def _test_body(self, x, y, dtype=np.float32):
+        f1 = make_job(x.shape, y.shape, dtype=flow.float32)
+        f2 = make_trt_job(x.shape, y.shape, dtype=flow.float32)
+        a = f1(x, y).get()
+        b = f2(x, y).get()
+        print("without xla: ", a)
+        print("with tensorrt", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, x_shape, y_shape, dtype=np.float32):
+        x = np.ones(x_shape, dtype=dtype)
+        y = np.ones(y_shape, dtype=dtype)
+        self._test_body(x, y, dtype=dtype)
+    def _test_random_body(self, x_shape, y_shape, dtype=np.float32):
+        x = np.random.random(x_shape).astype(dtype)
+        y = np.random.random(y_shape).astype(dtype)
+        self._test_body(x, y, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10), (1, 10))
+        self._test_ones_body((2, 10, 2), (2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2), (2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1, 10), (1, 10))
+        self._test_random_body((2, 10, 2), (2, 10, 2))
+        self._test_random_body((2, 5, 2, 2), (2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_pooling.py
+++ b/oneflow/python/test/xrt/test_pooling.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+class TestPooling(unittest.TestCase):
+    run_test = False
+    def _test_body(self, x, ksize, strides, padding, data_format, dtype=np.float32):
+        if not self.run_test:
+            return
+        f1 = self.make_job(x.shape, ksize, strides, padding, data_format,
+                           dtype=flow.float32)
+        f2 = self.make_trt_job(x.shape, ksize, strides, padding, data_format,
+                           dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without trt: ", a)
+        print("with tensorrt", b)
+        self.assertTrue(a.shape == b.shape)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, ksize, strides, padding, data_format,
+                        dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, ksize=ksize, strides=strides,
+                        padding=padding, data_format=data_format, dtype=dtype)
+    def _test_random_body(self, shape, ksize, strides, padding, data_format,
+                          dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, ksize=ksize, strides=strides, padding=padding,
+                        data_format=data_format, dtype=dtype)
+    def test_ones_input(self):
+        print("test ones input: ")
+        self._test_ones_body((1, 1, 6, 6), 1, 1, "VALID", "NCHW")
+        self._test_ones_body((1, 3, 6, 6), 3, 2, "SAME", "NCHW")
+        self._test_ones_body((1, 1, 3, 3), 1, 1, "VALID", "NCHW")
+        self._test_ones_body((1, 5, 9, 9), 3, 1, "SAME", "NCHW")
+        self._test_ones_body((1, 7, 9, 9), 1, 1, "SAME", "NCHW")
+        self._test_ones_body((1, 5, 3, 3), 1, 1, "VALID", "NCHW")
+        self._test_ones_body((1, 1, 6, 6), 2, 2, "SAME", "NCHW")
+        self._test_ones_body((1, 1, 6, 6), 2, 2, "VALID", "NCHW")
+        self._test_ones_body((1, 1, 9, 9), 2, 2, "SAME", "NCHW")
+        self._test_ones_body((1, 1, 9, 9), 2, 2, "VALID", "NCHW")
+     #   self._test_ones_body((1, 224, 224, 3), 3, 2, "VALID", "NHWC")
+     #   self._test_ones_body((1, 224, 224, 1), 2, 1, "SAME", "NHWC")
+    def test_random_input(self):
+        print("test random input: ")
+        self._test_random_body((1, 1, 6, 6), 1, 1, "VALID", "NCHW")
+        self._test_random_body((1, 3, 6, 6), 3, 2, "SAME", "NCHW")
+        self._test_random_body((1, 5, 6, 6), 3, 2, "VALID", "NCHW")
+        self._test_random_body((1, 7, 6, 6), 3, 2, "SAME", "NCHW")
+        self._test_random_body((1, 3, 3, 3), 1, 1, "VALID", "NCHW")
+        self._test_random_body((1, 3, 6, 6), 3, 2, "SAME", "NCHW")
+        self._test_random_body((1, 1, 6, 6), 2, 2, "SAME", "NCHW")
+        self._test_random_body((1, 1, 6, 6), 2, 2, "VALID", "NCHW")
+        self._test_random_body((1, 1, 9, 9), 2, 2, "SAME", "NCHW")
+        self._test_random_body((1, 1, 9, 9), 2, 2, "VALID", "NCHW")
+       # self._test_random_body((1, 224, 224, 3), 3, 2, "VALID", "NHWC")
+       # self._test_random_body((1, 224, 224, 1), 2, 1, "SAME", "NHWC")
+class TestMaxPooling(TestPooling):
+    run_test = True
+    def make_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def max_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.nn.max_pool2d(x, ksize=ksize, strides=strides,
+                                      padding=padding, data_format=data_format)
+        return max_pooling_job
+    def make_trt_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(True)
+        @flow.function(config)
+        def trt_max_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.nn.max_pool2d(x, ksize=ksize, strides=strides,
+                                      padding=padding, data_format=data_format)
+        return trt_max_pooling_job
+class TestAveragePooling(TestPooling):
+    run_test = True
+    def make_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def avg_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.nn.avg_pool2d(x, ksize=ksize, strides=strides,
+                                      padding=padding, data_format=data_format)
+        return avg_pooling_job
+    def make_trt_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(True)
+        @flow.function(config)
+        def trt_avg_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.nn.avg_pool2d(x, ksize=ksize, strides=strides,
+                                      padding=padding, data_format=data_format)
+        return trt_avg_pooling_job
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_reduce_op.py
+++ b/oneflow/python/test/xrt/test_reduce_op.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+class TestReduce(unittest.TestCase):
+    run_test = False
+    def _test_body(self, x, axis, keepdims, dtype=np.float32):
+        if not self.run_test:
+            return
+        f1 = self.make_job(x.shape, axis, keepdims, dtype=flow.float32)
+        f2 = self.make_xla_job(x.shape, axis, keepdims, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(a.shape == b.shape)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = self.make_trt_job(x.shape, axis, keepdims, dtype=flow.float32)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(a.shape == c.shape)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, axis, keepdims, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, axis, keepdims, dtype=dtype)
+    def _test_random_body(self, shape, axis, keepdims, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, axis, keepdims, dtype=dtype)
+    def test_ones_input(self):
+        # self._test_ones_body((1), [0], False)
+        self._test_ones_body((1), [0], True)
+        self._test_ones_body((1, 10), [1], False)
+        self._test_ones_body((1, 10), [1], True)
+        # self._test_ones_body((1, 10), [0, 1], False)
+        self._test_ones_body((1, 10), [0, 1], True)
+        self._test_ones_body((2, 10, 2), [1, 2], False)
+        self._test_ones_body((2, 10, 2), [1, 2], True)
+    def test_random_input(self):
+        # self._test_random_body((1), [0], False)
+        self._test_random_body((1), [0], True)
+        self._test_random_body((1, 10), [1], False)
+        self._test_random_body((1, 10), [1], True)
+        # self._test_random_body((1, 10), [0, 1], False)
+        self._test_random_body((1, 10), [0, 1], True)
+        self._test_random_body((2, 10, 2), [1, 2], False)
+        self._test_random_body((2, 10, 2), [1, 2], True)
+class TestReduceSum(TestReduce):
+    run_test = True
+    def make_job(self, x_shape, axis, keepdims, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def reduce_sum_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.reduce_sum(x, axis=axis, keepdims=keepdims)
+        return reduce_sum_job
+    def make_xla_job(self, x_shape, axis, keepdims, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_reduce_sum_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.reduce_sum(x, axis=axis, keepdims=keepdims)
+        return xla_reduce_sum_job
+    def make_trt_job(self, x_shape, axis, keepdims, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(True)
+        @flow.function(config)
+        def trt_reduce_sum_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+           return flow.math.reduce_sum(x, axis=axis, keepdims=keepdims)
+        return trt_reduce_sum_job
+# XLA has not support ReduceMean, so it will fallback to oneflow automatically.
+class TestReduceMean(TestReduce):
+    run_test = True
+    def make_job(self, x_shape, axis, keepdims, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def reduce_mean_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.reduce_mean(x, axis=axis, keepdims=keepdims)
+        return reduce_mean_job
+    def make_xla_job(self, x_shape, axis, keepdims, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_reduce_mean_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.reduce_mean(x, axis=axis, keepdims=keepdims)
+        return xla_reduce_mean_job
+    def make_trt_job(self, x_shape, axis, keepdims, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(True)
+        @flow.function(config)
+        def trt_reduce_mean_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+           return flow.math.reduce_mean(x, axis=axis, keepdims=keepdims)
+        return trt_reduce_mean_job
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_relu.py
+++ b/oneflow/python/test/xrt/test_relu.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def relu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.relu(x)
+    return relu_job
+def make_xla_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_relu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.relu(x)
+    return xla_relu_job
+def make_trt_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_relu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.relu(x)
+    return trt_relu_job
+class TestRelu(unittest.TestCase):
+    def _test_body(self, x, dtype=np.float32):
+        f1 = make_job(x.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, dtype=flow.float32)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, dtype=dtype)
+    def _test_random_body(self, shape, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1))
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1))
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_reshape.py
+++ b/oneflow/python/test/xrt/test_reshape.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(x_shape, shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def reshape_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+        return flow.reshape(x, shape)
+    return reshape_job
+def make_xla_job(x_shape, shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_reshape_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+        return flow.reshape(x, shape)
+    return xla_reshape_job
+def make_trt_job(x_shape, shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_reshape_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+        return flow.reshape(x, shape)
+    return trt_reshape_job
+class TestReshape(unittest.TestCase):
+    def _test_body(self, x, shape, dtype=np.float32):
+        f1 = make_job(x.shape, shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, shape, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(a.shape == b.shape)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, shape, dtype=flow.float32)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(a.shape == c.shape)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, x_shape, shape, dtype=np.float32):
+        x = np.ones(x_shape, dtype=dtype)
+        self._test_body(x, shape, dtype=dtype)
+    def _test_random_body(self, x_shape, shape, dtype=np.float32):
+        x = np.random.random(x_shape).astype(dtype)
+        self._test_body(x, shape, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10), (10,))
+        self._test_ones_body((2, 10, 2), (4, 10))
+        self._test_ones_body((2, 5, 2, 2), (2, 5, 4))
+    def test_random_input(self):
+        self._test_random_body((1, 10), (10,))
+        self._test_random_body((2, 10, 2), (4, 10))
+        self._test_random_body((2, 5, 2, 2), (2, 5, 4))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_reshape_like.py
+++ b/oneflow/python/test/xrt/test_reshape_like.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(x_shape, like_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def reshape_like_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                         like = flow.FixedTensorDef(like_shape, dtype=dtype)):
+        return flow.reshape_like(x, like)
+    return reshape_like_job
+def make_xla_job(x_shape, like_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False) 
+    @flow.function(config)
+    def xla_reshape_like_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                             like = flow.FixedTensorDef(like_shape, dtype=dtype)):
+        return flow.reshape_like(x, like)
+    return xla_reshape_like_job
+def make_trt_job(x_shape, like_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True) 
+    @flow.function(config)
+    def trt_reshape_like_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
+                             like = flow.FixedTensorDef(like_shape, dtype=dtype)):
+        return flow.reshape_like(x, like)
+    return trt_reshape_like_job
+class TestReshapeLike(unittest.TestCase):
+    def _test_body(self, x, like, dtype=np.float32):
+        f1 = make_job(x.shape, like.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, like.shape, dtype=flow.float32)
+        a = f1(x, like).get()
+        b = f2(x, like).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(a.shape == b.shape)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, like.shape, dtype=flow.float32)
+        c = f3(x, like).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(a.shape == c.shape)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, x_shape, like_shape, dtype=np.float32):
+        x = np.ones(x_shape, dtype=dtype)
+        like = np.ones(like_shape, dtype=dtype)
+        self._test_body(x, like, dtype=dtype)
+    def _test_random_body(self, x_shape, like_shape, dtype=np.float32):
+        x = np.random.random(x_shape).astype(dtype)
+        like = np.random.random(like_shape).astype(dtype)
+        self._test_body(x, like, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10), (10,))
+        self._test_ones_body((2, 10, 2), (4, 10))
+        self._test_ones_body((2, 5, 2, 2), (2, 5, 4))
+    def test_random_input(self):
+        self._test_random_body((1, 10), (10,))
+        self._test_random_body((2, 10, 2), (4, 10))
+        self._test_random_body((2, 5, 2, 2), (2, 5, 4))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_scalar_op.py
+++ b/oneflow/python/test/xrt/test_scalar_op.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+class TestScalarOp(unittest.TestCase):
+    run_test = False
+    def _test_body(self, x, scalar, dtype=np.float32):
+        if not self.run_test:
+            return
+        f1 = self.make_job(x.shape, scalar, dtype=flow.float32)
+        f2 = self.make_xla_job(x.shape, scalar, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, x_shape, scalar, dtype=np.float32):
+        x = np.ones(x_shape, dtype=dtype)
+        self._test_body(x, scalar, dtype=dtype)
+    def _test_random_body(self, x_shape, scalar, dtype=np.float32):
+        x = np.random.random(x_shape).astype(dtype)
+        self._test_body(x, scalar, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 10), 2.0)
+        self._test_ones_body((2, 10, 2), 2.0)
+        self._test_ones_body((2, 5, 2, 2), 2.0)
+    def test_random_input(self):
+        self._test_random_body((1, 10), 2.0)
+        self._test_random_body((2, 10, 2), 2.0)
+        self._test_random_body((2, 5, 2, 2), 2.0)
+class TestScalarAddOp(TestScalarOp):
+    run_test = True
+    def make_job(self, x_shape, scalar, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def scalar_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.add(x, scalar)
+        return scalar_add_job
+    def make_xla_job(self, x_shape, scalar, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_scalar_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.add(x, scalar)
+        return xla_scalar_add_job
+class TestScalarMulOp(TestScalarOp):
+    run_test = True
+    def make_job(self, x_shape, scalar, dtype=flow.float32):
+        config.use_xla_jit(False)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def scalar_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.multiply(x, scalar)
+        return scalar_mul_job
+    def make_xla_job(self, x_shape, scalar, dtype=flow.float32):
+        config.use_xla_jit(True)
+        config.use_tensorrt(False)
+        @flow.function(config)
+        def xla_scalar_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
+            return flow.math.multiply(x, scalar)
+        return xla_scalar_mul_job
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_sigmoid.py
+++ b/oneflow/python/test/xrt/test_sigmoid.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def sigmoid_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.sigmoid(x)
+    return sigmoid_job
+def make_xla_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_sigmoid_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.sigmoid(x)
+    return xla_sigmoid_job
+def make_trt_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_sigmoid_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.sigmoid(x)
+    return trt_sigmoid_job
+class TestSigmoid(unittest.TestCase):
+    def _test_body(self, x, dtype=np.float32):
+        f1 = make_job(x.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, dtype=flow.float32)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, dtype=dtype)
+    def _test_random_body(self, shape, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1))
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1))
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_softmax.py
+++ b/oneflow/python/test/xrt/test_softmax.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def softmax_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.nn.softmax(x, axis=axis)
+    return softmax_job
+def make_xla_job(input_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_softmax_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.nn.softmax(x, axis=axis)
+    return xla_softmax_job
+def make_trt_job(input_shape, axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_softmax_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.nn.softmax(x, axis=axis)
+    return trt_softmax_job
+class TestSoftmax(unittest.TestCase):
+    def _test_body(self, x, axis, dtype=np.float32):
+        f1 = make_job(x.shape, axis, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, axis, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, axis, dtype=flow.float32)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, axis, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, axis, dtype=dtype)
+    def _test_random_body(self, shape, axis, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, axis, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((2, 5), axis=1)
+        self._test_ones_body((2, 5), axis=-1)
+        self._test_ones_body((1, 5, 2), axis=1)
+        self._test_ones_body((1, 5, 2), axis=2)
+    def test_random_input(self):
+        self._test_random_body((2, 5), axis=1)
+        self._test_random_body((2, 5), axis=-1)
+        self._test_random_body((1, 5, 2), axis=1)
+        self._test_random_body((1, 5, 2), axis=2)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_softmax_grad.py
+++ b/oneflow/python/test/xrt/test_softmax_grad.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(shape, axis, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def softmax_grad_job(y=flow.FixedTensorDef(shape, dtype=dtype),
+                         dy=flow.FixedTensorDef(shape, dtype=dtype)):
+        return flow.nn.softmax_grad(y, dy, axis=axis)
+    return softmax_grad_job
+def make_xla_job(shape, axis, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_softmax_grad_job(y=flow.FixedTensorDef(shape, dtype=dtype),
+                             dy=flow.FixedTensorDef(shape, dtype=dtype)):
+        return flow.nn.softmax_grad(y, dy, axis=axis)
+    return xla_softmax_grad_job
+class TestSoftmaxGrad(unittest.TestCase):
+    def _test_body(self, y, dy, axis, dtype=np.float32):
+        f1 = make_job(y.shape, axis, dtype=flow.float32)
+        f2 = make_xla_job(y.shape, axis, dtype=flow.float32)
+        a = f1(y, dy).get()
+        b = f2(y, dy).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(a.shape == b.shape)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, axis, dtype=np.float32):
+        y = np.ones(shape, dtype=dtype)
+        dy = np.ones(shape, dtype=dtype)
+        self._test_body(y, dy, axis, dtype=dtype)
+    def _test_random_body(self, shape, axis, dtype=np.float32):
+        y = np.random.random(shape).astype(dtype)
+        dy = np.random.random(shape).astype(dtype)
+        self._test_body(y, dy, axis, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((2, 5), axis=1)
+        self._test_ones_body((2, 5), axis=-1)
+        self._test_ones_body((1, 5, 2), axis=1)
+        self._test_ones_body((1, 5, 2), axis=2)
+    def test_random_input(self):
+        self._test_random_body((2, 5), axis=1)
+        self._test_random_body((2, 5), axis=-1)
+        self._test_random_body((1, 5, 2), axis=1)
+        self._test_random_body((1, 5, 2), axis=2)
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_tanh.py
+++ b/oneflow/python/test/xrt/test_tanh.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def tanh_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.tanh(x)
+    return tanh_job
+def make_xla_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_tanh_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.tanh(x)
+    return xla_tanh_job
+def make_trt_job(input_shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_tanh_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.keras.activations.tanh(x)
+    return trt_tanh_job
+class TestTanh(unittest.TestCase):
+    def _test_body(self, x, dtype=np.float32):
+        f1 = make_job(x.shape, dtype=flow.float32)
+        f2 = make_xla_job(x.shape, dtype=flow.float32)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, dtype=flow.float32)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=np.float32):
+        x = np.ones(shape, dtype=dtype)
+        self._test_body(x, dtype=dtype)
+    def _test_random_body(self, shape, dtype=np.float32):
+        x = np.random.random(shape).astype(dtype)
+        self._test_body(x, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1))
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1))
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_tanh_grad.py
+++ b/oneflow/python/test/xrt/test_tanh_grad.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(shape, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def tanh_grad_job(y = flow.FixedTensorDef(shape, dtype=dtype),
+                      dy = flow.FixedTensorDef(shape, dtype=dtype)):
+        return flow.keras.activations.tanh_grad(y, dy)
+    return tanh_grad_job
+def make_xla_job(shape, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_tanh_grad_job(y = flow.FixedTensorDef(shape, dtype=dtype),
+                          dy = flow.FixedTensorDef(shape, dtype=dtype)):
+        return flow.keras.activations.tanh_grad(y, dy)
+    return xla_tanh_grad_job
+class TestTanhGrad(unittest.TestCase):
+    def _test_body(self, y, dy, dtype=np.float32):
+        f1 = make_job(y.shape, dtype=flow.float32)
+        f2 = make_xla_job(y.shape, dtype=flow.float32)
+        a = f1(y, dy).get()
+        b = f2(y, dy).get()
+        print("without xla: ", a)
+        print("with xla", b)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, dtype=np.float32):
+        y = np.ones(shape, dtype=dtype)
+        dy = np.ones(shape, dtype=dtype)
+        self._test_body(y, dy, dtype=dtype)
+    def _test_random_body(self, shape, dtype=np.float32):
+        y = np.random.random(shape).astype(dtype)
+        dy = np.random.random(shape).astype(dtype)
+        self._test_body(y, dy, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1))
+        self._test_ones_body((1, 10))
+        self._test_ones_body((2, 10, 2))
+        self._test_ones_body((2, 5, 2, 2))
+    def test_random_input(self):
+        self._test_random_body((1))
+        self._test_random_body((1, 10))
+        self._test_random_body((2, 10, 2))
+        self._test_random_body((2, 5, 2, 2))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/python/test/xrt/test_transpose.py
+++ b/oneflow/python/test/xrt/test_transpose.py
+import unittest
+import numpy as np
+import oneflow as flow
+config = flow.function_config()
+def make_job(input_shape, permute, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def transpose_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.transpose(x, perm=permute)
+    return transpose_job
+def make_xla_job(input_shape, permute, dtype=flow.float32):
+    config.use_xla_jit(True)
+    config.use_tensorrt(False)
+    @flow.function(config)
+    def xla_transpose_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.transpose(x, perm=permute)
+    return xla_transpose_job
+def make_trt_job(input_shape, permute, dtype=flow.float32):
+    config.use_xla_jit(False)
+    config.use_tensorrt(True)
+    @flow.function(config)
+    def trt_transpose_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
+        return flow.transpose(x, perm=permute)
+    return trt_transpose_job
+class TestTranspose(unittest.TestCase):
+    def _test_body(self, x, permute, dtype=flow.float32):
+        f1 = make_job(x.shape, permute, dtype=dtype)
+        f2 = make_xla_job(x.shape, permute, dtype=dtype)
+        a = f1(x).get()
+        b = f2(x).get()
+        print("without xla: ", a)
+        print("with xla: ", b)
+        self.assertTrue(a.shape == b.shape)
+        self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+        f3 = make_trt_job(x.shape, permute, dtype=dtype)
+        c = f3(x).get()
+        print("with tensorrt: ", c)
+        self.assertTrue(a.shape == c.shape)
+        self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
+        flow.clear_default_session()
+    def _test_ones_body(self, shape, permute, dtype=flow.float32):
+        np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
+        x = np.ones(shape, dtype=np_dtype)
+        self._test_body(x, permute, dtype=dtype)
+    def _test_random_body(self, shape, permute, dtype=flow.float32):
+        np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
+        x = np.random.random(shape).astype(np_dtype)
+        self._test_body(x, permute, dtype=dtype)
+    def test_ones_input(self):
+        self._test_ones_body((1, 2), (1, 0))
+        self._test_ones_body((2, 2, 2), (0, 2, 1))
+        self._test_ones_body((2, 2, 2), (1, 0, 2))
+        self._test_ones_body((2, 2, 2), (1, 2, 0))
+    def test_random_input(self):
+        self._test_random_body((1, 2), (1, 0))
+        self._test_random_body((2, 2, 2), (0, 2, 1))
+        self._test_random_body((2, 2, 2), (1, 0, 2))
+        self._test_random_body((2, 2, 2), (1, 2, 0))
+if __name__ == '__main__':
+    unittest.main()
--- a/oneflow/xrt/README.md
+++ b/oneflow/xrt/README.md
+## XRT (X-Runtime)
+XRT是一个同时支持多个计算引擎的运行时加速库，目前已经集成了TensorFlow XLA和Nvidia TensorRT两个后端引擎。其中XLA全面支持训练和预测，TensorRT支持预测以及部分算子支持训练。对于同一个计算图，XRT允许多个计算引擎联合使用，以获得更好的加速效果。
+对于任意后端引擎，XRT的执行过程均分成以下四个步骤：
+1. 计算图的转换
+2. 引擎无关优化
+3. 生成引擎相关Executable
+4. 执行Executable
+### 引擎无关优化
+- 划分子图
+  根据计算图中每个计算节点是否可编译、device、sbp policy等一系列属性，对节点进行聚合，被聚合的节点被新的节点（Launch节点）折叠后并在节点内进行子图重建，同时确定子图的后端执行引擎。
+  如果多个后端引擎被开启，则会按照优先级进行每个引擎的子图划分。目前各引擎的优先级如下：
+  - 训练时，优先进行XLA的子图划分，之后进行TensorRT子图划分。
+  - 预测时，优先进行TensorRT的子图划分，之后进行XLA子图划分。
+  [子图划分](https://github.com/Oneflow-Inc/oneflow-issue/issues/44)是自动完成的，但可以通过设置以下环境变量来调整子图划分的结果。
+  ```shell
+  export FLAGS_clustering_minimum_nodes=1
+  export FLAGS_clustering_maximum_nodes=100
+  export FLAGS_strict_clustering=true
+  ```
+  - FLAGS_clustering_minimum_nodes
+    设置每个子图合并的节点的最小数量。当子图包含的节点数量小于该值时，则该合并的子图会被释放。
+  - FLAGS_clustering_maximum_nodes
+    设置每个子图合并的节点的最大数量。在合并时XRT可以保证每个子图包含的节点数不大于该设定值。
+  - FLAGS_strict_clustering
+    节点在合并时可能会互相破坏依赖，导致节点的执行时机发生改变。可以设置环境变量FLAGS_strict_clustering=true来规避该行为，确保合并后节点的执行时机不变。
+    同时FLAGS_strict_clustering=true时会导致合并的子图变小，可能导致后端引擎丧失一些优化机会。FLAGS_strict_clustering默认设为true。
+- ...
+### Executable的生成
+在runtime阶段，每个子图都可以被编译成一个与引擎相关的Executable。
+对于静态shape的子图，由于缓存机制，每个子图只需要在运行时编译一次。对于包含动态shape的子图，则可能每次运行时都需要编译一次，因此如果计算图中包含动态shape的节点，暂时不建议使用XRT。
+### Executable的执行
+Executable执行时会分别调用所属的后端引擎提供的执行接口，执行完成后返回计算结果。对于GPU，执行接口调用是异步的，而对于CPU，执行接口调用是同步的。
+- 临时内存管理
+  目前XLA是通过自动增长的buffer内存池来管理临时内存的，并支持复用输出的buffer，达到减少显存占用和in-place计算的效果。
+  TensorRT可以通过环境变量来设置临时buffer的最大字节数。
+  ```shell
+  export FLAGS_max_workspace_bytes=10000
+  ```
+- Max batch size
+  TensorRT在执行时需要设置最大支持的batch size，XRT支持用户通过环境变量来设置，
+  ```shell
+  export FLAGS_max_batch_size=10
+  ```
+  当然，如果在运行时实际的batch size超过了设置的最大batch size，则XRT允许TensorRT Executable自动调整max batch size并正确执行（自动调整max batch size会带来一定的开销）。
+### 在OneFlow中如何使用XRT
+首先要求在编译OneFlow时开启了WITH_XLA或WITH_TENSORRT选项。
+OneFlow中XRT的使用默认是关闭的，可以通过前端的Python接口和设置环境变量的方法来配置开启或关闭XLA和TensorRT，并且通过Python接口配置的优先级高于通过环境变量配置的方法。
+- Python接口配置
+  ```python
+  import oneflow as flow
+  # 配置使用XLA
+  # True开启XLA，False关闭XLA，默认为未定义状态
+  flow.config.use_xla_jit(True)
+  # 配置使用TensorRT
+  # True开启TensorRT，False关闭TensorRT，默认为未定义状态
+  flow.config.use_tensorrt(True)
+  ```
+- 从环境变量配置
+  ```shell
+  # 只在Python前端未定义状态下生效
+  export FLAGS_use_xla_jit=true # true为开启，false为关闭
+  export FLAGS_use_tensorrt=true # true为开启，false为关闭
+  ```
+### BenchMark
+- Bert base (batch size = 60)
+  >| RTX  2080Ti 单卡       | FP32        |             | FP16混合精度 |             |
+  >| ---------------------- | ----------- | ----------- | ------------ | ----------- |
+  >|                        | oneflow     | oneflow-xla | oneflow      | oneflow-xla |
+  >| loss (100 batches)     | 8.85063839  | 8.850635529 | 8.850672722  | 8.850834847 |
+  >| s/batch                | 0.57        | 0.45        | 0.31         | 0.19        |
+  >| 显存占用               | 8669MiB     | 8685MiB     | 7009MiB      | 7041MiB     |
+  >| 计算吞吐 (sentences/s) | 105.2631579 | 133.3333333 | 193.5483871  | 315.7894737 |
+  >| 加速比                 | 1           | 1.266666667 | 1            | 1.631578947 |
+  >| RTX  2080Ti 2卡        | FP32        |             | FP16混合精度 |             |
+  >| ---------------------- | ----------- | ----------- | ------------ | ----------- |
+  >|                        | oneflow     | oneflow-xla | oneflow      | oneflow-xla |
+  >| loss (100 batche)      | 8.806107521 | 8.806109428 | 8.806120873  | 8.806238174 |
+  >| s/batch                | 0.596       | 0.485       | 0.353        | 0.241       |
+  >| 显存占用               | 9147MiB     | 9149MiB     | 7669MiB      | 7675MiB     |
+  >| 计算吞吐 (sentences/s) | 201.3422819 | 247.4226804 | 339.9433428  | 497.9253112 |
+  >| 加速比                 | 1           | 1.228865979 | 1            | 1.46473029  | 
+  >| RTX  2080Ti 4卡        | FP32        |             | FP16混合精度 |             |
+  >| ---------------------- | ----------- | ----------- | ------------ | ----------- |
+  >|                        | oneflow     | oneflow-xla | oneflow      | oneflow-xla |
+  >| loss (100 batches)     | 8.730175972 | 8.730184555 | 8.730111122  | 8.729899406 |
+  >| s/batch                | 0.61        | 0.495       | 0.376        | 0.252       |
+  >| 显存占用               | 9147MiB     | 9149MiB     | 7669MiB      | 7675MiB     |
+  >| 计算吞吐 (sentences/s) | 393.442623  | 484.8484848 | 638.2978723  | 952.3809524 |
+  >| 加速比                 | 1           | 1.232323232 | 1            | 1.492063492 |
+- Bert base (batch size = 40)
+  >| RTX  2080Ti 单卡       | FP32      |             |            |                | FP16混合精度 |             |            |                |
+  >| ---------------------- | --------- | ----------- | ---------- | -------------- | ------------ | ----------- | ---------- | -------------- |
+  >|                        | oneflow   | oneflow-xla | tensorflow | tensorflow-xla | oneflow      | oneflow-xla | tensorflow | tensorflow-xla |
+  >| 计算吞吐 (sentences/s) | 99.276    | 125.708     | 91.4       | 119.1          | 170.731      | 288.511     | 202.2      | 309.5          |
+  >| 加速比                 | 1         | 1.26625     | 1          | 1.30306        | 1            | 1.690       | 1          | 1.53066        |
+  >| RTX  2080Ti 2卡        | FP32      |             |            |                | FP16混合精度 |             |            |                |
+  >| ---------------------- | --------- | ----------- | ---------- | -------------- | ------------ | ----------- | ---------- | -------------- |
+  >|                        | oneflow   | oneflow-xla | tensorflow | tensorflow-xla | oneflow      | oneflow-xla | tensorflow | tensorflow-xla |
+  >| 计算吞吐 (sentences/s) | 188.476   | 223.643     | 173.6      | 196.2          | 290.946      | 431.241     | 307.8      | 376.1          |
+  >| 加速比                 | 1         | 1.18659     | 1          | 1.13018        | 1            | 1.482       | 1          | 1.22190        |
--- a/oneflow/xrt/any.h
+++ b/oneflow/xrt/any.h
+#ifndef ONEFLOW_XRT_ANY_H_
+#define ONEFLOW_XRT_ANY_H_
+#include <functional>
+#include <type_traits>
+#include <typeinfo>
+#include "glog/logging.h"
+namespace oneflow {
+namespace xrt {
+class Any {
+ public:
+  inline Any() = default;
+  inline Any(Any &&other);
+  inline Any(const Any &other);
+  template<typename T>
+  inline Any(T &&value);
+  inline virtual ~Any();
+  inline Any &operator=(Any &&other);
+  inline Any &operator=(const Any &other);
+  template<typename T>
+  inline Any &operator=(T &&value);
+  inline void Swap(Any &other);
+  template<typename T>
+  inline const T &Cast() const;
+  template<typename T>
+  inline T &Cast();
+  template<typename T>
+  inline friend const T &any_cast(const Any &any);
+  template<typename T>
+  inline friend T &any_cast(Any &any);
+ private:
+  struct AnyType {
+    const std::type_info *ptype_info;
+  };
+  struct AnyData {
+    virtual ~AnyData() = default;
+    virtual const void *Ptr() { return nullptr; };
+    std::function<AnyData *()> clone;
+  };
+  template<typename T>
+  struct AnyDataImpl : public AnyData {
+    T data_content;
+    explicit AnyDataImpl(const T &value);
+    const void *Ptr() override { return &data_content; }
+  };
+  template<typename T>
+  inline AnyType TypeInfo() const;
+  template<typename T>
+  inline bool CheckType() const;
+ private:
+  AnyType type_;
+  AnyData *data_ = nullptr;
+};
+template<typename T>
+Any::AnyDataImpl<T>::AnyDataImpl(const T &value) : data_content(value) {
+  this->clone = [this]() -> Any::AnyDataImpl<T> * {
+    return new AnyDataImpl<T>(this->data_content);
+  };
+}
+void Any::Swap(Any &other) {
+  std::swap(type_, other.type_);
+  std::swap(data_, other.data_);
+}
+Any::Any(Any &&other) { this->Swap(other); }
+Any::Any(const Any &other) {
+  type_ = other.type_;
+  if (other.data_) { data_ = other.data_->clone(); }
+}
+Any::~Any() {
+  if (data_) delete data_;
+  data_ = nullptr;
+}
+template<typename T>
+Any::AnyType Any::TypeInfo() const {
+  Any::AnyType type;
+  type.ptype_info = &typeid(T);
+  return std::move(type);
+}
+template<typename T>
+Any::Any(T &&value) {
+  typedef typename std::decay<T>::type DT;
+  if (std::is_same<DT, Any>::value) {
+    *this = std::move(value);
+  } else {
+    type_ = TypeInfo<T>();
+    data_ = new AnyDataImpl<T>(value);
+  }
+}
+Any &Any::operator=(Any &&other) {
+  Any(std::move(other)).Swap(*this);
+  return *this;
+}
+Any &Any::operator=(const Any &other) {
+  Any(other).Swap(*this);
+  return *this;
+}
+template<typename T>
+Any &Any::operator=(T &&value) {
+  Any(std::move(value)).Swap(*this);
+  return *this;
+}
+template<typename T>
+bool Any::CheckType() const {
+  if (typeid(T).hash_code() != type_.ptype_info->hash_code()) {
+    LOG(FATAL) << "Could not cast type " << type_.ptype_info->name() << " to type "
+               << typeid(T).name();
+    return false;
+  }
+  return true;
+}
+template<typename T>
+const T &Any::Cast() const {
+  CheckType<T>();
+  return *reinterpret_cast<const T *>(data_->Ptr());
+}
+template<typename T>
+T &Any::Cast() {
+  CheckType<T>();
+  return *const_cast<T *>(reinterpret_cast<const T *>(data_->Ptr()));
+}
+template<typename T>
+const T &any_cast(const Any &any) {
+  return any.Cast<T>();
+}
+template<typename T>
+T &any_cast(Any &any) {
+  return any.Cast<T>();
+}
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_ANY_H_
--- a/oneflow/xrt/api.cpp
+++ b/oneflow/xrt/api.cpp
+#include "oneflow/xrt/api.h"
+#include "glog/logging.h"
+#include "oneflow/core/operator/operator.h"  // GenLogicalBlobName, GenLogicalBlobId
+#include "oneflow/xrt/build_graph.h"
+#include "oneflow/xrt/utility/env.h"
+DEFINE_int32(clustering_minimum_nodes, EnvToInt(FLAGS_clustering_minimum_nodes, 1),
+             "Minium nodes of a cluster after clustering.");
+DEFINE_int32(clustering_maximum_nodes, EnvToInt(FLAGS_clustering_maximum_nodes, 1000),
+             "Maxium nodes of a cluster after clustering.");
+DEFINE_bool(strict_clustering, EnvToBool(FLAGS_strict_clustering, true),
+            "Option to clustering with strict dependencies analysis.");
+// DEFINE_string(engine, EnvToString(FLAGS_engine, "XLA"),
+//               "Which third party engine to be used. XLA and TENSORRT are "
+//               "valid, Default means using no engine.");
+DEFINE_bool(use_xla_jit, EnvToBool(FLAGS_use_xla_jit, false), "It's optional to use xla jit.");
+DEFINE_bool(use_tensorrt, EnvToBool(FLAGS_use_tensorrt, false), "It's optional to use tensorrt.");
+DEFINE_bool(tensorrt_fp16, EnvToBool(FLAGS_tensorrt_fp16, false),
+            "Enable fp16 precision for TENSORRT engine.");
+DEFINE_bool(tensorrt_int8, EnvToBool(FLAGS_tensorrt_int8, false),
+            "Enable int8 precision for TENSORRT engine.");
+namespace oneflow {
+namespace xrt {
+#define OP_TYPE_CASE(op) OperatorConf::k##op##Conf
+static std::unordered_map<int32_t, std::string> op_type2string_map = {
+    {OP_TYPE_CASE(Matmul), "MatMul"},
+    {OP_TYPE_CASE(Relu), "Relu"},
+    {OP_TYPE_CASE(Conv2D), "Conv2D"},
+    {OP_TYPE_CASE(Multiply), "Multiply"},
+    // {OP_TYPE_CASE(FullyConnected), "FullyConnected"},
+    {OP_TYPE_CASE(BiasAdd), "BiasAdd"},
+    {OP_TYPE_CASE(Reshape), "Reshape"},
+    {OP_TYPE_CASE(Identity), "Identity"},
+    {OP_TYPE_CASE(ReshapeLike), "ReshapeLike"},
+    {OP_TYPE_CASE(Cast), "Cast"},
+    {OP_TYPE_CASE(Concat), "Concat"},
+    {OP_TYPE_CASE(ScalarAdd), "ScalarAdd"},
+    {OP_TYPE_CASE(ScalarMul), "ScalarMul"},
+    {OP_TYPE_CASE(Transpose), "Transpose"},
+    {OP_TYPE_CASE(BroadcastAdd), "BcastAdd"},
+    {OP_TYPE_CASE(BroadcastMul), "BcastMul"},
+    {OP_TYPE_CASE(BroadcastDiv), "BcastDiv"},
+    {OP_TYPE_CASE(Add), "Add"},
+    {OP_TYPE_CASE(Sigmoid), "Sigmoid"},
+    {OP_TYPE_CASE(Tanh), "Tanh"},
+    {OP_TYPE_CASE(TanhGrad), "TanhGrad"},
+    {OP_TYPE_CASE(Gelu), "Gelu"},
+    {OP_TYPE_CASE(GeluGrad), "GeluGrad"},
+    {OP_TYPE_CASE(Gather), "Gather"},
+    {OP_TYPE_CASE(BatchGather), "BatchGather"},
+    {OP_TYPE_CASE(Softmax), "Softmax"},
+    {OP_TYPE_CASE(SoftmaxGrad), "SoftmaxGrad"},
+    {OP_TYPE_CASE(LayerNorm), "LayerNorm"},
+    {OP_TYPE_CASE(LayerNormParamGrad), "LayerNormParamGrad"},
+    {OP_TYPE_CASE(LayerNormGrad), "LayerNormGrad"},
+    {OP_TYPE_CASE(ReduceSum), "ReduceSum"},
+    {OP_TYPE_CASE(ReduceMean), "ReduceMean"},
+    {OP_TYPE_CASE(AdamModelUpdate), "AdamOptimizer"},
+    {OP_TYPE_CASE(MaxPooling2D), "MaxPooling2D"},
+    {OP_TYPE_CASE(AveragePooling2D), "AveragePooling2D"},
+    {OP_TYPE_CASE(Normalization), "Normalization"},
+    // {OP_TYPE_CASE(ReduceConcat), "ReduceConcat"},
+    // {OP_TYPE_CASE(ReduceSplit), "ReduceSplit"},
+    // TODO(hjchen2)
+};
+std::string ExtractOpTypeAsString(const OperatorConf &conf) {
+  const auto it = op_type2string_map.find(conf.op_type_case());
+  if (it != op_type2string_map.end()) {
+    return it->second;
+  } else {
+    // Return empty if the operator is not in the translation map
+    return std::string("");
+  }
+}
+XrtDevice DeviceTypeToXrtDevice(const DeviceType &device_type) {
+  switch (device_type) {
+    case DeviceType::kGPU: return XrtDevice::GPU_CUDA;
+    case DeviceType::kCPU: return XrtDevice::CPU_X86;
+    default:
+      DLOG(WARNING) << "Meet invalid device type (" << device_type
+                    << "). Use the default xrt device instead.";
+      return XrtDevice::CPU_X86;
+  }
+}
+DeviceType XrtDeviceToDeviceType(const XrtDevice &device) {
+  if (device == XrtDevice::GPU_CUDA) {
+    return DeviceType::kGPU;
+  } else if (device == XrtDevice::CPU_X86) {
+    return DeviceType::kCPU;
+  } else {
+    LOG(FATAL) << "Can not convert xrt device (" << device << ") to device type.";
+    return DeviceType::kCPU;
+  }
+}
+XrtEngine StringToXrtEngine(const std::string &engine) {
+  if (engine == "XLA") {
+    return xrt::XrtEngine::XLA;
+  } else if (engine == "TENSORRT") {
+    return xrt::XrtEngine::TENSORRT;
+  } else {
+    LOG(FATAL) << "Unknown engine: " << engine;
+  }
+}
+std::string BlobIdToName(const LogicalBlobId &lbi) {
+  CHECK_EQ(lbi.has_op_name(), true);
+  CHECK_EQ(lbi.has_blob_name(), true);
+  if (lbi.op_name() == "") { return lbi.blob_name(); }
+  return GenLogicalBlobName(lbi);
+}
+LogicalBlobId BlobNameToId(const std::string &blob_name) {
+  size_t pos = blob_name.find('/');
+  if (pos == std::string::npos) {
+    return GenLogicalBlobId("/" + blob_name);
+  } else {
+    return GenLogicalBlobId(blob_name);
+  }
+}
+std::shared_ptr<XrtGraph> BuildXrtGraph(const OpGraph *op_graph) {
+  return graph_builder::BuildGraph(op_graph);
+}
+std::shared_ptr<XrtGraph> BuildXrtGraph(const XrtLaunchOpConf::Function &function,
+                                        const DeviceType &device_type, const JobDesc &job_desc) {
+  return graph_builder::BuildGraph(function, device_type, job_desc);
+}
+void InitXrtConfigurations(const XrtConfig &config) {
+  if (config.has_use_xla_jit()) { FLAGS_use_xla_jit = config.use_xla_jit(); }
+  if (config.has_use_tensorrt()) { FLAGS_use_tensorrt = config.use_tensorrt(); }
+  // Set xla configurations.
+  if (config.has_tensorrt_config()) {
+    const XrtConfig::TensorRTConfig &trt_config = config.tensorrt_config();
+    if (trt_config.has_use_fp16()) { FLAGS_tensorrt_fp16 = trt_config.use_fp16(); }
+    if (trt_config.has_use_int8()) { FLAGS_tensorrt_int8 = trt_config.use_int8(); }
+  }
+}
+bool XrtCompilationEnabled() { return FLAGS_use_xla_jit || FLAGS_use_tensorrt; }
+XrtPassOptions CreateDefaultXrtPassOptions(bool train_phase) {
+  ClusteringOptions options;
+  options.minimum_nodes = FLAGS_clustering_minimum_nodes;
+  options.maximum_nodes = FLAGS_clustering_maximum_nodes;
+  options.strict_clustering = FLAGS_strict_clustering;
+  options.train_phase = train_phase;
+  // TODO(hjchen2)
+  options.engine = (1U << XrtEngineOptionBit::kUseDefault);
+  if (FLAGS_use_xla_jit) { options.engine |= (1U << XrtEngineOptionBit::kUseXlaJit); }
+  if (FLAGS_use_tensorrt) { options.engine |= (1U << XrtEngineOptionBit::kUseTensorRT); }
+  XrtPassOptions xrt_options;
+  xrt_options.clustering_options = options;
+  return xrt_options;
+}
+void RunCompilationTimeXrtPasses(const OpGraph &op_graph, Job *job, bool train_phase) {
+  auto graph = BuildXrtGraph(&op_graph);
+  // Create options to run xrt passes.
+  auto options = CreateDefaultXrtPassOptions(train_phase);
+  RunXrtPass("MarkClusterId", graph.get(), options);
+  RunXrtPass("BuildSubGraph", graph.get(), options);
+  // Rebuild Job
+  RunXrtPass("RebuildCompiledJob", graph.get(), options, job);
+}
+}  // namespace xrt
+}  // namespace oneflow
--- a/oneflow/xrt/api.h
+++ b/oneflow/xrt/api.h
+#ifndef ONEFLOW_XRT_API_H_
+#define ONEFLOW_XRT_API_H_
+#include "oneflow/core/common/shape.h"
+#include "oneflow/core/graph/op_graph.h"
+#include "oneflow/core/job/job_desc.h"
+#include "oneflow/core/operator/op_conf.pb.h"
+#include "oneflow/core/register/blob.h"
+#include "oneflow/core/register/logical_blob_id.pb.h"
+#include "oneflow/xrt/graph/graph.h"
+#include "oneflow/xrt/parameter.h"
+#include "oneflow/xrt/passes/pass.h"
+namespace oneflow {
+namespace xrt {
+std::string ExtractOpTypeAsString(const OperatorConf &conf);
+XrtDevice DeviceTypeToXrtDevice(const DeviceType &device_type);
+DeviceType XrtDeviceToDeviceType(const XrtDevice &device);
+XrtEngine StringToXrtEngine(const std::string &engine);
+std::string BlobIdToName(const LogicalBlobId &lbi);
+LogicalBlobId BlobNameToId(const std::string &blob_name);
+template<typename T>
+inline Shape AsShape(const std::vector<T> &dim_vec) {
+  return Shape(DimVector(dim_vec.begin(), dim_vec.end()));
+}
+// Build an xrt graph from launch conf.
+std::shared_ptr<XrtGraph> BuildXrtGraph(const XrtLaunchOpConf::Function &function,
+                                        const DeviceType &device_type, const JobDesc &job_desc);
+// Build an xrt graph from op graph.
+std::shared_ptr<XrtGraph> BuildXrtGraph(const OpGraph *op_graph);
+void InitXrtConfigurations(const XrtConfig &config);
+bool XrtCompilationEnabled();
+// Create a default options for xrt pass.
+// If environment variables FLAGS_clustering_minimum_nodes,
+// FLAGS_clustering_maximum_nodes, and FLAGS_strict_clustering have been set,
+// then it will be filled by these values.
+XrtPassOptions CreateDefaultXrtPassOptions(bool train_phase = false);
+// Run an xrt pass with fixed parameters.
+// args:
+// pass    "Pass type, sunch as \"BuildSubGraph\"."
+// graph   "An XRT graph which be applied by pass."
+// options "Specify options to affect pass results."
+inline void RunXrtPass(const std::string &pass, XrtGraph *graph, const XrtPassOptions &options) {
+  return RunPassImpl(pass, graph, options);
+}
+// Run an xrt pass with unfixed parameters.
+template<typename... Args>
+inline void RunXrtPass(const std::string &pass, XrtGraph *graph, const XrtPassOptions &options,
+                       Args &&... args) {
+  return RunPassImpl(pass, graph, options, std::forward<Args>(args)...);
+}
+void RunCompilationTimeXrtPasses(const OpGraph &op_graph, Job *job, bool train_phase);
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_API_H_
--- a/oneflow/xrt/argument.h
+++ b/oneflow/xrt/argument.h
+#ifndef ONEFLOW_XRT_ARGUMENT_H_
+#define ONEFLOW_XRT_ARGUMENT_H_
+#include <string>
+#include "oneflow/core/common/data_type.pb.h"
+#include "oneflow/core/common/shape.h"
+namespace oneflow {
+namespace xrt {
+// Each data flow will bind two keys, produce_key and consume_key.
+// Such as node A and B, there is a data flow named `a_output` on edge A->B.
+//  node A {
+//    in: "a_input"
+//    out: "a_output"
+//  }
+//  node B {
+//    in: "a_output"
+//    out: "b_output"
+//  }
+// In this case, the data flow named `a_output` has a `produce_key` named
+// \"out\" produced by node A and a `consume_key` named \"in\" consumed by
+// node B.
+struct ArgumentMetaData {
+  std::string produce_key;
+  std::string consume_key;
+};
+// Descriptor of data flow on graph edges include data name, shape and
+// data type. Also it may be attached by a metadata which giving the key
+// of producing and consuming.
+class Argument {
+ public:
+  Argument() : initialized_(false) {}
+  explicit Argument(const std::string &name) : Argument(name, ArgumentMetaData()) {}
+  explicit Argument(const std::string &name, const Shape &shape, const DataType &data_type)
+      : Argument(name, shape, data_type, ArgumentMetaData()) {}
+  explicit Argument(const std::string &name, const ArgumentMetaData &meta_data)
+      : arg_name_(name), meta_data_(meta_data), initialized_(true) {}
+  explicit Argument(const std::string &name, const Shape &shape, const DataType &data_type,
+                    const ArgumentMetaData &meta_data)
+      : arg_name_(name),
+        shape_(shape),
+        data_type_(data_type),
+        meta_data_(meta_data),
+        initialized_(true) {}
+  const std::string &name() const { return arg_name_; }
+  const Shape &shape() const { return shape_; }
+  const DataType &data_type() const { return data_type_; }
+  void set_meta_data(const ArgumentMetaData &meta_data) { meta_data_ = meta_data; }
+  const ArgumentMetaData &meta_data() const { return meta_data_; }
+  bool initialized() const { return initialized_; }
+  bool operator==(const Argument &rhs) const {
+    return arg_name_ == rhs.arg_name_ && shape_ == rhs.shape_ && data_type_ == rhs.data_type_;
+  }
+ private:
+  std::string arg_name_{""};
+  Shape shape_;
+  DataType data_type_;
+  ArgumentMetaData meta_data_;
+  bool initialized_ = false;
+};
+}  // namespace xrt
+}  // namespace oneflow
+namespace std {
+template<>
+struct hash<oneflow::xrt::Argument> {
+  size_t operator()(const oneflow::xrt::Argument &arg) const {
+    return std::hash<std::string>()(arg.name());
+  }
+};
+}  // namespace std
+#endif  // ONEFLOW_XRT_GRAPH_ARGUMENT_H_
--- a/oneflow/xrt/build_graph.cpp
+++ b/oneflow/xrt/build_graph.cpp
+#include "oneflow/xrt/build_graph.h"
+#include "oneflow/xrt/api.h"
+namespace oneflow {
+namespace xrt {
+namespace graph_builder {
+const Shape &InputTimeShape(const OpNode *op_node) {
+  CHECK_NOTNULL(op_node);
+  return *(op_node->GetInputBlobFastestTimeShape());
+}
+const Shape &OutputTimeShape(const OpNode *op_node) {
+  CHECK_NOTNULL(op_node);
+  return *(op_node->out_blob_time_shape());
+}
+const SbpParallel &BlobSbpPolicy(const OpNode *op_node, const std::string &name) {
+  CHECK_NOTNULL(op_node);
+  LogicalBlobId lbi = BlobNameToId(name);
+  return op_node->SbpParallel4Lbi(lbi);
+}
+GraphBuilder::GraphBuilder(const OpGraph *op_graph) : graph_(std::make_shared<XrtGraph>()) {
+  op_graph->TopoForEachNode([&](const OpNode *op_node) {
+    const Operator *op = &op_node->op();
+    XrtNode *node = graph_->AddNode(op->op_conf());
+    SetupXrtNode(node, op->op_conf());
+    auto &input_output_keys = node_info_[node].input_output_keys;
+    for (const std::string &bn : op->output_bns()) {
+      std::string output = BlobIdToName(op->BnInOp2Lbi(bn));
+      producers_[output] = node;
+      input_output_keys[output] = bn;
+    }
+    for (const std::string &bn : op->input_bns()) {
+      std::string input = BlobIdToName(op->BnInOp2Lbi(bn));
+      input_output_keys[input] = bn;
+      node_info_[node].inputs.insert(input);
+    }
+    node_info_[node].op_node = op_node;
+  });
+}
+GraphBuilder::GraphBuilder(const XrtLaunchOpConf::Function &function, const DeviceType &device_type,
+                           const JobDesc &job_desc)
+    : graph_(std::make_shared<XrtGraph>()) {
+  for (const auto &arg_conf : function.argument()) {
+    XrtNode *node = graph_->AddNode(arg_conf);
+    SetupXrtNode(node, arg_conf);
+    if (node->IsInArgumentNode()) {
+      producers_[arg_conf.value()] = node;
+    } else {
+      node_info_[node].inputs.insert(arg_conf.value());
+    }
+    auto &input_output_keys = node_info_[node].input_output_keys;
+    input_output_keys = {{arg_conf.value(), "value"}};
+  }
+  for (const auto &node_conf : function.node()) {
+    XrtNode *node = graph_->AddNode(node_conf);
+    SetupXrtNode(node, node_conf);
+    auto &input_output_keys = node_info_[node].input_output_keys;
+    auto op = ConstructOp(node_conf, device_type, &job_desc);
+    for (const std::string &bn : op->output_bns()) {
+      std::string output = BlobIdToName(op->BnInOp2Lbi(bn));
+      producers_[output] = node;
+      input_output_keys[output] = bn;
+    }
+    for (const std::string &bn : op->input_bns()) {
+      std::string input = BlobIdToName(op->BnInOp2Lbi(bn));
+      input_output_keys[input] = bn;
+      node_info_[node].inputs.insert(input);
+    }
+  }
+}
+void GraphBuilder::MakeMetaData(const XrtNode *start, const XrtNode *end,
+                                const std::string &arg_name, ArgumentMetaData *meta_data) {
+  const auto &prod_keys = node_info_.at(start).input_output_keys;
+  const auto &cons_keys = node_info_.at(end).input_output_keys;
+  meta_data->produce_key = prod_keys.at(arg_name);
+  meta_data->consume_key = cons_keys.at(arg_name);
+}
+void GraphBuilder::BuildGraphEdges() {
+  for (const auto &p : node_info_) {
+    const XrtNode *node = p.first;
+    const util::Set<std::string> &inputs = p.second.inputs;
+    for (const std::string &input : inputs) {
+      const auto &it = producers_.find(input);
+      if (it != producers_.end() && it->second != node) {
+        ArgumentMetaData meta;
+        MakeMetaData(it->second, node, input, &meta);
+        Argument argument(input, meta);
+        graph_->Connect(it->second, node, argument);
+      }
+    }
+  }
+}
+void GraphBuilder::SetupGraphEdges() {
+  for (XrtEdge *edge : graph_->Edges()) {
+    const OpNode *src = node_info_.at(edge->start()).op_node;
+    const OpNode *dst = node_info_.at(edge->end()).op_node;
+    const std::string &name = edge->argument().name();
+    if (nullptr == src || nullptr == dst) { continue; }
+    // Set time shape
+    std::vector<Shape> time_shape;
+    time_shape.push_back(OutputTimeShape(src));
+    time_shape.push_back(InputTimeShape(dst));
+    edge->SetAttr("time_shape", time_shape);
+    // Set sbp policy
+    std::vector<SbpParallel> sbp_policy;
+    sbp_policy.push_back(BlobSbpPolicy(src, name));
+    sbp_policy.push_back(BlobSbpPolicy(dst, name));
+    edge->SetAttr("sbp_policy", sbp_policy);
+  }
+}
+std::shared_ptr<XrtGraph> BuildGraph(const XrtLaunchOpConf::Function &function,
+                                     const DeviceType &device_type, const JobDesc &job_desc) {
+  return GraphBuilder(function, device_type, job_desc).Build();
+}
+std::shared_ptr<XrtGraph> BuildGraph(const OpGraph *op_graph) {
+  return GraphBuilder(op_graph).Build();
+}
+}  // namespace graph_builder
+}  // namespace xrt
+}  // namespace oneflow
--- a/oneflow/xrt/build_graph.h
+++ b/oneflow/xrt/build_graph.h
+#ifndef ONEFLOW_XRT_BUILD_GRAPH_H_
+#define ONEFLOW_XRT_BUILD_GRAPH_H_
+#include "oneflow/core/graph/op_graph.h"
+#include "oneflow/core/job/job_desc.h"
+#include "oneflow/core/operator/op_conf.pb.h"
+#include "oneflow/xrt/api.h"
+#include "oneflow/xrt/graph/graph.h"
+#include "oneflow/xrt/types.h"
+namespace oneflow {
+namespace xrt {
+namespace graph_builder {
+class GraphBuilder {
+ public:
+  GraphBuilder() = delete;
+  explicit GraphBuilder(const OpGraph *op_graph);
+  explicit GraphBuilder(const XrtLaunchOpConf::Function &function, const DeviceType &device_type,
+                        const JobDesc &job_desc);
+  std::shared_ptr<XrtGraph> Build() {
+    BuildGraphEdges();
+    SetupGraphEdges();
+    return graph_;
+  }
+  struct NodeInfo {
+    util::Set<std::string> inputs;
+    util::Map<std::string, std::string> input_output_keys;
+    const OpNode *op_node = nullptr;
+  };
+ private:
+  void SetupXrtNode(XrtNode *node, const OperatorConf &node_conf) const {
+    node->set_name(node_conf.name());
+    node->set_type(ExtractOpTypeAsString(node_conf));
+    node->set_device(DeviceTypeToXrtDevice(node_conf.device_type()));
+  }
+  void SetupXrtNode(XrtNode *node, const XrtLaunchOpConf::Argument &arg_conf) const {
+    node->set_name(arg_conf.name());
+    node->set_type(_ArgumentOpType);
+    node->set_device(DeviceTypeToXrtDevice(arg_conf.device_type()));
+  }
+  void MakeMetaData(const XrtNode *start, const XrtNode *end, const std::string &arg_name,
+                    ArgumentMetaData *meta_data);
+  void BuildGraphEdges();
+  void SetupGraphEdges();
+ private:
+  std::shared_ptr<XrtGraph> graph_;
+  util::Map<std::string, const XrtNode *> producers_;
+  util::Map<const XrtNode *, NodeInfo> node_info_;
+};
+std::shared_ptr<XrtGraph> BuildGraph(const XrtLaunchOpConf::Function &function,
+                                     const DeviceType &device_type, const JobDesc &job_desc);
+std::shared_ptr<XrtGraph> BuildGraph(const OpGraph *op_graph);
+}  // namespace graph_builder
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_BUILD_GRAPH_H_
--- a/oneflow/xrt/compilation_cache.cpp
+++ b/oneflow/xrt/compilation_cache.cpp
+#include "oneflow/xrt/compilation_cache.h"
+namespace oneflow {
+namespace xrt {
+bool operator==(const Signature &lhs, const Signature &rhs) {
+  return lhs.builder_name == rhs.builder_name && lhs.device_ordinal == rhs.device_ordinal
+         && lhs.entry_shapes == rhs.entry_shapes;
+}
+size_t SignatureHash::operator()(const Signature &signature) const {
+  size_t hash_val =
+      std::hash<std::string>()(signature.builder_name) ^ std::hash<int>()(signature.device_ordinal);
+  for (const auto &shape : signature.entry_shapes) { hash_val ^= std::hash<Shape>()(shape); }
+  return hash_val;
+}
+Signature ComputeSignature(const std::string &name, const int device_ordinal,
+                           const std::vector<Parameter> &entry_params) {
+  Signature signature;
+  signature.builder_name = name;
+  signature.device_ordinal = device_ordinal;
+  signature.entry_shapes.resize(entry_params.size());
+  for (int i = 0; i < entry_params.size(); ++i) {
+    signature.entry_shapes[i] = entry_params[i].shape();
+  }
+  return std::move(signature);
+}
+Executable *CompilationCache::GetRecord(const Signature &signature) const {
+  Executable *record = nullptr;
+  // std::shared_lock<std::shared_mutex> lock(mutex_);
+  std::lock_guard<std::mutex> lock(mutex_);
+  const auto &it = records_.find(signature);
+  if (it != records_.end()) { record = it->second.get(); }
+  return record;
+}
+void CompilationCache::Record(const Signature &signature,
+                              const std::shared_ptr<Executable> &result) {
+  // std::unique_lock<std::shared_mutex> lock(mutex_);
+  std::lock_guard<std::mutex> lock(mutex_);
+  records_.emplace(signature, result);
+}
+void CompilationCache::Release() {
+  util::Map<Signature, std::shared_ptr<Executable>, SignatureHash> empty_records;
+  records_.swap(empty_records);
+}
+}  // namespace xrt
+}  // namespace oneflow
--- a/oneflow/xrt/compilation_cache.h
+++ b/oneflow/xrt/compilation_cache.h
+#ifndef ONEFLOW_XRT_COMPILATION_CACHE_H_
+#define ONEFLOW_XRT_COMPILATION_CACHE_H_
+#include <memory>
+#include <mutex>
+#include <string>
+#include <vector>
+//#include "oneflow/core/common/data_type.pb.h"
+#include "oneflow/core/common/shape.h"
+#include "oneflow/xrt/executable.h"
+#include "oneflow/xrt/parameter.h"
+#include "oneflow/xrt/utility/stl.h"
+namespace oneflow {
+namespace xrt {
+struct Signature {
+  // Builder name
+  std::string builder_name;
+  // Device ordinal
+  int device_ordinal;
+  // std::vector<Shape> entry_data_types;
+  // It will lose efficacy if the entry shapes has been changed.
+  std::vector<Shape> entry_shapes;
+};
+bool operator==(const Signature &lhs, const Signature &rhs);
+struct SignatureHash {
+  size_t operator()(const Signature &signature) const;
+};
+Signature ComputeSignature(const std::string &name, const int device_ordinal,
+                           const std::vector<xrt::Parameter> &entry_params);
+class CompilationCache {
+ public:
+  Executable *GetRecord(const Signature &signature) const;
+  void Record(const Signature &signature, const std::shared_ptr<Executable> &result);
+  void Release();
+ private:
+  // static std::shared_mutex mutex_;
+  mutable std::mutex mutex_;
+  util::Map<Signature, std::shared_ptr<Executable>, SignatureHash> records_;
+};
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_COMPILATION_CACHE_H_
--- a/oneflow/xrt/executable.h
+++ b/oneflow/xrt/executable.h
+#ifndef ONEFLOW_XRT_EXECUTABLE_H_
+#define ONEFLOW_XRT_EXECUTABLE_H_
+#include <vector>
+#include "oneflow/xrt/parameter.h"
+#include "oneflow/xrt/xrt.pb.h"
+namespace oneflow {
+namespace xrt {
+struct ExecutableRunOptions {
+  // Specify stream if the engine supports multiple computation streams.
+  // It will use the default computation stream if `stream` is not set.
+  void *stream = nullptr;
+  int32_t device_ordinal = -1;
+  // Set host threads num.
+  int32_t host_num_threads = -1;
+  // Limit memory footprint.
+  int64_t host_memory_limit = -1;
+  int64_t device_memory_limit = -1;
+  // Random seed.
+  int64_t random_seed = -1;
+  // Maximum batch size for TensorRT.
+  int32_t max_batch_size = 1;
+  // Enable TensorRT Mixed-Precision.
+  // Enable TensorRT fp16
+  bool tensorrt_fp16 = false;
+  // Enable TensorRT int8
+  bool tensorrt_int8 = false;
+  // Feed the return parameters to reuse it's storage while running
+  // the executable.
+  std::vector<Parameter> return_params;
+};
+class Executable {
+ public:
+  Executable(const XrtEngine &engine) : engine_(engine) {}
+  virtual ~Executable() = default;
+  const XrtEngine &engine() const { return engine_; }
+  virtual bool Run(const std::vector<Parameter> &inputs, const ExecutableRunOptions &run_options,
+                   bool block_until_done = true) = 0;
+  bool RunAsync(const std::vector<Parameter> inputs, const ExecutableRunOptions &run_options) {
+    return Run(inputs, run_options, false);
+  }
+  const std::vector<Parameter> &Results() const { return results_; }
+ protected:
+  XrtEngine engine_;
+  std::vector<Parameter> results_;
+};
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_EXECUTABLE_H_
--- a/oneflow/xrt/graph/algorithm.h
+++ b/oneflow/xrt/graph/algorithm.h
+#ifndef ONEFLOW_XRT_GRAPH_ALGORITHM_H_
+#define ONEFLOW_XRT_GRAPH_ALGORITHM_H_
+#include "oneflow/xrt/utility/stl.h"
+namespace oneflow {
+namespace xrt {
+namespace algorithm {
+template<typename GraphType>
+struct GraphTypeTrait {
+  typedef typename GraphType::NodeType *pNodeType;
+  typedef typename GraphType::EdgeType *pEdgeType;
+};
+template<typename NodeType>
+struct NodeTypeTrait {
+  typedef typename NodeType::EdgeType *pEdgeType;
+};
+template<typename GraphType, typename UserFunc>
+inline void TopologyVisit(GraphType &graph, UserFunc func) {
+  typedef typename GraphTypeTrait<GraphType>::pNodeType pNodeType;
+  typedef typename GraphTypeTrait<GraphType>::pEdgeType pEdgeType;
+  util::Set<pNodeType> visited;
+  util::Queue<pNodeType> visit_queue;
+  for (pNodeType node : graph.Nodes()) {
+    if (node->IsSourceNode()) {
+      visit_queue.push(node);
+      visited.insert(node);
+    }
+  }
+  auto IsAllInputsVisited = [&](pNodeType node) -> bool {
+    for (pEdgeType edge : node->in_edges()) {
+      pNodeType start = edge->start();
+      if (visited.count(start) == 0) { return false; }
+    }
+    return true;
+  };
+  while (!visit_queue.empty()) {
+    pNodeType node = visit_queue.front();
+    visit_queue.pop();
+    {  // Run user function
+      func(node);
+    }
+    for (pEdgeType edge : node->out_edges()) {
+      pNodeType end = edge->end();
+      if (IsAllInputsVisited(end) && visited.insert(end).second) { visit_queue.push(end); }
+    }
+  }
+};
+template<typename NodeType>
+inline bool IsReachable(NodeType *start, NodeType *dest) {
+  typedef NodeType *pNodeType;
+  typedef typename NodeTypeTrait<NodeType>::pEdgeType pEdgeType;
+  util::Set<pNodeType> visited_nodes;
+  util::Stack<pNodeType> stack;
+  for (pEdgeType edge : start->out_edges()) { stack.push(edge->end()); }
+  while (!stack.empty()) {
+    pNodeType node = stack.top();
+    stack.pop();
+    if (node == dest) { return true; }
+    for (pEdgeType edge : node->out_edges()) {
+      pNodeType end = edge->end();
+      if (visited_nodes.insert(end).second) { stack.push(end); }
+    }
+  }
+  return false;
+}
+}  // namespace algorithm
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_GRAPH_ALGORITHM_H_
--- a/oneflow/xrt/graph/graph.cpp
+++ b/oneflow/xrt/graph/graph.cpp
+#include "oneflow/xrt/graph/graph.h"
+#include "oneflow/xrt/argument.h"
+namespace oneflow {
+namespace xrt {
+XrtEdge *XrtGraph::Connect(const XrtNode *start, const XrtNode *end) {
+  XrtEdge *edge = AddEdge(start, end);
+  const_cast<XrtNode *>(start)->AddOutEdge(edge);
+  const_cast<XrtNode *>(end)->AddInEdge(edge);
+  return edge;
+}
+XrtEdge *XrtGraph::Connect(const XrtNode *start, const XrtNode *end, const Argument &arg) {
+  XrtEdge *edge = Connect(start, end);
+  edge->SetArgument(arg);
+  return edge;
+}
+void XrtGraph::Disconnect(const XrtEdge *edge) {
+  const_cast<XrtNode *>(edge->start())->EraseOutEdge(edge);
+  const_cast<XrtNode *>(edge->end())->EraseInEdge(edge);
+}
+XrtNode *XrtGraph::Node(int64_t node_id) {
+  DCHECK_LT(node_id, nodes_.size());
+  return nodes_.at(node_id);
+}
+const XrtNode *XrtGraph::Node(int64_t node_id) const {
+  DCHECK_LT(node_id, nodes_.size());
+  return nodes_.at(node_id);
+}
+XrtNode *XrtGraph::AddNode() {
+  std::unique_ptr<XrtNode> node(new XrtNode);
+  node->unique_id_ = nodes_.size();
+  nodes_.push_back(node.get());
+  allocated_nodes_.push_back(std::move(node));
+  return nodes_.back();
+}
+XrtNode *XrtGraph::AddNode(const google::protobuf::Message &param) {
+  std::unique_ptr<XrtNode> node(new XrtNode(param));
+  node->unique_id_ = nodes_.size();
+  nodes_.push_back(node.get());
+  allocated_nodes_.push_back(std::move(node));
+  return nodes_.back();
+}
+XrtEdge *XrtGraph::AddEdge() {
+  std::unique_ptr<XrtEdge> edge(new XrtEdge);
+  edge->unique_id_ = edges_.size();
+  edges_.push_back(edge.get());
+  allocated_edges_.push_back(std::move(edge));
+  return edges_.back();
+}
+XrtEdge *XrtGraph::AddEdge(const XrtNode *start, const XrtNode *end) {
+  std::unique_ptr<XrtEdge> edge(new XrtEdge(start, end));
+  edge->unique_id_ = edges_.size();
+  edges_.push_back(edge.get());
+  allocated_edges_.push_back(std::move(edge));
+  return edges_.back();
+}
+XrtGraph *XrtGraph::AddSubgraph(int64_t node_id) {
+  std::unique_ptr<XrtGraph> subgraph(new XrtGraph);
+  nodes_[node_id]->sub_graph_ = subgraph.get();
+  subgraphs_[node_id] = std::move(subgraph);
+  return nodes_.at(node_id)->sub_graph_;
+}
+std::vector<Argument> XrtGraph::Arguments() const {
+  std::vector<Argument> arguments;
+  for (const XrtEdge *edge : edges_) {
+    if (edge && edge->argument().initialized()) { arguments.push_back(edge->argument()); }
+  }
+  return std::move(arguments);
+}
+std::string XrtGraph::ToDot() const {
+  std::stringstream ost;
+  ost << "digraph {\n";
+  for (const XrtNode *node : this->Nodes()) {
+    ost << "\"" << node->unique_id() << "\" [label=\"" << node->name() << "\"]\n";
+  }
+  for (const XrtEdge *edge : edges_) {
+    ost << "\"" << edge->start()->unique_id() << "\" -> "
+        << "\"" << edge->end()->unique_id() << "\"\n";
+  }
+  ost << "}";
+  return ost.str();
+}
+}  // namespace xrt
+}  // namespace oneflow
--- a/oneflow/xrt/graph/graph.h
+++ b/oneflow/xrt/graph/graph.h
+#ifndef ONEFLOW_XRT_GRAPH_GRAPH_H_
+#define ONEFLOW_XRT_GRAPH_GRAPH_H_
+#include <google/protobuf/message.h>
+#include <vector>
+#include "oneflow/xrt/argument.h"
+#include "oneflow/xrt/graph/algorithm.h"
+#include "oneflow/xrt/graph/node.h"
+#include "oneflow/xrt/utility/attribute_map.h"
+namespace oneflow {
+namespace xrt {
+class XrtGraph : public util::AttributeMap {
+ public:
+  XrtGraph() = default;
+  virtual ~XrtGraph() = default;
+  XrtNode *Node(int64_t node_id);
+  const XrtNode *Node(int64_t node_id) const;
+  XrtNode *AddNode();
+  XrtNode *AddNode(const google::protobuf::Message &param);
+  XrtEdge *AddEdge();
+  XrtEdge *AddEdge(const XrtNode *start, const XrtNode *end);
+  XrtEdge *Connect(const XrtNode *start, const XrtNode *end);
+  XrtEdge *Connect(const XrtNode *start, const XrtNode *end, const Argument &arg);
+  void Disconnect(const XrtEdge *edge);
+  // Create a subgraph for node that unique id is `node_id`
+  XrtGraph *AddSubgraph(int64_t node_id);
+  const std::vector<XrtNode *> &Nodes() const { return nodes_; }
+  std::vector<XrtNode *> &Nodes() { return nodes_; }
+  const std::vector<XrtEdge *> &Edges() const { return edges_; }
+  std::vector<XrtEdge *> &Edges() { return edges_; }
+  std::string ToDot() const;
+  std::vector<Argument> Arguments() const;
+ protected:
+  std::vector<XrtNode *> nodes_;
+  // All allocated nodes in the graph. The node unique id is related to it's
+  // index in the vector. The Xrt node in `nodes_` can be nullptr since we will
+  // always keep it in `nodes_` even if it has been removed from the graph.
+  std::vector<std::unique_ptr<XrtNode>> allocated_nodes_;
+  std::vector<XrtEdge *> edges_;
+  // All allocated edges in the graph. The edge unique id is related to it's
+  // index in the vector. And the xrt edge in `edges_` can also be nullptr.
+  std::vector<std::unique_ptr<XrtEdge>> allocated_edges_;
+  // All allocated subgraphs. The key of the map means node unique id, and the
+  // value is the subgraph which belongs to the node
+  util::Map<int64_t, std::unique_ptr<XrtGraph>> subgraphs_;
+};
+namespace algorithm {
+template<>
+struct GraphTypeTrait<XrtGraph> {
+  typedef XrtNode *pNodeType;
+  typedef XrtEdge *pEdgeType;
+};
+template<>
+struct GraphTypeTrait<const XrtGraph> {
+  typedef const XrtNode *pNodeType;
+  typedef const XrtEdge *pEdgeType;
+};
+}  // namespace algorithm
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_GRAPH_GRAPH_H_
--- a/oneflow/xrt/graph/node.cpp
+++ b/oneflow/xrt/graph/node.cpp
+#include "absl/strings/str_split.h"
+#include "oneflow/xrt/graph/algorithm.h"
+#include "oneflow/xrt/graph/node.h"
+namespace oneflow {
+namespace xrt {
+void XrtNode::AddInEdge(const XrtEdge *edge) { in_edges_.push_back(const_cast<XrtEdge *>(edge)); }
+void XrtNode::AddOutEdge(const XrtEdge *edge) { out_edges_.push_back(const_cast<XrtEdge *>(edge)); }
+void XrtNode::EraseInEdge(const XrtEdge *edge) {
+  in_edges_.remove_if(
+      [&](const XrtEdge *e) -> bool { return e->unique_id() == edge->unique_id(); });
+}
+void XrtNode::EraseOutEdge(const XrtEdge *edge) {
+  out_edges_.remove_if(
+      [&](const XrtEdge *e) -> bool { return e->unique_id() == edge->unique_id(); });
+}
+bool XrtNode::IsSourceNode() const { return in_edges_.size() == 0; }
+bool XrtNode::IsFinishNode() const { return out_edges_.size() == 0; }
+bool XrtNode::IsArgumentNode() const { return type_ == _ArgumentOpType; }
+bool XrtNode::IsInArgumentNode() const {
+  return IsArgumentNode() && absl::StartsWith(name_, _XrtInArgumentPrefix);
+}
+bool XrtNode::IsOutArgumentNode() const {
+  return IsArgumentNode() && absl::StartsWith(name_, _XrtOutArgumentPrefix);
+}
+bool XrtNode::IsReachable(const XrtNode &dst_node) const {
+  return algorithm::IsReachable(this, &dst_node);
+}
+}  // namespace xrt
+}  // namespace oneflow
--- a/oneflow/xrt/graph/node.h
+++ b/oneflow/xrt/graph/node.h
+#ifndef ONEFLOW_XRT_GRAPH_NODE_H_
+#define ONEFLOW_XRT_GRAPH_NODE_H_
+#include <google/protobuf/message.h>
+#include "oneflow/xrt/argument.h"
+#include "oneflow/xrt/graph/algorithm.h"
+#include "oneflow/xrt/types.h"
+#include "oneflow/xrt/utility/attribute_map.h"
+#include "oneflow/xrt/utility/stl.h"
+namespace oneflow {
+namespace xrt {
+class XrtNode;
+class XrtGraph;
+class XrtEdge : public util::AttributeMap {
+ public:
+  XrtNode *start() const { return start_; }
+  XrtNode *end() const { return end_; }
+  const Argument &argument() const { return arg_; }
+  Argument &argument() { return arg_; }
+  void SetStartNode(const XrtNode *start) { start_ = const_cast<XrtNode *>(start); }
+  void SetEndNode(const XrtNode *end) { end_ = const_cast<XrtNode *>(end); }
+  void SetArgument(const Argument &arg) { arg_ = arg; }
+  int64_t unique_id() const { return unique_id_; }
+  bool IsControlEdge() const { return !arg_.initialized(); }
+  virtual ~XrtEdge() = default;
+  friend class XrtGraph;
+ protected:
+  XrtEdge() = default;
+  XrtEdge(const XrtNode *start, const XrtNode *end)
+      : start_(const_cast<XrtNode *>(start)), end_(const_cast<XrtNode *>(end)) {}
+ protected:
+  XrtNode *start_ = nullptr;
+  XrtNode *end_ = nullptr;
+  Argument arg_;
+  int64_t unique_id_ = -1;
+};
+// XLA Node
+class XrtNode : public util::AttributeMap {
+ public:
+  const util::List<XrtEdge *> &in_edges() const { return in_edges_; }
+  const util::List<XrtEdge *> &out_edges() const { return out_edges_; }
+  util::List<XrtEdge *> &in_edges() { return in_edges_; }
+  util::List<XrtEdge *> &out_edges() { return out_edges_; }
+  void AddInEdge(const XrtEdge *edge);
+  void AddOutEdge(const XrtEdge *edge);
+  void EraseInEdge(const XrtEdge *edge);
+  void EraseOutEdge(const XrtEdge *edge);
+  void ClearInEdges() { in_edges_.clear(); };
+  void ClearOutEdges() { out_edges_.clear(); };
+  int64_t unique_id() const { return unique_id_; }
+  const XrtDevice &device() const { return device_; }
+  const std::string &type() const { return type_; }
+  const std::string &name() const { return name_; }
+  const google::protobuf::Message &param() const { return *param_; }
+  XrtGraph *sub_graph() const { return sub_graph_; }
+  void set_device(const XrtDevice &device) { device_ = device; }
+  void set_type(const std::string &type) { type_ = type; }
+  void set_name(const std::string &name) { name_ = name; }
+  bool IsSourceNode() const;
+  bool IsFinishNode() const;
+  bool IsArgumentNode() const;
+  bool IsInArgumentNode() const;
+  bool IsOutArgumentNode() const;
+  bool IsReachable(const XrtNode &dst_node) const;
+  virtual ~XrtNode() {}
+  friend class XrtGraph;
+ protected:
+  XrtNode() = default;
+  // XrtNode only can be created by XrtGraph
+  explicit XrtNode(const google::protobuf::Message &param)
+      : param_(&param), unique_id_(-1), sub_graph_(nullptr) {}
+ protected:
+  util::List<XrtEdge *> in_edges_;
+  util::List<XrtEdge *> out_edges_;
+  const google::protobuf::Message *param_ = nullptr;
+  // Each node has a unique id related to it's index in the graph's nodes
+  int64_t unique_id_ = -1;
+  // Backend device such as X86, CUDA, ARM and so on
+  XrtDevice device_;
+  // String type, such as "Conv2d", "Matmul" or other else
+  std::string type_;
+  // String name
+  std::string name_;
+  // Subgraph will be built for xrt launch nodes. Note that `sub_graph_` should
+  // be built and managed by the graph, other than the node
+  XrtGraph *sub_graph_ = nullptr;
+};
+namespace algorithm {
+template<>
+struct NodeTypeTrait<XrtNode> {
+  typedef XrtEdge *pEdgeType;
+};
+template<>
+struct NodeTypeTrait<const XrtNode> {
+  typedef const XrtEdge *pEdgeType;
+};
+}  // namespace algorithm
+}  // namespace xrt
+}  // namespace oneflow
+#endif  // ONEFLOW_XRT_GRAPH_NODE_H_
--- a/oneflow/xrt/graph_compiler.h
+++ b/oneflow/xrt/graph_compiler.h
--- a/oneflow/xrt/kernel/op_context.h
+++ b/oneflow/xrt/kernel/op_context.h
--- a/oneflow/xrt/kernel/op_kernel.h
+++ b/oneflow/xrt/kernel/op_kernel.h
--- a/oneflow/xrt/launch_kernel.cpp
+++ b/oneflow/xrt/launch_kernel.cpp
--- a/oneflow/xrt/launch_kernel.h
+++ b/oneflow/xrt/launch_kernel.h
--- a/oneflow/xrt/launch_op.cpp
+++ b/oneflow/xrt/launch_op.cpp
--- a/oneflow/xrt/launch_op.h
+++ b/oneflow/xrt/launch_op.h
--- a/oneflow/xrt/node_util.cpp
+++ b/oneflow/xrt/node_util.cpp
--- a/oneflow/xrt/node_util.h
+++ b/oneflow/xrt/node_util.h
--- a/oneflow/xrt/parameter.h
+++ b/oneflow/xrt/parameter.h
--- a/oneflow/xrt/passes/build_subgraph_pass.cpp
+++ b/oneflow/xrt/passes/build_subgraph_pass.cpp
--- a/oneflow/xrt/passes/cluster.cpp
+++ b/oneflow/xrt/passes/cluster.cpp
--- a/oneflow/xrt/passes/cluster.h
+++ b/oneflow/xrt/passes/cluster.h
--- a/oneflow/xrt/passes/infer_shape_pass.cpp
+++ b/oneflow/xrt/passes/infer_shape_pass.cpp
--- a/oneflow/xrt/passes/mark_cluster_id_pass.cpp
+++ b/oneflow/xrt/passes/mark_cluster_id_pass.cpp
--- a/oneflow/xrt/passes/pass.cpp
+++ b/oneflow/xrt/passes/pass.cpp
--- a/oneflow/xrt/passes/pass.h
+++ b/oneflow/xrt/passes/pass.h
--- a/oneflow/xrt/passes/rebuild_job_pass.cpp
+++ b/oneflow/xrt/passes/rebuild_job_pass.cpp
--- a/oneflow/xrt/passes/update_arg_metadata_pass.cpp
+++ b/oneflow/xrt/passes/update_arg_metadata_pass.cpp
--- a/oneflow/xrt/patches/xla.patch
+++ b/oneflow/xrt/patches/xla.patch
--- a/oneflow/xrt/platform.cpp
+++ b/oneflow/xrt/platform.cpp
--- a/oneflow/xrt/platform.h
+++ b/oneflow/xrt/platform.h
--- a/oneflow/xrt/tensorrt/README.md
+++ b/oneflow/xrt/tensorrt/README.md
+## TensorRT
--- a/oneflow/xrt/tensorrt/ops/activation_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/activation_op.cpp
--- a/oneflow/xrt/tensorrt/ops/argument_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/argument_op.cpp
--- a/oneflow/xrt/tensorrt/ops/batch_norm_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/batch_norm_op.cpp
--- a/oneflow/xrt/tensorrt/ops/bias_add_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/bias_add_op.cpp
--- a/oneflow/xrt/tensorrt/ops/concat_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/concat_op.cpp
--- a/oneflow/xrt/tensorrt/ops/convolution_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/convolution_op.cpp
--- a/oneflow/xrt/tensorrt/ops/element_wise_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/element_wise_op.cpp
--- a/oneflow/xrt/tensorrt/ops/identity_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/identity_op.cpp
--- a/oneflow/xrt/tensorrt/ops/matmul_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/matmul_op.cpp
--- a/oneflow/xrt/tensorrt/ops/op_context.cpp
+++ b/oneflow/xrt/tensorrt/ops/op_context.cpp
--- a/oneflow/xrt/tensorrt/ops/op_context.h
+++ b/oneflow/xrt/tensorrt/ops/op_context.h
--- a/oneflow/xrt/tensorrt/ops/op_kernel.h
+++ b/oneflow/xrt/tensorrt/ops/op_kernel.h
--- a/oneflow/xrt/tensorrt/ops/pooling_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/pooling_op.cpp
--- a/oneflow/xrt/tensorrt/ops/reduce_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/reduce_op.cpp
--- a/oneflow/xrt/tensorrt/ops/reshape_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/reshape_op.cpp
--- a/oneflow/xrt/tensorrt/ops/softmax_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/softmax_op.cpp
--- a/oneflow/xrt/tensorrt/ops/topk_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/topk_op.cpp
--- a/oneflow/xrt/tensorrt/ops/transpose_op.cpp
+++ b/oneflow/xrt/tensorrt/ops/transpose_op.cpp
--- a/oneflow/xrt/tensorrt/plugin/README.md
+++ b/oneflow/xrt/tensorrt/plugin/README.md
--- a/oneflow/xrt/tensorrt/trt_builder.cpp
+++ b/oneflow/xrt/tensorrt/trt_builder.cpp
--- a/oneflow/xrt/tensorrt/trt_builder.h
+++ b/oneflow/xrt/tensorrt/trt_builder.h
--- a/oneflow/xrt/tensorrt/trt_executable.cpp
+++ b/oneflow/xrt/tensorrt/trt_executable.cpp
--- a/oneflow/xrt/tensorrt/trt_executable.h
+++ b/oneflow/xrt/tensorrt/trt_executable.h
--- a/oneflow/xrt/tensorrt/trt_graph_compiler.cpp
+++ b/oneflow/xrt/tensorrt/trt_graph_compiler.cpp
--- a/oneflow/xrt/tensorrt/trt_graph_compiler.h
+++ b/oneflow/xrt/tensorrt/trt_graph_compiler.h
--- a/oneflow/xrt/tensorrt/trt_helpers.cpp
+++ b/oneflow/xrt/tensorrt/trt_helpers.cpp
--- a/oneflow/xrt/tensorrt/trt_helpers.h
+++ b/oneflow/xrt/tensorrt/trt_helpers.h
--- a/oneflow/xrt/tensorrt/trt_logger.cpp
+++ b/oneflow/xrt/tensorrt/trt_logger.cpp
--- a/oneflow/xrt/tensorrt/trt_logger.h
+++ b/oneflow/xrt/tensorrt/trt_logger.h
--- a/oneflow/xrt/tensorrt/trt_shape.h
+++ b/oneflow/xrt/tensorrt/trt_shape.h
--- a/oneflow/xrt/tensorrt/trt_unique_ptr.h
+++ b/oneflow/xrt/tensorrt/trt_unique_ptr.h
--- a/oneflow/xrt/tensorrt/trt_value.h
+++ b/oneflow/xrt/tensorrt/trt_value.h
--- a/oneflow/xrt/tests/README.md
+++ b/oneflow/xrt/tests/README.md
--- a/oneflow/xrt/types.h
+++ b/oneflow/xrt/types.h
--- a/oneflow/xrt/types.proto
+++ b/oneflow/xrt/types.proto
--- a/oneflow/xrt/utility/attribute_map.h
+++ b/oneflow/xrt/utility/attribute_map.h
--- a/oneflow/xrt/utility/env.h
+++ b/oneflow/xrt/utility/env.h
--- a/oneflow/xrt/utility/message_attr.h
+++ b/oneflow/xrt/utility/message_attr.h
--- a/oneflow/xrt/utility/registry.h
+++ b/oneflow/xrt/utility/registry.h
--- a/oneflow/xrt/utility/stl.h
+++ b/oneflow/xrt/utility/stl.h
--- a/oneflow/xrt/xla/README.md
+++ b/oneflow/xrt/xla/README.md
--- a/oneflow/xrt/xla/memory/device_buffer_allocator.h
+++ b/oneflow/xrt/xla/memory/device_buffer_allocator.h
--- a/oneflow/xrt/xla/memory/device_memory_pool.cpp
+++ b/oneflow/xrt/xla/memory/device_memory_pool.cpp
--- a/oneflow/xrt/xla/memory/device_memory_pool.h
+++ b/oneflow/xrt/xla/memory/device_memory_pool.h
--- a/oneflow/xrt/xla/ops/activation_grad_op.cpp
+++ b/oneflow/xrt/xla/ops/activation_grad_op.cpp
--- a/oneflow/xrt/xla/ops/adam_optimizer_op.cpp
+++ b/oneflow/xrt/xla/ops/adam_optimizer_op.cpp
--- a/oneflow/xrt/xla/ops/add_op.cpp
+++ b/oneflow/xrt/xla/ops/add_op.cpp
--- a/oneflow/xrt/xla/ops/argument_op.cpp
+++ b/oneflow/xrt/xla/ops/argument_op.cpp
--- a/oneflow/xrt/xla/ops/batch_matmul_op.cpp
+++ b/oneflow/xrt/xla/ops/batch_matmul_op.cpp
--- a/oneflow/xrt/xla/ops/bias_add_op.cpp
+++ b/oneflow/xrt/xla/ops/bias_add_op.cpp
--- a/oneflow/xrt/xla/ops/binary_op.h
+++ b/oneflow/xrt/xla/ops/binary_op.h
--- a/oneflow/xrt/xla/ops/broadcast_binary_op.cpp
+++ b/oneflow/xrt/xla/ops/broadcast_binary_op.cpp
--- a/oneflow/xrt/xla/ops/cast_op.cpp
+++ b/oneflow/xrt/xla/ops/cast_op.cpp
--- a/oneflow/xrt/xla/ops/fc_op.cpp
+++ b/oneflow/xrt/xla/ops/fc_op.cpp
--- a/oneflow/xrt/xla/ops/gather.cpp
+++ b/oneflow/xrt/xla/ops/gather.cpp
--- a/oneflow/xrt/xla/ops/layer_norm_op.cpp
+++ b/oneflow/xrt/xla/ops/layer_norm_op.cpp
--- a/oneflow/xrt/xla/ops/matmul_op.cpp
+++ b/oneflow/xrt/xla/ops/matmul_op.cpp
--- a/oneflow/xrt/xla/ops/op_context.cpp
+++ b/oneflow/xrt/xla/ops/op_context.cpp
--- a/oneflow/xrt/xla/ops/op_context.h
+++ b/oneflow/xrt/xla/ops/op_context.h
--- a/oneflow/xrt/xla/ops/op_kernel.h
+++ b/oneflow/xrt/xla/ops/op_kernel.h
--- a/oneflow/xrt/xla/ops/optimizer_op.h
+++ b/oneflow/xrt/xla/ops/optimizer_op.h
--- a/oneflow/xrt/xla/ops/reduce_op.cpp
+++ b/oneflow/xrt/xla/ops/reduce_op.cpp
--- a/oneflow/xrt/xla/ops/reshape_op.cpp
+++ b/oneflow/xrt/xla/ops/reshape_op.cpp
--- a/oneflow/xrt/xla/ops/scalar_binary_op.cpp
+++ b/oneflow/xrt/xla/ops/scalar_binary_op.cpp
--- a/oneflow/xrt/xla/ops/softmax_op.cpp
+++ b/oneflow/xrt/xla/ops/softmax_op.cpp
--- a/oneflow/xrt/xla/ops/transpose_op.cpp
+++ b/oneflow/xrt/xla/ops/transpose_op.cpp
--- a/oneflow/xrt/xla/ops/unary_op.cpp
+++ b/oneflow/xrt/xla/ops/unary_op.cpp
--- a/oneflow/xrt/xla/ops/unary_op.h
+++ b/oneflow/xrt/xla/ops/unary_op.h
--- a/oneflow/xrt/xla/xla_allocator.cpp
+++ b/oneflow/xrt/xla/xla_allocator.cpp
--- a/oneflow/xrt/xla/xla_allocator.h
+++ b/oneflow/xrt/xla/xla_allocator.h
--- a/oneflow/xrt/xla/xla_data_type.cpp
+++ b/oneflow/xrt/xla/xla_data_type.cpp
--- a/oneflow/xrt/xla/xla_data_type.h
+++ b/oneflow/xrt/xla/xla_data_type.h
--- a/oneflow/xrt/xla/xla_executable.cpp
+++ b/oneflow/xrt/xla/xla_executable.cpp
--- a/oneflow/xrt/xla/xla_executable.h
+++ b/oneflow/xrt/xla/xla_executable.h
--- a/oneflow/xrt/xla/xla_executable_context.cpp
+++ b/oneflow/xrt/xla/xla_executable_context.cpp
--- a/oneflow/xrt/xla/xla_executable_context.h
+++ b/oneflow/xrt/xla/xla_executable_context.h
--- a/oneflow/xrt/xla/xla_executable_scope.h
+++ b/oneflow/xrt/xla/xla_executable_scope.h
--- a/oneflow/xrt/xla/xla_graph_compiler.cpp
+++ b/oneflow/xrt/xla/xla_graph_compiler.cpp
--- a/oneflow/xrt/xla/xla_graph_compiler.h
+++ b/oneflow/xrt/xla/xla_graph_compiler.h
--- a/oneflow/xrt/xla/xla_helpers.cpp
+++ b/oneflow/xrt/xla/xla_helpers.cpp
--- a/oneflow/xrt/xla/xla_helpers.h
+++ b/oneflow/xrt/xla/xla_helpers.h
--- a/oneflow/xrt/xla/xla_macro.h
+++ b/oneflow/xrt/xla/xla_macro.h
--- a/oneflow/xrt/xla/xla_resource_manager.cpp
+++ b/oneflow/xrt/xla/xla_resource_manager.cpp
--- a/oneflow/xrt/xla/xla_resource_manager.h
+++ b/oneflow/xrt/xla/xla_resource_manager.h
--- a/oneflow/xrt/xla/xla_shape.cpp
+++ b/oneflow/xrt/xla/xla_shape.cpp
--- a/oneflow/xrt/xla/xla_shape.h
+++ b/oneflow/xrt/xla/xla_shape.h
--- a/oneflow/xrt/xrt.proto
+++ b/oneflow/xrt/xrt.proto
--- a/setup.py
+++ b/setup.py