XRT: XLA + TensorRT (#2525)
* Enable multiply definition for xla compilation in oneflow * Realize running an executable * Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore * Implement a seperate xla allocator to avoid introducing much objects of tensorflow * Define CompilationContext separately * Running XLA by CPU mode is OK now * Make the result shape after running the executable is a tuple, and refine comments * Add compilation cache to solve recompiling every time * Resolve InferSbpSignature in XlaLaunchOp * Resove executing on specified cuda stream * Refine XlaLaunch parallel conf, add batch matmul op * Refactor job rebuilding and fixup time shape * Update batch_dim_lbis field if XlaLaunch has any output which has batch dim * Resolve cluster-ring after clustered, take sbp policy and time shape into consideration * Add reshape op * Fix bugs * Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle * Fix bugs * Update cmake to compile with xla optionally * Support more ops * Add more ops, and fix bugs * Implement XLA allocator and internal memory pool * Adaptively resize allocator memory size * Refine memory allocator * Block host if running cpu executable * Fix bug for getting scalar value * Fix result layout bug. This bug causes wrong result for transpose * Refine gelu backward * Of xla sx (#1990) * add identity xla op * Add batch gather op * Refine batch gather * fix batch gather bug aand add gather op, mv identity op to unary_op * Add softmax and gather/batch_gather * Add xla softmax_grad op * Add xla layer normalization op * Add xla layer norm backward op * Alias inputs and outputs to compute in-place * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Refine xla allocator * Refine code style * Add xla reduce_sum op * Rewrite model update op to optimizer graph * Fix hang bugs * Fix input which body is disabled in xla launch kernel * Fix self control in * Fix self control in * Add fake consume op * Fix HasAttr bug for optional field * Refine AdamOptimizer * Fix xla AdamOptimizer bugs * Add meta data in HLO instruction, and refine * Fix bugs * add reduce sum and split normal model update (#2040) * remove append_func_to_list * Rm deprecated model update and save code (#1958) * remove code * mv random gen to kernel * mk seed required * address reviews * fix unused warning * address reviews * check in more deprecation * remove ModelSaveOpConf * move out ops and modify item (#1962) * ModelInit.__oneflow_input_remote_blobs__ * fix cpu only query & add error info (#1964) * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * modify check_point and add test check_point (#1963) * fix misuse of Scope/raii * op_name2variable_blob * add sigmoid test and tanh test (#1966) * add op matmul and matmul test (#1967) * rename oneflow.val to oneflow.input_blob_def * support auto var for convolution (#1972) * add op add and test add (#1973) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Of xla (#2237) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Fix static cublas library and xla link conflict * Fix cublas link conflict with tensorflow * Fix different connection kinds for multiple gpu cards (#2282) * Refine xla cluster algo (#2289) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Refine MarkClusterId pass and ReduceSplit task node (#2314) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Determine fusion disabled edges * update * Produce multiple registers on edges for ReduceSplit task node. Fix new allocator by stream id. * Refine MarkClusterId pass * Clustering subgraph with reverse ordering is better * Support strict clustering by taking dependencies into consideration * Translate rebuild job and rewrite optimizer into passes, and refine code style * Fix spell error * Update cmake * Merge branch dev_python (#2321) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Fix xla reshape op * Merge upstream of_xla (#2322) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Dev cuda 9 arch 70 (#2318) * kCudaAlignSize = 256 * always compute_70 * __CUDA_API_VERSION >= 10000 * __CUDA_API_VERSION >= 10000 * disable_all_reduce_sequence * Fix xla reshape op * Fix compilation without xla * Remove useless code and fix data type mismatch in field desc (#2326) * Remove useless code * Refine code style * Fix data type mismatch in field desc * Update README.md (#2335) * Refine code style (#2336) * Update XLA usage document (#2337) * Update XLA usage document * Fix mistakes * Add xla clang-format and format codestyle (#2340) * Revert "Add xla clang-format and format codestyle (#2340)" (#2341) This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724. * Add xla clang-format and format codestyle (#2342) * Add xla clang-format and format codestyle * Fix header file missing * Of xla sx (#2334) * add gather grad op and pass testing * rm check * done batch gather grad * pass test * modify according to the review * add unsorted_segment_sum and refine unsorted_batch_segment_sum * reform according to review * refromate according to the clang-format and rm reference to the temp object * Pick step0 and step1 new commits (#2346) * Add xla clang-format and format codestyle * Fix header file missing * Modify codes to support XLA Conflicts: oneflow/core/job/job_builder.cpp oneflow/core/job/job_builder.h oneflow/core/operator/op_conf.proto * Fix a bug for building subgraph although it won't lead to wrong results (#2347) * Fix setting is_mutable in xla launch op (#2349) * Change directory xla to xrt, apply patch if building with xla * Refactor * Add infer shape pass, and Refactor launch kernel, graph compiler * Refine code style, add xla executable and graph compiler * Rename platform.proto as types.proto * change OpCompiler to OpKernel, complete xla graph compiler * Fix compilation bugs and add allocator, now xla compilation is ok * Add xla executable runtime * Add executable run scope to support launch kernel on specific stream. * Fix infer shape pass, and revert cuda event pool * Refactor graph building with attaching argument metadata. * Set mutability if rebuilding job * Set device ordinal correctly * Refine DelOps * Refine Argument definition and abstract function as subgraph * Fix infer shape in xrt launch op and launch kernel. * Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt. * Refine code style * Rename xla Operand as XlaValue. * Complete TensorRT compiler and builder, Refine OpKernel * Pick public code changes from the new tensorrt branch. * Fix tensorrt compilation * Fake implementation of trt executable * Support selecting engine in launch kernel, refine trt executable * Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix. * Support train phase setting for registered op kernel * Remove RewriteOptimizer pass, update xla optimizer op. * Format job builder .h and .cpp files. * Remove RewriteOptimizer pass, update xla optimizer op. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Refine code style and comment. * Refine model update inference for launch op. * Refine * Refine code style and comment. * Refine model update inference for launch op. Conflicts: oneflow/xrt/kernel/op_kernel.h oneflow/xrt/node_util.cpp oneflow/xrt/node_util.h oneflow/xrt/passes/cluster.h oneflow/xrt/passes/mark_cluster_id_pass.cpp oneflow/xrt/passes/rebuild_job_pass.cpp oneflow/xrt/types.h * Add xrt README.md * Add use_xla_jit and use_tensorrt options in job proto * Refine code style * Fix BlobDesc getter and xla LayerNorm op for FP16 * Make use_xla_jit and use_tensorrt configurable from python config and env variables. * Update benchmark * Refine xrt README and rename compile_with_xrt.h file * Update README * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Disable WITH_XLA by default * Update xrt benchmark * Format xrt as core * add activation op * add softmax op * Refine code style, remove unused code * Remove duplication of XLA usage * test pass * pooling test pass * add concat op, not tested * add activation ops, test not psassed * Add xla gelu unittest * add activation op, and test passed * add pooling op, and test passed * Fix int64 env variable * Export float16 for python * Add xla relu unittest * try to solve conv bug * add elementwise add op, test passed * add concat op, test passed * Bugfix: transfer weights from gpu to host since tensorrt requires host weights. * add op unit tests * resolve conflicts and fix softmax bug * add identity op and topk op, to test * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * add reduce mean op, test passed * formate ops, add CHECKs, and optimize function structure * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * add trt gather op and unit test * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * add conv unit test * reformate * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * check files * modify files according to review advice. * Add xrt unittests (#2483) * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Add xla gelu unittest * Fix int64 env variable * Export float16 for python * Add xla relu unittest * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * Fix reduce_mean facade bug if keep_dims if true. * Refine tensorrt unittests * Check failed if full reduce without keep dimension. * madd pooling unit test * Add tensorrt bias_add and reshape op, and their unittests. * Support fp16 for tensorrt. * Add tensorrt transpose op and unittest. * add unit test conv_2d * add unit test concat * Fix concat if axis is -1. * Refine tensorrt conv2d unittest * Fix padding mode for conv2d and pooling, refine unittests. * Refine tensorrt concat unittest * Add convert api from string engine to XrtEngine. * Revert tensorrt, and merge of_xrt branch * Remove some comments. * Refine tensorrt unittests * Add XrtConfig to deal with xla and tensorrt configurations. Conflicts: oneflow/xrt/api.cpp * Update tensorflow.cmake to avoid applying the patch repeatedly. * Remove XrtConfig Option, and fix xrt unittests * Add tensorrt batch norm (#2516) * Refine xrt signatrue hash, and fix python configuration (#2520) * Fix XrtCompilationEnabled returns (#2524) * Fix compilation after merge dev_python * Update xrt unittests * Revert protobuf version * Remove comment FOR_RANGE * Remove unused code * Reformart * Refine job builder * Disable dump job if not debug mode Co-authored-by: NSnow <snow3s@qq.com> Co-authored-by: NJuncheng <liujuncheng1022@gmail.com>
Showing
cmake/third_party/absl.cmake
0 → 100644
cmake/third_party/tensorrt.cmake
0 → 100644
oneflow/xrt/README.md
0 → 100644
oneflow/xrt/any.h
0 → 100644
oneflow/xrt/api.cpp
0 → 100644
oneflow/xrt/api.h
0 → 100644
oneflow/xrt/argument.h
0 → 100644
oneflow/xrt/build_graph.cpp
0 → 100644
oneflow/xrt/build_graph.h
0 → 100644
oneflow/xrt/compilation_cache.cpp
0 → 100644
oneflow/xrt/compilation_cache.h
0 → 100644
oneflow/xrt/executable.h
0 → 100644
oneflow/xrt/graph/algorithm.h
0 → 100644
oneflow/xrt/graph/graph.cpp
0 → 100644
oneflow/xrt/graph/graph.h
0 → 100644
oneflow/xrt/graph/node.cpp
0 → 100644
oneflow/xrt/graph/node.h
0 → 100644
oneflow/xrt/graph_compiler.h
0 → 100644
此差异已折叠。
oneflow/xrt/kernel/op_context.h
0 → 100644
此差异已折叠。
oneflow/xrt/kernel/op_kernel.h
0 → 100644
此差异已折叠。
oneflow/xrt/launch_kernel.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/launch_kernel.h
0 → 100644
此差异已折叠。
oneflow/xrt/launch_op.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/launch_op.h
0 → 100644
此差异已折叠。
oneflow/xrt/node_util.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/node_util.h
0 → 100644
此差异已折叠。
oneflow/xrt/parameter.h
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/passes/cluster.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/passes/cluster.h
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/passes/pass.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/passes/pass.h
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/patches/xla.patch
0 → 100644
此差异已折叠。
oneflow/xrt/platform.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/platform.h
0 → 100644
此差异已折叠。
oneflow/xrt/tensorrt/README.md
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/tensorrt/trt_logger.h
0 → 100644
此差异已折叠。
oneflow/xrt/tensorrt/trt_shape.h
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/tensorrt/trt_value.h
0 → 100644
此差异已折叠。
oneflow/xrt/tests/README.md
0 → 100644
此差异已折叠。
oneflow/xrt/types.h
0 → 100644
此差异已折叠。
oneflow/xrt/types.proto
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/utility/env.h
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/utility/registry.h
0 → 100644
此差异已折叠。
oneflow/xrt/utility/stl.h
0 → 100644
此差异已折叠。
oneflow/xrt/xla/README.md
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/ops/add_op.cpp
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/ops/binary_op.h
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/ops/cast_op.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/xla/ops/fc_op.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/xla/ops/gather.cpp
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/ops/matmul_op.cpp
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/ops/op_context.h
0 → 100644
此差异已折叠。
oneflow/xrt/xla/ops/op_kernel.h
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/ops/reduce_op.cpp
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/ops/unary_op.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/xla/ops/unary_op.h
0 → 100644
此差异已折叠。
oneflow/xrt/xla/xla_allocator.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/xla/xla_allocator.h
0 → 100644
此差异已折叠。
oneflow/xrt/xla/xla_data_type.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/xla/xla_data_type.h
0 → 100644
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/xla_executable.h
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/xla_helpers.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/xla/xla_helpers.h
0 → 100644
此差异已折叠。
oneflow/xrt/xla/xla_macro.h
0 → 100644
此差异已折叠。
此差异已折叠。
此差异已折叠。
oneflow/xrt/xla/xla_shape.cpp
0 → 100644
此差异已折叠。
oneflow/xrt/xla/xla_shape.h
0 → 100644
此差异已折叠。
oneflow/xrt/xrt.proto
0 → 100644
此差异已折叠。
想要评论请 注册 或 登录