提交 8f3dcf94 编写于 作者: H Houjiang Chen 提交者: cheng cheng

XRT: XLA + TensorRT (#2525)

* Enable multiply definition for xla compilation in oneflow

* Realize running an executable

* Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore

* Implement a seperate xla allocator to avoid introducing much objects of tensorflow

* Define CompilationContext separately

* Running XLA by CPU mode is OK now

* Make the result shape after running the executable is a tuple, and refine comments

* Add compilation cache to solve recompiling every time

* Resolve InferSbpSignature in XlaLaunchOp

* Resove executing on specified cuda stream

* Refine XlaLaunch parallel conf, add batch matmul op

* Refactor job rebuilding and fixup time shape

* Update batch_dim_lbis field if XlaLaunch has any output which has batch dim

* Resolve cluster-ring after clustered, take sbp policy and time shape into consideration

* Add reshape op

* Fix bugs

* Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle

* Fix bugs

* Update cmake to compile with xla optionally

* Support more ops

* Add more ops, and fix bugs

* Implement XLA allocator and internal memory pool

* Adaptively resize allocator memory size

* Refine memory allocator

* Block host if running cpu executable

* Fix bug for getting scalar value

* Fix result layout bug. This bug causes wrong result for transpose

* Refine gelu backward

* Of xla sx (#1990)

* add identity xla op

* Add batch gather op

* Refine batch gather

* fix batch gather bug aand add gather op, mv identity op to unary_op

* Add softmax and gather/batch_gather

* Add xla softmax_grad op

* Add xla layer normalization op

* Add xla layer norm backward op

* Alias inputs and outputs to compute in-place

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Refine xla allocator

* Refine code style

* Add xla reduce_sum op

* Rewrite model update op to optimizer graph

* Fix hang bugs

* Fix input which body is disabled in xla launch kernel

* Fix self control in

* Fix self control in

* Add fake consume op

* Fix HasAttr bug for optional field

* Refine AdamOptimizer

* Fix xla AdamOptimizer bugs

* Add meta data in HLO instruction, and refine

* Fix bugs

* add reduce sum and split normal model update (#2040)

* remove append_func_to_list

* Rm deprecated model update and save code (#1958)

* remove code

* mv random gen to kernel

* mk seed required

* address reviews

* fix unused warning

* address reviews

* check in more deprecation

* remove ModelSaveOpConf

* move out ops and modify item (#1962)

* ModelInit.__oneflow_input_remote_blobs__

* fix cpu only query & add error info (#1964)

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* modify check_point and add test check_point (#1963)

* fix misuse of Scope/raii

* op_name2variable_blob

* add sigmoid test and tanh test (#1966)

* add op matmul and matmul test (#1967)

* rename oneflow.val to oneflow.input_blob_def

* support auto var for convolution (#1972)

* add op add and test add (#1973)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Of xla (#2237)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Fix static cublas library and xla link conflict

* Fix cublas link conflict with tensorflow

* Fix different connection kinds for multiple gpu cards (#2282)

* Refine xla cluster algo (#2289)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Refine MarkClusterId pass and ReduceSplit task node (#2314)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Determine fusion disabled edges

* update

* Produce multiple registers on edges for ReduceSplit task node.
Fix new allocator by stream id.

* Refine MarkClusterId pass

* Clustering subgraph with reverse ordering is better

* Support strict clustering by taking dependencies into consideration

* Translate rebuild job and rewrite optimizer into passes, and refine code style

* Fix spell error

* Update cmake

* Merge branch dev_python (#2321)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Fix xla reshape op

* Merge upstream of_xla (#2322)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Dev cuda 9 arch 70 (#2318)

* kCudaAlignSize = 256

* always compute_70

* __CUDA_API_VERSION >= 10000

* __CUDA_API_VERSION >= 10000

* disable_all_reduce_sequence

* Fix xla reshape op

* Fix compilation without xla

* Remove useless code and fix data type mismatch in field desc (#2326)

* Remove useless code

* Refine code style

* Fix data type mismatch in field desc

* Update README.md (#2335)

* Refine code style (#2336)

* Update XLA usage document (#2337)

* Update XLA usage document

* Fix mistakes

* Add xla clang-format and format codestyle (#2340)

* Revert "Add xla clang-format and format codestyle (#2340)" (#2341)

This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724.

* Add xla clang-format and format codestyle (#2342)

* Add xla clang-format and format codestyle

* Fix header file missing

* Of xla sx (#2334)

* add gather grad op and pass testing

* rm check

* done batch gather grad

* pass test

* modify according to the review

* add unsorted_segment_sum and refine unsorted_batch_segment_sum

* reform according to review

* refromate according to the clang-format and rm reference to the temp object

* Pick step0 and step1 new commits (#2346)

* Add xla clang-format and format codestyle

* Fix header file missing

* Modify codes to support XLA

Conflicts:
	oneflow/core/job/job_builder.cpp
	oneflow/core/job/job_builder.h
	oneflow/core/operator/op_conf.proto

* Fix a bug for building subgraph although it won't lead to wrong results (#2347)

* Fix setting is_mutable in xla launch op (#2349)

* Change directory xla to xrt, apply patch if building with xla

* Refactor

* Add infer shape pass, and Refactor launch kernel, graph compiler

* Refine code style, add xla executable and graph compiler

* Rename platform.proto as types.proto

* change OpCompiler to OpKernel, complete xla graph compiler

* Fix compilation bugs and add allocator, now xla compilation is ok

* Add xla executable runtime

* Add executable run scope to support launch kernel on specific stream.

* Fix infer shape pass, and revert cuda event pool

* Refactor graph building with attaching argument metadata.

* Set mutability if rebuilding job

* Set device ordinal correctly

* Refine DelOps

* Refine Argument definition and abstract function as subgraph

* Fix infer shape in xrt launch op and launch kernel.

* Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt.

* Refine code style

* Rename xla Operand as XlaValue.

* Complete TensorRT compiler and builder, Refine OpKernel

* Pick public code changes from the new tensorrt branch.

* Fix tensorrt compilation

* Fake implementation of trt executable

* Support selecting engine in launch kernel, refine trt executable

* Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix.

* Support train phase setting for registered op kernel

* Remove RewriteOptimizer pass, update xla optimizer op.

* Format job builder .h and .cpp files.

* Remove RewriteOptimizer pass, update xla optimizer op.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Refine code style and comment.

* Refine model update inference for launch op.

* Refine

* Refine code style and comment.

* Refine model update inference for launch op.

Conflicts:
	oneflow/xrt/kernel/op_kernel.h
	oneflow/xrt/node_util.cpp
	oneflow/xrt/node_util.h
	oneflow/xrt/passes/cluster.h
	oneflow/xrt/passes/mark_cluster_id_pass.cpp
	oneflow/xrt/passes/rebuild_job_pass.cpp
	oneflow/xrt/types.h

* Add xrt README.md

* Add use_xla_jit and use_tensorrt options in job proto

* Refine code style

* Fix BlobDesc getter and xla LayerNorm op for FP16

* Make use_xla_jit and use_tensorrt configurable from python config and env variables.

* Update benchmark

* Refine xrt README and rename compile_with_xrt.h file

* Update README

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Disable WITH_XLA by default

* Update xrt benchmark

* Format xrt as core

* add activation op

* add softmax op

* Refine code style, remove unused code

* Remove duplication of XLA usage

* test pass

* pooling test pass

* add concat op, not tested

* add activation ops, test not psassed

* Add xla gelu unittest

* add  activation op, and test  passed

* add pooling op, and test passed

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* try to solve conv bug

* add elementwise add op, test passed

* add concat op, test passed

* Bugfix: transfer weights from gpu to host since tensorrt requires host weights.

* add op unit tests

* resolve conflicts and fix softmax bug

* add identity op and topk op, to test

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* add reduce mean op, test passed

* formate ops, add CHECKs, and optimize function structure

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* add trt gather op and unit test

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* add conv unit test

* reformate

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* check files

* modify files according to review advice.

* Add xrt unittests (#2483)

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Add xla gelu unittest

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* Fix reduce_mean facade bug if keep_dims if true.

* Refine tensorrt unittests

* Check failed if full reduce without keep dimension.

* madd pooling unit test

* Add tensorrt bias_add and reshape op, and their unittests.

* Support fp16 for tensorrt.

* Add tensorrt transpose op and unittest.

* add unit test conv_2d

* add unit test concat

* Fix concat if axis is -1.

* Refine tensorrt conv2d unittest

* Fix padding mode for conv2d and pooling, refine unittests.

* Refine tensorrt concat unittest

* Add convert api from string engine to XrtEngine.

* Revert tensorrt, and merge of_xrt branch

* Remove some comments.

* Refine tensorrt unittests

* Add XrtConfig to deal with xla and tensorrt configurations.

Conflicts:
	oneflow/xrt/api.cpp

* Update tensorflow.cmake to avoid applying the patch repeatedly.

* Remove XrtConfig Option, and fix xrt unittests

* Add tensorrt batch norm (#2516)

* Refine xrt signatrue hash, and fix python configuration (#2520)

* Fix XrtCompilationEnabled returns (#2524)

* Fix compilation after merge dev_python

* Update xrt unittests

* Revert protobuf version

* Remove comment FOR_RANGE

* Remove unused code

* Reformart

* Refine job builder

* Disable dump job if not debug mode
Co-authored-by: NSnow <snow3s@qq.com>
Co-authored-by: NJuncheng <liujuncheng1022@gmail.com>
上级 465ee822
...@@ -8,6 +8,8 @@ option(BUILD_RDMA "" OFF) ...@@ -8,6 +8,8 @@ option(BUILD_RDMA "" OFF)
option(BUILD_CUDA "" ON) option(BUILD_CUDA "" ON)
option(RELEASE_VERSION "" ON) option(RELEASE_VERSION "" ON)
option(PY3 "" OFF) option(PY3 "" OFF)
option(WITH_XLA "Option to build with XLA" OFF)
option(WITH_TENSORRT "Option to build with TensorRT" OFF)
if(NOT RELEASE_VERSION) if(NOT RELEASE_VERSION)
set(CUDNN_STATIC OFF CACHE BOOL "") set(CUDNN_STATIC OFF CACHE BOOL "")
...@@ -20,6 +22,13 @@ else() ...@@ -20,6 +22,13 @@ else()
project(oneflow C CXX) project(oneflow C CXX)
endif() endif()
if (WITH_XLA)
add_definitions(-DWITH_XLA)
endif()
if (WITH_TENSORRT)
add_definitions(-DWITH_TENSORRT)
endif()
enable_testing() enable_testing()
set(CMAKE_CXX_STANDARD 11) set(CMAKE_CXX_STANDARD 11)
set(CMAKE_POSITION_INDEPENDENT_CODE ON) set(CMAKE_POSITION_INDEPENDENT_CODE ON)
...@@ -65,7 +74,7 @@ if(WIN32) ...@@ -65,7 +74,7 @@ if(WIN32)
#set(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS} /DEBUG:FASTLINK") #set(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS} /DEBUG:FASTLINK")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /D_ITERATOR_DEBUG_LEVEL=0") set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /D_ITERATOR_DEBUG_LEVEL=0")
else() else()
list(APPEND CUDA_NVCC_FLAGS -std=c++11 -w -Wno-deprecated-gpu-targets) list(APPEND CUDA_NVCC_FLAGS -w -Wno-deprecated-gpu-targets)
# half is not fully supported when __CUDA_ARCH__ < 530 # half is not fully supported when __CUDA_ARCH__ < 530
# list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=sm_30") # list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=sm_30")
# list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=compute_30") # list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=compute_30")
...@@ -85,10 +94,12 @@ else() ...@@ -85,10 +94,12 @@ else()
foreach(CUDA_NVCC_GENCODE ${CUDA_NVCC_GENCODES}) foreach(CUDA_NVCC_GENCODE ${CUDA_NVCC_GENCODES})
list(APPEND CUDA_NVCC_FLAGS -gencode ${CUDA_NVCC_GENCODE}) list(APPEND CUDA_NVCC_FLAGS -gencode ${CUDA_NVCC_GENCODE})
endforeach() endforeach()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -Wall -Wno-sign-compare -Wno-unused-function -fPIC") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -std=c++11 -Wall -Wno-sign-compare -Wno-unused-function -fPIC")
if (RELEASE_VERSION) if (RELEASE_VERSION)
list(APPEND CUDA_NVCC_FLAGS -O3) list(APPEND CUDA_NVCC_FLAGS -O3)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -DNDEBUG")
else()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O0")
endif() endif()
endif() endif()
...@@ -97,4 +108,5 @@ if (THIRD_PARTY) ...@@ -97,4 +108,5 @@ if (THIRD_PARTY)
set(THIRD_PARTY OFF CACHE BOOL "" FORCE) set(THIRD_PARTY OFF CACHE BOOL "" FORCE)
else() else()
include(oneflow) include(oneflow)
configure_file(${PROJECT_SOURCE_DIR}/setup.py.in ${PROJECT_BINARY_DIR}/setup.py)
endif() endif()
...@@ -42,3 +42,76 @@ or you can just clone source code and submodules step by step ...@@ -42,3 +42,76 @@ or you can just clone source code and submodules step by step
``` ```
cmake -DTHIRD_PARTY=OFF .. && make -j cmake -DTHIRD_PARTY=OFF .. && make -j
``` ```
### Build with XLA
- Install bazel
Download and install bazel from [here](https://docs.bazel.build/versions/1.0.0/bazel-overview.html) , and version 0.24.1 is recommended. You can confirm bazel is installed successfully by running the following command:
```shell
bazel version
```
- Update cmake
It is needed only if CMake installed does not support downloading .tgz file from URL with https protocol. Skip this step, just go back here to reinstall CMake if you encountered a downloading error while building the third-parties.
Download cmake(>=3.7) from [here](https://cmake.org/download/) , configure and install it by the following command:
```shell
# Install curl develop toolkit
sudo yum install libcurl-devel
# install cmake
cd cmake && ./bootstrap --system-curl --prefix=$your_path && make install
```
- Build third-parties
Run the following command to build third-parties.
```shell
cd build && cmake -DWITH_XLA=ON -DTHIRD_PARTY=ON ..
make -j$(nproc)
```
If the downloading error occurred, you should go back to the previous step to reinstall the cmake, then clean the file CMakeCache.txt and build the third-parties once again.
- Build OneFlow
```shell
cmake .. \
-DWITH_XLA=ON \
-DPYTHON_LIBRARY=your_python_lib_path \
-DPYTHON_INCLUDE_DIR=your_python_include_dir \
-DPython_NumPy_INCLUDE_DIRS=your_numpy_include_dir
make -j$(nproc)
```
- XLA documents
You can check this [doc](./oneflow/xrt/README.md) to obtain more details about how to use XLA.
### Build with TensorRT
- Build third-parties
Run the following command to build third-parties.
```shell
cd build && cmake -DWITH_TENSORRT=ON -DTHIRD_PARTY=ON ..
make -j$(nproc)
```
- Build OneFlow
```shell
cmake .. \
-DWITH_TENSORRT=ON \
-DPYTHON_LIBRARY=your_python_lib_path \
-DPYTHON_INCLUDE_DIR=your_python_include_dir \
-DPython_NumPy_INCLUDE_DIRS=your_numpy_include_dir
make -j$(nproc)
```
...@@ -45,6 +45,24 @@ foreach(oneflow_hdr_to_be_expanded ${oneflow_all_hdr_to_be_expanded}) ...@@ -45,6 +45,24 @@ foreach(oneflow_hdr_to_be_expanded ${oneflow_all_hdr_to_be_expanded})
endforeach() endforeach()
file(GLOB_RECURSE oneflow_all_src "${PROJECT_SOURCE_DIR}/oneflow/core/*.*" "${PROJECT_SOURCE_DIR}/oneflow/python/*.*") file(GLOB_RECURSE oneflow_all_src "${PROJECT_SOURCE_DIR}/oneflow/core/*.*" "${PROJECT_SOURCE_DIR}/oneflow/python/*.*")
if (WITH_XLA OR WITH_TENSORRT)
file(GLOB_RECURSE oneflow_xrt_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/*.*")
if (NOT WITH_XLA)
file(GLOB_RECURSE xla_removing_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/xla/*.*")
endif ()
if (NOT WITH_TENSORRT)
file(GLOB_RECURSE trt_removing_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/tensorrt/*.*")
endif ()
list(APPEND xrt_removing_srcs ${xla_removing_src})
list(APPEND xrt_removing_srcs ${trt_removing_src})
# message(STATUS "removing_srcs: ${xrt_removing_srcs}")
foreach (removing_file ${xrt_removing_srcs})
list(REMOVE_ITEM oneflow_xrt_src ${removing_file})
endforeach ()
list(APPEND oneflow_all_src ${oneflow_xrt_src})
endif()
foreach(oneflow_single_file ${oneflow_all_src}) foreach(oneflow_single_file ${oneflow_all_src})
# Verify whether this file is for other platforms # Verify whether this file is for other platforms
set(exclude_this OFF) set(exclude_this OFF)
...@@ -70,33 +88,33 @@ foreach(oneflow_single_file ${oneflow_all_src}) ...@@ -70,33 +88,33 @@ foreach(oneflow_single_file ${oneflow_all_src})
set(group_this ON) set(group_this ON)
endif() endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.h$") if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.h$")
list(APPEND of_all_obj_cc ${oneflow_single_file}) list(APPEND of_all_obj_cc ${oneflow_single_file})
set(group_this ON) set(group_this ON)
endif() endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cuh$") if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cuh$")
if(BUILD_CUDA) if(BUILD_CUDA)
list(APPEND of_all_obj_cc ${oneflow_single_file}) list(APPEND of_all_obj_cc ${oneflow_single_file})
endif() endif()
set(group_this ON) set(group_this ON)
endif() endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cu$") if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cu$")
if(BUILD_CUDA) if(BUILD_CUDA)
list(APPEND of_all_obj_cc ${oneflow_single_file}) list(APPEND of_all_obj_cc ${oneflow_single_file})
endif() endif()
set(group_this ON) set(group_this ON)
endif() endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.proto$") if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.proto$")
list(APPEND of_all_proto ${oneflow_single_file}) list(APPEND of_all_proto ${oneflow_single_file})
#list(APPEND of_all_obj_cc ${oneflow_single_file}) # include the proto file in the project #list(APPEND of_all_obj_cc ${oneflow_single_file}) # include the proto file in the project
set(group_this ON) set(group_this ON)
endif() endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cpp$") if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cpp$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*_test\\.cpp$") if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*_test\\.cpp$")
# test file # test file
# list(APPEND of_all_test_cc ${oneflow_single_file}) # list(APPEND of_all_test_cc ${oneflow_single_file})
else() else()
......
...@@ -15,6 +15,17 @@ include(cocoapi) ...@@ -15,6 +15,17 @@ include(cocoapi)
include(half) include(half)
include(json) include(json)
if (WITH_XLA)
include(tensorflow)
endif()
if (WITH_TENSORRT)
if (NOT WITH_XLA)
include(absl)
endif()
include(tensorrt)
endif()
if (BUILD_CUDA) if (BUILD_CUDA)
set(CUDA_SEPARABLE_COMPILATION ON) set(CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDA REQUIRED) find_package(CUDA REQUIRED)
...@@ -114,6 +125,11 @@ if (BUILD_CUDA) ...@@ -114,6 +125,11 @@ if (BUILD_CUDA)
include(cub) include(cub)
include(nccl) include(nccl)
if (WITH_XLA)
# Fix conflicts between tensorflow cublas dso and oneflow static cublas.
# TODO(hjchen2) Should commit a issue about this fix.
list(APPEND oneflow_third_party_libs -Wl,--whole-archive ${cuda_lib_dir}/libcublas_static.a -Wl,--no-whole-archive)
endif()
list(APPEND oneflow_third_party_libs ${CUDA_LIBRARIES}) list(APPEND oneflow_third_party_libs ${CUDA_LIBRARIES})
list(APPEND oneflow_third_party_libs ${CUDNN_LIBRARIES}) list(APPEND oneflow_third_party_libs ${CUDNN_LIBRARIES})
list(APPEND oneflow_third_party_libs ${NCCL_STATIC_LIBRARIES}) list(APPEND oneflow_third_party_libs ${NCCL_STATIC_LIBRARIES})
...@@ -150,6 +166,17 @@ if(BUILD_RDMA) ...@@ -150,6 +166,17 @@ if(BUILD_RDMA)
endif() endif()
endif() endif()
if(WITH_XLA)
list(APPEND oneflow_third_party_libs ${TENSORFLOW_XLA_LIBRARIES})
endif()
if(WITH_TENSORRT)
if (NOT WITH_XLA)
list(APPEND oneflow_third_party_libs ${ABSL_LIBRARIES})
endif()
list(APPEND oneflow_third_party_libs ${TENSORRT_LIBRARIES})
endif()
message(STATUS "oneflow_third_party_libs: " ${oneflow_third_party_libs}) message(STATUS "oneflow_third_party_libs: " ${oneflow_third_party_libs})
add_definitions(-DHALF_ENABLE_CPP11_USER_LITERALS=0) add_definitions(-DHALF_ENABLE_CPP11_USER_LITERALS=0)
include (ExternalProject)
SET(ABSL_PROJECT absl)
SET(ABSL_GIT_URL https://github.com/abseil/abseil-cpp.git)
SET(ABSL_GIT_TAG 43ef2148c0936ebf7cb4be6b19927a9d9d145b8f)
SET(ABSL_SOURCE_DIR ${CMAKE_CURRENT_BINARY_DIR}/third_party/absl)
SET(ABSL_INSTALL_DIR ${THIRD_PARTY_DIR}/absl)
SET(ABSL_INCLUDE_DIR ${ABSL_INSTALL_DIR}/include CACHE PATH "" FORCE)
SET(ABSL_LIBRARY_DIR ${ABSL_INSTALL_DIR}/lib CACHE PATH "" FORCE)
INCLUDE_DIRECTORIES(${ABSL_INCLUDE_DIR})
LINK_DIRECTORIES(${ABSL_LIBRARY_DIR})
SET(ABSL_LIBRARIES
${ABSL_LIBRARY_DIR}/libabsl_base.a
${ABSL_LIBRARY_DIR}/libabsl_spinlock_wait.a
${ABSL_LIBRARY_DIR}/libabsl_dynamic_annotations.a
${ABSL_LIBRARY_DIR}/libabsl_malloc_internal.a
${ABSL_LIBRARY_DIR}/libabsl_throw_delegate.a
${ABSL_LIBRARY_DIR}/libabsl_int128.a
${ABSL_LIBRARY_DIR}/libabsl_strings.a
${ABSL_LIBRARY_DIR}/libabsl_str_format_internal.a
${ABSL_LIBRARY_DIR}/libabsl_time.a
${ABSL_LIBRARY_DIR}/libabsl_bad_optional_access.a)
if (THIRD_PARTY)
ExternalProject_Add(${ABSL_PROJECT}
PREFIX ${ABSL_SOURCE_DIR}
GIT_REPOSITORY ${ABSL_GIT_URL}
GIT_TAG ${ABSL_GIT_TAG}
UPDATE_COMMAND ""
CMAKE_ARGS
-DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
-DBUILD_SHARED_LIBS:BOOL=OFF
-DCMAKE_CXX_FLAGS:STRING=${CMAKE_CXX_FLAGS}
-DCMAKE_CXX_FLAGS_DEBUG:STRING=${CMAKE_CXX_FLAGS_DEBUG}
-DCMAKE_CXX_FLAGS_RELEASE:STRING=${CMAKE_CXX_FLAGS_RELEASE}
CMAKE_CACHE_ARGS
-DCMAKE_INSTALL_PREFIX:PATH=${ABSL_INSTALL_DIR}
-DCMAKE_INSTALL_LIBDIR:PATH=${ABSL_LIBRARY_DIR}
-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
)
endif(THIRD_PARTY)
...@@ -3,9 +3,18 @@ include (ExternalProject) ...@@ -3,9 +3,18 @@ include (ExternalProject)
set(EIGEN_INCLUDE_DIR ${THIRD_PARTY_DIR}/eigen/include/eigen3) set(EIGEN_INCLUDE_DIR ${THIRD_PARTY_DIR}/eigen/include/eigen3)
set(EIGEN_INSTALL_DIR ${THIRD_PARTY_DIR}/eigen) set(EIGEN_INSTALL_DIR ${THIRD_PARTY_DIR}/eigen)
set(EIGEN_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/eigen/src/eigen) if(WITH_XLA)
#set(EIGEN_URL "https://storage.googleapis.com/mirror.tensorflow.org/bitbucket.org/eigen/eigen/get/8071cda5714d.tar.gz")
set(EIGEN_URL "https://bitbucket.org/eigen/eigen/get/8071cda5714d.tar.gz")
else()
set(EIGEN_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/eigen/src/eigen)
endif()
add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_NO_MALLOC -DEIGEN_USE_GPU) add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_USE_GPU)
if (NOT WITH_XLA)
add_definitions(-DEIGEN_NO_MALLOC)
endif()
#add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_NO_MALLOC -DEIGEN_USE_GPU)
if (THIRD_PARTY) if (THIRD_PARTY)
......
...@@ -5,7 +5,11 @@ set(PROTOBUF_LIBRARY_DIR ${THIRD_PARTY_DIR}/protobuf/lib) ...@@ -5,7 +5,11 @@ set(PROTOBUF_LIBRARY_DIR ${THIRD_PARTY_DIR}/protobuf/lib)
set(PROTOBUF_BINARY_DIR ${THIRD_PARTY_DIR}/protobuf/bin) set(PROTOBUF_BINARY_DIR ${THIRD_PARTY_DIR}/protobuf/bin)
set(PROTOBUF_SRC_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src) set(PROTOBUF_SRC_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src)
set(PROTOBUF_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/protobuf/src/protobuf) if(WITH_XLA)
set(PROTOBUF_URL "https://storage.googleapis.com/mirror.tensorflow.org/github.com/protocolbuffers/protobuf/archive/310ba5ee72661c081129eb878c1bbcec936b20f0.tar.gz")
else()
set(PROTOBUF_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/protobuf/src/protobuf)
endif()
if(WIN32) if(WIN32)
set(PROTOBUF_BUILD_LIBRARY_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/${CMAKE_BUILD_TYPE}) set(PROTOBUF_BUILD_LIBRARY_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/${CMAKE_BUILD_TYPE})
......
include (ExternalProject)
if (WITH_XLA)
list(APPEND TENSORFLOW_BUILD_CMD --define with_xla_support=true)
if (RELEASE_VERSION)
list(APPEND TENSORFLOW_BUILD_CMD -c opt)
set(TENSORFLOW_GENFILE_DIR k8-opt)
else()
list(APPEND TENSORFLOW_BUILD_CMD --copt=-g -c dbg)
set(TENSORFLOW_GENFILE_DIR k8-dbg)
endif()
set(TF_WITH_CUDA ON)
if (TF_WITH_CUDA)
set(CUDA_COMPUTE_CAPABILITIES "6.0,6.1")
if (NOT CUDA_VERSION VERSION_LESS "10.0")
set(CUDA_COMPUTE_CAPABILITIES "${CUDA_COMPUTE_CAPABILITIES},7.0")
endif()
list(APPEND TENSORFLOW_BUILD_CMD --config=cuda)
list(APPEND TENSORFLOW_BUILD_CMD --action_env TF_NEED_CUDA=1)
list(APPEND TENSORFLOW_BUILD_CMD --action_env TF_CUDA_COMPUTE_CAPABILITIES=${CUDA_COMPUTE_CAPABILITIES})
endif()
message(STATUS ${TENSORFLOW_BUILD_CMD})
set(TENSORFLOW_PROJECT tensorflow)
set(TENSORFLOW_GIT_URL https://github.com/tensorflow/tensorflow.git)
#set(TENSORFLOW_GIT_TAG master)
set(TENSORFLOW_GIT_TAG 80c04b80ad66bf95aa3f41d72a6bba5e84a99622)
set(TENSORFLOW_SOURCES_DIR ${THIRD_PARTY_DIR}/tensorflow)
set(TENSORFLOW_SRCS_DIR ${TENSORFLOW_SOURCES_DIR}/src/tensorflow)
set(TENSORFLOW_INC_DIR ${TENSORFLOW_SOURCES_DIR}/src/tensorflow)
set(PATCHES_DIR ${PROJECT_SOURCE_DIR}/oneflow/xrt/patches)
set(TENSORFLOW_JIT_DIR ${TENSORFLOW_SRCS_DIR}/tensorflow/compiler/jit)
set(TENSORFLOW_GEN_DIR ${TENSORFLOW_SRCS_DIR}/bazel-out/${TENSORFLOW_GENFILE_DIR}/genfiles)
set(TENSORFLOW_EXTERNAL_DIR ${TENSORFLOW_SRCS_DIR}/bazel-tensorflow/external)
set(THIRD_ABSL_DIR ${TENSORFLOW_EXTERNAL_DIR}/com_google_absl)
set(THIRD_PROTOBUF_DIR ${TENSORFLOW_EXTERNAL_DIR}/com_google_protobuf/src)
set(THIRD_BORINGSSL_DIR ${TENSORFLOW_EXTERNAL_DIR}/boringssl/src)
set(THIRD_SNAPPY_DIR ${TENSORFLOW_EXTERNAL_DIR}/snappy)
list(APPEND TENSORFLOW_XLA_INCLUDE_DIR
${TENSORFLOW_INC_DIR}
${TENSORFLOW_GEN_DIR}
${THIRD_ABSL_DIR}
${THIRD_PROTOBUF_DIR}
${THIRD_BORINGSSL_DIR}
${THIRD_SNAPPY_DIR}
)
include_directories(${TENSORFLOW_XLA_INCLUDE_DIR})
list(APPEND TENSORFLOW_XLA_LIBRARIES libtensorflow_framework.so.1)
list(APPEND TENSORFLOW_XLA_LIBRARIES libxla_core.so)
link_directories(
${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow
${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/compiler/jit/xla_lib
)
if (THIRD_PARTY)
ExternalProject_Add(${TENSORFLOW_PROJECT}
PREFIX ${TENSORFLOW_SOURCES_DIR}
GIT_REPOSITORY ${TENSORFLOW_GIT_URL}
GIT_TAG ${TENSORFLOW_GIT_TAG}
PATCH_COMMAND patch -Np1 < ${PATCHES_DIR}/xla.patch
CONFIGURE_COMMAND ""
BUILD_COMMAND cd ${TENSORFLOW_SRCS_DIR} &&
bazel build ${TENSORFLOW_BUILD_CMD} -j 20 //tensorflow/compiler/jit/xla_lib:libxla_core.so
INSTALL_COMMAND ""
)
endif(THIRD_PARTY)
set(TENSORFLOW_XLA_FRAMEWORK_LIB ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/libtensorflow_framework.so.1)
set(TENSORFLOW_XLA_CORE_LIB ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/compiler/jit/xla_lib/libxla_core.so)
endif(WITH_XLA)
include (ExternalProject)
if (WITH_TENSORRT)
find_path(TENSORRT_INCLUDE_DIR NvInfer.h
PATHS ${TENSORRT_ROOT} ${TENSORRT_ROOT}/include
$ENV{TENSORRT_ROOT} $ENV{TENSORRT_ROOT}/include
${THIRD_PARTY_DIR}/tensorrt/include)
find_library(TENSORRT_LIBRARIES NAMES libnvinfer.so libnvinfer.a
PATHS ${TENSORRT_ROOT} ${TENSORRT_ROOT}/lib
$ENV{TENSORRT_ROOT} $ENV{TENSORRT_ROOT}/lib
${THIRD_PARTY_DIR}/tensorrt/lib)
if (TENSORRT_INCLUDE_DIR AND TENSORRT_LIBRARIES)
else()
message(FATAL_ERROR "TensorRT was not found. You can set TENSORRT_ROOT to specify the search path.")
endif()
message(STATUS "TensorRT Include: ${TENSORRT_INCLUDE_DIR}")
message(STATUS "TensorRT Lib: ${TENSORRT_LIBRARIES}")
include_directories(${TENSORRT_INCLUDE_DIR})
endif(WITH_TENSORRT)
#include "oneflow/core/common/protobuf.h" #include "oneflow/core/common/protobuf.h"
#include "oneflow/core/common/shape.pb.h"
#include "oneflow/core/common/str_util.h" #include "oneflow/core/common/str_util.h"
#include "oneflow/core/register/blob_desc.pb.h" #include "oneflow/core/register/blob_desc.pb.h"
#include <google/protobuf/io/coded_stream.h> #include <google/protobuf/io/coded_stream.h>
...@@ -88,6 +89,11 @@ int32_t GetEnumFromPbMessage(const PbMessage& msg, const std::string& field_name ...@@ -88,6 +89,11 @@ int32_t GetEnumFromPbMessage(const PbMessage& msg, const std::string& field_name
OF_PP_FOR_EACH_TUPLE(DEFINE_SET_VAL_IN_PBMESSAGE, PROTOBUF_BASIC_DATA_TYPE_SEQ) OF_PP_FOR_EACH_TUPLE(DEFINE_SET_VAL_IN_PBMESSAGE, PROTOBUF_BASIC_DATA_TYPE_SEQ)
const PbMessage& GetMessageInPbMessage(const PbMessage& msg, const std::string& field_name) {
PROTOBUF_REFLECTION(msg, field_name);
return r->GetMessage(msg, fd);
}
PbMessage* MutableMessageInPbMessage(PbMessage* msg, const std::string& field_name) { PbMessage* MutableMessageInPbMessage(PbMessage* msg, const std::string& field_name) {
PROTOBUF_REFLECTION((*msg), field_name); PROTOBUF_REFLECTION((*msg), field_name);
return r->MutableMessage(msg, fd); return r->MutableMessage(msg, fd);
...@@ -115,6 +121,67 @@ PbMessage* MutableMessageInPbMessage(PbMessage* msg, int field_index) { ...@@ -115,6 +121,67 @@ PbMessage* MutableMessageInPbMessage(PbMessage* msg, int field_index) {
return r->MutableMessage(msg, fd); return r->MutableMessage(msg, fd);
} }
#define DECLARE_GETTER_FUNC_HEADER(type) \
template<> \
type GetValFromPbMessage<type>(const PbMessage& msg, const std::string& field_name)
#define DECLARE_SETTER_FUNC_HEADER(type) \
template<> \
void SetValInPbMessage<type>(PbMessage * msg, const std::string& field_name, const type& val)
#define DEFINE_MESSAGE_VAL_GETTER_AND_SETTER(message_type) \
DECLARE_GETTER_FUNC_HEADER(message_type) { \
PROTOBUF_REFLECTION(msg, field_name); \
return *dynamic_cast<const message_type*>(&r->GetMessage(msg, fd)); \
} \
DECLARE_SETTER_FUNC_HEADER(message_type) { \
PROTOBUF_REFLECTION((*msg), field_name); \
r->MutableMessage(msg, fd)->CopyFrom(val); \
}
DEFINE_MESSAGE_VAL_GETTER_AND_SETTER(ShapeProto);
#define DEFINE_ENUM_VAL_GETTER_AND_SETTER(enum_type) \
DECLARE_GETTER_FUNC_HEADER(enum_type) { \
PROTOBUF_REFLECTION(msg, field_name); \
return static_cast<enum_type>(r->GetEnumValue(msg, fd)); \
} \
DECLARE_SETTER_FUNC_HEADER(enum_type) { \
PROTOBUF_REFLECTION((*msg), field_name); \
r->SetEnumValue(msg, fd, val); \
}
DEFINE_ENUM_VAL_GETTER_AND_SETTER(DataType);
#define DEFINE_VECTOR_VAL_GETTER_AND_SETTER(vec_type, vec_type_name) \
DECLARE_GETTER_FUNC_HEADER(vec_type) { \
PROTOBUF_REFLECTION(msg, field_name); \
int32_t field_size = r->FieldSize(msg, fd); \
vec_type retval(field_size); \
for (int i = 0; i < field_size; ++i) { retval[i] = r->Get##vec_type_name(msg, fd, i); } \
return std::move(retval); \
} \
DECLARE_SETTER_FUNC_HEADER(vec_type) { \
PROTOBUF_REFLECTION((*msg), field_name); \
for (int i = 0; i < val.size(); ++i) { r->Set##vec_type_name(msg, fd, i, val[i]); } \
}
#define MAKE_REPEATED_TUPLE_SEQ(type, type_name) \
OF_PP_MAKE_TUPLE_SEQ(std::vector<type>, Repeated##type_name)
#define PROTOBUF_BASIC_REPEATED_DATA_TYPE_SEQ \
MAKE_REPEATED_TUPLE_SEQ(std::string, String) \
MAKE_REPEATED_TUPLE_SEQ(int32_t, Int32) \
MAKE_REPEATED_TUPLE_SEQ(uint32_t, UInt32) \
MAKE_REPEATED_TUPLE_SEQ(int64_t, Int64) \
MAKE_REPEATED_TUPLE_SEQ(uint64_t, UInt64) \
MAKE_REPEATED_TUPLE_SEQ(float, Float) \
MAKE_REPEATED_TUPLE_SEQ(double, Double) \
MAKE_REPEATED_TUPLE_SEQ(int16_t, EnumValue) \
MAKE_REPEATED_TUPLE_SEQ(bool, Bool)
OF_PP_FOR_EACH_TUPLE(DEFINE_VECTOR_VAL_GETTER_AND_SETTER, PROTOBUF_BASIC_REPEATED_DATA_TYPE_SEQ);
#define DEFINE_ADD_VAL_IN_PBRF(cpp_type, pb_type_name) \ #define DEFINE_ADD_VAL_IN_PBRF(cpp_type, pb_type_name) \
template<> \ template<> \
void AddValInPbRf(PbMessage* msg, const std::string& field_name, const cpp_type& val) { \ void AddValInPbRf(PbMessage* msg, const std::string& field_name, const cpp_type& val) { \
......
...@@ -36,6 +36,7 @@ using PbMd = google::protobuf::util::MessageDifferencer; ...@@ -36,6 +36,7 @@ using PbMd = google::protobuf::util::MessageDifferencer;
OF_PP_MAKE_TUPLE_SEQ(int64_t, Int64) \ OF_PP_MAKE_TUPLE_SEQ(int64_t, Int64) \
OF_PP_MAKE_TUPLE_SEQ(uint64_t, UInt64) \ OF_PP_MAKE_TUPLE_SEQ(uint64_t, UInt64) \
OF_PP_MAKE_TUPLE_SEQ(float, Float) \ OF_PP_MAKE_TUPLE_SEQ(float, Float) \
OF_PP_MAKE_TUPLE_SEQ(double, Double) \
OF_PP_MAKE_TUPLE_SEQ(int16_t, EnumValue) \ OF_PP_MAKE_TUPLE_SEQ(int16_t, EnumValue) \
OF_PP_MAKE_TUPLE_SEQ(bool, Bool) OF_PP_MAKE_TUPLE_SEQ(bool, Bool)
...@@ -92,6 +93,7 @@ template<typename T> ...@@ -92,6 +93,7 @@ template<typename T>
void SetValInPbMessage(PbMessage* msg, const std::string& field_name, const T& val); void SetValInPbMessage(PbMessage* msg, const std::string& field_name, const T& val);
const PbMessage& GetMessageInPbMessage(const PbMessage& msg, int field_index); const PbMessage& GetMessageInPbMessage(const PbMessage& msg, int field_index);
const PbMessage& GetMessageInPbMessage(const PbMessage& msg, const std::string& field_name);
PbMessage* MutableMessageInPbMessage(PbMessage*, const std::string& field_name); PbMessage* MutableMessageInPbMessage(PbMessage*, const std::string& field_name);
PbMessage* MutableMessageInPbMessage(PbMessage*, int field_index); PbMessage* MutableMessageInPbMessage(PbMessage*, int field_index);
......
#ifndef ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_ #ifndef ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_
#define ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_ #define ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_
#include "oneflow/core/common/util.h"
#include "oneflow/core/common/shape_vec.h" #include "oneflow/core/common/shape_vec.h"
namespace oneflow { namespace oneflow {
......
...@@ -35,7 +35,9 @@ void NormalForwardCompTaskNode::ProduceAllRegstsAndBindEdges() { ...@@ -35,7 +35,9 @@ void NormalForwardCompTaskNode::ProduceAllRegstsAndBindEdges() {
} }
void NormalForwardCompTaskNode::ConsumeAllRegsts() { void NormalForwardCompTaskNode::ConsumeAllRegsts() {
ForEachInDataEdge([&](TaskEdge* edge) { ConsumeRegst("in", edge->GetSoleRegst()); }); ForEachInDataEdge([&](TaskEdge* edge) {
for (const auto& regst : edge->GetRegsts()) { ConsumeRegst("in", regst); }
});
} }
bool NormalForwardCompTaskNode::IsReadyForBuild() { bool NormalForwardCompTaskNode::IsReadyForBuild() {
......
...@@ -4,7 +4,9 @@ ...@@ -4,7 +4,9 @@
namespace oneflow { namespace oneflow {
void OptimizerCompTaskNode::ConsumeAllRegsts() { void OptimizerCompTaskNode::ConsumeAllRegsts() {
ForEachInDataEdge([&](TaskEdge* edge) { ConsumeRegst("in", edge->GetSoleRegst()); }); ForEachInDataEdge([&](TaskEdge* edge) {
for (const auto& regst : edge->GetRegsts()) { ConsumeRegst("in", regst); }
});
} }
void OptimizerCompTaskNode::ProduceAllRegstsAndBindEdges() { ProduceRegst("tmp", false, 1, 1); } void OptimizerCompTaskNode::ProduceAllRegstsAndBindEdges() { ProduceRegst("tmp", false, 1, 1); }
......
...@@ -5,48 +5,28 @@ ...@@ -5,48 +5,28 @@
namespace oneflow { namespace oneflow {
namespace {
int32_t GetDataRegstDescCnt(
const HashMap<std::string, std::shared_ptr<RegstDesc>> name2regst_desc) {
size_t cnt = 0;
for (const auto& pair : name2regst_desc) {
cnt += pair.second->regst_desc_type().has_data_regst_desc();
}
return cnt;
}
} // namespace
void ReduceSplitCompTaskNode::ProduceAllRegstsAndBindEdges() { void ReduceSplitCompTaskNode::ProduceAllRegstsAndBindEdges() {
std::vector<EdgeInfo> edge_infos;
std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
HashMap<LogicalBlobId, int32_t> lbi2order; HashMap<LogicalBlobId, int32_t> lbi2order;
std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
FOR_RANGE(int32_t, idx, 0, reduce_split_op->output_bns().size()) { FOR_RANGE(int32_t, idx, 0, reduce_split_op->output_bns().size()) {
ProduceRegst("out_" + std::to_string(idx), false, 1, 1);
const auto& lbi = reduce_split_op->BnInOp2Lbi(reduce_split_op->output_bns().Get(idx)); const auto& lbi = reduce_split_op->BnInOp2Lbi(reduce_split_op->output_bns().Get(idx));
CHECK(lbi2order.emplace(lbi, idx).second); CHECK(lbi2order.emplace(lbi, idx).second);
} }
ForEachOutDataEdge([&](TaskEdge* edge) { ForEachOutDataEdge([&](TaskEdge* edge) {
TaskNode* dst_node = edge->dst_node(); TaskNode* dst_node = edge->dst_node();
CHECK(edge->dst_node()->GetTaskType() == TaskType::kOptimizer CHECK(edge->dst_node()->GetTaskType() == TaskType::kOptimizer
|| edge->dst_node()->GetTaskType() == TaskType::kNormalForward); || edge->dst_node()->GetTaskType() == TaskType::kNormalForward);
CompTaskNode* mdupdt_node = dynamic_cast<CompTaskNode*>(dst_node); CompTaskNode* mdupdt_node = dynamic_cast<CompTaskNode*>(dst_node);
std::shared_ptr<Operator> mdupdt_op = mdupdt_node->logical_node()->SoleOp(); std::shared_ptr<Operator> mdupdt_op = mdupdt_node->logical_node()->SoleOp();
int32_t order = -1;
for (const std::string& ibn : mdupdt_op->input_bns()) { for (const std::string& ibn : mdupdt_op->input_bns()) {
const auto& order_it = lbi2order.find(mdupdt_op->BnInOp2Lbi(ibn)); const auto& order_it = lbi2order.find(mdupdt_op->BnInOp2Lbi(ibn));
if (order_it != lbi2order.end()) { order = order_it->second; } if (order_it != lbi2order.end()) {
BindEdgeWithProducedRegst(edge, "out_" + std::to_string(order_it->second));
}
} }
CHECK_NE(order, -1);
EdgeInfo edge_info{edge, order};
edge_infos.emplace_back(edge_info);
}); });
SortEdges(&edge_infos);
FOR_RANGE(size_t, idx, 0, edge_infos.size()) {
std::string out_regst_name = "out_" + std::to_string(idx);
std::shared_ptr<RegstDesc> out_regst = ProduceRegst(out_regst_name, false, 1, 1);
edge_infos[idx].edge->AddRegst(out_regst_name, out_regst);
}
} }
void ReduceSplitCompTaskNode::ConsumeAllRegsts() { void ReduceSplitCompTaskNode::ConsumeAllRegsts() {
...@@ -68,22 +48,23 @@ void ReduceSplitCompTaskNode::BuildExecGphAndRegst() { ...@@ -68,22 +48,23 @@ void ReduceSplitCompTaskNode::BuildExecGphAndRegst() {
node->BindBnWithRegst(reduce_split_op->SoleIbn(), GetSoleConsumedRegst("in")); node->BindBnWithRegst(reduce_split_op->SoleIbn(), GetSoleConsumedRegst("in"));
FOR_RANGE(size_t, i, 0, reduce_split_op->output_bns().size()) { FOR_RANGE(size_t, i, 0, reduce_split_op->output_bns().size()) {
std::shared_ptr<RegstDesc> out_regst = GetProducedRegst("out_" + std::to_string(i)); std::string blob_name = "out_" + std::to_string(i);
std::shared_ptr<RegstDesc> out_regst = GetProducedRegst(blob_name);
CHECK(out_regst.get() != nullptr); CHECK(out_regst.get() != nullptr);
out_regst->AddLbi(reduce_split_op->BnInOp2Lbi(reduce_split_op->output_bns().Get(i))); out_regst->AddLbi(reduce_split_op->BnInOp2Lbi(blob_name));
node->BindBnWithRegst(reduce_split_op->output_bns().Get(i), out_regst); node->BindBnWithRegst(blob_name, out_regst);
} }
node->InferBlobDescs(parallel_ctx()); node->InferBlobDescs(parallel_ctx());
} }
void ReduceSplitCompTaskNode::EnableMemSharingInReduce(const ReduceMemSharingCtx& ctx) { void ReduceSplitCompTaskNode::EnableMemSharingInReduce(const ReduceMemSharingCtx& ctx) {
CHECK_EQ(GetRankCtx().TotalSegmentCount(), 1); CHECK_EQ(GetRankCtx().TotalSegmentCount(), 1);
size_t split_num = GetDataRegstDescCnt(produced_regsts()); std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
int64_t offset = 0; int64_t offset = 0;
FOR_RANGE(int32_t, idx, 0, split_num) { for (int i = 0; i < reduce_split_op->output_bns().size(); ++i) {
RegstDesc* split_out_regst = GetProducedRegst("out_" + std::to_string(idx)).get(); RegstDesc* out_regst = GetProducedRegst("out_" + std::to_string(i)).get();
ctx.EnableMemSharing4Regst(split_out_regst, offset); ctx.EnableMemSharing4Regst(out_regst, offset);
offset += InferRegstSize(*split_out_regst); offset += InferRegstSize(*out_regst);
} }
} }
......
...@@ -46,6 +46,20 @@ message MemoryAllocationAlgorithmConf { ...@@ -46,6 +46,20 @@ message MemoryAllocationAlgorithmConf {
optional bool use_time_line_algo = 3 [default = false]; optional bool use_time_line_algo = 3 [default = false];
} }
message XrtConfig {
message XlaConfig {
// TODO
}
message TensorRTConfig {
optional bool use_fp16 = 1 [default = false];
optional bool use_int8 = 2 [default = false];
}
optional bool use_xla_jit = 1 [default = false];
optional bool use_tensorrt = 2 [default = false];
optional XlaConfig xla_config = 3;
optional TensorRTConfig tensorrt_config = 4;
}
message JobConfigProto { message JobConfigProto {
required string job_name = 1; required string job_name = 1;
...@@ -65,6 +79,8 @@ message JobConfigProto { ...@@ -65,6 +79,8 @@ message JobConfigProto {
optional bool use_memory_allocation_algorithm_v2 = 101 [default = true]; optional bool use_memory_allocation_algorithm_v2 = 101 [default = true];
optional MemoryAllocationAlgorithmConf memory_allocation_algorithm_conf = 102; optional MemoryAllocationAlgorithmConf memory_allocation_algorithm_conf = 102;
optional XrtConfig xrt_config = 103;
optional bool enable_cudnn = 200 [default = true]; optional bool enable_cudnn = 200 [default = true];
optional int64 cudnn_buf_limit_mbyte = 201 [default = 1024]; // 1GByte optional int64 cudnn_buf_limit_mbyte = 201 [default = 1024]; // 1GByte
optional int32 cudnn_conv_force_fwd_algo = 202; optional int32 cudnn_conv_force_fwd_algo = 202;
......
...@@ -30,6 +30,17 @@ JobBuilder::JobBuilder(Job* job) : job_(job) { ...@@ -30,6 +30,17 @@ JobBuilder::JobBuilder(Job* job) : job_(job) {
op_name2parallel_conf_.emplace(op_name, placemnt_group->mutable_parallel_conf()).second); op_name2parallel_conf_.emplace(op_name, placemnt_group->mutable_parallel_conf()).second);
} }
} }
auto* sbp_conf = job->mutable_sbp_conf();
for (auto& pair : *(sbp_conf->mutable_op_name2sbp_signature_conf())) {
op_name2sbp_signature_conf_.emplace(pair.first, &pair.second);
}
for (auto& pair : *(job->mutable_helper()->mutable_lbn2batch_axis())) {
lbn2batch_axis_.emplace(pair.first, &pair.second);
}
auto* helper_conf = job->mutable_helper();
for (auto& pair : *(helper_conf->mutable_op_name2op_time_shape())) {
op_name2time_shapes_.emplace(pair.first, &pair.second);
}
FOR_RANGE(int32_t, i, 0, job->placement().blob_placement_group_size()) { FOR_RANGE(int32_t, i, 0, job->placement().blob_placement_group_size()) {
auto* blob_pg = job->mutable_placement()->mutable_blob_placement_group(i); auto* blob_pg = job->mutable_placement()->mutable_blob_placement_group(i);
for (const auto& lbi : blob_pg->lbi()) { for (const auto& lbi : blob_pg->lbi()) {
...@@ -38,12 +49,14 @@ JobBuilder::JobBuilder(Job* job) : job_(job) { ...@@ -38,12 +49,14 @@ JobBuilder::JobBuilder(Job* job) : job_(job) {
} }
} }
const OperatorConf& JobBuilder::OpConf4OpName(const std::string& op_name) const { OperatorConf* JobBuilder::MutableOpConf4OpName(const std::string& op_name) {
return *op_name2op_conf_.at(op_name); const auto& it = op_name2op_conf_.find(op_name);
CHECK(it != op_name2op_conf_.end());
return it->second;
} }
const ParallelConf& JobBuilder::ParallelConf4OpName(const std::string& op_name) const { const OperatorConf& JobBuilder::OpConf4OpName(const std::string& op_name) const {
return *op_name2parallel_conf_.at(op_name); return *op_name2op_conf_.at(op_name);
} }
const ParallelConf& JobBuilder::ParallelConf4Lbi(const LogicalBlobId& lbi) const { const ParallelConf& JobBuilder::ParallelConf4Lbi(const LogicalBlobId& lbi) const {
...@@ -89,15 +102,69 @@ void JobBuilder::MutParallelConfOnlyOnce(const std::string& op_name, ...@@ -89,15 +102,69 @@ void JobBuilder::MutParallelConfOnlyOnce(const std::string& op_name,
*placement_group->mutable_parallel_conf() = parallel_conf; *placement_group->mutable_parallel_conf() = parallel_conf;
} }
void JobBuilder::DelOps(const std::vector<OperatorConf>& op_confs) { void JobBuilder::RemoveOpByName(const std::string& op_name) {
for (const auto& op_conf : op_confs) { RemoveOpByName(std::unordered_set<std::string>{op_name});
const std::string& op_name = op_conf.name(); }
op_name2op_conf_.erase(op_name);
auto* op_list = job_->mutable_net()->mutable_op(); void JobBuilder::RemoveOpByName(const std::unordered_set<std::string>& removing_names) {
auto it = std::remove_if(op_list->begin(), op_list->end(), // Update net
[&](const OperatorConf& conf) { return conf.name() == op_name; }); DLNetConf net = job_->net();
if (it != op_list->end()) { op_list->erase(it); } job_->mutable_net()->clear_op();
for (const OperatorConf& op_conf : net.op()) {
if (removing_names.count(op_conf.name()) == 0) { *(job_->mutable_net()->add_op()) = op_conf; }
}
// Update placement
auto placement_group = job_->placement().placement_group();
job_->mutable_placement()->clear_placement_group();
for (const PlacementGroup& place : placement_group) {
PlacementGroup p;
OpNameSet* op_set = p.mutable_op_set();
for (const std::string& name : place.op_set().op_name()) {
if (removing_names.count(name) == 0) { op_set->add_op_name(name); }
}
*(p.mutable_parallel_conf()) = place.parallel_conf();
if (op_set->op_name().size() > 0) { *(job_->mutable_placement()->add_placement_group()) = p; }
}
auto* sbp_conf = job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf();
auto* time_shape_conf = job_->mutable_helper()->mutable_op_name2op_time_shape();
for (const std::string& op_name : removing_names) {
// Update Sbp
if (sbp_conf->count(op_name) > 0) { sbp_conf->erase(op_name); }
// Update time shape
if (time_shape_conf->count(op_name) > 0) { time_shape_conf->erase(op_name); }
} }
// Update batch dim lbis
// Update identical sbp oba pairs
if (job_->helper().has_identical_sbp_oba_pairs()) {
auto identical_sbp_oba_pairs = job_->helper().identical_sbp_oba_pairs().pair();
job_->mutable_helper()->mutable_identical_sbp_oba_pairs()->clear_pair();
for (const auto& pair : identical_sbp_oba_pairs) {
if (removing_names.count(pair.first().op_name()) == 0
&& removing_names.count(pair.second().op_name()) == 0) {
*(job_->mutable_helper()->mutable_identical_sbp_oba_pairs()->mutable_pair()->Add()) = pair;
}
}
}
// Update builder
JobBuilder builder(job_);
op_name2op_conf_.swap(builder.op_name2op_conf_);
op_name2parallel_conf_.swap(builder.op_name2parallel_conf_);
op_name2sbp_signature_conf_.swap(builder.op_name2sbp_signature_conf_);
lbn2batch_axis_.swap(builder.lbn2batch_axis_);
}
void JobBuilder::DelOps(const std::vector<std::string>& op_names) {
std::unordered_set<std::string> removing_names;
for (const auto& op_name : op_names) { removing_names.insert(op_name); }
RemoveOpByName(removing_names);
}
void JobBuilder::DelOps(const std::vector<OperatorConf>& op_confs) {
std::unordered_set<std::string> removing_names;
for (const auto& op_conf : op_confs) { removing_names.insert(op_conf.name()); }
RemoveOpByName(removing_names);
} }
void JobBuilder::MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs) { void JobBuilder::MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs) {
...@@ -130,6 +197,22 @@ void JobBuilder::ForEachOperator(const std::function<void(const Operator&)>& Han ...@@ -130,6 +197,22 @@ void JobBuilder::ForEachOperator(const std::function<void(const Operator&)>& Han
} }
} }
const ParallelConf& JobBuilder::ParallelConf4OpName(const std::string& op_name) const {
return *op_name2parallel_conf_.at(op_name);
}
void JobBuilder::AddParallelConf4OpName(const std::string& op_name,
const ParallelConf& parallel_conf) {
bool update = (op_name2parallel_conf_.count(op_name) == 0);
if (update) {
// update `op_name2parallel_conf_`
PlacementGroup* group = job_->mutable_placement()->add_placement_group();
group->mutable_op_set()->add_op_name(op_name);
*(group->mutable_parallel_conf()) = parallel_conf;
op_name2parallel_conf_[op_name] = group->mutable_parallel_conf();
}
}
SbpParallel* JobBuilder::MutSbpParallel4Oba(const OpBlobArg& oba) const { SbpParallel* JobBuilder::MutSbpParallel4Oba(const OpBlobArg& oba) const {
auto* sbp_sig = &(*job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf())[oba.op_name()]; auto* sbp_sig = &(*job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf())[oba.op_name()];
return &(*sbp_sig->mutable_bn_in_op2sbp_parallel())[oba.bn_in_op()]; return &(*sbp_sig->mutable_bn_in_op2sbp_parallel())[oba.bn_in_op()];
...@@ -141,4 +224,54 @@ void JobBuilder::BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpB ...@@ -141,4 +224,54 @@ void JobBuilder::BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpB
*pair->mutable_second() = second; *pair->mutable_second() = second;
} }
const SbpSignature& JobBuilder::SbpSignature4OpName(const std::string& op_name) const {
const auto& it = op_name2sbp_signature_conf_.find(op_name);
CHECK(it != op_name2sbp_signature_conf_.end());
return *(it->second);
}
void JobBuilder::AddSbpSignature4OpName(const std::string& op_name,
const SbpSignature& sbp_signature) {
const auto& it = op_name2sbp_signature_conf_.find(op_name);
if (it != op_name2sbp_signature_conf_.end()) {
*(it->second) = sbp_signature;
return;
}
auto* op_name2sbp_signature_conf = job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf();
(*op_name2sbp_signature_conf)[op_name] = sbp_signature;
op_name2sbp_signature_conf_.emplace(op_name, &(*op_name2sbp_signature_conf)[op_name]);
}
const OpTimeShape& JobBuilder::TimeShape4OpName(const std::string& op_name) const {
const auto& it = op_name2time_shapes_.find(op_name);
CHECK(it != op_name2time_shapes_.end());
return *(it->second);
}
void JobBuilder::AddTimeShape4OpName(const std::string& op_name, const OpTimeShape& time_shape) {
bool update = (op_name2time_shapes_.count(op_name) == 0);
if (update) {
auto* time_shape_conf = job_->mutable_helper()->mutable_op_name2op_time_shape();
(*time_shape_conf)[op_name] = time_shape;
op_name2time_shapes_[op_name] = &((*time_shape_conf)[op_name]);
}
}
const OptInt64& JobBuilder::BatchAxis4Lbn(const std::string& lbn) const {
const auto& it = lbn2batch_axis_.find(lbn);
CHECK(it != lbn2batch_axis_.end());
return *(it->second);
}
void JobBuilder::AddBatchAxis4Lbn(const std::string& lbn, const OptInt64& axis) {
bool update =
(lbn2batch_axis_.count(lbn) == 0) || (lbn2batch_axis_[lbn]->value() != axis.value());
if (update) {
auto* batch_axis = job_->mutable_helper()->mutable_lbn2batch_axis();
(*batch_axis)[lbn] = axis;
lbn2batch_axis_[lbn] = &((*batch_axis)[lbn]);
}
}
} // namespace oneflow } // namespace oneflow
...@@ -26,19 +26,37 @@ class JobBuilder final { ...@@ -26,19 +26,37 @@ class JobBuilder final {
SbpConf* mutable_sbp_conf() { return job_->mutable_sbp_conf(); } SbpConf* mutable_sbp_conf() { return job_->mutable_sbp_conf(); }
const OperatorConf& OpConf4OpName(const std::string& op_name) const; const OperatorConf& OpConf4OpName(const std::string& op_name) const;
const ParallelConf& ParallelConf4OpName(const std::string& op_name) const; OperatorConf* MutableOpConf4OpName(const std::string& op_name);
const ParallelConf& ParallelConf4Lbi(const LogicalBlobId& lbi) const;
void AddOps(const ParallelConf& parallel_conf, const std::vector<OperatorConf>& op_confs); void AddOps(const ParallelConf& parallel_conf, const std::vector<OperatorConf>& op_confs);
void MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs); void MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs);
void MutParallelConfOnlyOnce(const std::string& op_name, const ParallelConf& parallel_conf); void MutParallelConfOnlyOnce(const std::string& op_name, const ParallelConf& parallel_conf);
void AddOrMutOpsOnlyOnce(const ParallelConf& parallel_conf, void AddOrMutOpsOnlyOnce(const ParallelConf& parallel_conf,
const std::vector<OperatorConf>& op_confs); const std::vector<OperatorConf>& op_confs);
void RemoveOpByName(const std::string& op_name);
void RemoveOpByName(const std::unordered_set<std::string>& removing_names);
void DelOps(const std::vector<std::string>& op_names);
void DelOps(const std::vector<OperatorConf>& op_confs); void DelOps(const std::vector<OperatorConf>& op_confs);
SbpParallel* MutSbpParallel4Oba(const OpBlobArg& oba) const; SbpParallel* MutSbpParallel4Oba(const OpBlobArg& oba) const;
void BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpBlobArg& second); void BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpBlobArg& second);
void ForEachOperator(const std::function<void(const Operator&)>& Handler) const; void ForEachOperator(const std::function<void(const Operator&)>& Handler) const;
const ParallelConf& ParallelConf4Lbi(const LogicalBlobId& lbi) const;
const ParallelConf& ParallelConf4OpName(const std::string& op_name) const;
void AddParallelConf4OpName(const std::string& op_name, const ParallelConf& parallel_conf);
const SbpSignature& SbpSignature4OpName(const std::string& op_name) const;
void AddSbpSignature4OpName(const std::string& op_name, const SbpSignature& sbp_signature);
const OpTimeShape& TimeShape4OpName(const std::string& op_name) const;
void AddTimeShape4OpName(const std::string& op_name, const OpTimeShape& time_shape);
const OptInt64& BatchAxis4Lbn(const std::string& lbn) const;
void AddBatchAxis4Lbn(const std::string& lbn, const OptInt64& axis);
private: private:
PlacementGroup* FindPlacementGroup(const std::string& op_name) const; PlacementGroup* FindPlacementGroup(const std::string& op_name) const;
...@@ -48,6 +66,10 @@ class JobBuilder final { ...@@ -48,6 +66,10 @@ class JobBuilder final {
HashMap<LogicalBlobId, ParallelConf*> lbi2blob_parallel_conf_; HashMap<LogicalBlobId, ParallelConf*> lbi2blob_parallel_conf_;
HashSet<std::string> modified_op_conf_op_names_; HashSet<std::string> modified_op_conf_op_names_;
HashSet<std::string> modified_parallel_conf_op_names_; HashSet<std::string> modified_parallel_conf_op_names_;
HashMap<std::string, SbpSignature*> op_name2sbp_signature_conf_;
HashMap<std::string, OpTimeShape*> op_name2time_shapes_;
HashMap<std::string, OptInt64*> lbn2batch_axis_;
}; };
} // namespace oneflow } // namespace oneflow
......
...@@ -64,6 +64,9 @@ class JobDesc final { ...@@ -64,6 +64,9 @@ class JobDesc final {
bool all_reduce_fp16() const; bool all_reduce_fp16() const;
int64_t cudnn_buf_limit_mbyte() const { return job_conf_.cudnn_buf_limit_mbyte(); } int64_t cudnn_buf_limit_mbyte() const { return job_conf_.cudnn_buf_limit_mbyte(); }
bool has_xrt_config() const { return job_conf_.has_xrt_config(); }
const XrtConfig& xrt_config() const { return job_conf_.xrt_config(); }
#define DEFINE_FUNCTION_CONFIG_GETTER(T, func_name, field_name) \ #define DEFINE_FUNCTION_CONFIG_GETTER(T, func_name, field_name) \
T func_name(const std::string& field_name) const { \ T func_name(const std::string& field_name) const { \
const UserOpAttrVal& attr_val = GetFunctionFlagVal(field_name); \ const UserOpAttrVal& attr_val = GetFunctionFlagVal(field_name); \
......
...@@ -16,6 +16,8 @@ ...@@ -16,6 +16,8 @@
#include "oneflow/core/job_completer/add_lbi_diff_watcher.h" #include "oneflow/core/job_completer/add_lbi_diff_watcher.h"
#include "oneflow/core/framework/config_def.h" #include "oneflow/core/framework/config_def.h"
#include "oneflow/core/job_completer/xrt_compilation.h"
namespace oneflow { namespace oneflow {
namespace { namespace {
...@@ -356,6 +358,15 @@ void JobCompleter::Complete(Job* job) const { ...@@ -356,6 +358,15 @@ void JobCompleter::Complete(Job* job) const {
WithOpGraphAndMutJobBuilder(job, &AddGlobalOutputCriticalSections); WithOpGraphAndMutJobBuilder(job, &AddGlobalOutputCriticalSections);
WithOpGraphAndMutJobBuilder(job, &DumpLogicalBlobDescAndSbpSignature); WithOpGraphAndMutJobBuilder(job, &DumpLogicalBlobDescAndSbpSignature);
WithOpGraphAndMutJobBuilder(job, &SetOpTimeShape7BatchAxisLbis); WithOpGraphAndMutJobBuilder(job, &SetOpTimeShape7BatchAxisLbis);
if (XrtCompilationEnabled(GlobalJobDesc())) {
#ifdef OF_WITH_XRT
WithOpGraphAndMutJob(job, &RebuildXrtCompiledJob);
#else
LOG(WARNING) << "It will not use XLA or TensorRT since WITH_XLA or "
"WITH_TENSORRT was not enabled when compiling the project.";
#endif // OF_WITH_XRT
}
CheckOpGraph(OpGraph(*job)); CheckOpGraph(OpGraph(*job));
} }
......
...@@ -29,6 +29,7 @@ void GenerateFacadeImplOpConf(const OpNode& op_node, JobBuilder* job_builder) { ...@@ -29,6 +29,7 @@ void GenerateFacadeImplOpConf(const OpNode& op_node, JobBuilder* job_builder) {
*reduce_sum_conf->mutable_axis() = reduce_mean_conf.axis(); *reduce_sum_conf->mutable_axis() = reduce_mean_conf.axis();
reduce_sum_conf->set_keep_dims(reduce_mean_conf.keep_dims()); reduce_sum_conf->set_keep_dims(reduce_mean_conf.keep_dims());
reduce_sum_conf->set_out("out"); reduce_sum_conf->set_out("out");
if (reduce_mean_conf.keep_dims()) { reduce_sum_conf->set_keep_dims(true); }
job_builder->MutOpsOnlyOnce({reduce_sum_op_conf}); job_builder->MutOpsOnlyOnce({reduce_sum_op_conf});
const auto& in_blob = op_node.LogicalBlobDesc4Lbi(GenLogicalBlobId(reduce_mean_conf.in())); const auto& in_blob = op_node.LogicalBlobDesc4Lbi(GenLogicalBlobId(reduce_mean_conf.in()));
......
#include <string>
#include "oneflow/core/common/util.h"
#include "oneflow/core/graph/op_graph.h"
#include "oneflow/core/job/job_desc.h"
#if defined(WITH_XLA) || defined(WITH_TENSORRT)
#include "oneflow/xrt/api.h"
#define OF_WITH_XRT
#endif // WITH_XLA || WITH_TENSORRT
namespace oneflow {
inline void RebuildXrtCompiledJob(const OpGraph& op_graph, Job* job) {
#ifdef OF_WITH_XRT
const auto& job_desc = GlobalJobDesc();
if (Global<ResourceDesc>::Get()->enable_debug_mode()) {
TeePersistentLogStream::Create("job_without_xrt_" + std::to_string(job_desc.job_id()))
->Write(*job);
}
// Run compilation time passes currently include `MarkClusterId`, `BuildSubGraph`
// and `RebuildCompiledJob`.
xrt::RunCompilationTimeXrtPasses(op_graph, job, job_desc.IsTrain());
if (Global<ResourceDesc>::Get()->enable_debug_mode()) {
TeePersistentLogStream::Create("job_with_xrt_" + std::to_string(job_desc.job_id()))
->Write(*job);
}
#endif // OF_WITH_XRT
}
inline bool XrtCompilationEnabled(const JobDesc& job_desc) {
if (!job_desc.has_xrt_config()) { return xrt::XrtCompilationEnabled(); }
const XrtConfig& config = job_desc.xrt_config();
#ifdef OF_WITH_XRT
xrt::InitXrtConfigurations(config);
return xrt::XrtCompilationEnabled();
#else
return (config.has_use_xla_jit() && config.use_xla_jit())
|| (config.has_use_tensorrt() && config.use_tensorrt());
#endif // OF_WITH_XRT
}
} // namespace oneflow
...@@ -6,8 +6,9 @@ namespace oneflow { ...@@ -6,8 +6,9 @@ namespace oneflow {
template<DeviceType device_type, typename T> template<DeviceType device_type, typename T>
void ConcatKernel<device_type, T>::ForwardDataContent( void ConcatKernel<device_type, T>::ForwardDataContent(
const KernelCtx& ctx, std::function<Blob*(const std::string&)> BnInOp2Blob) const { const KernelCtx& ctx, std::function<Blob*(const std::string&)> BnInOp2Blob) const {
const int32_t axis = this->op_conf().concat_conf().axis();
Blob* out_blob = BnInOp2Blob("out"); Blob* out_blob = BnInOp2Blob("out");
int32_t axis = this->op_conf().concat_conf().axis();
if (axis < 0) { axis += out_blob->shape().NumAxes(); }
const int64_t row_num = out_blob->shape().elem_cnt() / out_blob->shape().Count(axis); const int64_t row_num = out_blob->shape().elem_cnt() / out_blob->shape().Count(axis);
const int64_t out_col_num = out_blob->shape().Count(axis); const int64_t out_col_num = out_blob->shape().Count(axis);
int64_t out_col_offset = 0; int64_t out_col_offset = 0;
......
...@@ -160,6 +160,10 @@ message NcclTupleBroadcastConf { ...@@ -160,6 +160,10 @@ message NcclTupleBroadcastConf {
required ParallelContext parallel_ctx = 1; required ParallelContext parallel_ctx = 1;
} }
message XrtLaunchKernelConf {
required ParallelContext parallel_ctx = 1;
}
message KernelConf { message KernelConf {
required OpAttribute op_attribute = 1; required OpAttribute op_attribute = 1;
required DataType data_type = 2; required DataType data_type = 2;
...@@ -182,6 +186,8 @@ message KernelConf { ...@@ -182,6 +186,8 @@ message KernelConf {
MaxPoolingKernelConf max_pooling_conf = 205; MaxPoolingKernelConf max_pooling_conf = 205;
LocalResponseNormalizationKernelConf local_response_normalization_conf = 300; LocalResponseNormalizationKernelConf local_response_normalization_conf = 300;
ReduceGatherKernelConf reduce_gather_conf = 350; ReduceGatherKernelConf reduce_gather_conf = 350;
XrtLaunchKernelConf xrt_launch_conf = 353;
AccuracyKernelConf accuracy_conf = 401; AccuracyKernelConf accuracy_conf = 401;
SliceKernelConf slice_conf = 402; SliceKernelConf slice_conf = 402;
ConstantKernelConf constant_conf = 403; ConstantKernelConf constant_conf = 403;
......
...@@ -55,8 +55,8 @@ Maybe<void> ConcatOp::GetSbpSignatures( ...@@ -55,8 +55,8 @@ Maybe<void> ConcatOp::GetSbpSignatures(
int32_t ConcatOp::FixAxis(const int32_t axis, const int64_t num_axes) const { int32_t ConcatOp::FixAxis(const int32_t axis, const int64_t num_axes) const {
int32_t ret = axis; int32_t ret = axis;
if (axis < 0) { ret += num_axes; } if (axis < 0) { ret += num_axes; }
CHECK_GE(axis, 0); CHECK_GE(ret, 0);
CHECK_LT(axis, num_axes); CHECK_LT(ret, num_axes);
return ret; return ret;
} }
......
...@@ -7,6 +7,7 @@ import "oneflow/core/record/image.proto"; ...@@ -7,6 +7,7 @@ import "oneflow/core/record/image.proto";
import "oneflow/core/record/record.proto"; import "oneflow/core/record/record.proto";
import "oneflow/core/job/resource.proto"; import "oneflow/core/job/resource.proto";
import "oneflow/core/register/logical_blob_id.proto"; import "oneflow/core/register/logical_blob_id.proto";
import "oneflow/core/job/sbp_parallel.proto";
enum ActivationType { enum ActivationType {
kNone = 0; kNone = 0;
...@@ -1517,6 +1518,32 @@ message LearningRateScheduleOpConf { ...@@ -1517,6 +1518,32 @@ message LearningRateScheduleOpConf {
optional WarmupConf warmup_conf = 5; optional WarmupConf warmup_conf = 5;
} }
message XrtLaunchOpConf {
message Argument {
required string name = 1;
required string value = 2;
required DeviceType device_type = 3;
}
message Function {
repeated Argument argument = 1;
repeated OperatorConf node = 2;
}
repeated string in = 1;
repeated string out = 2;
required Function function = 3;
// Function executing engine.
// Only "XLA" and "TensorRT" are supported currently.
required string engine = 4;
// Input mutability.
map<string, bool> input_mutability = 5;
// Mapping launch op's input and output names into function.
map<string, string> input_output_mapping = 6;
map<string, OptInt64> batch_axis = 7;
// Sbp signatures of each function node.
map<string, SbpSignature> sbp_signatures = 8;
optional bool model_update = 9 [default = false];
}
message NcclBoxingReduceScatterOpConf { message NcclBoxingReduceScatterOpConf {
required LogicalBlobId lbi = 1; required LogicalBlobId lbi = 1;
} }
...@@ -1690,6 +1717,8 @@ message OperatorConf { ...@@ -1690,6 +1717,8 @@ message OperatorConf {
SigmoidCrossEntropyLossGradOpConf sigmoid_cross_entropy_loss_grad_conf = 317; SigmoidCrossEntropyLossGradOpConf sigmoid_cross_entropy_loss_grad_conf = 317;
ParallelCastOpConf parallel_cast_conf = 336; ParallelCastOpConf parallel_cast_conf = 336;
XrtLaunchOpConf xrt_launch_conf = 410;
// math op // math op
BroadcastAddOpConf broadcast_add_conf = 500; BroadcastAddOpConf broadcast_add_conf = 500;
BroadcastSubOpConf broadcast_sub_conf = 501; BroadcastSubOpConf broadcast_sub_conf = 501;
......
...@@ -7,6 +7,7 @@ float = data_type_pb2.kFloat ...@@ -7,6 +7,7 @@ float = data_type_pb2.kFloat
float32 = float float32 = float
double = data_type_pb2.kDouble double = data_type_pb2.kDouble
float64 = double float64 = double
float16 = data_type_pb2.kFloat16
int8 = data_type_pb2.kInt8 int8 = data_type_pb2.kInt8
int32 = data_type_pb2.kInt32 int32 = data_type_pb2.kInt32
int64 = data_type_pb2.kInt64 int64 = data_type_pb2.kInt64
...@@ -19,6 +20,7 @@ _OF_BLOB_DTYPE2NUMPY_DTYPE = { ...@@ -19,6 +20,7 @@ _OF_BLOB_DTYPE2NUMPY_DTYPE = {
data_type_pb2.kUInt8: np.uint8, data_type_pb2.kUInt8: np.uint8,
data_type_pb2.kFloat: np.float32, data_type_pb2.kFloat: np.float32,
data_type_pb2.kDouble: np.double, data_type_pb2.kDouble: np.double,
data_type_pb2.kFloat16: np.float16,
# could be np.ubyte on some platform # could be np.ubyte on some platform
data_type_pb2.kChar: np.byte, data_type_pb2.kChar: np.byte,
} }
......
...@@ -277,6 +277,24 @@ def set_default_placement(func_desc, value): ...@@ -277,6 +277,24 @@ def set_default_placement(func_desc, value):
assert isinstance(value, placement_ctx.PlacementScope) assert isinstance(value, placement_ctx.PlacementScope)
func_desc.function_attribute.default_placement_scope = value func_desc.function_attribute.default_placement_scope = value
@oneflow_function_config('use_xla_jit')
def set_use_xla_jit(func_desc, value = True):
func_desc.job_config_proto.xrt_config.use_xla_jit = value
@oneflow_function_config('use_tensorrt')
def set_use_tensorrt(func_desc, value = True):
func_desc.job_config_proto.xrt_config.use_tensorrt = value
@oneflow_function_config('tensorrt.use_fp16')
def set_tensorrt_use_fp16(func_desc, value = True):
set_use_tensorrt(func_desc, True)
func_desc.job_config_proto.xrt_config.tensorrt_config.use_fp16 = value
@oneflow_function_config('tensorrt.use_int8')
def set_tensorrt_use_int8(func_desc, value = True):
set_use_tensorrt(func_desc, True)
func_desc.job_config_proto.xrt_config.tensorrt_config.use_int8 = value
@oneflow_function_config('default_distribute_strategy') @oneflow_function_config('default_distribute_strategy')
def set_default_distribute_strategy(func_desc, value): def set_default_distribute_strategy(func_desc, value):
assert isinstance(value, distribute_ctx.DistributeStrategy) assert isinstance(value, distribute_ctx.DistributeStrategy)
......
...@@ -43,6 +43,20 @@ def gelu(x, name=None): ...@@ -43,6 +43,20 @@ def gelu(x, name=None):
return remote_blob_util.RemoteBlob(lbi) return remote_blob_util.RemoteBlob(lbi)
@oneflow_export('keras.activations.gelu_grad')
def gelu_grad(x, dy):
op_conf = op_conf_util.OperatorConf()
op_conf.name = id_util.UniqueStr('GeluGrad_')
setattr(op_conf.gelu_grad_conf, 'x', x.logical_blob_name)
setattr(op_conf.gelu_grad_conf, 'dy', dy.logical_blob_name)
op_conf.gelu_grad_conf.dx = "dx"
compile_context.CurJobAddOp(op_conf)
lbi = logical_blob_id_util.LogicalBlobId()
lbi.op_name = op_conf.name
lbi.blob_name = "dx"
return remote_blob_util.RemoteBlob(lbi)
@oneflow_export("keras.activations.tanh") @oneflow_export("keras.activations.tanh")
def tanh(x, name=None): def tanh(x, name=None):
op_conf = op_conf_util.OperatorConf() op_conf = op_conf_util.OperatorConf()
...@@ -57,6 +71,20 @@ def tanh(x, name=None): ...@@ -57,6 +71,20 @@ def tanh(x, name=None):
return remote_blob_util.RemoteBlob(lbi) return remote_blob_util.RemoteBlob(lbi)
@oneflow_export('keras.activations.tanh_grad')
def tanh_grad(y, dy):
op_conf = op_conf_util.OperatorConf()
op_conf.name = id_util.UniqueStr('TanhGrad_')
setattr(op_conf.tanh_grad_conf, 'y', y.logical_blob_name)
setattr(op_conf.tanh_grad_conf, 'dy', dy.logical_blob_name)
op_conf.tanh_grad_conf.dx = "dx"
compile_context.CurJobAddOp(op_conf)
lbi = logical_blob_id_util.LogicalBlobId()
lbi.op_name = op_conf.name
lbi.blob_name = "dx"
return remote_blob_util.RemoteBlob(lbi)
@oneflow_export("keras.activations.sigmoid") @oneflow_export("keras.activations.sigmoid")
def sigmoid(x, name=None): def sigmoid(x, name=None):
op_conf = op_conf_util.OperatorConf() op_conf = op_conf_util.OperatorConf()
......
...@@ -100,6 +100,18 @@ def reshape(x, shape, name=None): ...@@ -100,6 +100,18 @@ def reshape(x, shape, name=None):
lbi.blob_name = "out" lbi.blob_name = "out"
return remote_blob_util.RemoteBlob(lbi) return remote_blob_util.RemoteBlob(lbi)
@oneflow_export("reshape_like")
def reshape_like(x, like, name=None):
op_conf = op_conf_util.OperatorConf()
op_conf.name = id_util.UniqueStr("ReshapeLike_")
setattr(op_conf.reshape_like_conf, "x", x.logical_blob_name)
setattr(op_conf.reshape_like_conf, "like", like.logical_blob_name)
op_conf.reshape_like_conf.y = "y"
compile_context.CurJobAddOp(op_conf)
lbi = logical_blob_id_util.LogicalBlobId()
lbi.op_name = op_conf.name
lbi.blob_name = "y"
return remote_blob_util.RemoteBlob(lbi)
@oneflow_export("dynamic_reshape") @oneflow_export("dynamic_reshape")
def dynamic_reshape(x, shape, name=None): def dynamic_reshape(x, shape, name=None):
......
...@@ -207,6 +207,66 @@ def layer_norm( ...@@ -207,6 +207,66 @@ def layer_norm(
setattr(out_lbi, "blob_name", "out") setattr(out_lbi, "blob_name", "out")
return remote_blob_util.RemoteBlob(out_lbi) return remote_blob_util.RemoteBlob(out_lbi)
@oneflow_export("layers.layer_norm_grad")
def layer_norm_grad(
dy,
x,
mean,
inv_variance,
begin_norm_axis=1,
name=None,
):
op_conf = op_conf_util.OperatorConf()
name = name if name is not None else id_util.UniqueStr(
"LayerNormGrad_")
setattr(op_conf, "name", name)
setattr(op_conf.layer_norm_grad_conf, "dy", dy.logical_blob_name)
setattr(op_conf.layer_norm_grad_conf, "x", x.logical_blob_name)
setattr(op_conf.layer_norm_grad_conf, "mean", mean.logical_blob_name)
setattr(op_conf.layer_norm_grad_conf, "inv_variance", inv_variance.logical_blob_name)
setattr(op_conf.layer_norm_grad_conf, "dx", "dx")
setattr(op_conf.layer_norm_grad_conf, "begin_norm_axis", begin_norm_axis)
setattr(op_conf.layer_norm_grad_conf, "epsilon", 1e-5)
compile_context.CurJobAddOp(op_conf)
out_lbi = logical_blob_id_util.LogicalBlobId()
setattr(out_lbi, "op_name", op_conf.name)
setattr(out_lbi, "blob_name", "dx")
return remote_blob_util.RemoteBlob(out_lbi)
@oneflow_export("layers.layer_norm_param_grad")
def layer_norm_param_grad(
dy,
norm,
gamma,
begin_params_axis=-1,
name=None,
):
op_conf = op_conf_util.OperatorConf()
name = name if name is not None else id_util.UniqueStr(
"LayerNormParamGrad_")
setattr(op_conf, "name", name)
setattr(op_conf.layer_norm_param_grad_conf, "dy", dy.logical_blob_name)
setattr(op_conf.layer_norm_param_grad_conf, "normalized", norm.logical_blob_name)
setattr(op_conf.layer_norm_param_grad_conf, "gamma", gamma.logical_blob_name)
setattr(op_conf.layer_norm_param_grad_conf, "begin_params_axis", begin_params_axis)
setattr(op_conf.layer_norm_param_grad_conf, "normalized_diff", "normalized_diff")
setattr(op_conf.layer_norm_param_grad_conf, "beta_diff", "beta_diff")
setattr(op_conf.layer_norm_param_grad_conf, "gamma_diff", "gamma_diff")
compile_context.CurJobAddOp(op_conf)
normalized_diff_lbi = logical_blob_id_util.LogicalBlobId()
beta_diff_lbi = logical_blob_id_util.LogicalBlobId()
gamma_diff_lbi = logical_blob_id_util.LogicalBlobId()
setattr(normalized_diff_lbi, "op_name", op_conf.name)
setattr(beta_diff_lbi, "op_name", op_conf.name)
setattr(gamma_diff_lbi, "op_name", op_conf.name)
setattr(normalized_diff_lbi, "blob_name", "normalized_diff")
setattr(beta_diff_lbi, "blob_name", "beta_diff")
setattr(gamma_diff_lbi, "blob_name", "gamma_diff")
return (remote_blob_util.RemoteBlob(normalized_diff_lbi),
remote_blob_util.RemoteBlob(beta_diff_lbi),
remote_blob_util.RemoteBlob(gamma_diff_lbi))
@oneflow_export("layers.batch_normalization") @oneflow_export("layers.batch_normalization")
def batch_normalization( def batch_normalization(
......
...@@ -258,6 +258,40 @@ def softmax(logits, axis=None, name=None): ...@@ -258,6 +258,40 @@ def softmax(logits, axis=None, name=None):
lbi.blob_name = "out" lbi.blob_name = "out"
return remote_blob_util.RemoteBlob(lbi) return remote_blob_util.RemoteBlob(lbi)
@oneflow_export("nn.softmax_grad")
def softmax_grad(y, dy, axis=None, name=None):
if axis is None:
axis = -1
assert type(axis) is int
op_conf = op_conf_util.OperatorConf()
name_prefix = name if name is not None else id_util.UniqueStr("SoftmaxGrad_")
setattr(op_conf, "name", name_prefix)
need_transpose = False
permute = [i for i in range(len(y.shape))]
if axis > 0 and axis != len(y.shape) - 1:
need_transpose = True
permute[axis] = permute[-1]
permute[-1] = axis
if need_transpose:
y = oneflow.transpose(y, perm=permute)
dy = oneflow.transpose(dy, perm=permute)
setattr(op_conf.softmax_grad_conf, "y", y.logical_blob_name)
setattr(op_conf.softmax_grad_conf, "dy", dy.logical_blob_name)
op_conf.softmax_grad_conf.axis = -1
op_conf.softmax_grad_conf.dx = "dx"
compile_context.CurJobAddOp(op_conf)
lbi = logical_blob_id_util.LogicalBlobId()
lbi.op_name = op_conf.name
lbi.blob_name = "dx"
dx = remote_blob_util.RemoteBlob(lbi)
if need_transpose:
dx = oneflow.transpose(dx, perm=permute)
return dx
@oneflow_export("nn.sparse_softmax_cross_entropy_with_logits") @oneflow_export("nn.sparse_softmax_cross_entropy_with_logits")
def sparse_softmax_cross_entropy_with_logits( def sparse_softmax_cross_entropy_with_logits(
......
...@@ -5,6 +5,7 @@ from oneflow.core.operator.op_conf_pb2 import OperatorConf ...@@ -5,6 +5,7 @@ from oneflow.core.operator.op_conf_pb2 import OperatorConf
def IsOpConfOnlyCpuSupported(op_conf): def IsOpConfOnlyCpuSupported(op_conf):
assert isinstance(op_conf, OperatorConf) assert isinstance(op_conf, OperatorConf)
"""
global _cpu_only_op_type_cases global _cpu_only_op_type_cases
if _cpu_only_op_type_cases == None: if _cpu_only_op_type_cases == None:
_cpu_only_op_type_cases = set() _cpu_only_op_type_cases = set()
...@@ -13,4 +14,9 @@ def IsOpConfOnlyCpuSupported(op_conf): ...@@ -13,4 +14,9 @@ def IsOpConfOnlyCpuSupported(op_conf):
_cpu_only_op_type_cases.add(field.number) _cpu_only_op_type_cases.add(field.number)
op_type_field = op_conf.WhichOneof("op_type") op_type_field = op_conf.WhichOneof("op_type")
return OperatorConf.DESCRIPTOR.fields_by_name[op_type_field].number in _cpu_only_op_type_cases return OperatorConf.DESCRIPTOR.fields_by_name[op_type_field].number in _cpu_only_op_type_cases
_cpu_only_op_type_cases = None """
op_type_field = op_conf.WhichOneof("op_type")
field_number = OperatorConf.DESCRIPTOR.fields_by_name[op_type_field].number
return c_api_util.IsOpTypeCaseCpuSupportOnly(field_number)
# _cpu_only_op_type_cases = None
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return x + y + x
return add_job
def make_xla_job(x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return x + y + x
return xla_add_job
def make_trt_job(x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return x + y + x
return trt_add_job
class TestAdd(unittest.TestCase):
def _test_body(self, x, y, dtype=np.float32):
f1 = make_job(x.shape, y.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, y.shape, dtype=flow.float32)
f3 = make_trt_job(x.shape, y.shape, dtype=flow.float32)
a = f1(x, y).get()
b = f2(x, y).get()
c = f3(x, y).get()
print("without xla: ", a)
print("with xla", b)
print("with tensorrt", c)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, x_shape, y_shape, dtype=np.float32):
x = np.ones(x_shape, dtype=dtype)
y = np.ones(y_shape, dtype=dtype)
self._test_body(x, y, dtype=dtype)
def _test_random_body(self, x_shape, y_shape, dtype=np.float32):
x = np.random.random(x_shape).astype(dtype)
y = np.random.random(y_shape).astype(dtype)
self._test_body(x, y, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10), (1, 10))
self._test_ones_body((2, 10, 2), (2, 10, 2))
self._test_ones_body((2, 5, 2, 2), (2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1, 10), (1, 10))
self._test_random_body((2, 10, 2), (2, 10, 2))
self._test_random_body((2, 5, 2, 2), (2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def batch_norm_job(x=flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.layers.batch_normalization(x, axis=axis)
return batch_norm_job
def make_xla_job(input_shape, axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_batch_norm_job(x=flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.layers.batch_normalization(x, axis=axis)
return xla_batch_norm_job
def make_trt_job(input_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_batch_norm_job(x=flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.layers.batch_normalization(x, axis=axis)
return trt_batch_norm_job
class TestRelu(unittest.TestCase):
def _test_body(self, x, axis, dtype=np.float32):
f1 = make_job(x.shape, axis, dtype=flow.float32)
f2 = make_xla_job(x.shape, axis, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, axis, dtype=flow.float32)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, axis, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, axis, dtype=dtype)
def _test_random_body(self, shape, axis, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, axis, dtype=dtype)
"""
TensorRT batch norm only support 4-d tensor (NCHW).
"""
def test_ones_input(self):
self._test_ones_body((2, 1, 2, 2), 1)
self._test_ones_body((2, 5, 2, 2), 1)
def test_random_input(self):
self._test_random_body((2, 1, 2, 2), 1)
self._test_random_body((2, 5, 2, 2), 1)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(x_shape, b_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def bias_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
bias = flow.FixedTensorDef(b_shape, dtype=dtype)):
return flow.nn.bias_add(x, bias)
return bias_add_job
def make_xla_job(x_shape, b_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_bias_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
bias = flow.FixedTensorDef(b_shape, dtype=dtype)):
return flow.nn.bias_add(x, bias)
return xla_bias_add_job
def make_trt_job(x_shape, b_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_bias_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
bias = flow.FixedTensorDef(b_shape, dtype=dtype)):
return flow.nn.bias_add(x, bias)
return trt_bias_add_job
class TestBiasAdd(unittest.TestCase):
def _test_body(self, x, bias, dtype=np.float32):
f1 = make_job(x.shape, bias.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, bias.shape, dtype=flow.float32)
a = f1(x, bias).get()
b = f2(x, bias).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, bias.shape, dtype=flow.float32)
c = f3(x, bias).get()
print("with tensorrt: ", c)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, x_shape, bias_shape, dtype=np.float32):
x = np.ones(x_shape, dtype=dtype)
b = np.ones(bias_shape, dtype=dtype)
self._test_body(x, b, dtype=dtype)
def _test_random_body(self, x_shape, bias_shape, dtype=np.float32):
x = np.random.random(x_shape).astype(dtype)
b = np.random.random(bias_shape).astype(dtype)
self._test_body(x, b, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10), (10))
self._test_ones_body((2, 10, 2), (10))
self._test_ones_body((2, 5, 2, 2), (5))
def test_random_input(self):
self._test_random_body((1, 10), (10))
self._test_random_body((2, 10, 2), (10))
self._test_random_body((2, 5, 2, 2), (5))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
class TestBroadcastOp(unittest.TestCase):
run_test = False
def _test_body(self, x, y, dtype=np.float32):
if not self.run_test:
return
f1 = self.make_job(x.shape, y.shape, dtype=flow.float32)
f2 = self.make_xla_job(x.shape, y.shape, dtype=flow.float32)
a = f1(x, y).get()
b = f2(x, y).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, x_shape, y_shape, dtype=np.float32):
x = np.ones(x_shape, dtype=dtype)
y = np.ones(y_shape, dtype=dtype)
self._test_body(x, y, dtype=dtype)
def _test_random_body(self, x_shape, y_shape, dtype=np.float32):
x = np.random.random(x_shape).astype(dtype)
y = np.random.random(y_shape).astype(dtype)
self._test_body(x, y, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10), (1, 1))
self._test_ones_body((2, 10, 2), (2, 1, 2))
self._test_ones_body((2, 5, 2, 2), (1, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1, 10), (1, 1))
self._test_random_body((2, 10, 2), (2, 1, 2))
self._test_random_body((2, 5, 2, 2), (1, 5, 2, 2))
class TestBroadcastAddOp(TestBroadcastOp):
run_test = True
def make_job(self, x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def broadcast_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.add(x, y)
return broadcast_add_job
def make_xla_job(self, x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_broadcast_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.add(x, y)
return xla_broadcast_add_job
class TestBroadcastMulOp(TestBroadcastOp):
run_test = True
def make_job(self, x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def broadcast_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.multiply(x, y)
return broadcast_mul_job
def make_xla_job(self, x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_broadcast_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.multiply(x, y)
return xla_broadcast_mul_job
class TestBroadcastDivOp(TestBroadcastOp):
run_test = True
def make_job(self, x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def broadcast_div_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.divide(x, y)
return broadcast_div_job
def make_xla_job(self, x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_broadcast_div_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.divide(x, y)
return xla_broadcast_div_job
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, dtype=flow.float32, target_dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def cast_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.cast(x, dtype=target_dtype)
return cast_job
def make_xla_job(input_shape, dtype=flow.float32, target_dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_cast_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.cast(x, dtype=target_dtype)
return xla_cast_job
class TestCast(unittest.TestCase):
def _test_body(self, x, dtype=flow.float32, target_dtype=flow.float32):
f1 = make_job(x.shape, dtype=dtype, target_dtype=target_dtype)
f2 = make_xla_job(x.shape, dtype=dtype, target_dtype=target_dtype)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
# b = trt_cast_job(x).get()
# print("with tensorrt", b)
# self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=flow.float32, target_dtype=flow.float32):
np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
x = np.ones(shape, dtype=np_dtype)
self._test_body(x, dtype=dtype, target_dtype=target_dtype)
def _test_random_body(self, shape, dtype=flow.float32, target_dtype=flow.float32):
np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
x = (1000 * np.random.random(shape)).astype(np_dtype)
self._test_body(x, dtype=dtype, target_dtype=target_dtype)
def test_ones_input(self):
self._test_ones_body((1), flow.float32, flow.int32)
self._test_ones_body((1, 10), flow.int32, flow.float32)
def test_random_input(self):
self._test_random_body((1), flow.float32, flow.int32)
self._test_random_body((1, 10), flow.int32, flow.float32)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(a_shape, b_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def concat_job(x=flow.FixedTensorDef(a_shape, dtype=dtype),
y=flow.FixedTensorDef(b_shape, dtype=dtype)):
return flow.concat([x, y], axis=axis)
return concat_job
def make_trt_job(a_shape, b_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_concat_job(x=flow.FixedTensorDef(a_shape, dtype=dtype),
y=flow.FixedTensorDef(b_shape, dtype=dtype)):
return flow.concat([x, y], axis=axis)
return trt_concat_job
class Testconcat(unittest.TestCase):
def _test_body(self, x, y, axis, dtype=np.float32):
f1 = make_job(x.shape, y.shape, axis, dtype=flow.float32)
f2 = make_trt_job(x.shape, y.shape, axis, dtype=flow.float32)
a = f1(x, y).get()
b = f2(x, y).get()
print("without xla: ", a)
print("with tensorrt: ", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, a_shape, b_shape, axis, dtype=np.float32):
x = np.ones(a_shape, dtype=dtype)
y = np.ones(b_shape, dtype=dtype)
self._test_body(x, y, axis, dtype=dtype)
def _test_random_body(self, a_shape, b_shape, axis, dtype=np.float32):
x = np.random.random(a_shape).astype(dtype)
y = np.random.random(b_shape).astype(dtype)
self._test_body(x, y, axis, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((5, 2), (5, 3), axis=1)
self._test_ones_body((5, 2), (5, 3), axis=-1)
self._test_ones_body((5, 1, 2), (5, 1, 2), axis=1)
self._test_ones_body((5, 1, 2), (5, 1, 2), axis=2)
def test_random_input(self):
self._test_random_body((5, 2), (5, 3), axis=1)
self._test_random_body((5, 2), (5, 3), axis=-1)
self._test_random_body((5, 1, 2), (5, 1, 2), axis=1)
self._test_random_body((5, 3, 2), (5, 3, 2), axis=2)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(x_shape, w_shape, kernel_size=None, strides=None,
padding="valid", data_format="NCHW", dilation_rate=None, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def conv2d_job(x=flow.FixedTensorDef(x_shape, dtype=dtype),
weight=flow.FixedTensorDef(w_shape, dtype=dtype)):
return flow.nn.conv2d(x, weight, strides, padding, data_format, dilation_rate)
return conv2d_job
def make_trt_job(x_shape, w_shape, kernel_size=None, strides=None,
padding="valid", data_format="NCHW", dilation_rate=None, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_conv2d_job(x=flow.FixedTensorDef(x_shape, dtype=dtype),
weight=flow.FixedTensorDef(w_shape, dtype=dtype)):
return flow.nn.conv2d(x, weight, strides, padding, data_format, dilation_rate)
return trt_conv2d_job
class TestConv2d(unittest.TestCase):
def make_filter_shape(self, shape, filters, kernel_size, data_format):
if data_format == "NCHW":
return [filters, shape[1], kernel_size, kernel_size]
else:
return [filters, kernel_size, kernel_size, shape[3]]
def _test_body(self, x, filters, kernel_size, strides, padding, data_format,
dilation_rate, dtype=np.float32):
f1 = make_job(x.shape, filters.shape, kernel_size, strides, padding,
data_format, dilation_rate, dtype=flow.float32)
f2 = make_trt_job(x.shape, filters.shape, kernel_size, strides, padding,
data_format, dilation_rate, dtype=flow.float32)
a = f1(x, filters).get()
b = f2(x, filters).get()
print("without xla: ", a)
print("with tensorrt: ", b)
self.assertTrue(a.shape == b.shape)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, filters, kernel_size, strides,
padding, data_format, dilation_rate, dtype=np.float32):
assert(len(shape) == 4)
x = np.ones(shape, dtype=dtype)
w_shape = self.make_filter_shape(shape, filters, kernel_size, data_format)
weight = np.random.random(w_shape).astype(dtype)
self._test_body(x, weight, kernel_size=kernel_size,
strides=strides, padding=padding, data_format=data_format,
dilation_rate=dilation_rate)
def _test_random_body(self, shape, filters, kernel_size, strides,
padding, data_format, dilation_rate, dtype=np.float32):
assert(len(shape) == 4)
x = np.random.random(shape).astype(dtype)
w_shape = self.make_filter_shape(shape, filters, kernel_size, data_format)
weight = np.random.random(w_shape).astype(dtype)
self._test_body(x, weight, kernel_size=kernel_size,
strides=strides, padding=padding, data_format=data_format,
dilation_rate=dilation_rate)
def test_ones_kernel_1x1(self):
self._test_ones_body(shape=[1, 1, 1, 1], filters=1, kernel_size=1, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_ones_body(shape=[1, 3, 1, 1], filters=1, kernel_size=1, strides=1,
padding="SAME", data_format="NCHW", dilation_rate=1)
self._test_ones_body(shape=[1, 1, 5, 5], filters=1, kernel_size=1, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
#self._test_ones_body(shape=[3, 1, 1, 5], filters=1, kernel_size=1, strides=1,
# padding="SAME", data_format="NHWC", dilation_rate=1)
self._test_ones_body(shape=[3, 3, 5, 5], filters=1, kernel_size=1, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
def test_random_kernel_1x1(self):
self._test_random_body(shape=[1, 1, 1, 1], filters=1, kernel_size=1, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 3, 1, 1], filters=1, kernel_size=1, strides=1,
padding="SAME", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 1, 5, 5], filters=1, kernel_size=1, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
#self._test_random_body(shape=[3, 1, 1, 5], filters=1, kernel_size=1, strides=1,
# padding="SAME", data_format="NHWC", dilation_rate=1)
self._test_random_body(shape=[3, 3, 5, 5], filters=1, kernel_size=1, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
def test_ones_kernel_3x3(self):
self._test_ones_body(shape=[1, 1, 3, 3], filters=1, kernel_size=3, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_ones_body(shape=[1, 3, 5, 5], filters=1, kernel_size=3, strides=1,
padding="SAME", data_format="NCHW", dilation_rate=1)
self._test_ones_body(shape=[1, 5, 3, 3], filters=1, kernel_size=3, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
#self._test_ones_body(shape=[1, 3, 3, 7], filters=1, kernel_size=3, strides=1,
# padding="SAME", data_format="NHWC", dilation_rate=1)
def test_random_kernel_3x3(self):
self._test_random_body(shape=[1, 1, 3, 3], filters=1, kernel_size=3, strides=1,
padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 3, 3, 3], filters=1, kernel_size=3, strides=1,
padding="SAME", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 3, 3, 3], filters=1, kernel_size=3, strides=1,
padding="SAME", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 3, 3, 3], filters=1, kernel_size=3, strides=1,
padding="SAME", data_format="NCHW", dilation_rate=1)
def test_ones_kernel_11x11(self):
self._test_ones_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_ones_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
self._test_ones_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_ones_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
def test_random_kernel_11x11(self):
self._test_random_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 3, 24, 24], filters=3, kernel_size=11,
strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
strides=4, padding="VALID", data_format="NCHW", dilation_rate=1)
self._test_random_body(shape=[1, 3, 27, 27], filters=3, kernel_size=11,
strides=4, padding="SAME", data_format="NCHW", dilation_rate=1)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
class TestGather(unittest.TestCase):
def _test_body(self, x, indices, axis, dtype=flow.float32):
indices = np.array(indices).astype(np.int32)
f1 = self.make_job(x.shape, indices.shape, axis, dtype=dtype)
f2 = self.make_xla_job(x.shape, indices.shape, axis, dtype=dtype)
a = f1(x, indices).get()
b = f2(x, indices).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def make_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
indices = flow.FixedTensorDef(indices_shape, dtype=flow.int32)):
return flow.gather(x, indices, axis=axis)
return gather_job
def make_xla_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
indices = flow.FixedTensorDef(indices_shape, dtype=flow.int32)):
return flow.gather(x, indices, axis=axis)
return xla_gather_job
def _test_ones_body(self, shape, indices, axis, dtype=flow.float32):
np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
x = np.ones(shape, dtype=np_dtype)
self._test_body(x, indices, axis, dtype=dtype)
def _test_random_body(self, shape, indices, axis, dtype=flow.float32):
np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
x = np.random.random(shape).astype(np_dtype)
self._test_body(x, indices, axis, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 1), [0], 0)
self._test_ones_body((2, 2), [0, 0], 0)
self._test_ones_body((1, 10), [[0], [0]], 0)
self._test_ones_body((1, 10), [[0, 1, 2], [2, 3, 4]], 1)
self._test_ones_body((2, 10, 2), [[0, 1], [2, 3], [4, 5]], 1)
self._test_ones_body((2, 5, 2, 2), [[0, 0], [1, 1]], 3)
def test_random_input(self):
self._test_random_body((1, 1), [0], 0)
self._test_random_body((2, 2), [0, 0], 0)
self._test_random_body((1, 10), [[0], [0]], 0)
self._test_random_body((1, 10), [[0, 1, 2], [2, 3, 4]], 1)
self._test_random_body((2, 10, 2), [[0, 1], [2, 3], [4, 5]], 1)
self._test_random_body((2, 5, 2, 2), [[0, 0], [1, 1]], 3)
class TestBatchGather(TestGather):
def make_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def batch_gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
indices = flow.FixedTensorDef(indices_shape,
dtype=flow.int32)):
return flow.gather(x, indices, batch_dims=axis)
return batch_gather_job
def make_xla_job(self, input_shape, indices_shape, axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_batch_gather_job(x = flow.FixedTensorDef(input_shape, dtype=dtype),
indices = flow.FixedTensorDef(indices_shape, dtype=flow.int32)):
return flow.gather(x, indices, batch_dims=axis)
return xla_batch_gather_job
def test_ones_input(self):
# batch_dims should be Dims(indices) - 1 and batch_dims > 0
self._test_ones_body((2, 3, 2), [[0], [1]], 1)
self._test_ones_body((2, 3, 2), [[0, 1], [1, 0]], 1)
self._test_ones_body((2, 3, 2, 2), [[[0], [0], [0]], [[1], [1], [1]]], 2)
def test_random_input(self):
# batch_dims should be Dims(indices) - 1 and batch_dims > 0
self._test_random_body((2, 3, 2), [[0], [1]], 1)
self._test_random_body((2, 3, 2), [[0, 1], [1, 2]], 1)
self._test_random_body((2, 3, 2, 2), [[[0], [0], [0]], [[1], [1], [1]]], 2)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def gelu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.gelu(x)
return gelu_job
def make_xla_job(input_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_gelu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.gelu(x)
return xla_gelu_job
class TestGelu(unittest.TestCase):
def _test_body(self, x, dtype=np.float32):
f1 = make_job(x.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, dtype=dtype)
def _test_random_body(self, shape, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1))
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1))
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def gelu_grad_job(x = flow.FixedTensorDef(shape, dtype=dtype),
dy = flow.FixedTensorDef(shape, dtype=dtype)):
return flow.keras.activations.gelu_grad(x, dy)
return gelu_grad_job
def make_xla_job(shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_gelu_grad_job(x = flow.FixedTensorDef(shape, dtype=dtype),
dy = flow.FixedTensorDef(shape, dtype=dtype)):
return flow.keras.activations.gelu_grad(x, dy)
return xla_gelu_grad_job
class TestGeluGrad(unittest.TestCase):
def _test_body(self, x, dy, dtype=np.float32):
f1 = make_job(x.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, dtype=flow.float32)
a = f1(x, dy).get()
b = f2(x, dy).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
dy = np.ones(shape, dtype=dtype)
self._test_body(x, dy, dtype=dtype)
def _test_random_body(self, shape, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
dy = np.random.random(shape).astype(dtype)
self._test_body(x, dy, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1))
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1))
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def identity_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.identity(x)
return identity_job
def make_xla_job(input_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_identity_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.identity(x)
return xla_identity_job
def make_trt_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_identity_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.identity(x)
return trt_identity_job
class TestIdentity(unittest.TestCase):
def _test_body(self, x, dtype=np.float32):
f1 = make_job(x.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, dtype=flow.float32)
f3 = make_trt_job(x.shape, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
c = f3(x).get()
print("without xla: ", a)
print("with xla: ", b)
print("with tensorrt: ", c)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, dtype=dtype)
def _test_random_body(self, shape, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1))
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1))
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, norm_axis, params_axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def layer_norm_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.layers.layer_norm(x, begin_norm_axis=norm_axis,
begin_params_axis=params_axis)
return layer_norm_job
def make_xla_job(input_shape, norm_axis, params_axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_layer_norm_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.layers.layer_norm(x, begin_norm_axis=norm_axis,
begin_params_axis=params_axis)
return xla_layer_norm_job
class TestLayerNorm(unittest.TestCase):
def _test_body(self, x, norm_axis, params_axis, dtype=np.float32):
f1 = make_job(x.shape, norm_axis, params_axis, dtype=flow.float32)
f2 = make_xla_job(x.shape, norm_axis, params_axis, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape,
norm_axis=-1,
params_axis=-1,
dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, norm_axis, params_axis, dtype=dtype)
def _test_random_body(self, shape,
norm_axis=-1,
params_axis=-1,
dtype=np.float32):
x = (10 * np.random.random(shape)).astype(dtype)
self._test_body(x, norm_axis, params_axis, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(shape, mean_shape, norm_axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def layer_norm_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
x = flow.FixedTensorDef(shape, dtype=dtype),
mean = flow.FixedTensorDef(mean_shape, dtype=dtype),
inv_variance = flow.FixedTensorDef(mean_shape, dtype=dtype)):
return flow.layers.layer_norm_grad(dy, x, mean, inv_variance,
begin_norm_axis=norm_axis)
return layer_norm_grad_job
def make_xla_job(shape, mean_shape, norm_axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_layer_norm_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
x = flow.FixedTensorDef(shape, dtype=dtype),
mean = flow.FixedTensorDef(mean_shape, dtype=dtype),
inv_variance = flow.FixedTensorDef(mean_shape, dtype=dtype)):
return flow.layers.layer_norm_grad(dy, x, mean, inv_variance,
begin_norm_axis=norm_axis)
return xla_layer_norm_grad_job
class TestLayerNormGrad(unittest.TestCase):
def _test_body(self, dy, x,
mean,
inv_variance,
norm_axis,
dtype=np.float32):
f1 = make_job(x.shape, mean.shape, norm_axis, dtype=flow.float32)
f2 = make_xla_job(x.shape, mean.shape, norm_axis, dtype=flow.float32)
a = f1(dy, x, mean, inv_variance).get()
b = f2(dy, x, mean, inv_variance).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape,
norm_axis=-1,
dtype=np.float32):
dy = np.ones(shape, dtype=dtype)
x = np.ones(shape, dtype=dtype)
if norm_axis < 0:
norm_axis += len(shape)
mean_shape = shape[:norm_axis]
mean = np.ones(mean_shape, dtype=dtype)
inv_variance = np.ones(mean_shape, dtype=dtype)
self._test_body(dy, x, mean, inv_variance, norm_axis, dtype=dtype)
def _test_random_body(self, shape,
norm_axis=-1,
dtype=np.float32):
dy = np.random.random(shape).astype(dtype)
x = np.random.random(shape).astype(dtype)
if norm_axis < 0:
norm_axis += len(shape)
mean_shape = shape[:norm_axis]
mean = np.random.random(mean_shape).astype(dtype)
inv_variance = np.random.random(mean_shape).astype(dtype)
self._test_body(dy, x, mean, inv_variance, norm_axis, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(shape, gamma_shape, params_axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def layer_norm_param_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
norm = flow.FixedTensorDef(shape, dtype=dtype),
gamma = flow.FixedTensorDef(gamma_shape, dtype=dtype)):
return flow.layers.layer_norm_param_grad(
dy, norm, gamma, begin_params_axis=params_axis)
return layer_norm_param_grad_job
def make_xla_job(shape, gamma_shape, params_axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_layer_norm_param_grad_job(dy = flow.FixedTensorDef(shape, dtype=dtype),
norm = flow.FixedTensorDef(shape, dtype=dtype),
gamma = flow.FixedTensorDef(gamma_shape, dtype=dtype)):
return flow.layers.layer_norm_param_grad(
dy, norm, gamma, begin_params_axis=params_axis)
return xla_layer_norm_param_grad_job
class TestLayerNormParamGrad(unittest.TestCase):
def _test_body(self, dy, norm, gamma, params_axis,
dtype=np.float32):
f1 = make_job(dy.shape, gamma.shape, params_axis, dtype=flow.float32)
f2 = make_xla_job(dy.shape, gamma.shape, params_axis, dtype=flow.float32)
(d_norm1, d_beta1, d_gamma1) = f1(dy, norm, gamma).get()
(d_norm2, d_beta2, d_gamma2) = f2(dy, norm, gamma).get()
print("normalize diff:")
print(" without xla: ", d_norm1)
print(" with xla: ", d_norm2)
print("beta diff:")
print(" without xla: ", d_beta1)
print(" with xla: ", d_beta2)
print("gamma diff:")
print(" without xla: ", d_gamma1)
print(" with xla: ", d_gamma2)
self.assertTrue(d_norm1.shape, d_norm2.shape)
self.assertTrue(d_beta1.shape, d_beta2.shape)
self.assertTrue(d_gamma1.shape, d_gamma2.shape)
self.assertTrue(np.allclose(d_norm1.ndarray(), d_norm2.ndarray(), rtol=1e-03, atol=1e-05))
self.assertTrue(np.allclose(d_beta1.ndarray(), d_beta2.ndarray(), rtol=1e-03, atol=1e-05))
self.assertTrue(np.allclose(d_gamma1.ndarray(), d_gamma2.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape,
params_axis=-1,
dtype=np.float32):
dy = np.ones(shape, dtype=dtype)
norm = np.ones(shape, dtype=dtype)
if params_axis < 0:
params_axis += len(shape)
gamma_shape = shape[params_axis:]
if len(gamma_shape) == 0:
gamma_shape = [1]
gamma = np.ones(gamma_shape, dtype=dtype)
self._test_body(dy, norm, gamma, params_axis, dtype=dtype)
def _test_random_body(self, shape,
params_axis=-1,
dtype=np.float32):
dy = np.random.random(shape).astype(dtype)
norm = np.random.random(shape).astype(dtype)
if params_axis < 0:
params_axis += len(shape)
gamma_shape = shape[params_axis:]
if len(gamma_shape) == 0:
gamma_shape = [1]
gamma = np.random.random(gamma_shape).astype(dtype)
self._test_body(dy, norm, gamma, params_axis, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(a_shape, b_shape, trans_a=False, trans_b=False, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def matmul_job(a=flow.FixedTensorDef(a_shape, dtype=dtype),
b=flow.FixedTensorDef(b_shape, dtype=dtype)):
return flow.matmul(a, b, transpose_a=trans_a, transpose_b=trans_b)
return matmul_job
def make_xla_job(a_shape, b_shape, trans_a=False, trans_b=False, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_matmul_job(a=flow.FixedTensorDef(a_shape, dtype=dtype),
b=flow.FixedTensorDef(b_shape, dtype=dtype)):
return flow.matmul(a, b, transpose_a=trans_a, transpose_b=trans_b)
return xla_matmul_job
class TestMatmul(unittest.TestCase):
def make_shape(self, m, n, transpose):
if transpose:
return (n, m)
else:
return (m, n)
def _test_body(self, a, b, trans_a, trans_b, dtype=np.float32):
f1 = make_job(a.shape, b.shape, trans_a, trans_b)
f2 = make_xla_job(a.shape, b.shape, trans_a, trans_b)
x = f1(a, b).get()
y = f2(a, b).get()
print("without xla: ", x)
print("with xla: ", y)
self.assertTrue(np.allclose(x.ndarray(), y.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, m, k, n, trans_a, trans_b, dtype=np.float32):
shape_a = self.make_shape(m, k, trans_a)
shape_b = self.make_shape(k, n, trans_b)
a = np.ones(shape_a, dtype=dtype)
b = np.ones(shape_b, dtype=dtype)
self._test_body(a, b, trans_a, trans_b, dtype=dtype)
def _test_random_body(self, m, k, n, trans_a, trans_b, dtype=np.float32):
shape_a = self.make_shape(m, k, trans_a)
shape_b = self.make_shape(k, n, trans_b)
a = np.random.random(shape_a).astype(dtype)
b = np.random.random(shape_b).astype(dtype)
self._test_body(a, b, trans_a, trans_b, dtype=dtype)
def test_ones1x1x1_input(self):
print("run test_ones1x1x1_input: ")
self._test_ones_body(1, 1, 1, False, False)
self._test_ones_body(1, 1, 1, False, True)
self._test_ones_body(1, 1, 1, True, False)
self._test_ones_body(1, 1, 1, True, True)
def test_random1x1x1_input(self):
print("test_random1x1x1_input: ")
self._test_random_body(1, 1, 1, False, False)
self._test_random_body(1, 1, 1, False, True)
self._test_random_body(1, 1, 1, True, False)
self._test_random_body(1, 1, 1, True, True)
def test_ones1x10x1_input(self):
print("test_ones1x10x1_input: ")
self._test_ones_body(1, 10, 1, False, False)
self._test_ones_body(1, 10, 1, False, True)
self._test_ones_body(1, 10, 1, True, False)
self._test_ones_body(1, 10, 1, True, True)
def test_random1x10x1_input(self):
print("test_random1x10x1_input: ")
self._test_random_body(1, 10, 1, False, False)
self._test_random_body(1, 10, 1, False, True)
self._test_random_body(1, 10, 1, True, False)
self._test_random_body(1, 10, 1, True, True)
def test_ones10x10x2_input(self):
print("test_ones10x10x2_input: ")
self._test_ones_body(10, 10, 2, False, False)
self._test_ones_body(10, 10, 2, False, True)
self._test_ones_body(10, 10, 2, True, False)
self._test_ones_body(10, 10, 2, True, True)
def test_random10x10x2_input(self):
print("run test_random10x10x2_input: ")
self._test_random_body(10, 10, 2, False, False)
self._test_random_body(10, 10, 2, False, True)
self._test_random_body(10, 10, 2, True, False)
self._test_random_body(10, 10, 2, True, True)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def multiply_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.multiply(x, y)
return multiply_job
def make_trt_job(x_shape, y_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_multiply_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
y = flow.FixedTensorDef(y_shape, dtype=dtype)):
return flow.math.multiply(x, y)
return trt_multiply_job
class TestMultiply(unittest.TestCase):
def _test_body(self, x, y, dtype=np.float32):
f1 = make_job(x.shape, y.shape, dtype=flow.float32)
f2 = make_trt_job(x.shape, y.shape, dtype=flow.float32)
a = f1(x, y).get()
b = f2(x, y).get()
print("without xla: ", a)
print("with tensorrt", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, x_shape, y_shape, dtype=np.float32):
x = np.ones(x_shape, dtype=dtype)
y = np.ones(y_shape, dtype=dtype)
self._test_body(x, y, dtype=dtype)
def _test_random_body(self, x_shape, y_shape, dtype=np.float32):
x = np.random.random(x_shape).astype(dtype)
y = np.random.random(y_shape).astype(dtype)
self._test_body(x, y, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10), (1, 10))
self._test_ones_body((2, 10, 2), (2, 10, 2))
self._test_ones_body((2, 5, 2, 2), (2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1, 10), (1, 10))
self._test_random_body((2, 10, 2), (2, 10, 2))
self._test_random_body((2, 5, 2, 2), (2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
class TestPooling(unittest.TestCase):
run_test = False
def _test_body(self, x, ksize, strides, padding, data_format, dtype=np.float32):
if not self.run_test:
return
f1 = self.make_job(x.shape, ksize, strides, padding, data_format,
dtype=flow.float32)
f2 = self.make_trt_job(x.shape, ksize, strides, padding, data_format,
dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without trt: ", a)
print("with tensorrt", b)
self.assertTrue(a.shape == b.shape)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, ksize, strides, padding, data_format,
dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, ksize=ksize, strides=strides,
padding=padding, data_format=data_format, dtype=dtype)
def _test_random_body(self, shape, ksize, strides, padding, data_format,
dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, ksize=ksize, strides=strides, padding=padding,
data_format=data_format, dtype=dtype)
def test_ones_input(self):
print("test ones input: ")
self._test_ones_body((1, 1, 6, 6), 1, 1, "VALID", "NCHW")
self._test_ones_body((1, 3, 6, 6), 3, 2, "SAME", "NCHW")
self._test_ones_body((1, 1, 3, 3), 1, 1, "VALID", "NCHW")
self._test_ones_body((1, 5, 9, 9), 3, 1, "SAME", "NCHW")
self._test_ones_body((1, 7, 9, 9), 1, 1, "SAME", "NCHW")
self._test_ones_body((1, 5, 3, 3), 1, 1, "VALID", "NCHW")
self._test_ones_body((1, 1, 6, 6), 2, 2, "SAME", "NCHW")
self._test_ones_body((1, 1, 6, 6), 2, 2, "VALID", "NCHW")
self._test_ones_body((1, 1, 9, 9), 2, 2, "SAME", "NCHW")
self._test_ones_body((1, 1, 9, 9), 2, 2, "VALID", "NCHW")
# self._test_ones_body((1, 224, 224, 3), 3, 2, "VALID", "NHWC")
# self._test_ones_body((1, 224, 224, 1), 2, 1, "SAME", "NHWC")
def test_random_input(self):
print("test random input: ")
self._test_random_body((1, 1, 6, 6), 1, 1, "VALID", "NCHW")
self._test_random_body((1, 3, 6, 6), 3, 2, "SAME", "NCHW")
self._test_random_body((1, 5, 6, 6), 3, 2, "VALID", "NCHW")
self._test_random_body((1, 7, 6, 6), 3, 2, "SAME", "NCHW")
self._test_random_body((1, 3, 3, 3), 1, 1, "VALID", "NCHW")
self._test_random_body((1, 3, 6, 6), 3, 2, "SAME", "NCHW")
self._test_random_body((1, 1, 6, 6), 2, 2, "SAME", "NCHW")
self._test_random_body((1, 1, 6, 6), 2, 2, "VALID", "NCHW")
self._test_random_body((1, 1, 9, 9), 2, 2, "SAME", "NCHW")
self._test_random_body((1, 1, 9, 9), 2, 2, "VALID", "NCHW")
# self._test_random_body((1, 224, 224, 3), 3, 2, "VALID", "NHWC")
# self._test_random_body((1, 224, 224, 1), 2, 1, "SAME", "NHWC")
class TestMaxPooling(TestPooling):
run_test = True
def make_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def max_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.nn.max_pool2d(x, ksize=ksize, strides=strides,
padding=padding, data_format=data_format)
return max_pooling_job
def make_trt_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_max_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.nn.max_pool2d(x, ksize=ksize, strides=strides,
padding=padding, data_format=data_format)
return trt_max_pooling_job
class TestAveragePooling(TestPooling):
run_test = True
def make_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def avg_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.nn.avg_pool2d(x, ksize=ksize, strides=strides,
padding=padding, data_format=data_format)
return avg_pooling_job
def make_trt_job(self, x_shape, ksize, strides, padding, data_format, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_avg_pooling_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.nn.avg_pool2d(x, ksize=ksize, strides=strides,
padding=padding, data_format=data_format)
return trt_avg_pooling_job
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
class TestReduce(unittest.TestCase):
run_test = False
def _test_body(self, x, axis, keepdims, dtype=np.float32):
if not self.run_test:
return
f1 = self.make_job(x.shape, axis, keepdims, dtype=flow.float32)
f2 = self.make_xla_job(x.shape, axis, keepdims, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(a.shape == b.shape)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = self.make_trt_job(x.shape, axis, keepdims, dtype=flow.float32)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(a.shape == c.shape)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, axis, keepdims, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, axis, keepdims, dtype=dtype)
def _test_random_body(self, shape, axis, keepdims, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, axis, keepdims, dtype=dtype)
def test_ones_input(self):
# self._test_ones_body((1), [0], False)
self._test_ones_body((1), [0], True)
self._test_ones_body((1, 10), [1], False)
self._test_ones_body((1, 10), [1], True)
# self._test_ones_body((1, 10), [0, 1], False)
self._test_ones_body((1, 10), [0, 1], True)
self._test_ones_body((2, 10, 2), [1, 2], False)
self._test_ones_body((2, 10, 2), [1, 2], True)
def test_random_input(self):
# self._test_random_body((1), [0], False)
self._test_random_body((1), [0], True)
self._test_random_body((1, 10), [1], False)
self._test_random_body((1, 10), [1], True)
# self._test_random_body((1, 10), [0, 1], False)
self._test_random_body((1, 10), [0, 1], True)
self._test_random_body((2, 10, 2), [1, 2], False)
self._test_random_body((2, 10, 2), [1, 2], True)
class TestReduceSum(TestReduce):
run_test = True
def make_job(self, x_shape, axis, keepdims, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def reduce_sum_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.reduce_sum(x, axis=axis, keepdims=keepdims)
return reduce_sum_job
def make_xla_job(self, x_shape, axis, keepdims, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_reduce_sum_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.reduce_sum(x, axis=axis, keepdims=keepdims)
return xla_reduce_sum_job
def make_trt_job(self, x_shape, axis, keepdims, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_reduce_sum_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.reduce_sum(x, axis=axis, keepdims=keepdims)
return trt_reduce_sum_job
# XLA has not support ReduceMean, so it will fallback to oneflow automatically.
class TestReduceMean(TestReduce):
run_test = True
def make_job(self, x_shape, axis, keepdims, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def reduce_mean_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.reduce_mean(x, axis=axis, keepdims=keepdims)
return reduce_mean_job
def make_xla_job(self, x_shape, axis, keepdims, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_reduce_mean_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.reduce_mean(x, axis=axis, keepdims=keepdims)
return xla_reduce_mean_job
def make_trt_job(self, x_shape, axis, keepdims, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_reduce_mean_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.reduce_mean(x, axis=axis, keepdims=keepdims)
return trt_reduce_mean_job
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def relu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.relu(x)
return relu_job
def make_xla_job(input_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_relu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.relu(x)
return xla_relu_job
def make_trt_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_relu_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.relu(x)
return trt_relu_job
class TestRelu(unittest.TestCase):
def _test_body(self, x, dtype=np.float32):
f1 = make_job(x.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, dtype=flow.float32)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, dtype=dtype)
def _test_random_body(self, shape, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1))
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1))
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(x_shape, shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def reshape_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.reshape(x, shape)
return reshape_job
def make_xla_job(x_shape, shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_reshape_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.reshape(x, shape)
return xla_reshape_job
def make_trt_job(x_shape, shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_reshape_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.reshape(x, shape)
return trt_reshape_job
class TestReshape(unittest.TestCase):
def _test_body(self, x, shape, dtype=np.float32):
f1 = make_job(x.shape, shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, shape, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(a.shape == b.shape)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, shape, dtype=flow.float32)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(a.shape == c.shape)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, x_shape, shape, dtype=np.float32):
x = np.ones(x_shape, dtype=dtype)
self._test_body(x, shape, dtype=dtype)
def _test_random_body(self, x_shape, shape, dtype=np.float32):
x = np.random.random(x_shape).astype(dtype)
self._test_body(x, shape, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10), (10,))
self._test_ones_body((2, 10, 2), (4, 10))
self._test_ones_body((2, 5, 2, 2), (2, 5, 4))
def test_random_input(self):
self._test_random_body((1, 10), (10,))
self._test_random_body((2, 10, 2), (4, 10))
self._test_random_body((2, 5, 2, 2), (2, 5, 4))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(x_shape, like_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def reshape_like_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
like = flow.FixedTensorDef(like_shape, dtype=dtype)):
return flow.reshape_like(x, like)
return reshape_like_job
def make_xla_job(x_shape, like_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_reshape_like_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
like = flow.FixedTensorDef(like_shape, dtype=dtype)):
return flow.reshape_like(x, like)
return xla_reshape_like_job
def make_trt_job(x_shape, like_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_reshape_like_job(x = flow.FixedTensorDef(x_shape, dtype=dtype),
like = flow.FixedTensorDef(like_shape, dtype=dtype)):
return flow.reshape_like(x, like)
return trt_reshape_like_job
class TestReshapeLike(unittest.TestCase):
def _test_body(self, x, like, dtype=np.float32):
f1 = make_job(x.shape, like.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, like.shape, dtype=flow.float32)
a = f1(x, like).get()
b = f2(x, like).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(a.shape == b.shape)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, like.shape, dtype=flow.float32)
c = f3(x, like).get()
print("with tensorrt: ", c)
self.assertTrue(a.shape == c.shape)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, x_shape, like_shape, dtype=np.float32):
x = np.ones(x_shape, dtype=dtype)
like = np.ones(like_shape, dtype=dtype)
self._test_body(x, like, dtype=dtype)
def _test_random_body(self, x_shape, like_shape, dtype=np.float32):
x = np.random.random(x_shape).astype(dtype)
like = np.random.random(like_shape).astype(dtype)
self._test_body(x, like, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10), (10,))
self._test_ones_body((2, 10, 2), (4, 10))
self._test_ones_body((2, 5, 2, 2), (2, 5, 4))
def test_random_input(self):
self._test_random_body((1, 10), (10,))
self._test_random_body((2, 10, 2), (4, 10))
self._test_random_body((2, 5, 2, 2), (2, 5, 4))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
class TestScalarOp(unittest.TestCase):
run_test = False
def _test_body(self, x, scalar, dtype=np.float32):
if not self.run_test:
return
f1 = self.make_job(x.shape, scalar, dtype=flow.float32)
f2 = self.make_xla_job(x.shape, scalar, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, x_shape, scalar, dtype=np.float32):
x = np.ones(x_shape, dtype=dtype)
self._test_body(x, scalar, dtype=dtype)
def _test_random_body(self, x_shape, scalar, dtype=np.float32):
x = np.random.random(x_shape).astype(dtype)
self._test_body(x, scalar, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 10), 2.0)
self._test_ones_body((2, 10, 2), 2.0)
self._test_ones_body((2, 5, 2, 2), 2.0)
def test_random_input(self):
self._test_random_body((1, 10), 2.0)
self._test_random_body((2, 10, 2), 2.0)
self._test_random_body((2, 5, 2, 2), 2.0)
class TestScalarAddOp(TestScalarOp):
run_test = True
def make_job(self, x_shape, scalar, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def scalar_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.add(x, scalar)
return scalar_add_job
def make_xla_job(self, x_shape, scalar, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_scalar_add_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.add(x, scalar)
return xla_scalar_add_job
class TestScalarMulOp(TestScalarOp):
run_test = True
def make_job(self, x_shape, scalar, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def scalar_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.multiply(x, scalar)
return scalar_mul_job
def make_xla_job(self, x_shape, scalar, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_scalar_mul_job(x = flow.FixedTensorDef(x_shape, dtype=dtype)):
return flow.math.multiply(x, scalar)
return xla_scalar_mul_job
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def sigmoid_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.sigmoid(x)
return sigmoid_job
def make_xla_job(input_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_sigmoid_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.sigmoid(x)
return xla_sigmoid_job
def make_trt_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_sigmoid_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.sigmoid(x)
return trt_sigmoid_job
class TestSigmoid(unittest.TestCase):
def _test_body(self, x, dtype=np.float32):
f1 = make_job(x.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, dtype=flow.float32)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, dtype=dtype)
def _test_random_body(self, shape, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1))
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1))
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def softmax_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.nn.softmax(x, axis=axis)
return softmax_job
def make_xla_job(input_shape, axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_softmax_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.nn.softmax(x, axis=axis)
return xla_softmax_job
def make_trt_job(input_shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_softmax_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.nn.softmax(x, axis=axis)
return trt_softmax_job
class TestSoftmax(unittest.TestCase):
def _test_body(self, x, axis, dtype=np.float32):
f1 = make_job(x.shape, axis, dtype=flow.float32)
f2 = make_xla_job(x.shape, axis, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, axis, dtype=flow.float32)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, axis, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, axis, dtype=dtype)
def _test_random_body(self, shape, axis, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, axis, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((2, 5), axis=1)
self._test_ones_body((2, 5), axis=-1)
self._test_ones_body((1, 5, 2), axis=1)
self._test_ones_body((1, 5, 2), axis=2)
def test_random_input(self):
self._test_random_body((2, 5), axis=1)
self._test_random_body((2, 5), axis=-1)
self._test_random_body((1, 5, 2), axis=1)
self._test_random_body((1, 5, 2), axis=2)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(shape, axis, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def softmax_grad_job(y=flow.FixedTensorDef(shape, dtype=dtype),
dy=flow.FixedTensorDef(shape, dtype=dtype)):
return flow.nn.softmax_grad(y, dy, axis=axis)
return softmax_grad_job
def make_xla_job(shape, axis, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_softmax_grad_job(y=flow.FixedTensorDef(shape, dtype=dtype),
dy=flow.FixedTensorDef(shape, dtype=dtype)):
return flow.nn.softmax_grad(y, dy, axis=axis)
return xla_softmax_grad_job
class TestSoftmaxGrad(unittest.TestCase):
def _test_body(self, y, dy, axis, dtype=np.float32):
f1 = make_job(y.shape, axis, dtype=flow.float32)
f2 = make_xla_job(y.shape, axis, dtype=flow.float32)
a = f1(y, dy).get()
b = f2(y, dy).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(a.shape == b.shape)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, axis, dtype=np.float32):
y = np.ones(shape, dtype=dtype)
dy = np.ones(shape, dtype=dtype)
self._test_body(y, dy, axis, dtype=dtype)
def _test_random_body(self, shape, axis, dtype=np.float32):
y = np.random.random(shape).astype(dtype)
dy = np.random.random(shape).astype(dtype)
self._test_body(y, dy, axis, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((2, 5), axis=1)
self._test_ones_body((2, 5), axis=-1)
self._test_ones_body((1, 5, 2), axis=1)
self._test_ones_body((1, 5, 2), axis=2)
def test_random_input(self):
self._test_random_body((2, 5), axis=1)
self._test_random_body((2, 5), axis=-1)
self._test_random_body((1, 5, 2), axis=1)
self._test_random_body((1, 5, 2), axis=2)
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def tanh_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.tanh(x)
return tanh_job
def make_xla_job(input_shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_tanh_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.tanh(x)
return xla_tanh_job
def make_trt_job(input_shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_tanh_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.keras.activations.tanh(x)
return trt_tanh_job
class TestTanh(unittest.TestCase):
def _test_body(self, x, dtype=np.float32):
f1 = make_job(x.shape, dtype=flow.float32)
f2 = make_xla_job(x.shape, dtype=flow.float32)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, dtype=flow.float32)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=np.float32):
x = np.ones(shape, dtype=dtype)
self._test_body(x, dtype=dtype)
def _test_random_body(self, shape, dtype=np.float32):
x = np.random.random(shape).astype(dtype)
self._test_body(x, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1))
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1))
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(shape, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def tanh_grad_job(y = flow.FixedTensorDef(shape, dtype=dtype),
dy = flow.FixedTensorDef(shape, dtype=dtype)):
return flow.keras.activations.tanh_grad(y, dy)
return tanh_grad_job
def make_xla_job(shape, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_tanh_grad_job(y = flow.FixedTensorDef(shape, dtype=dtype),
dy = flow.FixedTensorDef(shape, dtype=dtype)):
return flow.keras.activations.tanh_grad(y, dy)
return xla_tanh_grad_job
class TestTanhGrad(unittest.TestCase):
def _test_body(self, y, dy, dtype=np.float32):
f1 = make_job(y.shape, dtype=flow.float32)
f2 = make_xla_job(y.shape, dtype=flow.float32)
a = f1(y, dy).get()
b = f2(y, dy).get()
print("without xla: ", a)
print("with xla", b)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, dtype=np.float32):
y = np.ones(shape, dtype=dtype)
dy = np.ones(shape, dtype=dtype)
self._test_body(y, dy, dtype=dtype)
def _test_random_body(self, shape, dtype=np.float32):
y = np.random.random(shape).astype(dtype)
dy = np.random.random(shape).astype(dtype)
self._test_body(y, dy, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1))
self._test_ones_body((1, 10))
self._test_ones_body((2, 10, 2))
self._test_ones_body((2, 5, 2, 2))
def test_random_input(self):
self._test_random_body((1))
self._test_random_body((1, 10))
self._test_random_body((2, 10, 2))
self._test_random_body((2, 5, 2, 2))
if __name__ == '__main__':
unittest.main()
import unittest
import numpy as np
import oneflow as flow
config = flow.function_config()
def make_job(input_shape, permute, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(False)
@flow.function(config)
def transpose_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.transpose(x, perm=permute)
return transpose_job
def make_xla_job(input_shape, permute, dtype=flow.float32):
config.use_xla_jit(True)
config.use_tensorrt(False)
@flow.function(config)
def xla_transpose_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.transpose(x, perm=permute)
return xla_transpose_job
def make_trt_job(input_shape, permute, dtype=flow.float32):
config.use_xla_jit(False)
config.use_tensorrt(True)
@flow.function(config)
def trt_transpose_job(x = flow.FixedTensorDef(input_shape, dtype=dtype)):
return flow.transpose(x, perm=permute)
return trt_transpose_job
class TestTranspose(unittest.TestCase):
def _test_body(self, x, permute, dtype=flow.float32):
f1 = make_job(x.shape, permute, dtype=dtype)
f2 = make_xla_job(x.shape, permute, dtype=dtype)
a = f1(x).get()
b = f2(x).get()
print("without xla: ", a)
print("with xla: ", b)
self.assertTrue(a.shape == b.shape)
self.assertTrue(np.allclose(a.ndarray(), b.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
f3 = make_trt_job(x.shape, permute, dtype=dtype)
c = f3(x).get()
print("with tensorrt: ", c)
self.assertTrue(a.shape == c.shape)
self.assertTrue(np.allclose(a.ndarray(), c.ndarray(), rtol=1e-03, atol=1e-05))
flow.clear_default_session()
def _test_ones_body(self, shape, permute, dtype=flow.float32):
np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
x = np.ones(shape, dtype=np_dtype)
self._test_body(x, permute, dtype=dtype)
def _test_random_body(self, shape, permute, dtype=flow.float32):
np_dtype = flow.convert_of_dtype_to_numpy_dtype(dtype)
x = np.random.random(shape).astype(np_dtype)
self._test_body(x, permute, dtype=dtype)
def test_ones_input(self):
self._test_ones_body((1, 2), (1, 0))
self._test_ones_body((2, 2, 2), (0, 2, 1))
self._test_ones_body((2, 2, 2), (1, 0, 2))
self._test_ones_body((2, 2, 2), (1, 2, 0))
def test_random_input(self):
self._test_random_body((1, 2), (1, 0))
self._test_random_body((2, 2, 2), (0, 2, 1))
self._test_random_body((2, 2, 2), (1, 0, 2))
self._test_random_body((2, 2, 2), (1, 2, 0))
if __name__ == '__main__':
unittest.main()
## XRT (X-Runtime)
XRT是一个同时支持多个计算引擎的运行时加速库,目前已经集成了TensorFlow XLA和Nvidia TensorRT两个后端引擎。其中XLA全面支持训练和预测,TensorRT支持预测以及部分算子支持训练。对于同一个计算图,XRT允许多个计算引擎联合使用,以获得更好的加速效果。
对于任意后端引擎,XRT的执行过程均分成以下四个步骤:
1. 计算图的转换
2. 引擎无关优化
3. 生成引擎相关Executable
4. 执行Executable
### 引擎无关优化
- 划分子图
根据计算图中每个计算节点是否可编译、device、sbp policy等一系列属性,对节点进行聚合,被聚合的节点被新的节点(Launch节点)折叠后并在节点内进行子图重建,同时确定子图的后端执行引擎。
如果多个后端引擎被开启,则会按照优先级进行每个引擎的子图划分。目前各引擎的优先级如下:
- 训练时,优先进行XLA的子图划分,之后进行TensorRT子图划分。
- 预测时,优先进行TensorRT的子图划分,之后进行XLA子图划分。
[子图划分](https://github.com/Oneflow-Inc/oneflow-issue/issues/44)是自动完成的,但可以通过设置以下环境变量来调整子图划分的结果。
```shell
export FLAGS_clustering_minimum_nodes=1
export FLAGS_clustering_maximum_nodes=100
export FLAGS_strict_clustering=true
```
- FLAGS_clustering_minimum_nodes
设置每个子图合并的节点的最小数量。当子图包含的节点数量小于该值时,则该合并的子图会被释放。
- FLAGS_clustering_maximum_nodes
设置每个子图合并的节点的最大数量。在合并时XRT可以保证每个子图包含的节点数不大于该设定值。
- FLAGS_strict_clustering
节点在合并时可能会互相破坏依赖,导致节点的执行时机发生改变。可以设置环境变量FLAGS_strict_clustering=true来规避该行为,确保合并后节点的执行时机不变。
同时FLAGS_strict_clustering=true时会导致合并的子图变小,可能导致后端引擎丧失一些优化机会。FLAGS_strict_clustering默认设为true。
- ...
### Executable的生成
在runtime阶段,每个子图都可以被编译成一个与引擎相关的Executable。
对于静态shape的子图,由于缓存机制,每个子图只需要在运行时编译一次。对于包含动态shape的子图,则可能每次运行时都需要编译一次,因此如果计算图中包含动态shape的节点,暂时不建议使用XRT。
### Executable的执行
Executable执行时会分别调用所属的后端引擎提供的执行接口,执行完成后返回计算结果。对于GPU,执行接口调用是异步的,而对于CPU,执行接口调用是同步的。
- 临时内存管理
目前XLA是通过自动增长的buffer内存池来管理临时内存的,并支持复用输出的buffer,达到减少显存占用和in-place计算的效果。
TensorRT可以通过环境变量来设置临时buffer的最大字节数。
```shell
export FLAGS_max_workspace_bytes=10000
```
- Max batch size
TensorRT在执行时需要设置最大支持的batch size,XRT支持用户通过环境变量来设置,
```shell
export FLAGS_max_batch_size=10
```
当然,如果在运行时实际的batch size超过了设置的最大batch size,则XRT允许TensorRT Executable自动调整max batch size并正确执行(自动调整max batch size会带来一定的开销)。
### 在OneFlow中如何使用XRT
首先要求在编译OneFlow时开启了WITH_XLA或WITH_TENSORRT选项。
OneFlow中XRT的使用默认是关闭的,可以通过前端的Python接口和设置环境变量的方法来配置开启或关闭XLA和TensorRT,并且通过Python接口配置的优先级高于通过环境变量配置的方法。
- Python接口配置
```python
import oneflow as flow
# 配置使用XLA
# True开启XLA,False关闭XLA,默认为未定义状态
flow.config.use_xla_jit(True)
# 配置使用TensorRT
# True开启TensorRT,False关闭TensorRT,默认为未定义状态
flow.config.use_tensorrt(True)
```
- 从环境变量配置
```shell
# 只在Python前端未定义状态下生效
export FLAGS_use_xla_jit=true # true为开启,false为关闭
export FLAGS_use_tensorrt=true # true为开启,false为关闭
```
### BenchMark
- Bert base (batch size = 60)
>| RTX 2080Ti 单卡 | FP32 | | FP16混合精度 | |
>| ---------------------- | ----------- | ----------- | ------------ | ----------- |
>| | oneflow | oneflow-xla | oneflow | oneflow-xla |
>| loss (100 batches) | 8.85063839 | 8.850635529 | 8.850672722 | 8.850834847 |
>| s/batch | 0.57 | 0.45 | 0.31 | 0.19 |
>| 显存占用 | 8669MiB | 8685MiB | 7009MiB | 7041MiB |
>| 计算吞吐 (sentences/s) | 105.2631579 | 133.3333333 | 193.5483871 | 315.7894737 |
>| 加速比 | 1 | 1.266666667 | 1 | 1.631578947 |
>| RTX 2080Ti 2卡 | FP32 | | FP16混合精度 | |
>| ---------------------- | ----------- | ----------- | ------------ | ----------- |
>| | oneflow | oneflow-xla | oneflow | oneflow-xla |
>| loss (100 batche) | 8.806107521 | 8.806109428 | 8.806120873 | 8.806238174 |
>| s/batch | 0.596 | 0.485 | 0.353 | 0.241 |
>| 显存占用 | 9147MiB | 9149MiB | 7669MiB | 7675MiB |
>| 计算吞吐 (sentences/s) | 201.3422819 | 247.4226804 | 339.9433428 | 497.9253112 |
>| 加速比 | 1 | 1.228865979 | 1 | 1.46473029 |
>| RTX 2080Ti 4卡 | FP32 | | FP16混合精度 | |
>| ---------------------- | ----------- | ----------- | ------------ | ----------- |
>| | oneflow | oneflow-xla | oneflow | oneflow-xla |
>| loss (100 batches) | 8.730175972 | 8.730184555 | 8.730111122 | 8.729899406 |
>| s/batch | 0.61 | 0.495 | 0.376 | 0.252 |
>| 显存占用 | 9147MiB | 9149MiB | 7669MiB | 7675MiB |
>| 计算吞吐 (sentences/s) | 393.442623 | 484.8484848 | 638.2978723 | 952.3809524 |
>| 加速比 | 1 | 1.232323232 | 1 | 1.492063492 |
- Bert base (batch size = 40)
>| RTX 2080Ti 单卡 | FP32 | | | | FP16混合精度 | | | |
>| ---------------------- | --------- | ----------- | ---------- | -------------- | ------------ | ----------- | ---------- | -------------- |
>| | oneflow | oneflow-xla | tensorflow | tensorflow-xla | oneflow | oneflow-xla | tensorflow | tensorflow-xla |
>| 计算吞吐 (sentences/s) | 99.276 | 125.708 | 91.4 | 119.1 | 170.731 | 288.511 | 202.2 | 309.5 |
>| 加速比 | 1 | 1.26625 | 1 | 1.30306 | 1 | 1.690 | 1 | 1.53066 |
>| RTX 2080Ti 2卡 | FP32 | | | | FP16混合精度 | | | |
>| ---------------------- | --------- | ----------- | ---------- | -------------- | ------------ | ----------- | ---------- | -------------- |
>| | oneflow | oneflow-xla | tensorflow | tensorflow-xla | oneflow | oneflow-xla | tensorflow | tensorflow-xla |
>| 计算吞吐 (sentences/s) | 188.476 | 223.643 | 173.6 | 196.2 | 290.946 | 431.241 | 307.8 | 376.1 |
>| 加速比 | 1 | 1.18659 | 1 | 1.13018 | 1 | 1.482 | 1 | 1.22190 |
#ifndef ONEFLOW_XRT_ANY_H_
#define ONEFLOW_XRT_ANY_H_
#include <functional>
#include <type_traits>
#include <typeinfo>
#include "glog/logging.h"
namespace oneflow {
namespace xrt {
class Any {
public:
inline Any() = default;
inline Any(Any &&other);
inline Any(const Any &other);
template<typename T>
inline Any(T &&value);
inline virtual ~Any();
inline Any &operator=(Any &&other);
inline Any &operator=(const Any &other);
template<typename T>
inline Any &operator=(T &&value);
inline void Swap(Any &other);
template<typename T>
inline const T &Cast() const;
template<typename T>
inline T &Cast();
template<typename T>
inline friend const T &any_cast(const Any &any);
template<typename T>
inline friend T &any_cast(Any &any);
private:
struct AnyType {
const std::type_info *ptype_info;
};
struct AnyData {
virtual ~AnyData() = default;
virtual const void *Ptr() { return nullptr; };
std::function<AnyData *()> clone;
};
template<typename T>
struct AnyDataImpl : public AnyData {
T data_content;
explicit AnyDataImpl(const T &value);
const void *Ptr() override { return &data_content; }
};
template<typename T>
inline AnyType TypeInfo() const;
template<typename T>
inline bool CheckType() const;
private:
AnyType type_;
AnyData *data_ = nullptr;
};
template<typename T>
Any::AnyDataImpl<T>::AnyDataImpl(const T &value) : data_content(value) {
this->clone = [this]() -> Any::AnyDataImpl<T> * {
return new AnyDataImpl<T>(this->data_content);
};
}
void Any::Swap(Any &other) {
std::swap(type_, other.type_);
std::swap(data_, other.data_);
}
Any::Any(Any &&other) { this->Swap(other); }
Any::Any(const Any &other) {
type_ = other.type_;
if (other.data_) { data_ = other.data_->clone(); }
}
Any::~Any() {
if (data_) delete data_;
data_ = nullptr;
}
template<typename T>
Any::AnyType Any::TypeInfo() const {
Any::AnyType type;
type.ptype_info = &typeid(T);
return std::move(type);
}
template<typename T>
Any::Any(T &&value) {
typedef typename std::decay<T>::type DT;
if (std::is_same<DT, Any>::value) {
*this = std::move(value);
} else {
type_ = TypeInfo<T>();
data_ = new AnyDataImpl<T>(value);
}
}
Any &Any::operator=(Any &&other) {
Any(std::move(other)).Swap(*this);
return *this;
}
Any &Any::operator=(const Any &other) {
Any(other).Swap(*this);
return *this;
}
template<typename T>
Any &Any::operator=(T &&value) {
Any(std::move(value)).Swap(*this);
return *this;
}
template<typename T>
bool Any::CheckType() const {
if (typeid(T).hash_code() != type_.ptype_info->hash_code()) {
LOG(FATAL) << "Could not cast type " << type_.ptype_info->name() << " to type "
<< typeid(T).name();
return false;
}
return true;
}
template<typename T>
const T &Any::Cast() const {
CheckType<T>();
return *reinterpret_cast<const T *>(data_->Ptr());
}
template<typename T>
T &Any::Cast() {
CheckType<T>();
return *const_cast<T *>(reinterpret_cast<const T *>(data_->Ptr()));
}
template<typename T>
const T &any_cast(const Any &any) {
return any.Cast<T>();
}
template<typename T>
T &any_cast(Any &any) {
return any.Cast<T>();
}
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_ANY_H_
#include "oneflow/xrt/api.h"
#include "glog/logging.h"
#include "oneflow/core/operator/operator.h" // GenLogicalBlobName, GenLogicalBlobId
#include "oneflow/xrt/build_graph.h"
#include "oneflow/xrt/utility/env.h"
DEFINE_int32(clustering_minimum_nodes, EnvToInt(FLAGS_clustering_minimum_nodes, 1),
"Minium nodes of a cluster after clustering.");
DEFINE_int32(clustering_maximum_nodes, EnvToInt(FLAGS_clustering_maximum_nodes, 1000),
"Maxium nodes of a cluster after clustering.");
DEFINE_bool(strict_clustering, EnvToBool(FLAGS_strict_clustering, true),
"Option to clustering with strict dependencies analysis.");
// DEFINE_string(engine, EnvToString(FLAGS_engine, "XLA"),
// "Which third party engine to be used. XLA and TENSORRT are "
// "valid, Default means using no engine.");
DEFINE_bool(use_xla_jit, EnvToBool(FLAGS_use_xla_jit, false), "It's optional to use xla jit.");
DEFINE_bool(use_tensorrt, EnvToBool(FLAGS_use_tensorrt, false), "It's optional to use tensorrt.");
DEFINE_bool(tensorrt_fp16, EnvToBool(FLAGS_tensorrt_fp16, false),
"Enable fp16 precision for TENSORRT engine.");
DEFINE_bool(tensorrt_int8, EnvToBool(FLAGS_tensorrt_int8, false),
"Enable int8 precision for TENSORRT engine.");
namespace oneflow {
namespace xrt {
#define OP_TYPE_CASE(op) OperatorConf::k##op##Conf
static std::unordered_map<int32_t, std::string> op_type2string_map = {
{OP_TYPE_CASE(Matmul), "MatMul"},
{OP_TYPE_CASE(Relu), "Relu"},
{OP_TYPE_CASE(Conv2D), "Conv2D"},
{OP_TYPE_CASE(Multiply), "Multiply"},
// {OP_TYPE_CASE(FullyConnected), "FullyConnected"},
{OP_TYPE_CASE(BiasAdd), "BiasAdd"},
{OP_TYPE_CASE(Reshape), "Reshape"},
{OP_TYPE_CASE(Identity), "Identity"},
{OP_TYPE_CASE(ReshapeLike), "ReshapeLike"},
{OP_TYPE_CASE(Cast), "Cast"},
{OP_TYPE_CASE(Concat), "Concat"},
{OP_TYPE_CASE(ScalarAdd), "ScalarAdd"},
{OP_TYPE_CASE(ScalarMul), "ScalarMul"},
{OP_TYPE_CASE(Transpose), "Transpose"},
{OP_TYPE_CASE(BroadcastAdd), "BcastAdd"},
{OP_TYPE_CASE(BroadcastMul), "BcastMul"},
{OP_TYPE_CASE(BroadcastDiv), "BcastDiv"},
{OP_TYPE_CASE(Add), "Add"},
{OP_TYPE_CASE(Sigmoid), "Sigmoid"},
{OP_TYPE_CASE(Tanh), "Tanh"},
{OP_TYPE_CASE(TanhGrad), "TanhGrad"},
{OP_TYPE_CASE(Gelu), "Gelu"},
{OP_TYPE_CASE(GeluGrad), "GeluGrad"},
{OP_TYPE_CASE(Gather), "Gather"},
{OP_TYPE_CASE(BatchGather), "BatchGather"},
{OP_TYPE_CASE(Softmax), "Softmax"},
{OP_TYPE_CASE(SoftmaxGrad), "SoftmaxGrad"},
{OP_TYPE_CASE(LayerNorm), "LayerNorm"},
{OP_TYPE_CASE(LayerNormParamGrad), "LayerNormParamGrad"},
{OP_TYPE_CASE(LayerNormGrad), "LayerNormGrad"},
{OP_TYPE_CASE(ReduceSum), "ReduceSum"},
{OP_TYPE_CASE(ReduceMean), "ReduceMean"},
{OP_TYPE_CASE(AdamModelUpdate), "AdamOptimizer"},
{OP_TYPE_CASE(MaxPooling2D), "MaxPooling2D"},
{OP_TYPE_CASE(AveragePooling2D), "AveragePooling2D"},
{OP_TYPE_CASE(Normalization), "Normalization"},
// {OP_TYPE_CASE(ReduceConcat), "ReduceConcat"},
// {OP_TYPE_CASE(ReduceSplit), "ReduceSplit"},
// TODO(hjchen2)
};
std::string ExtractOpTypeAsString(const OperatorConf &conf) {
const auto it = op_type2string_map.find(conf.op_type_case());
if (it != op_type2string_map.end()) {
return it->second;
} else {
// Return empty if the operator is not in the translation map
return std::string("");
}
}
XrtDevice DeviceTypeToXrtDevice(const DeviceType &device_type) {
switch (device_type) {
case DeviceType::kGPU: return XrtDevice::GPU_CUDA;
case DeviceType::kCPU: return XrtDevice::CPU_X86;
default:
DLOG(WARNING) << "Meet invalid device type (" << device_type
<< "). Use the default xrt device instead.";
return XrtDevice::CPU_X86;
}
}
DeviceType XrtDeviceToDeviceType(const XrtDevice &device) {
if (device == XrtDevice::GPU_CUDA) {
return DeviceType::kGPU;
} else if (device == XrtDevice::CPU_X86) {
return DeviceType::kCPU;
} else {
LOG(FATAL) << "Can not convert xrt device (" << device << ") to device type.";
return DeviceType::kCPU;
}
}
XrtEngine StringToXrtEngine(const std::string &engine) {
if (engine == "XLA") {
return xrt::XrtEngine::XLA;
} else if (engine == "TENSORRT") {
return xrt::XrtEngine::TENSORRT;
} else {
LOG(FATAL) << "Unknown engine: " << engine;
}
}
std::string BlobIdToName(const LogicalBlobId &lbi) {
CHECK_EQ(lbi.has_op_name(), true);
CHECK_EQ(lbi.has_blob_name(), true);
if (lbi.op_name() == "") { return lbi.blob_name(); }
return GenLogicalBlobName(lbi);
}
LogicalBlobId BlobNameToId(const std::string &blob_name) {
size_t pos = blob_name.find('/');
if (pos == std::string::npos) {
return GenLogicalBlobId("/" + blob_name);
} else {
return GenLogicalBlobId(blob_name);
}
}
std::shared_ptr<XrtGraph> BuildXrtGraph(const OpGraph *op_graph) {
return graph_builder::BuildGraph(op_graph);
}
std::shared_ptr<XrtGraph> BuildXrtGraph(const XrtLaunchOpConf::Function &function,
const DeviceType &device_type, const JobDesc &job_desc) {
return graph_builder::BuildGraph(function, device_type, job_desc);
}
void InitXrtConfigurations(const XrtConfig &config) {
if (config.has_use_xla_jit()) { FLAGS_use_xla_jit = config.use_xla_jit(); }
if (config.has_use_tensorrt()) { FLAGS_use_tensorrt = config.use_tensorrt(); }
// Set xla configurations.
if (config.has_tensorrt_config()) {
const XrtConfig::TensorRTConfig &trt_config = config.tensorrt_config();
if (trt_config.has_use_fp16()) { FLAGS_tensorrt_fp16 = trt_config.use_fp16(); }
if (trt_config.has_use_int8()) { FLAGS_tensorrt_int8 = trt_config.use_int8(); }
}
}
bool XrtCompilationEnabled() { return FLAGS_use_xla_jit || FLAGS_use_tensorrt; }
XrtPassOptions CreateDefaultXrtPassOptions(bool train_phase) {
ClusteringOptions options;
options.minimum_nodes = FLAGS_clustering_minimum_nodes;
options.maximum_nodes = FLAGS_clustering_maximum_nodes;
options.strict_clustering = FLAGS_strict_clustering;
options.train_phase = train_phase;
// TODO(hjchen2)
options.engine = (1U << XrtEngineOptionBit::kUseDefault);
if (FLAGS_use_xla_jit) { options.engine |= (1U << XrtEngineOptionBit::kUseXlaJit); }
if (FLAGS_use_tensorrt) { options.engine |= (1U << XrtEngineOptionBit::kUseTensorRT); }
XrtPassOptions xrt_options;
xrt_options.clustering_options = options;
return xrt_options;
}
void RunCompilationTimeXrtPasses(const OpGraph &op_graph, Job *job, bool train_phase) {
auto graph = BuildXrtGraph(&op_graph);
// Create options to run xrt passes.
auto options = CreateDefaultXrtPassOptions(train_phase);
RunXrtPass("MarkClusterId", graph.get(), options);
RunXrtPass("BuildSubGraph", graph.get(), options);
// Rebuild Job
RunXrtPass("RebuildCompiledJob", graph.get(), options, job);
}
} // namespace xrt
} // namespace oneflow
#ifndef ONEFLOW_XRT_API_H_
#define ONEFLOW_XRT_API_H_
#include "oneflow/core/common/shape.h"
#include "oneflow/core/graph/op_graph.h"
#include "oneflow/core/job/job_desc.h"
#include "oneflow/core/operator/op_conf.pb.h"
#include "oneflow/core/register/blob.h"
#include "oneflow/core/register/logical_blob_id.pb.h"
#include "oneflow/xrt/graph/graph.h"
#include "oneflow/xrt/parameter.h"
#include "oneflow/xrt/passes/pass.h"
namespace oneflow {
namespace xrt {
std::string ExtractOpTypeAsString(const OperatorConf &conf);
XrtDevice DeviceTypeToXrtDevice(const DeviceType &device_type);
DeviceType XrtDeviceToDeviceType(const XrtDevice &device);
XrtEngine StringToXrtEngine(const std::string &engine);
std::string BlobIdToName(const LogicalBlobId &lbi);
LogicalBlobId BlobNameToId(const std::string &blob_name);
template<typename T>
inline Shape AsShape(const std::vector<T> &dim_vec) {
return Shape(DimVector(dim_vec.begin(), dim_vec.end()));
}
// Build an xrt graph from launch conf.
std::shared_ptr<XrtGraph> BuildXrtGraph(const XrtLaunchOpConf::Function &function,
const DeviceType &device_type, const JobDesc &job_desc);
// Build an xrt graph from op graph.
std::shared_ptr<XrtGraph> BuildXrtGraph(const OpGraph *op_graph);
void InitXrtConfigurations(const XrtConfig &config);
bool XrtCompilationEnabled();
// Create a default options for xrt pass.
// If environment variables FLAGS_clustering_minimum_nodes,
// FLAGS_clustering_maximum_nodes, and FLAGS_strict_clustering have been set,
// then it will be filled by these values.
XrtPassOptions CreateDefaultXrtPassOptions(bool train_phase = false);
// Run an xrt pass with fixed parameters.
// args:
// pass "Pass type, sunch as \"BuildSubGraph\"."
// graph "An XRT graph which be applied by pass."
// options "Specify options to affect pass results."
inline void RunXrtPass(const std::string &pass, XrtGraph *graph, const XrtPassOptions &options) {
return RunPassImpl(pass, graph, options);
}
// Run an xrt pass with unfixed parameters.
template<typename... Args>
inline void RunXrtPass(const std::string &pass, XrtGraph *graph, const XrtPassOptions &options,
Args &&... args) {
return RunPassImpl(pass, graph, options, std::forward<Args>(args)...);
}
void RunCompilationTimeXrtPasses(const OpGraph &op_graph, Job *job, bool train_phase);
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_API_H_
#ifndef ONEFLOW_XRT_ARGUMENT_H_
#define ONEFLOW_XRT_ARGUMENT_H_
#include <string>
#include "oneflow/core/common/data_type.pb.h"
#include "oneflow/core/common/shape.h"
namespace oneflow {
namespace xrt {
// Each data flow will bind two keys, produce_key and consume_key.
// Such as node A and B, there is a data flow named `a_output` on edge A->B.
// node A {
// in: "a_input"
// out: "a_output"
// }
// node B {
// in: "a_output"
// out: "b_output"
// }
// In this case, the data flow named `a_output` has a `produce_key` named
// \"out\" produced by node A and a `consume_key` named \"in\" consumed by
// node B.
struct ArgumentMetaData {
std::string produce_key;
std::string consume_key;
};
// Descriptor of data flow on graph edges include data name, shape and
// data type. Also it may be attached by a metadata which giving the key
// of producing and consuming.
class Argument {
public:
Argument() : initialized_(false) {}
explicit Argument(const std::string &name) : Argument(name, ArgumentMetaData()) {}
explicit Argument(const std::string &name, const Shape &shape, const DataType &data_type)
: Argument(name, shape, data_type, ArgumentMetaData()) {}
explicit Argument(const std::string &name, const ArgumentMetaData &meta_data)
: arg_name_(name), meta_data_(meta_data), initialized_(true) {}
explicit Argument(const std::string &name, const Shape &shape, const DataType &data_type,
const ArgumentMetaData &meta_data)
: arg_name_(name),
shape_(shape),
data_type_(data_type),
meta_data_(meta_data),
initialized_(true) {}
const std::string &name() const { return arg_name_; }
const Shape &shape() const { return shape_; }
const DataType &data_type() const { return data_type_; }
void set_meta_data(const ArgumentMetaData &meta_data) { meta_data_ = meta_data; }
const ArgumentMetaData &meta_data() const { return meta_data_; }
bool initialized() const { return initialized_; }
bool operator==(const Argument &rhs) const {
return arg_name_ == rhs.arg_name_ && shape_ == rhs.shape_ && data_type_ == rhs.data_type_;
}
private:
std::string arg_name_{""};
Shape shape_;
DataType data_type_;
ArgumentMetaData meta_data_;
bool initialized_ = false;
};
} // namespace xrt
} // namespace oneflow
namespace std {
template<>
struct hash<oneflow::xrt::Argument> {
size_t operator()(const oneflow::xrt::Argument &arg) const {
return std::hash<std::string>()(arg.name());
}
};
} // namespace std
#endif // ONEFLOW_XRT_GRAPH_ARGUMENT_H_
#include "oneflow/xrt/build_graph.h"
#include "oneflow/xrt/api.h"
namespace oneflow {
namespace xrt {
namespace graph_builder {
const Shape &InputTimeShape(const OpNode *op_node) {
CHECK_NOTNULL(op_node);
return *(op_node->GetInputBlobFastestTimeShape());
}
const Shape &OutputTimeShape(const OpNode *op_node) {
CHECK_NOTNULL(op_node);
return *(op_node->out_blob_time_shape());
}
const SbpParallel &BlobSbpPolicy(const OpNode *op_node, const std::string &name) {
CHECK_NOTNULL(op_node);
LogicalBlobId lbi = BlobNameToId(name);
return op_node->SbpParallel4Lbi(lbi);
}
GraphBuilder::GraphBuilder(const OpGraph *op_graph) : graph_(std::make_shared<XrtGraph>()) {
op_graph->TopoForEachNode([&](const OpNode *op_node) {
const Operator *op = &op_node->op();
XrtNode *node = graph_->AddNode(op->op_conf());
SetupXrtNode(node, op->op_conf());
auto &input_output_keys = node_info_[node].input_output_keys;
for (const std::string &bn : op->output_bns()) {
std::string output = BlobIdToName(op->BnInOp2Lbi(bn));
producers_[output] = node;
input_output_keys[output] = bn;
}
for (const std::string &bn : op->input_bns()) {
std::string input = BlobIdToName(op->BnInOp2Lbi(bn));
input_output_keys[input] = bn;
node_info_[node].inputs.insert(input);
}
node_info_[node].op_node = op_node;
});
}
GraphBuilder::GraphBuilder(const XrtLaunchOpConf::Function &function, const DeviceType &device_type,
const JobDesc &job_desc)
: graph_(std::make_shared<XrtGraph>()) {
for (const auto &arg_conf : function.argument()) {
XrtNode *node = graph_->AddNode(arg_conf);
SetupXrtNode(node, arg_conf);
if (node->IsInArgumentNode()) {
producers_[arg_conf.value()] = node;
} else {
node_info_[node].inputs.insert(arg_conf.value());
}
auto &input_output_keys = node_info_[node].input_output_keys;
input_output_keys = {{arg_conf.value(), "value"}};
}
for (const auto &node_conf : function.node()) {
XrtNode *node = graph_->AddNode(node_conf);
SetupXrtNode(node, node_conf);
auto &input_output_keys = node_info_[node].input_output_keys;
auto op = ConstructOp(node_conf, device_type, &job_desc);
for (const std::string &bn : op->output_bns()) {
std::string output = BlobIdToName(op->BnInOp2Lbi(bn));
producers_[output] = node;
input_output_keys[output] = bn;
}
for (const std::string &bn : op->input_bns()) {
std::string input = BlobIdToName(op->BnInOp2Lbi(bn));
input_output_keys[input] = bn;
node_info_[node].inputs.insert(input);
}
}
}
void GraphBuilder::MakeMetaData(const XrtNode *start, const XrtNode *end,
const std::string &arg_name, ArgumentMetaData *meta_data) {
const auto &prod_keys = node_info_.at(start).input_output_keys;
const auto &cons_keys = node_info_.at(end).input_output_keys;
meta_data->produce_key = prod_keys.at(arg_name);
meta_data->consume_key = cons_keys.at(arg_name);
}
void GraphBuilder::BuildGraphEdges() {
for (const auto &p : node_info_) {
const XrtNode *node = p.first;
const util::Set<std::string> &inputs = p.second.inputs;
for (const std::string &input : inputs) {
const auto &it = producers_.find(input);
if (it != producers_.end() && it->second != node) {
ArgumentMetaData meta;
MakeMetaData(it->second, node, input, &meta);
Argument argument(input, meta);
graph_->Connect(it->second, node, argument);
}
}
}
}
void GraphBuilder::SetupGraphEdges() {
for (XrtEdge *edge : graph_->Edges()) {
const OpNode *src = node_info_.at(edge->start()).op_node;
const OpNode *dst = node_info_.at(edge->end()).op_node;
const std::string &name = edge->argument().name();
if (nullptr == src || nullptr == dst) { continue; }
// Set time shape
std::vector<Shape> time_shape;
time_shape.push_back(OutputTimeShape(src));
time_shape.push_back(InputTimeShape(dst));
edge->SetAttr("time_shape", time_shape);
// Set sbp policy
std::vector<SbpParallel> sbp_policy;
sbp_policy.push_back(BlobSbpPolicy(src, name));
sbp_policy.push_back(BlobSbpPolicy(dst, name));
edge->SetAttr("sbp_policy", sbp_policy);
}
}
std::shared_ptr<XrtGraph> BuildGraph(const XrtLaunchOpConf::Function &function,
const DeviceType &device_type, const JobDesc &job_desc) {
return GraphBuilder(function, device_type, job_desc).Build();
}
std::shared_ptr<XrtGraph> BuildGraph(const OpGraph *op_graph) {
return GraphBuilder(op_graph).Build();
}
} // namespace graph_builder
} // namespace xrt
} // namespace oneflow
#ifndef ONEFLOW_XRT_BUILD_GRAPH_H_
#define ONEFLOW_XRT_BUILD_GRAPH_H_
#include "oneflow/core/graph/op_graph.h"
#include "oneflow/core/job/job_desc.h"
#include "oneflow/core/operator/op_conf.pb.h"
#include "oneflow/xrt/api.h"
#include "oneflow/xrt/graph/graph.h"
#include "oneflow/xrt/types.h"
namespace oneflow {
namespace xrt {
namespace graph_builder {
class GraphBuilder {
public:
GraphBuilder() = delete;
explicit GraphBuilder(const OpGraph *op_graph);
explicit GraphBuilder(const XrtLaunchOpConf::Function &function, const DeviceType &device_type,
const JobDesc &job_desc);
std::shared_ptr<XrtGraph> Build() {
BuildGraphEdges();
SetupGraphEdges();
return graph_;
}
struct NodeInfo {
util::Set<std::string> inputs;
util::Map<std::string, std::string> input_output_keys;
const OpNode *op_node = nullptr;
};
private:
void SetupXrtNode(XrtNode *node, const OperatorConf &node_conf) const {
node->set_name(node_conf.name());
node->set_type(ExtractOpTypeAsString(node_conf));
node->set_device(DeviceTypeToXrtDevice(node_conf.device_type()));
}
void SetupXrtNode(XrtNode *node, const XrtLaunchOpConf::Argument &arg_conf) const {
node->set_name(arg_conf.name());
node->set_type(_ArgumentOpType);
node->set_device(DeviceTypeToXrtDevice(arg_conf.device_type()));
}
void MakeMetaData(const XrtNode *start, const XrtNode *end, const std::string &arg_name,
ArgumentMetaData *meta_data);
void BuildGraphEdges();
void SetupGraphEdges();
private:
std::shared_ptr<XrtGraph> graph_;
util::Map<std::string, const XrtNode *> producers_;
util::Map<const XrtNode *, NodeInfo> node_info_;
};
std::shared_ptr<XrtGraph> BuildGraph(const XrtLaunchOpConf::Function &function,
const DeviceType &device_type, const JobDesc &job_desc);
std::shared_ptr<XrtGraph> BuildGraph(const OpGraph *op_graph);
} // namespace graph_builder
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_BUILD_GRAPH_H_
#include "oneflow/xrt/compilation_cache.h"
namespace oneflow {
namespace xrt {
bool operator==(const Signature &lhs, const Signature &rhs) {
return lhs.builder_name == rhs.builder_name && lhs.device_ordinal == rhs.device_ordinal
&& lhs.entry_shapes == rhs.entry_shapes;
}
size_t SignatureHash::operator()(const Signature &signature) const {
size_t hash_val =
std::hash<std::string>()(signature.builder_name) ^ std::hash<int>()(signature.device_ordinal);
for (const auto &shape : signature.entry_shapes) { hash_val ^= std::hash<Shape>()(shape); }
return hash_val;
}
Signature ComputeSignature(const std::string &name, const int device_ordinal,
const std::vector<Parameter> &entry_params) {
Signature signature;
signature.builder_name = name;
signature.device_ordinal = device_ordinal;
signature.entry_shapes.resize(entry_params.size());
for (int i = 0; i < entry_params.size(); ++i) {
signature.entry_shapes[i] = entry_params[i].shape();
}
return std::move(signature);
}
Executable *CompilationCache::GetRecord(const Signature &signature) const {
Executable *record = nullptr;
// std::shared_lock<std::shared_mutex> lock(mutex_);
std::lock_guard<std::mutex> lock(mutex_);
const auto &it = records_.find(signature);
if (it != records_.end()) { record = it->second.get(); }
return record;
}
void CompilationCache::Record(const Signature &signature,
const std::shared_ptr<Executable> &result) {
// std::unique_lock<std::shared_mutex> lock(mutex_);
std::lock_guard<std::mutex> lock(mutex_);
records_.emplace(signature, result);
}
void CompilationCache::Release() {
util::Map<Signature, std::shared_ptr<Executable>, SignatureHash> empty_records;
records_.swap(empty_records);
}
} // namespace xrt
} // namespace oneflow
#ifndef ONEFLOW_XRT_COMPILATION_CACHE_H_
#define ONEFLOW_XRT_COMPILATION_CACHE_H_
#include <memory>
#include <mutex>
#include <string>
#include <vector>
//#include "oneflow/core/common/data_type.pb.h"
#include "oneflow/core/common/shape.h"
#include "oneflow/xrt/executable.h"
#include "oneflow/xrt/parameter.h"
#include "oneflow/xrt/utility/stl.h"
namespace oneflow {
namespace xrt {
struct Signature {
// Builder name
std::string builder_name;
// Device ordinal
int device_ordinal;
// std::vector<Shape> entry_data_types;
// It will lose efficacy if the entry shapes has been changed.
std::vector<Shape> entry_shapes;
};
bool operator==(const Signature &lhs, const Signature &rhs);
struct SignatureHash {
size_t operator()(const Signature &signature) const;
};
Signature ComputeSignature(const std::string &name, const int device_ordinal,
const std::vector<xrt::Parameter> &entry_params);
class CompilationCache {
public:
Executable *GetRecord(const Signature &signature) const;
void Record(const Signature &signature, const std::shared_ptr<Executable> &result);
void Release();
private:
// static std::shared_mutex mutex_;
mutable std::mutex mutex_;
util::Map<Signature, std::shared_ptr<Executable>, SignatureHash> records_;
};
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_COMPILATION_CACHE_H_
#ifndef ONEFLOW_XRT_EXECUTABLE_H_
#define ONEFLOW_XRT_EXECUTABLE_H_
#include <vector>
#include "oneflow/xrt/parameter.h"
#include "oneflow/xrt/xrt.pb.h"
namespace oneflow {
namespace xrt {
struct ExecutableRunOptions {
// Specify stream if the engine supports multiple computation streams.
// It will use the default computation stream if `stream` is not set.
void *stream = nullptr;
int32_t device_ordinal = -1;
// Set host threads num.
int32_t host_num_threads = -1;
// Limit memory footprint.
int64_t host_memory_limit = -1;
int64_t device_memory_limit = -1;
// Random seed.
int64_t random_seed = -1;
// Maximum batch size for TensorRT.
int32_t max_batch_size = 1;
// Enable TensorRT Mixed-Precision.
// Enable TensorRT fp16
bool tensorrt_fp16 = false;
// Enable TensorRT int8
bool tensorrt_int8 = false;
// Feed the return parameters to reuse it's storage while running
// the executable.
std::vector<Parameter> return_params;
};
class Executable {
public:
Executable(const XrtEngine &engine) : engine_(engine) {}
virtual ~Executable() = default;
const XrtEngine &engine() const { return engine_; }
virtual bool Run(const std::vector<Parameter> &inputs, const ExecutableRunOptions &run_options,
bool block_until_done = true) = 0;
bool RunAsync(const std::vector<Parameter> inputs, const ExecutableRunOptions &run_options) {
return Run(inputs, run_options, false);
}
const std::vector<Parameter> &Results() const { return results_; }
protected:
XrtEngine engine_;
std::vector<Parameter> results_;
};
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_EXECUTABLE_H_
#ifndef ONEFLOW_XRT_GRAPH_ALGORITHM_H_
#define ONEFLOW_XRT_GRAPH_ALGORITHM_H_
#include "oneflow/xrt/utility/stl.h"
namespace oneflow {
namespace xrt {
namespace algorithm {
template<typename GraphType>
struct GraphTypeTrait {
typedef typename GraphType::NodeType *pNodeType;
typedef typename GraphType::EdgeType *pEdgeType;
};
template<typename NodeType>
struct NodeTypeTrait {
typedef typename NodeType::EdgeType *pEdgeType;
};
template<typename GraphType, typename UserFunc>
inline void TopologyVisit(GraphType &graph, UserFunc func) {
typedef typename GraphTypeTrait<GraphType>::pNodeType pNodeType;
typedef typename GraphTypeTrait<GraphType>::pEdgeType pEdgeType;
util::Set<pNodeType> visited;
util::Queue<pNodeType> visit_queue;
for (pNodeType node : graph.Nodes()) {
if (node->IsSourceNode()) {
visit_queue.push(node);
visited.insert(node);
}
}
auto IsAllInputsVisited = [&](pNodeType node) -> bool {
for (pEdgeType edge : node->in_edges()) {
pNodeType start = edge->start();
if (visited.count(start) == 0) { return false; }
}
return true;
};
while (!visit_queue.empty()) {
pNodeType node = visit_queue.front();
visit_queue.pop();
{ // Run user function
func(node);
}
for (pEdgeType edge : node->out_edges()) {
pNodeType end = edge->end();
if (IsAllInputsVisited(end) && visited.insert(end).second) { visit_queue.push(end); }
}
}
};
template<typename NodeType>
inline bool IsReachable(NodeType *start, NodeType *dest) {
typedef NodeType *pNodeType;
typedef typename NodeTypeTrait<NodeType>::pEdgeType pEdgeType;
util::Set<pNodeType> visited_nodes;
util::Stack<pNodeType> stack;
for (pEdgeType edge : start->out_edges()) { stack.push(edge->end()); }
while (!stack.empty()) {
pNodeType node = stack.top();
stack.pop();
if (node == dest) { return true; }
for (pEdgeType edge : node->out_edges()) {
pNodeType end = edge->end();
if (visited_nodes.insert(end).second) { stack.push(end); }
}
}
return false;
}
} // namespace algorithm
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_GRAPH_ALGORITHM_H_
#include "oneflow/xrt/graph/graph.h"
#include "oneflow/xrt/argument.h"
namespace oneflow {
namespace xrt {
XrtEdge *XrtGraph::Connect(const XrtNode *start, const XrtNode *end) {
XrtEdge *edge = AddEdge(start, end);
const_cast<XrtNode *>(start)->AddOutEdge(edge);
const_cast<XrtNode *>(end)->AddInEdge(edge);
return edge;
}
XrtEdge *XrtGraph::Connect(const XrtNode *start, const XrtNode *end, const Argument &arg) {
XrtEdge *edge = Connect(start, end);
edge->SetArgument(arg);
return edge;
}
void XrtGraph::Disconnect(const XrtEdge *edge) {
const_cast<XrtNode *>(edge->start())->EraseOutEdge(edge);
const_cast<XrtNode *>(edge->end())->EraseInEdge(edge);
}
XrtNode *XrtGraph::Node(int64_t node_id) {
DCHECK_LT(node_id, nodes_.size());
return nodes_.at(node_id);
}
const XrtNode *XrtGraph::Node(int64_t node_id) const {
DCHECK_LT(node_id, nodes_.size());
return nodes_.at(node_id);
}
XrtNode *XrtGraph::AddNode() {
std::unique_ptr<XrtNode> node(new XrtNode);
node->unique_id_ = nodes_.size();
nodes_.push_back(node.get());
allocated_nodes_.push_back(std::move(node));
return nodes_.back();
}
XrtNode *XrtGraph::AddNode(const google::protobuf::Message &param) {
std::unique_ptr<XrtNode> node(new XrtNode(param));
node->unique_id_ = nodes_.size();
nodes_.push_back(node.get());
allocated_nodes_.push_back(std::move(node));
return nodes_.back();
}
XrtEdge *XrtGraph::AddEdge() {
std::unique_ptr<XrtEdge> edge(new XrtEdge);
edge->unique_id_ = edges_.size();
edges_.push_back(edge.get());
allocated_edges_.push_back(std::move(edge));
return edges_.back();
}
XrtEdge *XrtGraph::AddEdge(const XrtNode *start, const XrtNode *end) {
std::unique_ptr<XrtEdge> edge(new XrtEdge(start, end));
edge->unique_id_ = edges_.size();
edges_.push_back(edge.get());
allocated_edges_.push_back(std::move(edge));
return edges_.back();
}
XrtGraph *XrtGraph::AddSubgraph(int64_t node_id) {
std::unique_ptr<XrtGraph> subgraph(new XrtGraph);
nodes_[node_id]->sub_graph_ = subgraph.get();
subgraphs_[node_id] = std::move(subgraph);
return nodes_.at(node_id)->sub_graph_;
}
std::vector<Argument> XrtGraph::Arguments() const {
std::vector<Argument> arguments;
for (const XrtEdge *edge : edges_) {
if (edge && edge->argument().initialized()) { arguments.push_back(edge->argument()); }
}
return std::move(arguments);
}
std::string XrtGraph::ToDot() const {
std::stringstream ost;
ost << "digraph {\n";
for (const XrtNode *node : this->Nodes()) {
ost << "\"" << node->unique_id() << "\" [label=\"" << node->name() << "\"]\n";
}
for (const XrtEdge *edge : edges_) {
ost << "\"" << edge->start()->unique_id() << "\" -> "
<< "\"" << edge->end()->unique_id() << "\"\n";
}
ost << "}";
return ost.str();
}
} // namespace xrt
} // namespace oneflow
#ifndef ONEFLOW_XRT_GRAPH_GRAPH_H_
#define ONEFLOW_XRT_GRAPH_GRAPH_H_
#include <google/protobuf/message.h>
#include <vector>
#include "oneflow/xrt/argument.h"
#include "oneflow/xrt/graph/algorithm.h"
#include "oneflow/xrt/graph/node.h"
#include "oneflow/xrt/utility/attribute_map.h"
namespace oneflow {
namespace xrt {
class XrtGraph : public util::AttributeMap {
public:
XrtGraph() = default;
virtual ~XrtGraph() = default;
XrtNode *Node(int64_t node_id);
const XrtNode *Node(int64_t node_id) const;
XrtNode *AddNode();
XrtNode *AddNode(const google::protobuf::Message &param);
XrtEdge *AddEdge();
XrtEdge *AddEdge(const XrtNode *start, const XrtNode *end);
XrtEdge *Connect(const XrtNode *start, const XrtNode *end);
XrtEdge *Connect(const XrtNode *start, const XrtNode *end, const Argument &arg);
void Disconnect(const XrtEdge *edge);
// Create a subgraph for node that unique id is `node_id`
XrtGraph *AddSubgraph(int64_t node_id);
const std::vector<XrtNode *> &Nodes() const { return nodes_; }
std::vector<XrtNode *> &Nodes() { return nodes_; }
const std::vector<XrtEdge *> &Edges() const { return edges_; }
std::vector<XrtEdge *> &Edges() { return edges_; }
std::string ToDot() const;
std::vector<Argument> Arguments() const;
protected:
std::vector<XrtNode *> nodes_;
// All allocated nodes in the graph. The node unique id is related to it's
// index in the vector. The Xrt node in `nodes_` can be nullptr since we will
// always keep it in `nodes_` even if it has been removed from the graph.
std::vector<std::unique_ptr<XrtNode>> allocated_nodes_;
std::vector<XrtEdge *> edges_;
// All allocated edges in the graph. The edge unique id is related to it's
// index in the vector. And the xrt edge in `edges_` can also be nullptr.
std::vector<std::unique_ptr<XrtEdge>> allocated_edges_;
// All allocated subgraphs. The key of the map means node unique id, and the
// value is the subgraph which belongs to the node
util::Map<int64_t, std::unique_ptr<XrtGraph>> subgraphs_;
};
namespace algorithm {
template<>
struct GraphTypeTrait<XrtGraph> {
typedef XrtNode *pNodeType;
typedef XrtEdge *pEdgeType;
};
template<>
struct GraphTypeTrait<const XrtGraph> {
typedef const XrtNode *pNodeType;
typedef const XrtEdge *pEdgeType;
};
} // namespace algorithm
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_GRAPH_GRAPH_H_
#include "absl/strings/str_split.h"
#include "oneflow/xrt/graph/algorithm.h"
#include "oneflow/xrt/graph/node.h"
namespace oneflow {
namespace xrt {
void XrtNode::AddInEdge(const XrtEdge *edge) { in_edges_.push_back(const_cast<XrtEdge *>(edge)); }
void XrtNode::AddOutEdge(const XrtEdge *edge) { out_edges_.push_back(const_cast<XrtEdge *>(edge)); }
void XrtNode::EraseInEdge(const XrtEdge *edge) {
in_edges_.remove_if(
[&](const XrtEdge *e) -> bool { return e->unique_id() == edge->unique_id(); });
}
void XrtNode::EraseOutEdge(const XrtEdge *edge) {
out_edges_.remove_if(
[&](const XrtEdge *e) -> bool { return e->unique_id() == edge->unique_id(); });
}
bool XrtNode::IsSourceNode() const { return in_edges_.size() == 0; }
bool XrtNode::IsFinishNode() const { return out_edges_.size() == 0; }
bool XrtNode::IsArgumentNode() const { return type_ == _ArgumentOpType; }
bool XrtNode::IsInArgumentNode() const {
return IsArgumentNode() && absl::StartsWith(name_, _XrtInArgumentPrefix);
}
bool XrtNode::IsOutArgumentNode() const {
return IsArgumentNode() && absl::StartsWith(name_, _XrtOutArgumentPrefix);
}
bool XrtNode::IsReachable(const XrtNode &dst_node) const {
return algorithm::IsReachable(this, &dst_node);
}
} // namespace xrt
} // namespace oneflow
#ifndef ONEFLOW_XRT_GRAPH_NODE_H_
#define ONEFLOW_XRT_GRAPH_NODE_H_
#include <google/protobuf/message.h>
#include "oneflow/xrt/argument.h"
#include "oneflow/xrt/graph/algorithm.h"
#include "oneflow/xrt/types.h"
#include "oneflow/xrt/utility/attribute_map.h"
#include "oneflow/xrt/utility/stl.h"
namespace oneflow {
namespace xrt {
class XrtNode;
class XrtGraph;
class XrtEdge : public util::AttributeMap {
public:
XrtNode *start() const { return start_; }
XrtNode *end() const { return end_; }
const Argument &argument() const { return arg_; }
Argument &argument() { return arg_; }
void SetStartNode(const XrtNode *start) { start_ = const_cast<XrtNode *>(start); }
void SetEndNode(const XrtNode *end) { end_ = const_cast<XrtNode *>(end); }
void SetArgument(const Argument &arg) { arg_ = arg; }
int64_t unique_id() const { return unique_id_; }
bool IsControlEdge() const { return !arg_.initialized(); }
virtual ~XrtEdge() = default;
friend class XrtGraph;
protected:
XrtEdge() = default;
XrtEdge(const XrtNode *start, const XrtNode *end)
: start_(const_cast<XrtNode *>(start)), end_(const_cast<XrtNode *>(end)) {}
protected:
XrtNode *start_ = nullptr;
XrtNode *end_ = nullptr;
Argument arg_;
int64_t unique_id_ = -1;
};
// XLA Node
class XrtNode : public util::AttributeMap {
public:
const util::List<XrtEdge *> &in_edges() const { return in_edges_; }
const util::List<XrtEdge *> &out_edges() const { return out_edges_; }
util::List<XrtEdge *> &in_edges() { return in_edges_; }
util::List<XrtEdge *> &out_edges() { return out_edges_; }
void AddInEdge(const XrtEdge *edge);
void AddOutEdge(const XrtEdge *edge);
void EraseInEdge(const XrtEdge *edge);
void EraseOutEdge(const XrtEdge *edge);
void ClearInEdges() { in_edges_.clear(); };
void ClearOutEdges() { out_edges_.clear(); };
int64_t unique_id() const { return unique_id_; }
const XrtDevice &device() const { return device_; }
const std::string &type() const { return type_; }
const std::string &name() const { return name_; }
const google::protobuf::Message &param() const { return *param_; }
XrtGraph *sub_graph() const { return sub_graph_; }
void set_device(const XrtDevice &device) { device_ = device; }
void set_type(const std::string &type) { type_ = type; }
void set_name(const std::string &name) { name_ = name; }
bool IsSourceNode() const;
bool IsFinishNode() const;
bool IsArgumentNode() const;
bool IsInArgumentNode() const;
bool IsOutArgumentNode() const;
bool IsReachable(const XrtNode &dst_node) const;
virtual ~XrtNode() {}
friend class XrtGraph;
protected:
XrtNode() = default;
// XrtNode only can be created by XrtGraph
explicit XrtNode(const google::protobuf::Message &param)
: param_(&param), unique_id_(-1), sub_graph_(nullptr) {}
protected:
util::List<XrtEdge *> in_edges_;
util::List<XrtEdge *> out_edges_;
const google::protobuf::Message *param_ = nullptr;
// Each node has a unique id related to it's index in the graph's nodes
int64_t unique_id_ = -1;
// Backend device such as X86, CUDA, ARM and so on
XrtDevice device_;
// String type, such as "Conv2d", "Matmul" or other else
std::string type_;
// String name
std::string name_;
// Subgraph will be built for xrt launch nodes. Note that `sub_graph_` should
// be built and managed by the graph, other than the node
XrtGraph *sub_graph_ = nullptr;
};
namespace algorithm {
template<>
struct NodeTypeTrait<XrtNode> {
typedef XrtEdge *pEdgeType;
};
template<>
struct NodeTypeTrait<const XrtNode> {
typedef const XrtEdge *pEdgeType;
};
} // namespace algorithm
} // namespace xrt
} // namespace oneflow
#endif // ONEFLOW_XRT_GRAPH_NODE_H_
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册