1. 10 6月, 2023 1 次提交
  2. 07 6月, 2023 1 次提交
  3. 31 5月, 2023 1 次提交
  4. 16 5月, 2023 1 次提交
    • M
      [CPU] Support Intel CPU inference (#3041) · 1f72082f
      Ma, Guokai 提交于
      * add fallback path for kernels used in megatron
      
      * temporary numactl WA for SPR 56core
      
      * adapt core allocation according to number of ranks
      
      * add switch to turn on numactl
      
      * detect number of cores on the system
      
      * allow select a subset of the cores on the system to bind
      
      * remove unneeded changes
      
      * add ccl backend
      
      * change nccl to ccl
      
      * remove unused code
      
      * add comm/ccl to ops
      
      * initial ccl comm support
      
      * first broadcast case passed
      
      * add CCL_Backend to DeepSpeed
      
      * support comm timer for CPU
      
      * support barrier for comm backend
      
      * support specify master address from deepspeed command line
      
      * support pytorch 2.0
      
      * remove 'block' from api
      
      * Tweak for debug
      Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      
      * Remove unecessary directory
      Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      
      * Add bf16 kernel support for inference
      
      * Add temporary torch implement for cpu inference
      
      * Add softmax ops cpu fallback for inference
      
      * bind cores to numa domain as well
      
      * merge latest change in gma/numactl
      
      * initial bf16 kernel support with fallback path
      
      * initial fallback path for bloom kernel injection
      
      * fix softmax attn mask
      
      * check KMP_AFFINITY to avoid conflict with numactl
      
      * New CCLBackend which utilize TorchBackend for initialization
      
      * rollback last change because there is result error
      
      * fix bloom injection policy TP could not work issue.
      
      injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}
      
      * Use TorchBackend to initialize CCLBackend, make behavior consistent
      
      * remove comm under deepspeed/ops
      
      * add license header
      
      * code clean up
      
      * fix format issue
      
      * remove magic number in main address
      
      * add caching support but not turn on by default
      
      * change name of inference_cuda_module to inference_module
      
      * Check for is_synchronized_device in accelerator before get Event
      
      * fix typo
      
      * Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type
      
      * add cpu backend files
      
      * change CPU_Accelerator op_builder_dir
      
      * remove cpu_kernel_path
      
      * using CPU_Accelerator on non-cuda device
      
      * fix deepspeed.op_builder => deepspeed.ops.op_builder
      
      * add alias for num_gpus: num_accelerators
      
      * allow loading cpu_builder in build stage
      
      * Assume cuda available if torch not installed
      
      * add oneccl_binding_pt to requirements
      
      * move oneccl-binding-pt to seperate requiremetns-cpu.txt
      
      * add missing file
      
      * use dependency_links in setuptools.setup() call for additional dependency links
      
      * install oneccl_bind_pt in workflows
      
      * change oneccl_bind_pt's version from 1.13 to 2.0
      
      * use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used
      
      * Add indicator for Accelerator used
      
      * change foo.c to foo.cpp
      
      * exclude 'cpu' directory in CUDA op builder reflection
      
      * add a cpu-inference workflow
      
      * run cpu-inference workflow on self-hosted instance
      
      * change cpu runs-on node to v100 node
      
      * print out python version in workflow
      
      * add verbose in pip command to understand oneccl_bind_pt install issue
      
      * update cpu-inference workflow
      
      * add a stage to detect instance instruction sets
      
      * add back bf16 support for CPU inference
      
      * enable autoTP for bloom
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * update workflow to detect cpu instruction sets
      
      * temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection
      
      * change cpu-inference workflow machine to ubuntu-20.04
      
      * add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * enable policy for llama
      
      * use a special build ipex to test avx2 detection fix
      
      * fix format
      
      * fix test fail issue
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * fix gptj sharded checkpoint loading problem
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * return a not implemented build in get_op_builder in cpu_backend
      
      * support cpu device in tests
      
      * use cpuinfo to extract number of CPUs
      
      * use ~/tmp as transfomer cache rather than /blob/
      
      * Add support for mpich launcher with prefer_deepspeed_comm
      
      * add missing modification in accelerator
      
      * enable IMPI launcher
      
      * remove unused file and fix formatting
      
      * clean up ccl.cpp
      
      * Less confusing error message when certin op builder are not implemented
      
      * Fix license header
      
      * Add license header
      
      * add license headers
      
      * add license header
      
      * fix cuda specific code in test
      
      * update CPU workflow
      
      * use numactl to bind to core
      
      * allow bind_cores_to_rank in multi-node impi runner
      
      * fix format error
      
      * Remove InferenceBuilder
      
      * fix format error in numa.py
      
      * check whether op is in installed ops in ds_report.py
      
      * allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'
      
      * lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator
      
      * put short path in the beginning in real_accelerator.py
      
      * device_count return number of NUMA nodes
      
      * fix typo
      
      * install numactl in cpu workflow
      
      * Follow comments
      
      * Better implementation of device_count() and current_device()
      
      * remove dependency_link for Intel Extension for DeepSpeed
      
      * use check is_synchronized_device in timer only once
      
      * remove env mapping WA in cpu_accelerator
      
      * fix duplicate definition
      
      * fix format error
      
      * refine ccl backend selection
      
      * move comments to the right place
      
      * remove prefer_deepspeed_comm, use CCLBackend by default
      
      * refractor fallback path
      
      * Fix execution failure in kernel injection path
      
      * do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect
      
      * guard residual_add fallback path with environ DS_KI_FALLBACK=True
      
      * fix format error
      
      * add test for allreduce on CPU workflow
      
      * fix format error
      
      * Fallback to TorchBackend if CCLBackend kernel are not implemented
      
      * Update Intel Extension for Pytorch installation link
      
      * Don't specify version number of Intel Extension for PyTorch
      
      * install oneCCL for CCLBackend
      
      * fix link path for CPU comm kernels
      
      * fix source oneCCL environment
      
      * source oneCCL env before run UT
      
      * Give more specific instruction when CCL_ROOT not defined
      
      ---------
      Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
      Co-authored-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      Co-authored-by: NZhenhuan Chen <zhenhuan.chen@intel.com>
      Co-authored-by: Nbaodii <di.bao@intel.com>
      Co-authored-by: NWang, Yi A <yi.a.wang@intel.com>
      Co-authored-by: Njianan-gu <jianan.gu@intel.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
      1f72082f
  5. 22 4月, 2023 1 次提交
  6. 13 4月, 2023 1 次提交
    • M
      [CPU support] Optionally bind each rank to different cores on host (#2881) · 0b5252bb
      Ma, Guokai 提交于
      * add fallback path for kernels used in megatron
      
      * temporary numactl WA for SPR 56core
      
      * adapt core allocation according to number of ranks
      
      * add switch to turn on numactl
      
      * detect number of cores on the system
      
      * allow select a subset of the cores on the system to bind
      
      * remove unneeded changes
      
      * use current_env to set OMP_NUM_THREADS in subprocess
      
      * add test for ds_arguments
      
      * change --bind_cores_to_rank option to store_true
      
      * add test for parse_range_list
      
      * add comment for parse range list
      
      * add test for parse range list, rewrite parse_range_list
      
      * fix format error
      
      * fix format
      
      * add -m parameter to numactl when necessary
      
      * Check KMP_AFFINITY to avoid conflict with numactl
      
      * fix format
      
      * negative case for parse_range_list
      
      * detect whether numactl is installed before use numactl to bind cores
      
      * check numactl with package manager of distro
      
      ---------
      Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      0b5252bb
  7. 31 3月, 2023 1 次提交
  8. 27 3月, 2023 1 次提交
  9. 02 3月, 2023 1 次提交
    • M
      Add MPICH Multinode Runner (#2839) · 8d53ac0c
      mzl 提交于
      * MPICH support
      
      * MPICH changes
      
      * MPICH changes
      
      * MPICH changes
      
      * MPICH changes
      
      * accelerator runtime modifications
      
      * Accelerator runtime changes
      
      * Accelerator runtime modifications
      
      * Remove redundant print from single node
      
      * Move hostfile to tmp
      
      * Code cleanup for MPICH class
      
      * Code cleanup, rm whitespace
      
      * Removing mpiexec environment check details
      
      * Not needed tmp hostfile as pass directly
      
      * Remove debugging comments
      
      * rm print statement
      
      * Revert comm changes as WA not needed
      
      * Use MPICHRunner name for class
      
      * Use MPICHRunner as class name
      
      * No need to use args.force_multi and args.launcher .
      
      This should be set in deepspeedexamples gpt-3.6b .sh script as:
      $launcher=MPICH
      run_cmd=" deepspeed  --hostfile=${hostfile_ds}  --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"
      
      * Adhere to code pattern
      
      * Rm empty lines in MPICHRunner class
      
      * Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh
      
      * pass MPICH hostfile through launcher_args in gpt-3.6b.sh
      
      * Clean code and remove args hostfile
      
      * fix merge
      
      * fix merge
      
      ---------
      Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
      
      * clean up and fix format
      
      * add ut
      
      ---------
      Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
      Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      8d53ac0c
  10. 26 1月, 2023 1 次提交
    • M
      Abstract accelerator (step 3) (#2677) · 98cc35b6
      Ma, Guokai 提交于
      * Integrate accelerator abstraction interface into deepspeed/
      
      * Fix error message in fp16/fused_optimizer
      
      * fix error message in fp16/unfused_optimizer.py
      
      * assign get_accelerator().pin_memory() result to input Tensor name
      
      * no need to check cuda and whether nvtx supported
      
      * move try-except into inner most block
      
      * call Event() and Stream() in get_accelerator() for data type
      
      * Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed
      
      * Apply op_builder backend api change from #2705 from @jeffra
      
      * fix tests where Builder NAME is used
      
      * keep original ...Builder.NAME interface instead of ...Builder().NAME interface
      
      * fix builder closure for installation
      
      * fix randomltd builder
      
      * add comments to clarify create_op_builder and get_op_builder
      
      * fix compatibility with pip install -e
      Co-authored-by: NCheng Li <pistasable@gmail.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      98cc35b6
  11. 04 1月, 2023 1 次提交
  12. 21 12月, 2022 1 次提交
  13. 20 12月, 2022 1 次提交
  14. 05 11月, 2022 1 次提交
  15. 27 10月, 2022 1 次提交
  16. 14 10月, 2022 1 次提交
  17. 30 7月, 2022 1 次提交
  18. 28 7月, 2022 1 次提交
    • T
      Trajepl/nebula ckpt engine (#2085) · e669aaf5
      trajep 提交于
      * enable checkpoint engine
      
      * seprated nebula config
      
      * add __init__.py for nebula importing
      
      * linter fix
      
      * fix: ds_config is None
      
      * fix: ds config
      
      * fix: get sd loader fix
      
      * align the API with torch raw code
      
      * linter fix
      
      * remove duplicate tag params
      
      * make checkpoint_engine as required args
      
      * fix args
      
      * extract parameters out to config
      
      * fix: load state dict
      
      * separate load engine
      
      * linter fix
      
      * extract checkpoint engine to abstract calss
      
      * linter fix
      
      * construct function args fix
      
      * add docs for dev/customers
      
      * linter fix
      
      * remove load engine
      
      * print->log_dist
      
      * linter fix
      
      * add tag flag to distinguish the loading order
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      e669aaf5
  19. 19 5月, 2022 1 次提交
  20. 15 3月, 2022 1 次提交
  21. 09 2月, 2022 1 次提交
  22. 28 1月, 2022 1 次提交
  23. 27 1月, 2022 1 次提交
  24. 20 1月, 2022 1 次提交
  25. 13 1月, 2022 1 次提交
  26. 18 11月, 2021 1 次提交
  27. 13 11月, 2021 1 次提交
  28. 03 11月, 2021 1 次提交
  29. 02 10月, 2021 1 次提交
  30. 21 4月, 2021 1 次提交
  31. 19 4月, 2021 2 次提交
  32. 14 4月, 2021 1 次提交
  33. 07 4月, 2021 1 次提交
  34. 17 3月, 2021 1 次提交
  35. 10 3月, 2021 1 次提交
  36. 09 3月, 2021 1 次提交
  37. 16 1月, 2021 1 次提交
  38. 18 12月, 2020 1 次提交
  39. 10 9月, 2020 1 次提交