提交 · 26b3e73298616f51258d26f9d3ba377a57c5989b · Greenplum / DeepSpeed

10 6月, 2023 1 次提交
- A
  single node pdsh sigkill (#3730) · 26b3e732
  由 Abhilash Majumder 提交于 6月 10, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  26b3e732
07 6月, 2023 1 次提交
- L
  Fix incorrectly formatted f string (#3698) · d8aaa581
  由 Logan Adams 提交于 6月 06, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
  d8aaa581
31 5月, 2023 1 次提交

由 Jeff Rasley 提交于 5月 30, 2023

* tmp remove launcher args

* add exclude list for env variables on aisc

* add comment

49a73549

16 5月, 2023 1 次提交

[CPU] Support Intel CPU inference (#3041) · 1f72082f

由 Ma, Guokai 提交于 5月 16, 2023

* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* add ccl backend

* change nccl to ccl

* remove unused code

* add comm/ccl to ops

* initial ccl comm support

* first broadcast case passed

* add CCL_Backend to DeepSpeed

* support comm timer for CPU

* support barrier for comm backend

* support specify master address from deepspeed command line

* support pytorch 2.0

* remove 'block' from api

* Tweak for debug
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Remove unecessary directory
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Add bf16 kernel support for inference

* Add temporary torch implement for cpu inference

* Add softmax ops cpu fallback for inference

* bind cores to numa domain as well

* merge latest change in gma/numactl

* initial bf16 kernel support with fallback path

* initial fallback path for bloom kernel injection

* fix softmax attn mask

* check KMP_AFFINITY to avoid conflict with numactl

* New CCLBackend which utilize TorchBackend for initialization

* rollback last change because there is result error

* fix bloom injection policy TP could not work issue.

injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}

* Use TorchBackend to initialize CCLBackend, make behavior consistent

* remove comm under deepspeed/ops

* add license header

* code clean up

* fix format issue

* remove magic number in main address

* add caching support but not turn on by default

* change name of inference_cuda_module to inference_module

* Check for is_synchronized_device in accelerator before get Event

* fix typo

* Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type

* add cpu backend files

* change CPU_Accelerator op_builder_dir

* remove cpu_kernel_path

* using CPU_Accelerator on non-cuda device

* fix deepspeed.op_builder => deepspeed.ops.op_builder

* add alias for num_gpus: num_accelerators

* allow loading cpu_builder in build stage

* Assume cuda available if torch not installed

* add oneccl_binding_pt to requirements

* move oneccl-binding-pt to seperate requiremetns-cpu.txt

* add missing file

* use dependency_links in setuptools.setup() call for additional dependency links

* install oneccl_bind_pt in workflows

* change oneccl_bind_pt's version from 1.13 to 2.0

* use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used

* Add indicator for Accelerator used

* change foo.c to foo.cpp

* exclude 'cpu' directory in CUDA op builder reflection

* add a cpu-inference workflow

* run cpu-inference workflow on self-hosted instance

* change cpu runs-on node to v100 node

* print out python version in workflow

* add verbose in pip command to understand oneccl_bind_pt install issue

* update cpu-inference workflow

* add a stage to detect instance instruction sets

* add back bf16 support for CPU inference

* enable autoTP for bloom
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* update workflow to detect cpu instruction sets

* temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection

* change cpu-inference workflow machine to ubuntu-20.04

* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* enable policy for llama

* use a special build ipex to test avx2 detection fix

* fix format

* fix test fail issue
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* fix gptj sharded checkpoint loading problem
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* return a not implemented build in get_op_builder in cpu_backend

* support cpu device in tests

* use cpuinfo to extract number of CPUs

* use ~/tmp as transfomer cache rather than /blob/

* Add support for mpich launcher with prefer_deepspeed_comm

* add missing modification in accelerator

* enable IMPI launcher

* remove unused file and fix formatting

* clean up ccl.cpp

* Less confusing error message when certin op builder are not implemented

* Fix license header

* Add license header

* add license headers

* add license header

* fix cuda specific code in test

* update CPU workflow

* use numactl to bind to core

* allow bind_cores_to_rank in multi-node impi runner

* fix format error

* Remove InferenceBuilder

* fix format error in numa.py

* check whether op is in installed ops in ds_report.py

* allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'

* lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator

* put short path in the beginning in real_accelerator.py

* device_count return number of NUMA nodes

* fix typo

* install numactl in cpu workflow

* Follow comments

* Better implementation of device_count() and current_device()

* remove dependency_link for Intel Extension for DeepSpeed

* use check is_synchronized_device in timer only once

* remove env mapping WA in cpu_accelerator

* fix duplicate definition

* fix format error

* refine ccl backend selection

* move comments to the right place

* remove prefer_deepspeed_comm, use CCLBackend by default

* refractor fallback path

* Fix execution failure in kernel injection path

* do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect

* guard residual_add fallback path with environ DS_KI_FALLBACK=True

* fix format error

* add test for allreduce on CPU workflow

* fix format error

* Fallback to TorchBackend if CCLBackend kernel are not implemented

* Update Intel Extension for Pytorch installation link

* Don't specify version number of Intel Extension for PyTorch

* install oneCCL for CCLBackend

* fix link path for CPU comm kernels

* fix source oneCCL environment

* source oneCCL env before run UT

* Give more specific instruction when CCL_ROOT not defined

---------
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Co-authored-by: NZhenhuan Chen <zhenhuan.chen@intel.com>
Co-authored-by: Nbaodii <di.bao@intel.com>
Co-authored-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Njianan-gu <jianan.gu@intel.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

1f72082f

22 4月, 2023 1 次提交
- M
  
  print default values (#3347) · 2f8d384e
  由 Michael Wyatt 提交于 4月 21, 2023
  
  2f8d384e
13 4月, 2023 1 次提交

[CPU support] Optionally bind each rank to different cores on host (#2881) · 0b5252bb

由 Ma, Guokai 提交于 4月 13, 2023

* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* use current_env to set OMP_NUM_THREADS in subprocess

* add test for ds_arguments

* change --bind_cores_to_rank option to store_true

* add test for parse_range_list

* add comment for parse range list

* add test for parse range list, rewrite parse_range_list

* fix format error

* fix format

* add -m parameter to numactl when necessary

* Check KMP_AFFINITY to avoid conflict with numactl

* fix format

* negative case for parse_range_list

* detect whether numactl is installed before use numactl to bind cores

* check numactl with package manager of distro

---------
Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

0b5252bb

31 3月, 2023 1 次提交
- M
  Update DeepSpeed copyright license to Apache 2.0 (#3111) · b361c727
  由 Michael Wyatt 提交于 3月 30, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b361c727
27 3月, 2023 1 次提交
- J
  
  update formatter version and style settings (#3098) · 91d63e02
  由 Jeff Rasley 提交于 3月 27, 2023
  
  91d63e02
02 3月, 2023 1 次提交

Add MPICH Multinode Runner (#2839) · 8d53ac0c

由 mzl 提交于 3月 02, 2023

* MPICH support

* MPICH changes

* MPICH changes

* MPICH changes

* MPICH changes

* accelerator runtime modifications

* Accelerator runtime changes

* Accelerator runtime modifications

* Remove redundant print from single node

* Move hostfile to tmp

* Code cleanup for MPICH class

* Code cleanup, rm whitespace

* Removing mpiexec environment check details

* Not needed tmp hostfile as pass directly

* Remove debugging comments

* rm print statement

* Revert comm changes as WA not needed

* Use MPICHRunner name for class

* Use MPICHRunner as class name

* No need to use args.force_multi and args.launcher .

This should be set in deepspeedexamples gpt-3.6b .sh script as:
$launcher=MPICH
run_cmd=" deepspeed  --hostfile=${hostfile_ds}  --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"

* Adhere to code pattern

* Rm empty lines in MPICHRunner class

* Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh

* pass MPICH hostfile through launcher_args in gpt-3.6b.sh

* Clean code and remove args hostfile

* fix merge

* fix merge

---------
Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* clean up and fix format

* add ut

---------
Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

8d53ac0c

26 1月, 2023 1 次提交

Abstract accelerator (step 3) (#2677) · 98cc35b6

由 Ma, Guokai 提交于 1月 26, 2023

* Integrate accelerator abstraction interface into deepspeed/

* Fix error message in fp16/fused_optimizer

* fix error message in fp16/unfused_optimizer.py

* assign get_accelerator().pin_memory() result to input Tensor name

* no need to check cuda and whether nvtx supported

* move try-except into inner most block

* call Event() and Stream() in get_accelerator() for data type

* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed

* Apply op_builder backend api change from #2705 from @jeffra

* fix tests where Builder NAME is used

* keep original ...Builder.NAME interface instead of ...Builder().NAME interface

* fix builder closure for installation

* fix randomltd builder

* add comments to clarify create_op_builder and get_op_builder

* fix compatibility with pip install -e
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

98cc35b6

04 1月, 2023 1 次提交
- J
  [launcher] fail gracefully if hostname -i doesn't work as expected (#2631) · a091bc22
  由 Jeff Rasley 提交于 1月 03, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  a091bc22
21 12月, 2022 1 次提交
- M
  
  add enable_each_rank_log to deepspeed/launcher/runner.py (#2571) · 11f5daba
  由 mzl 提交于 12月 21, 2022
  
  11f5daba
20 12月, 2022 1 次提交
- J
  
  [launcher] parse hostfile via regex and added error checks (#2626) · 8c56c25d
  由 Jeff Rasley 提交于 12月 19, 2022
  
  8c56c25d
05 11月, 2022 1 次提交

Added MLFLOW environment variables for logging metrics within trainig… (#2477) · ffb6d987

由 savitamittal1 提交于 11月 04, 2022

* Added MLFLOW environment variables for logging metrics within trainign script

* exporting MLFlow env variables from AML env
Co-authored-by: NCheng Li <pistasable@gmail.com>

ffb6d987

27 10月, 2022 1 次提交

rollback ds config changes (#2395) · 8da0238b

由 Cheng Li 提交于 10月 26, 2022

* rollback ds config changes

* fix format

* Fix error when output_file is a relative path without a prefix (#2397)
Co-authored-by: NBenjamin Steenhoek <benjaminjsteenhoek@gmail.com>

* fix restuls and exprs path to use absolute path

* write out optimial config after tuning

* fix format

* assert tuning result dir creation
Co-authored-by: NBenjamin Steenhoek <benjaminjsteenhoek@gmail.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

8da0238b

14 10月, 2022 1 次提交

Add SLURM Multinode Runner (#2404) · 3db0b5e2

由 Dashiell Stander 提交于 10月 13, 2022

Signed-off-by: NDashiell Stander <dstander@protonmail.com>
Co-authored-by: NDashiell Stander <dashiell@ip-172-31-45-20.ec2.internal>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

3db0b5e2

30 7月, 2022 1 次提交

Elastic Training support in DeepSpeed (#2153) (#2156) · 1ed5aa96

由 Arpan Jain 提交于 7月 29, 2022

Co-authored-by: NArpan Jain <t-arpanjain@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

1ed5aa96

28 7月, 2022 1 次提交

Trajepl/nebula ckpt engine (#2085) · e669aaf5

由 trajep 提交于 7月 28, 2022

* enable checkpoint engine

* seprated nebula config

* add __init__.py for nebula importing

* linter fix

* fix: ds_config is None

* fix: ds config

* fix: get sd loader fix

* align the API with torch raw code

* linter fix

* remove duplicate tag params

* make checkpoint_engine as required args

* fix args

* extract parameters out to config

* fix: load state dict

* separate load engine

* linter fix

* extract checkpoint engine to abstract calss

* linter fix

* construct function args fix

* add docs for dev/customers

* linter fix

* remove load engine

* print->log_dist

* linter fix

* add tag flag to distinguish the loading order
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

e669aaf5

19 5月, 2022 1 次提交
- L
  
  [launcher] add option to bypass ssh check (#1957) · 380d32f9
  由 liamcli 提交于 5月 18, 2022
  
  380d32f9
15 3月, 2022 1 次提交
- J
  
  [launcher] validate passwordless-ssh works when using hostfile launching (#1832) · a773996d
  由 Jeff Rasley 提交于 3月 14, 2022
  
  a773996d
09 2月, 2022 1 次提交
- L
  
  Improve how runner parses env var file (#1747) · dac9056e
  由 liamcli 提交于 2月 08, 2022
  
  dac9056e
28 1月, 2022 1 次提交
- J
  
  Multi-node save pid support + allow sparse-attn extra (#1728) · 9351266f
  由 Jeff Rasley 提交于 1月 27, 2022
  
  9351266f
27 1月, 2022 1 次提交
- J
  
  launcher save pid + require manual triton install for sparse-attn (#1727) · 171316fc
  由 Jeff Rasley 提交于 1月 26, 2022
  
  171316fc
20 1月, 2022 1 次提交
- J
  
  preserve cuda visible devices order (#1712) · 2d51f617
  由 Jeff Rasley 提交于 1月 19, 2022
  
  2d51f617
13 1月, 2022 1 次提交
- L
  
  support module and no python args for launcher (#1690) · fead387f
  由 liamcli 提交于 1月 12, 2022
  
  fead387f
18 11月, 2021 1 次提交
- S
  [launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node (#960) · e3c2d7b1
  由 Stas Bekman 提交于 11月 17, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  e3c2d7b1
13 11月, 2021 1 次提交

Autotuning (#1554) · 9caa74e5

由 Cheng Li 提交于 11月 13, 2021

* [squash] Staging autotuning v4
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* add new extra, guard xgboost, cleanup dead files (#268)

* Fix autotuning docs (#1553)

* fix docs

* rewording the goal

* fix typos

* fix typos (#1556)

* fix typos

* fix format

* fix bug (#1557)

* fix bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

9caa74e5

03 11月, 2021 1 次提交
- C
  Unify use f str (#1511) · df5b0884
  由 Chunyang Wen 提交于 11月 03, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  df5b0884
02 10月, 2021 1 次提交

Fix many typos (#1423) · be789b16

由 Alex Hedges 提交于 10月 01, 2021

* Fix typos in docs/

* Fix typos in code comments and output strings

* Fix typos in the code itself

* Fix typos in tests/
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

be789b16

21 4月, 2021 1 次提交
- J
  
  add option to force multi-node launcher mode (#977) · 9e0dab40
  由 Jeff Rasley 提交于 4月 20, 2021
  
  9e0dab40
19 4月, 2021 2 次提交

J

revert zero-inf change to launcher · 72a30c1e
由 Jeff Rasley 提交于 4月 18, 2021

72a30c1e

ZeRO-Infinity (#976) · 0d4a54a0

由 Jeff Rasley 提交于 4月 18, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

0d4a54a0

14 4月, 2021 1 次提交
- T
  
  Delete check of pdsh (#941) · e6999ebd
  由 Takuya Makino 提交于 4月 14, 2021
  
  e6999ebd
07 4月, 2021 1 次提交
- T
  
  Add space in help string (#926) · ce14cf1a
  由 Takuya Makino 提交于 4月 07, 2021
  
  ce14cf1a
17 3月, 2021 1 次提交
- S
  [runner/launch] propagate the error (#854) · 24335d49
  由 Stas Bekman 提交于 3月 16, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  24335d49
10 3月, 2021 1 次提交
- J
  
  Fix regression in runner (#843) · 2e6692c8
  由 Jeff Rasley 提交于 3月 09, 2021
  
  2e6692c8
09 3月, 2021 1 次提交

ZeRO 3 Offload (#834) · 599258f9

由 Samyam Rajbhandari 提交于 3月 08, 2021

* Squash stage3 v1 (#146)
Co-authored-by: NSamyam <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

599258f9

16 1月, 2021 1 次提交
- J
  
  skip empty lines in hostfile (#669) · 6217a6c2
  由 Jeff Rasley 提交于 1月 15, 2021
  
  6217a6c2
18 12月, 2020 1 次提交
- J
  
  Ability to initialize distributed backend outside deepspeed runtime (#608) · 7435b2f1
  由 Jeff Rasley 提交于 12月 17, 2020
  
  7435b2f1
10 9月, 2020 1 次提交

Add 1-bit Adam support to DeepSpeed (#380) · 01726ce2

由 Ammar Ahmad Awan 提交于 9月 09, 2020

* 1-bit adam (#353)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NYour Name <you@example.com>
Co-authored-by: Ntanghl1994 <htang14@ur.rochester.edu>
Co-authored-by: NHank <tanghl1994@gmail.com>
Co-authored-by: Nroot <root@node2x12b.cs.rochester.edu>
Co-authored-by: NAmmar Ahmad Awan <awan.ammar@microsoft.com>

01726ce2

Greenplum / DeepSpeed 上一次同步 12 个月

Greenplum / DeepSpeed
上一次同步 12 个月