提交 · 3f5e4931098bf533f8217afb6d986c90f81aed80 · Greenplum / DeepSpeed

16 6月, 2023 2 次提交
- D
  fix ccl_backend and residual_add problems (#3642) · 3f5e4931
  由 Dino Chen 提交于 6月 16, 2023
```
* fix ccl_backend path when it should fallback

* fix residual_add fallback when only one kernel is ready

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  3f5e4931
- A
  Account for expert parameters when calculating the total number of parameters in the model (#3720) · 9a8f6a1d
  由 Alejandro Dubrovsky 提交于 6月 15, 2023
```
Co-authored-by: NAlex Dubrovsky <dubro@amazon.com>
```
  9a8f6a1d
15 6月, 2023 2 次提交
- C
  
  add Chinese Zhihu social account (#3755) · b692d236
  由 Conglong Li 提交于 6月 14, 2023
  
  b692d236
- M
  remove UtilsBuilder load, use torch (un)flatten ops (#3728) · 5a5340d0
  由 mzl 提交于 6月 15, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5a5340d0
14 6月, 2023 5 次提交

L

Fix output transpose dimension bugs (#3747) · cd911f9a
由 Logan Adams 提交于 6月 13, 2023

cd911f9a

fix hybrid engine mlp module (#3736) · 45466afa

由 tensor-tang 提交于 6月 14, 2023

* fixgated_mlp.py

* fix hybrid_engine.py

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

45466afa

Include cublas error details when getting cublas handle fails (#3695) · 46bb08c2

由 john li 提交于 6月 13, 2023

* include cublas error details when getting cublas handle fails

* run clang-format

* just use raw enum value to avoid depending on minimum cuda version

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

46bb08c2

Fix autotuner get_gas_from_user_config (#3664) · 09332dbf

由 StrayWarrior 提交于 6月 14, 2023

Co-authored-by: NFeng Zhoutian <fengzhoutian@meituan.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

09332dbf

Fix apex install bugs (#3741) · 1b401823

由 Logan Adams 提交于 6月 13, 2023

* Fix apex installation

* Switch install flag from build-opt to global-opt to fix missing cpp_ext

* Try installing with support for newer pip

* Add build packaging

* Update to latest

* Pin to specific commit while pyproject.toml is fixed

1b401823

13 6月, 2023 1 次提交
- J
  FP8 unittest for H100 (#3731) · 6f4fc30b
  由 Joe Mayer 提交于 6月 12, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  6f4fc30b
10 6月, 2023 5 次提交
- M
  Documentation for DeepSpeed Accelerator Abstraction Interface (#3184) · 5289d691
  由 Ma, Guokai 提交于 6月 10, 2023
```
---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5289d691
- J
  
  bump to 0.9.5 · 54bd9e29
  由 Jeff Rasley 提交于 6月 09, 2023
  
  54bd9e29
- L
  Update Dockerfile with newer cuda and torch. (#3716) · a65f6b9e
  由 Logan Adams 提交于 6月 09, 2023
```
* Add non-interactive prompt, causing issues for some users

* Update pytorch version too
```
  a65f6b9e
- A
  single node pdsh sigkill (#3730) · 26b3e732
  由 Abhilash Majumder 提交于 6月 10, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  26b3e732
- M
  [Bugfix][CPU] Remove C++ version in CPU OpBuilder (#3643) · 8bfbb0e3
  由 Ma, Guokai 提交于 6月 10, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  8bfbb0e3
09 6月, 2023 3 次提交

Increase tensor creator coverage (#3684) · 046afced

由 Olatunji Ruwase 提交于 6月 08, 2023

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

046afced

L
Fix typo in name of hybrid engine function (#3704) · fc8e5c88
由 Logan Adams 提交于 6月 08, 2023
```
* Fix typo in name of hybrid engine function

* Fix
```
fc8e5c88

zero3 performance optimizations (#3622) · 0977106a

由 hablb 提交于 6月 08, 2023

* Remove dead code

params_already_reduced is not used

* Prevent evaluation of debug strings

Debug strings are evaluated even when logging is disabled

* Use contiguous gradients tensor reduce scatter between ranks

Use allreduce instead of reduce scatter. lower cpu overhead.

* move overflow tracker to optimizer.step

Don't check overflow in gradients for every bucket.
Do overflow chack once on grad flat buffer just before optimizer step

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

0977106a

08 6月, 2023 6 次提交

C
DeepSpeed overview in Japanese (#3709) · df425097
由 Conglong Li 提交于 6月 07, 2023
```
* DeepSpeed overview in Japanese

* DeepSpeed overview in Japanese
```
df425097

Small tweak on cuda version mismatch documentation (#3706) · d414678d

由 john li 提交于 6月 07, 2023

* Small tweak on cuda version mismatch documentation

* clarify minor versions should also match

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

d414678d

Fix unit test typo in tests/unit/ops/transformer/inference (#3697) · fb2b4ab1

由 Michael Wyatt 提交于 6月 07, 2023

* mix typo and missing epsilon value

* Touch file to re-build

* revert changes

* Touch file to re-build

* Format

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

fb2b4ab1

D
change partititon_name to partition_name (#3700) · c5edc91e
由 digger yu 提交于 6月 08, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
c5edc91e

Fix gpt-j inference issue (#3639) · 34a9fbf1

由 Reza Yazdani 提交于 6月 07, 2023

* fix gpt-j inference issue for mlp_gemm_func call

* bring back the gpt-j inference-test

* fix formatting

* fix the neox and pythia injection issue

34a9fbf1

L
Revert "fix typo name (#3689)" (#3702) · 7e59ef12
由 Logan Adams 提交于 6月 07, 2023
```
This reverts commit f2f5f21b.
```
7e59ef12

07 6月, 2023 5 次提交

fix typo name (#3689) · f2f5f21b

由 tensor-tang 提交于 6月 07, 2023

Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

f2f5f21b

L
Fix incorrectly formatted f string (#3698) · d8aaa581
由 Logan Adams 提交于 6月 06, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
d8aaa581
A
Correct world_size/backend for mpi (#3694) · c17313fb
由 Abhilash Majumder 提交于 6月 07, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
c17313fb

Fix local rank mismatch for heterogeneous nodes (#3409) · b7f463dd

由 Byungsoo Oh 提交于 6月 07, 2023

Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b7f463dd

non-JIT build fix on ROCm (#3638) · 4cd0a003

由 Ramya Ramineni 提交于 6月 06, 2023

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

4cd0a003

06 6月, 2023 3 次提交
- S
  
  Update README to add ICS'23 paper (#3687) · 2d737edd
  由 Siddharth Singh 提交于 6月 06, 2023
  
  2d737edd
- O
  Use logger in accelerator (#3682) · e5fe5f65
  由 Olatunji Ruwase 提交于 6月 05, 2023
```
* Use logger in accelerator

* Handle pre-build cases

* Explain possible import failure
```
  e5fe5f65
- D
  fix some typo (#3675) · 3fb3cfdc
  由 digger yu 提交于 6月 06, 2023
```
* fix typo deepspeed/runtime

* fix some typo

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  3fb3cfdc
05 6月, 2023 1 次提交

[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440) · c88af214

由 Zhen Zhang 提交于 6月 04, 2023

* fix mics save checkpoint hanging

* MiCS load_checkpoint

* copyright

* fix for torch-1.9.0

all_reduce_coalesced api does not support nccl backend

* Naming alignment

* adding more test conditions for mics shard size

* test with different shard sizes

* adding assertion for better error msg

---------
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>

c88af214

03 6月, 2023 3 次提交

J

bump to 0.9.4 · f483c034
由 Jeff Rasley 提交于 6月 02, 2023

f483c034

Refactor check_enabled root validator in DeepSpeedMonitorConfig (#3616) · 4559aa9b

由 Buğra 提交于 6月 02, 2023

* Refactor check_enabled root validator in DeepSpeedMonitorConfig

* formatting

* formatting

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NMichael Wyatt <mrwyattii@gmail.com>

4559aa9b

D
fix typo deepspeed/runtime (#3663) · 5d14afd2
由 digger yu 提交于 6月 03, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
5d14afd2

02 6月, 2023 4 次提交

flops_profiler: add option recompute_fwd_factor for the case of activation recompute (#3362) · 460bec46

由郭叶军提交于 6月 02, 2023

When activation checkpointing is enabled, most of forward is re-computed,
and so the FLOPS calculation should be updated with recompute_fwd_factor=1.0

I don't find a way to pass the option from model script to deepspeed engine,
and so add option directly for flops_profiler.
Co-authored-by: NCheng Li <pistasable@gmail.com>

460bec46

fix typo with deepspeed/ (#3547) · cd4e473e

由 digger yu 提交于 6月 02, 2023

* fix spelling error with deepspeed/runtime/

* fix typo docs/

* fix typo in comments with deepspeed/

* fix typo deepspeed/

* Update constants.py

Remove the space after nebula

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

cd4e473e

M
allow dict datatype for checkpoints (#3007) · da8f4e01
由 Michael Wyatt 提交于 6月 01, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
da8f4e01
H
Fix RuntimeError when using ZeRO Stage3 with mpu: #3564 (#3565) · f5dde36c
由 Haodong Lyu 提交于 6月 02, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
f5dde36c

Greenplum / DeepSpeed 上一次同步 11 个月

Greenplum / DeepSpeed
上一次同步 11 个月