Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
OpenDocCN
pytorch-doc-zh
提交
4c00d2f7
P
pytorch-doc-zh
项目概览
OpenDocCN
/
pytorch-doc-zh
通知
121
Star
3932
Fork
992
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
pytorch-doc-zh
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
4c00d2f7
编写于
2月 05, 2024
作者:
绝不原创的飞龙
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
2024-02-05 13:39:19
上级
3db0d3af
变更
6
展开全部
隐藏空白更改
内联
并排
Showing
6 changed file
with
2255 addition
and
0 deletion
+2255
-0
totrans/doc22_046.yaml
totrans/doc22_046.yaml
+782
-0
totrans/doc22_047.yaml
totrans/doc22_047.yaml
+118
-0
totrans/doc22_048.yaml
totrans/doc22_048.yaml
+44
-0
totrans/doc22_049.yaml
totrans/doc22_049.yaml
+959
-0
totrans/doc22_050.yaml
totrans/doc22_050.yaml
+236
-0
totrans/doc22_051.yaml
totrans/doc22_051.yaml
+116
-0
未找到文件。
totrans/doc22_046.yaml
浏览文件 @
4c00d2f7
此差异已折叠。
点击以展开。
totrans/doc22_047.yaml
浏览文件 @
4c00d2f7
-
en
:
Generic Join Context Manager
id
:
totrans-0
prefs
:
-
PREF_H1
type
:
TYPE_NORMAL
zh
:
通用加入上下文管理器
-
en
:
原文:[https://pytorch.org/docs/stable/distributed.algorithms.join.html](https://pytorch.org/docs/stable/distributed.algorithms.join.html)
id
:
totrans-1
prefs
:
-
PREF_BQ
type
:
TYPE_NORMAL
zh
:
原文:[https://pytorch.org/docs/stable/distributed.algorithms.join.html](https://pytorch.org/docs/stable/distributed.algorithms.join.html)
-
en
:
'
The
generic
join
context
manager
facilitates
distributed
training
on
uneven
inputs.
This
page
outlines
the
API
of
the
relevant
classes:
`Join`,
`Joinable`,
and
`JoinHook`.
For
a
tutorial,
see
[Distributed
Training
with
Uneven
Inputs
Using
the
Join
Context
Manager](https://pytorch.org/tutorials/advanced/generic_join.html).'
id
:
totrans-2
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
通用加入上下文管理器促进了不均匀输入的分布式训练。本页概述了相关类的API:`Join`、`Joinable`和`JoinHook`。有关教程,请参阅[使用加入上下文管理器进行不均匀输入的分布式训练](https://pytorch.org/tutorials/advanced/generic_join.html)。
-
en
:
'
[PRE0]'
id
:
totrans-3
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE0]'
-
en
:
This class defines the generic join context manager, which allows custom hooks
to be called after a process joins.
id
:
totrans-4
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
此类定义了通用加入上下文管理器,允许在进程加入后调用自定义钩子。
-
en
:
These hooks should shadow the collective communications of non-joined processes
to prevent hanging and erroring and to ensure algorithmic correctness. Refer to
[`JoinHook`](#torch.distributed.algorithms.JoinHook "torch.distributed.algorithms.JoinHook")
for details about the hook definition.
id
:
totrans-5
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
这些钩子应该遮蔽未加入进程的集体通信,以防止挂起和出错,并确保算法的正确性。有关钩子定义的详细信息,请参阅[`JoinHook`](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.JoinHook")。
-
en
:
Warning
id
:
totrans-6
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
警告
-
en
:
The context manager requires each participating [`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable") to call the method [`notify_join_context()`](#torch.distributed.algorithms.Join.notify_join_context
"torch.distributed.algorithms.Join.notify_join_context") before its own per- iteration
collective communications to ensure correctness.
id
:
totrans-7
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
上下文管理器要求每个参与的[`Joinable`](#torch.distributed.algorithms.Joinable "torch.distributed.algorithms.Joinable")在自己的每次迭代集体通信之前调用方法[`notify_join_context()`](#torch.distributed.algorithms.Join.notify_join_context
"torch.distributed.algorithms.Join.notify_join_context")以确保正确性。
-
en
:
Warning
id
:
totrans-8
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
警告
-
en
:
The context manager requires that all `process_group` attributes in the [`JoinHook`](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.JoinHook") objects are the same. If there are multiple
[`JoinHook`](#torch.distributed.algorithms.JoinHook "torch.distributed.algorithms.JoinHook")
...
...
@@ -44,80 +64,122 @@
information is used for checking for non- joined processes and for notifying processes
to throw an exception if `throw_on_early_termination` is enabled, both of which
using an all- reduce.
id
:
totrans-9
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
上下文管理器要求[`JoinHook`](#torch.distributed.algorithms.JoinHook "torch.distributed.algorithms.JoinHook")对象中的所有`process_group`属性都相同。如果有多个[`JoinHook`](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.JoinHook")对象,则使用第一个的`device`。进程组和设备信息用于检查未加入的进程,并通知进程在启用`throw_on_early_termination`时抛出异常,两者都使用全局归约。
-
en
:
Parameters
id
:
totrans-10
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
参数
-
en
:
'
**joinables**
(*List**[*[*Joinable*](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")*]*)
–
a
list
of
the
participating
[`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")
s;
their
hooks
are
iterated
over
in
the
given
order.'
id
:
totrans-11
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**joinables**(*List**[*[*Joinable*](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")*]*)
-
参与的[`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")对象的列表;它们的钩子按给定顺序迭代。'
-
en
:
'
**enable**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)"))
–
a
flag
enabling
uneven
input
detection;
setting
to
`False`
disables
the
context
manager’s
functionality
and
should
only
be
set
when
the
user
knows
the
inputs
will
not
be
uneven
(default:
`True`).'
id
:
totrans-12
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**enable**([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)"))
-
一个标志,用于启用不均匀输入检测;设置为`False`会禁用上下文管理器的功能,只有在用户知道输入不会不均匀时才应设置(默认值:`True`)。'
-
en
:
'
**throw_on_early_termination**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)"))
–
a
flag
controlling
whether
to
throw
an
exception
upon
detecting
uneven
inputs
(default:
`False`).'
id
:
totrans-13
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**throw_on_early_termination**([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)"))
-
一个控制是否在检测到不均匀输入时抛出异常的标志(默认值:`False`)。'
-
en
:
'
Example:'
id
:
totrans-14
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
示例:
-
en
:
'
[PRE1]'
id
:
totrans-15
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE1]'
-
en
:
'
[PRE2]'
id
:
totrans-16
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE2]'
-
en
:
Notifies the join context manager that the calling process has not yet joined.
id
:
totrans-17
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
通知加入上下文管理器,调用进程尚未加入。
-
en
:
Then, if `throw_on_early_termination=True`, checks if uneven inputs have been
detected (i.e. if one process has already joined) and throws an exception if so.
id
:
totrans-18
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
然后,如果`throw_on_early_termination=True`,则检查是否检测到不均匀的输入(即如果一个进程已经加入),如果是,则抛出异常。
-
en
:
This method should be called from a [`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable") object before its per-iteration collective
communications. For example, this should be called at the beginning of the forward
pass in `DistributedDataParallel`.
id
:
totrans-19
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
此方法应该在[`Joinable`](#torch.distributed.algorithms.Joinable "torch.distributed.algorithms.Joinable")对象的每次迭代集体通信之前调用。例如,在`DistributedDataParallel`的前向传递开始时应调用此方法。
-
en
:
Only the first [`Joinable`](#torch.distributed.algorithms.Joinable "torch.distributed.algorithms.Joinable")
object passed into the context manager performs the collective communications
in this method, and for the others, this method is vacuous.
id
:
totrans-20
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
只有第一个传递到上下文管理器的[`Joinable`](#torch.distributed.algorithms.Joinable "torch.distributed.algorithms.Joinable")对象在此方法中执行集体通信,对于其他对象,此方法为空。
-
en
:
Parameters
id
:
totrans-21
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
参数
-
en
:
'
**joinable**
([*Joinable*](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable"))
–
the
[`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")
object
calling
this
method.'
id
:
totrans-22
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
'
**joinable**([*Joinable*](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable"))
-
调用此方法的[`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")对象。'
-
en
:
Returns
id
:
totrans-23
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
返回
-
en
:
An async work handle for the all-reduce meant to notify the context manager
that the process has not yet joined if `joinable` is the first one passed into
the context manager; `None` otherwise.
id
:
totrans-24
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
一个用于全局归约的异步工作句柄,用于通知上下文管理器进程尚未加入,如果`joinable`是传递到上下文管理器的第一个;否则为`None`。
-
en
:
'
[PRE3]'
id
:
totrans-25
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE3]'
-
en
:
This defines an abstract base class for joinable classes.
id
:
totrans-26
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
这为可加入类定义了一个抽象基类。
-
en
:
A joinable class (inheriting from [`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")) should implement [`join_hook()`](#torch.distributed.algorithms.Joinable.join_hook
"torch.distributed.algorithms.Joinable.join_hook"), which returns a [`JoinHook`](#torch.distributed.algorithms.JoinHook
...
...
@@ -125,87 +187,143 @@
"torch.distributed.algorithms.Joinable.join_device") and [`join_process_group()`](#torch.distributed.algorithms.Joinable.join_process_group
"torch.distributed.algorithms.Joinable.join_process_group") that return device
and process group information, respectively.
id
:
totrans-27
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
一个可加入的类(从[`Joinable`](#torch.distributed.algorithms.Joinable "torch.distributed.algorithms.Joinable")继承)应该实现[`join_hook()`](#torch.distributed.algorithms.Joinable.join_hook
"torch.distributed.algorithms.Joinable.join_hook"),它返回一个[`JoinHook`](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.JoinHook")实例,另外还应该实现[`join_device()`](#torch.distributed.algorithms.Joinable.join_device
"torch.distributed.algorithms.Joinable.join_device")和[`join_process_group()`](#torch.distributed.algorithms.Joinable.join_process_group
"torch.distributed.algorithms.Joinable.join_process_group")来分别返回设备和进程组信息。
-
en
:
'
[PRE4]'
id
:
totrans-28
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE4]'
-
en
:
Return the device from which to perform collective communications needed by
the join context manager.
id
:
totrans-29
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
返回执行加入上下文管理器所需的集体通信的设备。
-
en
:
'
[PRE5]'
id
:
totrans-30
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE5]'
-
en
:
Return a [`JoinHook`](#torch.distributed.algorithms.JoinHook "torch.distributed.algorithms.JoinHook")
instance for the given [`Joinable`](#torch.distributed.algorithms.Joinable "torch.distributed.algorithms.Joinable").
id
:
totrans-31
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
为给定的[`Joinable`](#torch.distributed.algorithms.Joinable "torch.distributed.algorithms.Joinable")返回一个[`JoinHook`](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.JoinHook")实例。
-
en
:
Parameters
id
:
totrans-32
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
参数
-
en
:
'
**kwargs**
([*dict*](https://docs.python.org/3/library/stdtypes.html#dict
"(in
Python
v3.12)"))
–
a
[`dict`](https://docs.python.org/3/library/stdtypes.html#dict
"(in
Python
v3.12)")
containing
any
keyword
arguments
to
modify
the
behavior
of
the
join
hook
at
run
time;
all
[`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")
instances
sharing
the
same
join
context
manager
are
forwarded
the
same
value
for
`kwargs`.'
id
:
totrans-33
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
'
**kwargs**([*dict*](https://docs.python.org/3/library/stdtypes.html#dict
"(in
Python
v3.12)"))
-
包含任何关键字参数以在运行时修改加入钩子行为的[`dict`](https://docs.python.org/3/library/stdtypes.html#dict
"(in
Python
v3.12)");所有共享相同加入上下文管理器的[`Joinable`](#torch.distributed.algorithms.Joinable
"torch.distributed.algorithms.Joinable")实例将被转发相同的`kwargs`值。'
-
en
:
Return type
id
:
totrans-34
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
返回类型
-
en
:
'
[*JoinHook*](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.join.JoinHook")'
id
:
totrans-35
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
'
[*JoinHook*](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.join.JoinHook")'
-
en
:
'
[PRE6]'
id
:
totrans-36
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE6]'
-
en
:
Returns the process group for the collective communications needed by the join
context manager itself.
id
:
totrans-37
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
返回加入上下文管理器本身所需的集体通信的进程组。
-
en
:
'
[PRE7]'
id
:
totrans-38
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE7]'
-
en
:
This defines a join hook, which provides two entry points in the join context
manager.
id
:
totrans-39
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
这定义了一个加入钩子,在加入上下文管理器中提供了两个入口点。
-
en
:
'
Entry
points
:
a
main
hook,
which
is
called
repeatedly
while
there
exists
a
non-joined
process,
and
a
post-hook,
which
is
called
once
all
processes
have
joined.'
id
:
totrans-40
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
入口点:一个主要的钩子,当存在一个未加入的进程时会被重复调用,以及一个后置钩子,当所有进程都已加入时会被调用一次。
-
en
:
To implement a join hook for the generic join context manager, define a class
that inherits from [`JoinHook`](#torch.distributed.algorithms.JoinHook "torch.distributed.algorithms.JoinHook")
and override `main_hook()` and `post_hook()` as appropriate.
id
:
totrans-41
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
要为通用加入上下文管理器实现一个加入钩子,需要定义一个从[`JoinHook`](#torch.distributed.algorithms.JoinHook
"torch.distributed.algorithms.JoinHook")继承的类,并适当地重写`main_hook()`和`post_hook()`。
-
en
:
'
[PRE8]'
id
:
totrans-42
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE8]'
-
en
:
Call this hook while there exists a non-joined process to shadow collective
communications in a training iteration.
id
:
totrans-43
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
在训练迭代中,当存在一个未加入的进程时调用此钩子以隐藏集体通信。
-
en
:
Training iteration i.e., in one forward pass, backward pass, and optimizer step.
id
:
totrans-44
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
训练迭代,即在一个前向传播、反向传播和优化器步骤中。
-
en
:
'
[PRE9]'
id
:
totrans-45
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE9]'
-
en
:
Call hook after all processes have joined.
id
:
totrans-46
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
在所有进程都已加入后调用钩子。
-
en
:
It is passed an additional `bool` argument `is_last_joiner`, which indicates
if the rank is one of the last to join.
id
:
totrans-47
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
它接受一个额外的`bool`参数`is_last_joiner`,指示该排名是否是最后加入的之一。
-
en
:
Parameters
id
:
totrans-48
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
参数
-
en
:
'
**is_last_joiner**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)"))
–
`True`
if
the
rank
is
one
of
the
last
to
join;
`False`
otherwise.'
id
:
totrans-49
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
'
**is_last_joiner**([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)"))
-
如果排名是最后加入的之一,则为`True`;否则为`False`。'
totrans/doc22_048.yaml
浏览文件 @
4c00d2f7
-
en
:
Torch Distributed Elastic
id
:
totrans-0
prefs
:
-
PREF_H1
type
:
TYPE_NORMAL
zh
:
Torch分布式弹性
-
en
:
原文:[https://pytorch.org/docs/stable/distributed.elastic.html](https://pytorch.org/docs/stable/distributed.elastic.html)
id
:
totrans-1
prefs
:
-
PREF_BQ
type
:
TYPE_NORMAL
zh
:
原文:[https://pytorch.org/docs/stable/distributed.elastic.html](https://pytorch.org/docs/stable/distributed.elastic.html)
-
en
:
Makes distributed PyTorch fault-tolerant and elastic.
id
:
totrans-2
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
使分布式PyTorch具有容错性和弹性。
-
en
:
Get Started
id
:
totrans-3
prefs
:
-
PREF_H2
type
:
TYPE_NORMAL
zh
:
入门
-
en
:
Usage
id
:
totrans-4
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
用法
-
en
:
'
[Quickstart](elastic/quickstart.html)'
id
:
totrans-5
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[快速入门](elastic/quickstart.html)'
-
en
:
'
[Train
script](elastic/train_script.html)'
id
:
totrans-6
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[训练脚本](elastic/train_script.html)'
-
en
:
'
[Examples](elastic/examples.html)'
id
:
totrans-7
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[示例](elastic/examples.html)'
-
en
:
Documentation
id
:
totrans-8
prefs
:
-
PREF_H2
type
:
TYPE_NORMAL
zh
:
文档
-
en
:
API
id
:
totrans-9
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
API
-
en
:
'
[torchrun
(Elastic
Launch)](elastic/run.html)'
id
:
totrans-10
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[torchrun(弹性启动)](elastic/run.html)'
-
en
:
'
[Elastic
Agent](elastic/agent.html)'
id
:
totrans-11
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[弹性代理](elastic/agent.html)'
-
en
:
'
[Multiprocessing](elastic/multiprocessing.html)'
id
:
totrans-12
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[多进程](elastic/multiprocessing.html)'
-
en
:
'
[Error
Propagation](elastic/errors.html)'
id
:
totrans-13
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[错误传播](elastic/errors.html)'
-
en
:
'
[Rendezvous](elastic/rendezvous.html)'
id
:
totrans-14
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[会合](elastic/rendezvous.html)'
-
en
:
'
[Expiration
Timers](elastic/timer.html)'
id
:
totrans-15
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[过期计时器](elastic/timer.html)'
-
en
:
'
[Metrics](elastic/metrics.html)'
id
:
totrans-16
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[指标](elastic/metrics.html)'
-
en
:
'
[Events](elastic/events.html)'
id
:
totrans-17
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[事件](elastic/events.html)'
-
en
:
Advanced
id
:
totrans-18
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
高级
-
en
:
'
[Customization](elastic/customization.html)'
id
:
totrans-19
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[定制](elastic/customization.html)'
-
en
:
Plugins
id
:
totrans-20
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
插件
-
en
:
'
[TorchElastic
Kubernetes](elastic/kubernetes.html)'
id
:
totrans-21
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
[TorchElastic
Kubernetes](elastic/kubernetes.html)'
totrans/doc22_049.yaml
浏览文件 @
4c00d2f7
此差异已折叠。
点击以展开。
totrans/doc22_050.yaml
浏览文件 @
4c00d2f7
此差异已折叠。
点击以展开。
totrans/doc22_051.yaml
浏览文件 @
4c00d2f7
-
en
:
Tensor Parallelism - torch.distributed.tensor.parallel
id
:
totrans-0
prefs
:
-
PREF_H1
type
:
TYPE_NORMAL
-
en
:
原文:[https://pytorch.org/docs/stable/distributed.tensor.parallel.html](https://pytorch.org/docs/stable/distributed.tensor.parallel.html)
id
:
totrans-1
prefs
:
-
PREF_BQ
type
:
TYPE_NORMAL
-
en
:
'
Tensor
Parallelism(TP)
is
built
on
top
of
the
PyTorch
DistributedTensor
([DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md))
and
provides
different
parallelism
styles:
Colwise
and
Rowwise
Parallelism.'
id
:
totrans-2
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
Warning
id
:
totrans-3
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
Tensor Parallelism APIs are experimental and subject to change.
id
:
totrans-4
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
The
entrypoint
to
parallelize
your
`nn.Module`
using
Tensor
Parallelism
is:'
id
:
totrans-5
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
[PRE0]'
id
:
totrans-6
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE0]'
-
en
:
Apply Tensor Parallelism in PyTorch by parallelizing modules or sub-modules
based on a user-specified plan.
id
:
totrans-7
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
We parallelize module or sub_modules based on a parallelize_plan. The parallelize_plan
contains `ParallelStyle`, which indicates how user wants the module or sub_module
to be parallelized.
id
:
totrans-8
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
User can also specify different parallel style per module fully qualified name
(FQN).
id
:
totrans-9
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
Note that `parallelize_module` only accepts a 1-D `DeviceMesh`, if you have
a 2-D or N-D `DeviceMesh`, slice the DeviceMesh to a 1-D sub DeviceMesh first
then pass to this API(i.e. `device_mesh["tp"]`)
id
:
totrans-10
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
Parameters
id
:
totrans-11
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
**module**
(`nn.Module`)
–
Module
to
be
parallelized.'
id
:
totrans-12
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
-
en
:
'
**device_mesh**
(`DeviceMesh`)
–
Object
which
describes
the
mesh
topology
of
devices
for
the
DTensor.'
id
:
totrans-13
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
...
...
@@ -56,6 +71,7 @@
The
plan
used
to
parallelize
the
module.
It
can
be
either
a
`ParallelStyle`
object
which
contains
how
we
prepare
input/output
for
Tensor
Parallelism
or
it
can
be
a
dict
of
module
FQN
and
its
corresponding
`ParallelStyle`
object.'
id
:
totrans-14
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
...
...
@@ -63,52 +79,68 @@
"(in
Python
v3.12)")*,*
*deprecated*)
–
The
dimension
of
`device_mesh`
where
we
perform
Tensor
Parallelism
on,
this
field
is
deprecated
and
will
be
removed
in
future.
If
you
have
a
2-D
or
N-D
`DeviceMesh`,
consider
passing
in
device_mesh[“tp”]'
id
:
totrans-15
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
-
en
:
Returns
id
:
totrans-16
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
A `nn.Module` object parallelized.
id
:
totrans-17
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
Return type
id
:
totrans-18
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
[*Module*](generated/torch.nn.Module.html#torch.nn.Module
"torch.nn.modules.module.Module")'
id
:
totrans-19
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
Example::'
id
:
totrans-20
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
[PRE1]'
id
:
totrans-21
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE1]'
-
en
:
Note
id
:
totrans-22
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
For complex module architecture like Attention, MLP layers, we recommend composing
different ParallelStyles together (i.e. `ColwiseParallel` and `RowwiseParallel`)
and pass as a parallelize_plan, to achieves the desired sharding computation.
id
:
totrans-23
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
Tensor
Parallelism
supports
the
following
parallel
styles:'
id
:
totrans-24
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
[PRE2]'
id
:
totrans-25
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE2]'
-
en
:
Partition a compatible nn.Module in a row-wise fashion. Currently supports nn.Linear
and nn.Embedding. Users can compose it together with RowwiseParallel to achieve
the sharding of more complicated modules. (i.e. MLP, Attention)
id
:
totrans-26
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
Keyword Arguments
id
:
totrans-27
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
**input_layouts**
(*Placement**,*
*optional*)
–
The
DTensor
layout
of
input
tensor
for
the
nn.Module,
this
is
used
to
annotate
the
input
tensor
to
become
a
DTensor.
If
not
specified,
we
assume
the
input
tensor
to
be
replicated.'
id
:
totrans-28
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
...
...
@@ -116,173 +148,257 @@
output
for
the
nn.Module,
this
is
used
to
ensure
the
output
of
the
nn.Module
with
the
user
desired
layout.
If
not
specified,
the
output
tensor
is
sharded
on
the
last
dimension.'
id
:
totrans-29
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
-
en
:
'
**use_local_output**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)")*,*
*optional*)
–
Whether
to
use
local
[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor")
instead
of
`DTensor`
for
the
module
output,
default:
True.'
id
:
totrans-30
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
-
en
:
Returns
id
:
totrans-31
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
A `ParallelStyle` object that represents Colwise sharding of the nn.Module.
id
:
totrans-32
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
Example::'
id
:
totrans-33
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
[PRE3]'
id
:
totrans-34
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE3]'
-
en
:
Note
id
:
totrans-35
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
By default `ColwiseParallel` output is sharded on the last dimension if the
`output_layouts` not specified, if there’re operators that require specific tensor
shape (i.e. before the paired `RowwiseParallel`), keep in mind that if the output
is sharded the operator might need to be adjusted to the sharded size.
id
:
totrans-36
prefs
:
[]
type
:
TYPE_NORMAL
-
en
:
'
[PRE4]'
id
:
totrans-37
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE4]'
-
en
:
Partition a compatible nn.Module in a row-wise fashion. Currently supports nn.Linear
only. Users can compose it with ColwiseParallel to achieve the sharding of more
complicated modules. (i.e. MLP, Attention)
id
:
totrans-38
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
将兼容的nn.Module按行划分。目前仅支持nn.Linear。用户可以将其与ColwiseParallel组合,以实现更复杂模块的分片(即MLP,Attention)
-
en
:
Keyword Arguments
id
:
totrans-39
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
关键参数
-
en
:
'
**input_layouts**
(*Placement**,*
*optional*)
–
The
DTensor
layout
of
input
tensor
for
the
nn.Module,
this
is
used
to
annotate
the
input
tensor
to
become
a
DTensor.
If
not
specified,
we
assume
the
input
tensor
to
be
sharded
on
the
last
dimension.'
id
:
totrans-40
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**input_layouts**
(*Placement**,*
*optional*)
–
nn.Module的输入张量的DTensor布局,用于注释输入张量以成为DTensor。如果未指定,我们假定输入张量在最后一个维度上被分片。'
-
en
:
'
**output_layouts**
(*Placement**,*
*optional*)
–
The
DTensor
layout
of
the
output
for
the
nn.Module,
this
is
used
to
ensure
the
output
of
the
nn.Module
with
the
user
desired
layout.
If
not
specified,
the
output
tensor
is
replicated.'
id
:
totrans-41
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**output_layouts**
(*Placement**,*
*optional*)
–
nn.Module输出的DTensor布局,用于确保nn.Module的输出具有用户期望的布局。如果未指定,则输出张量将被复制。'
-
en
:
'
**use_local_output**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)")*,*
*optional*)
–
Whether
to
use
local
[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor")
instead
of
`DTensor`
for
the
module
output,
default:
True.'
id
:
totrans-42
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**use_local_output**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)")*,*
*optional*)
–
是否使用本地[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor")而不是`DTensor`作为模块输出,默认值为True。'
-
en
:
Returns
id
:
totrans-43
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
返回
-
en
:
A `ParallelStyle` object that represents Rowwise sharding of the nn.Module.
id
:
totrans-44
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
代表nn.Module的Rowwise分片的`ParallelStyle`对象。
-
en
:
'
Example::'
id
:
totrans-45
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
'
示例::'
-
en
:
'
[PRE5]'
id
:
totrans-46
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE5]'
-
en
:
'
To
simply
configure
the
nn.Module’s
inputs
and
outputs
with
DTensor
layouts
and
perform
necessary
layout
redistributions,
without
distribute
the
module
parameters
to
DTensors,
the
following
classes
can
be
used
in
the
`parallelize_plan`
of
`parallelize_module`:'
id
:
totrans-47
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
要简单配置nn.Module的输入和输出以及执行必要的布局重分配,而不将模块参数分发到DTensors,可以在`parallelize_module`的`parallelize_plan`中使用以下类:
-
en
:
'
[PRE6]'
id
:
totrans-48
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE6]'
-
en
:
Configure the nn.Module’s inputs to convert the input tensors of the nn.Module
to DTensors at runtime according to `input_layouts`, and perform layout redistribution
according to the `desired_input_layouts`.
id
:
totrans-49
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
根据`input_layouts`配置nn.Module的输入,根据`desired_input_layouts`执行布局重分配,将nn.Module的输入张量转换为DTensors。
-
en
:
Keyword Arguments
id
:
totrans-50
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
关键参数
-
en
:
'
**input_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
The
DTensor
layouts
of
input
tensors
for
the
nn.Module,
this
is
used
to
convert
the
input
tensors
to
DTensors.
If
some
inputs
are
not
torch.Tensor
or
no
need
to
convert
to
DTensors,
`None`
need
to
be
specified
as
a
placeholder.'
id
:
totrans-51
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**input_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
nn.Module的输入张量的DTensor布局,用于将输入张量转换为DTensors。如果某些输入不是torch.Tensor或不需要转换为DTensors,则需要指定`None`作为占位符。'
-
en
:
'
**desired_input_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
The
desired
DTensor
layout
of
input
tensors
for
the
nn.Module,
this
is
used
to
ensure
the
inputs
of
the
nn.Module
have
the
desired
DTensor
layouts.
This
argument
needs
to
have
the
same
length
with
`input_layouts`.'
id
:
totrans-52
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**desired_input_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
nn.Module输入张量的期望DTensor布局,用于确保nn.Module的输入具有期望的DTensor布局。此参数需要与`input_layouts`具有相同的长度。'
-
en
:
'
**use_local_output**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)")*,*
*optional*)
–
Whether
to
use
local
[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor")
instead
of
`DTensor`
for
the
module
inputs,
default:
False.'
id
:
totrans-53
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**use_local_output**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)")*,*
*optional*)
–
是否使用本地[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor")而不是`DTensor`作为模块输入,默认值为False。'
-
en
:
Returns
id
:
totrans-54
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
返回
-
en
:
A `ParallelStyle` object that prepares the sharding layouts of the nn.Module’s
inputs.
id
:
totrans-55
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
准备nn.Module输入的分片布局的`ParallelStyle`对象。
-
en
:
'
Example::'
id
:
totrans-56
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
'
示例::'
-
en
:
'
[PRE7]'
id
:
totrans-57
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE7]'
-
en
:
'
[PRE8]'
id
:
totrans-58
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE8]'
-
en
:
Configure the nn.Module’s outputs to convert the output tensors of the nn.Module
to DTensors at runtime according to `output_layouts`, and perform layout redistribution
according to the `desired_output_layouts`.
id
:
totrans-59
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
根据`output_layouts`配置nn.Module的输出,根据`desired_output_layouts`执行布局重分配,将nn.Module的输出张量转换为DTensors。
-
en
:
Keyword Arguments
id
:
totrans-60
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
关键参数
-
en
:
'
**output_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
The
DTensor
layouts
of
output
tensors
for
the
nn.Module,
this
is
used
to
convert
the
output
tensors
to
DTensors
if
they
are
[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor").
If
some
outputs
are
not
torch.Tensor
or
no
need
to
convert
to
DTensors,
`None`
need
to
be
specified
as
a
placeholder.'
id
:
totrans-61
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**output_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
nn.Module输出张量的DTensor布局,用于将输出张量转换为DTensors(如果它们是[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor"))。如果某些输出不是torch.Tensor或不需要转换为DTensors,则需要指定`None`作为占位符。'
-
en
:
'
**desired_output_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
The
desired
DTensor
layouts
of
output
tensors
for
the
nn.Module,
this
is
used
to
ensure
the
outputs
of
the
nn.Module
have
the
desired
DTensor
layouts.'
id
:
totrans-62
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**desired_output_layouts**
(*Union**[**Placement**,*
*Tuple**[**Placement**]**]*)
–
nn.Module输出张量的期望DTensor布局,用于确保nn.Module的输出具有期望的DTensor布局。'
-
en
:
'
**use_local_output**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)")*,*
*optional*)
–
Whether
to
use
local
[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor")
instead
of
`DTensor`
for
the
module
outputs,
default:
False.'
id
:
totrans-63
prefs
:
-
PREF_UL
type
:
TYPE_NORMAL
zh
:
'
**use_local_output**
([*bool*](https://docs.python.org/3/library/functions.html#bool
"(in
Python
v3.12)")*,*
*optional*)
–
是否使用本地[`torch.Tensor`](tensors.html#torch.Tensor
"torch.Tensor")而不是`DTensor`作为模块输出,默认值为False。'
-
en
:
Returns
id
:
totrans-64
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
返回
-
en
:
A ParallelStyle object that prepares the sharding layouts of the nn.Module’s
outputs.
id
:
totrans-65
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
准备nn.Module输出的分片布局的`ParallelStyle`对象。
-
en
:
'
Example::'
id
:
totrans-66
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
'
示例::'
-
en
:
'
[PRE9]'
id
:
totrans-67
prefs
:
[]
type
:
TYPE_PRE
zh
:
'
[PRE9]'
-
en
:
For models like Transformer, we recommend users to use `ColwiseParallel` and
`RowwiseParallel` together in the parallelize_plan for achieve the desired sharding
for the entire model (i.e. Attention and MLP).
id
:
totrans-68
prefs
:
[]
type
:
TYPE_NORMAL
zh
:
对于Transformer等模型,我们建议用户在`parallelize_plan`中同时使用`ColwiseParallel`和`RowwiseParallel`来实现整个模型的期望分片(即Attention和MLP)。
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录