提交 67bcaafa 编写于 作者: L liqingping

docs: complete english docs

上级 b2168a29
......@@ -18,7 +18,7 @@ di-operator是负责在K8s系统中编排DIJob,采用K8s [operator pattern](ht
### API定义
根据DI-engine框架的特性,我们利用K8s Custom Resource定义了DIJob资源,用来定义一个RL任务运行所期望达成的状态,包括镜像、启动命令、挂载存储、workers数目等。
根据DI-engine框架的特性,我们利用K8s Custom Resource定义了DIJob资源,用来定义一个DI-engine强化学习(Reinforcement Learning,RL)任务运行所期望达成的状态,包括镜像、启动命令、挂载存储、workers数目等。
DIJobSpec中各字段定义及含义:
......@@ -169,5 +169,5 @@ DI Orchestrator为DI-engine框架提供了分布式场景下基于K8s的容器
1. 封装性。依赖Operator的编排能力,部署DI-engine分布式RL训练的细节(包括pod创建、服务发现)对用户来说是透明的。根据DI-engine框架对分布式RL训练的部署需求,Operator为任务创建workers,Operator会把每个worker的状态记录到DIJob的状态中。DIJob的生命周期也由Operator维护,向用户展示DIJob在不同阶段的状态。
2. 易用性。用户只需要在DIJob的yaml文件中定义好任务的配置之后,一键提交到K8s集群即可,Operator将负责完成部署工作,将用户从K8s集群中复杂的分布式RL训练部署中解放出来。同时可以借助命令行工具一键提交DIJob。
3. 鲁棒性。依赖Operator的重启机制,保证workers在意外退出的情况下能自动重启。
4. 动态扩展。DIJob所需的workers是动态变化的,因此Server提供了http接口可以动态调整workers的数目,使得DIJob可以根据自身需求调整workers数目,优化吞吐量。
4. 动态扩展。DIJob所需的workers是动态变化的,因此用户可以通过K8s client直接修改DIJob来更改workers数目;同时,Server提供了HTTP接口可以动态调整workers的数目。动态扩展使得用户可以根据自身需求调整workers数目,优化吞吐量。
5. 动态调度。依赖Operator子组件Allocator,针对DI-engine任务进行动态调度变得简单。Allocator提供了针对单任务和多任务的调度策略,可以在不影响正常训练的情况下优化全局任务完成时间。
......@@ -17,7 +17,7 @@ DI Operator is responsible for orchestrating DIJob in K8s system, using K8s [ope
### API Definitions
According to the characteristics of DI-engine framework, we use K8s Custom Resource to define the DIJob resource, which is used to define the desired state of a DI-engine job, including images, startup commands, mount volumes, and the number of workers, etc..
According to the characteristics of DI-engine framework, we use K8s Custom Resource to define the DIJob resource, which is used to define the desired state of a DI-engine Reinforcement Learning(RL) job, including images, startup commands, mount volumes, and the number of workers, etc..
Definition and meaning of each field in DIJobSpec is as follows:
......@@ -171,8 +171,8 @@ Jobs submitted run in the cluster according to the process in the following figu
DI Orchestrator provides a K8s-based container-orchestration solution for DI-engine framework in a distributed scenario. For a DIJob, Operator is responsible for orchestrating DI-engine workers so that each worker can run normally and perform training tasks. The sub-module Allocator in Operator provides DI-engine framework with the ability to dynamically allocate and schedule resources. By calling Server's HTTP interface, users are given the functions of adding, deleting, and querying workers for each job. In summary, DI Orchestrator provides the following advantages:
1. Encapsulation. Depending on the orchestration capabilities of Operator, details of deploying DI-engine distributed RL training jobs(including pod creation, service discovery) are transparent to users. According to the deployment requirements of DI-engine jobs for distributed RL training, Operator creates workers for jobs, and write the status of each worker to DIJob status. The life cycle of DIJob is also maintained by Operator, providing us with status of DIJob in different stages.
2. Ease of use. The user only needs to define the configuration of the task in the yaml file of DIJob and submit it to the K8s cluster with one click, and the operator will be responsible for completing the deployment work, freeing the user from the complex distributed RL training deployment in the K8s cluster. At the same time, DIJob can be submitted with one click with the help of command line tools.
3. Robustness. Rely on the operator's restart mechanism to ensure that workers can automatically restart in the case of unexpected exit.
4. Dynamic expansion. The workers required by DIJob change dynamically, so the server provides the http interface to dynamically adjust the number of workers, so that DIJob can adjust the number of workers according to its own needs and optimize throughput.
5. Dynamic scheduling. By relying on the Operator subcomponent Allocator, dynamic scheduling for DI-engine tasks becomes simple. Allocator provides scheduling strategies for single-task and multi-task, which can optimize the global task completion time without affecting normal training.
\ No newline at end of file
1. Encapsulation. Depending on the orchestration capabilities of Operator, details of deploying DI-engine distributed RL training jobs(including pod creation, service discovery) are transparent to users. According to the deployment requirements of DI-engine jobs for distributed RL training, Operator creates workers for jobs, and writes the status of each worker to DIJob status. The life cycle of DIJob is also maintained by Operator, providing us with status of DIJob in different stages.
2. Ease of use. Users only need to define the configuration of DI-engine job in the yaml file of DIJob and submit it to K8s cluster with one click, and Operator will be responsible for completing the deployment work, freeing users from the complex distributed RL training deployment in K8s cluster. At the same time, DIJob can be submitted with one click with the help of command line tools.
3. Robustness. Rely on the Operator's restart mechanism to ensure that workers can automatically restart in the case of unexpected exit.
4. Dynamic expansion. The number of workers required by DIJob changes dynamically, so users can directly modify DIJob through the K8s client to change the number of workers; at the same time, Server provides HTTP interfaces to dynamically adjust the number of workers. Dynamic expansion allows users to adjust the number of workers according to their own needs and optimize throughput.
5. Dynamic scheduling. By relying on Operator's sub-module Allocator, dynamic scheduling for DI-engine jobs becomes simple. Allocator provides scheduling strategies for single-job and multi-jobs, which can optimize the global job completion time without affecting normal training.
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册