未验证 提交 3e0c808c 编写于 作者: E EastInsure 提交者: GitHub

Merge pull request #7 from konnase/hotfix/rename

refactor: file rename
......@@ -10,11 +10,11 @@ run:
tests: true
skip-dirs:
- manifests # deploy phoenix-rubber yaml
- manifests
- third_party # from go-ethereum
- _out #phoenix-rubber executable binary file
- _out
- doc # user tutorial
- deployment # deploy phoenix-rubber yaml
- deployment
- config # the crd config yaml
- cluster # the logging bash
- vendor # the third library
......
# Build the di-operator binary
FROM golang:1.14 as builder
FROM golang:1.15 as builder
WORKDIR /workspace
# Copy the Go Modules manifests
......
......@@ -24,19 +24,19 @@ di-server-7b86ff8df4-jfgmp 1/1 Running 0 59s
Install global components of DIJob defined in AggregatorConfig:
```bash
kubectl create -f examples/di_v1alpha1_agconfig.yaml -n di-system
kubectl create -f config/samples/agconfig.yaml -n di-system
```
### Submit DIJob
```bash
# submit DIJob
$ kubectl create -f examples/di_v1alpha1_dijob.yaml
$ kubectl create -f config/samples/dijob-cartpole.yaml
# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod
# get logs of coordinator
$ kubectl logs dijob-example-coordinator
$ kubectl logs cartpole-dqn-coordinator
```
## User Guide
......
......@@ -38,7 +38,7 @@ type DIJobSpec struct {
// CleanPodPolicy defines the policy to clean pods after DIJob completed
CleanPodPolicy CleanPodPolicy `json:"cleanPodPolicy,omitempty"`
// Volumes defines the shared volumes for di components
// Volumes defines the shared volumes for DI-engine components
Volumes []corev1.Volume `json:"volumes,omitempty"`
Coordinator CoordinatorSpec `json:"coordinator"`
......
......@@ -73,7 +73,8 @@ var _ = BeforeSuite(func() {
},
}
cfg, err := testEnv.Start()
var err error
cfg, err = testEnv.Start()
Expect(err).NotTo(HaveOccurred())
Expect(cfg).NotTo(BeNil())
......
......@@ -19748,7 +19748,7 @@ spec:
description: Priority labels the priority of DIJob
type: string
volumes:
description: Volumes defines the shared volumes for di components
description: Volumes defines the shared volumes for DI-engine components
items:
description: Volume represents a named volume in a pod that may
be accessed by any container in the pod.
"""
Copyright 2020 Sensetime X-lab. All Rights Reserved
Copyright 2020 OpenDILab. All Rights Reserved
"""
from typing import Union, Mapping, List, NamedTuple, Tuple, Callable, Optional, Any
import copy
......
"""
Copyright 2020 Sensetime X-lab. All Rights Reserved
Copyright 2020 OpenDILab. All Rights Reserved
"""
import os
import socket
......
# DI Orchestrator架构
DI框架分为3个重要的模块,分别是coordinator、collector和learner。一般情况下,一个DI训练任务只有一个coordinator,learner和collector的数目可以变化。三个模块的作用分别为:
DI-engine框架分为3个重要的模块,分别是coordinator、collector和learner。一般情况下,一个DI-engine训练任务只有一个coordinator,learner和collector的数目可以变化。三个模块的作用分别为:
- coordinator:保持与collector和learner的连接,接受collector和learner的获取原信息请求、推送原信息请求等,向learner和collector发送任务。
- collector:从coordinator获取RL模型在存储中间件中的位置并加载RL模型,然后在自身构造的环境中根据RL模型决策产生数据帧,将数据帧存储回存储中间件,并将数据帧原信息(存储路径、大小等)汇报给coordinator。
- learner:从coordinator获取数据帧存储位置并从存储中间件中加载数据帧开始训练RL模型,训练完成之后将模型存储到中间件中,并将模型原信息(存储路径、大小等)汇报给coordinator。由于learner部分常常存在数据并行训练这一额外的分布式机制,避免混淆,我们将与coordinator进行交互的模块称作logic learner,是coordinator下发任务的基本单位;而将数据并行训练中的单个learner进程称作ddp learner,多个ddp learner进程提供数据并行服务。一个logic learner可以对应1个ddp learner(单卡)或多个ddp learner(多卡)。另外,提供数据并行训练服务还需要额外引入aggregator模块,aggregator负责将多个ddp learner的训练结果进行汇总并发送给coordinator,即aggregator与多个ddp learner一起构成logic learner,而coordinator只会与logic learner进行交互。
有关DI的详细介绍可参考[DI developer tutorial](https://opendilab.github.io/DI-engine/tutorial_dev/index.html)
有关DI-engine的详细介绍可参考[DI-engine developer tutorial](https://opendilab.github.io/DI-engine/tutorial_dev/index.html)
为了提供DI在Kubernetes(K8s)中运行的支持,我们设计了DI Orchestrator,本文将说明利用DI Orchestrator,DI各个模块在K8s系统上如何被创建、如何相互发现、如何开始训练等。DI Orchestrator的架构如下图所示:
为了提供DI-engine在Kubernetes(K8s)中运行的支持,我们设计了DI Orchestrator,本文将说明利用DI Orchestrator,DI-engine各个模块在K8s系统上如何被创建、如何相互发现、如何开始训练等。DI Orchestrator的架构如下图所示:
![](images/di-arch.png)
![](images/di-arch.svg)
整体分为两大模块:`di-server``di-operator``DDPL`指ddp learner,`Lm`指Learner,`Cn`指Collector,`Aggregator+DDPL`构成一个logic learner。接下来将首先介绍一个DI任务提交到K8s之后DI Orchestrator如何将DI的各个模块(在K8s中就是一个[pod](https://kubernetes.io/docs/concepts/workloads/pods/))创建并启动,然后将对di-server和di-operator进行介绍。
整体分为两大模块:`di-server``di-operator``DDPL`指ddp learner,`Lm`指Learner,`Cn`指Collector,`Aggregator+DDPL`构成一个logic learner。接下来将首先介绍一个DI-engine任务提交到K8s之后DI Orchestrator如何将DI-engine的各个模块(在K8s中就是一个[pod](https://kubernetes.io/docs/concepts/workloads/pods/))创建并启动,然后将对di-server和di-operator进行介绍。
## 任务创建流程
这里介绍任务创建流程,说明一个DI任务在K8s中从创建到执行完成的一整个生命周期
这里介绍任务创建流程,说明一个DI-engine任务在K8s中从创建到执行完成的一整个生命周期
- 编写AggregatorConfig yaml文件,定义aggregator的模板,将在后面创建DIJob的时候用来创建aggregator,aggregator可以为训练端提供数据并行训练服务。
- 编写DIJob yaml文件,定义coordinator、collector、learner的模板,提交到K8s集群中。
- di-operator监听到DIJob的提交,创建coordinator,并为coordinator创建可访问的域名。
......@@ -41,7 +41,7 @@ type DIJobSpec struct {
// CleanPodPolicy defines the policy to clean pods after DIJob completed
CleanPodPolicy CleanPodPolicy `json:"cleanPodPolicy,omitempty"`
// Volumes defines the shared volumes for DI components
// Volumes defines the shared volumes for DI-engine components
Volumes []corev1.Volume `json:"volumes,omitempty"`
Coordinator CoordinatorSpec `json:"coordinator"`
......@@ -60,7 +60,7 @@ type AggregatorConfigSpec struct {
```
> **为什么aggregator单独定义?**
aggregator对所有使用DI框架进行RL训练的任务都是通用的,因此我们将aggregator定义为一个全局的、共享的资源AggregatorConfig,所有RL任务提交后,di-server将通过读取集群中唯一的AggregatorConfig来创建aggregator。另外,aggregator只是针对最常见的数据并行训练,如果是其他并行训练方法,需要定义新的Custom Resource。
aggregator对所有使用DI-engine框架进行RL训练的任务都是通用的,因此我们将aggregator定义为一个全局的、共享的资源AggregatorConfig,所有RL任务提交后,di-server将通过读取集群中唯一的AggregatorConfig来创建aggregator。另外,aggregator只是针对最常见的数据并行训练,如果是其他并行训练方法,需要定义新的Custom Resource。
### 状态定义
用户提交DIJob后,di-operator便接管了DIJob的生命周期的管理,为了便于用户了解DIJob的状态,我们定义了以下阶段(phase):
......@@ -103,7 +103,7 @@ func (r *DIJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl
当用户提交DIJob后,informer获取到该提交事件后触发handler,之后Reconcile函数被调用;Reconcile函数中调用list pod方法发现coordinator未创建,则读取DIJob中关于coordinator的定义模板,创建相应的coordinator pod(coordinator程序在其中运行)和service(用于pod间通信),并将一些环境变量写入pod中,包括pod的名称、pod的命名空间、访问coordinator的URL等环境变量。
其中,DI框架的每个模块占用的端口都有一个默认值,如下所示:
其中,DI-engine框架的每个模块占用的端口都有一个默认值,如下所示:
```go
DefaultCollectorPort = 22270
......@@ -122,7 +122,7 @@ coordinator创建之后,di-operator将监听pod的状态并修改DIJob的状
di-operator中实现了webhook校验方法,创建MutatingWebhook用于设置DIJob的默认值;创建ValidatingWebhook用于校验DIJob的正确性。比如对`CleanPodPolicy`字段,我们在MutatingWebhook中设置其默认值为`Running`,表示DIJob完成后将Running的pod都删除;我们在ValidatingWebhook中校验`CleanPodPolicy`字段的值,如果用户设置的值不等于`None``ALL``Running`中的任何一个,则拒绝提交该DIJob。
## DI Server
di-server是一个为DI框架定制的http服务器,提供新增、删除和查询collector、learner、aggregator的功能。通过调用di-server的相关接口,di-server为DIJob提供了动态增删collector和learner的能力。下面将对di-server的设计进行简要介绍,包括存储AggregatorConfig、DIJob以及DIJob所有pod的本地cache;用于动态新增、删除和查询collector、learner和aggregator的http接口设计。
di-server是一个为DI-engine框架定制的http服务器,提供新增、删除和查询collector、learner、aggregator的功能。通过调用di-server的相关接口,di-server为DIJob提供了动态增删collector和learner的能力。下面将对di-server的设计进行简要介绍,包括存储AggregatorConfig、DIJob以及DIJob所有pod的本地cache;用于动态新增、删除和查询collector、learner和aggregator的http接口设计。
### 本地cache
为了减少di-server与K8s api server之间查询的频率,从而减轻K8s api server的负担,我们利用[client-go](https://github.com/kubernetes/client-go)提供的informer机制将AggregatorConfig、DIJob和DIJob的所有pod存储在本地cache,如下图所示
......@@ -158,8 +158,8 @@ genericInformer.Informer().GetIndexer().GetByKey(key)
## DI Orchestrator的优势
DI Orchestrator为DI框架提供了分布式场景下基于K8s的容器运行方案。对于用户提交的DIJob,di-operator负责对DI的各个模块进行编排,使得各个模块可以正常运行并执行训练任务。通过调用di-server的接口,赋予coordinator新增、删除和查询其所有的collector、learner和aggregator的功能,提升DI框架资源动态分配的能力。总结DI Orchestrator提供了以下优势:
1. 封装性。依赖di-operator的编排能力,部署DI分布式RL训练的细节(包括pod创建、服务发现)对用户来说是透明的。根据DI框架对分布式RL训练的部署需求,di-operator会将coordinator创建出来,然后coordinator再请求di-server创建其他模块,di-operator会把每个模块的pod的状态记录到DIJob的状态中。DIJob的生命周期也由di-operator维护,向用户展示DIJob在不同阶段的状态。
DI Orchestrator为DI-engine框架提供了分布式场景下基于K8s的容器运行方案。对于用户提交的DIJob,di-operator负责对DI-engine的各个模块进行编排,使得各个模块可以正常运行并执行训练任务。通过调用di-server的接口,赋予coordinator新增、删除和查询其所有的collector、learner和aggregator的功能,提升DI-engine框架资源动态分配的能力。总结DI Orchestrator提供了以下优势:
1. 封装性。依赖di-operator的编排能力,部署DI-engine分布式RL训练的细节(包括pod创建、服务发现)对用户来说是透明的。根据DI-engine框架对分布式RL训练的部署需求,di-operator会将coordinator创建出来,然后coordinator再请求di-server创建其他模块,di-operator会把每个模块的pod的状态记录到DIJob的状态中。DIJob的生命周期也由di-operator维护,向用户展示DIJob在不同阶段的状态。
2. 易用性。用户只需要在DIJob的yaml文件中定义好coordinator、collector、learner的配置之后,一键提交到K8s集群即可,di-operator将负责完成部署工作,将用户从K8s集群中复杂的分布式RL训练部署中解放出来。
3. 鲁棒性。依赖K8s的pod重启机制,保证pod在意外退出的情况下能自动重启,coordinator能够迅速响应并重新连接。
4. 动态扩展。DIJob所需的collector/learner/aggregator是动态变化的,因此di-server提供了http接口可以动态调整collector/learner的数目,使得DIJob可以根据自身需求调整collector和learner的比例,优化吞吐量。
# DI Operator architecture
DI framework consists of 3 important modules, namely coordinator, collector and learner. In general, a DI training job has only one coordinator, and the number of learners and collectors can vary. The roles of the three modules are:
DI-engine framework consists of 3 important modules, namely coordinator, collector and learner. In general, a DI-engine training job has only one coordinator, and the number of learners and collectors can vary. The roles of the three modules are:
- Coordinator. Maintain connections with collectors and learners, accept meta-infos requests and posts from collectors and learners, and send tasks to collectors and learners.
- Collector. Request path to RL model stored in storage middleware from coordinator, load the RL model, and then generate data frames according to the RL model's steps from environment. Store the data frames back to the storage middleware, and report meta-infos (the storage path, size, etc.) of the data frames to coordinator.
- Learner: Request data frames storage path from coordinator and load the data frames from storage middleware to start training the RL model. After the training is completed, store the model into the storage middleware, and report model meta-infos (storage path, size, etc.) to coordinator. Because we often need to use distributed mechanism of data parallel training, to avoid confusion, we call the module interacting with coordinator the logic learner, which is the basic unit for coordinator to issue tasks. And the single learner process in the data parallel training is called ddp learner, and multiple ddp learner processes provide data parallel services. One logic learner can correspond to one ddp learner (single-gpu) or multiple ddp learners (multi-gpu). In addition, to provide data parallel training services, an additional aggregator module needs to be introduced. The aggregator is responsible for summarizing the training results of multiple ddp learners and sending them to coordinator. That is, the aggregator and multiple ddp learners form a logic learner, and coordinator will only interact with logic learners.
For the introduction of DI, please refer to [DI developer tutorial](https://opendilab.github.io/DI-engine/tutorial_dev/index.html).
For the introduction of DI-engine, please refer to [DI-engine developer tutorial](https://opendilab.github.io/DI-engine/tutorial_dev/index.html).
In order to provide running support for DI in Kubernetes (K8s), we designed `DI Orchestrator`. This article will explain how to use DI Orchestrator, how each module of DI is created on K8s and discovers each other, how to start training, etc. The architecture of DI Orchestrator is shown in the figure below:
In order to provide running support for DI-engine in Kubernetes (K8s), we designed `DI Orchestrator`. This article will explain how to use DI Orchestrator, how each module of DI-engine is created on K8s and discovers each other, how to start training, etc. The architecture of DI Orchestrator is shown in the figure below:
![](images/di-arch.png)
![](images/di-arch.svg)
There are two main modules that is `di-server` and `di-operator`.
`DDPL` represents ddp learner, `Lm` represents logic learner, `Cn` represents collector, and `Aggregator+DDPL` constructs a logic learner. In the following pages, we will first introduce how `DI Orchestrator` creates and starts each module of DI after a DI job is submitted to K8s, and then introduces the architecture of `di-server` and `di-operator`.
`DDPL` represents ddp learner, `Lm` represents logic learner, `Cn` represents collector, and `Aggregator+DDPL` constructs a logic learner. In the following pages, we will first introduce how `DI Orchestrator` creates and starts each module of DI-engine after a DI-engine job is submitted to K8s, and then introduces the architecture of `di-server` and `di-operator`.
## Job creation process
Here is a description of the job creation process, illustrating the entire life cycle of a DI job from creation to execution in K8s.
Here is a description of the job creation process, illustrating the entire life cycle of a DI-engine job from creation to execution in K8s.
- Edit the AggregatorConfig yaml file to define the aggregator template, which will be used to create aggregators when DIJob is created later. Aggregator can provide data parallel training services.
- Edit the DIJob yaml file to define the template of coordinator, collector and learner, and submit it to K8s.
- After di-operator received the event of DIJob submission, it creates a coordinator, and creates an accessible domain name for the coordinator.
......@@ -25,7 +25,7 @@ Here is a description of the job creation process, illustrating the entire life
- When the training is completed, di-operator will delete all collectors, learners by default, while coordinator will be reserved for users to view logs and other operations.
## DI Operator
Ding-operator is a component responsible for orchestrating DIJob in K8s. It uses K8s [operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) to monitor the status of DIJob objects in K8s cluster through the control loop in [controller pattern](https://kubernetes.io/docs/concepts/architecture/controller/), and to update the status of DIJob when necessary. The status is modified so that the actual status of DIJob is as consistent as possible with our predefined status.
Di-operator is a component responsible for orchestrating DIJob in K8s. It uses K8s [operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) to monitor the status of DIJob objects in K8s cluster through the control loop in [controller pattern](https://kubernetes.io/docs/concepts/architecture/controller/), and to update the status of DIJob when necessary. The status is modified so that the actual status of DIJob is as consistent as possible with our predefined status.
### API definition
According to the characteristics of each module, we have defined two Custom Resources, namely DIJob and AggregatorConfig. The former is used to define the prerequisites for coordinator, collector and learner to start running, including docker images, startup commands, computing and storage resources, environment variables, etc. The latter is used to define the prerequisites for aggregator.
......@@ -42,7 +42,7 @@ type DIJobSpec struct {
// CleanPodPolicy defines the policy to clean pods after DIJob completed
CleanPodPolicy CleanPodPolicy `json:"cleanPodPolicy,omitempty"`
// Volumes defines the shared volumes for DI components
// Volumes defines the shared volumes for DI-engine components
Volumes []corev1.Volume `json:"volumes,omitempty"`
Coordinator CoordinatorSpec `json:"coordinator"`
......@@ -61,7 +61,7 @@ type AggregatorConfigSpec struct {
```
> **Why should aggregator be defined alone?**
Aggregator is common module for all RL training jobs using DI framework, so we define the aggregator as a global and shared resource named AggregatorConfig. After RL jobs are submitted, di-server will read the global AggregatorConfig in K8s cluster to create aggregators for these RL jobs. In addition, aggregator is only for most common data parallel training. You need to define a new Custom Resource if other parallel training methods are used.
Aggregator is common module for all RL training jobs using DI-engine framework, so we define the aggregator as a global and shared resource named AggregatorConfig. After RL jobs are submitted, di-server will read the global AggregatorConfig in K8s cluster to create aggregators for these RL jobs. In addition, aggregator is only for most common data parallel training. You need to define a new Custom Resource if other parallel training methods are used.
### Status definition
After DIJob is submitted, di-operator takes over the management of the life cycle of the DIJob. In order to facilitate the user to have a better view of the DIJob's status, we define the following phases:
......@@ -105,7 +105,7 @@ func (r *DIJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl
When DIJob is submitted, we firstly list pods that belong to DIJob in the Reconcile function and find that the coordinator has not been created. Then we read the coordinator template defined in DIJob and create the corresponding coordinator pod (used to run coordinator main process) and service (used for inter-pod communication), and write some environment variables into the pod, including the name of the pod, the namespace of the pod, the port which coordinator listens to, and the URL to access the coordinator.
The port occupied by each module of the DI framework has a default value, as shown below:
The port occupied by each module of the DI-engine framework has a default value, as shown below:
```go
DefaultCollectorPort = 22270
......@@ -124,7 +124,7 @@ To achieve the above goals, we can configure webhooks in K8s. K8s webhook consis
The webhook verification is implemented in di-operator. MutatingWebhook is created to set the default value for DIJob; ValidatingWebhook is created to verify the correctness of DIJob. For example, for the `CleanPodPolicy` field in DIJob, we set its default value in MutatingWebhook to `Running`, which means that all running pods will be deleted after DIJob is completed. We verify the value of the `CleanPodPolicy` field in ValidatingWebhook, if the value set by the user is not equal to any of `None`, `ALL`, or `Running`, the DIJob will be rejected.
## DI Server
Ding-server is an http server customized for DI framework, providing the apis of adding, deleting, and querying collectors, learners, and aggregators. By calling the related apis of di-server, di-server can provide DIJob with the ability to dynamically scale collectors and learners. The following will briefly introduce the design of di-server, including the local cache for storing AggregatorConfig, DIJob and all pods of DIJob; the http interface design for dynamically adding, deleting and querying collectors, learners and aggregators.
Di-server is an http server customized for DI-engine framework, providing the apis of adding, deleting, and querying collectors, learners, and aggregators. By calling the related apis of di-server, di-server can provide DIJob with the ability to dynamically scale collectors and learners. The following will briefly introduce the design of di-server, including the local cache for storing AggregatorConfig, DIJob and all pods of DIJob; the http interface design for dynamically adding, deleting and querying collectors, learners and aggregators.
### Local cache
In order to reduce the frequency of queries between di-server and K8s api server, thereby reducing the burden of K8s api server, we use [client-go](https://github.com/kubernetes/client-go)'s informer mechanism to store AggregatorConfig, DIJob and all pods of DIJob in local cache, as shown in the following figure
......@@ -147,7 +147,7 @@ In order to support dynamic scaling of collectors/learners for DIJobs, di-server
![](images/di-api.png)
提供如下接口:
The following http interfaces are provided:
| method | path | description |
|---|---|---|
......@@ -161,8 +161,8 @@ In order to support dynamic scaling of collectors/learners for DIJobs, di-server
## Advantages of DI Orchestrator
DI Orchestrator provides a K8s-based container-orchestration solution for the DI framework in a distributed scenario. For a DIJob, di-operator is responsible for arranging the various modules of DI so that each module can run normally and perform training tasks. By calling di-server’s HTTP interface, coordinator is given the ability to add, delete, and query all its collectors, learners, aggregators and improve the dynamic allocation of DI framework resources. In summary, DI Orchestrator provides the following advantages:
1. Encapsulation. Relying on the orchestration capabilities of di-operator, deploying DI distributed RL training (including pod creation and service discovery) are transparent to us. According to the deployment requirements of the DI framework for distributed RL training, di-operator will create coordinator, and then the coordinator will request di-server to create other modules. Ding-operator will record the status of the pod of each module into the status of the DIJob. The life cycle of DIJob is also maintained by di-operator, providing us with status of DIJob in different stages.
2. Ease of use. We only need to define the configuration of coordinator, collector, and learner in the yaml file of DIJob, and submit them to K8s cluster with one click. Ding-operator will be responsible for deploying DI RL trainings and liberating us from the complex distributed RL deployments in K8s cluster.
3. Robustness. Relying on the pod restart mechanism of K8s, it ensures that pods can automatically restart in the event of an unexpected exit, and the coordinator can respond quickly and reconnect.
4. Dynamic expansion. Collectors/learners required by DIJob is dynamically changing, so di-server provides HTTP interfaces to allow us to dynamically adjust the number of collectors/learners, so that DIJob can adjust the ratio of collectors and learners according to its own needs to optimize throughput.
DI Orchestrator provides a K8s-based container-orchestration solution for the DI-engine framework in a distributed scenario. For a DIJob, di-operator is responsible for arranging the various modules of DI-engine so that each module can run normally and perform training tasks. By calling di-server’s HTTP interface, coordinator is given the ability to add, delete, and query all its collectors, learners, aggregators and improve the dynamic allocation of DI-engine framework resources. In summary, DI Orchestrator provides the following advantages:
1. Encapsulation. Relying on the orchestration capabilities of di-operator, deploying DI-engine distributed RL training (including pod creation and service discovery) is transparent to us. According to the deployment requirements of the DI-engine framework for distributed RL training, di-operator will create coordinator, and then the coordinator will request di-server to create other modules. Di-operator will record the status of the pod of each module into the status of the DIJob. The life cycle of DIJob is also maintained by di-operator, providing us with status of DIJob in different stages.
2. Ease of use. We only need to define the configuration of coordinator, collector, and learner in the yaml file of DIJob, and submit them to K8s cluster with one click. Di-operator will be responsible for deploying DI-engine RL training and liberating us from the complex distributed RL deployments in K8s cluster.
3. Robustness. Relying on the pod restart mechanism of K8s, ensures that pods can automatically restart in the event of an unexpected exit, and the coordinator can respond quickly and reconnect.
4. Dynamic expansion. Collectors/learners required by DIJob are dynamically changing, so di-server provides HTTP interfaces to allow us to dynamically adjust the number of collectors/learners, so that DIJob can adjust the ratio of collectors and learners according to its own needs to optimize throughput.
# developer guide
# Developer Guide
## prerequisites
## Prerequisites
- a well prepared kubernetes cluster. Follow the [instructions](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/) to create a kubernetes cluster, or create a local kubernetes node referring to [kind](https://kind.sigs.k8s.io/docs/user/quick-start/) or [minikube](https://minikube.sigs.k8s.io/docs/start/)
- kustomize. Installed by the following command
```bash
......@@ -11,7 +11,7 @@ kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
```bash
kubectl create -f ./config/certmanager/cert-manager.yaml
```
## project initialization
## Project Initialization
This project is based on [kubebuilder v3](https://github.com/kubernetes-sigs/kubebuilder/releases/download/v3.0.0/kubebuilder_linux_amd64), since CRDs generated by kubebuilder v2 is not compatible in kubernetes v1.20.
```bash
kubebuilder init --domain opendilab.org --license apache2 --owner "The OpenDILab authors"
......@@ -32,26 +32,28 @@ make manifests
```
New CRD files will be generated in [./config/crd/bases](./config/crd/bases)
## controller logic
## Controller Logic
Referenced to [controllers](./controllers)
## di-server logic
## DI Server Logic
Referenced to [server](./server)
## Installation
Run the following command in the project root directory.
```bash
# build images. If you are not working in Linux, here you should use `make docker-build`
make dev-images
# build images.
make docker-build
make docker-push
# deploy di-operator and server to cluster
make dev-deploy
```
Since the CustomResourceDefinitions are too long, you will probably find the following error:
![](docs/images/deploy-failed.png)
```bash
The CustomResourceDefinition "dijobs.diengine.opendilab.org" is invalid: metadata.annotations: Too long: must have at most 262144 bytes
```
Then run the following command will solve the problem:
Then running the following command will solve the problem:
```bash
kustomize build config/crd | kubectl create -f -
```
......@@ -66,5 +68,5 @@ di-server-7b86ff8df4-jfgmp 1/1 Running 0 59s
Install global components of DIJob defined in AggregatorConfig:
```bash
kubectl create -f examples/di-mock-agconfig.yaml -n di-system
kubectl create -f config/samples/agconfig.yaml -n di-system
```
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册