提交 8598fb0f 编写于 作者: L liqingping

docs: update architecture-cn

上级 3046d211
......@@ -115,7 +115,7 @@ type Policy interface {
用户可根据自身需求实现自己的调度算法。
`job.spec.preemptible==false`时,Allocator将不会对该任务进行调度,只会根据`job.spec.minReplicas`为该任务分配固定数目的workers,分配结果写到`job.status.replicas`。不过,用户可以通过修改`job.status.replicas`来变更该任务的workers数目。
> Note:不能直接通过`kubectl apply`或者`kubectl edit`命令直接修改`job.status.replicas`,因为`job.status`被定义为SubResource,对于DIJob的所有的PUT和POST请求都会忽略`job.status`字段。见[Kubernetes API Conversion](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status)
> Note:不能直接通过`kubectl apply`或者`kubectl edit`命令直接修改`job.status.replicas`,因为`job.status`被定义为SubResource,对于DIJob的所有的PUT和POST请求都会忽略`job.status`字段,见[Kubernetes API Conversion](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status)。可以执行`go run ./hack/update_replicas.go --ns [your-job-namespace] --n [your-job-name] --r [expected-replicas]`实现修改replicas的操作。
#### Controller控制循环
Controller控制循环用于调谐DIJob的状态,包括生命周期管理、workers的创建和删除等,如前文所述状态转移图。
......@@ -145,6 +145,23 @@ job_id由`namespace.name.generation`三元组构成。
用户提交的任务按照以下流程在集群中运行,由Allocator进行调度、Controller进行容器编排、Server进行任务profilings的汇报。
![](images/di-engine-schedule.png)
1. 用户提交DIJob到K8s集群中。
2. Allocator进行初始分配:
1. 对不允许抢占的job,根据`job.spec.minReplicas`修改`job.status.replicas`的值。
2. 对允许抢占的job,根据`job.spec.minReplicas`修改`job.status.allocation`的值。
3. Controller获取K8s集群中job的变更。
4. Controller创建相应数目的replicas。
1. 对不允许抢占的job,根据`job.status.replicas`创建对应数目的replicas。
2. 对允许抢占的job,根据`job.status.allocation`创建对应数目的replicas,并为每个replicas指定在哪个节点运行。
5. replicas启动并开始训练,一段时间后将采集到的profilings数据汇报到Server端。
6. Server将profilings数目更新到`job.status.profilings`中。
7. 每个固定调度周期,Allocator重新调度所有jobs:
1. 对不允许抢占的jobs,这里不做重调度。
2. 对允许抢占的jobs,根据Allocator Policy中定义的调度策略进行全局调度,并修改每个jobs的`job.status.allocation`
8. Controller获取K8s集群中jobs的变更。
9. Controller创建相应数目的replicas。
## DI Orchestrator的优势
DI Orchestrator为DI-engine框架提供了分布式场景下基于K8s的容器运行方案。对于用户提交的DIJob,Operator负责对DI-engine的各个模块进行编排,使得各个模块可以正常运行并执行训练任务;通过子模块Allocator为DI-engine框架提供资源动态分配与调度的能力。通过调用Server的接口,赋予用户新增、删除和查询任务的workers的功能。总结DI Orchestrator提供了以下优势:
......
docs/images/di-engine-schedule.png

210.2 KB | W: | H:

docs/images/di-engine-schedule.png

221.9 KB | W: | H:

docs/images/di-engine-schedule.png
docs/images/di-engine-schedule.png
docs/images/di-engine-schedule.png
docs/images/di-engine-schedule.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -3,8 +3,7 @@ module opendilab.org/di-orchestrator
go 1.16
require (
github.com/deckarep/golang-set v1.7.1
github.com/gin-gonic/gin v1.7.7 // indirect
github.com/gin-gonic/gin v1.7.7
github.com/go-logr/logr v0.4.0
github.com/onsi/ginkgo v1.16.4
github.com/onsi/gomega v1.15.0
......
......@@ -111,8 +111,6 @@ github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSs
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/daviddengcn/go-colortext v0.0.0-20160507010035-511bcaf42ccd/go.mod h1:dv4zxwHi5C/8AeI+4gX4dCWOIvNi7I6JCSX0HvlKPgE=
github.com/deckarep/golang-set v1.7.1 h1:SCQV0S6gTtp6itiFrTqI+pfmJ4LN85S1YzhDf9rTHJQ=
github.com/deckarep/golang-set v1.7.1/go.mod h1:93vsz/8Wt4joVM7c2AVqh+YRMiUSc14yDtF28KmMOgQ=
github.com/dgrijalva/jwt-go v3.2.0+incompatible/go.mod h1:E3ru+11k8xSBh+hMPgOLZmtrrCbhqsmaPHjLKYnJCaQ=
github.com/dgryski/go-sip13 v0.0.0-20181026042036-e10d5fee7954/go.mod h1:vAd38F8PWV+bWy6jNmig1y/TA+kYO4g3RSRF0IAv0no=
github.com/docker/distribution v2.7.1+incompatible/go.mod h1:J2gT2udsDAN96Uj4KfcMRqY0/ypR+oyYUYmja8H+y+w=
......@@ -208,6 +206,7 @@ github.com/go-openapi/swag v0.19.5/go.mod h1:POnQmlKehdgb5mhVOsnJFsivZCEZ/vjK9gh
github.com/go-openapi/validate v0.18.0/go.mod h1:Uh4HdOzKt19xGIGm1qHf/ofbX1YQ4Y+MYsct2VUrAJ4=
github.com/go-openapi/validate v0.19.2/go.mod h1:1tRCw7m3jtI8eNWEEliiAqUIcBztB2KDnRCRMUi7GTA=
github.com/go-openapi/validate v0.19.8/go.mod h1:8DJv2CVJQ6kGNpFW6eV9N3JviE1C85nY1c2z52x1Gk4=
github.com/go-playground/assert/v2 v2.0.1 h1:MsBgLAaY856+nPRTKrp3/OZK38U/wa0CcBYNjji3q3A=
github.com/go-playground/assert/v2 v2.0.1/go.mod h1:VDjEfimB/XKnb+ZQfWdccd7VUvScMdVu0Titje2rxJ4=
github.com/go-playground/locales v0.13.0 h1:HyWk6mgj5qFqCT5fjGBuRArbVDfE4hi8+e8ceBS/t7Q=
github.com/go-playground/locales v0.13.0/go.mod h1:taPMhCMXrRLJO55olJkUXHZBHCxTMfnGwq/HNwmWNS8=
......
package main
import (
"context"
"flag"
"log"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/client-go/dynamic"
ctrl "sigs.k8s.io/controller-runtime"
"opendilab.org/di-orchestrator/pkg/api/v2alpha1"
)
var (
namespace string
jobname string
replicas int
)
func main() {
flag.StringVar(&namespace, "ns", "default", "The namespace of the scaling job.")
flag.StringVar(&jobname, "n", "gobigger-test", "The name of the scaling job.")
flag.IntVar(&replicas, "r", 1, "The number of replicas for the job.")
flag.Parse()
cfg, err := ctrl.GetConfig()
if err != nil {
log.Fatalf("Failed to get kubeconfig: %v", err)
}
// create dynamic client for dijob
dclient := dynamic.NewForConfigOrDie(cfg)
gvr := schema.GroupVersionResource{
Group: v2alpha1.GroupVersion.Group,
Version: v2alpha1.GroupVersion.Version,
Resource: "dijobs",
}
diclient := dclient.Resource(gvr)
unjob, err := diclient.Namespace(namespace).Get(context.Background(), jobname, metav1.GetOptions{})
if err != nil {
log.Fatalf("Failed to get job with dynamic client: %v", err)
}
// set job.status.replicas to what we want
err = unstructured.SetNestedField(unjob.Object, int64(replicas), "status", "replicas")
if err != nil {
log.Fatalf("Failed to set nested fields")
}
// update job status
_, err = diclient.Namespace(namespace).UpdateStatus(context.Background(), unjob, metav1.UpdateOptions{})
if err != nil {
log.Fatalf("Failed to update status: %v", err)
}
log.Printf("Successfully update dijob %s/%s replicas to %d", namespace, jobname, replicas)
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册