Update doc

02d581bb · TeslaZhao · 2b8393e7 · 02d581bb · 02d581bb · 02d581bb
5 changed file
--- a/doc/Offical_Docs/7-0_Python_Pipeline_Int_CN.md
+++ b/doc/Offical_Docs/7-0_Python_Pipeline_Int_CN.md
@@ -4,8 +4,9 @@

 Paddle Serving 实现了一套通用的多模型组合服务编程框架 Python Pipeline，不仅解决上述痛点，同时还能大幅提高 GPU 利用率，并易于开发和维护。

-通过阅读以下内容掌握 Python Pipeline 框架基础功能、设计方案、使用指南等。
- [Python Pipeline 基础功能]()
- [Python Pipeline 使用案例]()
- [Python Pipeline 高阶用法]()
- [Python Pipeline 优化指南]()
+Python Pipeline 使用案例请阅读[Python Pipeline 快速部署案例](./3-2_QuickStart_Pipeline_OCR_CN.md)
+
+通过阅读以下内容掌握 Python Pipeline 设计方案、高阶用法和优化指南等。
+- [Python Pipeline 框架设计](7-1_Python_Pipeline_Design_CN.md)
+- [Python Pipeline 高阶用法](7-2_Python_Pipeline_Senior_CN.md)
+- [Python Pipeline 优化指南](7-3_Python_Pipeline_Optimize_CN.md)
--- a/doc/Offical_Docs/7-1_Python_Pipeline_Basic_CN.md
+++ b/doc/Offical_Docs/7-1_Python_Pipeline_Basic_CN.md
--- a/doc/Offical_Docs/7-3_Python_Pipeline_Senior_CN.md
+++ b/doc/Offical_Docs/7-3_Python_Pipeline_Senior_CN.md
 # Python Pipeline 高阶用法

-高阶用法在复杂场景中使用，实现更多自定义能力，包括 DAG 跳过某个OP运行、自定义数据传输结构以及多卡推理等。
+在复杂业务场景中使用常规功能无法满足需求，本文介绍一些高阶用法。
+- DAG 结构跳过某个 Op 运行
+- 批量推理
+- 单机多卡推理
+- 多种计算芯片上推理
+- 低精度推理
+- TensorRT 推理加速
+- MKLDNN 推理加速

-## DAG 跳过某个OP运行

-为 DAG 图中跳过某个 OP 运行，实际做法是在跳过此 OP 的 process 阶段，只要在 preprocess 做好判断，跳过 process 阶段，在和 postprocess 后直接返回即可。
-preprocess 返回结果列表的第二个结果是 `is_skip_process=True` 表示是否跳过当前 OP 的 process 阶段，直接进入 postprocess 处理。
+**一. DAG 结构跳过某个 Op 运行 **
+
+此应用场景一般在 Op 前后处理中有 if 条件判断时，不满足条件时，跳过后面处理。实际做法是在跳过此 Op 的 process 阶段，只要在 preprocess 做好判断，跳过 process 阶段，在和 postprocess 后直接返回即可。
+preprocess 返回结果列表的第二个结果是 `is_skip_process=True` 表示是否跳过当前 Op 的 process 阶段，直接进入 postprocess 处理。

 ```python
 def preprocess(self, input_dicts, data_id, log_id):
@@ -35,32 +43,8 @@ def preprocess(self, input_dicts, data_id, log_id):

 ```

-## 自定义 proto 中 Request 和 Response 结构
-
-当默认 proto 结构不满足业务需求时，同时下面2个文件的 proto 的 Request 和 Response message 结构，保持一致。
-
-> pipeline/gateway/proto/gateway.proto 
-
-> pipeline/proto/pipeline_service.proto
-
-再重新编译 Serving Server。
-
-
-## 自定义 URL
-grpc gateway 处理 post 请求，默认 `method` 是 `prediction`，例如:127.0.0.1:8080/ocr/prediction。用户可自定义 name 和 method，对于已有 url 的服务可无缝切换。
-
-```proto
-service PipelineService {
-  rpc inference(Request) returns (Response) {
-    option (google.api.http) = {
-      post : "/{name=*}/{method=*}"
-      body : "*"
-    };
-  }
-};
-```
+** 二. 批量推理 **

-## 批量推理
 Pipeline 支持批量推理，通过增大 batch size 可以提高 GPU 利用率。Python Pipeline 支持3种 batch 形式以及适用的场景如下：
 - 场景1：一个推理请求包含批量数据(batch)
  - 单条数据定长，批量变长，数据转成BCHW格式
@@ -76,11 +60,12 @@ Pipeline 支持批量推理，通过增大 batch size 可以提高 GPU 利用率
 | :------------------------------------------: | :-----------------------------------------: |
 |  batch | client 发送批量数据，client.predict 的 batch=True |
 | mini-batch | preprocess 按 list 类型返回，参考 OCR 示例 RecOp的preprocess|
-| auto-batching | config.yml 中 OP 级别设置 batch_size 和 auto_batching_timeout |
+| auto-batching | config.yml 中 Op 级别设置 batch_size 和 auto_batching_timeout |


-### 4.6 单机多卡
-单机多卡推理，M 个 OP 进程与 N 个 GPU 卡绑定，在 `config.yml` 中配置3个参数有关系，首先选择进程模式、并发数即进程数，devices 是 GPU 卡 ID。绑定方法是进程启动时遍历 GPU 卡 ID，例如启动7个 OP 进程 `config.yml` 设置 devices:0,1,2，那么第1，4，7个启动的进程与0卡绑定，第2，4个启动的进程与1卡绑定，3，6进程与卡2绑定。
+** 三. 单机多卡推理 **
+
+单机多卡推理，M 个 Op 进程与 N 个 GPU 卡绑定，在 `config.yml` 中配置3个参数有关系，首先选择进程模式、并发数即进程数，devices 是 GPU 卡 ID。绑定方法是进程启动时遍历 GPU 卡 ID，例如启动7个 Op 进程 `config.yml` 设置 devices:0,1,2，那么第1，4，7个启动的进程与0卡绑定，第2，4个启动的进程与1卡绑定，3，6进程与卡2绑定。
 - 进程ID: 0  绑定 GPU 卡0
 - 进程ID: 1  绑定 GPU 卡1
 - 进程ID: 2  绑定 GPU 卡2
@@ -94,3 +79,44 @@ Pipeline 支持批量推理，通过增大 batch size 可以提高 GPU 利用率
 #计算硬件 ID，当 devices 为""或不写时为 CPU 预测；当 devices 为"0", "0,1,2"时为 GPU 预测，表示使用的 GPU 卡
 devices: "0,1,2"
 ```
+
+
+** 四. 多种计算芯片上推理 **
+
+Pipeline 除了支持 CPU、GPU 芯片推理之外，还支持在多种计算硬件推理部署。在 `config.yml` 中由 `device_type` 和 `devices`。优先使用 `device_type` 指定类型，当空缺时根据 `devices` 判断。`device_type` 描述如下：
+- CPU(Intel) : 0
+- GPU(Jetson/海光DCU) : 1
+- TensorRT : 2
+- CPU(Arm) : 3
+- XPU : 4
+- Ascend310 : 5 
+- ascend910 : 6
+
+config.yml中硬件配置：
+```
+#计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+device_type: 0
+
+#计算硬件ID，优先由device_type决定硬件类型。devices为""或空缺时为CPU预测；当为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+devices: "" # "0,1"
+```
+           
+** 五. 低精度推理 **
+Pipeline Serving支持低精度推理，CPU、GPU和TensoRT支持的精度类型如下图所示：
+
+- CPU
+  - fp32(default)
+  - fp16
+  - bf16(mkldnn)
+- GPU
+  - fp32(default)
+  - fp16
+  - int8
+- Tensor RT
+  - fp32(default)
+  - fp16
+  - int8 
+
+使用int8时，要开启use_calib: True
+
+参考[simple_web_service](../../examples/Pipeline/simple_web_service)示例
--- a/doc/Offical_Docs/7-2_Python_Pipeline_Usage_CN.md
+++ b/doc/Offical_Docs/7-2_Python_Pipeline_Usage_CN.md
-# Python Pipeline 使用案例
-
-Python Pipeline 使用案例部署步骤可分为下载模型、配置、编写代码、推理测试4个步骤。
-
-所有Pipeline示例在[examples/Pipeline/](../../examples/Pipeline) 目录下，目前有7种类型模型示例：
- [PaddleClas](../../examples/Pipeline/PaddleClas) 
- [Detection](../../examples/Pipeline/PaddleDetection)  
- [bert](../../examples/Pipeline/PaddleNLP/bert)
- [imagenet](../../examples/Pipeline/PaddleClas/imagenet)
- [imdb_model_ensemble](../../examples/Pipeline/imdb_model_ensemble)
- [ocr](../../examples/Pipeline/PaddleOCR/ocr)
- [simple_web_service](../../examples/Pipeline/simple_web_service)
-
-以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving，相关代码在 `Serving/examples/Pipeline/imdb_model_ensemble` 文件夹下可以找到，例子中的 Server 端结构如下图所示：
-
-<div align=center>
-<img src='../images/pipeline_serving-image4.png' height = "200" align="middle"/>
-</div>
-
-** 部署需要的文件 **
-需要五类文件，其中模型文件、配置文件、服务端代码是构建Pipeline服务必备的三个文件。测试客户端和测试数据集为测试准备
- 模型文件
- 配置文件(config.yml)
-  - 服务级别：服务端口、gRPC线程数、服务超时、重试次数等
-  - DAG级别：资源类型、开启Trace、性能profile
-  - OP级别：模型路径、并发度、推理方式、计算硬件、推理超时、自动批量等
- 服务端(web_server.py)
-  - 服务级别：定义服务名称、读取配置文件、启动服务
-  - DAG级别：指定多OP之间的拓扑关系
-  - OP级别：重写OP前后处理
- 测试客户端
-  - 正确性校验
-  - 压力测试
- 测试数据集
-  - 图片、文本、语音等
-
-
-## 获取模型
-
-示例中通过`get_data.sh`获取模型文件，示例中的模型文件已保存Feed/Fetch Var参数，如没有保存请跳转到[保存Serving部署参数]()步骤。
-```shell
-cd Serving/examples/Pipeline/imdb_model_ensemble
-sh get_data.sh
-```
-
-## 创建config.yaml
-本示例采用了brpc的client连接类型，还可以选择grpc或local_predictor。
-```yaml
-#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
-rpc_port: 18070
-
-#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
-http_port: 18071
-
-#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
-#当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
-worker_num: 4
-
-#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
-build_dag_each_worker: False
-
-dag:
-    #op资源类型, True, 为线程模型；False，为进程模型
-    is_thread_op: True
-
-    #重试次数
-    retry: 1
-
-    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
-    use_profile: False
-
-    #channel的最大长度，默认为0
-    channel_size: 0
-
-    #tracer, 跟踪框架吞吐，每个OP和channel的工作情况。无tracer时不生成数据
-    tracer:
-        #每次trace的时间间隔，单位秒/s
-        interval_s: 10
-op:
-    bow:
-        # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
-        concurrency: 1
-
-        # client连接类型，brpc, grpc和local_predictor
-        client_type: brpc
-
-        # Serving交互重试次数，默认不重试
-        retry: 1
-
-        # Serving交互超时时间, 单位ms
-        timeout: 3000
-
-        # Serving IPs
-        server_endpoints: ["127.0.0.1:9393"]
-
-        # bow模型client端配置
-        client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
-
-        # Fetch结果列表，以client_config中fetch_var的alias_name为准
-        fetch_list: ["prediction"]
-
-        # 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout，否则不足batch_size时会阻塞
-        batch_size: 2
-
-        # 批量查询超时，与batch_size配合使用
-        auto_batching_timeout: 2000
-    cnn:
-        # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
-        concurrency: 1
-
-        # client连接类型，brpc
-        client_type: brpc
-
-        # Serving交互重试次数，默认不重试
-        retry: 1
-
-        # 预测超时时间, 单位ms
-        timeout: 3000
-
-        # Serving IPs
-        server_endpoints: ["127.0.0.1:9292"]
-
-        # cnn模型client端配置
-        client_config: "imdb_cnn_client_conf/serving_client_conf.prototxt"
-
-        # Fetch结果列表，以client_config中fetch_var的alias_name为准
-        fetch_list: ["prediction"]
-        
-        # 批量查询Serving的数量, 默认1。
-        batch_size: 2
-
-        # 批量查询超时，与batch_size配合使用
-        auto_batching_timeout: 2000
-    combine:
-        # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
-        concurrency: 1
-
-        # Serving交互重试次数，默认不重试
-        retry: 1
-
-        # 预测超时时间, 单位ms
-        timeout: 3000
-
-        # 批量查询Serving的数量, 默认1。
-        batch_size: 2
-
-        # 批量查询超时，与batch_size配合使用
-        auto_batching_timeout: 2000
-```
-
-## 编写 Server 代码
-
-代码示例中，重点留意3个自定义Op的preprocess、postprocess处理，以及Combin Op初始化列表input_ops=[bow_op, cnn_op]，设置Combin Op的前置OP列表。
-
-```python
-from paddle_serving_server.pipeline import Op, RequestOp, ResponseOp
-from paddle_serving_server.pipeline import PipelineServer
-from paddle_serving_server.pipeline.proto import pipeline_service_pb2
-from paddle_serving_server.pipeline.channel import ChannelDataEcode
-import numpy as np
-from paddle_serving_app.reader import IMDBDataset
-
-class ImdbRequestOp(RequestOp):
-    def init_op(self):
-        self.imdb_dataset = IMDBDataset()
-        self.imdb_dataset.load_resource('imdb.vocab')
-
-    def unpack_request_package(self, request):
-        dictdata = {}
-        for idx, key in enumerate(request.key):
-            if key != "words":
-                continue
-            words = request.value[idx]
-            word_ids, _ = self.imdb_dataset.get_words_and_label(words)
-            dictdata[key] = np.array(word_ids)
-        return dictdata
-
-
-class CombineOp(Op):
-    def preprocess(self, input_data):
-        combined_prediction = 0
-        for op_name, data in input_data.items():
-            combined_prediction += data["prediction"]
-        data = {"prediction": combined_prediction / 2}
-        return data
-
-
-read_op = ImdbRequestOp()
-bow_op = Op(name="bow",
-            input_ops=[read_op],
-            server_endpoints=["127.0.0.1:9393"],
-            fetch_list=["prediction"],
-            client_config="imdb_bow_client_conf/serving_client_conf.prototxt",
-            concurrency=1,
-            timeout=-1,
-            retry=1)
-cnn_op = Op(name="cnn",
-            input_ops=[read_op],
-            server_endpoints=["127.0.0.1:9292"],
-            fetch_list=["prediction"],
-            client_config="imdb_cnn_client_conf/serving_client_conf.prototxt",
-            concurrency=1,
-            timeout=-1,
-            retry=1)
-combine_op = CombineOp(
-    name="combine",
-    input_ops=[bow_op, cnn_op],
-    concurrency=5,
-    timeout=-1,
-    retry=1)
-
-# use default ResponseOp implementation
-response_op = ResponseOp(input_ops=[combine_op])
-
-server = PipelineServer()
-server.set_response_op(response_op)
-server.prepare_server('config.yml')
-server.run_server()
-```
-
-## 启动服务验证
-
-```python
-from paddle_serving_client.pipeline import PipelineClient
-import numpy as np
-
-client = PipelineClient()
-client.connect(['127.0.0.1:18080'])
-
-words = 'i am very sad | 0'
-
-futures = []
-for i in range(3):
-    futures.append(
-        client.predict(
-            feed_dict={"words": words},
-            fetch=["prediction"],
-            asyn=True))
-
-for f in futures:
-    res = f.result()
-    if res["ecode"] != 0:
-        print(res)
-        exit(1)
-```
--- a/doc/Offical_Docs/7-4_Python_Pipeline_Optimize_CN.md
+++ b/doc/Offical_Docs/7-4_Python_Pipeline_Optimize_CN.md