# Serving配置

(简体中文|[English](./Serving_Configure_EN.md))

## 简介

本文主要介绍C++ Serving以及Python Pipeline的各项配置:

- [模型配置文件](#模型配置文件): 转换模型时自动生成，描述模型输入输出信息
- [C++ Serving](#c-serving): 用于高性能场景，介绍了快速启动以及自定义配置方法
- [Python Pipeline](#python-pipeline): 用于单算子多模型组合场景

## 模型配置文件

在开始介绍Server配置之前，先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时，会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt，两者内容一致，为模型输入输出的参数信息，方便用户拼装参数。该配置文件用于Server以及Client，并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](./Save_CN.md)》。protobuf格式可参考`core/configure/proto/general_model_config.proto`。

样例如下：

```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "concat_1.tmp_0"
  alias_name: "concat_1.tmp_0"
  is_lod_tensor: false
  fetch_type: 1
  shape: 3
  shape: 640
  shape: 640
}
```

其中
- feed_var：模型输入
- fetch_var：模型输出
- name：名称
- alias_name：别名，与名称对应
- is_lod_tensor：是否为lod，具体可参考《[Lod字段说明](./LOD_CN.md)》
- feed_type：数据类型

|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
|20|STRING|

- shape：数据维度

## C++ Serving

### 1.快速启动与关闭

可以通过配置模型及端口号快速启动服务，启动命令如下：

```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```

该命令会自动生成配置文件，并使用生成的配置文件启动C++ Serving。例如上述启动命令会自动生成workdir_9393目录，其结构如下

```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```

更多启动参数详见下表：
| Argument                                       | Type | Default | Description                                           |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `--thread`                                       | int  | `2`     | Number of brpc service thread                         |
| `--runtime_thread_num`                           | int[]| `0`     | Thread Number for each model in asynchronous mode     |
| `--batch_infer_size`                             | int[]| `32`    | Batch Number for each model in asynchronous mode      |
| `--gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
| `--port`                                         | int  | `9292`  | Exposed port of current service to users              |
| `--model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
| `--mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
| `--ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
| `--use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL. Need open with ir_optim.                                |
| `--use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT. Need open with ir_optim.                           |
| `--use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference. Need open with ir_optim.                              |
| `--use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.        |
| `--precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
| `--use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
| `--gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
| `--use_ascend_cl`                                | bool | False   | Enable for ascend910; Use with use_lite for ascend310 |
| `--request_cache_size`                           | int  | `0`     | Bytes size of request cache. By default, the cache is disabled |
| `--enable_prometheus`                            | bool | False   | Use Prometheus |
| `--prometheus_port`                              | int  | 19393   | Port of the Prometheus |
| `--use_dist_model`                               | bool | False   | Use distributed model or not |
| `--dist_carrier_id`                              | str  | ""      | Carrier id of distributed model |
| `--dist_cfg_file`                                | str  | ""      | Config file of distributed model |
| `--dist_endpoints`                               | str  | ""      | Endpoints of distributed model. splited by comma |
| `--dist_nranks`                                  | int  | 0       | The number of rank in the distributed model|
| `--dist_subgraph_index`                          | int  | -1      | The subgraph index of distributed model|
| `--dist_master_serving`                          | bool | False   | The master serving of distributed inference |
| `--min_subgraph_size`                            | str  | ""      | The min size of subgraph |
| `--gpu_memory_mb`                                | int  | 50      | Initially allocate GPU storage size, 50 MB default.|
| `--cpu_math_thread_num`                          | int  | 1       | Initialize the number of CPU computing threads|
| `--trt_workspace_size`                           | int  | 33554432| Initialize allocation 1 << 25 GPU storage size for tensorRT|
| `--trt_use_static`                               | bool | False   | Initialize TRT with static data| 

#### 当您的某个模型想使用多张GPU卡部署时.
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
#### 当您的一个服务包含两个模型部署时.
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
#### 当您想要关闭Serving服务时（在Serving启动目录或环境变量SERVING_HOME路径下，执行以下命令）.
```BASH
python3 -m paddle_serving_server.serve stop
```
stop参数发送SIGINT至C++ Serving，若改成kill则发送SIGKILL信号至C++ Serving

### 2.自定义配置启动

一般情况下，自动生成的配置可以应对大部分场景。对于特殊场景，用户也可自行定义配置文件。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下:
```BASH
/bin/serving --flagfile=proj.conf
```

#### 2.1 proj.conf

proj.conf用于传入服务参数，并指定了其他相关配置文件的路径。如果重复传入参数，则以最后序参数值为准。
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
各项参数的描述及默认值详见下表：
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|

#### 2.2 service.prototxt

service.prototxt用于配置Paddle Serving实例挂载的service列表。通过`--inferservice_path`和`--inferservice_file`指定加载路径。protobuf格式可参考`core/configure/server_configure.protobuf`的`InferServiceConf`。示例如下：

```
port: 8010
services {
  name: "GeneralModelService"
  workflows: "workflow1"
}
```

其中：
- port: 用于配置Serving实例监听的端口号。
- services: 使用默认配置即可，不可修改。name指定service名称，workflow1的具体定义在workflow.prototxt

#### 2.3 workflow.prototxt

workflow.prototxt用来描述具体的workflow。通过`--workflow_path`和`--workflow_file`指定加载路径。protobuf格式可参考`configure/server_configure.protobuf`的`Workflow`类型。
如下示例，workflow由3个OP构成，GeneralReaderOp用于读取数据，GeneralInferOp依赖于GeneralReaderOp并进行预测，GeneralResponseOp将预测结果返回：

```
workflows {
  name: "workflow1"
  workflow_type: "Sequence"
  nodes {
    name: "general_reader_0"
    type: "GeneralReaderOp"
  }
  nodes {
    name: "general_infer_0"
    type: "GeneralInferOp"
    dependencies {
      name: "general_reader_0"
      mode: "RO"
    }
  }
  nodes {
    name: "general_response_0"
    type: "GeneralResponseOp"
    dependencies {
      name: "general_infer_0"
      mode: "RO"
    }
  }
}
```
其中：

- name: workflow名称，用于从service.prototxt索引到具体的workflow
- workflow_type: 只支持"Sequence"
- nodes: 用于串联成workflow的所有节点，可配置多个nodes。nodes间通过配置dependencies串联起来
- node.name: 与node.type一一对应，具体可参考`python/paddle_serving_server/dag.py`
- node.type: 当前node所执行OP的类名称，与serving/op/下每个具体的OP类的名称对应
- node.dependencies: 依赖的上游node列表
- node.dependencies.name: 与workflow内节点的name保持一致
- node.dependencies.mode: RO-Read Only, RW-Read Write

#### 2.4 resource.prototxt

resource.prototxt，用于指定模型配置文件。通过`--resource_path`和`--resource_file`指定加载路径。它的protobuf格式参考`core/configure/proto/server_configure.proto`的`ResourceConf`。示例如下：

```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
```

其中：

- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
- general_model_path: 用来指定general_model.prototxt所在的目录
- general_model_file: 用来指定general_model.prototxt所在的文件名

#### 2.5 model_toolkit.prototxt

用来配置模型信息和预测引擎。它的protobuf格式参考`core/configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下：

```
engines {
  name: "general_infer_0"
  type: "PADDLE_INFER"
  reloadable_meta: "uci_housing_model/fluid_time_file"
  reloadable_type: "timestamp_ne"
  model_dir: "uci_housing_model"
  gpu_ids: -1
  enable_memory_optimization: true
  enable_ir_optimization: false
  use_trt: false
  use_lite: false
  use_xpu: false
  use_gpu: false
  combined_model: false
  gpu_multi_stream: false
  use_ascend_cl: false
  runtime_thread_num: 0
  batch_infer_size: 32
  enable_overrun: false
  allow_split_request: true
}
```

其中

- name: 引擎名称，与workflow.prototxt中的node.name以及所在目录名称对应
- type: 预测引擎的类型。当前只支持”PADDLE_INFER“
- reloadable_meta: 目前实际内容无意义，用来通过对该文件的mtime判断是否超过reload时间阈值
- reloadable_type: 检查reload条件：timestamp_ne/timestamp_gt/md5sum/revision/none

|reloadable_type|含义|
|---------------|----|
|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
|md5sum|目前无用，配置后永远不reload|
|revision|目前无用，配置后用于不reload|

- model_dir: 模型文件路径
- gpu_ids: 引擎运行时使用的GPU device id，支持指定多个，如：
```
# 指定GPU0，1，2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: 是否开启memory优化
- enable_ir_optimization: 是否开启ir优化
- use_trt: 是否开启TensorRT，需同时开启use_gpu
- use_lite: 是否开启PaddleLite
- use_xpu: 是否使用昆仑XPU
- use_gpu:是否使用GPU
- combined_model: 是否使用组合模型文件
- gpu_multi_stream: 是否开启gpu多流模式
- use_ascend_cl: 是否使用昇腾,单独开启适配昇腾910，同时开启lite适配310
- runtime_thread_num: 若大于0， 则启用Async异步模式，并创建对应数量的predictor实例。
- batch_infer_size: Async异步模式下的最大batch数
- enable_overrun: Async异步模式下总是将整个任务放入任务队列
- allow_split_request: Async异步模式下允许拆分任务

#### 2.6 general_model.prototxt

general_model.prototxt内容与模型配置serving_server_conf.prototxt相同，用了描述模型输入输出参数信息。示例如下：
```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "fc_0.tmp_1"
  alias_name: "price"
  is_lod_tensor: false
  fetch_type: 1
  shape: 1
}
```

## Python Pipeline
### 快速启动与关闭
Python Pipeline启动命令如下：

```BASH
python3 web_service.py
```

当您想要关闭Serving服务时（在Pipeline启动目录下或环境变量SERVING_HOME路径下，执行以下命令）：
```BASH
python3 -m paddle_serving_server.serve stop
```
stop参数发送SIGINT至Pipeline Serving，若改成kill则发送SIGKILL信号至Pipeline Serving

### 配置文件
Python Pipeline提供了用户友好的多模型组合服务编程框架，适用于多模型组合应用的场景。
其配置文件为YAML格式，一般默认为config.yaml。示例如下：
```YAML
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
rpc_port: 18090

#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
http_port: 9999

#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
##当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20

#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
build_dag_each_worker: false

dag:
    #op资源类型, True, 为线程模型；False，为进程模型
    is_thread_op: False

    #重试次数
    retry: 1

    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
    use_profile: false
    tracer:
        interval_s: 10

    #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
    #client_type: local_predictor

    #channel的最大长度，默认为0
    #channel_size: 0

    #针对大模型分布式场景tensor并行，接收第一个返回结果后其他结果丢弃来提供速度
    #channel_recv_frist_arrive: False

op:
    det:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 6

        #Serving IPs
        #server_endpoints: ["127.0.0.1:9393"]

        #Fetch结果列表，以client_config中fetch_var的alias_name为准
        #fetch_list: ["concat_1.tmp_0"]

        #det模型client端配置
        #client_config: serving_client_conf.prototxt

        #Serving交互超时时间, 单位ms
        #timeout: 3000

        #Serving交互重试次数，默认不重试
        #retry: 1

        # 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout，否则不足batch_size时会阻塞
        #batch_size: 2

        # 批量查询超时，与batch_size配合使用
        #auto_batching_timeout: 2000

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:
            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #det模型路径
            model_config: ocr_det_model

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["concat_1.tmp_0"]

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
            device_type: 0

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: ""

            #use_mkldnn, 开启mkldnn时，必须同时设置ir_optim=True，否则无效
            #use_mkldnn: True

            #ir_optim, 开启TensorRT时，必须同时设置ir_optim=True，否则无效
            ir_optim: True
            
            #CPU 计算线程数，在CPU场景开启会降低单次请求响应时长
            #thread_num: 10
            
            #precsion, 预测精度，降低预测精度可提升预测速度
            #GPU 支持: "fp32"(default), "fp16", "int8"；
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"

            #mem_optim, memory / graphic memory optimization
            #mem_optim: True

            #use_calib, Use TRT int8 calibration
            #use_calib: False

            #use_mkldnn, Use mkldnn for cpu
            #use_mkldnn: False

            #The cache capacity of different input shapes for mkldnn
            #mkldnn_cache_capacity: 0

            #mkldnn_op_list, op list accelerated using MKLDNN, None default
            #mkldnn_op_list: []

            #mkldnn_bf16_op_list,op list accelerated using MKLDNN bf16, None default.
            #mkldnn_bf16_op_list: []

            #min_subgraph_size,the minimal subgraph size for opening tensorrt to optimize, 3 default
            #min_subgraph_size: 3
    rec:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 3

        #超时时间, 单位ms
        timeout: -1

        #Serving交互重试次数，默认不重试
        retry: 1

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            #client类型，包括brpc, grpc和local_predictor。local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #rec模型路径
            model_config: ocr_rec_model

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
            device_type: 0

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: ""

            #use_mkldnn, 开启mkldnn时，必须同时设置ir_optim=True，否则无效
            #use_mkldnn: True

            #ir_optim, 开启TensorRT时，必须同时设置ir_optim=True，否则无效
            ir_optim: True
            
            #CPU 计算线程数，在CPU场景开启会降低单次请求响应时长
            #thread_num: 10
            
            #precsion, 预测精度，降低预测精度可提升预测速度
            #GPU 支持: "fp32"(default), "fp16", "int8"；
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"
```

### 单机多卡

单机多卡推理，M个OP进程与N个GPU卡绑定，需要在config.ymal中配置3个参数。首先选择进程模式，这样并发数即进程数，然后配置devices。绑定方法是进程启动时遍历GPU卡ID，例如启动7个OP进程，设置了0，1，2三个device id，那么第1、4、7个启动的进程与0卡绑定，第2、5进程与1卡绑定，3、6进程与卡2绑定。
```YAML
#op资源类型, True, 为线程模型；False，为进程模型
is_thread_op: False

#并发数，is_thread_op=True时，为线程并发；否则为进程并发
concurrency: 7

devices: "0,1,2"
```

### 异构硬件

Python Pipeline除了支持CPU、GPU之外，还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定，当其空缺时根据devices自动判断类型。device_type描述如下：
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
- Ascend310(Arm) : 5
- Ascend910(Arm) : 6

config.yml中硬件配置：
```YAML
#计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
device_type: 0
#计算硬件ID，优先由device_type决定硬件类型。devices为""或空缺时为CPU预测；当为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
devices: "" # "0,1"
```

### 低精度推理

Python Pipeline支持低精度推理，CPU、GPU和TensoRT支持的精度类型如下所示：
- CPU
  - fp32(default)
  - fp16
  - bf16(mkldnn)
- GPU
  - fp32(default)
  - fp16(TRT下有效)
  - int8
- Tensor RT
  - fp32(default)
  - fp16
  - int8 

```YAML
#precsion, 预测精度，降低预测精度可提升预测速度
#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8"；
#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
precision: "fp32"

#cablic, open it when using int8
use_calib: True
```