未验证 提交 8c23752c 编写于 作者: W Wing 提交者: GitHub

Refine backend (#6972)

上级 104f398b
# Advanced deployment
OAP servers inter communicate with each other in a cluster environment.
OAP servers communicate with each other in a cluster environment.
In the cluster mode, you could run in different roles.
- Mixed(default)
- Receiver
- Aggregator
In some time, users want to deploy cluster nodes with explicit role. Then could use this.
Sometimes users may wish to deploy cluster nodes with a clearly defined role. They could then use this function.
## Mixed
Default role, the OAP should take responsibilities of
1. Receive agent traces or metrics.
1. Do L1 aggregation
1. Internal communication(send/receive)
1. Do L2 aggregation
By default, the OAP is responsible for:
1. Receiving agent traces or metrics.
1. L1 aggregation
1. Internal communication (sending/receiving)
1. L2 aggregation
1. Persistence
1. Alarm
## Receiver
The OAP should take responsibilities of
1. Receive agent traces or metrics.
1. Do L1 aggregation
1. Internal communication(send)
The OAP is responsible for:
1. Receiving agent traces or metrics.
1. L1 aggregation
1. Internal communication (sending)
## Aggregator
The OAP should take responsibilities of
The OAP is responsible for:
1. Internal communication(receive)
1. Do L2 aggregation
1. L2 aggregation
1. Persistence
1. Alarm
___
These roles are designed for complex deployment requirements based on security and network policy.
These roles are designed for complex deployment requirements on security and network policy.
## Kubernetes
If you are using our native [Kubernetes coordinator](backend-cluster.md#kubernetes), the `labelSelector`
setting is used for `Aggregator` choose rules. Choose the right OAP deployment based on your requirements.
\ No newline at end of file
setting is used for `Aggregator` role selection rules. Choose the right OAP deployment based on your needs.
......@@ -10,11 +10,11 @@ For example, if T is 1.2 seconds and a response completes in 0.5 seconds, then t
greater than 1.2 seconds dissatisfy the user. Responses greater than 4.8 seconds frustrate the user.
The apdex threshold T can be configured in `service-apdex-threshold.yml` file or via [Dynamic Configuration](dynamic-config.md).
The `default` item will be apply to a service isn't defined in this configuration as the default threshold.
The `default` item will apply to a service that isn't defined in this configuration as the default threshold.
## Configuration Format
The configuration content includes the service' names and their threshold:
The configuration content includes the names and thresholds of the services:
```yml
# default threshold is 500ms
......
# Alarm
Alarm core is driven by a collection of rules, which are defined in `config/alarm-settings.yml`.
There are three parts in alarm rule definition.
1. [Alarm rules](#rules). They define how metrics alarm should be triggered, what conditions should be considered.
1. [Webhooks](#webhook). The list of web service endpoint, which should be called after the alarm is triggered.
1. [gRPCHook](#gRPCHook). The host and port of remote gRPC method, which should be called after the alarm is triggered.
1. [Alarm rules](#rules). They define how metrics alarm should be triggered and what conditions should be considered.
1. [Webhooks](#webhook). The list of web service endpoints, which should be called after the alarm is triggered.
1. [gRPCHook](#gRPCHook). The host and port of the remote gRPC method, which should be called after the alarm is triggered.
## Entity name
Define the relation between scope and entity name.
Defines the relation between scope and entity name.
- **Service**: Service name
- **Instance**: {Instance name} of {Service name}
- **Endpoint**: {Endpoint name} in {Service name}
......@@ -16,49 +16,44 @@ Define the relation between scope and entity name.
- **Endpoint Relation**: {Source endpoint name} in {Source Service name} to {Dest endpoint name} in {Dest service name}
## Rules
**There are two types of rules, individual rule and composite rule, composite rule is the combination of individual rules**
**There are two types of rules: individual rules and composite rules. A composite rule is a combination of individual rules.**
### Individual rules
Alarm rule is constituted by following keys
- **Rule name**. Unique name, show in alarm message. Must end with `_rule`.
- **Metrics name**. A.K.A. metrics name in oal script. Only long, double, int types are supported. See
[List of all potential metrics name](#list-of-all-potential-metrics-name).
- **Include names**. The following entity names are included in this rule. Please follow [Entity name define](#entity-name).
- **Exclude names**. The following entity names are excluded in this rule. Please follow [Entity name define](#entity-name).
- **Include names regex**. Provide a regex to include the entity names. If both setting the include name list and include name regex, both rules will take effect.
- **Exclude names regex**. Provide a regex to exclude the entity names. If both setting the exclude name list and exclude name regex, both rules will take effect.
- **Include labels**. The following labels of the metric are included in this rule.
- **Exclude labels**. The following labels of the metric are excluded in this rule.
- **Include labels regex**. Provide a regex to include labels. If both setting the include label list and include label regex, both rules will take effect.
- **Exclude labels regex**. Provide a regex to exclude labels. If both setting the exclude label list and exclude label regex, both rules will take effect.
- **Tags**. Tags are key/value pairs that are attached to alarms. Tags are intended to be used to specify identifying attributes of alarms that are meaningful and relevant to users. If you want to these tags searchable on SkyWalking UI, you need to set the tag keys in `core/default/searchableAlarmTags`, or through system environment variable, `SW_SEARCHABLE_ALARM_TAG_KEYS`. The default supported key is `level`.
An alarm rule is made up of the following elements:
- **Rule name**. A unique name shown in the alarm message. It must end with `_rule`.
- **Metrics name**. This is also the metrics name in the OAL script. Only long, double, int types are supported. See the
[list of all potential metrics name](#list-of-all-potential-metrics-name).
- **Include names**. Entity names which are included in this rule. Please follow the [entity name definitions](#entity-name).
- **Exclude names**. Entity names which are excluded from this rule. Please follow the [entity name definitions](#entity-name).
- **Include names regex**. A regex that includes entity names. If both include-name list and include-name regex are set, both rules will take effect.
- **Exclude names regex**. A regex that excludes entity names. If both exclude-name list and exclude-name regex are set, both rules will take effect.
- **Include labels**. Metric labels which are included in this rule.
- **Exclude labels**. Metric labels which are excluded from this rule.
- **Include labels regex**. A regex that includes labels. If both include-label list and include-label regex are set, both rules will take effect.
- **Exclude labels regex**. A regex that exclude labels. If both the exclude-label list and exclude-label regex are set, both rules will take effect.
- **Tags**. Tags are key/value pairs that are attached to alarms. Tags are used to specify distinguishing attributes of alarms that are meaningful and relevant to users. If you would like to make these tags searchable on the SkyWalking UI, you may set the tag keys in `core/default/searchableAlarmTags`, or through system environment variable `SW_SEARCHABLE_ALARM_TAG_KEYS`. The key `level` is supported by default.
*The settings of labels is required by meter-system which intends to store metrics from label-system platform, just like Prometheus, Micrometer, etc.
The function supports the above four settings should implement `LabeledValueHolder`.*
*Label settings are required by the meter-system. They are used to store metrics from the label-system platform, such as Prometheus, Micrometer, etc.
The four label settings mentioned above must implement `LabeledValueHolder`.*
- **Threshold**. The target value.
For multiple values metrics, such as **percentile**, the threshold is an array. Described like `value1, value2, value3, value4, value5`.
Each value could the threshold for each value of the metrics. Set the value to `-` if don't want to trigger alarm by this or some of the values.
Such as in **percentile**, `value1` is threshold of P50, and `-, -, value3, value4, value5` means, there is no threshold for P50 and P75 in percentile alarm rule.
- **OP**. Operator, support `>`, `>=`, `<`, `<=`, `=`. Welcome to contribute all OPs.
- **Period**. How long should the alarm rule should be checked. This is a time window, which goes with the
backend deployment env time.
- **Count**. In the period window, if the number of **value**s over threshold(by OP), reaches count, alarm
should send.
- **Only as condition**. Specify if the rule can send notification or just as an condition of composite rule.
- **Silence period**. After alarm is triggered in Time-N, then keep silence in the **TN -> TN + period**.
By default, it is as same as **Period**, which means in a period, same alarm(same ID in same
metrics name) will be trigger once.
For multiple-value metrics, such as **percentile**, the threshold is an array. It is described as: `value1, value2, value3, value4, value5`.
Each value may serve as the threshold for each value of the metrics. Set the value to `-` if you do not wish to trigger the alarm by one or more of the values.
For example in **percentile**, `value1` is the threshold of P50, and `-, -, value3, value4, value5` means that there is no threshold for P50 and P75 in the percentile alarm rule.
- **OP**. The operator. It supports `>`, `>=`, `<`, `<=`, `=`. We welcome contributions of all OPs.
- **Period**. The frequency for checking the alarm rule. This is a time window that corresponds to the backend deployment env time.
- **Count**. Within a period window, if the number of times which **value** goes over the threshold (based on OP) reaches `count`, then an alarm will be sent.
- **Only as condition**. Indicates if the rule can send notifications, or if it simply serves as an condition of the composite rule.
- **Silence period**. After the alarm is triggered in Time-N, there will be silence during the **TN -> TN + period**.
By default, it works in the same manner as **period**. The same alarm (having the same ID in the same metrics name) may only be triggered once within a period.
### Composite rules
**NOTE**. Composite rules only work for alarm rules targeting the same entity level, such as alarm rules of the service level.
For example, `service_percent_rule && service_resp_time_percentile_rule`. You shouldn't compose alarm rules of different entity levels.
such as one alarm rule of the service metrics with another rule of the endpoint metrics.
**NOTE**: Composite rules are only applicable to alarm rules targeting the same entity level, such as service-level alarm rules (`service_percent_rule && service_resp_time_percentile_rule`). Do not compose alarm rules of different entity levels, such as an alarm rule of the service metrics with another rule of the endpoint metrics.
Composite rule is constituted by the following keys
- **Rule name**. Unique name, show in alarm message. Must end with `_rule`.
- **Expression**. Specify how to compose rules, support `&&`, `||`, `()`.
- **Message**. Specify the notification message when rule triggered.
- **Tags**. Tags are key/value pairs that are attached to alarms. Tags are intended to be used to specify identifying attributes of alarms that are meaningful and relevant to users.
A composite rule is made up of the following elements:
- **Rule name**. A unique name shown in the alarm message. Must end with `_rule`.
- **Expression**. Specifies how to compose rules, and supports `&&`, `||`, and `()`.
- **Message**. The notification message to be sent out when the rule is triggered.
- **Tags**. Tags are key/value pairs that are attached to alarms. Tags are used to specify distinguishing attributes of alarms that are meaningful and relevant to users.
```yaml
rules:
# Rule unique name, must be ended with `_rule`.
......@@ -124,33 +119,32 @@ composite-rules:
### Default alarm rules
We provided a default `alarm-setting.yml` in our distribution only for convenience, which including following rules
1. Service average response time over 1s in last 3 minutes.
1. Service success rate lower than 80% in last 2 minutes.
1. Percentile of service response time is over 1s in last 3 minutes
1. Service Instance average response time over 1s in last 2 minutes, and the instance name matches the regex.
1. Endpoint average response time over 1s in last 2 minutes.
1. Database access average response time over 1s in last 2 minutes.
1. Endpoint relation average response time over 1s in last 2 minutes.
For convenience's sake, we have provided a default `alarm-setting.yml` in our release. It includes the following rules:
1. Service average response time over 1s in the last 3 minutes.
1. Service success rate lower than 80% in the last 2 minutes.
1. Percentile of service response time over 1s in the last 3 minutes
1. Service Instance average response time over 1s in the last 2 minutes, and the instance name matches the regex.
1. Endpoint average response time over 1s in the last 2 minutes.
1. Database access average response time over 1s in the last 2 minutes.
1. Endpoint relation average response time over 1s in the last 2 minutes.
### List of all potential metrics name
The metrics names are defined in official [OAL scripts](../../guides/backend-oal-scripts.md), right now
metrics from **Service**, **Service Instance**, **Endpoint**, **Service Relation**, **Service Instance Relation**, **Endpoint Relation** scopes could be used in Alarm, and the **Database access** same with **Service** scope.
The metrics names are defined in official the [OAL scripts](../../guides/backend-oal-scripts.md). Currently, metrics from the **Service**, **Service Instance**, **Endpoint**, **Service Relation**, **Service Instance Relation**, **Endpoint Relation** scopes could be used in Alarm, and the **Database access** scope is same as **Service**.
Submit issue or pull request if you want to support any other scope in alarm.
Submit an issue or a pull request if you want to support any other scopes in alarm.
## Webhook
Webhook requires the peer is a web container. The alarm message will send through HTTP post by `application/json` content type. The JSON format is based on `List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage>` with following key information.
- **scopeId**, **scope**. All scopes are defined in org.apache.skywalking.oap.server.core.source.DefaultScopeDefine.
- **name**. Target scope entity name. Please follow [Entity name define](#entity-name).
- **id0**. The ID of the scope entity matched the name. When using relation scope, it is the source entity ID.
- **id1**. When using relation scope, it will be the dest entity ID. Otherwise, it is empty.
- **ruleName**. The rule name you configured in `alarm-settings.yml`.
- **alarmMessage**. Alarm text message.
- **startTime**. Alarm time measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.
- **tags**. The tags you configured in `alarm-settings.yml`.
The Webhook requires the peer to be a web container. The alarm message will be sent through HTTP post by `application/json` content type. The JSON format is based on `List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage>` with the following key information:
- **scopeId**, **scope**. All scopes are defined in `org.apache.skywalking.oap.server.core.source.DefaultScopeDefine`.
- **name**. Target scope entity name. Please follow the [entity name definitions](#entity-name).
- **id0**. The ID of the scope entity that matches with the name. When using the relation scope, it is the source entity ID.
- **id1**. When using the relation scope, it is the destination entity ID. Otherwise, it is empty.
- **ruleName**. The rule name configured in `alarm-settings.yml`.
- **alarmMessage**. The alarm text message.
- **startTime**. The alarm time measured in milliseconds, which occurs between the current time and the midnight of January 1, 1970 UTC.
- **tags**. The tags configured in `alarm-settings.yml`.
Example as following
See the following example:
```json
[{
"scopeId": 1,
......@@ -182,10 +176,10 @@ Example as following
```
## gRPCHook
The alarm message will send through remote gRPC method by `Protobuf` content type.
The message format with following key information which are defined in `oap-server/server-alarm-plugin/src/main/proto/alarm-hook.proto`.
The alarm message will be sent through remote gRPC method by `Protobuf` content type.
The message contains key information which are defined in `oap-server/server-alarm-plugin/src/main/proto/alarm-hook.proto`.
Part of protocol looks as following:
Part of the protocol looks like this:
```protobuf
message AlarmMessage {
int64 scopeId = 1;
......@@ -211,9 +205,9 @@ message KeyStringValuePair {
```
## Slack Chat Hook
To do this you need to follow the [Getting Started with Incoming Webhooks guide](https://api.slack.com/messaging/webhooks) and create new Webhooks.
Follow the [Getting Started with Incoming Webhooks guide](https://api.slack.com/messaging/webhooks) and create new Webhooks.
The alarm message will send through HTTP post by `application/json` content type if you configured Slack Incoming Webhooks as following:
The alarm message will be sent through HTTP post by `application/json` content type if you have configured Slack Incoming Webhooks as follows:
```yml
slackHooks:
textTemplate: |-
......@@ -229,8 +223,8 @@ slackHooks:
```
## WeChat Hook
Note, only WeCom(WeChat Company Edition) supports webhook. To use the WeChat webhook you need to follow the [Wechat Webhooks guide](https://work.weixin.qq.com/help?doc_id=13376).
The alarm message would send through HTTP post by `application/json` content type after you set up Wechat Webhooks as following:
Note that only the WeChat Company Edition (WeCom) supports WebHooks. To use the WeChat WebHook, follow the [Wechat Webhooks guide](https://work.weixin.qq.com/help?doc_id=13376).
The alarm message will be sent through HTTP post by `application/json` content type after you have set up Wechat Webhooks as follows:
```yml
wechatHooks:
textTemplate: |-
......@@ -245,9 +239,9 @@ wechatHooks:
```
## Dingtalk Hook
To do this you need to follow the [Dingtalk Webhooks guide](https://ding-doc.dingtalk.com/doc#/serverapi2/qf2nxq/uKPlK) and create new Webhooks.
For security issue, you can config optional secret for individual webhook url.
The alarm message will send through HTTP post by `application/json` content type if you configured Dingtalk Webhooks as following:
Follow the [Dingtalk Webhooks guide](https://ding-doc.dingtalk.com/doc#/serverapi2/qf2nxq/uKPlK) and create new Webhooks.
For security purposes, you can config an optional secret for an individual webhook URL.
The alarm message will be sent through HTTP post by `application/json` content type if you have configured Dingtalk Webhooks as follows:
```yml
dingtalkHooks:
textTemplate: |-
......@@ -263,10 +257,10 @@ dingtalkHooks:
```
## Feishu Hook
To do this you need to follow the [Feishu Webhooks guide](https://www.feishu.cn/hc/zh-cn/articles/360024984973) and create new Webhooks.
For security issue, you can config optional secret for individual webhook url.
if you want to at someone, you can config `ats` which is the feishu's user_id and separated by "," .
The alarm message will send through HTTP post by `application/json` content type if you configured Feishu Webhooks as following:
Follow the [Feishu Webhooks guide](https://www.feishu.cn/hc/zh-cn/articles/360024984973) and create new Webhooks.
For security purposes, you can config an optional secret for an individual webhook URL.
If you would like to direct a text to a user, you can config `ats` which is the feishu's user_id and separated by "," .
The alarm message will be sent through HTTP post by `application/json` content type if you have configured Feishu Webhooks as follows:
```yml
feishuHooks:
textTemplate: |-
......@@ -283,8 +277,8 @@ feishuHooks:
```
## WeLink Hook
To do this you need to follow the [WeLink Webhooks guide](https://open.welink.huaweicloud.com/apiexplorer/#/apiexplorer?type=internal&method=POST&path=/welinkim/v1/im-service/chat/group-chat) and create new Webhooks.
The alarm message will send through HTTP post by `application/json` content type if you configured WeLink Webhooks as following:
Follow the [WeLink Webhooks guide](https://open.welink.huaweicloud.com/apiexplorer/#/apiexplorer?type=internal&method=POST&path=/welinkim/v1/im-service/chat/group-chat) and create new Webhooks.
The alarm message will be sent through HTTP post by `application/json` content type if you have configured WeLink Webhooks as follows:
```yml
welinkHooks:
textTemplate: "Apache SkyWalking Alarm: \n %s."
......@@ -304,6 +298,6 @@ welinkHooks:
Since 6.5.0, the alarm settings can be updated dynamically at runtime by [Dynamic Configuration](dynamic-config.md),
which will override the settings in `alarm-settings.yml`.
In order to determine that whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for
each alarm rule, if any attribute (`metrics-name`, `op`, `threshold`, `period`, `count`, etc.) of a rule is changed,
In order to determine whether an alarm rule is triggered or not, SkyWalking needs to cache the metrics of a time window for
each alarm rule. If any attribute (`metrics-name`, `op`, `threshold`, `period`, `count`, etc.) of a rule is changed,
the sliding window will be destroyed and re-created, causing the alarm of this specific rule to restart again.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册