mal.md 8.9 KB
Newer Older
G
Gao Hongtao 已提交
1 2
# Meter Analysis Language

W
Wing 已提交
3 4 5
The meter system provides a functional analysis language called MAL (Meter Analysis Language) that lets users analyze and 
aggregate meter data in the OAP streaming system. The result of an expression can either be ingested by the agent analyzer,
or the OC/Prometheus analyzer.
G
Gao Hongtao 已提交
6 7 8

## Language data type

W
Wing 已提交
9
In MAL, an expression or sub-expression can evaluate to one of the following two types:
G
Gao Hongtao 已提交
10

W
Wing 已提交
11 12
 - **Sample family**:  A set of samples (metrics) containing a range of metrics whose names are identical.
 - **Scalar**: A simple numeric value that supports integer/long and floating/double.
G
Gao Hongtao 已提交
13 14 15

## Sample family

W
Wing 已提交
16
A set of samples, which acts as the basic unit in MAL. For example:
G
Gao Hongtao 已提交
17 18 19 20 21

```
instance_trace_count
```

W
Wing 已提交
22
The sample family above may contain the following samples which are provided by external modules, such as the agent analyzer:
G
Gao Hongtao 已提交
23 24 25 26 27 28 29 30 31

```
instance_trace_count{region="us-west",az="az-1"} 100
instance_trace_count{region="us-east",az="az-3"} 20
instance_trace_count{region="asia-north",az="az-1"} 33
```

### Tag filter

W
Wing 已提交
32
MAL supports four type operations to filter samples in a sample family:
G
Gao Hongtao 已提交
33

W
Wing 已提交
34 35 36 37
 - tagEqual: Filter tags exactly equal to the string provided.
 - tagNotEqual: Filter tags not equal to the string provided.
 - tagMatch: Filter tags that regex-match the string provided.
 - tagNotMatch: Filter labels that do not regex-match the string provided.
G
Gao Hongtao 已提交
38 39 40 41 42 43

For example, this filters all instance_trace_count samples for us-west and asia-north region and az-1 az:

```
instance_trace_count.tagMatch("region", "us-west|asia-north").tagEqual("az", "az-1")
```
44 45
### Value filter

W
Wing 已提交
46
MAL supports six type operations to filter samples in a sample family by value:
47

W
Wing 已提交
48 49 50 51 52 53
- valueEqual: Filter values exactly equal to the value provided.
- valueNotEqual: Filter values equal to the value provided.
- valueGreater: Filter values greater than the value provided.
- valueGreaterEqual: Filter values greater than or equal to the value provided.
- valueLess: Filter values less than the value provided.
- valueLessEqual: Filter values less than or equal to the value provided.
54 55 56 57 58 59

For example, this filters all instance_trace_count samples for values >= 33:

```
instance_trace_count.valueGreaterEqual(33)
```
60
### Tag manipulator
W
Wing 已提交
61
MAL allows tag manipulators to change (i.e. add/delete/update) tags and their values.
62 63

#### K8s
W
Wing 已提交
64 65
MAL supports using the metadata of K8s to manipulate the tags and their values.
This feature requires authorizing the OAP Server to access K8s's `API Server`.
66 67

##### retagByK8sMeta
W
Wing 已提交
68
`retagByK8sMeta(newLabelName, K8sRetagType, existingLabelName, namespaceLabelName)`. Add a new tag to the sample family based on the value of an existing label. Provide several internal converting types, including
69 70
- K8sRetagType.Pod2Service  

W
Wing 已提交
71
Add a tag to the sample using `service` as the key, `$serviceName.$namespace` as the value, and according to the given value of the tag key, which represents the name of a pod.
72 73 74

For example:
```
75
container_cpu_usage_seconds_total{namespace=default, container=my-nginx, cpu=total, pod=my-nginx-5dc4865748-mbczh} 2
76 77 78
```
Expression:
```
79
container_cpu_usage_seconds_total.retagByK8sMeta('service' , K8sRetagType.Pod2Service , 'pod' , 'namespace')
80 81 82
```
Output:
```
83
container_cpu_usage_seconds_total{namespace=default, container=my-nginx, cpu=total, pod=my-nginx-5dc4865748-mbczh, service='nginx-service.default'} 2
84
```
G
Gao Hongtao 已提交
85 86 87 88 89 90 91 92 93 94 95 96

### Binary operators

The following binary arithmetic operators are available in MAL:

 - \+ (addition)
 - \- (subtraction)
 - \* (multiplication)
 - / (division)

Binary operators are defined between scalar/scalar, sampleFamily/scalar and sampleFamily/sampleFamily value pairs.

W
Wing 已提交
97
Between two scalars: they evaluate to another scalar that is the result of the operator being applied to both scalar operands:
G
Gao Hongtao 已提交
98 99 100 101 102

```
1 + 2
```

103
Between a sample family and a scalar, the operator is applied to the value of every sample in the sample family. For example:
G
Gao Hongtao 已提交
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122

```
instance_trace_count + 2
``` 

or 

```
2 + instance_trace_count
``` 

results in

```
instance_trace_count{region="us-west",az="az-1"} 102 // 100 + 2
instance_trace_count{region="us-east",az="az-3"} 22 // 20 + 2
instance_trace_count{region="asia-north",az="az-1"} 35 // 33 + 2
```

W
Wing 已提交
123 124 125
Between two sample families, a binary operator is applied to each sample in the sample family on the left and 
its matching sample in the sample family on the right. A new sample family with empty name will be generated.
Only the matched tags will be reserved. Samples with no matching samples in the sample family on the right will not be found in the result.
G
Gao Hongtao 已提交
126 127 128 129 130 131 132 133 134 135 136 137 138 139

Another sample family `instance_trace_analysis_error_count` is 

```
instance_trace_analysis_error_count{region="us-west",az="az-1"} 20
instance_trace_analysis_error_count{region="asia-north",az="az-1"} 11 
```

Example expression:

```
instance_trace_analysis_error_count / instance_trace_count
```

W
Wing 已提交
140
This returns a resulting sample family containing the error rate of trace analysis. Samples with region us-west and az az-3 
G
Gao Hongtao 已提交
141 142 143 144 145 146 147 148 149 150
have no match and will not show up in the result:

```
{region="us-west",az="az-1"} 0.8  // 20 / 100
{region="asia-north",az="az-1"} 0.3333  // 11 / 33
```

### Aggregation Operation

Sample family supports the following aggregation operations that can be used to aggregate the samples of a single sample family,
W
Wing 已提交
151
resulting in a new sample family having fewer samples (sometimes having just a single sample) with aggregated values:
G
Gao Hongtao 已提交
152 153

 - sum (calculate sum over dimensions)
154 155 156
 - min (select minimum over dimensions)
 - max (select maximum over dimensions)
 - avg (calculate the average over dimensions)
G
Gao Hongtao 已提交
157
 
W
Wing 已提交
158
These operations can be used to aggregate overall label dimensions or preserve distinct dimensions by inputting `by` parameter. 
G
Gao Hongtao 已提交
159 160 161 162 163 164 165 166 167 168 169

```
<aggr-op>(by: <tag1, tag2, ...>)
```

Example expression:

```
instance_trace_count.sum(by: ['az'])
```

W
Wing 已提交
170
will output the following result:
G
Gao Hongtao 已提交
171 172 173 174 175 176 177 178 179

```
instance_trace_count{az="az-1"} 133 // 100 + 33
instance_trace_count{az="az-3"} 20
```

### Function

`Duraton` is a textual representation of a time range. The formats accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS}
W
Wing 已提交
180
 where a day is regarded as exactly 24 hours.
G
Gao Hongtao 已提交
181 182 183 184 185 186 187 188 189 190 191 192

Examples:
 - "PT20.345S" -- parses as "20.345 seconds"
 - "PT15M"     -- parses as "15 minutes" (where a minute is 60 seconds)
 - "PT10H"     -- parses as "10 hours" (where an hour is 3600 seconds)
 - "P2D"       -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
 - "P2DT3H4M"  -- parses as "2 days, 3 hours and 4 minutes"
 - "P-6H3M"    -- parses as "-6 hours and +3 minutes"
 - "-P6H3M"    -- parses as "-6 hours and -3 minutes"
 - "-P-6H+3M"  -- parses as "+6 hours and -3 minutes"

#### increase
W
Wing 已提交
193
`increase(Duration)`: Calculates the increase in the time range.
G
Gao Hongtao 已提交
194 195

#### rate
W
Wing 已提交
196
`rate(Duration)`: Calculates the per-second average rate of increase in the time range.
G
Gao Hongtao 已提交
197 198

#### irate
W
Wing 已提交
199
`irate()`: Calculates the per-second instant rate of increase in the time range.
G
Gao Hongtao 已提交
200 201

#### tag
W
Wing 已提交
202
`tag({allTags -> })`: Updates tags of samples. User can add, drop, rename and update tags.
G
Gao Hongtao 已提交
203 204

#### histogram
W
Wing 已提交
205 206
`histogram(le: '<the tag name of le>')`: Transforms less-based histogram buckets to meter system histogram buckets. 
`le` parameter represents the tag name of the bucket. 
G
Gao Hongtao 已提交
207 208

#### histogram_percentile
W
Wing 已提交
209
`histogram_percentile([<p scalar>])`. Represents the meter-system to calculate the p-percentile (0 ≤ p ≤ 100) from the buckets. 
G
Gao Hongtao 已提交
210 211

#### time
W
Wing 已提交
212
`time()`: Returns the number of seconds since January 1, 1970 UTC.
G
Gao Hongtao 已提交
213

214

G
Gao Hongtao 已提交
215
## Down Sampling Operation
W
Wing 已提交
216 217
MAL should instruct meter-system on how to downsample for metrics. It doesn't only refer to aggregate raw samples to 
`minute` level, but also expresses data from `minute` in higher levels, such as `hour` and `day`. 
G
Gao Hongtao 已提交
218

W
Wing 已提交
219
Down sampling function is called `downsampling` in MAL, and it accepts the following types:
G
Gao Hongtao 已提交
220

221 222 223 224 225 226 227
 - AVG
 - SUM
 - LATEST
 - MIN (TODO)
 - MAX (TODO)
 - MEAN (TODO)
 - COUNT (TODO)
G
Gao Hongtao 已提交
228

229
The default type is `AVG`.
G
Gao Hongtao 已提交
230

231
If users want to get the latest time from `last_server_state_sync_time_in_seconds`:
G
Gao Hongtao 已提交
232 233

```
234
last_server_state_sync_time_in_seconds.tagEqual('production', 'catalog').downsampling(LATEST)
G
Gao Hongtao 已提交
235 236 237 238
```

## Metric level function

W
Wing 已提交
239
There are three levels in metric: service, instance and endpoint. They extract level relevant labels from metric labels, then informs the meter-system the level to which this metric belongs.
G
Gao Hongtao 已提交
240 241 242 243 244 245 246 247 248 249

 - `servcie([svc_label1, svc_label2...])` extracts service level labels from the array argument.
 - `instance([svc_label1, svc_label2...], [ins_label1, ins_label2...])` extracts service level labels from the first array argument, 
                                                                        extracts instance level labels from the second array argument.
 - `endpoint([svc_label1, svc_label2...], [ep_label1, ep_label2...])` extracts service level labels from the first array argument, 
                                                                      extracts endpoint level labels from the second array argument.

## More Examples

Please refer to [OAP Self-Observability](../../../oap-server/server-bootstrap/src/main/resources/fetcher-prom-rules/self.yaml)