This RFC proposes a new metric API in replace of the old perf-counter API.
## Motivation
The perf-counter API has bad naming convention to be parsed and queried over the external monitoring system like [Prometheus](https://prometheus.io/), or [open-falcon](https://open-falcon.org/).
Here are some examples of the perf-counter it exposes:
From our experiences, this naming rules have the problems listed below:
- The name includes invalid characters for Prometheus: '*', '.', '@', '#'. While they are legal in open-falcon, in order to support Prometheus we must convert all these characters to '_'. It causes that Pegasus has to expose different styles of counter name to the two monitoring systems.
- Strange and meaningless words like "zion", "eon" are included in the counter name.
- Information like "table name", "replica id" are weirdly encoded into the counter name following some obsecure rules, such as "`#<table name>`" and "`@<replica id>`". This increases the difficulty on the engineering of perf-counters, that needs additional parsing.
## Design
### Naming
Firstly, let's take a look on the new naming. For the above perf-counters:
- I apply [the naming rules of prometheus](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels), only use underscores '_' for word separator.
- Better utilization of "tags"/"labels" (optional key-value pairs for attributes of each metric).
- "`#<table name>`" is replaced with tag "`table=<table name>`"
- "`@<replica id>`" is replaced with tags "`table=<table id>,partition=<partition id>`".
- I introduce a new tag called "entity", which is used to identify the level of the perf-counter. There are 3 types of entity in Pegasus:
- entity=server
- entity=table
- entity=replica
### API
As the naming is changed now, the perf-counter API must be correspondingly evolved. That is the new metric API.
To declare a perf-counter, take `get_qps` of a replica as an example, what was it like:
```cpp
classrocksdb_store{
private:
perf_counter_wrapper_get_qps;
};
rocksdb_store::rocksdb_store(){
_get_qps=init_app_counter(
"app.pegasus",
fmt::format("get_qps@{}",str_gpid),
COUNTER_TYPE_RATE,
"statistic the qps of GET request");
}
```
After using the new API:
```cpp
METRIC_DEFINE_meter(server,get_qps,kRequestsPerSecond,"the qps of GET requests")
The code might be seemingly more complicated, but in fact it's simpler:
1. Each instance of `METRIC_ENTITY_replica` is attributed with a replica's ID. Instantiated once, the instance (`_replica_entity`) can be used to create every metric belongs to this replica, compared to old API, where every perf-counter needs to encode ID into its name (`fmt::format("get_qps@{}", str_gpid)`).
2. "app.pegasus" in the old API is called a "section". We now use entity instead. Remembering what a section name means is not friendly to new-comers.
3. Metric definition and metric instantiation is decoupled. A metric is defined in global scope, its instantiation in the constructor of `rocksdb_store` is reduced to one line of code.
The API is inspired by [kudu metrics](https://github.com/apache/kudu/blob/master/src/kudu/util/metrics.h), which is a well-documented and mature implementation of metrics in C++.
## Notes
The documentations and monitoring templates must adapt to the latest metric name when this refactoring is released.