Files
mattermost/go.tools.mod
Agniva De Sarker 7b1dee4936 MM-33818: Add replica lag metric (#17148)
* MM-33818: Add replica lag metric

We add two new metrics for monitoring replica lag:
- Monitor absolute lag based on binlog distance/transaction queue length.
- Monitor time taken for the replica to catch up.

To achieve this, we add a config setting to run a user defined SQL query
on the database.

We need to specify a separate datasource field as part of the config because
in some databases, querying the replica lag value requires elevated credentials
which are not needed for usual running of the application, and can even be a security risk.

Arguably, a peculiar part of the design is the requirement of the query output to be in a (node, value)
format. But since from the application, the SQL query is a black box and the user can set any query
they want, we cannot, in any way templatize this.

And as an extra note, freno also does it in a similar way.

The last bit is because we need to have a separate datasources, now we consume one extra connection
rather than sharing it with the pool. This is an unfortunate result of the design, and while one extra
connection doesn't make much of a difference in a single-tenant scenario. It does make so, in a multi-tenant scenario.
But in a multi-tenant scenario, the expectation would already be to use a connection pool. So this
is not a big concern.

https://mattermost.atlassian.net/browse/MM-33818

```release-note
Two new gauge metrics were added:
mattermost_db_replica_lag_abs and mattermost_db_replica_lag_time, both
containing a label of "node", signifying which db host is the metric from.

These metrics signify the replica lag in absolute terms and in the time dimension
capturing the whole picture of replica lag.

To use these metrics, a separate config section ReplicaLagSettings was added
under SqlSettings. This is an array of maps which contain three keys: DataSource,
QueryAbsoluteLag, and QueryTimeLag. Each map entry is for a single replica instance.

DataSource contains the DB credentials to connect to the replica instance.

QueryAbsoluteLag is a plain SQL query that must return a single row of which the first column
must be the node value of the Prometheus metric, and the second column must be the value of the lag.

QueryTimeLag is the same as above, but used to measure the time lag.

As an example, for AWS Aurora instances, the QueryAbsoluteLag can be:

select server_id, highest_lsn_rcvd-durable_lsn as bindiff from aurora_global_db_instance_status() where server_id=<>

and QueryTimeLag can be:

select server_id, visibility_lag_in_msec from aurora_global_db_instance_status() where server_id=<>

For MySQL Group Replication, the absolute lag can be measured from the number of pending transactions
in the applier queue:

select member_id, count_transaction_remote_in_applier_queue FROM performance_schema.replication_group_member_stats where member_id=<>

Overall, what query to choose is left to the administrator, and depending on the database and need, an appropriate
query can be chosen.
```

* Trigger CI

* Fix tests

* address review comments

* Remove t.Parallel

It was spawning too many connections,
and overloading the docker container.

Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
2021-03-29 21:35:24 +05:30

18 lines
721 B
Modula-2

module github.com/mattermost/mattermost-server/v5
go 1.14
require (
github.com/go-bindata/go-bindata v3.1.2+incompatible // indirect
github.com/golang-migrate/migrate/v4 v4.14.1 // indirect
github.com/jstemmer/go-junit-report v0.9.1 // indirect
github.com/jteeuwen/go-bindata v3.0.7+incompatible // indirect
github.com/mattermost/mattermost-utilities/mmgotool v0.0.0-20210309083648-c1e5575135f9 // indirect
github.com/philhofer/fwd v1.0.0 // indirect
github.com/reflog/struct2interface v0.6.1 // indirect
github.com/spf13/cobra v1.1.3 // indirect
github.com/tinylib/msgp v1.1.2 // indirect
github.com/ttacon/chalk v0.0.0-20160626202418-22c06c80ed31 // indirect
github.com/vektra/mockery v1.1.2 // indirect
)