mirror of
https://github.com/mattermost/mattermost.git
synced 2025-02-25 18:55:24 -06:00
* MM-33818: Add replica lag metric We add two new metrics for monitoring replica lag: - Monitor absolute lag based on binlog distance/transaction queue length. - Monitor time taken for the replica to catch up. To achieve this, we add a config setting to run a user defined SQL query on the database. We need to specify a separate datasource field as part of the config because in some databases, querying the replica lag value requires elevated credentials which are not needed for usual running of the application, and can even be a security risk. Arguably, a peculiar part of the design is the requirement of the query output to be in a (node, value) format. But since from the application, the SQL query is a black box and the user can set any query they want, we cannot, in any way templatize this. And as an extra note, freno also does it in a similar way. The last bit is because we need to have a separate datasources, now we consume one extra connection rather than sharing it with the pool. This is an unfortunate result of the design, and while one extra connection doesn't make much of a difference in a single-tenant scenario. It does make so, in a multi-tenant scenario. But in a multi-tenant scenario, the expectation would already be to use a connection pool. So this is not a big concern. https://mattermost.atlassian.net/browse/MM-33818 ```release-note Two new gauge metrics were added: mattermost_db_replica_lag_abs and mattermost_db_replica_lag_time, both containing a label of "node", signifying which db host is the metric from. These metrics signify the replica lag in absolute terms and in the time dimension capturing the whole picture of replica lag. To use these metrics, a separate config section ReplicaLagSettings was added under SqlSettings. This is an array of maps which contain three keys: DataSource, QueryAbsoluteLag, and QueryTimeLag. Each map entry is for a single replica instance. DataSource contains the DB credentials to connect to the replica instance. QueryAbsoluteLag is a plain SQL query that must return a single row of which the first column must be the node value of the Prometheus metric, and the second column must be the value of the lag. QueryTimeLag is the same as above, but used to measure the time lag. As an example, for AWS Aurora instances, the QueryAbsoluteLag can be: select server_id, highest_lsn_rcvd-durable_lsn as bindiff from aurora_global_db_instance_status() where server_id=<> and QueryTimeLag can be: select server_id, visibility_lag_in_msec from aurora_global_db_instance_status() where server_id=<> For MySQL Group Replication, the absolute lag can be measured from the number of pending transactions in the applier queue: select member_id, count_transaction_remote_in_applier_queue FROM performance_schema.replication_group_member_stats where member_id=<> Overall, what query to choose is left to the administrator, and depending on the database and need, an appropriate query can be chosen. ``` * Trigger CI * Fix tests * address review comments * Remove t.Parallel It was spawning too many connections, and overloading the docker container. Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
18 lines
721 B
Modula-2
18 lines
721 B
Modula-2
module github.com/mattermost/mattermost-server/v5
|
|
|
|
go 1.14
|
|
|
|
require (
|
|
github.com/go-bindata/go-bindata v3.1.2+incompatible // indirect
|
|
github.com/golang-migrate/migrate/v4 v4.14.1 // indirect
|
|
github.com/jstemmer/go-junit-report v0.9.1 // indirect
|
|
github.com/jteeuwen/go-bindata v3.0.7+incompatible // indirect
|
|
github.com/mattermost/mattermost-utilities/mmgotool v0.0.0-20210309083648-c1e5575135f9 // indirect
|
|
github.com/philhofer/fwd v1.0.0 // indirect
|
|
github.com/reflog/struct2interface v0.6.1 // indirect
|
|
github.com/spf13/cobra v1.1.3 // indirect
|
|
github.com/tinylib/msgp v1.1.2 // indirect
|
|
github.com/ttacon/chalk v0.0.0-20160626202418-22c06c80ed31 // indirect
|
|
github.com/vektra/mockery v1.1.2 // indirect
|
|
)
|