* MM-34389: Reliable websockets: First commit
This is the first commit which makes some basic changes
to get it ready for the actual implementation.
Changes include:
- A config field to conditionally enable it.
- Refactoring the WriteMessage along with setting the deadline
to a separate method.
The basic idea is that the client sends the connection_id
and sequence_number either during the handshake (via query params),
or during challenge_auth (via added parameters in the map).
If the conn_id is empty, then we create a new one and set it.
Otherwise, we get the queues from a connection manager (TBD)
and attach them to WebConn.
```release-note
NONE
```
https://mattermost.atlassian.net/browse/MM-34389
* Incorporate review comments
* Trigger CI
* removing telemetry
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* MM-33818: Add replica lag metric
We add two new metrics for monitoring replica lag:
- Monitor absolute lag based on binlog distance/transaction queue length.
- Monitor time taken for the replica to catch up.
To achieve this, we add a config setting to run a user defined SQL query
on the database.
We need to specify a separate datasource field as part of the config because
in some databases, querying the replica lag value requires elevated credentials
which are not needed for usual running of the application, and can even be a security risk.
Arguably, a peculiar part of the design is the requirement of the query output to be in a (node, value)
format. But since from the application, the SQL query is a black box and the user can set any query
they want, we cannot, in any way templatize this.
And as an extra note, freno also does it in a similar way.
The last bit is because we need to have a separate datasources, now we consume one extra connection
rather than sharing it with the pool. This is an unfortunate result of the design, and while one extra
connection doesn't make much of a difference in a single-tenant scenario. It does make so, in a multi-tenant scenario.
But in a multi-tenant scenario, the expectation would already be to use a connection pool. So this
is not a big concern.
https://mattermost.atlassian.net/browse/MM-33818
```release-note
Two new gauge metrics were added:
mattermost_db_replica_lag_abs and mattermost_db_replica_lag_time, both
containing a label of "node", signifying which db host is the metric from.
These metrics signify the replica lag in absolute terms and in the time dimension
capturing the whole picture of replica lag.
To use these metrics, a separate config section ReplicaLagSettings was added
under SqlSettings. This is an array of maps which contain three keys: DataSource,
QueryAbsoluteLag, and QueryTimeLag. Each map entry is for a single replica instance.
DataSource contains the DB credentials to connect to the replica instance.
QueryAbsoluteLag is a plain SQL query that must return a single row of which the first column
must be the node value of the Prometheus metric, and the second column must be the value of the lag.
QueryTimeLag is the same as above, but used to measure the time lag.
As an example, for AWS Aurora instances, the QueryAbsoluteLag can be:
select server_id, highest_lsn_rcvd-durable_lsn as bindiff from aurora_global_db_instance_status() where server_id=<>
and QueryTimeLag can be:
select server_id, visibility_lag_in_msec from aurora_global_db_instance_status() where server_id=<>
For MySQL Group Replication, the absolute lag can be measured from the number of pending transactions
in the applier queue:
select member_id, count_transaction_remote_in_applier_queue FROM performance_schema.replication_group_member_stats where member_id=<>
Overall, what query to choose is left to the administrator, and depending on the database and need, an appropriate
query can be chosen.
```
* Trigger CI
* Fix tests
* address review comments
* Remove t.Parallel
It was spawning too many connections,
and overloading the docker container.
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* Add feature flag for apps
* Update default to false
* Add plugin version Feature Flag
* Fix typo
* Only force shutdown, and leave the enable status dependant on the user (defaulting to enable)
* Remove unneeded tracking of status
* Handle plugin init on startup for locally installed plugin
App contains server.
Server contains WebsocketRouter.
There is no need for WebsocketRouter to contain
server too. As is evident because the code never used it.
```release-note
NONE
```
* MM-34171: Fix racy test TestSentry
The NewServer call sets the global mlog Info variable.
But before that, if we call UpdateConfig with the functional options,
then the store.Set method will call mlog.Info before
it could be set.
To fix that, we prepare the updated config
and pass that directly to NewServer to avoid
having to call UpdateConfig.
https://mattermost.atlassian.net/browse/MM-34171
```release-note
NONE
```
* disable watcher
* Trying yet again
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* MM-33893: Disable TCP_NO_DELAY for websocket connections
In very large installations, websocket messages cause too much
traffic congestion by sending too small packets and thereby cause
a drop in throughput.
To counter this, we disable the TCP_NO_DELAY flag for websocket
connections. This has shown to give noticeable improvements in
load tests.
We wrap this in a feature flag for now to let it soak in Community
first.
```release-note
NONE
```
https://mattermost.atlassian.net/browse/MM-33893
* fix gorilla specific conn
test-server-race wasn't using the same set of steps
that the test-server step did. Therefore one test was failing.
Refactored it such that scripts/test.sh can be used to run
normal and race tests as well
```release-note
NONE
```
* MM-34080: Removing sqlite entirely
The initial commit missed removing this blank import.
So the library still remained in our vendor directory.
Removing it for good now.
Bye bye sqlite.
https://mattermost.atlassian.net/browse/MM-34080
* fix go.mod
* MM-34000: Use non-epoll mode for TLS connections
A *crypto/tls.Conn does not expose the underlying TCP connection
or even a File method to get the underlying file descriptor
like the way a *net/TCPConn does. Therefore the netpoll code would
fail to get the file descriptor.
Relevant issue here: https://github.com/mailru/easygo/issues/3
It is indeed possible to use reflect black magic to get the unexported
member, but I have found unexpected errors during writing to the websocket
by getting the file descriptor this way. I do not want to spend time investigating
this especially since this is already released.
Once this is out, we can decide on the right way to fix this, most probably
by proposing to expose the File method or some other way.
https://mattermost.atlassian.net/browse/MM-34000
```release-note
Fix an issue where websockets wouldn't work with TLS connections.
In that case, we just fall back to the way it works for Windows machines,
which is to use a separate goroutine for reader connection.
```
* Ignore logging errors on non-epoll
On non-epoll systems, we needed to return an error
to break from the loop. But in that case, there is no
need to log the error