* MM-34002: Improve AddUserToChannel
When we would add a user to a channel, we would
check whether the user is removed from that team or not.
During LDAP sync, this check is not required because the
team member would have just been created. Hence, we
pass a boolean flag to bypass the check.
And with that done, we can freely query the replica.
https://mattermost.atlassian.net/browse/MM-34002
```release-note
NONE
```
* Refactor code
* Rename a struct field
* fix double negative
During LDAP sync, we would call AddTeamMember which had a read-after-write issue
where we would create a team member but then immediately after that
query the team member.
The same pattern was found in:
AddTeamMember
AddTeamMembers
AddTeamMemberByToken
To fix this, we just return the inserted team member from AddUserToTeam and use that
instead of query GetTeamMember again.
```release-note
NONE
```
https://mattermost.atlassian.net/browse/MM-33913
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
Remote Cluster Service
- provides ability for multiple Mattermost cluster instances to create a trusted connection with each other and exchange messages
- trusted connections are managed via slash commands (for now)
- facilitates features requiring inter-cluster communication, such as Shared Channels
Shared Channels Service
- provides ability to shared channels between one or more Mattermost cluster instances (using trusted connection)
- sharing/unsharing of channels is managed via slash commands (for now)
* MM-34487: Add dead queue
Just a circular buffer to store dead messages for now.
Not controlling this via config flag because this does
not have any effect except taking some more memory
per connection
```release-note
NONE
```
https://mattermost.atlassian.net/browse/MM-34487
* Wrap with config
* fix test
* MM-34086: Fix flaky test StartServerTLSOverwriteCipher
This suffered from the same race condition as
https://github.com/mattermost/mattermost-server/pull/17215.
We apply the same approach of using a memstore instead
of a file based config store.
It is likely that the test was also failing because of
the same race. Although I don't obviously see how, because
the race happens with mlog.Info. The test will fail only
if the config wasn't able to set properly. So although
not strictly related, it's somehow related.
If it fails again, we can look into it again. But this
atleast fixes the race.
While here, we also apply some cleanup of the code.
https://mattermost.atlassian.net/browse/MM-34086
```release-note
NONE
```
* Remove race test
* MM-34389: Reliable websockets: First commit
This is the first commit which makes some basic changes
to get it ready for the actual implementation.
Changes include:
- A config field to conditionally enable it.
- Refactoring the WriteMessage along with setting the deadline
to a separate method.
The basic idea is that the client sends the connection_id
and sequence_number either during the handshake (via query params),
or during challenge_auth (via added parameters in the map).
If the conn_id is empty, then we create a new one and set it.
Otherwise, we get the queues from a connection manager (TBD)
and attach them to WebConn.
```release-note
NONE
```
https://mattermost.atlassian.net/browse/MM-34389
* Incorporate review comments
* Trigger CI
* removing telemetry
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* MM-33818: Add replica lag metric
We add two new metrics for monitoring replica lag:
- Monitor absolute lag based on binlog distance/transaction queue length.
- Monitor time taken for the replica to catch up.
To achieve this, we add a config setting to run a user defined SQL query
on the database.
We need to specify a separate datasource field as part of the config because
in some databases, querying the replica lag value requires elevated credentials
which are not needed for usual running of the application, and can even be a security risk.
Arguably, a peculiar part of the design is the requirement of the query output to be in a (node, value)
format. But since from the application, the SQL query is a black box and the user can set any query
they want, we cannot, in any way templatize this.
And as an extra note, freno also does it in a similar way.
The last bit is because we need to have a separate datasources, now we consume one extra connection
rather than sharing it with the pool. This is an unfortunate result of the design, and while one extra
connection doesn't make much of a difference in a single-tenant scenario. It does make so, in a multi-tenant scenario.
But in a multi-tenant scenario, the expectation would already be to use a connection pool. So this
is not a big concern.
https://mattermost.atlassian.net/browse/MM-33818
```release-note
Two new gauge metrics were added:
mattermost_db_replica_lag_abs and mattermost_db_replica_lag_time, both
containing a label of "node", signifying which db host is the metric from.
These metrics signify the replica lag in absolute terms and in the time dimension
capturing the whole picture of replica lag.
To use these metrics, a separate config section ReplicaLagSettings was added
under SqlSettings. This is an array of maps which contain three keys: DataSource,
QueryAbsoluteLag, and QueryTimeLag. Each map entry is for a single replica instance.
DataSource contains the DB credentials to connect to the replica instance.
QueryAbsoluteLag is a plain SQL query that must return a single row of which the first column
must be the node value of the Prometheus metric, and the second column must be the value of the lag.
QueryTimeLag is the same as above, but used to measure the time lag.
As an example, for AWS Aurora instances, the QueryAbsoluteLag can be:
select server_id, highest_lsn_rcvd-durable_lsn as bindiff from aurora_global_db_instance_status() where server_id=<>
and QueryTimeLag can be:
select server_id, visibility_lag_in_msec from aurora_global_db_instance_status() where server_id=<>
For MySQL Group Replication, the absolute lag can be measured from the number of pending transactions
in the applier queue:
select member_id, count_transaction_remote_in_applier_queue FROM performance_schema.replication_group_member_stats where member_id=<>
Overall, what query to choose is left to the administrator, and depending on the database and need, an appropriate
query can be chosen.
```
* Trigger CI
* Fix tests
* address review comments
* Remove t.Parallel
It was spawning too many connections,
and overloading the docker container.
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* Add feature flag for apps
* Update default to false
* Add plugin version Feature Flag
* Fix typo
* Only force shutdown, and leave the enable status dependant on the user (defaulting to enable)
* Remove unneeded tracking of status
* Handle plugin init on startup for locally installed plugin
App contains server.
Server contains WebsocketRouter.
There is no need for WebsocketRouter to contain
server too. As is evident because the code never used it.
```release-note
NONE
```
* MM-34171: Fix racy test TestSentry
The NewServer call sets the global mlog Info variable.
But before that, if we call UpdateConfig with the functional options,
then the store.Set method will call mlog.Info before
it could be set.
To fix that, we prepare the updated config
and pass that directly to NewServer to avoid
having to call UpdateConfig.
https://mattermost.atlassian.net/browse/MM-34171
```release-note
NONE
```
* disable watcher
* Trying yet again
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* MM-33893: Disable TCP_NO_DELAY for websocket connections
In very large installations, websocket messages cause too much
traffic congestion by sending too small packets and thereby cause
a drop in throughput.
To counter this, we disable the TCP_NO_DELAY flag for websocket
connections. This has shown to give noticeable improvements in
load tests.
We wrap this in a feature flag for now to let it soak in Community
first.
```release-note
NONE
```
https://mattermost.atlassian.net/browse/MM-33893
* fix gorilla specific conn
test-server-race wasn't using the same set of steps
that the test-server step did. Therefore one test was failing.
Refactored it such that scripts/test.sh can be used to run
normal and race tests as well
```release-note
NONE
```