* MM-33818: Add replica lag metric
We add two new metrics for monitoring replica lag:
- Monitor absolute lag based on binlog distance/transaction queue length.
- Monitor time taken for the replica to catch up.
To achieve this, we add a config setting to run a user defined SQL query
on the database.
We need to specify a separate datasource field as part of the config because
in some databases, querying the replica lag value requires elevated credentials
which are not needed for usual running of the application, and can even be a security risk.
Arguably, a peculiar part of the design is the requirement of the query output to be in a (node, value)
format. But since from the application, the SQL query is a black box and the user can set any query
they want, we cannot, in any way templatize this.
And as an extra note, freno also does it in a similar way.
The last bit is because we need to have a separate datasources, now we consume one extra connection
rather than sharing it with the pool. This is an unfortunate result of the design, and while one extra
connection doesn't make much of a difference in a single-tenant scenario. It does make so, in a multi-tenant scenario.
But in a multi-tenant scenario, the expectation would already be to use a connection pool. So this
is not a big concern.
https://mattermost.atlassian.net/browse/MM-33818
```release-note
Two new gauge metrics were added:
mattermost_db_replica_lag_abs and mattermost_db_replica_lag_time, both
containing a label of "node", signifying which db host is the metric from.
These metrics signify the replica lag in absolute terms and in the time dimension
capturing the whole picture of replica lag.
To use these metrics, a separate config section ReplicaLagSettings was added
under SqlSettings. This is an array of maps which contain three keys: DataSource,
QueryAbsoluteLag, and QueryTimeLag. Each map entry is for a single replica instance.
DataSource contains the DB credentials to connect to the replica instance.
QueryAbsoluteLag is a plain SQL query that must return a single row of which the first column
must be the node value of the Prometheus metric, and the second column must be the value of the lag.
QueryTimeLag is the same as above, but used to measure the time lag.
As an example, for AWS Aurora instances, the QueryAbsoluteLag can be:
select server_id, highest_lsn_rcvd-durable_lsn as bindiff from aurora_global_db_instance_status() where server_id=<>
and QueryTimeLag can be:
select server_id, visibility_lag_in_msec from aurora_global_db_instance_status() where server_id=<>
For MySQL Group Replication, the absolute lag can be measured from the number of pending transactions
in the applier queue:
select member_id, count_transaction_remote_in_applier_queue FROM performance_schema.replication_group_member_stats where member_id=<>
Overall, what query to choose is left to the administrator, and depending on the database and need, an appropriate
query can be chosen.
```
* Trigger CI
* Fix tests
* address review comments
* Remove t.Parallel
It was spawning too many connections,
and overloading the docker container.
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
This is a deadlock due to reversed locking order of the SidebarChannels
and SidebarCategories table.
I could not find the exact culprit query from the deadlock output
because it only shows the last query a transaction is running. And from looking
at the code, the only query that runs "UPDATE SidebarCategories SET DisplayName = ?, Sorting = ? WHERE Id = ?"
is UpdateSidebarCategories. But for the deadlock to happen, it has to lock
SidebarChannels _first_, and then _then_ lock SidebarCategories.
Looking a bit more throughly, I found that DeleteSidebarCategory does indeed
lock the tables in an inverse way and if DeleteSidebarCategory runs concurrently with
UpdateSidebarCategories, they will deadlock.
Here's how it will happen.
```
tx1
DELETE FROM SidebarChannels WHERE CategoryId = 'xx';
tx2
UPDATE SidebarCategories SET DisplayName='dn' WHERE Id='xx';
tx2
DELETE FROM SidebarChannels WHERE (ChannelId IN ('yy') AND CategoryId = 'xx');
tx1
DELETE FROM SidebarCategories WHERE Id = 'xx';
```
And then we see:
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
To fix this, we simply reorder the Delete query to lock the SidebarCategories first,
and then SidebarChannels.
In fact, any transaction updating/deleting rows from those two tables should always operate on that order
if possible.
https://mattermost.atlassian.net/browse/MM-31396
```release-note
Fixed a database deadlock that can happen if a sidebar category is updated and deleted at the same time.
```
* MM-31094: Adds tooling to develop and test using a MySQL instance with replication lag. Adds some lazy lookups to fallback to master if results are not found.
* MM-31094: Removes mysql-read-replica from default docker services.
* MM-31094: Switches (store..SessionStore).Get and (store.TeamStore).GetMember to using context.Context.
* MM-31094: Updates (store.UsersStore).Get to use context.
* MM-31094: Updates (store.PostStore).Get to use context.
* MM-31094: Removes feature flag and config setting.
* MM-31094: Rolls back some master reads.
* MM-31094: Rolls a non-cache read.
* MM-31094: Removes feature flag from the store.
* MM-31094: Removes unused constant and struct field.
* MM-31094: Removes some old feature flag references.
* MM-31094: Fixes some tests.
* MM-31094: App layers fix.
* MM-31094: Fixes mocks.
* MM-31094: Don't reparse flag.
* MM-31094: No reparse.
* MM-31094: Removed unused FeatureFlags field.
* MM-31094: Removes unnecessary feature flags variable declarations.
* MM-31094: Fixes copy-paste error.
* MM-31094: Fixes logical error.
* MM-30194: Removes test method from store.
* Revert "MM-30194: Removes test method from store."
This reverts commit d5a6e8529b.
* MM-31094: Conforming to make's strange syntax.
* MM-31094: Configures helper for read replica with option.
* MM-31094: Adds some missing ctx's.
* MM-31094: WIP
* MM-31094: Updates test names.
* MM-31094: WIP
* MM-31094: Removes unnecessary master reads.
* MM-31094: ID case changes out of scope.
* MM-31094: Removes unused context.
* MM-31094: Switches to a helper. Removes some var naming changes. Fixes a merge error.
* MM-31094: Removes SQLITE db driver ref.
* MM-31094: Layer generate fix.
* MM-31094: Removes unnecessary changes.
* MM-31094: Moves test method.
* MM-31094: Re-add previous fix.
* MM-31094: Removes make command for dev.
* MM-31094: Fix for login.
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* Adding bulk-indexing and improving a bit the name indexing for bleve and elasticsearch
* Update services/searchengine/bleveengine/bleve.go
Co-authored-by: Mario de Frutos Dieguez <mario@defrutos.org>
* Update store/sqlstore/file_info_store.go
Co-authored-by: Mario de Frutos Dieguez <mario@defrutos.org>
* Update store/sqlstore/file_info_store.go
Co-authored-by: Mario de Frutos Dieguez <mario@defrutos.org>
* Adding tests requested in the PR review
* fixing tests
* Adding a feature flag to avoid indexing files before the feature is released
* Fixing i18n
Co-authored-by: Mario de Frutos Dieguez <mario@defrutos.org>
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* MM-30882: Fix read-after-write issue for demoting user
In (*App).DemoteUserToGuest, we would demote a user, and then immediately
read it back to do future operations from the user. This reading back
of the user had the effect of sticking the old value into the cache
after which it would never be updated.
There was another issue along with this, which was when the invalidation
message would broadcast across the cluster, it would hit the cache invalidation
problem where an unrelated store call would miss the cache because
it was invalidated, and then again read from replica and stick the old value.
To fix all these, we return the new value directly from the store method
to avoid having the app to read it again.
And we add a map in the localcache layer which tracks invalidations made,
and then switch to use master if it's true.
The core change is fairly limited, but due to changing the store method signatures,
a lot of code needed to be updated to pass "context.Background". Therefore the PR
just "appears" to be big, but the main changes are limited to app/user.go,
sqlstore/user_store.go and user_layer.go
https://mattermost.atlassian.net/browse/MM-30882
```release-note
Fix an issue where demoting a user to guest would not take effect in
an environment with read replicas.
```
* Fix concurrent map access
* Fixing mistakes
* fix tests
* Optimise creation of dm
* Handle direct channels with the same user
* Cover GetMany with specs and add it on tha cache layer as well
* Fix specs by handling user dming themselves
* Apply PR suggestions
* Apply PR suggestions
* Use require.NoError instead of require.Nil on userstore test
* Improve readability of GetOrCreateDirectChannel
* Apply PR suggestions
* Update layers
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* Replacing require.Nil with require.NoError
* More replacements
* More Nil/NotNill to NoError/Error
* Other detected errors
* renaming apperr to err
* Removed not needed line
* Rename old appErr variables that are no longer model.AppError values
* Fixing tiny typo
* Reverting changes outside the store (accidentally added)
* Apply suggestions from code review
Co-authored-by: Doug Lauder <wiggin77@warpmail.net>
Co-authored-by: Doug Lauder <wiggin77@warpmail.net>
* Update config.go
* Update telemetry.go
* Update store and store test
* Update settings.go
* use new error code
* Trigger CI
Co-authored-by: Haardik Dharma <dharmahaardik08@gmail.com>
* MM-30892: Avoid unnecessary table scan in AnalyticsPostCount.
[AnalyticsPostCount](d9e8402dc2/store/sqlstore/post_store.go (L1513-L1541)) unconditionally joins on the `Channels` table even when not filtering to a team.
This is a proposal to stop doing this, reducing this to a simple index query when no team is specified. Note that this is technically a subtle semantic difference, since orphaned posts (from deleted or invalid channel ids) are currently excluded. But I submit the benefits outweigh the technical differences here.
Fixes: https://mattermost.atlassian.net/browse/MM-30892
* Update store/sqlstore/post_store.go
Co-authored-by: Claudio Costa <cstcld91@gmail.com>
* Update store/sqlstore/post_store.go
Co-authored-by: Claudio Costa <cstcld91@gmail.com>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>
* Add search engine support for files
* Fixing i18n
* Fix golangci-lint
* Fix consistency problem in the Search receiver functio of the SqlFileStore
* Fixing some tests
* Fixing test
* Apply suggestions from code review
Co-authored-by: Mario de Frutos Dieguez <mario@defrutos.org>
* Addressing PR review comments
* Removing some empty lines
* Address PR review comments
* Fixing problem after merge master
* Fixing spelling problem
* Add missed translations
* Fixing certain global variable usages after merge master
* Fixing some constants usage
* Fixing goimports order
Co-authored-by: Mario de Frutos Dieguez <mario@defrutos.org>
* format using `goimports -local github.com/mattermost/mattermost-server/v5 -w`
* added goimports lint check to .golangci.yml
* format using `goimports -local github.com/mattermost/mattermost-server/v5 -w` for a corner case
* make app-layers, *-mocks and store-layers for ci check
Co-authored-by: Mahmudul Haque <mahmudulhaque@protonmail.com>
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* MM-31063: Change constants to use CamelCase
* store package
* change allcaps to camel case (#16615)
* New tools.mod
Co-authored-by: Ibrahim Serdar Acikgoz <serdaracikgoz86@gmail.com>
* MM-31356: Add a minimum required version check for Postgres
To keep conformance with our failing fast and obvious philosophy,
we add a check to prevent Mattermost server from starting
if the postgres version is below 10.0.
This gives customers a chance to upgrade their database before upgrading
their Mattermost version, than to run into weird compatibility issues
after they have finished the upgrade.
https://mattermost.atlassian.net/browse/MM-31356
```release-note
NONE
```
* fix lint errors
* Use a function to pretty-print version string
* rectify comment
* Implement unzip function
* Implement FileSize method
* Implement path rewriting for bulk import
* Small improvements
* Add ImportSettings to config
* Implement ListImports API endpoint
* Enable uploading import files
* Implement import process job
* Add missing license headers
* Address reviews
* Make path sanitization a bit smarter
* Clean path before calculating Dir
* [MM-30008] Add mmctl support for file imports (#16301)
* Add mmctl support for import files
* Improve test
* Remove unnecessary handlers
* Use th.TestForSystemAdminAndLocal
* Make nouser id a constant
* Removing supplier concept from the sql store
* Removing other metions to supplier
* Fixing gofmt
* Fixing gofmt
* Renaming NewSqlStore to New
* Fixing tests
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* Make UpdateSidebarCategories return the original categories
* MM-20897 Add category muting
* Prevent muting the DMs category
* Fix muted state not being stored in the database
* Address feedback
* Address some feedback
* Fix unit tests
* MM-20897 Mute/unmute channels in the database in bulk
* Satisfy golangci-lint
* MM-28067: Optimize app server startup in tests
For every test, we would wipe out the database and then start
the server again. This would run through all the migrations
and create new rows in Systems and Roles table every time.
As a result, this was consuming a lot of setup time for every test.
We optimize this by preloading the DB with dummy data in the roles
and systems table so that the server code just skips over those migrations.
This is completely forward compatible and adding new migrations does not
need to generate the sql files again. Only in case of schema changes to the
roles or systems table, this would need to be done.
It is unlikely the Systems table schema will get changed. The Roles table
might change in future but it's a comparatively rare event. Given the reduction
in CI runtime we are seeing, it's a worthy optimization.
We also apply some more optimizations:
- Coalesce multiple UpdateConfig calls into a single one. Each UpdateConfig call
has to do a json marshal which would take precious CPU cycles. It's much more
efficient to do everything in a single call.
- Remove unnecessary debug.FreeOSMemory in reload config. This was an artifact
from old days and is no longer required.
Numbers:
Results show a full **2 minutes** shaved off the test runtime. Earlier, tests
would take around 16 minutes. Now they take 14 minutes.
```release-note
NONE
```
* fix app package
* Refactor apply multi role filters and add role filters to get all profiles
* Add some tests
* Fix tests
* Fix lint
* Trigger CI
* Rename param to make more sense
* Tie get filtered user stats to usermanagement read users
* Dont filter out other system roles when searching for team members or team admins only filter out system admins
* add new permissions
* add migration
* fix test
* remove system roles as default permissions
* implement changes discussed with dennis
* add read only and fix i18n
* use model consts instead of strings
* turn the permissions into pseudo constants
* Update read only default permissions
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
Co-authored-by: Hossein Ahmadian-Yazdi <hyazdi1997@gmail.com>
* MM-30026: Use DB master when getting team members from a session
A race condition happens when the read-replica isn't updated yet
by the time a session expiry message reaches another node in the cluster.
Here is the sequence of events that can cause it:
- Server1 gets any request which has to wipe session cache.
- The SQL query is written to DB master, and a cluster message is propagated
to clear the session cache for that user.
- Now before the read-replica is updated with the master’s update,
the cluster message reaches Server2. The session cache is wiped out for that user.
- _Any random_ request for that user hits Server2. Does NOT have to be
the update team name request. The request does not find the value
in session cache, because it’s wiped off, and picks it up from the DB.
Surprise surprise, it gets the stale value. Sticks it into the cache.
By now, the read-replica is updated. But guess what, we aren’t going to
ask the DB anymore, because we have it in the cache. And the cache has the stale value.
We use a temporary approach for now by introducing a context in the DB calls so that
the useMaster information can be easily passed. And this has the added advantage of
reusing the same context for future DB calls in case it happens. And we can also
add more context keys as needed.
A proper approach needs some architectural changes. See the issue for more details.
```release-note
Fixed a bug where a session will hold on to a cached value
in an HA setup with read-replicas configured.
```
* incorporate review comments
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
* Add the content field to FileInfo
* Fixing the upgrade code
* Trying to fix the text-scheme
* Fixing test-schema
* Fixing test-schema
* Moving the migration to the next version