Files
mattermost/services
Agniva De Sarker ef85001523 MM-25710: Use an efficient cache serialization algorithm (#14826)
* MM-25710: Use an efficient cache serialization algorithm

We investigate 3 packages for selecting a suitable replacement
for gob encoding. The algorithm chosen was msgpack which gives
a decent boost over the standard gob encoding.

Any external schema dependent algorithms like protobuf, flatbuffers, avro,
capn'proto were not considered as that would entail converting the model structs
into separate schema objects and then code generating the Go structs.
It could be done theoretically at a later stage specifically for structs
which are in the hot path. This is a general solution for now.

The packages considered were:
- github.com/tinylib/msgp
- github.com/ugorji/go/codec
- github.com/vmihailenco/msgpack/v5

msgp uses code generation to generate encoding/decoding code without the reflection overhead.
Theoretically, therefore this is supposed to give the fastest performance. However, a major
flaw in that package is that it works only at a file/directory level, not at a package level.
Therefore, for structs which are spread across multiple files, it becomes near to impossible
to chase down all transitive dependencies to generate the code. Even if that's done, it fails
on some complex type like xml.Name and time.Time. (See: https://github.com/tinylib/msgp/issues/274#issuecomment-643654611)

Therefore, we are left with 2 choices. Both of them use the same underlying algorithm.
But msgpack/v5 wraps the encoders/decoders in a sync.Pool. To make a perfect apples-apples
comparison, I wrote a sync.Pool for ugorji/go/codec too and compared performance.

msgpack/v5 came out to be the fastest by a small margin.

benchstat master.txt ugorji.txt
name               old time/op    new time/op    delta
LRU/simple=new-8     5.62µs ± 3%    3.68µs ± 2%  -34.64%  (p=0.000 n=10+10)
LRU/complex=new-8    38.4µs ± 2%     9.1µs ± 2%  -76.38%  (p=0.000 n=9+9)
LRU/User=new-8       75.8µs ± 2%    23.5µs ± 2%  -69.01%  (p=0.000 n=10+10)
LRU/Post=new-8        125µs ± 2%      21µs ± 3%  -82.92%  (p=0.000 n=9+10)
LRU/Status=new-8     27.6µs ± 1%     5.4µs ± 4%  -80.34%  (p=0.000 n=10+10)

name               old alloc/op   new alloc/op   delta
LRU/simple=new-8     3.20kB ± 0%    1.60kB ± 0%  -49.97%  (p=0.000 n=10+10)
LRU/complex=new-8    15.7kB ± 0%     4.4kB ± 0%  -71.89%  (p=0.000 n=9+10)
LRU/User=new-8       33.5kB ± 0%     9.2kB ± 0%  -72.48%  (p=0.000 n=10+8)
LRU/Post=new-8       38.7kB ± 0%     4.8kB ± 0%  -87.48%  (p=0.000 n=10+10)
LRU/Status=new-8     10.6kB ± 0%     1.7kB ± 0%  -83.50%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
LRU/simple=new-8       46.0 ± 0%      20.0 ± 0%  -56.52%  (p=0.000 n=10+10)
LRU/complex=new-8       324 ± 0%        48 ± 0%  -85.19%  (p=0.000 n=10+10)
LRU/User=new-8          622 ± 0%       108 ± 0%  -82.64%  (p=0.000 n=10+10)
LRU/Post=new-8          902 ± 0%        74 ± 0%  -91.80%  (p=0.000 n=10+10)
LRU/Status=new-8        242 ± 0%        22 ± 0%  -90.91%  (p=0.000 n=10+10)

11:31:48-~/mattermost/mattermost-server/services/cache2$benchstat master.txt vmi.txt
name               old time/op    new time/op    delta
LRU/simple=new-8     5.62µs ± 3%    3.68µs ± 3%  -34.59%  (p=0.000 n=10+10)
LRU/complex=new-8    38.4µs ± 2%     8.7µs ± 3%  -77.45%  (p=0.000 n=9+10)
LRU/User=new-8       75.8µs ± 2%    20.9µs ± 1%  -72.45%  (p=0.000 n=10+10)
LRU/Post=new-8        125µs ± 2%      21µs ± 2%  -83.08%  (p=0.000 n=9+10)
LRU/Status=new-8     27.6µs ± 1%     5.1µs ± 3%  -81.66%  (p=0.000 n=10+10)

name               old alloc/op   new alloc/op   delta
LRU/simple=new-8     3.20kB ± 0%    1.60kB ± 0%  -49.89%  (p=0.000 n=10+10)
LRU/complex=new-8    15.7kB ± 0%     4.6kB ± 0%  -70.87%  (p=0.000 n=9+8)
LRU/User=new-8       33.5kB ± 0%    10.3kB ± 0%  -69.40%  (p=0.000 n=10+9)
LRU/Post=new-8       38.7kB ± 0%     6.0kB ± 0%  -84.62%  (p=0.000 n=10+10)
LRU/Status=new-8     10.6kB ± 0%     1.9kB ± 0%  -82.41%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
LRU/simple=new-8       46.0 ± 0%      20.0 ± 0%  -56.52%  (p=0.000 n=10+10)
LRU/complex=new-8       324 ± 0%        46 ± 0%  -85.80%  (p=0.000 n=10+10)
LRU/User=new-8          622 ± 0%       106 ± 0%  -82.96%  (p=0.000 n=10+10)
LRU/Post=new-8          902 ± 0%        89 ± 0%  -90.13%  (p=0.000 n=10+10)
LRU/Status=new-8        242 ± 0%        23 ± 0%  -90.50%  (p=0.000 n=10+10)

In general, we can see that the time to marshal/unmarshal pays off as the size of the struct
increases.

We can see that msgpack/v5 is faster for CPU but very slightly heavier on memory.
Since we are interested in fastest speed, we choose msgpack/v5.

As a future optimization, we can use a mix of msgpack and msgp for hot structs.
To do that, we would need to shuffle around some code so that for the hot struct,
all its dependencies are in the same file.

Let's use this in production for some time, watch grafana graphs for the hottest caches
and come back to optimizing this more once we have more data.

Side note: we have to do with micro-benchmarks for the time being, because all the caches
aren't migrated to cache2 interface yet. Once that's in, we can actually run some load tests
and do comparisons.

* Bring back missing import

* Fix tests
2020-06-18 17:21:39 +05:30
..