mirror of
https://github.com/mattermost/mattermost.git
synced 2025-02-25 18:55:24 -06:00
* MM-25710: Use an efficient cache serialization algorithm We investigate 3 packages for selecting a suitable replacement for gob encoding. The algorithm chosen was msgpack which gives a decent boost over the standard gob encoding. Any external schema dependent algorithms like protobuf, flatbuffers, avro, capn'proto were not considered as that would entail converting the model structs into separate schema objects and then code generating the Go structs. It could be done theoretically at a later stage specifically for structs which are in the hot path. This is a general solution for now. The packages considered were: - github.com/tinylib/msgp - github.com/ugorji/go/codec - github.com/vmihailenco/msgpack/v5 msgp uses code generation to generate encoding/decoding code without the reflection overhead. Theoretically, therefore this is supposed to give the fastest performance. However, a major flaw in that package is that it works only at a file/directory level, not at a package level. Therefore, for structs which are spread across multiple files, it becomes near to impossible to chase down all transitive dependencies to generate the code. Even if that's done, it fails on some complex type like xml.Name and time.Time. (See: https://github.com/tinylib/msgp/issues/274#issuecomment-643654611) Therefore, we are left with 2 choices. Both of them use the same underlying algorithm. But msgpack/v5 wraps the encoders/decoders in a sync.Pool. To make a perfect apples-apples comparison, I wrote a sync.Pool for ugorji/go/codec too and compared performance. msgpack/v5 came out to be the fastest by a small margin. benchstat master.txt ugorji.txt name old time/op new time/op delta LRU/simple=new-8 5.62µs ± 3% 3.68µs ± 2% -34.64% (p=0.000 n=10+10) LRU/complex=new-8 38.4µs ± 2% 9.1µs ± 2% -76.38% (p=0.000 n=9+9) LRU/User=new-8 75.8µs ± 2% 23.5µs ± 2% -69.01% (p=0.000 n=10+10) LRU/Post=new-8 125µs ± 2% 21µs ± 3% -82.92% (p=0.000 n=9+10) LRU/Status=new-8 27.6µs ± 1% 5.4µs ± 4% -80.34% (p=0.000 n=10+10) name old alloc/op new alloc/op delta LRU/simple=new-8 3.20kB ± 0% 1.60kB ± 0% -49.97% (p=0.000 n=10+10) LRU/complex=new-8 15.7kB ± 0% 4.4kB ± 0% -71.89% (p=0.000 n=9+10) LRU/User=new-8 33.5kB ± 0% 9.2kB ± 0% -72.48% (p=0.000 n=10+8) LRU/Post=new-8 38.7kB ± 0% 4.8kB ± 0% -87.48% (p=0.000 n=10+10) LRU/Status=new-8 10.6kB ± 0% 1.7kB ± 0% -83.50% (p=0.000 n=10+10) name old allocs/op new allocs/op delta LRU/simple=new-8 46.0 ± 0% 20.0 ± 0% -56.52% (p=0.000 n=10+10) LRU/complex=new-8 324 ± 0% 48 ± 0% -85.19% (p=0.000 n=10+10) LRU/User=new-8 622 ± 0% 108 ± 0% -82.64% (p=0.000 n=10+10) LRU/Post=new-8 902 ± 0% 74 ± 0% -91.80% (p=0.000 n=10+10) LRU/Status=new-8 242 ± 0% 22 ± 0% -90.91% (p=0.000 n=10+10) 11:31:48-~/mattermost/mattermost-server/services/cache2$benchstat master.txt vmi.txt name old time/op new time/op delta LRU/simple=new-8 5.62µs ± 3% 3.68µs ± 3% -34.59% (p=0.000 n=10+10) LRU/complex=new-8 38.4µs ± 2% 8.7µs ± 3% -77.45% (p=0.000 n=9+10) LRU/User=new-8 75.8µs ± 2% 20.9µs ± 1% -72.45% (p=0.000 n=10+10) LRU/Post=new-8 125µs ± 2% 21µs ± 2% -83.08% (p=0.000 n=9+10) LRU/Status=new-8 27.6µs ± 1% 5.1µs ± 3% -81.66% (p=0.000 n=10+10) name old alloc/op new alloc/op delta LRU/simple=new-8 3.20kB ± 0% 1.60kB ± 0% -49.89% (p=0.000 n=10+10) LRU/complex=new-8 15.7kB ± 0% 4.6kB ± 0% -70.87% (p=0.000 n=9+8) LRU/User=new-8 33.5kB ± 0% 10.3kB ± 0% -69.40% (p=0.000 n=10+9) LRU/Post=new-8 38.7kB ± 0% 6.0kB ± 0% -84.62% (p=0.000 n=10+10) LRU/Status=new-8 10.6kB ± 0% 1.9kB ± 0% -82.41% (p=0.000 n=10+10) name old allocs/op new allocs/op delta LRU/simple=new-8 46.0 ± 0% 20.0 ± 0% -56.52% (p=0.000 n=10+10) LRU/complex=new-8 324 ± 0% 46 ± 0% -85.80% (p=0.000 n=10+10) LRU/User=new-8 622 ± 0% 106 ± 0% -82.96% (p=0.000 n=10+10) LRU/Post=new-8 902 ± 0% 89 ± 0% -90.13% (p=0.000 n=10+10) LRU/Status=new-8 242 ± 0% 23 ± 0% -90.50% (p=0.000 n=10+10) In general, we can see that the time to marshal/unmarshal pays off as the size of the struct increases. We can see that msgpack/v5 is faster for CPU but very slightly heavier on memory. Since we are interested in fastest speed, we choose msgpack/v5. As a future optimization, we can use a mix of msgpack and msgp for hot structs. To do that, we would need to shuffle around some code so that for the hot struct, all its dependencies are in the same file. Let's use this in production for some time, watch grafana graphs for the hottest caches and come back to optimizing this more once we have more data. Side note: we have to do with micro-benchmarks for the time being, because all the caches aren't migrated to cache2 interface yet. Once that's in, we can actually run some load tests and do comparisons. * Bring back missing import * Fix tests