Update on my NEO/go work

Author:	Kirill Smelkov
Date:	20 Mar 2018

tl;dr A draft proof-of-concept read-only NEO storage was implemented in Go. The storage speaks NEO protocol on the wire and is thus network-compatible with current NEO/py implementation¹. FileStorage or SQLite database can be used as data backend which makes it disk-compatible with ZEO/py and NEO/py/sqlite. Benchmarks comparing ZEO/py, NEO/py/{sqlite,sql} and NEO/go/{fs1,sqlite} were performed on wendelin.core-like and regular ERP5-like datasets. They show that, compared to NEO/py, NEO/go storage works ~ 2x-4.5x faster for single-client serial workloads, and on systems with enough processors works ~ 10x-30x faster for 16-clients parallel workload, thus providing better scaling of database service.

Below is example picture that compares storages performance on our shuttle machine for the case when both client and server runs on the same host:

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/rio-wczblk1.svg

The implementation is very rough at many edges, but more or less works.

The code is located at https://lab.nexedi.com/kirr/neo.git, branches y/go - for clean ready-to-go parts, and t - for still draft stuff.

[1]	small protocol changes are currently needed about how message ids are allocated.

Contents

The need for faster storage
Choice of Go
Development overview
Performance tests
- Measurements stability
  - CPU frequency (P-states)
  - CPU idle (C-states)
- Results and discussion
Current problems
Resume
Appendix I. SSD latency
Appendix II. CPU C-states
Appendix III. All benchmarks

The need for faster storage

Before showing my work, let's first go on why faster ZODB storage is needed. The reason is: in the era when mechanical hard-drives were commonly used, their random access time was ~ 8 - 12 milliseconds. The time to handle read request for objects not already in server cache was thus dominated by IO time, as usually microsecond-range software processing time was negligible to that. With SSD however the picture is completely different: for SSD random 4K-read access latency is usually in the range of 100-200 microseconds, and so the time storage software frontend processing adds to that becomes important to keep low in order to be able to keep up providing SSD-based speed. For example if SSD read latency is 100µs and software-only latency of ZODB server frontend is 300µs, the latency to read objects not in disk cache will be 400µs, thus 4x slower compared to what SSD could natively provide. Please see Appendix I. SSD latency for details on SSD latency topic.

Currently, even with fast Go client (Results and discussion shows Python ZEO and NEO clients add significant latency on their own) both ZEO/py and NEO/py server request handling latency is in the order of several hundreds of µs, and so there is room for improvement here.

Another area where storage improvement is needed is being able to serve many request in parallel. There are several reasons here:

modern SSD are often multi-queue - they can handle several IO requests in parallel to each other.
in situation when client needs to load several objects, some of them could be already in disk cache on the server while some not. If objects are requested serially, it might be that loading objects that are already in the server cache will have to wait for objects that need actual disk IO to load first. If objects are loaded serially one by one, this effectively blocks requests that could be processed just from disk cache. Similarly if most objects are in server disk cache, loading them one-by-one just wastes time, as even if underlying SSD is not multiqueue, they still can be all read in parallel.

The situation is not limited to only one client as several objects could be requested by many clients from the same storage at the same time and the same arguments apply: if storage handles requests serially, the more clients and simultaneous requests are there, the more the latency will be for one individual object to be read.

Currently both ZEO/py and NEO/py handle 1 request at a time only.

Choice of Go

To work on the improved storage, I choosed to use the Go language. My rationale was:

Go is similar to Python in many aspects.
contrary to Python, Go provides lightweight threads called goroutines. The implementation of many concurrent and parallel algorithms is thus naturally expressed in terms of sequential processes that communicate over channels, instead of asynchronous callbacks and state-machines spaghetti.

Internally many goroutines can be multiplexed by Go runtime on top of a single OS thread. The way multiplexing is done is similar to async/await for in-process communication (represented via channels in go) and by way of using epoll network puller for communications via network. But to user it all looks like that every goroutine is separate sequential process. It is possible to create lots of goroutines because they are much more lightweight compared to threads.

In the end goroutines provide similar scalability to systems that are explicitly "asynchronously" programmed around epoll, but for a human it all looks like as if every goroutine executes sequentially and blocks on IO Send/Recv. Go runtime also manages pool of OS threads and schedules goroutines on them, so that whole processing can be using all CPUs available (and inter-goroutine communication via channels is always safe regardless of whether two G are running on the same OS thread or on two OS threads, potentially executing in parallel to each other).
contrary to Python, Go is statically typed and compiled. Being compiled provides better serial performance compared to CPython who is bytecode interpreter internally. On the other hand the compilation time is usually tiny or small this way not loosing fast edit/try property. Being statically typed on one hand helps for being compiled, and on the other hand does not put much mental overhead on the programmer as type inference is widespread. Static types also drive people to work-out interfaces more well, which should be good for project structure, with Go providing good interfaces support at the language level.
contrary to Python, Go has no GIL and leverages multicore by default.

Go is not ideal. However it looks to be a good fit for modern networked services.

Development overview

For development I choosed to initially implement simple case of a read-only single storage node that does not support NEO replication (so n(replica) = 1, n(partition) = 1), but that can work together with existing NEO/py master and NEO/py client on the wire in backward-compatible manner. The choise for being read-only initially is for simplicity and to get things started incrementally. No replication initially also helps to bootstrap because as e.g. protocols like Paxos and Raft show, reaching consensus in between several nodes on committed data can be tricky in various nodes or network failure scenarious and hard to get right. It thus requires proper protocol logic specification in order to be able to analyze it for correctness. Unfortunately, to my knowledge, we currently don't have that.

The work on NEO/go started a bit from a side - by drafting ZODB/go first (1, 2, 3, 4, 5). The reason here is that having Go ZODB client around helps development, because then there can be e.g. in-process unit tests, and also because ZODB/go client will be also needed for Go projects that need to work with data in ZODB. A note goes right away that by insisting on specifying ZODB/go interfaces right, a performance bug in current ZODB/py was discovered: in ZODB/py loadBefore, in addition to serial, also returns serial_next, which constraints storage implementations unnecessarily and is used only in client cache. In ZODB/go client cache implementation shows that it is possible to build efficient client cache without serial_next returned from Load and for this reason in ZODB/go Load interface specification comes without specifying serial_next return. This in turn paves the way for how even NEO/py could be reworked to use 2x less SQL queries for a load object request.

Next, even though NEO/py uses SQLite or MariaDB backends only, keeping in mind current widespread usage of FileStorage in ZODB-based projects, a FileStorage/go was implemented (1, 2, 3, 4, 5). This gives us flexibility to test Go bits not only on SQL data, but also on existing data.fs databases, and actually one of NEO/go storage backend uses FileStorage for that. Additionally fs1/go comes also with several FileStorage related maintenance tools (1, 2, 3), that together with unit-tests demonstrate that data is read bit-to-bit identical compared to python FileStorage implementation, and that might be also useful sometimes on the field. Closing ZODB/go client overview there is also port of zodbtools to Go (1, 2, 3, 4) for completeness.

On server-side NEO/go work started by first implementing messages serialization in exactly the same wire format as NEO/py does. This was needed because NEO/py uses custom encoding format instead of something well-established, as e.g. protocol buffers, with cross-language support. NEO/py takes runtime introspection approach with messages classes defining their encodings in struct module format with then common encoder/decoder using them with the help of small python-based domain-specific language. NEO/go takes similar, though a bit different approach: messages are defined as Go structures with corresponding encoding and decoding routines generated for them at build time. This works well in practice because in Go there is a good library support to parse and typecheck Go sources, and this way it is relatively easy to build custom tools that either generate additional code based on main source, or analyze the program in some custom way, or whatever... For the reference the generated code for messages serialization on NEO/go side is this.

Then comes the link layer which provides service to exchange messages over network. In current NEO/py every message has msg_id field, that similarly to ZEO/py marks a request with serial number with requester then waiting for corresponding answer to come back with the same message id. Even though there might be several reply messages coming back to a single request, as e.g. NEO/py asynchronous replication code shows, this approach is still similar to ZEO/py remote procedure call (RPC) model because of single request semantic. One of the places where this limitation shows is the same replicator code where transactions metadata is fetched first with first series of RPC calls, and only then object data is fetched with the second series of RPC calls. This could be not very good e.g. in case when there is a lot of transactions/data to synchronize, because 1) it puts assumption on, and so constraints, the storage backend model on how data is stored (separate SQL tables for metadata and data), and 2) no data will be synchronized at all until all transactions are synchronized first. The second point prevents for example the syncing storage in turn to provide, even if read-only, service for the already fetched data. What would be maybe more useful is for requester to send request that it wants to fetch ZODB data in tid_min..tid_max range and then the sender sending intermixed stream of metadata/data in zodbdump-like format.

Keeping in mind this, and other examples, NEO/go shifts from thinking about protocol logic as RPC to thinking of it as more general network protocol and settles to provide general connection-oriented message exchange service: whenever a message with new msg_id is sent, a new connection is established multiplexed on top of a single node-node TCP link. Then it is possible to send/receive arbitrary messages over back and forth until so established connection is closed. This works transparently to NEO/py who still thinks it operates in simple RPC mode because of the way messages are put on the wire and because simple RPC is subset of a general exchange. The neonet module also provides DialLink and ListenLink primitives that works similarly to standard Go net.Dial and net.Listen but wraps so created link into the multiplexing layer. What is actually done this way is very similar to HTTP/2 which also provides multiple general streams multiplexing on top of a single TCP connection (1, 2). However if connection ids (sent in place of msg_id on the wire) are assigned arbitrary, there could be a case when two nodes could try to initiate two new different connections to each other with the same connection id. To prevent such kind of conflict a simple rule to allocate connection ids either even or odd, depending on the role peer played while establishing the link, could be used. HTTP/2 takes similar approach where "Streams initiated by a client MUST use odd-numbered stream identifiers; those initiated by the server MUST use even-numbered stream identifiers." with NEO/go doing the same corresponding to who was originally dialer and who was a listener. However it requires small patch to be applied on NEO/py side to increment msg_id by 2 instead of 1.

NEO/py currently explicitly specifies msg_id for an answer in only limited set of cases, by default assuming a reply comes to the last received message whose msg_id it remembers globally per TCP-link. This approach is error-prone and cannot generally work in cases where several simultaneous requests are received over single link. This way NEO/go does not maintain any such global per-link knowledge and handles every request by always explicitly using corresponding connection object created at request reception time.

Then comes the main cluster logic with NEO/go master, client and storage implementation. Like it was already said I needed NEO/go client to be there to help development by being able to use it in unit tests, with the same applying to NEO/go master as well. The NEO/go client will be also helpful for benchmarking (see below). Even though current NEO/go POC was targeting simple case of 1 storage node, there is preliminary support for Node Table and Partition Table which are tried to be updated from master messages in accordance to general NEO protocol. Please note that contrary to previous parts, the main neo server Go code is still very draft and dirty.

For storage there is two backends: one that uses FileStorage directly, and another one that works with NEO/py SQLite data. The FileStorage backend is currently more well-tested and optimized with several performance fruits still being there not taken on the sqlite side.

It should be good to note on how cluster logic is tested: in general cluster is concurrent system with multiple nodes that might be doing several things at the same time, so traditional approach of testing serial systems cannot be applied here easily. What I emerged with is the way to do it via synchronous tracing: first Go tracing package and corresponding gotrace tool are introduced that help do Linux-style trace-events in Go programs. Basically one can add special "//trace:event traceMyEvent(...)" comments into the sources and add traceMyEvent(...) calls at appropriate places. The calls will be no-op by default, but one can attach probes to trace events via tracing API with arbitrary Go code, and while such probes are run the original code execution is paused. This builds the foundation to collect events in synchronous way during execution of a system.

On top of that package tracetest provides infrastructure for testing concurrent systems. Some excerpt from its documentation follows:

A serial system can be verified by checking that its execution produces expected serial stream of events. But concurrent systems cannot be verified by exactly this way because events are only partly-ordered with respect to each other by causality or so called happens-before relation.

However in a concurrent system one can decompose all events into serial streams in which events are strictly ordered by causality with respect to each other. This decomposition in turn allows to verify that in every stream events happenned as expected.

Verification of events for all streams can be done by one sequential process:

if events A and B in different streams are unrelated to each other by causality, the sequence of checks models a particular possible flow of time. Notably since events are delivered synchronously and sender is blocked until receiver/checker explicitly confirms event has been processed, by checking either A then B, or B then A allows to check for a particular race-condition.

if events A and B in different streams are related to each other by causality (i.e. there is some happens-before relation for them) the sequence of checking should represent that ordering relation.

The actual test code that uses this is currently all dirty and in flux, but here is one example how it is all used in practice for verifying two processes that are not ordered by causality with respect to each other:

client asks master for partition table.
master notifies client with node table information.

Last note learned while doing NEO/go is about prefetching objects: whenever there is non-small ListBox to show or RSS feed to generate on ERP5 side, there are many objects that need to be loaded from ZODB. If we load this objects from ZODB one by one, the performance will be low no matter how fast storage frontend is, because even if round-trip time is small, it will be multiplied by overall number of objects to load. What thus makes sense it to prefetch objects in advance in parallel to each other. Prefetching in turn is tightly related to client cache, because after prefetching you need to store data somewhere until it is needed. Let's consider the steps that are performed when an object is loaded regularly:

start loading object into cache,
wait for the loading to complete.

This way Prefetch is naturally only "1" - start loading object into cache but do not wait for the loading to be complete. Go's goroutines naturally help here where we can spawn every such loading into its own goroutine instead of explicitly programming loading in terms of a state machine, and this way ZODB/go always provides working prefetch. We don't have working prefetch implementation for Python clients, hopefully yet.

Performance tests

We frequently run NEO or ZEO storages on the same machine as main ERP5 service. But this is not always the case and, especially for big data projects, the performance when clients and server run on different hosts interconnected by network is interesting. This installment, however, covers only so-called localhost scenario - the case when both client and server are run on the same computer².

The way performance is tested is by neotest driver via launching various storages and then running several test programs on them to see how fast it works:

zhash.py: this program computes hash of whole latest objects stream in a ZODB database with using Python storage client.
zhash.go: this program computes hash of whole latest objects stream in a ZODB database with using Go storage client.
zwrk.go: this program, similarly to HTTP wrk, benchmarks database under parallel load from multiple clients. During a benchmark it is run with several ·n covering from 1 to 16 clients.

zwrk.go deserves a note: initially I was checking how storages perform under parallel load by way of running multiple zhash.py or zhash.go processes which put heavy weight on the OS and do not allow to simulate many clients as those processes are fighting with each other and with server process for CPU with context switches etc. As the result timings were not stable, noisy and not repeatable from run to run. Zwrk.go instead, similarly to wrk, simulates many clients from inside one process by leveraging Go support for goroutines. Side note is that due to ZEO performance being also interesting, simple ZEO/go client had to be quickly made.

Two test datasets are used: wczblk1-8 - the dataset with wendelin.core ZBlk1 objects covering 8M array, and prod1-1024 - synthethic dataset that tries to represent regular ERP5 instance. Both datasets are very small and so we can assume they reside completely in server disk cache while running benchmarks. Benchmark timings will thus give pure storage software processing latency, as pagecache hit time is on par, or less, to 1µs.

For NEO benchmarks, compression was disabled while generating the datasets. This is needed for fairness because NEO uses zlib compression, and e.g. for wczblk1-8 it takes ~ 20µs to decompress data on client side after receiving it from server. Thus if compression is not disabled we'll be measuring server reply latency adjusted by zlib timings³.

To verify how various storages perform I used 4 systems:

deco (i7-6600U @2.60GHz from 2015 https://www.cpubenchmark.net/cpu.php?&id=2608) Fturbo ~ 3.4GHz
rio (i7-3770S @3.10GHz from 2012 https://www.cpubenchmark.net/cpu.php?id=897) Fturbo ~ 3.9GHz
neo1 (i7-860 @2.80GHz from 2009 https://www.cpubenchmark.net/cpu.php?id=6) (no Fturbo)
z6001 (Xeon-X5650 @2.67GHz from 2010 https://www.cpubenchmark.net/cpu.php?id=1304) (no Fturbo)

[2]	the networked case is partly analyzed and understood, but in order to release this update today I will omit it for now.

[3]	on a related note zlib is not the best compression library out there today and eventually it makes sense to switch to more modern ones that provide faster decompression.

Measurements stability

Modern hardware and operating systems have lots of fluid state that might be preventing one to get stable timings while doing benchmarks. For example by default on Linux the CPU frequencies might be changing depending on the load, on which CPUs OS scheduler decided to put processes, etc.

Below is a list of factors that were took into account as ones that might affect benchmark timings and corresponding settings used to avoid instabilities. Some of the recipes, as the following text shows, might turn out to be useful to get better regular performance on our systems, not only for benchmarking.

CPU frequency (P-states)

As it was already said, the OS might be changing CPU frequencies all the time. To avoid timing instabilities caused by this, we set CPU frequency to a fixed value - the maximum frequency this particular CPU was designed to be running in the long term. The "long term" is important, as modern CPU can provide operations on faster then normal frequencies (so called "turbo" frequencies), but only for a limited time - until they get overheated.

For example my notebook has i7-6600U which identifies itself as "Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz" but actually allows clock speeds up to 3.4GHz. However the frequencies in (2.6GHz, 3.4GHz] range are turbo frequencies and cannot be used under intensive load for more than ~ (30 seconds ... 1 minute) - the time after which CPU just gets overheated and either its frequency is forcibly decreased, or its internal protection in the form of thermal throttling gets in forcing the CPU to skip doing work regularly to get its temperature to normal range.

So for my notebook setting CPU frequency to fixed is done by:

# cpupower frequency-set -g performance
# cpupower frequency-set --min 2.6Ghz --max 2.6GHz

CPU idle (C-states)

However CPU frequency is not the only source of timing instability. Modern OS also try to save power by putting CPU into sleep when there is no immediate work to do. There are several CPU sleep states (so called C-states) from shallow to the most deepest one and the deeper a state is the less power CPU is using while being in it. Entering and exiting such states is not free however and comes with the cost of energy and time latency. Below is list of C-states and corresponding exit latencies for Xeon-X5650 used in HP Z600 workstation:

C1(3µs) C1E(10µs) C3(20µs) C6(200µs)

As you can see if C6 is decided to be used, the latency to wake up might be up to 200µs which is rather big time in our context.

Linux by default uses predictor to estimate for how long the CPU in question will not be needed before putting it into sleep. The predictor is not always working well, thus causing addition time latencies and slowdowns when it fails.

This way all benchmarks are run with C-states greater than C1 disabled to avoid timing instabilities⁴:

# cpupower idle-set --disable-by-latency 5      ; C1 < latency <= C1E

Please see Appendix II. CPU C-states for more details on the topic.

Results and discussion

Let's first consider serial zhash.{py,go} performance. Below are some relevant timings from deco:

cluster:deco dataset:wczblk1-8
zeo/py/fs1-zhash.py                                     376µs ± 3%
zeo/py/fs1-zhash.go                                     130µs ± 1%
neo/py(!log)/sqlite-zhash.py                            329µs ± 6%
neo/py(!log)/sqlite-zhash.go                            147µs ± 3%
neo/py(!log)/sql-zhash.py                               375µs ± 3%
neo/py(!log)/sql-zhash.go                               183µs ± 3%
neo/go/fs1-zhash.py                                     226µs ± 3%
neo/go/fs1-zhash.go                                    55.9µs ± 1%
neo/go/sqlite-zhash.py                                  249µs ± 2%
neo/go/sqlite-zhash.go                                 72.8µs ± 1%

cluster:deco dataset:prod1-1024
zeo/py/fs1-zhash.py                                     345µs ± 3%
zeo/py/fs1-zhash.go                                     105µs ± 1%
neo/py(!log)/sqlite-zhash.py                            328µs ± 6%
neo/py(!log)/sqlite-zhash.go                            140µs ± 1%
neo/py(!log)/sql-zhash.py                               373µs ± 3%
neo/py(!log)/sql-zhash.go                               181µs ± 1%
neo/go/fs1-zhash.py                                     229µs ± 1%
neo/go/fs1-zhash.go                                    43.5µs ± 0%
neo/go/sqlite-zhash.py                                  257µs ± 1%
neo/go/sqlite-zhash.go                                 65.9µs ± 1%

What can be observer right away is that for both ZEO and NEO Python clients add significant latency by themselves. This is true qualitatively for all 4 hosts, and so should be investigated, at least for NEO/py.

Another thing to note is that the SHA1 used in NEO for checksumming is heavy-weight:

node:deco
crc32/py/4K                                            5.72µs ± 0%
crc32/go/4K                                             279ns ± 0%
sha1/py/4K                                             6.70µs ± 0%
sha1/go/4K                                             5.59µs ± 0%
unzlib/py/wczdata                                      27.1µs ± 1%
unzlib/go/wczdata                                      26.7µs ± 0%
unzlib/py/prod1-avg                                    5.12µs ± 2%
unzlib/go/prod1-avg                                    5.25µs ± 0%

cluster:deco dataset:wczblk1-8
neo/go/fs1-zhash.go                                    55.9µs ± 1%
neo/go/fs1(!sha1)-zhash.go(!sha1)                      39.3µs ± 1%
neo/go/sqlite-zhash.go                                 72.8µs ± 1%
neo/go/sqlite-zhash.go(!sha1)                          66.2µs ± 2%

cluster:deco dataset:prod1-1024
neo/go/fs1-zhash.go                                    43.5µs ± 0%
neo/go/fs1(!sha1)-zhash.go(!sha1)                      38.1µs ± 1%
neo/go/sqlite-zhash.go                                 65.9µs ± 1%
neo/go/sqlite-zhash.go(!sha1)                          63.3µs ± 1%

The time for SHA1(4K) is ~ 6µs here. However it is more than 10% of whole time for neo/go/fs1-zhash.go. If the checksum is needed for data integrity verification only, what makes more sense is to use Castagnoli CRC32 (the one used by iSCSI which is more robust to errors compared to IEEE CRC32) because modern processors can compute both CRC32 in an order of magnitude smaller time compared to SHA1 (notice the crc32/go/4K timing above). For this reasons I do benchmarks for Go servers with both SHA1 normally computed and checked, and also with SHA1 computation skipped - to see how much improvements could be there by switching to more lightweight checksum.

Neotest emits timings in Go benchmarking format with the idea for that data to be easily analyzed and compared in between runs with Go tools like benchstat (see the pretty output above). However for scalability plotting it is more natural to leverage Python stack and matplotlib & numpy in particular, and also I was keeping in mind the idea of eventually ingesting output from benchmark runs into ERP5 and then being able to analyze data there on ERP5 side. This way I first did benchlib.py Python module that allows to read benchmarking data on Python side, and then a (very hacky) benchplot utility to do scalability graphs for zwrk runs.

Below is such graph for Z6001 computer which has 24 CPU (12 cores x 2 hyper-threads):

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/z6001-wczblk1.svg

For brevity I let the picture speak for itself and we can also discuss things in followup exchanges.

The data and graphs for all benchmarks performed are showed in Appendix III. All benchmarks

Current problems

There are some known problems not yet analyzed:

prefetch not as fast as it should be.
NEO/go storage is read-only without write mode support.
NEO/go storage does not support MariaDB backend.

Resume

Proof-of-concept NEO/go storage was presented which shows faster serial speeds and scales more well compared to NEO/py.

In the closing I'd like to thank Ivan, Rafael, Hardik and Jean-Paul for their help.

Kirill

Appendix I. SSD latency

Let's check SSD of deco to get more sense of SSD latencies:

root@deco:~# hdparm -I /dev/sda |grep -i model
        Model Number:       SanDisk X400 M.2 2280 512GB

The random direct (no kernel cache) 4K-read SSD latency is ~ 105μs on deco on average:

kirr@deco:~/src/tools/fs/ioping$ ./ioping -D -i 0ms -s 4k -S 1024M -w 10s -q -k .

--- . (ext4 /dev/sda3) ioping statistics ---
96 k requests completed in 9.96 s, 375 MiB read, 9.64 k iops, 37.6 MiB/s
generated 96.0 k requests in 10.0 s, 375.0 MiB, 9.60 k iops, 37.5 MiB/s
min/avg/max/mdev = 96.3 us / 103.8 us / 325.4 us / 3.46 us
< 101.1 us      388     |
< 102.6 us      23263   | ************
< 104.1 us      38810   | ********************
< 105.6 us      28913   | ***************
< 107.1 us      3267    | *
< 108.6 us      364     |
< 110.1 us      75      |
< 111.6 us      90      |
< 113.1 us      87      |
< 114.6 us      78      |
< 116.1 us      71      |
< 117.6 us      60      |
< 119.1 us      55      |
< 120.6 us      49      |
< 122.1 us      37      |
< 123.6 us      37      |
< 125.1 us      45      |
< 126.6 us      30      |
< 128.1 us      18      |
< 129.6 us      12      |
< 131.1 us      12      |
<       +∞      139     |

If we don't disable kernel cache the random 4K cached disk read latency for disk read requests that hit kernel cache is ~ 0.6μs on deco on average:

kirr@deco:~/src/tools/fs/ioping$ ./ioping -C -i 0ms -s 4k -S 1024M -w 10s -q -k .

--- . (ext4 /dev/sda3) ioping statistics ---
15.3 M requests completed in 9.14 s, 58.3 GiB read, 1.67 M iops, 6.38 GiB/s
generated 15.3 M requests in 10.0 s, 58.3 GiB, 1.53 M iops, 5.83 GiB/s
min/avg/max/mdev = 221 ns / 597 ns / 313.9 us / 325 ns
< 487 ns        155684  |
< 516 ns        1765442 | *****
< 546 ns        2679487 | ********
< 575 ns        3654902 | ***********
< 605 ns        3453224 | ***********
< 634 ns        1507604 | ****
< 664 ns        949194  | ***
< 693 ns        229810  |
< 723 ns        124053  |
< 752 ns        80134   |
< 782 ns        98617   |
< 812 ns        114873  |
< 841 ns        138809  |
< 871 ns        114586  |
< 900 ns        63176   |
< 930 ns        43167   |
< 959 ns        14998   |
< 989 ns        7450    |
< 1.02 us       2875    |
< 1.05 us       1279    |
< 1.08 us       1101    |
<       +∞      88638   |

(ioping was patched a bit to display latency histogram)

Appendix II. CPU C-states

As it was said CPU sleep states can be causing timing instabilities and slowdowns. Let's look into more details on this topic.

On Linux the part of the OS that is responsible for putting CPU into sleep is the so-called "cpuidle" subsystem. In short there are idle drivers that know specifics of a CPU and how to enter it into particular sleep state, and idle governors that make the decision into which sleep state to enter the CPU when there is no work to do.

There are two well-established idle governors:

"ladder" - the one that is documented to be better suited for systems with periodic timer tick, and
"menu" - the one that is documented to be better suited for tickless systems.

With menu being commonly used everywhere by default.

Since now we are talking about tickless systems, in order to make the decision the governor needs to predict the future - it needs to know for how long time the CPU will not be needed. Knowing this is required because entering/exiting a sleep state itself needs energy and the overall savings could be reached only if CPU will reside in the state for long enough. This way every C-state, in addition to exit latency, has target_residency property and idle governor selects deepest state with residency time < expected idle time. Below is list of exit-latency/target-residency for above mentioned Xeon-X5650:

C1(3µs/6µs) C1E(10µs/20µs) C3(20µs/80µs) C6(200µs/800µs)

Predicting the future, however, is not always easy - menu idle governor originally started by using next scheduled on this CPU timer as the only prediction source. This was putting systems into deeper than necessary sleeps and was harming performance. The performance was eventually fixed by introducing correction factors that take current CPU load and IO waiters into account with the idea that the more loaded system is, the less deeper C-states should be used in order not to cause slowdown. Then menu was amended with repeating pattern detector to be able to adjust itself to events, like incoming network traffic, that come from outside and do not correlate with local timer. This detector was later amended to filter out outliers for more stable detection. And that is menu state today.

Let us put it again: whenever there is no immediate work for a CPU to do, the system is using predictor that estimates the time this CPU won't be needed, corrects the estimation by this-CPU load factor, and based on the result selects the deepest sleep state that fits into target window. And even if the predictor is not working correctly, for a busy server it should not matter much because correction by load should lower allowed target window significantly. Everything should be working ok, right?

Except that it is not always the case:

Let's consider 2 processes running on the same machine, one modelling ERP5 worker thread and another - database, with worker constantly serially sending read requests to the database and waiting for it to reply. Given enough CPUs on the system the OS scheduler will be trying to keep both processes on separate CPUs, and this in turn means that the load factor for worker CPU might be small: there is no IO on it (the IO counted in load-factor is only for disk, not network IO), and if most of the time is spent waiting for the database to reply, the load of worker CPU itself will be small. This in turn means that prediction errors of putting worker CPU into sleep won't be corrected, and so if there is indeed prediction errors and there are C-states with long-enough exit latencies, the worker CPU might be put into deep sleep while waiting for database reply. When the reply comes, addition time penalty will be paid waiting for the worker CPU to wake up.

This effect can be observed with real slowdowns on two kind of our machines: on the "ASUS" machines with i7-860 CPU (identifies as "Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz"), and on above-mentioned HP Z600 with Xeon-X5650:

"ASUS" (full benchmark):

cluster:neo1 dataset:wczblk1-8
zeo/py/fs1-zhash.py                                         516µs ± 8%               459µs ± 0%  -11.00%  (p=0.016 n=5+4)
zeo/py/fs1-zwrk.go·1                                        195µs ± 1%               172µs ± 1%  -11.98%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zhash.py                                692µs ± 5%               390µs ± 1%  -43.67%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zwrk.go·1                               230µs ± 1%               191µs ± 2%  -16.86%  (p=0.008 n=5+5)
neo/py(!log)/sql-zhash.py                                   938µs ±50%               459µs ± 1%  -51.03%  (p=0.016 n=5+4)
neo/py(!log)/sql-zwrk.go·1                                  305µs ± 7%               244µs ± 6%  -19.90%  (p=0.008 n=5+5)

cluster:neo1 dataset:prod1-1024
zeo/py/fs1-zhash.py                                         499µs ± 2%               436µs ± 2%  -12.61%  (p=0.008 n=5+5)
zeo/py/fs1-zwrk.go·1                                        170µs ± 1%               146µs ± 1%  -13.66%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zhash.py                                678µs ± 3%               387µs ± 5%  -42.91%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zwrk.go·1                               209µs ± 3%               180µs ± 3%  -14.26%  (p=0.008 n=5+5)
neo/py(!log)/sql-zhash.py                                   922µs ±53%               477µs ±40%  -48.31%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·1                                  294µs ± 4%               230µs ± 2%  -21.61%  (p=0.008 n=5+5)

Z600 (full benchmark):

cluster:z6001 dataset:wczblk1-8
zeo/py/fs1-zhash.py                                         675µs ± 5%               590µs ± 4%  -12.51%  (p=0.008 n=5+5)
zeo/py/fs1-zwrk.go·1                                        217µs ± 3%               195µs ± 1%  -10.29%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zhash.py                                673µs ± 8%               482µs ± 4%  -28.32%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zwrk.go·1                               306µs ± 6%               220µs ± 1%  -27.99%  (p=0.008 n=5+5)
neo/py(!log)/sql-zhash.py                                   958µs ±28%               529µs ± 4%  -44.74%  (p=0.016 n=5+4)
neo/py(!log)/sql-zwrk.go·1                                  395µs ±11%               276µs ± 3%  -30.18%  (p=0.008 n=5+5)

cluster:z6001 dataset:prod1-1024
zeo/py/fs1-zhash.py                                         646µs ± 3%               555µs ± 2%  -14.12%  (p=0.008 n=5+5)
zeo/py/fs1-zwrk.go·1                                        186µs ± 4%               163µs ± 1%  -12.16%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zhash.py                                653µs ± 7%               467µs ± 4%  -28.50%  (p=0.008 n=5+5)
neo/py(!log)/sqlite-zwrk.go·1                               273µs ± 4%               204µs ± 1%  -25.07%  (p=0.008 n=5+5)
neo/py(!log)/sql-zhash.py                                   961µs ±38%               518µs ± 2%  -46.11%  (p=0.016 n=5+4)
neo/py(!log)/sql-zwrk.go·1                                  365µs ±11%               275µs ± 7%  -24.47%  (p=0.008 n=5+5)

The timings above compare latency to read 1 object from database for runs with C-states enabled to runs where C-states greater C1 were disabled⁴. There were 5 iterations of each benchmark and in particular -zwrk.go ones were running each iteration for 10 seconds in C-states-enabled case to make sure CPU loading factor gets warmed up⁵.

On these machines the slowdown effect does not even completely go away if there are many workers simultaneously accessing the database, and the database process itself acts as RPC-like frontend to MariaDB process (notice no neo/py/sqlite here, as sqlite backend is implemented by way of using libsqlite in the same process, while sql backend talks to separate MariaDB server process over unix socket):

"ASUS":

cluster:neo1 dataset:wczblk1-8
neo/py(!log)/sql-zwrk.go·2                                  491µs ± 2%               412µs ± 3%  -15.94%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·3                                  740µs ± 2%               625µs ± 4%  -15.61%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·4                                  976µs ± 2%               837µs ± 1%  -14.28%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·8                                 1.98ms ± 2%              1.69ms ± 7%  -14.60%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·12                                2.98ms ± 2%              2.57ms ± 6%  -13.56%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·16                                4.03ms ± 3%              3.42ms ± 9%  -14.95%  (p=0.008 n=5+5)

cluster:neo1 dataset:prod1-1024
neo/py(!log)/sql-zwrk.go·2                                  483µs ± 2%               405µs ± 4%  -16.22%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·3                                  728µs ± 1%               640µs ± 5%  -11.98%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·4                                  978µs ± 3%               837µs ± 4%  -14.42%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·8                                 1.92ms ± 2%              1.63ms ± 1%  -15.10%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·12                                2.94ms ± 2%              2.52ms ± 3%  -14.18%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·16                                3.87ms ± 3%              3.27ms ± 3%  -15.56%  (p=0.008 n=5+5)

Z600:

cluster:z6001 dataset:wczblk1-8
neo/py(!log)/sql-zwrk.go·2                                  514µs ± 5%               470µs ± 4%   -8.50%  (p=0.016 n=5+5)
neo/py(!log)/sql-zwrk.go·3                                  741µs ± 3%               699µs ± 0%   -5.64%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·4                                 1.02ms ± 7%              0.95ms ± 5%   -6.66%  (p=0.032 n=5+5)
neo/py(!log)/sql-zwrk.go·8                                 2.00ms ± 7%              1.91ms ± 1%   -4.52%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·12                                2.94ms ± 3%              2.87ms ± 3%     ~     (p=0.056 n=5+5)
neo/py(!log)/sql-zwrk.go·16                                4.04ms ±11%              3.80ms ± 1%   -5.78%  (p=0.008 n=5+5)

cluster:z6001 dataset:prod1-1024
neo/py(!log)/sql-zwrk.go·2                                  524µs ± 5%               460µs ± 0%  -12.15%  (p=0.016 n=5+4)
neo/py(!log)/sql-zwrk.go·3                                  754µs ±10%               689µs ± 1%   -8.64%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·4                                  975µs ±11%               927µs ± 0%   -4.98%  (p=0.016 n=5+4)
neo/py(!log)/sql-zwrk.go·8                                 1.92ms ± 2%              1.83ms ± 1%   -4.53%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·12                                2.95ms ± 9%              2.73ms ± 1%   -7.65%  (p=0.008 n=5+5)
neo/py(!log)/sql-zwrk.go·16                                4.12ms ± 8%              3.69ms ± 1%  -10.36%  (p=0.008 n=5+5)

The slowdown effect is however less easily observed on our "Shuttle" machines with i7-3770S CPU (identifies as "Intel(R) Core(TM) i7-3770S CPU @ 3.10GHz"), and on my notebook - usually showing slighter changes which are harder (and more time consuming) to distinguish from noise.

Let's try to understand what is going on. Below is the table that compares CPU of the machines:

Machine	N(CPU)	C-states (exit-latency / target-residency)
Z600	24	C1(3µs/6µs) C1E(10µs/20µs) C3(20µs/80µs) C6(200µs/800µs)
"ASUS"	8	C1(3µs/6µs) C1E(10µs/20µs) C3(20µs/80µs) C6(200µs/800µs)
"Shuttle"	8	C1(1µs/1µs) C1E(10µs/20µs) C3(59µs/156µs) C6(80µs/300µs)
deco	4	C1(2µs/2µs) C1E(10µs/20µs) C3(70µs/100µs) C6(85µs/200µs) C7s(124µs/800µs) C8(200µs/800µs) C9(480µs/5000µs) C10(890µs/5000µs)

Comparing "ASUS" to "Shuttle" it can be seen that even though there is the same number of processors, the worst exit latency for "Shuttle" is significantly less compared to "ASUS". This way misprediction errors has lower time penalties. It should be also taken into account that the time-latencies reported by CPU is the maximum time latencies and they usually correspond to the time when whole CPU package has to be brought up from that sleep state. When it is only one core that was sleeping the exit latency is usually smaller.

However deco case is different. It presents the CPU with very deep sleep-states available, but there is no easily seen related performance penalty. But deco has only 2 cores (so 4 CPUs with hyper-threading) and this way there is simply no room to replay 3-processes ERP5-NEO-MariaDB scenario. Deco has also much faster single-thread performance (e.g. ~ 214K pystone/s at no-turbo frequency) compared to all other machines (~ 140K (shuttle), ~ 115K (Z600), ~ 105K (ASUS) pystone/s). This way typical request-reply cycle on deco could be taking less time with cpuidle governor thus tending to predict smaller sleep time windows.

I had not investigated "Shuttle" and Deco cases till the end. However what is clear is:

The cpuidle predictor in Linux is not perfect and can make mistakes. This in turn can lead to slowdowns in RPC-like systems and also sometimes noise in benchmarks. In any case for timings stability it makes sense to disable C-states while running benchmarks.
The C-states slowdown effect tends to be more visible in the following circumstances:
- system has many CPUs,
- there are deep-enough C-states,
- the system load is RPC-like with lot of request-reply in between different processes.

This way all the benchmarks were run C-states > C1 disabled:

# cpupower idle-set --disable-by-latency 5      ; C1 < latency <= C1E

For the reference there is also many technical resources on the network that recommend disabling C-states to get better performance of low-latency applications. (e.g. one from HP).

[4]	(1, 2) C1 is needed for hyper-threading - if all C-states were disabled, the kernel would just busy-loop if there is no work to do, and this can be taking resources from other threads on the same CPU core. Exit-latency for C1 is usually relatively small, so leaving C1 enabled can be tolerated.

[5]	menu source explicitly states that CPU load factor has few seconds window. I also verified via monitoring watch -n1 -d cat /proc/sched_debug that corresponding OS-scheduler runqueue load adjusts to actual load changes in no more than 1 second.

Appendix III. All benchmarks

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/deco-wczblk1.svg

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/deco-prod1.svg

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/neo1-wczblk1.svg

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/neo1-prod1.svg

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/rio-prod1.svg

https://lab.nexedi.com/kirr/misc/raw/c8f923a2/t/z6001-prod1.svg