I present a Rust-specific sequel to my previous benchmark of 2 Kotlin and a Rust microservice
— it’s hard to resist one’s own curiosity and popular demand, especially when you’ve been
nerd-sniped.
Let’s stress-test the two prominent web frameworks: Actix Web and Rocket.
In addition to stable “threads & blocking calls” Rocket v0.4,
I have included a development snapshot of in-the-works Rocket v0.5,
which is async and no longer requires nightly Rust.
Impatient? Jump to the results.
Preamble
I’ll take advantage of the previous article to fully describe aspects that apply equally well to this round:
- How this is different than TechEmpower benchmarks. TL;DR: we want to capture finer nuances and test idiomatic implementations with error reporting, logging, etc. — rather than highly optimised ones.
- The microservice we’ve benchmarking. TL;DR: a simple endpoint that does one call to Elasticsearch server.
- The testing methodology. TL;DR: repeated runs of a Python script that spins microservice Docker container and exposes it to an increasing number of concurrent connections using wrk.
- Runtime Environment. TL;DR: we limit the microservice to 1.5 CPU cores and 512 MiB memory using Docker.
- Hardware. TL;DR: Google Cloud Platform VM with 4-core AMD Epyc Rome CPU for the microservice + 12-core machine for the Elasticsearch server.
We still compile in release mode, target skylake
, and utilize cheap performance tricks.
What changed is Rust version, we have to use nightly because of Rocket v0.4.
More specifically, all implementations are compiled using
rustc 1.47.0-nightly (2d8a3b918 2020-08-26)
.1
Actix v3.0
Code as benchmarked: locations-rs tag actix-v30
.
Uses Actix Web 3.0.2.
Just some small things have changed since the version described in the last post:
- We now use version 3.0, but note that its performance matches v2.0 in our case.
- OpenAPI (Swagger) support is reintroduced as I was able to make Paperclip support v3.0, too.
Rocket v0.4
Code as benchmarked: locations-rs-rocket tag rocket-v04
.
Uses stable Rocket 0.4.5.
Porting from Actix to Rocket v0.4 was a matter of
one +175 -150 lines commit.2
It looks a bit scary but was mostly mechanical:
converting Actix type to Rocketry ones, and then fixing all compiler errors —
I love how rustc
essentially works as your to-do list.
There was only one major hurdle:
Calling Async Functions from Blocking Handlers
Rocket v0.4 handlers are classic blocking (sync) functions, but Reqwest-based elasticsearch-rs only provides an async API. Whoops.
The general advice is to propagate the asynchronicity up in caller stack instead of trying to call async functions from sync code. But what if we really want to? These are our options:
global-rt
: launch a global threaded Tokio runtime alongside Rocket workers. Then call runtime’s Handle::block_on() in endpoint handlers to delegate the work to the global async runtime, pausing the Rocket worker until the future resolves.per-worker-rt
: create a basic single-threaded Tokio runtime per each Rocket worker.3 In endpoint handlers then call Runtime::block_on(), which here has different semantics (!) than theHandle::block_on()
above: the future, and any other spawned async tasks, actually run within this Rocket worker thread.4 This comes with a caveat. Reqwest::client() seems to attach itself to the async runtime it is first used in. I had to make Elasticsearch client also local to each Rocket worker. Otherwise, I got deadlocks or problems described in hyper issue #2112.per-req-rt
: launch a fresh basic Tokio runtime per each request. It feels wrong, and it is wrong. I’ve tried and benchmarked this so that we know how much wrong.- Patch elasticsearch-rs to provide a blocking API — by employing Request’s optional blocking API. That would be futile, and essentially a sophisticated variant of 2., given the implementation details of the blocking client.
Here are the results of benchmarking the first three approaches. Source code of each variant is available under the respective tag in the locations-rs-rocket repository.
You can almost hear the server crying as it tries to cope with the inefficiency of per-req-rt
:
it is more than 7✕ less efficient than the best performing variant.
The other two more realistic variants are close to each other.
per-worker-rt
has a slight edge in peak performance and a clear edge in efficiency,
especially for low connection counts.
It is therefore proclaimed a winner of this qualification round
and represents Rocket v0.4 in later benchmarks.
It is not a surprise that highly-optimised Actix uses a similar approach: independent single-threaded Tokio runtimes per each worker instead of a global work-stealing threaded one.
Keep-Alive Connections
Keep-Alive (persistent) connections save latency and resources when a client makes more than one request to a given server, especially when the connection is secured by TLS. Unfortunately, Rocket v0.4 is not their friend.
First, to the best of my knowledge, persistent connections don’t work at all in Rocket v0.4 — the server closes the connection before reading a second request. I’ve traced the problem down to a bug in BufReader in old hyper 0.10.x and submitted a fix. In the 0.11.x branch, the same bug was fixed a long ago and released with 0.11.0 in June 2017. My pull request was closed without merging, as the maintainers were (understandably) not keen on releasing a new version of a legacy branch that was superseded 3 years ago. In other words, Rocket v0.4 depends on unmaintained hyper for its HTTP handling.
Second, even if the bug in hyper is patched, keep-alive connections in hyper 0.10.x are implemented naïvely: the worker thread is kept busy waiting for the client on the persistent connection, unable to process other requests. It is therefore easy to (accidentally) trigger denial-of-service by opening more persistent connections than available workers, even with the default keep-alive timeout of 5 s.5 Note that the first problem prevents this second problem from happening. ;-)
Both issues were long resolved in more modern hyper 0.11+, and therefore in Rocket v0.5-dev which I happen to benchmark too.
If you run Rocket v0.4 in production,
I recommend you to turn off persistent connections in Rocket config (set keep-alive
to 0
)
— while most clients retry the failed second request on a persistent connection gracefully,
at least some versions of Python requests
and urllib3
were raising exceptions instead.
If you care about latency, I suggest you put an HTTP load-balancer in front of Rocket v0.4 server
to reintroduce persistent connections at least to the client <-> load-balancer hop.
The benchmarks show that disabling keep-alive causes only a mild performance hit in our case.
The red oneshot-*
line is Rocket with keep-alive disabled and 16 workers,
while other persistent-*
lines represent Rocket with patched hyper, enabled keep-alive, and 16, 32 and 64 workers.
It can be seen that disabling keep-alive hurts latency, but not necessarily throughput
— if the number of connections can be increased.
Note that the effect will be more pronounced in reality,
real network latencies are much more significant than that of our loopback interface.
Unfortunately, wrk
does not indicate that some of its concurrent connections have failed to connect to the server at all.
But another demonstration of the denial-of-service behaviour is present when keep-alive is enabled:
Notice how the latencies of the 16-, 32-, and 64-worker instances of Rocket cut-off at
16, 32, respectively 64 concurrent connections.
When such saturation happens,
it is indeed impossible to make a new connection using e.g. curl
to the Rocket instance.
Because of these 2 problems, Rocket v0.4 has keep-alive disabled in all other benchmarks.
Tuning The Number of Workers
If you want to squeeze the highest possible efficiency from a Rocket v0.4 instance, you should tweak the number of its worker threads. The optimal count will depend mainly on the number of available CPU cores, and the ratio of time your endpoints spend CPU-crunching and waiting for I/O.
Here I have benchmarked worker counts from 8 to 256.
Instance with 16 workers is the most efficient, although the differences are small. Most notable variance is, as expected, in memory consumption. Having in mind that 1.5 CPUs is available the microservice, we arrive at around 10 workers per core. Rocket’s default is more conservative 2 workers per core.
Rocket v0.5-dev
Big fat warning: Rocket v0.5 is still under development. Version tested in this post is its 1369dc4 commit. A lot of things may change in the final release, including all observations here. You can track Rocket v0.5 progress in its async migration tracking issue and related GitHub milestone.
Code as benchmarked: locations-rs-rocket tag rocket-v05-dev
.
Porting from async Actix to async Rocket v0.5-dev was even easier than to Rocket v0.4. Here is the 147 insertions, 140 deletions commit that did the job.2
Compared to v0.4, Rocket v0.5-dev is boring, in the best possible sense of the word. Persistent connections are without problems. There is no need to fiddle with the number of workers. I attribute this to porting to up-to-date hyper 0.13.x and async Rust ecosystem.
Results
All graphs below are interactive and infinitely scalable SVGs — zoom in if necessary.