Async vs Threaded Client

Notes on comparing async vs threaded ZEO clients

Sun Oct 23 2016

Contents

ZODB-level test
- Further digging into Python 2.7...
Low-level test
Summary

Based on some early measurements I'd made, I'd gotten the impression that there was significant overhead in asycio's call_soon_threadsafe mechanism. This is important, because:

A common server architecture uses an asynchronous I/O library for I/O and a thread pool for blocking operations, such as computation or blocking database access.
ZEO clients use a separate thread for interacting with ZEO servers, and application requests communicate with the I/O thread using call_soon_threadsafe.

First, I did a quick hack to (mostly) implement an asyncio event loop using synchronous I/O and threads. :)

With the standard asyncio event loop, call_soon_threadsafe works by adding call information to a queue and waking an event loop by writing to a socket. With the threaded event loop, a simple lock is used.

ZODB-level test

I then ran a benchmark script with and without the threaded event loop. Because I was focused on client performance, I used a single client and other default options. The best-of-3 results on my mac (2.3 Ghz Core I7, four cores) for Python 3.5:

Test	Asyncore	Threaded
add	2779	2308
update	2729	2488
cached	189	187
read	1167	799
prefetch	839	550

Values are time in per transaction in microseconds.

I also tried this with uvloop, which is quite a bit faster than the standard event loop, and using the experimental byteserver:

Test	Asyncore	Threaded	Threaded+byteserver
add	2612	2514	1545
update	2693	3330	1504
cached	190	193	191
read	896	762	637
prefetch	565	566	540

I also tested with Python 2.7:

Test	Asyncore	Threaded	Threaded+byteserver
add	16519	6778	6130
update	13231	4912	4448
cached	177	182	178
read	5045	5160	5373
prefetch	1897	2104	663

Some things to note:

Cached times are meaningless, as they don't touch the network at all. They are merely included for completeness.
The threaded results in the uvloop table used uvloop on the server.
Update times can be highly variable, for some reason.
I'm most interested in read times. Add, update, and prefetch operations all involve a lot of asynchronous network messages, which should be less sensitive to the impact of call_soon_threadsafe. Also, read operations are a lot more common.

Using synchronous I/O is significantly faster for Python 3.5, although less significantly when uvloop was used. This suggests some benefit in pursuing the threaded event loop.

The results for Python 2.7 were more mixed and rather surprising. I wasn't sure what to make if these results. Trollius has been a somewhat problematic dependency, so removing it as a client dependency using the threaded event loop might be a big win in and of itself.

Further digging into Python 2.7...

The extreme slowness with threaded Python 2.7 was driving me nuts.

So, after some digging, I discovered that the slowdown was due to waiting on futures with a timeout, which is wildly expensive in Python 2. If, when waiting for results, I don't supply a timeout, I get much better results:

Test	Asyncore	Threaded	Threaded+byteserver
add	3002	2278	1574
update	3204	2759	1490
cached	178	178	176
read	1314	896	717
prefetch	912	717	585

Low-level test

Based on my recollection of some informal tests, I'd expected a much bigger win from using a threaded event loop. I decided to try to reproduce some earlier tests.

I used the following script:

from ZODB.utils import z64, maxtid
import threading
import time
import ZEO

reps = 1000

a, s = ZEO.server()

conn = ZEO.connection(a)
conn.root.x = 'a'*999
conn.transaction_manager.commit()

storage = conn.db().storage
expected = storage.loadBefore(z64, maxtid)

start = time.time()
for i in range(reps):
    assert storage._call('loadBefore', z64, maxtid) == expected
print('storage', (time.time() - start) / reps * 1000000)

runner = storage._server
loop = runner.loop
protocol = runner.client.protocol
event = threading.Event()

@loop.call_soon_threadsafe
def run():
    i = [reps]
    start = time.time()
    def done(f):
        if f is not None:
            assert f.result() == expected

        if i[0] >= 0:
            i[0] -= 1
            f = protocol.load_before(z64, maxtid)
            f.add_done_callback(done)
        else:
            print('async', (time.time() - start) / reps * 1000000)
            event.set()

    done(None)

event.wait()

This script calls loadBefore on the server 1000 times consecutively in 2 ways (both of which bypass the ZEO client cache):

It calls across threads using call_soon_threadsafe. (In the script above, this is a consequence of calling storage._call.
It calls loadBefore from within the I/O loop by recursively applying futures.

Here are best-of-three results:

test	asyncio	asyncio w uvloop	threaded (uvloop server)
cross-thread	213	120	121
within thread	152	98	102

And using byteserver:

test	asyncio	asyncio w uvloop	threaded
cross-thread	152	115	80
within thread	98	72	61

So the benefit of the threaded event loop are less than I remembered.

Note that these tests were done with Python 3.5.

Summary

There is a performance benefit to use synchronous I/O on the client.
- A secondary benefit is that this would allow us to stop using Trollius on the client.
  
  Note that when byteserver is used, it can be used to serve Python 2, which will take Trollius out of the picture.
- While there does seem to some cost associated with using call_soon_threadsafe, it's not as large as I feared, but it's still likely worth avoiding.
In looking at this, I found a huge performance fix for Python 2.

This needs a bit more thought. My quick hack was to get rid of some timeout, but we need the timeouts, so I'll need to find a way to implement them more efficiently.
As expected (based on some previous measurements), the new byteserver is a win. (I think it will be a bigger win when we test with high concurrency, as I think/hope it will scale better.)

Provide feedback

Saved searches