Skip to content

Async vs Threaded Client

Jim Fulton edited this page Oct 23, 2016 · 2 revisions

Notes on comparing async vs threaded ZEO clients

Sun Oct 23 2016

Based on some early measurements I'd made, I'd gotten the impression that there was significant overhead in asycio's call_soon_threadsafe mechanism. This is important, because:

  • A common server architecture uses an asynchronous I/O library for I/O and a thread pool for blocking operations, such as computation or blocking database access.
  • ZEO clients use a separate thread for interacting with ZEO servers, and application requests communicate with the I/O thread using call_soon_threadsafe.

First, I did a quick hack to (mostly) implement an asyncio event loop using synchronous I/O and threads. :)

With the standard asyncio event loop, call_soon_threadsafe works by adding call information to a queue and waking an event loop by writing to a socket. With the threaded event loop, a simple lock is used.

I then ran a benchmark script with and without the threaded event loop. Because I was focused on client performance, I used a single client and other default options. The best-of-3 results on my mac (2.3 Ghz Core I7, four cores) for Python 3.5:

Test Asyncore Threaded
add 2779 2308
update 2729 2488
cached 189 187
read 1167 799
prefetch 839 550

Values are time in per transaction in microseconds.

I also tried this with uvloop, which is quite a bit faster than the standard event loop, and using the experimental byteserver:

Test Asyncore Threaded Threaded+byteserver
add 2612 2514 1545
update 2693 3330 1504
cached 190 193 191
read 896 762 637
prefetch 565 566 540

I also tested with Python 2.7:

Test Asyncore Threaded Threaded+byteserver
add 16519 6778 6130
update 13231 4912 4448
cached 177 182 178
read 5045 5160 5373
prefetch 1897 2104 663

Some things to note:

  • Cached times are meaningless, as they don't touch the network at all. They are merely included for completeness.
  • The threaded results in the uvloop table used uvloop on the server.
  • Update times can be highly variable, for some reason.
  • I'm most interested in read times. Add, update, and prefetch operations all involve a lot of asynchronous network messages, which should be less sensitive to the impact of call_soon_threadsafe. Also, read operations are a lot more common.

Using synchronous I/O is significantly faster for Python 3.5, although less significantly when uvloop was used. This suggests some benefit in pursuing the threaded event loop.

The results for Python 2.7 were more mixed and rather surprising. I wasn't sure what to make if these results. Trollius has been a somewhat problematic dependency, so removing it as a client dependency using the threaded event loop might be a big win in and of itself.

The extreme slowness with threaded Python 2.7 was driving me nuts.

So, after some digging, I discovered that the slowdown was due to waiting on futures with a timeout, which is wildly expensive in Python 2. If, when waiting for results, I don't supply a timeout, I get much better results:

Test Asyncore Threaded Threaded+byteserver
add 3002 2278 1574
update 3204 2759 1490
cached 178 178 176
read 1314 896 717
prefetch 912 717 585

Based on my recollection of some informal tests, I'd expected a much bigger win from using a threaded event loop. I decided to try to reproduce some earlier tests.

I used the following script:

from ZODB.utils import z64, maxtid
import threading
import time
import ZEO

reps = 1000

a, s = ZEO.server()

conn = ZEO.connection(a)
conn.root.x = 'a'*999
conn.transaction_manager.commit()

storage = conn.db().storage
expected = storage.loadBefore(z64, maxtid)

start = time.time()
for i in range(reps):
    assert storage._call('loadBefore', z64, maxtid) == expected
print('storage', (time.time() - start) / reps * 1000000)

runner = storage._server
loop = runner.loop
protocol = runner.client.protocol
event = threading.Event()

@loop.call_soon_threadsafe
def run():
    i = [reps]
    start = time.time()
    def done(f):
        if f is not None:
            assert f.result() == expected

        if i[0] >= 0:
            i[0] -= 1
            f = protocol.load_before(z64, maxtid)
            f.add_done_callback(done)
        else:
            print('async', (time.time() - start) / reps * 1000000)
            event.set()

    done(None)

event.wait()

This script calls loadBefore on the server 1000 times consecutively in 2 ways (both of which bypass the ZEO client cache):

  1. It calls across threads using call_soon_threadsafe. (In the script above, this is a consequence of calling storage._call.
  2. It calls loadBefore from within the I/O loop by recursively applying futures.

Here are best-of-three results:

test asyncio asyncio w uvloop threaded (uvloop server)
cross-thread 213 120 121
within thread 152 98 102

And using byteserver:

test asyncio asyncio w uvloop threaded
cross-thread 152 115 80
within thread 98 72 61

So the benefit of the threaded event loop are less than I remembered.

Note that these tests were done with Python 3.5.

  • There is a performance benefit to use synchronous I/O on the client.

    • A secondary benefit is that this would allow us to stop using Trollius on the client.

      Note that when byteserver is used, it can be used to serve Python 2, which will take Trollius out of the picture.

    • While there does seem to some cost associated with using call_soon_threadsafe, it's not as large as I feared, but it's still likely worth avoiding.

  • In looking at this, I found a huge performance fix for Python 2.

    This needs a bit more thought. My quick hack was to get rid of some timeout, but we need the timeouts, so I'll need to find a way to implement them more efficiently.

  • As expected (based on some previous measurements), the new byteserver is a win. (I think it will be a bigger win when we test with high concurrency, as I think/hope it will scale better.)