-
Notifications
You must be signed in to change notification settings - Fork 19
Async vs Threaded Client
Sun Oct 23 2016
Based on some early measurements I'd made, I'd gotten the impression
that there was significant overhead in asycio's
call_soon_threadsafe
mechanism. This is important, because:
- A common server architecture uses an asynchronous I/O library for I/O and a thread pool for blocking operations, such as computation or blocking database access.
- ZEO clients use a separate thread for interacting with ZEO servers,
and application requests communicate with the I/O thread using
call_soon_threadsafe
.
First, I did a quick hack to (mostly) implement an asyncio event loop using synchronous I/O and threads. :)
With the standard asyncio event loop, call_soon_threadsafe
works by
adding call information to a queue and waking an event loop by writing
to a socket. With the threaded event loop, a simple lock is used.
I then ran a benchmark script with and without the threaded event loop. Because I was focused on client performance, I used a single client and other default options. The best-of-3 results on my mac (2.3 Ghz Core I7, four cores) for Python 3.5:
Test | Asyncore | Threaded |
---|---|---|
add | 2779 | 2308 |
update | 2729 | 2488 |
cached | 189 | 187 |
read | 1167 | 799 |
prefetch | 839 | 550 |
Values are time in per transaction in microseconds.
I also tried this with uvloop, which is quite a bit faster than the standard event loop, and using the experimental byteserver:
Test | Asyncore | Threaded | Threaded+byteserver |
---|---|---|---|
add | 2612 | 2514 | 1545 |
update | 2693 | 3330 | 1504 |
cached | 190 | 193 | 191 |
read | 896 | 762 | 637 |
prefetch | 565 | 566 | 540 |
I also tested with Python 2.7:
Test | Asyncore | Threaded | Threaded+byteserver |
---|---|---|---|
add | 16519 | 6778 | 6130 |
update | 13231 | 4912 | 4448 |
cached | 177 | 182 | 178 |
read | 5045 | 5160 | 5373 |
prefetch | 1897 | 2104 | 663 |
Some things to note:
- Cached times are meaningless, as they don't touch the network at all. They are merely included for completeness.
- The threaded results in the uvloop table used uvloop on the server.
- Update times can be highly variable, for some reason.
- I'm most interested in read times. Add, update, and prefetch
operations all involve a lot of asynchronous network messages, which
should be less sensitive to the impact of
call_soon_threadsafe
. Also, read operations are a lot more common.
Using synchronous I/O is significantly faster for Python 3.5, although less significantly when uvloop was used. This suggests some benefit in pursuing the threaded event loop.
The results for Python 2.7 were more mixed and rather surprising. I wasn't sure what to make if these results. Trollius has been a somewhat problematic dependency, so removing it as a client dependency using the threaded event loop might be a big win in and of itself.
The extreme slowness with threaded Python 2.7 was driving me nuts.
So, after some digging, I discovered that the slowdown was due to waiting on futures with a timeout, which is wildly expensive in Python 2. If, when waiting for results, I don't supply a timeout, I get much better results:
Test | Asyncore | Threaded | Threaded+byteserver |
---|---|---|---|
add | 3002 | 2278 | 1574 |
update | 3204 | 2759 | 1490 |
cached | 178 | 178 | 176 |
read | 1314 | 896 | 717 |
prefetch | 912 | 717 | 585 |
Based on my recollection of some informal tests, I'd expected a much bigger win from using a threaded event loop. I decided to try to reproduce some earlier tests.
I used the following script:
from ZODB.utils import z64, maxtid import threading import time import ZEO reps = 1000 a, s = ZEO.server() conn = ZEO.connection(a) conn.root.x = 'a'*999 conn.transaction_manager.commit() storage = conn.db().storage expected = storage.loadBefore(z64, maxtid) start = time.time() for i in range(reps): assert storage._call('loadBefore', z64, maxtid) == expected print('storage', (time.time() - start) / reps * 1000000) runner = storage._server loop = runner.loop protocol = runner.client.protocol event = threading.Event() @loop.call_soon_threadsafe def run(): i = [reps] start = time.time() def done(f): if f is not None: assert f.result() == expected if i[0] >= 0: i[0] -= 1 f = protocol.load_before(z64, maxtid) f.add_done_callback(done) else: print('async', (time.time() - start) / reps * 1000000) event.set() done(None) event.wait()
This script calls loadBefore
on the server 1000 times
consecutively in 2 ways (both of which bypass the ZEO client cache):
- It calls across threads using
call_soon_threadsafe
. (In the script above, this is a consequence of callingstorage._call
. - It calls
loadBefore
from within the I/O loop by recursively applying futures.
Here are best-of-three results:
test | asyncio | asyncio w uvloop | threaded (uvloop server) |
---|---|---|---|
cross-thread | 213 | 120 | 121 |
within thread | 152 | 98 | 102 |
And using byteserver:
test | asyncio | asyncio w uvloop | threaded |
---|---|---|---|
cross-thread | 152 | 115 | 80 |
within thread | 98 | 72 | 61 |
So the benefit of the threaded event loop are less than I remembered.
Note that these tests were done with Python 3.5.
-
There is a performance benefit to use synchronous I/O on the client.
-
A secondary benefit is that this would allow us to stop using Trollius on the client.
Note that when byteserver is used, it can be used to serve Python 2, which will take Trollius out of the picture.
-
While there does seem to some cost associated with using
call_soon_threadsafe
, it's not as large as I feared, but it's still likely worth avoiding.
-
-
In looking at this, I found a huge performance fix for Python 2.
This needs a bit more thought. My quick hack was to get rid of some timeout, but we need the timeouts, so I'll need to find a way to implement them more efficiently.
-
As expected (based on some previous measurements), the new byteserver is a win. (I think it will be a bigger win when we test with high concurrency, as I think/hope it will scale better.)