Use io_uring to batch handle clients pending writes to reduce SYSCALL count. #112

lipzhu · 2024-04-01T05:33:09Z

Description

This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for valkey when handleClientsWithPendingWrites.
With this patch, we can observe more than 6% perf gain for SET/GET.
This patch was implemented based on below discussion during the review:

Introduce a io_uring.h to handle the io_uing related API to split it from server logic.
Make io_uring.h independent of server.h .
Only use io_uring to gain performance when write client static buffer.

Benchmark Result

Test Env

OPERATING SYSTEM: Ubuntu 22.04.4 LTS
Kernel: 5.15.0-116-generic
PROCESSOR: Intel Xeon Platinum 8380
Base: 5b9fc46
Server and Client in same socket.

Test Steps

Start valkey-server with below config.

taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""

Start valkey-benchmark to ensure valkey-server CPU utilized is 1(fully utilized).

taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4

Test Result

QPS of SET and GET can increase 6.5%, 6.6% correspondingly.

Perf Stat

The perf stat info shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.

perf stat -p `pidof valkey-server` sleep 10

# w/o io_uring
 Performance counter stats for process id '2267781':

          9,993.95 msec task-clock                #    0.999 CPUs utilized
               625      context-switches          #   62.538 /sec
                 0      cpu-migrations            #    0.000 /sec
            94,933      page-faults               #    9.499 K/sec
    33,894,880,825      cycles                    #    3.392 GHz
    39,284,579,699      instructions              #    1.16  insn per cycle
     7,750,350,988      branches                  #  775.504 M/sec
        73,791,242      branch-misses             #    0.95% of all branches
   169,474,584,465      slots                     #   16.958 G/sec
    39,212,071,735      topdown-retiring          #     23.1% retiring
    11,962,902,869      topdown-bad-spec          #      7.1% bad speculation
    43,199,367,984      topdown-fe-bound          #     25.5% frontend bound
    75,159,711,305      topdown-be-bound          #     44.3% backend bound

      10.001262795 seconds time elapsed

# w/ io_uring
 Performance counter stats for process id '2273716':

          9,970.38 msec task-clock                #    0.997 CPUs utilized
             1,077      context-switches          #  108.020 /sec
                 1      cpu-migrations            #    0.100 /sec
           124,080      page-faults               #   12.445 K/sec
    33,813,062,268      cycles                    #    3.391 GHz
    41,455,816,158      instructions              #    1.23  insn per cycle
     8,063,017,730      branches                  #  808.697 M/sec
        68,008,453      branch-misses             #    0.84% of all branches
   169,066,451,360      slots                     #   16.957 G/sec
    38,077,547,648      topdown-retiring          #     22.0% retiring
    28,509,121,765      topdown-bad-spec          #     16.5% bad speculation
    41,083,738,441      topdown-fe-bound          #     23.8% frontend bound
    65,062,545,805      topdown-be-bound          #     37.7% backend bound

      10.001785198 seconds time elapsed

NOTE

Since io_uring was adopted from kernel 5.1, if kernel doesn't support io_uring yet, it will use the origin implementation.
This patch introduce the liburing dependency, it is installed in my local env, to keep it simple, I didn't include liburing dependency in this patch, the CI build may failed.

zuiderkwast · 2024-04-02T19:44:33Z

If you merge latest unstable, the spellcheck is fixed.

Can you add a check for <liburing.h> in Makefile, something like this:

HAS_LIBURING := $(shell sh -c 'echo "$(NUMBER_SIGN_CHAR)include <liburing.h>" > foo.c; \
	$(CC) -E foo.c > /dev/null 2>&1 && echo yes; \
	rm foo.c')
ifeq ($(HAS_LIBURING),yes)
	...
else
	...
endif

PingXie · 2024-04-15T04:16:34Z

I have not taken a closer look at the PR but at the minimum I think we need a config to opt in io_uring.

I am also suspecting that the gain is likely coming from tapping into the additional cores that couldn't be utilized efficiently by the current io-threads. If so this would lead to two questions

On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?
Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?

Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.

lipzhu · 2024-04-15T12:54:09Z

On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?

The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .

Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?

More context about the multi threading improvements from community?

Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.

I am not security experts, can you give more details about your concern, will this a blocker for community to adopt io_uring?

PingXie · 2024-04-15T21:50:40Z

The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .

Start valkey-becnmark taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4 to ensure valkey-server CPU utilized is 1(fully utilized).

io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads? Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.

More context about the multi threading improvements from community?

#22

I am not security experts, can you give more details about your concern

https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html

will this a blocker for community to adopt io_uring?

I would say this would be a serious concern for me. There are two possible outcomes

the vulns are mostly in the kernel and there isn't much application developers (us) can do.
the vulns can be mitigated by changes in the application, which is Valkey here

would reduce the reach of this feature
would add more work on the Valkey team

lipzhu · 2024-04-16T06:07:29Z

io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads?

Actually, I didn't use the busy polling model of io_uring in this patch, no background threads started of io_uring, all the cycles generated by io_uring are classified into Valkey server. Just as I post in top comment, the perf gain benefit from the feature of https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#batching

Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.

I can get a similar result with 2-4 CPUs allocated.

I would say this would be a serious concern for me. There are two possible outcomes

the vulns are mostly in the kernel and there isn't much application developers (us) can do.

the vulns can be mitigated by changes in the application, which is Valkey here

would reduce the reach of this feature

would add more work on the Valkey team

Just glanced the vulns list: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring, seems most of the fixes happened in the kernel side, there isn't application developers can do. I think we can keep supporting the io_uring, user can disable the io_uring if they want.

zuiderkwast · 2024-04-16T10:05:49Z

I think this is really good stuff. Performance is one of the areas we should prioritize IMO.

lipzhu · 2024-04-17T00:46:04Z

I think this is really good stuff. Performance is one of the areas we should prioritize IMO.

Thanks @zuiderkwast, the question is how can we push this patch forward?

zuiderkwast · 2024-04-17T20:10:56Z

@lipzhu We are a new team, new project, first release and we are still busy with rebranding from redis to valkey and getting a website up. I think you just need some patience to let other team members have some time to look and think about it.

I'll add it to the backlog for Valkey 8. It will not forgotten.

PingXie · 2024-04-18T13:09:44Z

Yeah this kind of changes requires the reviewers to block off some decent amount of time and to really think through it holistically. This week is really busy for the team as many of us are having in person engagement with the OSS community. We really appreciate your patience.

Wenwen-Chen · 2024-04-25T07:41:20Z

@lipzhu

Description

This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for Valkey when handleClientsWithPendingWrites. With this patch, we can observe more than 4% perf gain for SET/GET, and didn't see an obvious performance regression.

As far as I know, IO_Uring is a high efficient IO engine.
Do you have any plan to optimize Valkey's other modules by using io_uring technology?
For example, ae framework, snapshot operations.

lipzhu · 2024-04-25T10:08:19Z

As far as I know, IO_Uring is a high efficient IO engine. Do you have any plan to optimize Valkey's other modules by using io_uring technology? For example, ae framework, snapshot operations.

Sure, but at the beginning when we decide to introduce io_uring, we want to search the scenarios that io_uring really helps on perf gain, this patch is straightforward.
And another scenario I come out is disk related operation, I open #255 to understand the details.
For the ae framework part, it needs a lot of work to replace the epoll and sync workflow per my understanding. I had a POC before, I can observe the performance gain, but the cost is more CPU resources allocated.
So I want to integrate io_uring incrementally, and I also need the help from community, as you know, currently they are busy rebranding :)

src/io_uring.c

src/networking.c

PingXie · 2024-04-26T21:49:35Z

Thanks @lipzhu!

I am generally aligned with the high level idea (and good to know that you don't use polling).

I do have some high level feedback around the code structure and I will list them here too

we should go with an opt-in approach and keep io-uring off by default
I think we should avoid mixing sync read()/write() calls with io_uring. Let's explore a way to have a cleaner separation
the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream

BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.

PingXie · 2024-04-27T02:15:30Z

@lipzhu, looking at your results above, the amount of the read calls jumps out too. It will be great if you could apply io-uring to the query path as well.

src/io_uring.c

lipzhu · 2024-04-28T02:41:29Z

@PingXie Thanks for your comments :)

@lipzhu, looking at your results above, the amount of the read calls jumps out too.

The counter data is based on the time duration(10s), each query is pair to readQueryFromClient, so I think the SYSCALL count of read increased is sensible because the QPS increased too.

It will be great if you could apply io-uring to the query path as well.

@PingXie I have done this before, but some issues I found are:

I didn't find a batch read scenario from read query path, if use the io_uring_prep_read and following io_uring_submit_and_wait simply to simulate the read, the SYSCALL count didn't reduce and io_uring_enter is more expensive than read.
Prefer a small PR, each pr focus only one thing.

lipzhu · 2024-04-28T02:52:19Z

Thanks @lipzhu!

I am generally aligned with the high level idea (and good to know that you don't use polling).

I do have some high level feedback around the code structure and I will list them here too

we should go with an opt-in approach and keep io-uring off by default

Ok, I will introduce a new config like io-uring (default off) in valkey.conf.

I think we should avoid mixing sync read()/write() calls with io_uring. Let's explore a way to have a cleaner separation

I will refactor the code to explore a cleaner separation.

the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream

The reason I didn't do for scatter/gather IOs because I remember I didn't observe the perf gain with io_uring, I will double confirm this later, for the replication stream, thanks to point it out, I will check it later.

BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.

Sure, thanks for your guidance and patience.

codecov · 2024-05-21T04:28:29Z

Codecov Report

Attention: Patch coverage is 25.30120% with 62 lines in your changes missing coverage. Please review.

Project coverage is 70.53%. Comparing base (20d583f) to head (efc4fe4).

Files with missing lines	Patch %	Lines
src/networking.c	30.30%	46 Missing ⚠️
src/io_uring.c	0.00%	12 Missing ⚠️
src/server.c	20.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #112      +/-   ##
============================================
- Coverage     70.54%   70.53%   -0.02%     
============================================
  Files           114      115       +1     
  Lines         61644    61718      +74     
============================================
+ Hits          43488    43530      +42     
- Misses        18156    18188      +32

Files with missing lines	Coverage Δ
src/config.c	`78.69% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/server.c	`88.49% <20.00%> (-0.11%)`	⬇️
src/io_uring.c	`0.00% <0.00%> (ø)`
src/networking.c	`86.78% <30.30%> (-1.70%)`	⬇️

... and 10 files with indirect coverage changes

lipzhu · 2024-05-30T09:51:49Z

Update for this patch:

Introduce a new config io_uring (yes|no) to let user determine if enable io_uring or not, and then validate if the running system support io_uring, when both conditions are met go to the io_uring code path otherwise fall back.
Split the sync write with io_uring, most of the logic moved to io_uring.c.
Have a measurement for the scatter/gather IOs and slave clients, didn't observe the perf gain for above 2 scenarios by using io_uring, but some regressions. Did a simple analysis, this is because of the count of SYSCALL write is not high in above 2 scenarios which make the cycles ratio of SYSCALL is not high, and per my measurement, a single SYSCALL io_uring_enter is more expensive than SYSCALL write. I think it is not a good idea to use io_uring batch handle for above 2 scenarios.

I suggest to only use io_uring for the clients static buffer write in first phase, most of the static response buffer is small which make total count of syscall is high and cycles ratio of SYCALL write is high correspondingly and the perf gain is significant indeed.

@PingXie @zuiderkwast What do you think?

zuiderkwast

I think this looks mostly good now. It has some refactorings that will conflict with the Async IO threading feature, so I think we should merge the async IO theading first.

The error messages and log messages can probably be improved, but I will review those later.

src/io_uring.h

lipzhu · 2024-07-30T10:21:01Z

@PingXie @zuiderkwast I saw the Async IO is already merged. I rebase the io_uring batch optimization based on unstable branch, let's resume this pull request?

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Wangyang Guo <[email protected]>

secwall · 2024-08-03T19:16:05Z

It seems that this change (even without uring enabled) breaks operation with multiple threads.
Just running valkey-benchmark against an instance with io-threads set to 4 makes valkey fail:

6966:M 03 Aug 2024 18:34:36.931 # === ASSERTION FAILED ===
6966:M 03 Aug 2024 18:34:36.931 # ==> io_threads.c:384 'c->clients_pending_write_node.prev == NULL && c->clients_pending_write_node.next == NULL' is not true

The reason is simple: trySendWriteToIOThreads expects ln to be already unlinked. A simple patch like this makes benchmark passing:

--- a/src/networking.c
+++ b/src/networking.c
@@ -2540,14 +2540,18 @@ int handleClientsWithPendingWrites(void) {
         }

         /* If we can send the client to the I/O thread, let it handle the write. */
-        if (trySendWriteToIOThreads(c) == C_OK) {
+        if (server.io_threads_num > 1) {
             listUnlinkNode(server.clients_pending_write, ln);
-            continue;
+            if (trySendWriteToIOThreads(c) == C_OK) {
+                continue;
+            }
         }

         /* We can't write to the client while IO operation is in progress. */
         if (c->io_write_state != CLIENT_IDLE || c->io_read_state != CLIENT_IDLE) {
-            listUnlinkNode(server.clients_pending_write, ln);
+            if (server.io_threads_num == 1) {
+                listUnlinkNode(server.clients_pending_write, ln);                                                                                                                                                    +            }
             continue;                                                                                                                                                                                                         }

@@ -2559,7 +2563,9 @@ int handleClientsWithPendingWrites(void) {
                 continue;
             }
         } else {
-            listUnlinkNode(server.clients_pending_write, ln);
+            if (server.io_threads_num == 1) {
+                listUnlinkNode(server.clients_pending_write, ln);                                                                                                                                                    +            }                                                                                                                                                                                                                     /* Try to write buffers to the client socket. */
             if (writeToClient(c) == C_ERR) continue;

The second issue: enabling both io_uring and TLS makes even simple info with cli fail:

./src/valkey-cli --tls --cacert tests/tls/ca.crt
127.0.0.1:6379> info
Error: Success

It seems that we should not try to use io_uring for tls-enabled clients like this?

--- a/src/networking.c
+++ b/src/networking.c
@@ -2429,7 +2429,7 @@ int processIOThreadsWriteDone(void) {
 static inline int _canWriteUsingIOUring(client *c) {
     if (server.io_uring_enabled && server.io_threads_num == 1) {
         /* Currently, we only use io_uring to handle the static buffer write requests. */
-        return getClientType(c) != CLIENT_TYPE_REPLICA && listLength(c->reply) == 0 && c->bufpos > 0;
+        return connIsTLS(c->conn) == 0 && getClientType(c) != CLIENT_TYPE_REPLICA && listLength(c->reply) == 0 && c->bufpos > 0;
     }
     return 0;
 }

Signed-off-by: Lipeng Zhu <[email protected]>

PingXie · 2024-08-12T03:08:47Z

I saw the Async IO is already merged. I rebase the io_uring batch optimization based on unstable branch, let's resume this pull request?

@lipzhu - are the performance numbers in the PR description updated? If not, can you help re-benchmark the improvements? Let's make sure there is still meaningful improvement with async IO changes merged before diving into the code review?

lipzhu · 2024-08-12T05:14:11Z

I saw the Async IO is already merged. I rebase the io_uring batch optimization based on unstable branch, let's resume this pull request?

@lipzhu - are the performance numbers in the PR description updated? If not, can you help re-benchmark the improvements? Let's make sure there is still meaningful improvement with async IO changes merged before diving into the code review?

@PingXie Thanks, I just updated the performance boost info in top comments, we can still have 6% performance boost based on the SET/GET benchmark.

PingXie · 2024-08-12T05:51:18Z

Thanks, @lipzhu!

Sorry I didn't make it clear earlier.

I don't think the current test setup (controlling CPU allocation via taskset) represents the real world workload. And the reason is that, with this test setup, the server can "steal" compute from the CPUs not explicitly allocated by taskset through io-uring. While in the async IO case, the server sticks to the CPUs allocated; therefore, the results are not apple-to-apple. In my opinion, for this test to be valid, we would need to separate out the client and the server on two different machines and allow the server to use all the CPUs for io-threading. Then we toggle io-uring on and off and compare the two sets of performance numbers.

lipzhu · 2024-08-13T06:54:15Z

@PingXie I setup an environment which separate server and client and double confirm the perf boost. Below are the brief summary of my local test env.
Both server and client have 8 CPUs (Intel(R) Xeon(R) Platinum 8380 CPU) enabled, and they are connected through NIC Ethernet Controller XXV710 for 25GbE SFP28. Run the same commands w/o taskset, we can observe ~5% perf boost.

Start server.

~/valkey/src/valkey-server /tmp/valkey.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""

Start client.

~/valkey/src/valkey-benchmark -h 192.168.2.1 -p 9001 -t set,get -d 100 -r 1000000 -n 5000000 -c 50 --threads 4

Signed-off-by: Lipeng Zhu <[email protected]>

PingXie · 2024-08-13T07:16:16Z

Both server and client have 8 CPUs (Intel(R) Xeon(R) Platinum 8380 CPU) enabled

Just a quick confirmation - there were 8 CPUs in total and 8 io-threads in these tests?

lipzhu · 2024-08-13T07:20:35Z

Both server and client have 8 CPUs (Intel(R) Xeon(R) Platinum 8380 CPU) enabled

Just a quick confirmation - there were 8 CPUs in total and 8 io-threads in these tests?

Both server and client have 8 CPUs. Doesn't enable io-threads for this test, not quite understand why io-threads should be enabled for this? Because this optimization only works for the main thread.

Signed-off-by: Lipeng Zhu <[email protected]>

PingXie · 2024-08-13T07:42:36Z

Doesn't enable io-threads for this test, not quite understand why io-threads should be enabled for this? Because this optimization only works for the main thread.

Io-threading is important because both io-threading and io-uring are targeted at the same problem, which is how to better utilize the CPUs on the system. It is not a fair comparison when one test can use only one CPU (when io-uring is off) while the other can use other CPUs via io-uring.

In a broader sense, io-uring is essentially a more generic form of io-threading done in the kernel.

Do you mind trying out the tests one more time but with 8 io-threads?

lipzhu · 2024-08-13T08:20:56Z

Io-threading is important because both io-threading and io-uring are targeted at the same problem, which is how to better utilize the CPUs on the system. It is not a fair comparison when one test can use only one CPU (when io-uring is off) while the other can use other CPUs via io-uring.

Actually, io_uring will not steal CPU resource in this scenario. Maybe you are talking about the feature of async_io_threads which can be set by IOSQE_ASYNC, but we didn't use this feature.
Just as titled, the perf boost mainly come from the reduced write SYSCALL.

The perf also shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.
Another experiment we can prove that is that we started the valkey in server which only has 1 CPU resource and test, not sure if this can dispel your concern.

perf stat -p `pidof valkey-server` sleep 10

# w/o io_uring
 Performance counter stats for process id '2267781':

          9,993.95 msec task-clock                #    0.999 CPUs utilized
               625      context-switches          #   62.538 /sec
                 0      cpu-migrations            #    0.000 /sec
            94,933      page-faults               #    9.499 K/sec
    33,894,880,825      cycles                    #    3.392 GHz
    39,284,579,699      instructions              #    1.16  insn per cycle
     7,750,350,988      branches                  #  775.504 M/sec
        73,791,242      branch-misses             #    0.95% of all branches
   169,474,584,465      slots                     #   16.958 G/sec
    39,212,071,735      topdown-retiring          #     23.1% retiring
    11,962,902,869      topdown-bad-spec          #      7.1% bad speculation
    43,199,367,984      topdown-fe-bound          #     25.5% frontend bound
    75,159,711,305      topdown-be-bound          #     44.3% backend bound

      10.001262795 seconds time elapsed

# w/ io_uring
 Performance counter stats for process id '2273716':

          9,970.38 msec task-clock                #    0.997 CPUs utilized
             1,077      context-switches          #  108.020 /sec
                 1      cpu-migrations            #    0.100 /sec
           124,080      page-faults               #   12.445 K/sec
    33,813,062,268      cycles                    #    3.391 GHz
    41,455,816,158      instructions              #    1.23  insn per cycle
     8,063,017,730      branches                  #  808.697 M/sec
        68,008,453      branch-misses             #    0.84% of all branches
   169,066,451,360      slots                     #   16.957 G/sec
    38,077,547,648      topdown-retiring          #     22.0% retiring
    28,509,121,765      topdown-bad-spec          #     16.5% bad speculation
    41,083,738,441      topdown-fe-bound          #     23.8% frontend bound
    65,062,545,805      topdown-be-bound          #     37.7% backend bound

      10.001785198 seconds time elapsed

PingXie · 2024-08-14T07:16:56Z

Maybe you are talking about the feature of async_io_threads which can be set by IOSQE_ASYNC, but we didn't use this feature.

I, again, forgot this point. I am convinced by your test results. Will find time next to resume the code review :). Thanks a lot for your patience, @lipzhu!

PingXie · 2024-08-14T07:18:54Z

The perf also shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.

Whenever you get a chance, can you incorporate these performance numbers along with your test setup to the PR description so they are more discoverable?

lipzhu · 2024-08-14T10:35:30Z

Thanks @PingXie.

Maybe you are talking about the feature of async_io_threads which can be set by IOSQE_ASYNC, but we didn't use this feature.

I, again, forgot this point. I am convinced by your test results. Will find time next to resume the code review :). Thanks a lot for your patience, @lipzhu!

Really appreciate your effort on this :).

Whenever you get a chance, can you incorporate these performance numbers along with your test setup to the PR description so they are more discoverable?

Done.

lipzhu · 2024-09-09T07:13:09Z

Kindly ping @PingXie @zuiderkwast .

Signed-off-by: Lipeng Zhu <[email protected]>

zuiderkwast added the performance label Apr 3, 2024

lipzhu mentioned this pull request Apr 15, 2024

[QUESTION] AOF blocked fdatasync will block clients' response that have no propagateCommands. #255

Closed

PingXie reviewed Apr 26, 2024

View reviewed changes

src/io_uring.c Outdated Show resolved Hide resolved

src/io_uring.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

PingXie reviewed Apr 27, 2024

View reviewed changes

src/io_uring.c Outdated Show resolved Hide resolved

lipzhu force-pushed the io_uring branch from 5642e38 to 092fdc3 Compare May 21, 2024 04:14

lipzhu force-pushed the io_uring branch from 2fffa8b to 3abd93c Compare May 30, 2024 09:51

lipzhu force-pushed the io_uring branch 2 times, most recently from 958ff60 to 0e9afa9 Compare May 30, 2024 10:01

lipzhu requested a review from PingXie June 3, 2024 01:40

zuiderkwast reviewed Jun 11, 2024

View reviewed changes

src/io_uring.h Outdated Show resolved Hide resolved

lipzhu mentioned this pull request Jun 20, 2024

Use io_uring to make fsync asynchronous when set appendfsync to always to make CPU more efficient. #599

Open

lipzhu force-pushed the io_uring branch from a08c6eb to e29bd5e Compare July 15, 2024 02:28

lipzhu marked this pull request as draft July 15, 2024 02:36

lipzhu force-pushed the io_uring branch from 0a01364 to 054f74b Compare July 15, 2024 03:22

lipzhu force-pushed the io_uring branch from 054f74b to 162c183 Compare July 30, 2024 10:18

lipzhu marked this pull request as ready for review July 30, 2024 10:18

lipzhu force-pushed the io_uring branch 4 times, most recently from 82c669d to cbe6361 Compare July 31, 2024 04:55

Rebase based on unstable branch.

cbe6361

--------- Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Wangyang Guo <[email protected]>

Merge remote-tracking branch 'origin/unstable' into io_uring_batch

1b53184

lipzhu force-pushed the io_uring branch 2 times, most recently from b3169ab to f6c6dd7 Compare August 5, 2024 09:47

code enhancement according to the feedback from secwall.

f6c6dd7

Signed-off-by: Lipeng Zhu <[email protected]>

resolve conflicts

97243aa

Signed-off-by: Lipeng Zhu <[email protected]>

clang-format

2664c48

Signed-off-by: Lipeng Zhu <[email protected]>

solve conflicts

efc4fe4

Signed-off-by: Lipeng Zhu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use io_uring to batch handle clients pending writes to reduce SYSCALL count. #112

Use io_uring to batch handle clients pending writes to reduce SYSCALL count. #112

lipzhu commented Apr 1, 2024 •

edited

Loading

zuiderkwast commented Apr 2, 2024 •

edited

Loading

PingXie commented Apr 15, 2024

lipzhu commented Apr 15, 2024

PingXie commented Apr 15, 2024

lipzhu commented Apr 16, 2024

zuiderkwast commented Apr 16, 2024

lipzhu commented Apr 17, 2024

zuiderkwast commented Apr 17, 2024

PingXie commented Apr 18, 2024

Wenwen-Chen commented Apr 25, 2024

Description

lipzhu commented Apr 25, 2024

PingXie commented Apr 26, 2024

PingXie commented Apr 27, 2024

lipzhu commented Apr 28, 2024

lipzhu commented Apr 28, 2024 •

edited

Loading

codecov bot commented May 21, 2024 •

edited

Loading

lipzhu commented May 30, 2024

zuiderkwast left a comment

lipzhu commented Jul 30, 2024

secwall commented Aug 3, 2024

PingXie commented Aug 12, 2024

lipzhu commented Aug 12, 2024

PingXie commented Aug 12, 2024

lipzhu commented Aug 13, 2024 •

edited

Loading

PingXie commented Aug 13, 2024

lipzhu commented Aug 13, 2024 •

edited

Loading

PingXie commented Aug 13, 2024

lipzhu commented Aug 13, 2024 •

edited

Loading

PingXie commented Aug 14, 2024

PingXie commented Aug 14, 2024

lipzhu commented Aug 14, 2024

lipzhu commented Sep 9, 2024

Use io_uring to batch handle clients pending writes to reduce SYSCALL count. #112

Are you sure you want to change the base?

Use io_uring to batch handle clients pending writes to reduce SYSCALL count. #112

Conversation

lipzhu commented Apr 1, 2024 • edited Loading

Description

Benchmark Result

Test Env

Test Steps

Test Result

Perf Stat

zuiderkwast commented Apr 2, 2024 • edited Loading

PingXie commented Apr 15, 2024

lipzhu commented Apr 15, 2024

PingXie commented Apr 15, 2024

lipzhu commented Apr 16, 2024

zuiderkwast commented Apr 16, 2024

lipzhu commented Apr 17, 2024

zuiderkwast commented Apr 17, 2024

PingXie commented Apr 18, 2024

Wenwen-Chen commented Apr 25, 2024

Description

lipzhu commented Apr 25, 2024

PingXie commented Apr 26, 2024

PingXie commented Apr 27, 2024

lipzhu commented Apr 28, 2024

lipzhu commented Apr 28, 2024 • edited Loading

codecov bot commented May 21, 2024 • edited Loading

Codecov Report

lipzhu commented May 30, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

lipzhu commented Jul 30, 2024

secwall commented Aug 3, 2024

PingXie commented Aug 12, 2024

lipzhu commented Aug 12, 2024

PingXie commented Aug 12, 2024

lipzhu commented Aug 13, 2024 • edited Loading

PingXie commented Aug 13, 2024

lipzhu commented Aug 13, 2024 • edited Loading

PingXie commented Aug 13, 2024

lipzhu commented Aug 13, 2024 • edited Loading

PingXie commented Aug 14, 2024

PingXie commented Aug 14, 2024

lipzhu commented Aug 14, 2024

lipzhu commented Sep 9, 2024

lipzhu commented Apr 1, 2024 •

edited

Loading

zuiderkwast commented Apr 2, 2024 •

edited

Loading

lipzhu commented Apr 28, 2024 •

edited

Loading

codecov bot commented May 21, 2024 •

edited

Loading

lipzhu commented Aug 13, 2024 •

edited

Loading

lipzhu commented Aug 13, 2024 •

edited

Loading

lipzhu commented Aug 13, 2024 •

edited

Loading