Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelism problem and read/write performance in ZNS mode #164

Open
ZhuWeiLin0 opened this issue Nov 14, 2024 · 0 comments
Open

Parallelism problem and read/write performance in ZNS mode #164

ZhuWeiLin0 opened this issue Nov 14, 2024 · 0 comments

Comments

@ZhuWeiLin0
Copy link

ZhuWeiLin0 commented Nov 14, 2024

Hello:
I noticed that physical page would always be in Channel 0 and Channel 1 no matter how many channels I actually configure, because the CH_BITS is set to 1. And I think this would hamper parallelism and read/write performance. So I changed the CH_BITS to 3 and make rsv = 6 to utilize 8 channels. But I got two very confusing problems.

#define CH_BITS     (1)

struct ppa {
    union {
        struct {
            uint64_t spg  : SPG_BITS; //sub page
            uint64_t pg   : PG_BITS;
	    uint64_t blk  : BLK_BITS;
	    uint64_t fc   : FC_BITS;
            uint64_t pl   : PL_BITS;
	    uint64_t ch   : CH_BITS;
            uint64_t V    : 1; //padding page or not
            uint64_t rsv  : 8;
        } g;

	uint64_t ppa;
    };
};

My ZNS configuration is:
LOGICAL_PAGE_SIZE = ZNS_PAGE_SIZE = 4KB
8 channels, 4 chips/channel, 2 planes/chip , 32 blocks/plane
1GB / zone, 32 zone in total.

My first question is why improving parallelism seems hamper read performance while helps write performance?

I used fio to test the read/write performance, my fio command is:
fio --ioengine=psync --direct=1 --filename=/dev/nvme0n1 --rw=write --iodepth=16 --bs=32k --group_reporting --zonemode=zbd --name=seqwrite --offset_increment=0z --size=16z

fio --ioengine=psync --direct=1 --filename=/dev/nvme0n1 --rw=read --offset_increment=0z --size=2z --group_reporting --zonemode=zbd --bs=32k --name=seqread --numjobs=8

When CH_BITS=1, the read/write performance is shown below, and we can see that write bandwidth is only 19.6MB/s, and read is 241MB/s.

**seqwrite**: (g=0): rw=write, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=16
fio-3.38-4-gcd56
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
^Cbs: 1 (f=1): [W(1)][0.8%][w=20.0MiB/s][w=640 IOPS][eta 13m:56s]
fio: terminating on signal 2

seqwrite: (groupid=0, jobs=1): err= 0: pid=1111: Thu Nov 14 07:34:59 2024
  write: IOPS=626, BW=19.6MiB/s (20.5MB/s)(150MiB/7661msec); 0 zone resets
    clat (usec): min=29, max=68982, avg=1591.88, stdev=8555.80
     lat (usec): min=30, max=68984, avg=1593.26, stdev=8555.78
    clat percentiles (usec):
     |  1.00th=[   40],  5.00th=[   40], 10.00th=[   41], 20.00th=[   41],
     | 30.00th=[   42], 40.00th=[   47], 50.00th=[   55], 60.00th=[   57],
     | 70.00th=[   59], 80.00th=[   59], 90.00th=[   63], 95.00th=[   77],
     | 99.00th=[49021], 99.50th=[49021], 99.90th=[49021], 99.95th=[49021],
     | 99.99th=[68682]
   bw (  KiB/s): min=18432, max=20480, per=100.00%, avg=20206.93, stdev=637.92, samples=15
   iops        : min=  576, max=  640, avg=631.47, stdev=19.94, samples=15
  lat (usec)   : 50=45.74%, 100=50.78%, 250=0.31%, 500=0.02%
  lat (msec)   : 50=3.12%, 100=0.02%
  cpu          : usr=0.61%, sys=2.13%, ctx=4585, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4801,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=19.6MiB/s (20.5MB/s), 19.6MiB/s-19.6MiB/s (20.5MB/s-20.5MB/s), io=150MiB (157MB), run=7661-7661msec

Disk stats (read/write):
  nvme0n1: ios=51/4800, sectors=2112/307200, merge=0/0, ticks=6/7514, in_queue=7519, util=98.69%


**seqread**: (g=0): rw=read, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=1
...
fio-3.38-4-gcd56
Starting 8 processes
Jobs: 3 (f=3): [R(1),_(4),R(2),_(1)][88.5%][r=1047MiB/s][r=33.5k IOPS][eta 00m:09s]
seqread: (groupid=0, jobs=8): err= 0: pid=1092: Thu Nov 14 07:30:57 2024
  read: IOPS=7699, BW=241MiB/s (252MB/s)(16.0GiB/68095msec)
    clat (usec): min=13, max=29836, avg=137.70, stdev=290.41
     lat (usec): min=13, max=29836, avg=137.92, stdev=290.52
    clat percentiles (usec):
     |  1.00th=[   20],  5.00th=[   22], 10.00th=[   23], 20.00th=[   24],
     | 30.00th=[   26], 40.00th=[   28], 50.00th=[   31], 60.00th=[   37],
     | 70.00th=[   47], 80.00th=[  420], 90.00th=[  445], 95.00th=[  478],
     | 99.00th=[  498], 99.50th=[  506], 99.90th=[  586], 99.95th=[ 1057],
     | 99.99th=[15270]
   bw (  KiB/s): min= 2048, max=3113088, per=100.00%, avg=349634.07, stdev=69274.13, samples=729
   iops        : min=   64, max=97284, avg=10925.81, stdev=2164.80, samples=729
  lat (usec)   : 20=1.65%, 50=69.44%, 100=3.46%, 250=0.38%, 500=24.37%
  lat (usec)   : 750=0.64%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.71%, sys=4.32%, ctx=858341, majf=0, minf=154
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=241MiB/s (252MB/s), 241MiB/s-241MiB/s (252MB/s-252MB/s), io=16.0GiB (17.2GB), run=68095-68095msec

Disk stats (read/write):
  nvme0n1: ios=516011/0, sectors=33024704/0, merge=0/0, ticks=60955/0, in_queue=60954, util=99.96%

Then I changed the CH_BITS to 3. and conducted the same fio experiments as above. Results show that write bandwidth rises to 72.3MB/s as expected but read bandwidth falls to 78.2MB/s. Why would this happen?

**seqwrite**: (g=0): rw=write, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=16
fio-3.38-4-gcd56
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [W(1)][100.0%][w=72.0MiB/s][w=2304 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=1048: Thu Nov 14 07:58:45 2024
  write: IOPS=2312, BW=72.3MiB/s (75.8MB/s)(16.0GiB/226764msec); 0 zone resets
    clat (usec): min=26, max=31048, avg=429.29, stdev=2160.66
     lat (usec): min=27, max=31050, avg=430.43, stdev=2160.66
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   38], 10.00th=[   38], 20.00th=[   38],
     | 30.00th=[   38], 40.00th=[   38], 50.00th=[   38], 60.00th=[   39],
     | 70.00th=[   39], 80.00th=[   41], 90.00th=[   58], 95.00th=[   67],
     | 99.00th=[12256], 99.50th=[12256], 99.90th=[12387], 99.95th=[12387],
     | 99.99th=[22152]
   bw (  KiB/s): min=67584, max=75927, per=100.00%, avg=74078.05, stdev=1383.76, samples=452
   iops        : min= 2112, max= 2372, avg=2314.90, stdev=43.23, samples=452
  lat (usec)   : 50=87.74%, 100=9.01%, 250=0.09%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=3.13%, 50=0.01%
  cpu          : usr=1.45%, sys=7.37%, ctx=489934, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=72.3MiB/s (75.8MB/s), 72.3MiB/s-72.3MiB/s (75.8MB/s-75.8MB/s), io=16.0GiB (17.2GB), run=226764-226764msec

Disk stats (read/write):
  nvme0n1: ios=50/524000, sectors=2104/33536000, merge=0/0, ticks=2/218353, in_queue=218355, util=100.00%

**seqread**: (g=0): rw=read, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=1
...
fio-3.38-4-gcd56
Starting 8 processes
Jobs: 1 (f=1): [_(3),R(1),_(4)][98.1%][r=70.4MiB/s][r=2252 IOPS][eta 00m:04s]                    
seqread: (groupid=0, jobs=8): err= 0: pid=1066: Thu Nov 14 08:02:46 2024
  read: IOPS=2502, BW=78.2MiB/s (82.0MB/s)(16.0GiB/209537msec)
    clat (usec): min=26, max=23734, avg=449.98, stdev=264.21
     lat (usec): min=26, max=23734, avg=450.45, stdev=264.22
    clat percentiles (usec):
     |  1.00th=[  371],  5.00th=[  396], 10.00th=[  404], 20.00th=[  420],
     | 30.00th=[  429], 40.00th=[  433], 50.00th=[  437], 60.00th=[  445],
     | 70.00th=[  465], 80.00th=[  482], 90.00th=[  494], 95.00th=[  498],
     | 99.00th=[  523], 99.50th=[  537], 99.90th=[  586], 99.95th=[ 1172],
     | 99.99th=[16057]
   bw (  KiB/s): min= 3328, max=574879, per=100.00%, avg=170196.40, stdev=18692.56, samples=1631
   iops        : min=  104, max=17964, avg=5318.13, stdev=584.13, samples=1631
  lat (usec)   : 50=0.04%, 100=0.01%, 250=0.01%, 500=95.30%, 750=4.60%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.03%, 50=0.01%
  cpu          : usr=0.62%, sys=3.03%, ctx=999070, majf=0, minf=151
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=78.2MiB/s (82.0MB/s), 78.2MiB/s-78.2MiB/s (82.0MB/s-82.0MB/s), io=16.0GiB (17.2GB), run=209537-209537msec

Disk stats (read/write):
  nvme0n1: ios=523929/0, sectors=33531456/0, merge=0/0, ticks=210808/0, in_queue=210808, util=100.00%

My second question is why is performance increasment limited when I tried to improve read/write performance?
For example, if we managed to compress multiple lpn data into 1 ppn , then we can definitely improve write performance. But as I tried, it seems that the improvement is limited to around 5x even though the compression ratio is really high.
I said this by two facts:

  1. when I compress 118 lpn into 1 ppn, the write bandwidth improvement is 5.6x. When I set the compressed data size to zero, which means hole dataset would be written in 1 ppn no matter how large it actually is , the write bandwidth improvement is still 5.6x.
  2. The maximum improvement is the same when CH_BITS is set to 1 and to 3. But it is expected to have more improvement when CH_BITS is 3 because CH_BITS=3 can utilize all 8 channels.

These two problems have been confusing me for a long time. I'd be very grateful if you would kindly answer them. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant