Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRIU dumps triggers COW on all memory in all child processes #2386

Open
Tianyang-Zhang opened this issue Apr 11, 2024 · 20 comments
Open

CRIU dumps triggers COW on all memory in all child processes #2386

Tianyang-Zhang opened this issue Apr 11, 2024 · 20 comments
Assignees
Labels
bug-mem no-auto-close Don't auto-close as a stale issue

Comments

@Tianyang-Zhang
Copy link

Description

I have a process that allocates 3GB memory then fork(), the child process never touch any memory so that no copy on write, and the total memory usage is 3GB. Checkpoint that process tree and the checkpoint size is 6GB, also, the system memory usage increases during checkpointing. If checkpointing with --leave-running flag, the total process memory usage is doubled to 6GB after the checkpoint is finished.

I read the page https://criu.org/Copy-on-write_memory. It looks like CRIU should have such "forked COW memory" support since v0.3.

CRIU parses the /proc/pid/smaps to get the VMA type. In this case, I think CRIU got the wrong VMA type. CRIU read the perm field(rw-p in below) to see if the VMA is private or shared. However, the "forked memory" is marked as private in smaps, although its actually anonymous shared. You can see from the smaps output below:

7f1dc68a2000-7f1e868a6000 rw-p 00000000 00:00 0
Size:            3145744 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:             3145744 kB
Pss:             1572874 kB
Shared_Clean:          0 kB
Shared_Dirty:    3145740 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:      3145744 kB
Anonymous:       3145744 kB
LazyFree:              0 kB
AnonHugePages:   3143680 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
ProtectionKey:         0
VmFlags: rd wr mr mw me ac
...

In this case, CRIU treats this VMA as VMA_ANON_PRIVATE and eventually the parasite thread will call vmsplice() to transfer pages to the pipe. Then, the vmsplice() syscall somehow triggers the Copy-on-Write and causes the memory usage to increase. You can verify this vmsplice() issue by creating an app that:

  1. create a pipe
  2. allocate X memory
  3. fork
  4. child call vmsplice() to write X memory to the pipe
  5. parent creates a small buffer and read X memory out from the pipe

You will see the process memory usage is doubled after data transfer is done, and the anonymous shared VMA becomes anonymous private.

Then I tried to find if there is any code path to handle this "forked memory" but I couldn't find it. It looks like all VMA_ANON_SHARED handling eventually needs to find a corresponding entry in /proc/self/map_files/, but the forked memory doesn't map to any file.

I would greatly appreciate any help. I want to know if this is a limitation of CRIU or a bug. If a process uses 1GB forks 100 times and all children don't touch the memory, the system usage is 1GB but will increase to 100GB after checkpoint.

Steps to reproduce the issue:

  1. allocate 2GB memory in a process
  2. fork() 19 times and put all processes to sleep
  3. check system memory usage, should be 2GB
  4. dump the process with --leave-running
  5. check the system memory usage during dump and after dump finish
  6. check checkpoint image size.

Describe the results you received:
The system memory increase after the entire process tree is seized. Eventually the system memory usage increased to 40GB. The checkpoint image size is 40GB.

Describe the results you expected:
System memory should not increase. The process should still only use 2GB after the dump. The checkpoint image size should be 2GB because all 20 processes shares the same physical pages and no CoW.

Additional information you deem important (e.g. issue happens only occasionally):
The sys_vmsplice() call in parasite.c::dump_pages(). The proc_parse.c::parse_smaps().

Also, I'm wondering if the bit 61 from /proc/pid/pagemap can be used to determine whether a page is anonymous shared(from https://www.kernel.org/doc/Documentation/vm/pagemap.txt), and then specially handle that case. I see CRIU uses that bit to check if the page is file-page.

CRIU logs and information:
(The dump log is too long, I will just paste some of the VMA related parts. Please let me know if anything else is needed)

CRIU full dump/restore logs:

(00.007397) ========================================
(00.007412) Dumping task (pid: 8235 comm: fork_malloc)
(00.007415) ========================================
(00.007416) Obtaining task stat ... 
(00.007446) 
(00.007448) Collecting mappings (pid: 8235)
(00.007450) ----------------------------------------
(00.007502) Handling VMA with the following smaps entry: 00400000-00401000 r--p 00000000 103:02 4204064                           /home/ec2-user/fork_malloc
(00.007523) Found regular file mapping, OK
(00.007598) Dumping path for -3 fd via self 12 [/home/ec2-user/fork_malloc]
(00.007686) Handling VMA with the following smaps entry: 00401000-00404000 r-xp 00001000 103:02 4204064                           /home/ec2-user/fork_malloc
(00.007694) vma 401000 borrows vfi from previous 400000
(00.007700) Handling VMA with the following smaps entry: 00404000-00406000 r--p 00004000 103:02 4204064                           /home/ec2-user/fork_malloc
(00.007706) vma 404000 borrows vfi from previous 401000
(00.007712) Handling VMA with the following smaps entry: 00406000-00407000 r--p 00005000 103:02 4204064                           /home/ec2-user/fork_malloc
(00.007718) vma 406000 borrows vfi from previous 404000
(00.007723) Handling VMA with the following smaps entry: 00407000-00408000 rw-p 00006000 103:02 4204064                           /home/ec2-user/fork_malloc
(00.007729) vma 407000 borrows vfi from previous 406000
(00.007735) Handling VMA with the following smaps entry: 014a0000-014c1000 rw-p 00000000 00:00 0                                  [heap]
(00.030446) Handling VMA with the following smaps entry: 7f1dc68a2000-7f1e868a6000 rw-p 00000000 00:00 0 
(00.030483) Handling VMA with the following smaps entry: 7f1e868a6000-7f1e868d2000 r--p 00000000 103:02 12597426                  /usr/lib64/libc.so.6
(00.030505) Found regular file mapping, OK
(00.030555) Dumping path for -3 fd via self 12 [/usr/lib64/libc.so.6]
(00.030636) Handling VMA with the following smaps entry: 7f1e868d2000-7f1e86a48000 r-xp 0002c000 103:02 12597426                  /usr/lib64/libc.so.6
(00.030647) vma 7f1e868d2000 borrows vfi from previous 7f1e868a6000
(00.030654) Handling VMA with the following smaps entry: 7f1e86a48000-7f1e86a9c000 r--p 001a2000 103:02 12597426                  /usr/lib64/libc.so.6
(00.030660) vma 7f1e86a48000 borrows vfi from previous 7f1e868d2000
(00.030666) Handling VMA with the following smaps entry: 7f1e86a9c000-7f1e86a9d000 ---p 001f6000 103:02 12597426                  /usr/lib64/libc.so.6
(00.030672) vma 7f1e86a9c000 borrows vfi from previous 7f1e86a48000
(00.030678) Handling VMA with the following smaps entry: 7f1e86a9d000-7f1e86aa0000 r--p 001f6000 103:02 12597426                  /usr/lib64/libc.so.6
(00.030702) vma 7f1e86a9d000 borrows vfi from previous 7f1e86a9c000
(00.030734) Handling VMA with the following smaps entry: 7f1e86aa0000-7f1e86aa3000 rw-p 001f9000 103:02 12597426                  /usr/lib64/libc.so.6
(00.030739) vma 7f1e86aa0000 borrows vfi from previous 7f1e86a9d000
(00.030745) Handling VMA with the following smaps entry: 7f1e86aa3000-7f1e86ab0000 rw-p 00000000 00:00 0 
(00.030759) Handling VMA with the following smaps entry: 7f1e86ab0000-7f1e86ab3000 r--p 00000000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(00.030769) Found regular file mapping, OK
(00.030793) Dumping path for -3 fd via self 12 [/usr/lib64/libgcc_s-11-20220421.so.1]
(00.030825) Handling VMA with the following smaps entry: 7f1e86ab3000-7f1e86ac5000 r-xp 00003000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(00.030832) vma 7f1e86ab3000 borrows vfi from previous 7f1e86ab0000
(00.030837) Handling VMA with the following smaps entry: 7f1e86ac5000-7f1e86ac8000 r--p 00015000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(00.030843) vma 7f1e86ac5000 borrows vfi from previous 7f1e86ab3000
(00.030870) Handling VMA with the following smaps entry: 7f1e86ac8000-7f1e86ac9000 ---p 00018000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(00.030876) vma 7f1e86ac8000 borrows vfi from previous 7f1e86ac5000
(00.030881) Handling VMA with the following smaps entry: 7f1e86ac9000-7f1e86aca000 r--p 00018000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(00.030886) vma 7f1e86ac9000 borrows vfi from previous 7f1e86ac8000
(00.030892) Handling VMA with the following smaps entry: 7f1e86aca000-7f1e86acb000 rw-p 00019000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(00.030897) vma 7f1e86aca000 borrows vfi from previous 7f1e86ac9000
(00.030903) Handling VMA with the following smaps entry: 7f1e86acb000-7f1e86ada000 r--p 00000000 103:02 12597429                  /usr/lib64/libm.so.6
(00.030915) Found regular file mapping, OK
(00.030935) Dumping path for -3 fd via self 12 [/usr/lib64/libm.so.6]
(00.030965) Handling VMA with the following smaps entry: 7f1e86ada000-7f1e86b4a000 r-xp 0000f000 103:02 12597429                  /usr/lib64/libm.so.6
(00.030971) vma 7f1e86ada000 borrows vfi from previous 7f1e86acb000
(00.030977) Handling VMA with the following smaps entry: 7f1e86b4a000-7f1e86ba4000 r--p 0007f000 103:02 12597429                  /usr/lib64/libm.so.6
(00.030982) vma 7f1e86b4a000 borrows vfi from previous 7f1e86ada000
(00.031039) Handling VMA with the following smaps entry: 7f1e86ba4000-7f1e86ba5000 r--p 000d8000 103:02 12597429                  /usr/lib64/libm.so.6
(00.031044) vma 7f1e86ba4000 borrows vfi from previous 7f1e86b4a000
(00.031054) Handling VMA with the following smaps entry: 7f1e86ba5000-7f1e86ba6000 rw-p 000d9000 103:02 12597429                  /usr/lib64/libm.so.6
(00.031060) vma 7f1e86ba5000 borrows vfi from previous 7f1e86ba4000
(00.031065) Handling VMA with the following smaps entry: 7f1e86ba6000-7f1e86c3f000 r--p 00000000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(00.031078) Found regular file mapping, OK
(00.031098) Dumping path for -3 fd via self 12 [/usr/lib64/libstdc++.so.6.0.29]
(00.031127) Handling VMA with the following smaps entry: 7f1e86c3f000-7f1e86d49000 r-xp 00099000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(00.031134) vma 7f1e86c3f000 borrows vfi from previous 7f1e86ba6000
(00.031139) Handling VMA with the following smaps entry: 7f1e86d49000-7f1e86dbb000 r--p 001a3000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(00.031145) vma 7f1e86d49000 borrows vfi from previous 7f1e86c3f000
(00.031150) Handling VMA with the following smaps entry: 7f1e86dbb000-7f1e86dbc000 ---p 00215000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(00.031156) vma 7f1e86dbb000 borrows vfi from previous 7f1e86d49000
(00.031179) Handling VMA with the following smaps entry: 7f1e86dbc000-7f1e86dc9000 r--p 00215000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(00.031188) vma 7f1e86dbc000 borrows vfi from previous 7f1e86dbb000
(00.031193) Handling VMA with the following smaps entry: 7f1e86dc9000-7f1e86dca000 rw-p 00222000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(00.031198) vma 7f1e86dc9000 borrows vfi from previous 7f1e86dbc000
(00.031203) Handling VMA with the following smaps entry: 7f1e86dca000-7f1e86dcd000 rw-p 00000000 00:00 0 
(00.031215) Handling VMA with the following smaps entry: 7f1e86dd1000-7f1e86dd3000 rw-p 00000000 00:00 0 
(00.031222) Handling VMA with the following smaps entry: 7f1e86dd3000-7f1e86dd5000 r--p 00000000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(00.031232) Found regular file mapping, OK
(00.031252) Dumping path for -3 fd via self 12 [/usr/lib64/ld-linux-x86-64.so.2]
(00.031307) Handling VMA with the following smaps entry: 7f1e86dd5000-7f1e86dfb000 r-xp 00002000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(00.031315) vma 7f1e86dd5000 borrows vfi from previous 7f1e86dd3000
(00.031320) Handling VMA with the following smaps entry: 7f1e86dfb000-7f1e86e06000 r--p 00028000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(00.031326) vma 7f1e86dfb000 borrows vfi from previous 7f1e86dd5000
(00.031331) Handling VMA with the following smaps entry: 7f1e86e07000-7f1e86e09000 r--p 00033000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(00.031336) vma 7f1e86e07000 borrows vfi from previous 7f1e86dfb000
(00.031342) Handling VMA with the following smaps entry: 7f1e86e09000-7f1e86e0b000 rw-p 00035000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(00.031348) vma 7f1e86e09000 borrows vfi from previous 7f1e86e07000
(00.031354) Handling VMA with the following smaps entry: 7ffe3a6a9000-7ffe3a6ca000 rw-p 00000000 00:00 0                          [stack]
(00.031366) Handling VMA with the following smaps entry: 7ffe3a7a4000-7ffe3a7a8000 r--p 00000000 00:00 0                          [vvar]
(00.031382) Handling VMA with the following smaps entry: 7ffe3a7a8000-7ffe3a7aa000 r-xp 00000000 00:00 0                          [vdso]
(00.031391) Handling VMA with the following smaps entry: ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
(00.031402) Collected, longest area occupies 786436 pages
(00.031405) 0x400000-0x401000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x1
(00.031409) 0x401000-0x404000 (12K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x1000 reg fp  shmid: 0x1
(00.031412) 0x404000-0x406000 (8K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x4000 reg fp  shmid: 0x1
(00.031414) 0x406000-0x407000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x5000 reg fp  shmid: 0x1
(00.031417) 0x407000-0x408000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x6000 reg fp  shmid: 0x1
(00.031419) 0x14a0000-0x14c1000 (132K) prot 0x3 flags 0x22 fdflags 0 st 0x221 off 0 reg heap ap  shmid: 0
(00.031422) 0x7f1dc68a2000-0x7f1e868a6000 (3145744K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(00.031425) 0x7f1e868a6000-0x7f1e868d2000 (176K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x2
(00.031427) 0x7f1e868d2000-0x7f1e86a48000 (1496K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x2c000 reg fp  shmid: 0x2
(00.031430) 0x7f1e86a48000-0x7f1e86a9c000 (336K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1a2000 reg fp  shmid: 0x2
(00.031432) 0x7f1e86a9c000-0x7f1e86a9d000 (4K) prot 0 flags 0x2 fdflags 0 st 0x41 off 0x1f6000 reg fp  shmid: 0x2
(00.031435) 0x7f1e86a9d000-0x7f1e86aa0000 (12K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1f6000 reg fp  shmid: 0x2
(00.031437) 0x7f1e86aa0000-0x7f1e86aa3000 (12K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x1f9000 reg fp  shmid: 0x2
(00.031440) 0x7f1e86aa3000-0x7f1e86ab0000 (52K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(00.031442) 0x7f1e86ab0000-0x7f1e86ab3000 (12K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x3
(00.031444) 0x7f1e86ab3000-0x7f1e86ac5000 (72K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x3000 reg fp  shmid: 0x3
(00.031450) 0x7f1e86ac5000-0x7f1e86ac8000 (12K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x15000 reg fp  shmid: 0x3
(00.031452) 0x7f1e86ac8000-0x7f1e86ac9000 (4K) prot 0 flags 0x2 fdflags 0 st 0x41 off 0x18000 reg fp  shmid: 0x3
(00.031455) 0x7f1e86ac9000-0x7f1e86aca000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x18000 reg fp  shmid: 0x3
(00.031457) 0x7f1e86aca000-0x7f1e86acb000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x19000 reg fp  shmid: 0x3
(00.031460) 0x7f1e86acb000-0x7f1e86ada000 (60K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x4
(00.031462) 0x7f1e86ada000-0x7f1e86b4a000 (448K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0xf000 reg fp  shmid: 0x4
(00.031464) 0x7f1e86b4a000-0x7f1e86ba4000 (360K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x7f000 reg fp  shmid: 0x4
(00.031467) 0x7f1e86ba4000-0x7f1e86ba5000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0xd8000 reg fp  shmid: 0x4
(00.031469) 0x7f1e86ba5000-0x7f1e86ba6000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0xd9000 reg fp  shmid: 0x4
(00.031471) 0x7f1e86ba6000-0x7f1e86c3f000 (612K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x5
(00.031474) 0x7f1e86c3f000-0x7f1e86d49000 (1064K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x99000 reg fp  shmid: 0x5
(00.031476) 0x7f1e86d49000-0x7f1e86dbb000 (456K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1a3000 reg fp  shmid: 0x5
(00.031479) 0x7f1e86dbb000-0x7f1e86dbc000 (4K) prot 0 flags 0x2 fdflags 0 st 0x41 off 0x215000 reg fp  shmid: 0x5
(00.031481) 0x7f1e86dbc000-0x7f1e86dc9000 (52K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x215000 reg fp  shmid: 0x5
(00.031483) 0x7f1e86dc9000-0x7f1e86dca000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x222000 reg fp  shmid: 0x5
(00.031486) 0x7f1e86dca000-0x7f1e86dcd000 (12K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(00.031488) 0x7f1e86dd1000-0x7f1e86dd3000 (8K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(00.031491) 0x7f1e86dd3000-0x7f1e86dd5000 (8K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x6
(00.031493) 0x7f1e86dd5000-0x7f1e86dfb000 (152K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x2000 reg fp  shmid: 0x6
(00.031495) 0x7f1e86dfb000-0x7f1e86e06000 (44K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x28000 reg fp  shmid: 0x6
(00.031498) 0x7f1e86e07000-0x7f1e86e09000 (8K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x33000 reg fp  shmid: 0x6
(00.031500) 0x7f1e86e09000-0x7f1e86e0b000 (8K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x35000 reg fp  shmid: 0x6
(00.031503) 0x7ffe3a6a9000-0x7ffe3a6ca000 (132K) prot 0x3 flags 0x122 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(00.031506) 0x7ffe3a7a4000-0x7ffe3a7a8000 (16K) prot 0x1 flags 0x22 fdflags 0 st 0x1201 off 0 reg vvar ap  shmid: 0
(00.031509) 0x7ffe3a7a8000-0x7ffe3a7aa000 (8K) prot 0x5 flags 0x22 fdflags 0 st 0x209 off 0 reg vdso ap  shmid: 0
(00.031511) 0xffffffffff600000-0xffffffffff601000 (4K) prot 0x5 flags 0x22 fdflags 0 st 0x204 off 0 vsys ap  shmid: 0
(00.031514) ----------------------------------------


(03.692000) Dumping task (pid: 8237 comm: fork_malloc)
(03.692002) ========================================
(03.692004) Obtaining task stat ... 
(03.692049) 
(03.692051) Collecting mappings (pid: 8237)
(03.692053) ----------------------------------------
(03.692111) Handling VMA with the following smaps entry: 00400000-00401000 r--p 00000000 103:02 4204064                           /home/ec2-user/fork_malloc
(03.692126) Found regular file mapping, OK
(03.692154) Handling VMA with the following smaps entry: 00401000-00404000 r-xp 00001000 103:02 4204064                           /home/ec2-user/fork_malloc
(03.692161) vma 401000 borrows vfi from previous 400000
(03.692168) Handling VMA with the following smaps entry: 00404000-00406000 r--p 00004000 103:02 4204064                           /home/ec2-user/fork_malloc
(03.692174) vma 404000 borrows vfi from previous 401000
(03.692191) Handling VMA with the following smaps entry: 00406000-00407000 r--p 00005000 103:02 4204064                           /home/ec2-user/fork_malloc
(03.692198) vma 406000 borrows vfi from previous 404000
(03.692204) Handling VMA with the following smaps entry: 00407000-00408000 rw-p 00006000 103:02 4204064                           /home/ec2-user/fork_malloc
(03.692210) vma 407000 borrows vfi from previous 406000
(03.692217) Handling VMA with the following smaps entry: 014a0000-014c1000 rw-p 00000000 00:00 0                                  [heap]
(03.692614) Handling VMA with the following smaps entry: 7f1dc68a2000-7f1e868a6000 rw-p 00000000 00:00 0 
(03.692628) Handling VMA with the following smaps entry: 7f1e868a6000-7f1e868d2000 r--p 00000000 103:02 12597426                  /usr/lib64/libc.so.6
(03.692642) Found regular file mapping, OK
(03.692663) Handling VMA with the following smaps entry: 7f1e868d2000-7f1e86a48000 r-xp 0002c000 103:02 12597426                  /usr/lib64/libc.so.6
(03.692669) vma 7f1e868d2000 borrows vfi from previous 7f1e868a6000
(03.692676) Handling VMA with the following smaps entry: 7f1e86a48000-7f1e86a9c000 r--p 001a2000 103:02 12597426                  /usr/lib64/libc.so.6
(03.692682) vma 7f1e86a48000 borrows vfi from previous 7f1e868d2000
(03.692689) Handling VMA with the following smaps entry: 7f1e86a9c000-7f1e86a9d000 ---p 001f6000 103:02 12597426                  /usr/lib64/libc.so.6
(03.692695) vma 7f1e86a9c000 borrows vfi from previous 7f1e86a48000
(03.692702) Handling VMA with the following smaps entry: 7f1e86a9d000-7f1e86aa0000 r--p 001f6000 103:02 12597426                  /usr/lib64/libc.so.6
(03.692707) vma 7f1e86a9d000 borrows vfi from previous 7f1e86a9c000
(03.692741) Handling VMA with the following smaps entry: 7f1e86aa0000-7f1e86aa3000 rw-p 001f9000 103:02 12597426                  /usr/lib64/libc.so.6
(03.692747) vma 7f1e86aa0000 borrows vfi from previous 7f1e86a9d000
(03.692753) Handling VMA with the following smaps entry: 7f1e86aa3000-7f1e86ab0000 rw-p 00000000 00:00 0 
(03.692766) Handling VMA with the following smaps entry: 7f1e86ab0000-7f1e86ab3000 r--p 00000000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(03.692777) Found regular file mapping, OK
(03.692798) Handling VMA with the following smaps entry: 7f1e86ab3000-7f1e86ac5000 r-xp 00003000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(03.692804) vma 7f1e86ab3000 borrows vfi from previous 7f1e86ab0000
(03.692810) Handling VMA with the following smaps entry: 7f1e86ac5000-7f1e86ac8000 r--p 00015000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(03.692816) vma 7f1e86ac5000 borrows vfi from previous 7f1e86ab3000
(03.692841) Handling VMA with the following smaps entry: 7f1e86ac8000-7f1e86ac9000 ---p 00018000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(03.692847) vma 7f1e86ac8000 borrows vfi from previous 7f1e86ac5000
(03.692853) Handling VMA with the following smaps entry: 7f1e86ac9000-7f1e86aca000 r--p 00018000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(03.692860) vma 7f1e86ac9000 borrows vfi from previous 7f1e86ac8000
(03.692866) Handling VMA with the following smaps entry: 7f1e86aca000-7f1e86acb000 rw-p 00019000 103:02 12978283                  /usr/lib64/libgcc_s-11-20220421.so.1
(03.692873) vma 7f1e86aca000 borrows vfi from previous 7f1e86ac9000
(03.692880) Handling VMA with the following smaps entry: 7f1e86acb000-7f1e86ada000 r--p 00000000 103:02 12597429                  /usr/lib64/libm.so.6
(03.692893) Found regular file mapping, OK
(03.692911) Handling VMA with the following smaps entry: 7f1e86ada000-7f1e86b4a000 r-xp 0000f000 103:02 12597429                  /usr/lib64/libm.so.6
(03.692918) vma 7f1e86ada000 borrows vfi from previous 7f1e86acb000
(03.692924) Handling VMA with the following smaps entry: 7f1e86b4a000-7f1e86ba4000 r--p 0007f000 103:02 12597429                  /usr/lib64/libm.so.6
(03.692931) vma 7f1e86b4a000 borrows vfi from previous 7f1e86ada000
(03.692968) Handling VMA with the following smaps entry: 7f1e86ba4000-7f1e86ba5000 r--p 000d8000 103:02 12597429                  /usr/lib64/libm.so.6
(03.692975) vma 7f1e86ba4000 borrows vfi from previous 7f1e86b4a000
(03.692981) Handling VMA with the following smaps entry: 7f1e86ba5000-7f1e86ba6000 rw-p 000d9000 103:02 12597429                  /usr/lib64/libm.so.6
(03.692987) vma 7f1e86ba5000 borrows vfi from previous 7f1e86ba4000
(03.692994) Handling VMA with the following smaps entry: 7f1e86ba6000-7f1e86c3f000 r--p 00000000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(03.693007) Found regular file mapping, OK
(03.693024) Handling VMA with the following smaps entry: 7f1e86c3f000-7f1e86d49000 r-xp 00099000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(03.693030) vma 7f1e86c3f000 borrows vfi from previous 7f1e86ba6000
(03.693036) Handling VMA with the following smaps entry: 7f1e86d49000-7f1e86dbb000 r--p 001a3000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(03.693042) vma 7f1e86d49000 borrows vfi from previous 7f1e86c3f000
(03.693048) Handling VMA with the following smaps entry: 7f1e86dbb000-7f1e86dbc000 ---p 00215000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(03.693054) vma 7f1e86dbb000 borrows vfi from previous 7f1e86d49000
(03.693077) Handling VMA with the following smaps entry: 7f1e86dbc000-7f1e86dc9000 r--p 00215000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(03.693083) vma 7f1e86dbc000 borrows vfi from previous 7f1e86dbb000
(03.693089) Handling VMA with the following smaps entry: 7f1e86dc9000-7f1e86dca000 rw-p 00222000 103:02 12597619                  /usr/lib64/libstdc++.so.6.0.29
(03.693094) vma 7f1e86dc9000 borrows vfi from previous 7f1e86dbc000
(03.693100) Handling VMA with the following smaps entry: 7f1e86dca000-7f1e86dcd000 rw-p 00000000 00:00 0 
(03.693112) Handling VMA with the following smaps entry: 7f1e86dd1000-7f1e86dd3000 rw-p 00000000 00:00 0 
(03.693120) Handling VMA with the following smaps entry: 7f1e86dd3000-7f1e86dd5000 r--p 00000000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(03.693134) Found regular file mapping, OK
(03.693180) Handling VMA with the following smaps entry: 7f1e86dd5000-7f1e86dfb000 r-xp 00002000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(03.693189) vma 7f1e86dd5000 borrows vfi from previous 7f1e86dd3000
(03.693195) Handling VMA with the following smaps entry: 7f1e86dfb000-7f1e86e06000 r--p 00028000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(03.693201) vma 7f1e86dfb000 borrows vfi from previous 7f1e86dd5000
(03.693207) Handling VMA with the following smaps entry: 7f1e86e07000-7f1e86e09000 r--p 00033000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(03.693213) vma 7f1e86e07000 borrows vfi from previous 7f1e86dfb000
(03.693220) Handling VMA with the following smaps entry: 7f1e86e09000-7f1e86e0b000 rw-p 00035000 103:02 12597422                  /usr/lib64/ld-linux-x86-64.so.2
(03.693226) vma 7f1e86e09000 borrows vfi from previous 7f1e86e07000
(03.693233) Handling VMA with the following smaps entry: 7ffe3a6a9000-7ffe3a6ca000 rw-p 00000000 00:00 0                          [stack]
(03.693247) Handling VMA with the following smaps entry: 7ffe3a7a4000-7ffe3a7a8000 r--p 00000000 00:00 0                          [vvar]
(03.693265) Handling VMA with the following smaps entry: 7ffe3a7a8000-7ffe3a7aa000 r-xp 00000000 00:00 0                          [vdso]
(03.693274) Handling VMA with the following smaps entry: ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
(03.693286) Collected, longest area occupies 786436 pages
(03.693289) 0x400000-0x401000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x1
(03.693292) 0x401000-0x404000 (12K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x1000 reg fp  shmid: 0x1
(03.693296) 0x404000-0x406000 (8K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x4000 reg fp  shmid: 0x1
(03.693298) 0x406000-0x407000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x5000 reg fp  shmid: 0x1
(03.693301) 0x407000-0x408000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x6000 reg fp  shmid: 0x1
(03.693304) 0x14a0000-0x14c1000 (132K) prot 0x3 flags 0x22 fdflags 0 st 0x221 off 0 reg heap ap  shmid: 0
(03.693307) 0x7f1dc68a2000-0x7f1e868a6000 (3145744K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(03.693310) 0x7f1e868a6000-0x7f1e868d2000 (176K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x2
(03.693313) 0x7f1e868d2000-0x7f1e86a48000 (1496K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x2c000 reg fp  shmid: 0x2
(03.693316) 0x7f1e86a48000-0x7f1e86a9c000 (336K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1a2000 reg fp  shmid: 0x2
(03.693319) 0x7f1e86a9c000-0x7f1e86a9d000 (4K) prot 0 flags 0x2 fdflags 0 st 0x41 off 0x1f6000 reg fp  shmid: 0x2
(03.693321) 0x7f1e86a9d000-0x7f1e86aa0000 (12K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1f6000 reg fp  shmid: 0x2
(03.693324) 0x7f1e86aa0000-0x7f1e86aa3000 (12K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x1f9000 reg fp  shmid: 0x2
(03.693327) 0x7f1e86aa3000-0x7f1e86ab0000 (52K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(03.693330) 0x7f1e86ab0000-0x7f1e86ab3000 (12K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x3
(03.693332) 0x7f1e86ab3000-0x7f1e86ac5000 (72K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x3000 reg fp  shmid: 0x3
(03.693335) 0x7f1e86ac5000-0x7f1e86ac8000 (12K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x15000 reg fp  shmid: 0x3
(03.693338) 0x7f1e86ac8000-0x7f1e86ac9000 (4K) prot 0 flags 0x2 fdflags 0 st 0x41 off 0x18000 reg fp  shmid: 0x3
(03.693340) 0x7f1e86ac9000-0x7f1e86aca000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x18000 reg fp  shmid: 0x3
(03.693343) 0x7f1e86aca000-0x7f1e86acb000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x19000 reg fp  shmid: 0x3
(03.693346) 0x7f1e86acb000-0x7f1e86ada000 (60K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x4
(03.693349) 0x7f1e86ada000-0x7f1e86b4a000 (448K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0xf000 reg fp  shmid: 0x4
(03.693351) 0x7f1e86b4a000-0x7f1e86ba4000 (360K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x7f000 reg fp  shmid: 0x4
(03.693354) 0x7f1e86ba4000-0x7f1e86ba5000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0xd8000 reg fp  shmid: 0x4
(03.693359) 0x7f1e86ba5000-0x7f1e86ba6000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0xd9000 reg fp  shmid: 0x4
(03.693362) 0x7f1e86ba6000-0x7f1e86c3f000 (612K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x5
(03.693364) 0x7f1e86c3f000-0x7f1e86d49000 (1064K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x99000 reg fp  shmid: 0x5
(03.693367) 0x7f1e86d49000-0x7f1e86dbb000 (456K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1a3000 reg fp  shmid: 0x5
(03.693370) 0x7f1e86dbb000-0x7f1e86dbc000 (4K) prot 0 flags 0x2 fdflags 0 st 0x41 off 0x215000 reg fp  shmid: 0x5
(03.693372) 0x7f1e86dbc000-0x7f1e86dc9000 (52K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x215000 reg fp  shmid: 0x5
(03.693375) 0x7f1e86dc9000-0x7f1e86dca000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x222000 reg fp  shmid: 0x5
(03.693378) 0x7f1e86dca000-0x7f1e86dcd000 (12K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(03.693381) 0x7f1e86dd1000-0x7f1e86dd3000 (8K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(03.693384) 0x7f1e86dd3000-0x7f1e86dd5000 (8K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x6
(03.693387) 0x7f1e86dd5000-0x7f1e86dfb000 (152K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x2000 reg fp  shmid: 0x6
(03.693390) 0x7f1e86dfb000-0x7f1e86e06000 (44K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x28000 reg fp  shmid: 0x6
(03.693393) 0x7f1e86e07000-0x7f1e86e09000 (8K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x33000 reg fp  shmid: 0x6
(03.693396) 0x7f1e86e09000-0x7f1e86e0b000 (8K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x35000 reg fp  shmid: 0x6
(03.693399) 0x7ffe3a6a9000-0x7ffe3a6ca000 (132K) prot 0x3 flags 0x122 fdflags 0 st 0x201 off 0 reg ap  shmid: 0
(03.693402) 0x7ffe3a7a4000-0x7ffe3a7a8000 (16K) prot 0x1 flags 0x22 fdflags 0 st 0x1201 off 0 reg vvar ap  shmid: 0
(03.693406) 0x7ffe3a7a8000-0x7ffe3a7aa000 (8K) prot 0x5 flags 0x22 fdflags 0 st 0x209 off 0 reg vdso ap  shmid: 0
(03.693409) 0xffffffffff600000-0xffffffffff601000 (4K) prot 0x5 flags 0x22 fdflags 0 st 0x204 off 0 vsys ap  shmid: 0

Output of `criu --version`:

(paste your output here)

Output of `criu check --all`:

(paste your output here)

Additional environment details:

@rst0git
Copy link
Member

rst0git commented Apr 12, 2024

  1. allocate 2GB memory in a process
  2. fork() 19 times and put all processes to sleep

@Tianyang-Zhang Would you be able to share an example code snippet for a test program that could be used to reproduce this problem?

@Tianyang-Zhang
Copy link
Author

@Tianyang-Zhang Would you be able to share an example code snippet for a test program that could be used to reproduce this problem?

Sure, here is a minimal test program to reproduce the issue. The CRIU I'm using is v3.19.

#include <cstdint>
#include <sys/mman.h>
#include <unistd.h>
#include <cstdio>
#include <cstring>
#include <cstdlib>

#define PAGE_SIZE_4K 4096
const uint64_t GB = 1024ULL * 1024ULL * 1024ULL;

int main(int argc, char *argv[]) {
  uint64_t n_GB = 1 * GB;
  int n_forks = 19;
  if (argc > 1) {
    n_GB = strtoul(argv[1], nullptr, 10) * GB;
  }

  if (argc > 2) {
    n_forks = strtoul(argv[2], nullptr, 10);
  }

  // Allocate memory.
  void *ptr = mmap(NULL, n_GB, PROT_READ | PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  if (ptr == MAP_FAILED) {
    perror("mmap failed");
    return -1;
  }

  // Populate all pages.
  for (uint64_t off = 0; off < n_GB; off += PAGE_SIZE_4K) {
    int val = random();
    memset(ptr + off, val, PAGE_SIZE_4K);
  }

  printf("Root process PID: %d\n", getpid());

  // Forks.
  for (int i = 0; i < n_forks; i++) {
    pid_t pid = fork();
    if (pid < 0) {
      perror("fork failed");
      return -1;
    }

    if (pid == 0) {
      // Child process
      printf("Forked %d processes\n", i + 1);
      break;
      return 0;
    }
  }

  // Put all processes to sleep.
  while (true)
    sleep(1000);
}

Compile and run:

g++ -std=c++11 -O0 -g -o ./fork_malloc ./fork_malloc.cc

./fork_malloc 2 19
Root process PID: 9038
Forked 1 processes
Forked 2 processes
Forked 3 processes
...

Dump:

criu dump --tree 9038 --images-dir /data/ckpt --track-mem --shell-job -v4 -o ./dump.log --leave-running

Please let me know if you need anything else. @rst0git

@rst0git
Copy link
Member

rst0git commented Apr 13, 2024

Describe the results you received:
The system memory increase after the entire process tree is seized. Eventually the system memory usage increased to 40GB.

Describe the results you expected:
System memory should not increase. The process should still only use 2GB after the dump.

@Tianyang-Zhang Would it be possible to confirm if the memory utilisation remains increased after the checkpoint has been created, or it is increased only when criu dump is running?

What is the kernel version and CPU architecture of your system?

@rst0git
Copy link
Member

rst0git commented Apr 13, 2024

the "forked memory" is marked as private in smaps, although its actually anonymous shared.

Have you tried changing the following line in the example above?

-void *ptr = mmap(NULL, n_GB, PROT_READ | PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+void *ptr = mmap(NULL, n_GB, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);

@Tianyang-Zhang
Copy link
Author

Would it be possible to confirm if the memory utilization remains increased after the checkpoint has been created, or it is increased only when criu dump is running?

@rst0git Thanks for helping! The memory usage remains increased after the checkpoint(use --keep-running option).

What is the kernel version and CPU architecture of your system?

I'm using the AWS r5.8xlarge EC2 instance with the following:

Kernel Version: 5.14.0-58.el9.x86_64
Architecture: x86
CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
OS: CentOS Stream 9

Have you tried changing the following line in the example above?

Using MAP_SHARED doesn't trigger this issue.

@Tianyang-Zhang Tianyang-Zhang changed the title CRIU dumps duplicated non-COWed memory in all child processes CRIU dumps triggers COW on all memory in all child processes Apr 16, 2024
@avagin avagin self-assigned this Apr 16, 2024
@Tianyang-Zhang
Copy link
Author

It looks like this is the cause of why vmsplice() triggers the COW: https://lwn.net/Articles/849638/

@avagin
Copy link
Member

avagin commented Apr 18, 2024

It looks like this is the cause of why vmsplice() triggers the COW: https://lwn.net/Articles/849638/

You are right. I have found the same thing while investigating this issue:

                  - __get_user_pages
                     - 36.53% handle_mm_fault
                        - 35.53% __handle_mm_fault
                           - 32.33% do_wp_page

Both vmsplice and process_vm_readv uses get_user_pages.

@avagin
Copy link
Member

avagin commented Apr 18, 2024

@Tianyang-Zhang I don't see a better solution rather than to use write instead of vmsplice. Could you try out the next patch:

diff --git a/criu/page-xfer.c b/criu/page-xfer.c
index 94f477414..7c73acd67 100644
--- a/criu/page-xfer.c
+++ b/criu/page-xfer.c
@@ -828,7 +828,20 @@ int page_xfer_predump_pages(int pid, struct page_xfer *xfer, struct page_pipe *p
 
                bufvec.iov_base = userbuf;
                bufvec.iov_len = bytes_read;
-               ret = vmsplice(ppb->p[1], &bufvec, 1, SPLICE_F_NONBLOCK | SPLICE_F_GIFT);
+               if (0) {
+                       ret = vmsplice(ppb->p[1], &bufvec, 1, SPLICE_F_NONBLOCK | SPLICE_F_GIFT);
+               } else {
+                       long off;
+
+                       for (off = 0; off < bytes_read; ) {
+                               ret = write(ppb->p[1], userbuf + off, bytes_read - off);
+                               if (ret == -1)
+                                       break;
+                               off += ret;
+                       }
+                       if (ret != -1)
+                               ret = bytes_read;
+               }
 
                if (ret == -1 || ret != bytes_read) {
                        pr_err("vmsplice: Failed to splice user buffer to pipe %ld\n", ret);

In ideal case, we need to detect write-protected cow pages to dump them once.

@avagin avagin added the bug-mem label Apr 18, 2024
@rst0git
Copy link
Member

rst0git commented Apr 18, 2024

It seems like this change was introduced with torvalds/linux@17839856fd58

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

@sirius8050
Copy link

That is because your kernel version is low, you can update your kernel > 5.15. It will not triggers COW again.

@Tianyang-Zhang
Copy link
Author

I don't see a better solution rather than to use write instead of vmsplice. Could you try out the next patch:

I just tried the patch and the COW is still triggered. I thought the trigger would be the vmsplice() call on the target process side. Which is the sys_vmsplice() in parasite.c::dump_pages().

@avagin
Copy link
Member

avagin commented Apr 20, 2024

@Tianyang-Zhang Silly me. I patched the wrong vmsplice. Here is the right patch:

diff --git a/compel/arch/x86/plugins/std/syscalls/syscall_32.tbl b/compel/arch/x86/plugins/std/syscalls/syscall_32.tbl
index ab36a5cd6..6bb368d64 100644
--- a/compel/arch/x86/plugins/std/syscalls/syscall_32.tbl
+++ b/compel/arch/x86/plugins/std/syscalls/syscall_32.tbl
@@ -37,6 +37,7 @@ __NR_mprotect         125             sys_mprotect            (const void *addr, unsigned long len, unsigned
 __NR_getpgid           132             sys_getpgid             (pid_t pid)
 __NR_personality       136             sys_personality         (unsigned int personality)
 __NR_flock             143             sys_flock               (int fd, unsigned long cmd)
+__NR_writev            146             sys_writev              (unsigned int fd, const struct iovec *iov, unsigned long nr_segs)
 __NR_getsid            147             sys_getsid              (void)
 __NR_sched_setscheduler        156             sys_sched_setscheduler  (int pid, int policy, struct sched_param *p)
 __NR_nanosleep         162             sys_nanosleep           (struct timespec *rqtp, struct timespec *rmtp)
diff --git a/compel/arch/x86/plugins/std/syscalls/syscall_64.tbl b/compel/arch/x86/plugins/std/syscalls/syscall_64.tbl
index 4e843bee9..88e4e4812 100644
--- a/compel/arch/x86/plugins/std/syscalls/syscall_64.tbl
+++ b/compel/arch/x86/plugins/std/syscalls/syscall_64.tbl
@@ -18,6 +18,7 @@ __NR_rt_sigprocmask           14              sys_sigprocmask         (int how, k_rtsigset_t *set, k_rtsigse
 __NR_rt_sigreturn              15              sys_rt_sigreturn        (void)
 __NR_ioctl                     16              sys_ioctl               (unsigned int fd, unsigned int cmd, unsigned long arg)
 __NR_pread64                   17              sys_pread               (unsigned int fd, char *buf, size_t count, loff_t pos)
+__NR_writev                    20              sys_writev              (unsigned int fd, const struct iovec *iov, unsigned long nr_segs)
 __NR_mremap                    25              sys_mremap              (unsigned long addr, unsigned long old_len, unsigned long new_len, unsigned long flags, unsigned long new_addr)
 __NR_mincore                   27              sys_mincore             (void *addr, unsigned long size, unsigned char *vec)
 __NR_madvise                   28              sys_madvise             (unsigned long start, size_t len, int behavior)
diff --git a/criu/pie/parasite.c b/criu/pie/parasite.c
index e151ed656..2f9d34698 100644
--- a/criu/pie/parasite.c
+++ b/criu/pie/parasite.c
@@ -86,7 +86,8 @@ static int dump_pages(struct parasite_dump_pages_args *args)
        if (nr_segs > UIO_MAXIOV)
                nr_segs = UIO_MAXIOV;
        while (1) {
-               ret = sys_vmsplice(p, &iovs[args->off + off], nr_segs, SPLICE_F_GIFT | SPLICE_F_NONBLOCK);
+       //      ret = sys_vmsplice(p, &iovs[args->off + off], nr_segs, SPLICE_F_GIFT | SPLICE_F_NONBLOCK);
+               ret = sys_writev(p, &iovs[args->off + off], nr_segs);
                if (ret < 0) {
                        sys_close(p);
                        pr_err("Can't splice pages to pipe (%d/%d/%d)\n", ret, nr_segs, args->off + off);

@avagin
Copy link
Member

avagin commented Apr 20, 2024

It seems like this change was introduced with torvalds/linux@17839856fd58

This patch was introduced in v5.8 and then it was rolled back in v5.9 (torvalds/linux@a308c71).

I messed up with my environment and I was thinking the issue exists in new kernels (6.8+), but actually it doesn't.

@Tianyang-Zhang could you verify that you can reproduce the issue on a non-rhel kernel?

@Tianyang-Zhang
Copy link
Author

Tianyang-Zhang commented Apr 22, 2024

could you verify that you can reproduce the issue on a non-rhel kernel?

@avagin I tried on a Ubuntu host with 5.15 kernel, and the issue is gone. I also tried CentOS 7 with 3.10 kernel and also no issue. Haven't found an environment to try 5 ~ 5.8 kernels yet.

The patch somehow causes CRIU to hang when transferring pages. But anyway, write() should be a lot slower than vmsplice() so probably kernel upgrade would be a better choice for us.

Here is the dump log from the patch:

...
pie: 8312: Daemon waits for command
(26.298370) PPB: 1024 pages 1 segs 1024 pipe 205 off
(26.298375) Sent msg to daemon 66 0 0
(26.298379) Wait for ack 66 on daemon socket
pie: 8312: __fetched msg: 66 0 0
pie: 8312: __sent ack msg: 66 66 0
pie: 8312: Daemon waits for command
(26.299607) Fetched ack: 66 66 0
(26.299610) PPB: 1024 pages 1 segs 1024 pipe 206 off
(26.299614) Sent msg to daemon 66 0 0
(26.299618) Wait for ack 66 on daemon socket
pie: 8312: __fetched msg: 66 0 0
pie: 8312: __sent ack msg: 66 66 0
(26.300824) Fetched ack: 66 66 0
pie: 8312: Daemon waits for command
(26.300829) page-xfer: Transferring pages:
(26.300831) page-xfer:  buf 0/0
(26.300832) page-xfer:  buf 1024/1
(26.300834) page-xfer:  p 0x7fc479101000 [1024]
(26.303072) page-xfer:  buf 1024/1
(26.303079) page-xfer:  p 0x7fc479501000 [1024] --> hangs here

In ideal case, we need to detect write-protected cow pages to dump them once.

Besides the vmsplice() COW issue, the write-protected COW pages are dumped multiple times. I thought this is already handled since a long time ago according to https://criu.org/Copy-on-write_memory

Would that be an easy fix? Thanks for helping!

@rst0git
Copy link
Member

rst0git commented Apr 23, 2024

the write-protected COW pages are dumped multiple times.

I believe the reason for this is because the mappings are private (MAP_PRIVATE) and child processes created by fork() inherit copies of these mappings. Thus, CRIU saves a copy of the private mappings for each child process. In contrast, if the mappings are shared MAP_SHARED, it would save a single copy that is used by all child processes.

@Tianyang-Zhang
Copy link
Author

the mappings are private (MAP_PRIVATE) and child processes created by fork() inherit copies of these mappings

Is there any plan to support the MAP_PRIVATE case in the future? Or CRIU cannot know which pages are already COWed, and which pages are still write-protected so that CRIU has to dump the whole thing if the mappings are private?

@Tianyang-Zhang
Copy link
Author

Hi, could you please confirm if there is a plan to support the MAP_PRIVATE case? We can conclude and close the issue after that last question, thanks!

@avagin
Copy link
Member

avagin commented May 7, 2024

the mappings are private (MAP_PRIVATE) and child processes created by fork() inherit copies of these mappings

Is there any plan to support the MAP_PRIVATE case in the future? Or CRIU cannot know which pages are already COWed, and which pages are still write-protected so that CRIU has to dump the whole thing if the mappings are private?

I am trying to figure out how we can do that. Any ideas are welcome.

Copy link

github-actions bot commented Jun 7, 2024

A friendly reminder that this issue had no activity for 30 days.

@avagin avagin added no-auto-close Don't auto-close as a stale issue and removed stale-issue labels Jun 8, 2024
@Tianyang-Zhang
Copy link
Author

Hi @avagin @rst0git, thanks a lot for your previous help! May I ask if this COW mechanism on CRIU wiki is implemented? https://criu.org/Copy-on-write_memory Is there any limitation of that method? It looks like someone had that idea for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-mem no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

No branches or pull requests

4 participants