Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on context switches #65

Closed
civodul opened this issue Oct 29, 2022 · 9 comments
Closed

Memory leak on context switches #65

civodul opened this issue Oct 29, 2022 · 9 comments
Labels

Comments

@civodul
Copy link
Collaborator

civodul commented Oct 29, 2022

As discussed in the context of the Shepherd, Fibers 1.1.1 leaks memory on each context switch (!). The simplest reproducer (with Guile 3.0.8 or even 2.2.7) is this:

(use-modules (fibers)
             (fibers channels)
             ((fibers scheduler) #:select (yield-current-task))
             (ice-9 rdelim)
             (statprof))

(run-fibers
 (lambda ()
   (define channel
     (make-channel))
   (define leave-channel
     (make-channel))

   (spawn-fiber
    (lambda ()
      (sleep 10)
      (put-message leave-channel 'leave)))
   (spawn-fiber
    (lambda ()
      (let loop ()
        (yield-current-task)
        ;; (sleep 10)
        (loop))))
   (spawn-fiber
    (lambda ()
      (let loop ()
        (pk 'heap-size (assoc-ref (gc-stats) 'heap-size))
        (sleep 2)
        (loop))))
   (get-message leave-channel))
 ;; #:drain? #t
 #:parallelism 1                                  ;don't create POSIX threads
 #:hz 0)

This leak is proportional to the number of context switches: replace (yield-current-task) with (sleep 1) or similar, and everything's fine.

@civodul
Copy link
Collaborator Author

civodul commented Oct 29, 2022

With Fibers 1.0.0 (replacing yield-current-task with yield-current-fiber from (fiber internal)), the problem doesn't seem to occur.

@civodul
Copy link
Collaborator Author

civodul commented Oct 29, 2022

Heap size becomes stable if we change the second fiber above to:

   (spawn-fiber
    (lambda ()
      (let loop ()
        ;; (yield-current-task)
        (suspend-current-task schedule-task)
        (loop))))

The difference between (yield-current-task) and (suspend-current-task schedule-task) is that the former returns #t after its call to abort-to-prompt. And indeed, if we add a continuation to the schedule-task call, the leak is back:

   (spawn-fiber
    (lambda ()
      (let loop ()
        ;; (yield-current-task)
        (suspend-current-task
         (lambda (sched k)
           ;; When the 'schedule-task' call is not in tail position, memory is leaked. 
           (->bool (schedule-task sched k))))
        (loop))))

@civodul civodul added the bug label Oct 29, 2022
@civodul
Copy link
Collaborator Author

civodul commented Oct 30, 2022

The program below has a stable heap size when built with -O1, but not when built with -O1 -Ocps or -O2 (with Guile 3.0.8):

(use-modules (fibers)
             (fibers channels)
             (fibers scheduler)
             (statprof))

(run-fibers
 (lambda ()
   (define channel
     (make-channel))
   (define leave-channel
     (make-channel))

   (spawn-fiber
    (lambda ()
      (sleep 20)
      (put-message leave-channel 'leave)))
   (spawn-fiber
    (lambda ()
      (let loop ()
        (let ((tag ((@ (fibers scheduler) scheduler-prompt-tag) (current-scheduler))))
          (abort-to-prompt tag schedule-task))

        (loop))))
   (spawn-fiber
    (lambda ()
      (let loop ()
        (pk 'heap-size (assoc-ref (gc-stats) 'heap-size))
        (sleep 2)
        (loop))))
   (get-message leave-channel))

 #:parallelism 1                                  ;don't create POSIX threads
 #:hz 0)

@civodul
Copy link
Collaborator Author

civodul commented Oct 31, 2022

The culprit appears to be 84addfb. Reverting it gives us back stable heap usage in the example above. (Many thanks to @cwebber for telling me to just bisect — that it didn't occur to me is embarrassing :-))

@cwebber
Copy link
Collaborator

cwebber commented Oct 31, 2022

Even more than that, it's the very same commit I found was a problem when I bisected #58 !

That commit caused bugs in both space and time it appears. :)

@civodul
Copy link
Collaborator Author

civodul commented Nov 1, 2022

The culprit appears to be 84addfb. Reverting it gives us back stable heap usage in the example above.

Instead of reverting it, a possible fix is:

diff --git a/fibers.scm b/fibers.scm
index 6e81f98..11c5389 100644
--- a/fibers.scm
+++ b/fibers.scm
@@ -150,7 +150,9 @@ parameter mutations to the fiber."
         (with-dynamic-state dynamic-state thunk))))
   (define (create-fiber sched thunk)
     (schedule-task sched
-                   (capture-dynamic-state thunk)))
+                   (capture-dynamic-state
+                    (lambda ()
+                      (%start-stack #t thunk)))))
   (cond
    (scheduler
     ;; When a scheduler is passed explicitly, it could be there is no

(That %start-stack call was removed in 84addfb.)

Surprising because %start-stack does very little, and what it does only affects the stacks.c API. In fact, the following patch has the same effect:

diff --git a/fibers.scm b/fibers.scm
index 6e81f98..23d07fd 100644
--- a/fibers.scm
+++ b/fibers.scm
@@ -150,7 +150,11 @@ parameter mutations to the fiber."
         (with-dynamic-state dynamic-state thunk))))
   (define (create-fiber sched thunk)
     (schedule-task sched
-                   (capture-dynamic-state thunk)))
+                   (capture-dynamic-state
+                    (let ((f (make-fluid)))
+                      (lambda ()
+                        (with-fluids ((f 42))     ;WTF?
+                          (thunk)))))))
   (cond
    (scheduler
     ;; When a scheduler is passed explicitly, it could be there is no

So I suspect this whole thing exhibits a bug in dynamic stack capture or something.

@civodul
Copy link
Collaborator Author

civodul commented Nov 4, 2022

Escalating! 👉 https://issues.guix.gnu.org/59021

@civodul
Copy link
Collaborator Author

civodul commented Nov 8, 2022

The change in libguile/vm.c discussed in the Guile issue did fix a leak, but I have another one shown by this reproducer:

(use-modules (fibers)
             (fibers channels)
             (fibers scheduler)
             (statprof)
             (ice-9 match))

(include "heap-profiler.scm")

(run-fibers
 (lambda ()
   (define leave-channel
     (make-channel))

   (spawn-fiber
    (lambda ()
      (sleep 60)
      (put-message leave-channel 'leave)))
   (let* ((sock (socket AF_INET
                        (logior SOCK_STREAM SOCK_NONBLOCK SOCK_CLOEXEC)
                        IPPROTO_IP))
          (port 4567))
     (bind sock (make-socket-address AF_INET INADDR_LOOPBACK port))
     (listen sock 4)
     (format #t "listening on port ~a~%" port)
     (format #t "In Bash, run:
   while : ; do echo foo> /dev/tcp/localhost/~a ; done~%~%" port)

     (spawn-fiber
      (lambda ()
        (let loop ()
          (match (accept sock (logior SOCK_NONBLOCK SOCK_CLOEXEC))
            ((connection . peer)
             (close-port connection))
            (_ #f))
          (loop)))))
   (spawn-fiber
    (lambda ()
      (define delay 2)
      (let loop ((previous (assoc-ref (gc-stats) 'heap-total-allocated)))
        (let* ((stats (gc-stats))
               (allocated (assoc-ref stats 'heap-total-allocated)))
          (pk 'heap-size (assoc-ref stats 'heap-size)
              'MiB/s (/ (- allocated previous) (expt 2. 20) delay))
          ;; (profile-heap)
          (sleep delay)
          (loop allocated)))))
   (get-message leave-channel)

   (profile-heap))

 #:parallelism 1                                  ;don't create POSIX threads
 #:hz 0)

It's a relatively slow leak because on my laptop the allocation rate is only ~1 MiB/s, but it is leaky.

With the heap profiler called at the end, I get:

;;; (heap-size 7622656 MiB/s 1.01300048828125)
  %   type                               self    avg obj size
 33.9 program                                931936    32.4
 25.7 pair                                   705696    16.0
  9.0 unknown                                248000   361.0
  5.0 symbol                                 138112    32.0
  4.3 struct                                 117088    97.7
  3.8 vector                                 103219    30.7
  3.0 stringbuf                               82224    48.3
  2.7 pointer                                 75136    16.0
  1.8 string                                  49488    32.0
  1.7 bytevector                              48032    59.2
  1.6 variable                                42816    20.3
  1.4 smob                                    38192    38.3
  1.2 hash-table                              34272    32.0
  1.1 heap-number                             30928    32.0
  0.9 port                                    24992    32.0
  0.8 weak-table                              21088    28.5
  0.7 vm-continuation                         19760    30.9
  0.7 atomic-box                              18160    32.0
  0.3 partial-continuation                     8832    32.0
  0.2 primitive                                4848    16.0
  0.1 weak-vector                              3424    18.9
  0.0 dynamic-state                            1296    16.8
  0.0 primitive-generic                        1280    32.0
  0.0 keyword                                   864    16.0
  0.0 frame                                     272    34.0
  0.0 fluid                                     256    18.3
  0.0 bitvector                                 176    35.2
  0.0 foreign-program                            96    32.0
  0.0 array                                      48    48.0
  0.0 weak-set                                   48    48.0
sampled heap: 2.62316 MiB (heap size: 7.26953 MiB)

... suggesting we've allocated and retained many closures (program here).

@civodul civodul closed this as completed in a9c2dbb Nov 9, 2022
@civodul
Copy link
Collaborator Author

civodul commented Nov 9, 2022

Turns out the leak was introduced in c25dcb9. Fixed now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants