Memory jump in high-load scenario with nsc/nse release/v1.11.2+ #703

denis-tingaikin · 2024-09-09T09:47:56Z

Sometimes the memory consumption is uncontrollable increasing in the nsmgr.

Reproduced with release v1.14.0-rc.2

szvincze · 2024-09-23T07:01:19Z

We tested with custom clients and endpoints that used SDK v1.13.2 where we observed the above described behavior.
One time we tried to let the nsmgrs reach the memory limit and to be restarted. Surprisingly, after the nsmgr processes were OOMKilled then came up properly and did not show any significant increase of memory consumption. On this picture the restarts happened at 0:00 and 05:00.

After that we also tried to restart nsmgr manually to see if it had effect on the increasing memory consumption. We restarted nsmgrs in different ways in the different tests to see if the restart method matters, but haven't seen difference when we killed the process or deleted the pod etc, it always flattened the memory diagram.

Here I add some more pictures we took during our tests.

Each test showed the same behavior. For some reason the restart of nsmgr completely eliminated the memory increase.

This last one shows the test when we let the system continue to run after the restart to see if it starts growing some time later but nothing suspicious happened for more that 11 hours:

Do you have any idea what could cause this behavior?

denis-tingaikin · 2024-09-23T14:27:10Z

Do you have any idea what could cause this behavior?

My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.

szvincze · 2024-09-24T07:36:16Z

Do you have any idea what could cause this behavior?

My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.

Would it be an explanation why the memory graph flattens after restarting nsmgr?

szvincze · 2024-09-24T09:56:51Z

Just a small addition. It seems when we stop the test and delete the respective test namespace nsmgr gives back the memory it leaked, which was not the case previously.

zolug · 2024-10-01T16:04:25Z

Hi @denis-tingaikin,

edit: I think I might have found a possible src of resource leaking. In our NSC the streams returned by MonitorConnections linger on, as they are using the main context... After running a test with fixed context handling for the streaming RPC, the steady increase in memory usage disappeared.

The NSC used in Meridio: https://github.com/Nordix/Meridio/blob/master/cmd/proxy/main.go#L243
(There's another NSC as well, but a memory increase is visible when using only the NSC above.)

Simply on a Kind cluster I also noticed the differences Szilard reported (by leaving the cluster intact without any traffic). The nsmgrs with the most memory hosted more (custom) NSEs (in both cases below 1 nsmgr had 3 NSEs, 1 nsmgr had 1 NSE, and 2 nsmgrs run without NSEs).

Meridio 1.1.1

Meridio 1.1.2

denis-tingaikin · 2024-10-09T15:27:08Z

Hello @zolug,

I mentioned this problem: #675 on the last call. The problem actually can be on the client side. At this moment we're avoiding it on the server side, which can be not a perfect solution. Currently planned to have a look a bit deeply into code on the client side.

denis-tingaikin · 2024-11-05T16:14:06Z

@szvincze Could we test ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8 when you get a time?

Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.

szvincze · 2024-11-08T09:36:21Z

@szvincze Could we test ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8 when you get a time?

Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.

We just ran a 6 hours test and focused on file descriptors. It looks very similar to our previous tests. There are a bit more FDs for nsmgr after the test, but that was the case with the workarounds too. Even though the figures are similar from memory consumption point of view, if we should also compare it to the v1.14.1 release then I assume we should run another 60 hours test with this build.

denis-tingaikin · 2024-11-08T09:52:04Z

It looks good; it seems like we are already able to get rid of using workarounds in the next releases. Let's keep the open this ticket till next RC.

denis-tingaikin changed the title ~~Memory jump in hightload~~ Memory jump in high-load scenario Sep 9, 2024

denis-tingaikin added this to Release v1.14.0 Sep 10, 2024

denis-tingaikin self-assigned this Sep 10, 2024

NikitaSkrynnik moved this to In Progress in Release v1.14.0 Sep 10, 2024

denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Sep 13, 2024

denis-tingaikin moved this from Blocked to In Progress in Release v1.14.0 Sep 20, 2024

denis-tingaikin changed the title ~~Memory jump in high-load scenario~~ Memory jump in high-load scenario with nsc/nse release/v1.11.2+ Sep 20, 2024

denis-tingaikin added this to Release v1.14.1 Sep 20, 2024

Ex4amp1e moved this to In Progress in Release v1.14.1 Sep 24, 2024

denis-tingaikin mentioned this issue Sep 24, 2024

NSMgr registry streams could leak networkservicemesh/sdk#1660

Open

denis-tingaikin moved this from In Progress to Moved to next release in Release v1.14.0 Sep 24, 2024

denis-tingaikin moved this from In Progress to Blocked in Release v1.14.1 Sep 26, 2024

Ex4amp1e moved this from Blocked to In Progress in Release v1.14.1 Oct 3, 2024

Ex4amp1e moved this from In Progress to Blocked in Release v1.14.1 Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory jump in high-load scenario with nsc/nse release/v1.11.2+ #703

Memory jump in high-load scenario with nsc/nse release/v1.11.2+ #703

denis-tingaikin commented Sep 9, 2024 •

edited

Loading

szvincze commented Sep 23, 2024

denis-tingaikin commented Sep 23, 2024

szvincze commented Sep 24, 2024

szvincze commented Sep 24, 2024

zolug commented Oct 1, 2024 •

edited

Loading

denis-tingaikin commented Oct 9, 2024

denis-tingaikin commented Nov 5, 2024 •

edited

Loading

szvincze commented Nov 8, 2024

denis-tingaikin commented Nov 8, 2024

Memory jump in high-load scenario with nsc/nse release/v1.11.2+ #703

Memory jump in high-load scenario with nsc/nse release/v1.11.2+ #703

Comments

denis-tingaikin commented Sep 9, 2024 • edited Loading

szvincze commented Sep 23, 2024

denis-tingaikin commented Sep 23, 2024

szvincze commented Sep 24, 2024

szvincze commented Sep 24, 2024

zolug commented Oct 1, 2024 • edited Loading

denis-tingaikin commented Oct 9, 2024

denis-tingaikin commented Nov 5, 2024 • edited Loading

szvincze commented Nov 8, 2024

denis-tingaikin commented Nov 8, 2024

denis-tingaikin commented Sep 9, 2024 •

edited

Loading

zolug commented Oct 1, 2024 •

edited

Loading

denis-tingaikin commented Nov 5, 2024 •

edited

Loading