Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory jump in high-load scenario with nsc/nse release/v1.11.2+ #703

Open
denis-tingaikin opened this issue Sep 9, 2024 · 9 comments
Open
Assignees

Comments

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 9, 2024

Sometimes the memory consumption is uncontrollable increasing in the nsmgr.

Reproduced with release v1.14.0-rc.2

image

@denis-tingaikin denis-tingaikin changed the title Memory jump in hightload Memory jump in high-load scenario Sep 9, 2024
@denis-tingaikin denis-tingaikin self-assigned this Sep 10, 2024
@NikitaSkrynnik NikitaSkrynnik moved this to In Progress in Release v1.14.0 Sep 10, 2024
@denis-tingaikin denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Sep 13, 2024
@denis-tingaikin denis-tingaikin moved this from Blocked to In Progress in Release v1.14.0 Sep 20, 2024
@denis-tingaikin denis-tingaikin changed the title Memory jump in high-load scenario Memory jump in high-load scenario with nsc/nse release/v1.11.2+ Sep 20, 2024
@szvincze
Copy link

We tested with custom clients and endpoints that used SDK v1.13.2 where we observed the above described behavior.
One time we tried to let the nsmgrs reach the memory limit and to be restarted. Surprisingly, after the nsmgr processes were OOMKilled then came up properly and did not show any significant increase of memory consumption. On this picture the restarts happened at 0:00 and 05:00.

Screenshot from 2024-09-23 08-37-31

After that we also tried to restart nsmgr manually to see if it had effect on the increasing memory consumption. We restarted nsmgrs in different ways in the different tests to see if the restart method matters, but haven't seen difference when we killed the process or deleted the pod etc, it always flattened the memory diagram.

Here I add some more pictures we took during our tests.
nsmgr-memory-after-restart-1

Screenshot from 2024-09-23 08-36-34

Screenshot from 2024-09-23 08-35-59

Screenshot from 2024-09-23 08-35-14

Screenshot from 2024-09-23 08-34-43

Each test showed the same behavior. For some reason the restart of nsmgr completely eliminated the memory increase.

This last one shows the test when we let the system continue to run after the restart to see if it starts growing some time later but nothing suspicious happened for more that 11 hours:
Screenshot from 2024-09-23 08-34-00

Do you have any idea what could cause this behavior?

@denis-tingaikin
Copy link
Member Author

Do you have any idea what could cause this behavior?

My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.

@szvincze
Copy link

Do you have any idea what could cause this behavior?

My guess the problem is in the diff in the nsc from v1.11.2 to main. Most likely, monitor streams are not closing. Currently, I am working with quick reproduction; I will inform you when we get more information.

Would it be an explanation why the memory graph flattens after restarting nsmgr?

@Ex4amp1e Ex4amp1e moved this to In Progress in Release v1.14.1 Sep 24, 2024
@szvincze
Copy link

Just a small addition. It seems when we stop the test and delete the respective test namespace nsmgr gives back the memory it leaked, which was not the case previously.
Screenshot from 2024-09-24 11-54-18

@denis-tingaikin denis-tingaikin moved this from In Progress to Moved to next release in Release v1.14.0 Sep 24, 2024
@denis-tingaikin denis-tingaikin moved this from In Progress to Blocked in Release v1.14.1 Sep 26, 2024
@zolug
Copy link

zolug commented Oct 1, 2024

Hi @denis-tingaikin,

edit: I think I might have found a possible src of resource leaking. In our NSC the streams returned by MonitorConnections linger on, as they are using the main context... After running a test with fixed context handling for the streaming RPC, the steady increase in memory usage disappeared.

The NSC used in Meridio: https://github.com/Nordix/Meridio/blob/master/cmd/proxy/main.go#L243
(There's another NSC as well, but a memory increase is visible when using only the NSC above.)

Simply on a Kind cluster I also noticed the differences Szilard reported (by leaving the cluster intact without any traffic). The nsmgrs with the most memory hosted more (custom) NSEs (in both cases below 1 nsmgr had 3 NSEs, 1 nsmgr had 1 NSE, and 2 nsmgrs run without NSEs).

Meridio 1.1.1
Screenshot from 2024-10-01 17-48-16

Meridio 1.1.2
Screenshot from 2024-10-01 13-29-58

@Ex4amp1e Ex4amp1e moved this from Blocked to In Progress in Release v1.14.1 Oct 3, 2024
@Ex4amp1e Ex4amp1e moved this from In Progress to Blocked in Release v1.14.1 Oct 4, 2024
@denis-tingaikin
Copy link
Member Author

Hello @zolug,

I mentioned this problem: #675 on the last call. The problem actually can be on the client side. At this moment we're avoiding it on the server side, which can be not a perfect solution. Currently planned to have a look a bit deeply into code on the client side.

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Nov 5, 2024

@szvincze Could we test ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8 when you get a time?

Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.

@szvincze
Copy link

szvincze commented Nov 8, 2024

@szvincze Could we test ghcr.io/networkservicemesh/ci/cmd-nsmgr:d1e85e8 when you get a time?

Note: Focus on nsmgr mem/fd consumption. I expected that its not leaking.

We just ran a 6 hours test and focused on file descriptors. It looks very similar to our previous tests. There are a bit more FDs for nsmgr after the test, but that was the case with the workarounds too. Even though the figures are similar from memory consumption point of view, if we should also compare it to the v1.14.1 release then I assume we should run another 60 hours test with this build.

FDs-before-test
FDs-after-3h
FDs-after-6h
FDs-after-test

@denis-tingaikin
Copy link
Member Author

It looks good; it seems like we are already able to get rid of using workarounds in the next releases. Let's keep the open this ticket till next RC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Moved to next release
Status: Blocked
Development

No branches or pull requests

3 participants