High Memory Usage Leads to Non-responsive Behavior on Amazon Linux 2023 #697

ruchiraw · 2024-04-23T01:22:12Z

ruchiraw
Apr 23, 2024

Summary

Experiencing a critical issue where Amazon Linux 2023 EC2 instances become non-responsive under high memory load. This behavior contrasts with Amazon Linux 2018, where the system handles similar scenarios without becoming unresponsive.

Description

We run a Node.js application on Amazon Linux 2023 that utilizes Amazon System Manager (SSM) and EBS volumes. The application spawns 60 child processes that process data, potentially causing significant memory usage spikes. When memory utilization approaches 95%, the entire EC2 instance becomes unresponsive, no application or system logs are accessible, SSH connections are refused, and AWS marks the instance as impaired.

Simultaneously, there is a noticeable spike in read operations as observed on EBS metrics, though write operations do not show a similar increase. Monitoring with iotop has shown unusual spikes in read operations across multiple processes right before the system becomes non-responsive.

Hypothesis

It appears the system may be dumping the file cache due to memory pressure, causing all file operations to revert to disk, leading to severe performance degradation. Enabling swap seems to mitigate the issue, suggesting that memory management differences between Amazon Linux 2023 and 2018 are potentially responsible.

Comparison with Amazon Linux 2018

The same application on Amazon Linux 2018 handles memory pressure differently by possibly sacrificing high-memory child processes to maintain overall system stability.

Discussion Points

Could the dumping of the file cache be responsible for this issue? If so, how can we configure Amazon Linux 2023 to handle high memory pressure more like Amazon Linux 2018?
Are there specific sysctl or other kernel parameters that could be tuned to improve how memory pressure is handled?
What alternative strategies might be employed to prevent the system from becoming unresponsive during high memory usage spikes?
Any similar experiences or observations with Amazon Linux 2023?
Recommendations or best practices for handling such scenarios?
Insights into the differences in memory management between Amazon Linux 2018 and 2023?

Attachments

[1] Screenshot of read operation spike right before the stuck behavior.

[2] Screenshot of increased disk read operations right before the stuck behavior by different processes.

[3] Output of sysctl -a from AML 2023 sysctl-2023.txt
[4] Output of sysctl -a from AML 2018 sysctl-2018.txt

matejsp · 2024-04-23T04:58:24Z

matejsp
Apr 23, 2024

I can confirm this issue too. On testing environment we had t4g.medium and when the instance was out of memory (during ansible run) the whole instance just frooze and could never recover on its own. When on the instance before running I could see that it was always copying s3 files observing very high I/O. This was happening every day for us (we provision our test environment every day) until we didn't upgrade the instance to m6g.large. It was not running out of disk space.

0 replies

abuehaze14 · 2024-05-06T12:39:43Z

abuehaze14
May 6, 2024

Hey,

Thank you for reporting this issue to us, Can you please elaborate the instance type you are using on your environment? Are you seeing this behavior in specific instance type/size or all instances are impacted? Have you seen this behavior since you migrated to AL2023 or it's something that you could only notice on recent AL2023 kernels? What are the exact kernel versions you are using on AL2023 & Amazon Linux 2018? Do you have anyway by which we can reproduce this behavior locally?

Can you share logs from the impacted AL2023 instances that may show stack traces when the instance goes unresponsive? Will you be able to share EC2 instance-id for the impacted instance and timestamp in UTC for when it got impaired?

8 replies

sanchay-hai Jun 22, 2024

Any updates here @abuehaze14 ?

abuehaze14 Jun 24, 2024

@nbryant42 @sanchay-hai The issue discussed in https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3584718cf2ec shouldn't be impacting x86_64 as it's only related to Graviton based instances running AL2023 6.1 kernel. I'd recommend opening a ticket with our support team for further investigation into the cause of such reported memory overhead on AL2023 6.1 kernel. if there is any reproduction that we can run to replicate the mentioned behavior, please share it with us so we can debug this internally.

sanchay-hai Jun 24, 2024

Hi @abuehaze14 we are using Graviton instances

abuehaze14 Jun 24, 2024

@sanchay-hai ,Can you validate if there are stack traces in the instance operating system logs or EC2 console output when the issue happens? that may reveal the reason on why the kernel get stuck.

sanchay-hai Jun 24, 2024

Nope, I looked at the journalctl and ec2 system log on the UI but found nothing

sjp38 · 2024-05-06T20:38:34Z

sjp38
May 6, 2024

Are there specific sysctl or other kernel parameters that could be tuned to improve how memory pressure is handled?

I think you could try DAMON_RECLAIM and/or DAMON_LRU_SORT. You could also utilize the features in more flexible and highly tuned way using damo.

For more information and resources for DAMON, you could use the project site.

0 replies

nbryant42 · 2024-06-19T23:38:00Z

nbryant42
Jun 19, 2024

I'm seeing a similar issue on a variety of nodes/workloads that have little in common other than observed memory pressure. The OOM killer does not step in when it should. Memory usage spikes, either suddenly or gradually depending on the workload. Can't SSH into the machine. Machine stops responding in general, or responds only sporadically. Kernel eventually stops responding to ARP packets, which causes EC2 status check failures. This seems to be caused by excessive application process memory consumption; I've seen it caused by our Gitlab containers, Firefox/Geckodriver containers, and possibly a Zulip process running on Ubuntu. The common denominator is that the application process starts to use more memory than expected, but instead of being OOM-killed, the machine freezes up.

Sorta sounds like this, but it feels to me that this wasn't such a problem with earlier kernels: https://news.ycombinator.com/item?id=28771484

One of the better comments in that thread is near the end:

I avoid swap since it needs to be encrypted to protect sensitive data written out from memory to disk. Instead I reserve more memory for the kernel vm.min_free_kbytes based on the installed ram and also based and some redhat suggestions, reserve more memory in vm.admin_reserve_kbytes and vm.user_reserve_kbytes, adjust vm.vfs_cache_pressure based on server role and finally set vm.overcommit_ratio to 0. This worked well on over 50k bare metal servers with no swap. OOM was extremely rare outside of dev. OOM basically only happened with automation had human induced bugs that deployed too many java instances to a server. All of the servers had anywhere from 512GB to 3TB ram and nearly all the memory was in use at all times.

I can't really tell whether these sysctl's are going to be of any use, however. These days, most people don't permit direct root login, so admin_reserve_kbytes is not much use. user_reserve_kbytes only takes effect in never-overcommit mode. I can't tell whether min_free_kbytes will make things better or worse. Best bet might be to install earlyoom and tune the trigger point just below where the machine starts to fall over.

The machine that most recently fell over was an ECS host running only containerized workloads that should have been fairly light on disk usage - they don't really use the container filesystem except for maybe a bit of logging (shouldn't even be doing that, but it can be hard to get rid of all of it in an off-the-shelf container.)

This seems to also affect recent Ubuntu 22.04 AMIs and definitely affects ECS-Optimized Amazon Linux 2023, on Intel and AMD hardware. Ubuntu 22.04 has also recently migrated to 6.x kernels so I'm starting to suspect an upstream kernel bug

All very disappointing. Cgroups are supposed to protect the machine as a whole from going south if one container starts to thrash, but the default settings are not remotely good enough to accomplish that; no resource protection for SSHD or the ECS agent, and the kernel can't even protect itself enough to respond to ARP! As a possible "solution", I've found myself reducing the memory.low protections on the containers that might not need as much, in order to make more resources available to the process that started to thrash.

3 replies

abuehaze14 Jun 24, 2024

Hi @nbryant42, Are you seeing this behavior only on AL2023 6.1 kernels? it seems that you weren't seeing this behavior before so can you share the kernel version that you were using before you started seeing this behavior?if you want the application to get OOM killer without causing the kernel to be unresponsive then we would recommend that you raise min_free_kbytes which should make sure that there is enough reserved memory for kernel critical tasks, this will increase the likelihood of OOM killer but in favor of protecting the Operating system from crashing or being unresponsive as you are reporting here. if your application is running within a container and the container is hitting maximum configured memory limit then the application should altogether get killed by OOM if the container is configured to do so. Can you advise which cgroup version you are using for the impacted containers cgroupv1 or cgroupv2? Were you using cgroupv1 on your older environment? I have found opencontainers/runtime-spec#1005 which is listing main differences between cgroupv1 & cgroupv2 memory controller configuration which may help in explaining the behavior that you are reporting here.

nbryant42 Jun 26, 2024

Previously Amazon Linux 2 (4.x and later 5.10) seemed to be fine. But at this point it would take a bit of effort just to revert back to 5.10 just to comparison test (ie so we can know for sure that they really were fine). We really want to commit to 6.x and cgroupsv2. This behavior seems to have emerged more recently, either with AL2023 or perhaps some kernel update thereafter. Yes, cgroupsv2 on AL2023 and cgroupsv1 on AL2; I'm pretty sure this is the ECS-Optimized Amazon Linux default setup.

I don't know for a fact whether low-level network stack behavior is the kind of kernel-critical task that would be helped by min_free_kbytes. I only have an armchair understanding of Linux memory management, but my impression is that min_free_kbytes helps PF_MEMALLOC and GFP_ATOMIC allocations. Not an expert, but I kinda doubt that everything involved in an ARP response uses GFP_ATOMIC. It's really just intended for interrupt handlers.

I'm reluctant to start relying heavily on max memory limits per container. The problem is we have multiple containers that have unpredictable memory usage patterns, so there is a tradeoff when setting a static memory.high value that is small enough to kill a problem container in all cases: we don't know the memory usage of the other containers in advance, so we want to overcommit and just start killing processes when the machine as a whole gets oversubscribed. Otherwise we would have to add more physical memory than we really need during normal operation. This was one of the goals of the design changes that were made from cgroupsv1 to cgroupsv2: Make it easier to build a setup where you only need to install the physical memory that is actually needed.

This look really interesting: https://engineering.fb.com/2018/07/19/production-engineering/oomd/

nbryant42 Sep 5, 2024

Just to report back, the reason behavior seemed to differ from AL2 was most likely swap. When AMIs were migrated to AL2023, nobody noticed that volume assignments differed for the swap volume, so swap wasn't getting set up as expected.

We've reinstated swap and made several other related changes to fit best with the cgroupsv2 changes. Most notably, set memory.low/memoryReservation to much lower values and review all memory settings for each container for sanity; maxSwap needed to go away under linuxParameters in ECS task definitions since it's interpreted differently under cgroupsv2. We also introduced systemd-oomd monitoring tasks under ecstasks.slice.

The main insight is that we need to do whatever we reasonably can to allow the system to reclaim pages equally, so use swap, don't forbid swap, and use memory protections like memory.low sparingly, because they can actually limit the kernel's options for what can be swapped out or reclaimed.

sanchay-hai · 2024-06-22T21:40:09Z

sanchay-hai
Jun 22, 2024

We're seeing a similar issue. Our instance type is m7g.large

$ uname -r
4.14.345-262.561.amzn2.aarch64

$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

What we notice is that the ec2 instance starts consuming > 50% cpu and becomes totally unresponsive. no ssh, no tcp. But at the same time storage read IO ops goes through the roof. Reboot solves the issue.

We've noticed the same pattern in multiple instances

2 replies

abuehaze14 Jun 24, 2024

Hi @sanchay-hai, Have you checked which processes are consuming such CPU cycles using tools like top or ps? We would also recommend to check which processes are causing I/O Spikes using iotop, that may reveal the reasons behind the behavior that you are reporting here.

A side note AL2 4.14 is currently in maintenance mode where it's only receiving security fixes so we would recommend you to migrate to either AL2 5.10 or AL2023 6.1 for optimum experience.

sanchay-hai Jun 24, 2024

Sounds good on the AL2 4.14. We'll upgrade.

On the processes --- we can't even ssh to figure out which processes consume CPU. That's the main issue

elsaco · 2024-06-24T21:10:57Z

elsaco
Jun 24, 2024

On a troubling instance, does it make any difference if zswap is being enabled? The AWS instances default to disabled:

[ec2-user] ~$ cat /sys/module/zswap/parameters/enabled
N

Run echo 1 > /sys/module/zswap/parameters/enabled to enable zswap at runtime, and observe the statistics with sudo grep -r . /sys/kernel/debug/zswap

1 reply

abuehaze14 Jun 27, 2024

@elsaco enabling zwap may help specially if the memory pressure is coming from anonymous pages however that will increase the CPU utilization on the EC2 instance given the amount of CPU cycles required to compress/decompress the memory pages so it really depends on the category of utilized memory and amount of available CPU resources on the EC2 instance.

smghacker · 2024-06-26T17:48:24Z

smghacker
Jun 26, 2024

Experiencing the same issue with
NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30"

and output of uname -r
5.10.216-204.855.amzn2.x86_64

We also experience the same with another instance which is on Graviton
and output of uname -r
5.10.216-204.855.amzn2.aarch64

Both instances are different types.

0 replies

zcobol · 2024-06-28T18:11:52Z

zcobol
Jun 28, 2024

@abuehaze14 enabling zswap will steal some cpu cycles but you'll be able to login into your busy vm. The tradeoff is reasonable, IMO.

1 reply

mackensen Sep 5, 2024

I'll chime in that I experienced this exact scenario and with zswap enabled AWS was able to identify the VM as unhealthy and replace it, which was a definite improvement.

thimslugga · 2024-09-05T17:16:42Z

thimslugga
Sep 5, 2024

I would recommend configuring at least 1-2GB of swap space to help the kernel with memory management responsibility e.g. reclaim, etc and combo it with one of the userspace oom daemon solutions out there e.g. earlyoom, systemd-oomd, etc.

Pressure Stall Info:

systemd-oomd:

earlyoom:

https://github.com/rfjakob/earlyoom

nohang:

https://github.com/hakavlad/nohang

oomd:

https://github.com/facebookincubator/oomd

Additional tuning:

sudo dnf install --allowerasing -y grubby dbus dbus-daemon polkit systemd-container systemd-pam udisks2

# set default governor to performance and enable psi
sudo grubby --update-kernel=ALL --args="cpufreq.default_governor=performance psi=1"

# sysctl tuning
sudo mkdir -p /etc/sysctl.d

cat <<'EOF' | sudo tee /etc/sysctl.d/99-custom-tuning.conf
# Adjust the kernel printk to minimize seiral console logging.
# The defaults are very verbose and they can have a performance impact.
# Note: 4 4 1 7 should also be fine, just not debug i.e. 7
kernel.printk=3 4 1 7

# This setting enables better interactivity for desktop workloads and is 
# not typically suitable for most server type workloads e.g. postgresdb. 
#
# This feature is aimed at improving system responsiveness under load by
# automatically grouping task groups with similar execution patterns.
# While beneficial for desktop responsiveness, in server environments,
# especially those running Kubernetes, this behavior might not always
# be desirable as it could lead to uneven distribution of CPU resources
# among pods. 
#
# man 7 sched
# 
# The use of the cgroups(7) cpu controller to place processes in cgroups 
# other than the root cpu cgroup overrides the affect of auto-grouping. 
#
# https://cateee.net/lkddb/web-lkddb/SCHED_AUTOGROUP.html
# https://www.postgresql.org/message-id/[email protected]
kernel.sched_autogroup_enabled=0

# Specifies the minimum number of kilobytes to keep free across the system. 
# This is used to determine an appropriate value for each low memory zone, 
# each of which is assigned a number of reserved free pages in proportion 
# to their size.
#
# Setting vm.min_free_kbytes to an extremely low value prevents the system from 
# reclaiming memory, which can result in system  hangs and oom-killing processes. 
# Setting min_free_kbytes too high e.g. 5–10% of total system memory can cause 
# the system to enter an out-of-memory state immediately, resulting in the
# system spending too much time trying to reclaim memory. 
#
# As a rule of thumb, set this value to between 1-3% of available system
# memory and adjust this value up or down to meet the needs of your application 
# workload.
#
# Ensure that the reserved kernel memory is sufficient to sustain a high
# rate of packet buffer allocations as the default value may be too small.
vm.min_free_kbytes=1048576

# Ensure that the instance does not try to swap too early.
vm.swappiness=10

EOF

# Apply tuning
sudo sysctl --system

sudo systemctl reboot

1 reply

nbryant42 Sep 5, 2024

Agreed, with a caveat: PSI is far from perfect. Earlier this year, I spent a lot of time trying to get my head around exactly what it's measuring in terms of memory pressure, and the scenarios that cause the pressure metric to increase are quite narrow: it has to be swap-in of recently swapped-out memory pages, with just the right aging to be considered recent enough. There are probably plenty of workloads that would benefit from additional RAM that don't really trigger an increase in the pressure metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Memory Usage Leads to Non-responsive Behavior on Amazon Linux 2023 #697

{{title}}

Replies: 9 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

High Memory Usage Leads to Non-responsive Behavior on Amazon Linux 2023 #697

Summary

Description

Hypothesis

Comparison with Amazon Linux 2018

Discussion Points

Attachments

Replies: 9 comments · 16 replies

Replies: 9 comments 16 replies