Replies: 9 comments 16 replies
-
I can confirm this issue too. On testing environment we had t4g.medium and when the instance was out of memory (during ansible run) the whole instance just frooze and could never recover on its own. When on the instance before running I could see that it was always copying s3 files observing very high I/O. This was happening every day for us (we provision our test environment every day) until we didn't upgrade the instance to m6g.large. It was not running out of disk space. |
Beta Was this translation helpful? Give feedback.
-
Hey, Thank you for reporting this issue to us, Can you please elaborate the instance type you are using on your environment? Are you seeing this behavior in specific instance type/size or all instances are impacted? Have you seen this behavior since you migrated to AL2023 or it's something that you could only notice on recent AL2023 kernels? What are the exact kernel versions you are using on AL2023 & Amazon Linux 2018? Do you have anyway by which we can reproduce this behavior locally? Can you share logs from the impacted AL2023 instances that may show stack traces when the instance goes unresponsive? Will you be able to share EC2 instance-id for the impacted instance and timestamp in UTC for when it got impaired? |
Beta Was this translation helpful? Give feedback.
-
I think you could try DAMON_RECLAIM and/or DAMON_LRU_SORT. You could also utilize the features in more flexible and highly tuned way using damo. For more information and resources for DAMON, you could use the project site. |
Beta Was this translation helpful? Give feedback.
-
I'm seeing a similar issue on a variety of nodes/workloads that have little in common other than observed memory pressure. The OOM killer does not step in when it should. Memory usage spikes, either suddenly or gradually depending on the workload. Can't SSH into the machine. Machine stops responding in general, or responds only sporadically. Kernel eventually stops responding to ARP packets, which causes EC2 status check failures. This seems to be caused by excessive application process memory consumption; I've seen it caused by our Gitlab containers, Firefox/Geckodriver containers, and possibly a Zulip process running on Ubuntu. The common denominator is that the application process starts to use more memory than expected, but instead of being OOM-killed, the machine freezes up. Sorta sounds like this, but it feels to me that this wasn't such a problem with earlier kernels: https://news.ycombinator.com/item?id=28771484 One of the better comments in that thread is near the end:
I can't really tell whether these sysctl's are going to be of any use, however. These days, most people don't permit direct root login, so admin_reserve_kbytes is not much use. user_reserve_kbytes only takes effect in never-overcommit mode. I can't tell whether min_free_kbytes will make things better or worse. Best bet might be to install The machine that most recently fell over was an ECS host running only containerized workloads that should have been fairly light on disk usage - they don't really use the container filesystem except for maybe a bit of logging (shouldn't even be doing that, but it can be hard to get rid of all of it in an off-the-shelf container.) This seems to also affect recent Ubuntu 22.04 AMIs and definitely affects ECS-Optimized Amazon Linux 2023, on Intel and AMD hardware. Ubuntu 22.04 has also recently migrated to 6.x kernels so I'm starting to suspect an upstream kernel bug All very disappointing. Cgroups are supposed to protect the machine as a whole from going south if one container starts to thrash, but the default settings are not remotely good enough to accomplish that; no resource protection for SSHD or the ECS agent, and the kernel can't even protect itself enough to respond to ARP! As a possible "solution", I've found myself reducing the memory.low protections on the containers that might not need as much, in order to make more resources available to the process that started to thrash. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
On a troubling instance, does it make any difference if
Run |
Beta Was this translation helpful? Give feedback.
-
Experiencing the same issue with and output of uname -r We also experience the same with another instance which is on Graviton Both instances are different types. |
Beta Was this translation helpful? Give feedback.
-
@abuehaze14 enabling zswap will steal some cpu cycles but you'll be able to login into your busy vm. The tradeoff is reasonable, IMO. |
Beta Was this translation helpful? Give feedback.
-
I would recommend configuring at least 1-2GB of swap space to help the kernel with memory management responsibility e.g. reclaim, etc and combo it with one of the userspace oom daemon solutions out there e.g. earlyoom, systemd-oomd, etc.
Pressure Stall Info:
systemd-oomd:
earlyoom: nohang: oomd: Additional tuning: sudo dnf install --allowerasing -y grubby dbus dbus-daemon polkit systemd-container systemd-pam udisks2
# set default governor to performance and enable psi
sudo grubby --update-kernel=ALL --args="cpufreq.default_governor=performance psi=1"
# sysctl tuning
sudo mkdir -p /etc/sysctl.d
cat <<'EOF' | sudo tee /etc/sysctl.d/99-custom-tuning.conf
# Adjust the kernel printk to minimize seiral console logging.
# The defaults are very verbose and they can have a performance impact.
# Note: 4 4 1 7 should also be fine, just not debug i.e. 7
kernel.printk=3 4 1 7
# This setting enables better interactivity for desktop workloads and is
# not typically suitable for most server type workloads e.g. postgresdb.
#
# This feature is aimed at improving system responsiveness under load by
# automatically grouping task groups with similar execution patterns.
# While beneficial for desktop responsiveness, in server environments,
# especially those running Kubernetes, this behavior might not always
# be desirable as it could lead to uneven distribution of CPU resources
# among pods.
#
# man 7 sched
#
# The use of the cgroups(7) cpu controller to place processes in cgroups
# other than the root cpu cgroup overrides the affect of auto-grouping.
#
# https://cateee.net/lkddb/web-lkddb/SCHED_AUTOGROUP.html
# https://www.postgresql.org/message-id/[email protected]
kernel.sched_autogroup_enabled=0
# Specifies the minimum number of kilobytes to keep free across the system.
# This is used to determine an appropriate value for each low memory zone,
# each of which is assigned a number of reserved free pages in proportion
# to their size.
#
# Setting vm.min_free_kbytes to an extremely low value prevents the system from
# reclaiming memory, which can result in system hangs and oom-killing processes.
# Setting min_free_kbytes too high e.g. 5–10% of total system memory can cause
# the system to enter an out-of-memory state immediately, resulting in the
# system spending too much time trying to reclaim memory.
#
# As a rule of thumb, set this value to between 1-3% of available system
# memory and adjust this value up or down to meet the needs of your application
# workload.
#
# Ensure that the reserved kernel memory is sufficient to sustain a high
# rate of packet buffer allocations as the default value may be too small.
vm.min_free_kbytes=1048576
# Ensure that the instance does not try to swap too early.
vm.swappiness=10
EOF
# Apply tuning
sudo sysctl --system
sudo systemctl reboot |
Beta Was this translation helpful? Give feedback.
-
Summary
Experiencing a critical issue where Amazon Linux 2023 EC2 instances become non-responsive under high memory load. This behavior contrasts with Amazon Linux 2018, where the system handles similar scenarios without becoming unresponsive.
Description
We run a Node.js application on Amazon Linux 2023 that utilizes Amazon System Manager (SSM) and EBS volumes. The application spawns 60 child processes that process data, potentially causing significant memory usage spikes. When memory utilization approaches 95%, the entire EC2 instance becomes unresponsive, no application or system logs are accessible, SSH connections are refused, and AWS marks the instance as impaired.
Simultaneously, there is a noticeable spike in read operations as observed on EBS metrics, though write operations do not show a similar increase. Monitoring with
iotop
has shown unusual spikes in read operations across multiple processes right before the system becomes non-responsive.Hypothesis
It appears the system may be dumping the file cache due to memory pressure, causing all file operations to revert to disk, leading to severe performance degradation. Enabling swap seems to mitigate the issue, suggesting that memory management differences between Amazon Linux 2023 and 2018 are potentially responsible.
Comparison with Amazon Linux 2018
The same application on Amazon Linux 2018 handles memory pressure differently by possibly sacrificing high-memory child processes to maintain overall system stability.
Discussion Points
sysctl
or other kernel parameters that could be tuned to improve how memory pressure is handled?Attachments
[1] Screenshot of read operation spike right before the stuck behavior.
[2] Screenshot of increased disk read operations right before the stuck behavior by different processes.
[3] Output of
sysctl -a
from AML 2023 sysctl-2023.txt[4] Output of
sysctl -a
from AML 2018 sysctl-2018.txtBeta Was this translation helpful? Give feedback.
All reactions