Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module stops working under high load #6

Open
lensen opened this issue Jun 19, 2023 · 9 comments
Open

Module stops working under high load #6

lensen opened this issue Jun 19, 2023 · 9 comments

Comments

@lensen
Copy link

lensen commented Jun 19, 2023

Hi,

Every time our webservers gets a large volume of requests (between 500 - 1000 req/sec), the module stops working.
The logs show the following:

[Mon Jun 19 12:11:28.975463 2023] [:error] [pid 244:tid 140027053811456] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:29.112630 2023] [:error] [pid 22:tid 140026914522880] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:29.314719 2023] [:error] [pid 23:tid 140026922915584] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:29.393069 2023] [:notice] [pid 26:tid 140027045418752] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:0 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:29.471197 2023] [:error] [pid 187:tid 140027037026048] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:29.581255 2023] [:error] [pid 39:tid 140026838988544] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:29.658611 2023] [:error] [pid 22:tid 140026897737472] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:29.764263 2023] [:error] [pid 29:tid 140027037026048] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:29.954129 2023] [:notice] [pid 30:tid 140026847381248] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:0 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:30.283096 2023] [:error] [pid 187:tid 140027045418752] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:30.356176 2023] [:notice] [pid 55:tid 140026939700992] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:22 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:30.362993 2023] [:notice] [pid 55:tid 140026931308288] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:22 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:30.373114 2023] [:notice] [pid 55:tid 140026922915584] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:22 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:30.461654 2023] [:notice] [pid 93:tid 140026486494976] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:0 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:30.774836 2023] [:error] [pid 290:tid 140026931308288] [mod_prometheus_status][src/mod_prometheus_status.c:155] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:30.784886 2023] [:error] [pid 93:tid 140026503280384] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:30.987457 2023] [:error] [pid 74:tid 140026906130176] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.264254 2023] [:error] [pid 244:tid 140026889344768] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.364170 2023] [:notice] [pid 26:tid 140026906130176] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:28 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:31.483681 2023] [:error] [pid 182:tid 140026906130176] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.563502 2023] [:notice] [pid 185:tid 140026914522880] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:28 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:31.565036 2023] [:notice] [pid 185:tid 140027045418752] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:28 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:31.570348 2023] [:error] [pid 185:tid 140026939700992] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.571591 2023] [:error] [pid 187:tid 140026939700992] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.576137 2023] [:notice] [pid 23:tid 140026847381248] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:0 errno:9 (Bad file descriptor)
[Mon Jun 19 12:11:31.671526 2023] [:error] [pid 27:tid 140026822203136] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.673610 2023] [:error] [pid 27:tid 140026822203136] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.751870 2023] [:error] [pid 23:tid 140026914522880] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.760623 2023] [:error] [pid 26:tid 140026897737472] [mod_prometheus_status][src/mod_prometheus_status.c:149] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.776093 2023] [:error] [pid 182:tid 140027045418752] [mod_prometheus_status][src/mod_prometheus_status.c:155] setsockopt failed: socket:/tmp/mtr.mIeBhw fd:0 errno:88 (Socket operation on non-socket)
[Mon Jun 19 12:11:31.784187 2023] [:notice] [pid 74:tid 140027028633344] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:25 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.784685 2023] [:notice] [pid 74:tid 140026897737472] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:25 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.787175 2023] [:notice] [pid 23:tid 140026461316864] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:24 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.788300 2023] [:notice] [pid 74:tid 140026939700992] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:25 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.788426 2023] [:notice] [pid 74:tid 140026906130176] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:25 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.789363 2023] [:notice] [pid 30:tid 140026931308288] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:23 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.791295 2023] [:notice] [pid 74:tid 140027037026048] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:25 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.796406 2023] [:notice] [pid 30:tid 140026830595840] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:23 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.805731 2023] [:notice] [pid 22:tid 140026855773952] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:27 errno:107 (Transport endpoint is not connected)
[Mon Jun 19 12:11:31.805990 2023] [:notice] [pid 27:tid 140026729883392] [mod_prometheus_status][src/mod_prometheus_status.c:187] failed to send to metrics collector: socket:/tmp/mtr.mIeBhw fd:26 errno:107 (Transport endpoint is not connected)

When CURLing the /metrics endpoint I've created, most of the time it responds with a 500 error instead of showing the metrics.

I'm running Apache/2.4.6 (CentOS), module version 0.3.0.

@sni
Copy link
Contributor

sni commented Jun 19, 2023

is there anything before all those error messages?

@lensen
Copy link
Author

lensen commented Jun 19, 2023

Aside from the start up messages, no:

[Mon Jun 19 12:10:37.837541 2023] [:notice] [pid 13:tid 140027232766080] [mod_prometheus_status][src/mod_prometheus_status.c:488] prometheus_status_init: version 0.3.0 - using tmp socket /tmp/mtr.mIeBhw
[Mon Jun 19 12:10:37.840010 2023] [:notice] [pid 14:tid 140027232766080] [mod_prometheus_status][src/mod_prometheus_status.c:377] prometheus_status_load_gomodule gomodule loaded
[2023-06-19 12:10:37.844797][mod_prometheus_status][Debug][pid:14][module.go:86] InitMetricsCollector: /tmp/mtr.mIeBhw (uid: 1001, gid: 1001)
[2023-06-19 12:10:37.845119][mod_prometheus_status][Debug][pid:14][module.go:101] listening on metricsSocket: /tmp/mtr.mIeBhw
[2023-06-19 12:10:37.845227][mod_prometheus_status][Info][pid:14][module.go:81] mod_prometheus_status v0.3.0 initialized - socket:/tmp/mtr.mIeBhw - uid:1001 - gid:1001 - build:0.3.0
[Mon Jun 19 12:10:37.904612 2023] [mpm_worker:notice] [pid 13:tid 140027232766080] AH00292: Apache/2.4.6 (CentOS) configured -- resuming normal operations
[Mon Jun 19 12:10:37.904645 2023] [core:notice] [pid 13:tid 140027232766080] AH00094: Command line: 'httpd -D FOREGROUND'

@sni
Copy link
Contributor

sni commented Jun 19, 2023

How does the socket (/tmp/mtr.*) look like if that happens? Is it gone?

@lensen
Copy link
Author

lensen commented Jun 19, 2023

Nope. the file is still there, with the correct permissions:

bash-4.2$ ls -la /tmp/mtr.mIeBhw
srwxr-xr-x 1 apache apache 0 Jun 19 12:10 /tmp/mtr.mIeBhw

@Chronos85
Copy link

Chronos85 commented Jun 20, 2023

I have attached strace to the process which might help to debug this. This is from a single process. It seems it all goes well up until a certain point. The socket call always gets fd/17 in this case. Then suddenly it uses a fd/31 and it fails.

What is interesting that in the strace the socket() call always comes first, handing out the fd/17. But in the case where it goes wrong, the write() comes BEFORE the socket() call. Might this be a threading race condition?

socket(AF_UNIX, SOCK_STREAM, 0)         = 17
connect(17, {sa_family=AF_UNIX, sun_path="/tmp/mtr.wVhAvO"}, 110) = 0
setsockopt(17, SOL_SOCKET, SO_RCVTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(17, SOL_SOCKET, SO_SNDTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
close(17)                               = 0
socket(AF_UNIX, SOCK_STREAM, 0)         = 17
connect(17, {sa_family=AF_UNIX, sun_path="/tmp/mtr.wVhAvO"}, 110) = 0
setsockopt(17, SOL_SOCKET, SO_RCVTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(17, SOL_SOCKET, SO_SNDTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
close(17)                               = 0
socket(AF_UNIX, SOCK_STREAM, 0)         = 17
connect(17, {sa_family=AF_UNIX, sun_path="/tmp/mtr.wVhAvO"}, 110) = 0
setsockopt(17, SOL_SOCKET, SO_RCVTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(17, SOL_SOCKET, SO_SNDTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
close(17)                               = 0
socket(AF_UNIX, SOCK_STREAM, 0)         = 17
connect(17, {sa_family=AF_UNIX, sun_path="/tmp/mtr.wVhAvO"}, 110) = 0
setsockopt(17, SOL_SOCKET, SO_RCVTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(17, SOL_SOCKET, SO_SNDTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
close(17)                               = 0
socket(AF_UNIX, SOCK_STREAM, 0)         = 17
connect(17, {sa_family=AF_UNIX, sun_path="/tmp/mtr.wVhAvO"}, 110) = 0
setsockopt(17, SOL_SOCKET, SO_RCVTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(17, SOL_SOCKET, SO_SNDTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
close(17)                               = 0
write(31, "request:promRequests;1;test.doma"..., 47) = -1 ENOTCONN (Transport endpoint is not connected)
close(31)                               = 0
socket(AF_UNIX, SOCK_STREAM, 0)         = 31
connect(31, {sa_family=AF_UNIX, sun_path="/tmp/mtr.wVhAvO"}, 110) = 0
setsockopt(0, SOL_SOCKET, SO_RCVTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = -1 ENOTSOCK (Socket operation on non-socket)
socket(AF_UNIX, SOCK_STREAM, 0)         = 22
connect(22, {sa_family=AF_UNIX, sun_path="/tmp/mtr.wVhAvO"}, 110) = 0
setsockopt(17, SOL_SOCKET, SO_RCVTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(17, SOL_SOCKET, SO_SNDTIMEO, "\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
write(17, "request:promResponseSize;0;test."..., 51) = -1 ENOTCONN (Transport endpoint is not connected)
close(17)                               = 0

@Chronos85
Copy link

During some testing with ss and looking at the socket connections, if found that the Recv-Q on the LISTENING socket reached as far as 4096, after which the connection errors occured.

Setting net.core.somaxconn to a higher number indeed delayed the onset of the high load issue. Backing of the load test, reduced the Recv-Q on the socket. This seems to indicate its something you need to balance and play with if you run into a high load traffic use.

@Chronos85
Copy link

Chronos85 commented Sep 1, 2023

Hello. I'd like to re-open this.

We've been trying to troubleshoot this issue as ever increasing the net.core.somaxcon is not really an option.

I've modded my version of the module to include lots more debug lines for opening/closing/re-using connections and I came to the conclusion that it seems that there is a multi-threading issue. It seems that file descriptors are swapping over.

{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:144] connect socket:/run/httpd/mtr.OWE7rV fd:15","@fields":{"timestamp":"Fri Sep 01 14:36:30.412660 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(144)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:154] socket opt 1:/run/httpd/mtr.OWE7rV fd:15","@fields":{"timestamp":"Fri Sep 01 14:36:30.412712 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(154)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:160] socket opt 2:/run/httpd/mtr.OWE7rV fd:15","@fields":{"timestamp":"Fri Sep 01 14:36:30.412722 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(160)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:134] reuse socket:/run/httpd/mtr.OWE7rV fd:15","@fields":{"timestamp":"Fri Sep 01 14:36:30.413888 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(134)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:134] reuse socket:/run/httpd/mtr.OWE7rV fd:15","@fields":{"timestamp":"Fri Sep 01 14:36:30.414047 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(134)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:357] close from prometheus_status_counter socket:/run/httpd/mtr.OWE7rV fd:15","@fields":{"timestamp":"Fri Sep 01 14:36:30.414064 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(357)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:175] socket close:/run/httpd/mtr.OWE7rV fd:15","@fields":{"timestamp":"Fri Sep 01 14:36:30.414069 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(175)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:134] reuse socket:/run/httpd/mtr.OWE7rV fd:20","@fields":{"timestamp":"Fri Sep 01 14:36:37.265318 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(134)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:134] reuse socket:/run/httpd/mtr.OWE7rV fd:20","@fields":{"timestamp":"Fri Sep 01 14:36:37.271246 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(134)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:201] failed to send to metrics collector: socket:/run/httpd/mtr.OWE7rV fd:20 errno:9 (Bad file descriptor)","@fields":{"timestamp":"Fri Sep 01 14:36:37.282769 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(201)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:175] socket close:/run/httpd/mtr.OWE7rV fd:20","@fields":{"timestamp":"Fri Sep 01 14:36:37.282779 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(175)","error_code":"-"}}
{"type":"apache_error_json","@message":"[mod_prometheus_status][src/mod_prometheus_status.c:177] failed to close socket:/run/httpd/mtr.OWE7rV fd:20 errno:9 (Bad file descriptor)","@fields":{"timestamp":"Fri Sep 01 14:36:37.282786 2023","module":"-","level":"notice","pid":"34939","tid":"139720864335616","client":"-","file":"src/mod_prometheus_status.c(177)","error_code":"-"}}

This logs shows the thread 139720864335616 in process 34939 closes fd/15. In code the fd should get reset to 0 (metric_socket_fd = 0;). But, the next logline shows it trying to reuse a fd/20 ?! That fails, it tries to close, it cant etc.

With this idea, i switched over from worker.c (multi-proces/multi-thread) to perfork (multi-process,single thread) MPM module in apache. And low and behold, the module works like a charm. No errors, no drops, nothing.

I am not well versed in C so I can't really make any suggestions, but it seems to me that some threading fixes are required.

Interested to hear what your thoughts are. If you require any more information, please don't hesitate to contact me!

@sni
Copy link
Contributor

sni commented Sep 1, 2023

it's hard to add my 2cents without a deep dive here. Setting net.core.somaxconn does not solve anything here anyway, it just delays the issue a bit. But might indicate some file descriptor leak.
Could you have a look if the number of open files is high from the start or steadily growing?

@Chronos85
Copy link

Ow definitely a descriptor leak. Sorry, that would have been good information to add.

here you can see the open_files from our pods. The sea-saw is from increasing descriptors, and us reloaded the apache process. on the right you can see the impact of switching to pre-fork module. These are ALL open descriptors on the system btw. So the right graph is higher in general due to more processes instead of threads. But it flatlines nicely.

image

It's actually quite easy to reproduce in a docker container. Just a normal httpd install with a single vhost. I did an ab -c 100 -n 10000 - If you watch the logs and the sockets connections ss -x you can see it losing control of sockets

docker was build with an centos7-base and used centos provided packages. No custom builds, no other modules apart from the normal provided ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants