Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unstable clusterer module (repeated messages Node [X] is UP) #3542

Open
kertor opened this issue Dec 13, 2024 · 13 comments
Open

[BUG] Unstable clusterer module (repeated messages Node [X] is UP) #3542

kertor opened this issue Dec 13, 2024 · 13 comments
Assignees
Milestone

Comments

@kertor
Copy link

kertor commented Dec 13, 2024

OpenSIPS version you are running

version: opensips 3.5.2 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: b16e49c98
main.c compiled on  with gcc 11

Describe the bug

I observe an unstable cluster module.
Have two nodes (usrloc replication):

# On main node:
{
    "Clusters": [
        {
            "cluster_id": 1,
            "Nodes": [
                {
                    "node_id": 4,
                    "db_id": 0,
                    "url": "bin:10.0.0.184:5501",
                    "link_state": "Up",
                    "state": "enabled",
                    "next_hop": "4",
                    "description": "node-2-reserve"
                }
            ]
        }
    ]
}

# On secondary node
{
    "Clusters": [
        {
            "cluster_id": 1,
            "Nodes": [
                {
                    "node_id": 3,
                    "db_id": 0,
                    "url": "bin:10.0.0.183:5501",
                    "link_state": "Up",
                    "state": "enabled",
                    "next_hop": "3",
                    "description": "node-2-main"
                }
            ]
        }
    ]
}

Sometime it works, then i see this repeated messages on "node-2-reserve":

Dec 13 10:53:46 node-2-reserve /usr/sbin/opensips[875256]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [3] is possibly down, retrying
Dec 13 10:53:46 node-2-reserve /usr/sbin/opensips[875253]: INFO:clusterer:handle_pong: Node [3] is UP

I understand that this already fixed in 4c38ceb but i have last build "b16e49c98" and still observe the problem.

To Reproduce

It is floating problem, at this moment do not know how to reproduce it.
But command clusterer_reload is fixing this problem.

Expected behavior

Stable clusterer, no logs like:

Dec 13 10:53:08 node-2-main /usr/sbin/opensips[418257]: INFO:usrloc:replicate_ucontact_update: All destinations in cluster: 1 are down or probing
Dec 13 10:53:08 node-2-main /usr/sbin/opensips[418257]: ERROR:usrloc:replicate_ucontact_update: replicate ucontact update failed

Relevant System Logs

Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: Using the global ( per process ) buff 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_parse_headers: received a COMPLETE message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: completely received a message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: Nothing more to read on TCP conn 0x7f1b6c902258, currently in state 0 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:call_callbacks: binary Packet CMD: 1. Capability: clusterer-internal
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:bin_rcv_cl_packets: received clusterer message from: 10.0.0.183:53379 with source id: 3 and cluster id: 1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:handle_internal_msg: Received ping reply from node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:set_link_w_neigh: setting link with neighbour [3] to state <0>
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: con found in state 0
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: tcp connection found (0x7f1b6c903448), acquiring fd
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: c= 0x7f1b6c903448, n=16, Usock=289
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875265]: DBG:core:handle_worker: read response= 7f1b6c903448, 1, fd -1 from 14 (875253)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: after receive_fd: c= 0x7f1b6c903448 n=8 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: sending via fd 10...
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tsend_stream_async: Async successful write from first try on 0x7f1b6c903448
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: after write: c= 0x7f1b6c903448 n/len=86/86 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:send_full_top_update: Sent topology update to node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: con found in state 0
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: tcp connection found (0x7f1b6c903448), acquiring fd
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: c= 0x7f1b6c903448, n=16, Usock=289
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875265]: DBG:core:handle_worker: read response= 7f1b6c903448, 1, fd -1 from 14 (875253)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tcp_conn_get: after receive_fd: c= 0x7f1b6c903448 n=8 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: sending via fd 10...
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:tsend_stream_async: Async successful write from first try on 0x7f1b6c903448
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:proto_bin_send: after write: c= 0x7f1b6c903448 n/len=99/99 fd=10
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:send_cap_update: Sent capability update to node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: INFO:clusterer:handle_pong: Node [3] is UP
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:evi_param_set: set int cluster_id=1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:evi_param_set: set int node_id=3
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:evi_param_set: set int new_state=1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7f1bef51b070
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7f1bef51b070
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:destroy_avp_list: destroying list (nil)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:sr_add_report: adding report to identifier [node_states] group [clusterer] on idx 18 [19,18]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: tcp_read_req end
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: Using the global ( per process ) buff 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_parse_headers: received a COMPLETE message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: completely received a message
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_handle_req: Nothing more to read on TCP conn 0x7f1b6c902258, currently in state 0 
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:call_callbacks: binary Packet CMD: 9. Capability: clusterer-internal
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:bin_rcv_cl_packets: received clusterer message from: 10.0.0.183:53379 with source id: 3 and cluster id: 1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:clusterer:handle_internal_msg: Received capability update with source [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:core:bin_pop_str: Popped: 'usrloc-contact-repl' [19]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875253]: DBG:proto_bin:bin_read_req: tcp_read_req end







ec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:tcp_conn_get: after receive_fd: c= 0x7f1b6c903448 n=8 fd=5
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:proto_bin:proto_bin_send: sending via fd 5...
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:tsend_stream_async: Async successful write from first try on 0x7f1b6c903448
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:proto_bin:proto_bin_send: after write: c= 0x7f1b6c903448 n/len=309/309 fd=5
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:clusterer:msg_send_retry: sent bin packet to node [3]
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:registrar:build_contact: created Contact HF: Contact: <sip:[email protected]>;expires=60

Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:MD5StringArray: MD5 calculated: 9fb93ca357c3189b8747a9751f9dd969
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:parse_headers_aux: flags=ffffffffffffffff
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:cleanup_uac_timers: RETR/FR timers reset
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:insert_timer_unsafe: [2]: 0x7f1b6c8ef798 (162816)
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:_reply_light: reply sent out. buf=0x7f1bef51e4d8: SIP/2.0 2..., shmem=0x7f1b6c8fd648: SIP/2.0 2
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:_reply_light: finished
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:comp_scriptvar: int 20: 60 / 0
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: INFO:[REGISTER] Registered successfully ([email protected] ci=0bb129639bd2e2bd942247bbe69)
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:parse_headers_aux: flags=ffffffffffffffff
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:comp_scriptvar: str 20: INSPECTOR-VI
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:tm:t_unref: UNREF_UNSAFE: [0x7f1b6c8ef718] after is 0
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:destroy_avp_list: destroying list (nil)
Dec 13 10:59:24 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:receive_msg: cleaning up
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [3] is possibly down, retrying
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875265]: DBG:core:handle_worker: read response= 7f1b6c903448, 1, fd -1 from 12 (875251)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:clusterer:do_action_trans_2: Sent ping retry to node [3]
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:clusterer:set_link_w_neigh: setting link with neighbour [3] to state <5>
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:evi_param_set: set int cluster_id=1
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:evi_param_set: set int node_id=3
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:evi_param_set: set int new_state=0
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7f1bef51afd8
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7f1bef51afd8
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:destroy_avp_list: destroying list (nil)
Dec 13 10:59:25 node-2-reserve /usr/sbin/opensips[875251]: DBG:core:sr_add_report: adding report to identifier [node_states] group [clusterer] on idx 17 [18,17]

OS/environment information

  • Operating System: Ubuntu 22
  • OpenSIPS installation: source
  • other relevant information:

Additional context

Could you please check it?

Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Dec 29, 2024
@ewadii
Copy link

ewadii commented Dec 30, 2024

same "issue" after upgrading from 3.4.8 to 3.4.10, no routing issues just a bothersome log message on the secondary box during low traffic conditions
basic usage of clusterer module without usrloc. "INFO:clusterer:do_action_trans_2: Ping reply not received, node [1] is possibly down, retrying"

@stale stale bot removed the stale label Dec 30, 2024
Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Jan 16, 2025
@kertor
Copy link
Author

kertor commented Jan 16, 2025

Up

@baevga
Copy link

baevga commented Jan 22, 2025

Hello OpenSIPS Team,

I’ve encountered the same issue after updating from version 3.3.8~20230418~2045436d8-1 to 3.5.3~20250119~a6c390aac-1.

Are there any updates on this issue?

Thank you!

@Shkiperon
Copy link
Contributor

I have same issue with other details - my configuration of clusterer module has no connection to DB and has some fine tuning:

udp_workers=8
tcp_workers=12

socket=bin:10.15.80.206

socket=udp:10.15.20.152:5060 frag
socket=udp:10.15.20.152:5061 frag
socket=udp:10.15.20.152:5062 frag
socket=tcp:10.15.20.152:5060
socket=tls:10.15.20.152:5061
socket=wss:10.15.20.152:5443

socket=udp:1.1.1.1:5060 frag
socket=tls:1.1.1.1:5061
socket=wss:1.1.1.1:5443

socket=hep_udp:10.15.20.152:6000

#### CLUSTERER module
loadmodule "clusterer.so"
modparam("clusterer", "ping_interval", 4)
modparam("clusterer", "ping_timeout", 2000)
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "my_node_info", "cluster_id=1, url=bin:10.15.80.206, sip_addr=10.15.20.152, flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1, node_id=2, url=bin:10.15.80.30, sip_addr=10.15.20.152")
modparam("clusterer", "sharing_tag", "reg/1=active")

#### BIN protocol
loadmodule "proto_bin.so"
modparam("proto_bin", "bin_send_timeout", 200)
modparam("proto_bin", "bin_max_msg_chunks", 1024)
modparam("proto_bin", "bin_async_max_postponed_chunks", 1024)
modparam("proto_bin", "bin_async_local_connect_timeout", 200)
modparam("proto_bin", "bin_async_local_write_timeout", 100)
modparam("proto_bin", "bin_async", 1)

OpenSIPS that I'm using builded from git sources ("updated" revision of 3.4.10) on Debian 11:

version: opensips 3.4.10 (x86_64/linux)
flags: STATS: On, SHM_EXTRA_STATS, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 905daa129
main.c compiled on 12:30:50 Jan 22 2025 with gcc 10

I found that if there is no configured timer_workers (default is 1) I have many ping fails during 30 minutes test:

[2025-01-31 04:23:49.652259+00:00] INFO[1573982] INFO:core:evi_publish_event: Registered event <E_UL_LATENCY_UPDATE(17)>
[2025-01-31 04:23:50.952110+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:31:22.472075+00:00] INFO[1574033] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:31:22.473356+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:31:34.422320+00:00] INFO[1574027] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:31:34.423427+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:34:50.240390+00:00] INFO[1573995] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:34:50.241364+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:40:29.367509+00:00] INFO[1574040] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:40:29.369081+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:41:00.166541+00:00] INFO[1574061] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:41:00.168524+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:44:01.083718+00:00] INFO[1574049] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:44:01.084926+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:44:29.945205+00:00] INFO[1574047] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:44:29.946859+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:44:59.773480+00:00] INFO[1574059] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:44:59.774660+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:45:29.618426+00:00] INFO[1574049] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:45:29.619701+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:45:59.441217+00:00] INFO[1574061] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:45:59.441917+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:46:29.283520+00:00] INFO[1574027] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:46:29.284703+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:47:00.068918+00:00] INFO[1574057] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:47:00.070450+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:47:29.938571+00:00] INFO[1574055] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:47:29.941813+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:47:41.828111+00:00] INFO[1574035] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:47:41.828898+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:47:59.759697+00:00] INFO[1574040] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:47:59.761175+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:48:29.577608+00:00] INFO[1574040] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:48:29.579053+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:48:59.400814+00:00] INFO[1574055] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:48:59.401583+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:51:22.484276+00:00] INFO[1574059] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:51:22.485940+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:55:03.203736+00:00] INFO[1574053] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:55:03.204774+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 04:58:43.895450+00:00] INFO[1574043] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 04:58:43.896706+00:00] INFO[1574037] INFO:clusterer:handle_pong: Node [2] is UP

If I add some additional timer workers (timer_workers = 4 in my case) then I have less ping problems, but problem still exist:

[2025-01-31 08:57:26.970570+00:00] INFO[1639007] INFO:usrloc:replicate_urecord_insert: All destinations in cluster: 1 are down or probing
[2025-01-31 08:57:26.970850+00:00] INFO[1639007] INFO:usrloc:replicate_ucontact_insert: All destinations in cluster: 1 are down or probing
[2025-01-31 08:57:27.947441+00:00] INFO[1639040] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 08:59:04.397070+00:00] INFO[1639055] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 08:59:04.398025+00:00] INFO[1639040] INFO:clusterer:handle_pong: Node [2] is UP
[2025-01-31 09:33:40.503483+00:00] INFO[1639049] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
[2025-01-31 09:33:40.504715+00:00] INFO[1639040] INFO:clusterer:handle_pong: Node [2] is UP

After this tests I checked the tcpdump, and in both cases I see same conditions for ping problems.
Merged data from log and tcpdump:

--> 2025-01-31 08:58:55.674840+00:00 - [PSH,ACK] clusterer-internal (ping)
--> 2025-01-31 08:58:55.674960+00:00 - [ACK] on clusterer-internal (ping)
--> 2025-01-31 08:58:56.423115+00:00 - [PSH,ACK] clusterer-internal (ping)
--> 2025-01-31 08:58:56.423323+00:00 - [ACK] on clusterer-internal (ping)
--> 2025-01-31 08:58:56.988544+00:00 - [PSH,ACK] usrloc-contac-repl contact1
--> 2025-01-31 08:58:56.988850+00:00 - [ACK] on previous
--> 2025-01-31 08:58:59.694331+00:00 - [PSH,ACK] clusterer-internal (ping)
--> 2025-01-31 08:58:59.694444+00:00 - [ACK] on clusterer-internal (ping)
--> 2025-01-31 08:59:02.265824+00:00 - [PSH,ACK] usrloc-contac-repl contact2
--> 2025-01-31 08:59:02.265963+00:00 - [ACK] on previous
--> 2025-01-31 08:59:04.396934+00:00 - [PSH,ACK] clusterer-internal (ping)
[2025-01-31 08:59:04.397070+00:00] INFO[1639055] INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is possibly down, retrying
--> 2025-01-31 08:59:04.397103+00:00 - [ACK] on clusterer-internal (ping)
--> 2025-01-31 08:59:04.397665+00:00 - [PSH,ACK] clusterer-internal (ping)
--> 2025-01-31 08:59:04.397718+00:00 - [PSH,ACK] Many repl comands
--> 2025-01-31 08:59:04.397748+00:00 - [ACK] on clusterer-internal (ping)
--> 2025-01-31 08:59:04.397788+00:00 - [ACK] on many repl comands
[2025-01-31 08:59:04.398025+00:00] INFO[1639040] INFO:clusterer:handle_pong: Node [2] is UP

Looks like there are two problems:

  1. Timer worker cannot work so fast how OpenSIPS user expecting (when we have many udp, tcp and other workers with dialog replication, usrloc replication etc.), so the modparam("clusterer", "ping_interval", 4) in real system can be 4.5 sec.
  2. Clusterer module ignoring the fact that we have ping_timeout. I think it must alert after 6 sec (ping_interval+ping_timeout) with some deviation (because timer worker cannot trigger event with high percision), but not after 4,5 sec after ping only started a try.

@liviuchircu liviuchircu self-assigned this Feb 3, 2025
@liviuchircu liviuchircu added this to the 3.4.11 milestone Feb 3, 2025
@liviuchircu
Copy link
Member

Thank you for the reports -- perhaps commit 7d74f3d introduced a regression during periods of low traffic, where node state can now flip-flop. Currently treating this issue with raised priority, so the fix makes it into the next stable minor release with some added testing from your side as well.

@Shkiperon
Copy link
Contributor

@liviuchircu is there any updates on this issue? Is there a chance that the problem will be solved by the release of 3.4.11?

@liviuchircu
Copy link
Member

Hey @Shkiperon! I tried to reproduce the issue here using your settings, but failed to do so (both in high traffic conditions: 500 CPS, or while idling). Any tips on reproducing this are appreciated -- perhaps I'm missing a module, setting or network setup which induces all these issues.

@kertor
Copy link
Author

kertor commented Feb 14, 2025

Hello @liviuchircu !
For example, it is my settings (OpenSIPS 3.5) where i see the problem:

########################################################################
loadmodule "clusterer.so"
########################################################################
modparam("clusterer", "db_url", "text:///etc/opensips/db")
modparam("clusterer", "my_node_id", ID_OF_NODE)
modparam("clusterer", "ping_interval", 1)
modparam("clusterer", "ping_timeout", 500)
modparam("clusterer", "node_timeout", 10)

########################################################################
loadmodule "tm.so"
########################################################################
modparam("tm", "fr_inv_timeout", 240)
modparam("tm", "restart_fr_on_each_reply", 0)
modparam("tm", "timer_partitions", 4)
modparam("tm", "auto_100trying", 0)
modparam("tm", "ruri_matching", 0)
modparam("tm", "tm_replication_cluster", 1)
modparam("tm", "via1_matching", 0)
modparam("tm", "cluster_param", "tmid")



id(int,auto) cluster_id(int) node_id(int) url(string) state(int) no_ping_retries(int) priority(int) sip_addr(string,null) flags(string,null) description(string,null)
0:1:1:bin\:10.10.10.11\:11111:1:3:50:::balancer-1.lan
0:1:2:bin\:10.10.10.12\:11111:1:3:50:::balancer-2.lan
0:1:3:bin\:10.10.10.13\:11111:1:3:50:::balancer-3.lan

{
    "Clusters": [
        {
            "cluster_id": 1,
            "Nodes": [
                {
                    "node_id": 3,
                    "db_id": 0,
                    "url": "bin:10.10.10.13:11111",
                    "link_state": "Up",
                    "state": "enabled",
                    "next_hop": "3",
                    "description": "balancer-3.lan"
                },
                {
                    "node_id": 2,
                    "db_id": 0,
                    "url": "bin:10.10.10.12:11111",
                    "link_state": "Up",
                    "state": "enabled",
                    "next_hop": "2",
                    "description": "balancer-2.lan"
                }
            ]
        }
    ]
}

{
    "Clusters": [
        {
            "cluster_id": 1,
            "Capabilities": [
                {
                    "name": "tm-repl",
                    "state": "Ok",
                    "enabled": "yes"
                }
            ]
        }
    ]
}

I observe the issue approximately 6–8 hours after the load has been applied.

@Shkiperon
Copy link
Contributor

Hi @liviuchircu ! I have extra tuning in my config file and some modules configured in cluster mode:

log_level=3
xlog_level=3
syslog_enabled=yes
syslog_facility=LOG_LOCAL7
xlog_buf_size=8192

udp_workers=8
tcp_workers=12
timer_workers=4

auto_aliases=no

exec_dns_threshold = 50000
exec_msg_threshold = 50000
tcp_threshold = 50000

open_files_limit=8192
tcp_connect_timeout=700
tcp_connection_lifetime=3605
tcp_max_connections=4096
maxbuffer=65536

restart_persistency_cache_file="/tmp/opensips/.restart_persistency.cache"

tos=0x88

...

#### Local Cache module
loadmodule "cachedb_local.so"
modparam("cachedb_local", "cachedb_url", "local:///dns")
modparam("cachedb_local", "cachedb_url", "local:///dialog")
modparam("cachedb_local", "cache_collections", "dns/r; dialog = 12; default = 4")
modparam("cachedb_local", "cache_clean_period", 300)
modparam("cachedb_local", "cluster_id", 1)
modparam("cachedb_local", "cluster_persistency", "sync-from-cluster")
modparam("cachedb_local", "enable_restart_persistency", 1)

#### DNS Cache module
loadmodule "dns_cache.so"
modparam("dns_cache", "cachedb_url", "local:///dns")
modparam("dns_cache", "blacklist_timeout",1800)

#### Transaction Module
loadmodule "tm.so"
modparam("tm", "fr_timeout", 5)
modparam("tm", "fr_inv_timeout", 120)
modparam("tm", "restart_fr_on_each_reply", 0)
modparam("tm", "onreply_avp_mode", 1)
modparam("tm", "auto_100trying", 0)

#### Record Route Module
loadmodule "rr.so"
modparam("rr", "append_fromtag", 0)

#### MAX ForWarD module
loadmodule "maxfwd.so"

#### SIP MSG OPerationS module
loadmodule "sipmsgops.so"

#### Management Interface (FIFO+HTTP)
loadmodule "mi_fifo.so"
modparam("mi_fifo", "fifo_name", "/run/opensips/opensips_fifo")
modparam("mi_fifo", "fifo_mode", 0660)

loadmodule "httpd.so"
modparam("httpd", "ip", "127.0.0.1")
modparam("httpd", "port", 8080)
modparam("httpd", "post_buf_size", 4096)

loadmodule "mi_http.so"
modparam("mi_http", "root", "mirpc")

#### MySQL module
loadmodule "db_mysql.so"

#### DbText module
loadmodule "db_text.so"
modparam("db_text", "db_mode", 1)

#### Redis module
loadmodule "cachedb_redis.so"
modparam("cachedb_redis", "cachedb_url", "redis:read://:[email protected]:6380/0")
modparam("cachedb_redis", "cachedb_url", "redis:write://:[email protected]:6381/0")

#### Permissions module
loadmodule "permissions.so"
modparam("permissions", "db_url", "text:///etc/opensips/dbtext")

#### AVPops module
loadmodule "avpops.so"
modparam("avpops", "db_url", "mysql://opensips:opensipsrw@localhost/opensips")

#### CLUSTERER module
loadmodule "clusterer.so"
modparam("clusterer", "ping_interval", 4)
modparam("clusterer", "ping_timeout", 2000)
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "my_node_info", "cluster_id=1, url=bin:10.15.80.206, sip_addr=10.15.20.152, flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1, node_id=2, url=bin:10.15.80.30, sip_addr=10.15.20.152")
modparam("clusterer", "sharing_tag", "reg/1=active")

#### BIN protocol
loadmodule "proto_bin.so"
modparam("proto_bin", "bin_send_timeout", 200)
modparam("proto_bin", "bin_max_msg_chunks", 1024)
modparam("proto_bin", "bin_async_max_postponed_chunks", 1024)
modparam("proto_bin", "bin_async_local_connect_timeout", 200)
modparam("proto_bin", "bin_async_local_write_timeout", 100)
modparam("proto_bin", "bin_async", 1)

#### USeR LOCation module
loadmodule "usrloc.so"
modparam("usrloc", "nat_bflag", "SIP_PING")
modparam("usrloc", "working_mode_preset", "full-sharing-cluster")
modparam("usrloc", "db_url", "mysql://opensips:opensipsrw@localhost/opensips")
modparam("usrloc", "location_cluster", 1)
modparam("usrloc", "hash_size", 9)
modparam("usrloc", "use_domain", 1)
modparam("usrloc", "pinging_mode", "ownership")

#### RTPEngine module
loadmodule "rtpengine.so"
modparam("rtpengine", "db_url", "text:///etc/opensips/dbtext")

#### DIALOG module
loadmodule "dialog.so"
modparam("dialog", "dlg_match_mode", 1)
modparam("dialog", "default_timeout", 7200)
modparam("dialog", "db_mode", 0)
modparam("dialog", "cachedb_url", "local:///dialog")
modparam("dialog", "dialog_replication_cluster", 1)
modparam("dialog", "profile_replication_cluster", 1)
modparam("dialog", "options_ping_interval", 10)
modparam("dialog", "profiles_with_value", "userCalls/s; caller/s; callee/s;")
modparam("dialog", "replicate_profiles_buffer", 800)

#### Dynamic Routing module
loadmodule "drouting.so"
modparam("drouting", "db_url", "mysql://opensips:opensipsrw@localhost/opensips")
modparam("drouting", "ruri_avp", "$avp(dr_ruri)")
modparam("drouting", "gw_id_avp", "$avp(dr_gwid)")
modparam("drouting", "carrier_id_avp", "$avp(dr_carrierid)")
modparam("drouting", "probing_method", "OPTIONS")
modparam("drouting", "probing_interval", 30)
modparam("drouting", "probing_from", "sip:ping@tester")
modparam("drouting", "probing_socket", "udp:10.15.20.152:5060")
modparam("drouting", "gw_socket_filter_mode", "matched-only")
modparam("drouting", "cluster_id", 1)
modparam("drouting", "cluster_sharing_tag", "reg")
modparam("drouting", "cluster_probing_mode", "by-shtag")

...

#### TRACER and HEP modules
loadmodule "proto_hep.so"
modparam("proto_hep", "hep_capture_id", 2001)
modparam("proto_hep", "homer5_on", 1)
modparam("proto_hep", "hep_id", "[hid]10.10.99.21:9060;transport=udp;version=3")

loadmodule "tracer.so"
modparam("tracer", "trace_id", "[tid]uri=hep:hid")

....

#### TCP protocol
loadmodule "proto_tcp.so"
modparam("proto_tcp", "trace_destination", "hid")
modparam("proto_tcp", "trace_on", 1)

#### PROTO TLS
loadmodule "proto_tls.so"
modparam("proto_tls", "trace_destination", "hid")
modparam("proto_tls", "trace_on", 1)
modparam("proto_tls", "tls_handshake_timeout", 400)
modparam("proto_tls", "tls_send_timeout", 300)

loadmodule "proto_wss.so"
modparam("proto_wss", "wss_port", 9443)
modparam("proto_wss", "wss_resource", "/ws")
modparam("proto_wss", "wss_handshake_timeout", 300)
modparam("proto_wss", "cert_check_on_conn_reusage", 1)
modparam("proto_wss", "trace_destination", "hid")
modparam("proto_wss", "trace_on", 1)
modparam("proto_wss", "require_origin", 1)

loadmodule "tls_wolfssl.so"

loadmodule "tls_mgm.so"
modparam("tls_mgm", "tls_library", "wolfssl")
....

Here is part of my config with modules that works with clusterer or with TCP connections

@Shkiperon
Copy link
Contributor

@liviuchircu maybe something is differents in my build options (#3542 (comment))?

@Shkiperon
Copy link
Contributor

I found combination of modparams that suppress the "node down <-> node up" flapping - in my configuration (#3542 (comment)) I changed ping timeout:

modparam("clusterer", "ping_timeout", 2500)

After that flapping between two virtual machines stopped. There is a low ping between them. Here are the statistics for 100 icmp packets:

rtt min/avg/max/mdev = 0.155/0.217/0.306/0.031 ms

On OpenSIPS 3.3 I wasn't have such problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants