Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [worker server] worker server connection has already been closed #15666

Closed
2 of 3 tasks
q4q5q6qw opened this issue Mar 4, 2024 · 1 comment
Closed
2 of 3 tasks
Labels
bug Something isn't working

Comments

@q4q5q6qw
Copy link

q4q5q6qw commented Mar 4, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

After the system runs properly for a period of time, port 1234 of the worker node is disabled,the master is normal, I checked the worker configuration item registry-disconnect-strategy.strategy=waiting, registry-disconnect-strategy.max-waiting-time=100s by referring to issue:12414, and the following error information is displayed in the log:

[INFO] 2024-03-03 08:07:42.285 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.75.195.147:54462, server: 10.75.194.59/10.75.194.59:2181 [WARN] 2024-03-03 08:07:42.286 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.194.59/10.75.194.59:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:42.856 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[52] - [WorkflowInstance-0][TaskInstance-0] - The current server status is WAITING, will not write heartBeatInfo into registry [ERROR] 2024-03-03 08:07:43.475 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[80] - [WorkflowInstance-0][TaskInstance-0] - Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleException: Waiting to reconnect to registry in PT1M40S failed at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:208) at com.sun.proxy.$Proxy110.disconnect(Unknown Source) at org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:53) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:43) at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281) at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.dolphinscheduler.registry.api.RegistryException: Cannot connect to the Zookeeper registry in 100 s at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.connectUntilTimeout(ZookeeperRegistry.java:133) at org.apache.dolphinscheduler.registry.api.RegistryClient.connectUntilTimeout(RegistryClient.java:72) at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:71) ... 20 common frames omitted [INFO] 2024-03-03 08:07:43.668 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 10.75.195.160/10.75.195.160:2181. [INFO] 2024-03-03 08:07:43.668 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [WARN] 2024-03-03 08:07:43.679 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.195.160/10.75.195.160:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:44.683 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 10.75.195.174/10.75.195.174:2181. [INFO] 2024-03-03 08:07:44.684 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [INFO] 2024-03-03 08:07:44.684 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.75.195.147:39546, server: 10.75.195.174/10.75.195.174:2181 [WARN] 2024-03-03 08:07:44.685 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.195.174/10.75.195.174:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:45.217 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 10.75.194.59/10.75.194.59:2181. [INFO] 2024-03-03 08:07:45.217 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [INFO] 2024-03-03 08:07:45.217 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.75.195.147:54470, server: 10.75.194.59/10.75.194.59:2181 [WARN] 2024-03-03 08:07:45.218 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.194.59/10.75.194.59:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:46.479 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[126] - [WorkflowInstance-0][TaskInstance-0] - Worker server is stopping, current cause : Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server [INFO] 2024-03-03 08:07:46.479 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[146] - [WorkflowInstance-0][TaskInstance-0] - Worker begin to kill all cache task, task size: 2 [INFO] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[92] - [WorkflowInstance-18203][TaskInstance-18204] - Begin kill task instance, processId: 0 [ERROR] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[95] - [WorkflowInstance-18203][TaskInstance-18204] - Task instance kill failed, processId is not exist [INFO] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[92] - [WorkflowInstance-18206][TaskInstance-18206] - Begin kill task instance, processId: 0 [ERROR] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[95] - [WorkflowInstance-18206][TaskInstance-18206] - Task instance kill failed, processId is not exist [INFO] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[159] - [WorkflowInstance-0][TaskInstance-0] - Worker after kill all cache task, task size: 2, killed number: 0 [WARN] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[72] - [WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask finished... [INFO] 2024-03-03 08:07:46.515 +0800 org.apache.curator.framework.imps.CuratorFrameworkImpl:[998] - [WorkflowInstance-0][TaskInstance-0] - backgroundOperationsLoop exiting [WARN] 2024-03-03 08:07:47.007 +0800 org.apache.zookeeper.ClientCnxn:[1286] - [WorkflowInstance-0][TaskInstance-0] - An exception was thrown while closing send thread for session 0x0. java.io.IOException: Connection has already been closed and reconnection is not allowed at org.apache.zookeeper.ClientCnxn$SendThread.changeZkState(ClientCnxn.java:990) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1141) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1200) [INFO] 2024-03-03 08:07:47.116 +0800 org.apache.zookeeper.ClientCnxn:[568] - [WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 0x0 [INFO] 2024-03-03 08:07:47.116 +0800 org.apache.zookeeper.ZooKeeper:[1232] - [WorkflowInstance-0][TaskInstance-0] - Session: 0x0 closed [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[128] - [WorkflowInstance-0][TaskInstance-0] - Worker registry client closed [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[60] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[62] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[133] - [WorkflowInstance-0][TaskInstance-0] - Worker server stopped, current cause: Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server

  • [ ]

What you expected to happen

Normal scheduling

How to reproduce

Reference Log Information

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@q4q5q6qw q4q5q6qw added bug Something isn't working Waiting for reply Waiting for reply labels Mar 4, 2024
@q4q5q6qw q4q5q6qw changed the title [Bug] [worker server] Bug title [Bug] [worker server] worker server connection has already been closed Mar 4, 2024
@SbloodyS SbloodyS removed the Waiting for reply Waiting for reply label Jul 16, 2024
@SbloodyS
Copy link
Member

Fixed in #16180.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants