Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BUG: Nebula nodes cannot ping each other , however they can ping the lighthouse vpn IP #1165

Open
Jahazee opened this issue Jun 14, 2024 · 14 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@Jahazee
Copy link

Jahazee commented Jun 14, 2024

What version of nebula are you using? (nebula -version)

1.9.2

What operating system are you using?

Linux, windows,android

Describe the Bug

I have three nodes in the network. 2 nodes as client and 1 node as light house. Lighthouse is running in public cloud.
The client nodes are running windows and android platform.
Windows node and android platform nodes is successfully connected to the lighthouse. Light house VPN IP is reachable from both of these nodes.

Looks peer to peer tunnel is not established . Any help on this would be appreciated.

However unfortunately, I cannot ping android node from windows node or vice-versa.
android node vpn ip : 10.0.0.50
windows node vpn ip : 10.0.0.2
lighthouse node vpn ip : 10.0.0.100

Logs on Android node

time="2024-06-14T13:55:19Z" level=debug msg="Packet store" length=2 localIndex=2056711493 remoteIndex=0 stored=true vpnIp=10.0.0.2
time="2024-06-14T13:55:19Z" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2056711493 localIndex=2056711493 remoteIndex=0 udpAddrs="[]" vpnIp=10.0.0.2
time="2024-06-14T13:55:20Z" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]"

window node note public ip masked as x.x.x.x

017791302 tunnelCheck="map[method:passive state:alive]" vpnIp=10.0.0.100
time="2024-06-14T20:12:57+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:12:57+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:12:58+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:12:58+05:30" level=debug msg="Packet store" length=2 localIndex=895826622 remoteIndex=0 stored=true vpnIp=10.0.0.50
time="2024-06-14T20:12:59+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:13:00+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:13:00+05:30" level=debug msg="Packet store" length=3 localIndex=895826622 remoteIndex=0 stored=true vpnIp=10.0.0.50
time="2024-06-14T20:13:01+05:30" level=info msg="Handshake timed out" durationNs=5669465500 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[157.45.145.140:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:13:01+05:30" level=debug msg="Pending hostmap hostInfo deleted" hostMap="map[indexNumber:895826622 mapTotalSize:0 remoteIndexNumber:0 vpnIp:10.0.0.50]"

Logs from affected hosts

Logs on Android node

time="2024-06-14T13:55:19Z" level=debug msg="Packet store" length=2 localIndex=2056711493 remoteIndex=0 stored=true vpnIp=10.0.0.2
time="2024-06-14T13:55:19Z" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2056711493 localIndex=2056711493 remoteIndex=0 udpAddrs="[]" vpnIp=10.0.0.2
time="2024-06-14T13:55:20Z" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]"

window node note public ip masked as x.x.x.x

017791302 tunnelCheck="map[method:passive state:alive]" vpnIp=10.0.0.100
time="2024-06-14T20:12:57+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:12:57+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:12:58+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:12:58+05:30" level=debug msg="Packet store" length=2 localIndex=895826622 remoteIndex=0 stored=true vpnIp=10.0.0.50
time="2024-06-14T20:12:59+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:13:00+05:30" level=debug msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[x.x.x.x:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:13:00+05:30" level=debug msg="Packet store" length=3 localIndex=895826622 remoteIndex=0 stored=true vpnIp=10.0.0.50
time="2024-06-14T20:13:01+05:30" level=info msg="Handshake timed out" durationNs=5669465500 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=895826622 localIndex=895826622 remoteIndex=0 udpAddrs="[157.45.145.140:43159]" vpnIp=10.0.0.50
time="2024-06-14T20:13:01+05:30" level=debug msg="Pending hostmap hostInfo deleted" hostMap="map[indexNumber:895826622 mapTotalSize:0 remoteIndexNumber:0 vpnIp:10.0.0.50]"

Config files from affected hosts


@Jahazee
Copy link
Author

Jahazee commented Jun 16, 2024

Enabled relay on the light house and updated the config on the nodes to use the nebula ip of light house as relay. Still nodes are unable to connect.

@tokudan
Copy link

tokudan commented Aug 4, 2024

I think I have seen the same thing. both nodes are behind NAT. Connecting to the lighthouse worked fine for each node, but they were unable to talk to eachother.
After rebooting one of the nodes, it immediately started working, though the node will have received new IPv4 and IPv6 addresses due to the reboot.
Versions:
rebooted node: 1.6.1+dfsg-3+b4
other node: 1.9.3
punchy enabled on both nodes.

@Jahazee
Copy link
Author

Jahazee commented Aug 9, 2024

@tokudan thanks for your response. Let me try . In my setup , I have android device and other one is windows 11. Both are running with nebula version 1.9.3 . I would have started the nebula process multiple time , not sure what reboot would make difference here.

@Cyberes
Copy link

Cyberes commented Aug 12, 2024

I have dozens of hosts on my Nebula network and I encounter a similar issue once every few months or so. I don't need to reboot, just restarting the Nebula service on the problem host fixes everything.

When one of my hosts is affected by this issue, it's only that single host. Everyone else works as expected.

Working host:

level=info msg="Attempt to relay through hosts" localIndex=2552419571 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=172.0.3.101
level=info msg="send CreateRelayRequest" initiatorRelayIndex=170893945 relay=172.0.0.2 relayFrom=172.0.2.120 relayTo=172.0.3.101
level=info msg="send CreateRelayRequest" initiatorRelayIndex=1306741978 relay=172.0.0.3 relayFrom=172.0.2.120 relayTo=172.0.3.101
level=info msg="Re-send CreateRelay request" localIndex=2552419571 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="send CreateRelayRequest" initiatorRelayIndex=170893945 relay=172.0.0.2 relayFrom=172.0.2.120 relayTo=172.0.3.101
level=info msg="Re-send CreateRelay request" localIndex=2552419571 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="send CreateRelayRequest" initiatorRelayIndex=1306741978 relay=172.0.0.3 relayFrom=172.0.2.120 relayTo=172.0.3.101
level=info msg=handleCreateRelayResponse initiatorRelayIndex=170893945 relayFrom=172.0.2.120 relayTo=172.0.3.101 responderRelayIndex=1591794049 vpnIp=172.0.0.2
level=info msg=handleCreateRelayResponse initiatorRelayIndex=1306741978 relayFrom=172.0.2.120 relayTo=172.0.3.101 responderRelayIndex=217037255 vpnIp=172.0.0.3
level=info msg="Attempt to relay through hosts" localIndex=2552419571 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=2552419571 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=2552419571 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=2552419571 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=2552419571 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Handshake message received" certName=... durationNs=312059865 fingerprint=...
level=info msg="Handshake timed out" durationNs=6761641802 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3581883771 localIndex=3581883771 remoteIndex=0

Problem host:

level=info msg="Attempt to relay through hosts" localIndex=4235701552 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0
level=info msg="send CreateRelayRequest" initiatorRelayIndex=3506794252 relay=172.0.0.2 relayFrom=172.0.3.101 relayTo=172.0.2.120
level=info msg="send CreateRelayRequest" initiatorRelayIndex=3382568272 relay=172.0.0.3 relayFrom=172.0.3.101 relayTo=172.0.2.120
level=info msg="Re-send CreateRelay request" localIndex=4235701552 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.2.120
level=info msg="send CreateRelayRequest" initiatorRelayIndex=3506794252 relay=172.0.0.2 relayFrom=172.0.3.101 relayTo=172.0.2.12
level=info msg="Re-send CreateRelay request" localIndex=4235701552 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.2.120
level=info msg="send CreateRelayRequest" initiatorRelayIndex=3382568272 relay=172.0.0.3 relayFrom=172.0.3.101 relayTo=172.0.2.120
etc

@cameronbraid
Copy link

I get a similar thing all the time.

I have a linux lighthouse on the internet and a linux server in my lan. Which are both fixed and can always talk to each other fine.

I have a mac laptop which works perfectly when connected to the lan - it can connect to the linux server. When I pair it with my phone to use its mobile connection I have to restart nebula on the linux server to be able to connect to it.

@johnmaguire johnmaguire added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 29, 2024
@johnmaguire
Copy link
Collaborator

@tokudan & @cameronbraid The other two users in this thread reported having their Lighthouse configured as a relay. There are many connectivity issues that can occur with Nebula but it is surprising to hear that they can ping the Lighthouse from two nodes, the Lighthouse is configured as a relay, and yet the nodes cannot reach each other. Can you confirm whether each of you are also using the Lighthouse as a relay in your scenarios?

@Jahazee & @Cyberes At the risk of asking a dumb question, can you explain how relays are configured? Note that in addition to am_relay: true on the Lighthouse, and use_relays: true on the nodes, you must also specify the Lighthouse as a possible relay in the nodes' config: https://nebula.defined.net/docs/config/relay/#relayrelays

To any user in this thread, sharing configs of the affected hosts (Lighthouse and both nodes) as well as logs around the event will be immensely helpful in tracking down the issue. Thanks!

@Cyberes
Copy link

Cyberes commented Aug 30, 2024

@johnmaguire This is my standard config for my non-relay/non-lighthouse nodes:

relay:
  am_relay: false
  relays:
  - 172.0.0.2
  - 172.0.0.3
  use_relays: true
lighthouse:
  am_lighthouse: false
  hosts:
  - 172.0.0.2
  - 172.0.0.3
  interval: 60

My lighthouse looks like this:

lighthouse:
  am_lighthouse: true
  interval: 60
punchy:
  delay: 1s
  punch: true
  punch_back: true
  respond: true
relay:
  am_relay: true
  use_relays: false

You should have megabytes of logs from me at this point for this issue but let me know if you need anything more.

@kha84
Copy link

kha84 commented Sep 1, 2024

I can back the above issues. I do have a lighthouse configured on Linux VPS (Debian 11, Nebula version is 1.6.1 - installed from standard debian repo), with public IPv4 which acts as a relay. And there're two nodes:

  • one is my Linux machine at home behind my own NAT built-in to WiFi router which is behind cgNAT of my ISP. It is running Nebula of version 1.9.3
  • another is my android phone with Nebula app installed, and it's definitely not having public IPv4:
App version 0.4.1-77 (sha: 7a048d8)
Nebula version: 1.8.1 (go1.20.10)
Flutter version: 3.3.5
Dart version 2.18.2

From time to time I cannot connect/ping from my phone to my Linux machine at home, but they both can connect to the lighthouse. What helps to fix this is to restart Nebula either on home machine, phone or both.

Not sure if it's some new bug related to latest version or it's just me being unlucky - my experience with Nebula is just few days and it's always been like this.

you must also specify the Lighthouse as a possible relay in the nodes' config: https://nebula.defined.net/docs/config/relay/#relayrelays

I haven't found this kind of setting in Android app. Does it exist or I am just being blind?

@kha84
Copy link

kha84 commented Sep 1, 2024

I managed to find steps to reproduce this issue:

  1. to have my phone in the same home WiFi network as my Linux machine and have Nebula restarted on both. At this point both hosts are able to connect to each other via Nebula IP addresses

  2. disable WiFi on my phone and let it use cellular network. Even though Nebula VPN is still shown as running, phone cannot talk to Linux machine via Nebula IP address any longer, but still can connect to Nebula Lighthouse using Nebula IP. When I attempt to connect to Linux PC, Nebula app on phone complains with "Fail to send handshake", Nebula service on Linux PC complains with "Existing relay mismatch with CreateRelayRequest"

  3. re-establish VPN connection on phone - turn "Status" for my network off and back on in app. No result. Same messages in phone and Linux PC logs as in previous step. Phone still can connect to Lighthouse using Nebula IP, but not the Linux machine at home. Subsequent restarts of Nebula on phone don't help as well. Even with closing app.

  4. restart Nebula service on a home Linux PC. This fixes the issue and phone can again connect to my home Linux machine.

  5. enable WiFi on phone to allow it to connect to home WiFi network. Phone still can talk to Linux PC!

  6. disable WiFi on phone again - and it still can connect to Linux PC. Looks like some routes has been cached at step 4 and now being reused. Not sure how long this effect lasts.

So at least one case is clear when this happens: when phone begins to roam and goes from one network (home WiFi) to another (cellular) for the very first time. What's bad is that in order to reestablish connection, I need to reboot Nebula service on a destination host I want access to - which can be tricky when you're away with your phone :)

@Cyberes
Copy link

Cyberes commented Sep 4, 2024

@johnmaguire it happened again so more logs for you. I think I have a different issue than what the others are facing since mine is about losing connection.

Problem host:

level=info msg="send CreateRelayRequest" initiatorRelayIndex=2208677988 relayFrom=172.0.3.101 relayTo=172.0.2.120 responderRelayIndex=0 vpnIp=172.0.0.2
level=info msg=handleCreateRelayResponse initiatorRelayIndex=2208677988 relayFrom=172.0.3.101 relayTo=172.0.2.120 responderRelayIndex=1035013250 vpnIp=172.0.0.2
level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.3 vpnIp=172.0.2.120
level=info msg="Handshake message received" certName=nginx fingerprint=... handshake="map[stage:1 style:ix_psk0]" initiatorIndex=697670109 issuer=... remoteIndex=0 responderIndex=0 udpAddr="<nil>" vpnIp=172.0.2.120
level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.3 vpnIp=172.0.2.120
level=info msg="Handshake message received" certName=nginx fingerprint=... handshake="map[stage:1 style:ix_psk0]" initiatorIndex=697670109 issuer=... remoteIndex=0 responderIndex=0 udpAddr="<nil>" vpnIp=172.0.2.120
level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120
level=info msg="Handshake message received" certName=nginx fingerprint=... handshake="map[stage:1 style:ix_psk0]" initiatorIndex=697670109 issuer=... remoteIndex=0 responderIndex=0 udpAddr="<nil>" vpnIp=172.0.2.120
level=info msg="Handshake message sent" certName=nginx fingerprint=... handshake="map[stage:2 style:ix_psk0]" initiatorIndex=697670109 issuer=... relay=172.0.0.2 remoteIndex=0 responderIndex=65839047 vpnIp=172.0.2.120
level=info msg="Taking new handshake" certName=nginx localIndex=1535021440 remoteIndex=1412676492 vpnIp=172.0.2.120
level=info msg="Handshake message received" certName=nginx fingerprint=... handshake="map[stage:1 style:ix_psk0]" initiatorIndex=697670109 issuer=... remoteIndex=0 responderIndex=0 udpAddr="<nil>" vpnIp=172.0.2.120
level=info msg="send CreateRelayResponse" initiatorRelayIndex=197443235 relayFrom=172.0.2.120 relayTo=172.0.3.101 responderRelayIndex=1993822362 vpnIp=172.0.0.3
level=info msg=handleCreateRelayRequest initiatorRelayIndex=197443235 relayFrom=172.0.2.120 relayTo=172.0.3.101 vpnIp=172.0.0.3
level=info msg="send CreateRelayResponse" initiatorRelayIndex=197443235 relayFrom=172.0.2.120 relayTo=172.0.3.101 responderRelayIndex=1993822362 vpnIp=172.0.0.3
level=info msg=handleCreateRelayRequest initiatorRelayIndex=197443235 relayFrom=172.0.2.120 relayTo=172.0.3.101 vpnIp=172.0.0.3
level=info msg="Handshake message received" certName=nginx durationNs=123336364 fingerprint=... handshake="map[stage:2 style:ix_psk0]" initiatorIndex=1526267344 issuer=... remoteIndex=1526267344 responderIndex=2242841832 sentCachedPackets=0 udpAddr="<nil>" vpnIp=172.0.2.120
level=info msg="Send handshake via relay" localIndex=1526267344 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.2.120
level=info msg="Send handshake via relay" localIndex=1526267344 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.2.120
level=info msg="Send handshake via relay" localIndex=1526267344 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.2.120
level=info msg="Send handshake via relay" localIndex=1526267344 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.2.120
level=info msg="Attempt to relay through hosts" localIndex=1526267344 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=172.0.2.120
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1526267344 localIndex=1526267344 remoteIndex=0 udpAddrs="[]" vpnIp=172.0.2.120
level=info msg="Re-handshaking with remote" reason="local certificate is not current" vpnIp=172.0.2.120
level=info msg="send CreateRelayRequest" initiatorRelayIndex=3421498576 relayFrom=172.0.3.101 relayTo=172.0.2.120 responderRelayIndex=0 vpnIp=172.0.0.3
level=info msg="Tunnel status" certName=nginx localIndex=1535021440 remoteIndex=1412676492 tunnelCheck="map[method:active state:dead]" vpnIp=172.0.2.120
level=info msg="Tunnel status" certName=nginx localIndex=65839047 remoteIndex=697670109 tunnelCheck="map[method:active state:dead]" vpnIp=172.0.2.120
level=info msg="sendNoMetrics failed to find HostInfo" certName=nginx error="unable to find host with relay" localIndex=1526267344 relay=172.0.0.3 remoteIndex=2242841832 vpnIp=172.0.2.120
level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.3 vpnIp=172.0.2.120
level=info msg="Handshake message received" certName=nginx fingerprint=... handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2270976162 issuer=... remoteIndex=0 responderIndex=0 udpAddr="<nil>" vpnIp=172.0.2.120
level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.3 vpnIp=172.0.2.120
level=info msg="Handshake message received" certName=nginx fingerprint=... handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2270976162 issuer=... remoteIndex=0 responderIndex=0 udpAddr="<nil>" vpnIp=172.0.2.120
level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.3 vpnIp=172.0.2.120
["Handshake message received" and "Handshake message sent" repeats forever]

Working host:

level=info msg="Tunnel status" certName=archive localIndex=2226883942 remoteIndex=1620137054 tunnelCheck="map[method:active state:dead]" vpnIp=172.0.3.101
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3427860552 localIndex=3427860552 remoteIndex=0 udpAddrs="[...]" vpnIp=172.0.3.101
level=info msg="Attempt to relay through hosts" localIndex=3427860552 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=172.0.3.101
level=info msg="send CreateRelayRequest" initiatorRelayIndex=2388030379 relay=172.0.0.2 relayFrom=172.0.2.120 relayTo=172.0.3.101
level=info msg="Send handshake via relay" localIndex=3427860552 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Re-send CreateRelay request" localIndex=3427860552 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="send CreateRelayRequest" initiatorRelayIndex=2388030379 relay=172.0.0.2 relayFrom=172.0.2.120 relayTo=172.0.3.101
level=info msg="Send handshake via relay" localIndex=3427860552 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.101
level=info msg=handleCreateRelayResponse initiatorRelayIndex=2388030379 relayFrom=172.0.2.120 relayTo=172.0.3.101 responderRelayIndex=3272475814 vpnIp=172.0.0.2
level=info msg="Attempt to relay through hosts" localIndex=3427860552 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=3427860552 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=3427860552 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=3427860552 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Send handshake via relay" localIndex=3427860552 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.101
level=info msg="Attempt to relay through hosts" localIndex=3427860552 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=172.0.3.101
["Send handshake via relay" line repeats]
level=info msg="Handshake timed out" durationNs=6824371247 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1271254101 localIndex=1271254101 remoteIndex=0 udpAddrs="[...]" vpnIp=172.0.3.101
["Send handshake via relay" process restarts]

@johnmaguire
Copy link
Collaborator

@Cyberes One log line that jumps out to me is this one:

level=info msg="sendNoMetrics failed to find HostInfo" certName=nginx error="unable to find host with relay" localIndex=1526267344 relay=172.0.0.3 remoteIndex=2242841832 vpnIp=172.0.2.120

This is similar but different from another type of log we've seen a bit lately, on relays:

Nov 11 16:10:12 linux.dsrgthyz.com nebula[1147]: time="2024-11-11T16:10:12Z" level=info msg="Failed to find target host info by ip" certName=android error="unable to find host with relay" localIndex=2672096877 relayTo=192.168.100.3 remoteIndex=442491203 vpnIp=192.168.100.6
Nov 11 16:10:13 linux.dsrgthyz.com nebula[1147]: time="2024-11-11T16:10:13Z" level=info msg="Failed to find target host info by ip" certName=android error="unable to find host with relay" localIndex=2672096877 relayTo=192.168.100.3 remoteIndex=442491203 vpnIp=192.168.100.6
Nov 11 16:10:14 linux.dsrgthyz.com nebula[1147]: time="2024-11-11T16:10:14Z" level=info msg="Failed to find target host info by ip" certName=android error="unable to find host with relay" localIndex=2672096877 relayTo=192.168.100.3 remoteIndex=442491203 vpnIp=192.168.100.6

Are you able to check your Lighthouse logs and see if you're seeing this message as well? I encourage anyone else in this ticket experiencing issues to check for this issue. This bug is being fixed in #1270.

@Cyberes
Copy link

Cyberes commented Nov 12, 2024

I checked my logs and 99% of the unable to find host with relay messages appear to come from my lighthouse and are associated with the same handful of high-traffic nodes that other nodes talk to a lot (nginx proxy, syslog, etc).

image

@johnmaguire
Copy link
Collaborator

@Cyberes "unable to find host with relay" will appear on relay hosts (I expect your Lighthouses are configured to act as relays.) This log message is indicative of the problem in #1270 which occurs when the "responder" node loses connection to the relay. I'm sharing this to try to determine if the connectivity issues you shared earlier in the thread are related to this issue. Next time you experience a connectivity loss you could check for this log line.

@Cyberes
Copy link

Cyberes commented Nov 14, 2024

Will do. Should I continue to post in this issue or is there somewhere else more appropriate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants