Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPN Routing Issue After 0.7.0-beta Update #1079

Closed
kvecchione opened this issue Dec 7, 2021 · 16 comments · Fixed by #1094
Closed

VPN Routing Issue After 0.7.0-beta Update #1079

kvecchione opened this issue Dec 7, 2021 · 16 comments · Fixed by #1094
Assignees
Labels
component/lima Issues related to lima and qemu kind/bug Something isn't working
Milestone

Comments

@kvecchione
Copy link

Rancher Desktop Version

0.7.0-beta.1-32-g4c29159

Rancher Desktop K8s Version

1.21.7

What operating system are you using?

macOS

Operating System / Build Version

11.6.1

What CPU architecture are you using?

x64

Windows User Only

No response

Actual Behavior

After the update to 0.7.0-beta I'm seeing issues with network routing while using a VPN service, forcing me to add a route to the VM.

It appears to be related to adding a routable IP: 4363d48

lima-rancher-desktop:~$ ip route show
default via 192.168.205.1 dev rd0 metric 201
default via 192.168.5.2 dev eth0 metric 202
10.42.0.0/24 dev cni0 scope link src 10.42.0.1
100.64.1.0/24 via 192.168.5.2 dev eth0 <-- Needed to add a custom route to pull images over VPN

Steps to Reproduce

Install Rancher Desktop 0.7.0-beta application.
Connect to VPN service.
Attempt to pull image behind VPN using nerdctl

Result

❯ nerdctl pull example.com/docker/base-image/image:base-latest
INFO[0000] trying next host                              error="failed to do request: Head \"https://example.com/v2/docker/base-image/image/manifests/base-latest\": dial tcp 100.1.1.48:443: connect: connection refused" host=example.com
FATA[0000] failed to resolve reference "example.com/docker/base-image/image:base-latest": failed to do request: Head "https://example.com/v2/docker/base-image/image/manifests/base-latest": dial tcp 100.1.1.48:443: connect: connection refused

Expected Behavior

nerdctl pull example.com/docker/base-image/image:base-latest
example.com/docker/base-image/image:base-latest:                 resolved       |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:5f97a633e9cea2e6ca473fc71054d43123c3761e4dc358e6b9393f9ca308f9ff: done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:0e5d5f686006d794fdb0a6b8bb9608902bafe8d7c53c9b90d5928cb718229f58:   done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:fa1d984e6771414963f1d224b3a46d25349f3b0ecdcb8e0431e5dee93867e361:    downloading    |+++++---------------------------------|  4.0 MiB/27.9 MiB
layer-sha256:249318e32f30e79fe2dfe3504c0018b6651d1ea91e7fd76ff3cb7566656e1e68:    downloading    |++++++++------------------------------|  6.0 MiB/26.7 MiB
elapsed: 5.0 s                                                                    total:  10.0 M (2.0 MiB/s)

Additional Information

This was working with 0.6.1 and is a regression with the current 0.7.0 candidate.

@kvecchione kvecchione added the kind/bug Something isn't working label Dec 7, 2021
@kvecchione kvecchione changed the title Network Bridging Network Routing After 0.7.0-beta Update Dec 7, 2021
@kvecchione kvecchione changed the title Network Routing After 0.7.0-beta Update Network Routing Issue After 0.7.0-beta Update Dec 7, 2021
@kvecchione kvecchione changed the title Network Routing Issue After 0.7.0-beta Update VPN Routing Issue After 0.7.0-beta Update Dec 7, 2021
@mook-as mook-as added the component/lima Issues related to lima and qemu label Dec 7, 2021
@jandubois
Copy link
Member

I do not understand this:

  • The rd0 interface goes to a NATed network on the host, so should be able to route to any address the host can route to, just like eth0.

  • The error says "connection refused", which I think means the route works fine, but the endpoint refused the connection.

    dial tcp 100.1.1.48:443: connect: connection refused
    
  • The route 100.64.1.0/24 via 192.168.5.2 dev eth0 does not cover the endpoint address, so why does it make a difference?

I've tested routing over an OpenVPN connection, and it worked fine via rd0. Confirmed via traceroute that it was using the rd0 IP address.

Now, we are aware that the traffic over rd0 experiences extreme packet-loss, slowing down the connections to become virtually unusable for pulling images (see #1070), but that should not produce a connection refused error.

Can you please clarify?

@jandubois jandubois self-assigned this Dec 8, 2021
@jandubois jandubois added this to the v0.7.0 milestone Dec 8, 2021
@kvecchione
Copy link
Author

kvecchione commented Dec 8, 2021

Sorry, I had modified some of the IP addresses in the example (missed one), but the route I added did cover the actual IP of the target. In my case I'm using a different type of VPN (Zscaler) which rd0 doesn't seem to be able to route.

I'm not exactly sure why it doesn't route on rd0, but here's another example of the issue using tcptraceroute. If you have anything else you want me to run to diagnose, I'd be happy to do so.

lima-rancher-desktop:~# tcptraceroute intranet.url.com 443
Selected device rd0, address 192.168.205.2, port 60701 for outgoing packets
Tracing the path to intranet.url.com (100.64.1.48) on TCP port 443 (https), 30 hops max
 1  * * *
 2  * * *
 3  intranet.url.com (100.64.1.48) [closed]  0.788 ms  0.785 ms  0.956 ms
 
lima-rancher-desktop:~# ip route add 100.64.1.0/24 via 192.168.5.2

lima-rancher-desktop:~# tcptraceroute intranet.url.com 443
Selected device eth0, address 192.168.5.15, port 46523 for outgoing packets
Tracing the path to intranet.url.com (100.64.1.48) on TCP port 443 (https), 30 hops max
 1  intranet.url.com (100.64.1.48) [open]  0.947 ms  0.619 ms  0.780 ms

@jandubois
Copy link
Member

Thanks for the update. Is this purely a VPN configuration, or does it involve proxies as well?

Is there any special routing on the host that only forwards packages from certain interfaces to the VPN?

@kvecchione
Copy link
Author

I'm unfortunately limited in understanding how this product actually functions under the hood. It's definitely different than a traditional OpenVPN-type VPN and I'm fairly certain it involves proxies.

@kvecchione
Copy link
Author

kvecchione commented Dec 8, 2021

I took a couple pcaps with and without the route added to force it back to using the eth0 interface. It seems like the rd0 network behaves entirely differently when interacting with the VPN:

Without route:
image

With route:
image

@kvecchione
Copy link
Author

Can confirm 0.7.0-beta.1-57-g30155bd resolves this issue. Thank you!

@jandubois
Copy link
Member

Unfortunately removing the local route for rd0 broke the external IP; packets would no longer get delivered via that interface, so we had to revert that change: #1107.

I've not found a way to split ingress/egress between 2 interfaces with just routing table entries; maybe it needs some iptables rules, but we are out of time to figure it out for the 0.7.0 release.

So we'll have the slowdown for downloading a 4GB image from 13s to 90s (but thankfully no longer to 75m). For smaller images the slowdown may be barely noticeable. But if this is a problem for you, then you can work around this by adding the provisioning script to an override.yaml file (new lima feature, only available in the latest builds).

override.yaml uses the same schema as lima.yaml, and it's provisioning scripts will execute before the scripts from Rancher Desktop. Full path is ~/Library/Application Support/rancher-desktop/lima/_config/override.yaml. It does not exist; you have to create it.

@jandubois jandubois reopened this Dec 15, 2021
@jandubois jandubois modified the milestones: v0.7.0, v1.0.0 Dec 15, 2021
@kvecchione
Copy link
Author

@jandubois Would it cause issues to reverse the priority of the default routes so eth0 is preferred over rd0? In my testing, this would fix my issue as well.

lima-rancher-desktop:~# ip route show | grep default
default via 192.168.205.1 dev rd0  metric 201
default via 192.168.5.2 dev eth0  metric 202

lima-rancher-desktop:~# curl intranet.url.com
curl: (7) Failed to connect to intranet.url.com port 80 after 10 ms: Connection refused

lima-rancher-desktop:~# ip route del default via 192.168.205.1
lima-rancher-desktop:~# ip route add default via 192.168.205.1 metric 203

lima-rancher-desktop:~# ip route show | grep default
default via 192.168.5.2 dev eth0  metric 202
default via 192.168.205.1 dev rd0  metric 203

lima-rancher-desktop:~# curl intranet.url.com
<html>
...

https://github.com/rancher-sandbox/rancher-desktop/blob/v0.7.0-beta.1/src/assets/lima-config.yaml#L47-L60

@jandubois
Copy link
Member

Would it cause issues to reverse the priority of the default routes so eth0 is preferred over rd0? In my testing, this would fix my issue as well.

Yes, it works when rd0 is a "shared" network, but not when it is "bridged", which is another change we made, so the "external" address is actually accessible from outside the host. A "shared" network it NATed on the host.

Example routes when using a "bridged" network:

lima-rancher-desktop:~$ ip route
default via 192.168.5.2 dev eth0  metric 202
default via 192.168.17.1 dev rd0  metric 203
10.42.0.0/24 dev cni0 scope link  src 10.42.0.1
192.168.0.0/16 dev rd0 scope link  src 192.168.18.110
192.168.5.0/24 dev eth0 scope link  src 192.168.5.15

The higher metric for the default route doesn't matter if you have a more specific route with a longer prefix match. So in the example above, all connections to 192.168/16 will go through rd0. And if I delete that route, then 192.168.18.110 is no longer routable from the host or externally (I'm not sure why; I suspect because return packets will be sent over the default route, and then there is an address mismatch, but I haven't done any packet tracing yet).

Eventually all the network settings should be exposed via the UI, but for now you would have to manually configure them via YAML files. Here is how you can revert rd0 to a "shared" network:

$ cat "$HOME/Library/Application Support/rancher-desktop/lima/_config/override.yaml"
networks:
- lima: shared
  interface: rd0
# restart Rancher Desktop after setting up override.yaml
$ rdlima ip route
default via 192.168.5.2 dev eth0  metric 202
default via 192.168.205.1 dev rd0  metric 203
192.168.5.0/24 dev eth0 scope link  src 192.168.5.15
192.168.205.0/24 dev rd0 scope link  src

Since the shared network doesn't define (or need) a specific route for the local network, even local connection go over slirp and have the old speed. But of course you can't access 192.168.205.2 from outside the host.

And I'm sorry, I just realized that I've been mostly talking about the interface speed, and not the VPN issue. So it is possible that the bridged network will not interfere with your VPN because connections to local resources should not go over the VPN, and connections to VPN addresses should go over the SLIRP interface.

If you want to double-check, I've been testing this with the build for Merge pull request #1107, which is Rancher Desktop-0.7.0-beta.1-103-g29dd7f2.dmg.

@kvecchione
Copy link
Author

Rancher Desktop-0.7.0-beta.1-103-g29dd7f2.dmg works for my use case. Thanks for providing an example override.yaml, this will be helpful in the interim.

@jandubois
Copy link
Member

Rancher Desktop-0.7.0-beta.1-103-g29dd7f2.dmg works for my use case. Thanks for providing an example override.yaml, this will be helpful in the interim.

Thanks for letting me know! Was the override.yaml necessary for your VPN to work, or just for the speedup to access local resources?

@kvecchione
Copy link
Author

I didn't need override.yaml for either to work with the new build. My understanding was that this is not being included in the 0.7.0 release and I will temporarily need it to make our VPN work, right?

@jandubois
Copy link
Member

My understanding was that this is not being included in the 0.7.0 release and I will temporarily need it to make our VPN work, right?

No, the build you tested is extremely similar to the imminent 0.7.0 release, so I'm happy to hear it works for you as-is.

Switching to the shared network would still speed up access to resources in your local network, but maybe you don't access anything from there, so it doesn't matter?

@kvecchione
Copy link
Author

I'm getting reasonable pull speeds with the defaults on 0.7.0-beta.1-103-g29dd7f2 (with a full factory reset), so I think we're in good shape if this is similar to what 0.7.0 will be.

@kvecchione
Copy link
Author

I have tested with the latest official 0.7.1 release and everything looks good, closing the issue.

@robrecord
Copy link

robrecord commented Jul 21, 2023

Hi guys, I've been having issues with Lima not working properly if I have the firewall provided by my VPN service switched on (iVPN) - issues created here and here - just wondering since Rancher appears to use Lima, is there any straightfoward advice someone could offer for how to sidestep these issues? Perhaps a distillation of the above thread in the form of config file suggestions? As I'm not familiar with Lima and certainly not rancher. Many thanks!

Should I create a override.yaml file? What should go in it?

And is this really a lima issue? Because if not then I'm in the wrong place :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/lima Issues related to lima and qemu kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants