Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forwarder processes requests too slowly when there are a lot of clients #1666

Open
NikitaSkrynnik opened this issue Sep 17, 2024 · 11 comments
Open
Assignees

Comments

@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Sep 17, 2024

Description

After some analysis we found out that forwarder processes requests too slowly. Here are top 5 places:

  1. discoverforwarder - up to 6s
  2. discoverendpoint - up to 4s
  3. roundrobin - up to 1s
  4. postpone - up to 900ms
  5. Closes in many sdk-vpp chain elements - can take up to tens of seconds

discoverforwarder and discoverendpoint

The root cause of these issues is probably slow registry-k8s

Issues:
networkservicemesh/sdk-k8s#512


roundrobin

Needs more investigation...


postpone

The root cause of postpone being to slow is improper use of contexts in some places. trace relies heavily on context.WithValue. Also a lot of other chain elements use this function.

Issues:
#1665
#1667


Closes in many sdk-vpp chain elements

Clients can wait for the error from a forwarder for the time much longer than request timeout because some chain elements call close if request fails.

Issues:
networkservicemesh/sdk-vpp#851

@NikitaSkrynnik
Copy link
Contributor Author

@denis-tingaikin, @szvincze

@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Sep 19, 2024

Current plan is to investigate why nsmgr is slow. It looks like the problems in forwarder and nsmgr are the same:

  • Remove all limits from nsmgr
  • Measure time of request processing on nsmgr without traces
  • Check a size of context without traces and how fast it works
  • Check a size of context with traces and how fast it works
  • If context doesn't affect the speed, then investigate trace chain element

@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Sep 20, 2024

Some statistics after rc.7 testing (40 clients, min and max request processing time, context size), only local cases:

# TELEMETRY LOG LEVEL TIME MIN TIME MAX CONTEXT SIZE FIELDS
1 FALSE INFO 300ms 9s 187 fields fields_1.txt
2 TRUE INFO 2s 15s 569 fields fields_2.txt
3 TRUE TRACE 10s 40s 1239 fields fields_3.txt

@NikitaSkrynnik
Copy link
Contributor Author

Did some analysis:

  1. These lines - consume up to 10 seconds
  2. These lines - up to 7 seconds

@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Sep 23, 2024

Current plan:

  • Find out why NSM without traces can spend 9 seconds to process request

@denis-tingaikin denis-tingaikin moved this from In Progress to Moved to next release in Release v1.14.0 Sep 24, 2024
@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Sep 25, 2024

dial chain element can consume up to 1s. We need to investigate what exactly affects the performance: grpc or unix socket.

Current plan:

  • Create a new application that makes a lot of find requests and count avg dial time

@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Sep 30, 2024

After removing all resource limits from nse, forwarder, nsmgr and registry the request time without traces is between 200ms - 1.5s. Almost all time is consumed by forwarder.

The most time consuming elements:

CHAIN ELEMENT MIN TIME MAX TIME
discover (dial) 3ms 37ms
discover (find) 3ms 40ms
discover (recv) 19ms 117ms
authorize 20ms 150ms
kernelTapClient/Server 10ms 150ms
tagClient/Server < 1ms 50ms
upClient/Server < 1ms 50ms
mtuClient/Server < 1ms 50ms
l2XconnectServer 15ms 50ms
ipaddressClient < 1ms 20ms
pinggrouprangeServer 1ms 15ms

Other sdk-vpp chain elements can also take up to 50ms but it's rare.

Testing of dial chain element didn't show any valuable results. Dial time varies from 1ms to 1.7s regardless of the number of clients.

@denis-tingaikin
Copy link
Member

Testing of dial chain element didn't show any valuable results. Dial time varies from 1ms to 1.7s regardless of the number of clients.

I think the main question for now is whether the situation with 1.7s for dial is related to the unix sockets or if it is some problem of the grpc that we should fix.

@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Oct 3, 2024

Collected more statistics without resource limits on pods forwarder, nse, nsmgr and registry (30 clients, Azure cluster):

# TELEMETRY LOG LEVEL TIME MIN TIME MAX CONTEXT SIZE FIELDS
1 FALSE INFO 200ms 1.5s 187 fields fields_1.txt
2 TRUE INFO 300ms 1.8s 569 fields fields_2.txt
3 TRUE TRACE 400ms 2.1s 1239 fields fields_3.txt

Testing dial on Azure cluster with a lot of find requests didn't show anything special. It looks like dial = 1.7s happened only once on my local machine. The average dial time for one application that spams find requests is 5ms, the maximum time is 70ms. For 10 applications the avg dial time is 25ms, the maximum time is 300ms.

Testing on unix sockets without grpc and NSM has shown that dial time can randomly be x10 of the average value. The same is for grpc and NSM.

@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Oct 3, 2024

Conclusion

Scenario with 30-40 clients and 1 endpoint on a cluster with one node works good.

Currernt plan

  • Add a cache to discoverForwarder and discoverEndpoint
  • fix authorize
  • refactor trace chain element
  • Test remote scenarios with one endpoint and 40 clients (traces enabled)
  • Test remote scenarios with 30 endpoints and 30 clients. Endoints should have CIDR of a size 2, and be scaled from 15 to 30 repeatedly
  • Test remote scenarios with 30 endpoints and 30 clients. Endoints should have CIDR of a size 2, clients should be scaled from 0 to 30

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Oct 4, 2024

Decomposition:

positive scenario: 7d

  • simplify /refactor trace
  • add client's cache nse/ns
  • fix authorize
  • run tests

negative scenario: 15d

  • simplify /refactor trace
  • add client's cache nse/ns
  • fix authorize
  • run tests
  • get metrics for remote cases
  • get metrics for each chain elements
  • fix chain elements

vpp elements shoud be considered in the separate issue.

@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.15.0 Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Moved to next release
Status: In Progress
Status: In Progress
Development

No branches or pull requests

2 participants