Forwarder processes requests too slowly when there are a lot of clients #1666

NikitaSkrynnik · 2024-09-17T13:00:09Z

Description

After some analysis we found out that forwarder processes requests too slowly. Here are top 5 places:

discoverforwarder - up to 6s
discoverendpoint - up to 4s
roundrobin - up to 1s
postpone - up to 900ms
Closes in many sdk-vpp chain elements - can take up to tens of seconds

`discoverforwarder` and `discoverendpoint`

The root cause of these issues is probably slow registry-k8s

Issues:
networkservicemesh/sdk-k8s#512

`roundrobin`

Needs more investigation...

`postpone`

The root cause of postpone being to slow is improper use of contexts in some places. trace relies heavily on context.WithValue. Also a lot of other chain elements use this function.

Issues:
#1665
#1667

Closes in many `sdk-vpp` chain elements

Clients can wait for the error from a forwarder for the time much longer than request timeout because some chain elements call close if request fails.

Issues:
networkservicemesh/sdk-vpp#851

The text was updated successfully, but these errors were encountered:

NikitaSkrynnik · 2024-09-17T13:25:42Z

@denis-tingaikin, @szvincze

NikitaSkrynnik · 2024-09-19T15:53:19Z

Current plan is to investigate why nsmgr is slow. It looks like the problems in forwarder and nsmgr are the same:

Remove all limits from nsmgr
Measure time of request processing on nsmgr without traces
Check a size of context without traces and how fast it works
Check a size of context with traces and how fast it works
If context doesn't affect the speed, then investigate trace chain element

NikitaSkrynnik · 2024-09-20T07:17:29Z

Some statistics after rc.7 testing (40 clients, min and max request processing time, context size), only local cases:

#	TELEMETRY	LOG LEVEL	TIME MIN	TIME MAX	CONTEXT SIZE	FIELDS
1	FALSE	INFO	300ms	9s	187 fields	fields_1.txt
2	TRUE	INFO	2s	15s	569 fields	fields_2.txt
3	TRUE	TRACE	10s	40s	1239 fields	fields_3.txt

NikitaSkrynnik · 2024-09-23T06:02:25Z

Did some analysis:

These lines - consume up to 10 seconds
These lines - up to 7 seconds

NikitaSkrynnik · 2024-09-23T06:04:14Z

Current plan:

Find out why NSM without traces can spend 9 seconds to process request

NikitaSkrynnik · 2024-09-25T10:42:33Z

dial chain element can consume up to 1s. We need to investigate what exactly affects the performance: grpc or unix socket.

Current plan:

Create a new application that makes a lot of find requests and count avg dial time

NikitaSkrynnik · 2024-09-30T08:52:09Z

After removing all resource limits from nse, forwarder, nsmgr and registry the request time without traces is between 200ms - 1.5s. Almost all time is consumed by forwarder.

The most time consuming elements:

CHAIN ELEMENT	MIN TIME	MAX TIME
discover (dial)	3ms	37ms
discover (find)	3ms	40ms
discover (recv)	19ms	117ms
authorize	20ms	150ms
kernelTapClient/Server	10ms	150ms
tagClient/Server	< 1ms	50ms
upClient/Server	< 1ms	50ms
mtuClient/Server	< 1ms	50ms
l2XconnectServer	15ms	50ms
ipaddressClient	< 1ms	20ms
pinggrouprangeServer	1ms	15ms

Other sdk-vpp chain elements can also take up to 50ms but it's rare.

Testing of dial chain element didn't show any valuable results. Dial time varies from 1ms to 1.7s regardless of the number of clients.

denis-tingaikin · 2024-09-30T12:27:38Z

Testing of dial chain element didn't show any valuable results. Dial time varies from 1ms to 1.7s regardless of the number of clients.

I think the main question for now is whether the situation with 1.7s for dial is related to the unix sockets or if it is some problem of the grpc that we should fix.

NikitaSkrynnik · 2024-10-03T08:09:55Z

Collected more statistics without resource limits on pods forwarder, nse, nsmgr and registry (30 clients, Azure cluster):

#	TELEMETRY	LOG LEVEL	TIME MIN	TIME MAX	CONTEXT SIZE	FIELDS
1	FALSE	INFO	200ms	1.5s	187 fields	fields_1.txt
2	TRUE	INFO	300ms	1.8s	569 fields	fields_2.txt
3	TRUE	TRACE	400ms	2.1s	1239 fields	fields_3.txt

Testing dial on Azure cluster with a lot of find requests didn't show anything special. It looks like dial = 1.7s happened only once on my local machine. The average dial time for one application that spams find requests is 5ms, the maximum time is 70ms. For 10 applications the avg dial time is 25ms, the maximum time is 300ms.

Testing on unix sockets without grpc and NSM has shown that dial time can randomly be x10 of the average value. The same is for grpc and NSM.

NikitaSkrynnik · 2024-10-03T09:59:43Z

Conclusion

Scenario with 30-40 clients and 1 endpoint on a cluster with one node works good.

Currernt plan

Add a cache to discoverForwarder and discoverEndpoint
fix authorize
refactor trace chain element
Test remote scenarios with one endpoint and 40 clients (traces enabled)
Test remote scenarios with 30 endpoints and 30 clients. Endoints should have CIDR of a size 2, and be scaled from 15 to 30 repeatedly
Test remote scenarios with 30 endpoints and 30 clients. Endoints should have CIDR of a size 2, clients should be scaled from 0 to 30

denis-tingaikin · 2024-10-04T10:35:22Z

NikitaSkrynnik self-assigned this Sep 17, 2024

NikitaSkrynnik added this to Release v1.14.0 Sep 17, 2024

NikitaSkrynnik moved this to Todo in Release v1.14.0 Sep 17, 2024

NikitaSkrynnik added this to Release v1.14.1 Sep 17, 2024

NikitaSkrynnik removed this from Release v1.14.0 Sep 17, 2024

NikitaSkrynnik moved this to In Progress in Release v1.14.1 Sep 19, 2024

NikitaSkrynnik added this to Release v1.14.0 Sep 19, 2024

NikitaSkrynnik moved this to In Progress in Release v1.14.0 Sep 19, 2024

denis-tingaikin moved this from In Progress to Moved to next release in Release v1.14.0 Sep 24, 2024

This was referenced Oct 7, 2024

Rework a cache for NS/NSE registry #1676

Open

Revert "Fix some leaks in opa policies (#1624)" #1678

Merged

denis-tingaikin added this to Release v1.15.0 Nov 5, 2024

denis-tingaikin self-assigned this Nov 5, 2024

denis-tingaikin mentioned this issue Nov 5, 2024

Rework a cache for NS/NSE registry #1684

Open

9 tasks

denis-tingaikin moved this to In Progress in Release v1.15.0 Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forwarder processes requests too slowly when there are a lot of clients #1666

Forwarder processes requests too slowly when there are a lot of clients #1666

NikitaSkrynnik commented Sep 17, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 17, 2024

NikitaSkrynnik commented Sep 19, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 20, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 23, 2024

NikitaSkrynnik commented Sep 23, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 25, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 30, 2024 •

edited

Loading

denis-tingaikin commented Sep 30, 2024

NikitaSkrynnik commented Oct 3, 2024 •

edited

Loading

NikitaSkrynnik commented Oct 3, 2024 •

edited

Loading

denis-tingaikin commented Oct 4, 2024 •

edited

Loading

Forwarder processes requests too slowly when there are a lot of clients #1666

Forwarder processes requests too slowly when there are a lot of clients #1666

Comments

NikitaSkrynnik commented Sep 17, 2024 • edited Loading

Description

discoverforwarder and discoverendpoint

roundrobin

postpone

Closes in many sdk-vpp chain elements

NikitaSkrynnik commented Sep 17, 2024

NikitaSkrynnik commented Sep 19, 2024 • edited Loading

NikitaSkrynnik commented Sep 20, 2024 • edited Loading

NikitaSkrynnik commented Sep 23, 2024

NikitaSkrynnik commented Sep 23, 2024 • edited Loading

NikitaSkrynnik commented Sep 25, 2024 • edited Loading

NikitaSkrynnik commented Sep 30, 2024 • edited Loading

denis-tingaikin commented Sep 30, 2024

NikitaSkrynnik commented Oct 3, 2024 • edited Loading

NikitaSkrynnik commented Oct 3, 2024 • edited Loading

Conclusion

Currernt plan

denis-tingaikin commented Oct 4, 2024 • edited Loading

NikitaSkrynnik commented Sep 17, 2024 •

edited

Loading

`discoverforwarder` and `discoverendpoint`

`roundrobin`

`postpone`

Closes in many `sdk-vpp` chain elements

NikitaSkrynnik commented Sep 19, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 20, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 23, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 25, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 30, 2024 •

edited

Loading

NikitaSkrynnik commented Oct 3, 2024 •

edited

Loading

NikitaSkrynnik commented Oct 3, 2024 •

edited

Loading

denis-tingaikin commented Oct 4, 2024 •

edited

Loading