Add admin-port to let administrator connect to the server even maxclient number is reached #1120

hwware · 2024-10-03T07:19:18Z

Background:

The parameter "maxclient" default value is 10000. Client administrators could set higher value to accept more client connections.
However, there is one situation: If the connection number of client reaches the maxclient, then even administrators have no chance to connect to the server to process some urgent issues or update some configurations.

To solve above case, by connecting to the admin port, an administrator or cloud provider has the chance to connect to the server even the max number of connected clients reaches

Of course, there is one risk by introducing the admin port: hacker has one more chance to invade the Valkey server, but
I think this kind of risky is less than what we thought. Because in standlone mode, primary node could expose ports to multiply replicas and sentinel nodes, and in cluster mode. primary node has port to connect with other nodes. Thus one more port to connect with administrator is not a big issue.

codecov · 2024-10-03T07:35:24Z

Codecov Report

Attention: Patch coverage is 65.21739% with 8 lines in your changes missing coverage. Please review.

Project coverage is 70.68%. Comparing base (6b3e122) to head (8ac9019).
Report is 1 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/networking.c	36.36%	7 Missing ⚠️
src/server.c	91.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1120      +/-   ##
============================================
- Coverage     70.83%   70.68%   -0.16%     
============================================
  Files           118      118              
  Lines         63561    63583      +22     
============================================
- Hits          45025    44943      -82     
- Misses        18536    18640     +104

Files with missing lines	Coverage Δ
src/config.c	`77.63% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/server.c	`87.40% <91.66%> (-0.02%)`	⬇️
src/networking.c	`88.29% <36.36%> (-0.24%)`	⬇️

... and 13 files with indirect coverage changes

soloestoy · 2024-10-09T03:47:54Z

src/networking.c

    /* Limit the number of connections we take at the same time.
     *
     * Admission control will happen before a client is created and connAccept()
     * called, because we don't want to even start transport-level negotiation
     * if rejected. */
-    if (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients) {
+    if (port != server.admin_port && (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients)) {


connections from admin-port can consume clients out of maxclients, but the eventloop's size is fixed, do we need to make eventloop to be auto resizing?

Hi @soloestoy,

Yes, you’re correct—the event loop size is fixed. However, since the port is known only to the administrator and accepts only local connections, we don't anticipate a large number of clients on the port. Additionally, we've reserved 96 extra FDs in the event loop, which I believe should be sufficient to handle the expected client load on the port. ref:

valkey/src/server.c

Line 2634 in 9b8a061

server.el = aeCreateEventLoop(server.maxclients + CONFIG_FDSET_INCR);

That said, I think I overlooked this. Do you think we should track and add a condition for the number of clients connected to the port?

We can add a fixed number and limit the number of admin connections. We don't need to count the admin connections separately. We can just accept connections on the admin port as long as the total number of connections is less than for example server.maxclients + 20.

Added a check for the number of connections accepted on the port. Currently, it's set to 20, but I believe 10 would be sufficient. Additionally, the event loop size also been updated based on the maximum number of clients accepted on the port.

I think we will also need to track admin clients separately from server.clients or we are eating into the actual end-user client quote, something like below, assuming server.clients is the total

if (port != server.admin_port && (listLength(server.clients) - listLength(server.admin_clients) + getClusterConnectionsCount() >= server.maxclients))

At amazon we have a fixed number of clients ontop of what is requested as well (although I believe we chose 256). It's worth mentioning that clusterbus connections also eat into the overhead.

I think this feature could solve part of this problem in cluster mode.

At amazon we have a fixed number of clients ontop of what is requested as well (although I believe we chose 256). It's worth mentioning that clusterbus connections also eat into the overhead.

We also have a similar CONFIG_EXTRA_RESERVED_FDS, the value is 500. we are doing some whitelist-ips(and localhost) to bypass the maxclients check. I am worry that admin-port consume a listen port and in here it only used for bypass the maxclients check. I am hoping there is a way that admin can do more, for example, when the server is blocked, admin can still has the way to connect in. Also port management is suck in some cases.

I am hoping there is a way that admin can do more,

Yeah we discussed in yesterday's core team meeting that the fd limit needs to be taken care as well. In addition, we should not evict "admin clients" either. I also think we shouldn't allow a "normal" or "standard" client to kill "admin" clients. Another related question would be whether we should count the "admin" clients when reporting "connected_clients" in INFO.

For the fd limit, I think we could avoid hard-code values by resizing the eventloop->events array dynamically in aeCreateFileEvent when the current size is about to be exceeded. The main benefit though would be simplified error handling - see #528

In addition, we should not evict "admin clients" either. I also think we shouldn't allow a "normal" or "standard" client to kill "admin" clients. Another related question would be whether we should count the "admin" clients when reporting "connected_clients" in INFO.

this feels a bit overkill, i agree this is better. A admin should be very fast, and the number should be quite small. Normally, we should try to avoid using the admin port. Ps, we marked client kill as a admin command, so a normal user is unlikely to call it to kill the admin client (we can add a type filter for sure)

For the fd limit, I think we could avoid hard-code values by resizing the eventloop->events array dynamically in aeCreateFileEvent when the current size is about to be exceeded.

yes, i agree dynamically is better in this case.

PingXie

Should we close the discussion at valkey-io/valkey-rfc#3 (comment)?

PingXie · 2024-10-14T04:40:45Z

src/networking.c

    /* Limit the number of connections we take at the same time.
     *
     * Admission control will happen before a client is created and connAccept()
     * called, because we don't want to even start transport-level negotiation
     * if rejected. */
-    if (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients) {
+    if (port != server.admin_port && (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients)) {


I think we will also need to track admin clients separately from server.clients or we are eating into the actual end-user client quote, something like below, assuming server.clients is the total

if (port != server.admin_port && (listLength(server.clients) - listLength(server.admin_clients) + getClusterConnectionsCount() >= server.maxclients))

Signed-off-by: hwware <[email protected]>

hwware · 2024-12-12T18:58:06Z

Should we close the discussion at valkey-io/valkey-rfc#3 (comment)?

It is closed right now.

madolson

Core team discussion:

Wen will followup if just doing a local connection is sufficient for his use case. That simplifies the management of this new port. If Huawei does want a remote connection, we need to discuss if we will support TLS and how we will do that.

PingXie · 2024-12-17T00:51:29Z

Core team discussion:

Wen will followup if just doing a local connection is sufficient for his use case. That simplifies the management of this new port. If Huawei does want a remote connection, we need to discuss if we will support TLS and how we will do that.

I still think the value of introducing an admin port is limited. When the admin port idea was initially proposed, it was intended to address two scenarios, IIRC:

Providing more privilege to clients connecting through a special port:

I believe the correct way to handle this use case is through ACLs.

Working around resource constraints such as maxclients:

While maxclients is just one example, a harder challenge is when the server’s CPUs are overwhelmed by client traffic, making operator connections unreliable.

Looking at this from the operator's perspective, I see two reasons why an operator needs to connect to the server:

To run commands like INFO to collect metrics.

In our case at GCP, we’ve moved to a PUSH model where the server periodically exports its metrics to an external file. This approach avoids the need for direct connections during high-load situations.

To perform administrative operations (e.g., moving slots, adding replicas).

These operations are generally less time-sensitive and can afford to wait so the priority of ensuring immediate connectivity under resource constraints is lower IMO.

I can create a separate issue to describe the PUSH model in more detail, if folks are interested.

PingXie · 2024-12-23T08:32:00Z

[core team discussion]

being able to reliably to connect to the server is in general desired, not just for the metrics but also for management scenarios like auto-scaling

need to patch "maxclients", "max fd", and "client eviction" policies.

admin port is a general construct that could be reused by multiple parties.
security can/should still be achieved via tying the port to the ACLs, somehow.
another point to explore is whether these can be achieved in a module.

PingXie · 2024-12-24T02:50:05Z

[decision time]

@valkey-io/core-team, please 👍 or 👎

Added an admin port dedicated to providing reliable connections only.
Clients connecting to the admin port do not have any special privileges.

It is the operator's responsibility to determine whether the admin port is firewalled off from the applications.

PingXie · 2024-12-24T02:51:07Z

Separately, if we approve this feature, can we consider renaming the port since it doesn’t provide additional privileges? I’m thinking of names like “service port,” “management port,” or perhaps “aux port.”

murphyjacob4 · 2024-12-24T06:07:05Z

Just throwing this out there -

One thing I was thinking is if the problem we actually want to solve is "per port configurations". Or perhaps "per endpoint configurations", if we decouple from just TCP.

If we did this, the case of admin port would simply be setting up an additional endpoint and configuring it with no max clients limit.

Something like:

valkey.conf
...

configure-endpoint type plaintext cluster-broadcast yes maxclients 1000  # user plaintext port

configure-endpoint type tls cluster-broadcast yes maxclients 1000 # user tls port

configure-endpoint type tls maxclients -1 cluster-broadcast no  # admin port

I think this opens it up to future improvements (e.g. perhaps prioritization capabilities to support management connections from being starved by user traffic). Or we can combine ACLs into it, e.g. only a certain endpoint can authenticate to certain ACLs, or an endpoint can automatically authenticate to a certain ACL (if the operator wants to use a firewall as their security perimeter). Obviously - we could choose to get as complicated or simple as we'd like with it.

In my head, I'm thinking of Envoy, where you can set up many "listeners", which each has its own configuration.

As a side benefit, it also opens us up to a mechanism for supporting module owned endpoints, where the module is responsible for the whole connection flow, something like:

module load my_module.so
configure-endpoint type my_module_conn_type

Obviously, it is a superset of the problem solved by just the admin port alone, so that comes with its own tradeoffs. But it feels like a more flexible direction then having a prescriptive set of configurations attached to an admin port IMO

PingXie · 2024-12-24T07:18:13Z

One thing I was thinking is if the problem we actually want to solve is "per port configurations".

Spot on. And the file descriptor limit could be solved independently without introducing a new policy (i.e., dynamically extending the events array)

configure-endpoint type tls maxclients -1 cluster-broadcast no # admin port

The only potential downside is that this will need to be a SPECIAL config and we will have to provide custom parsing logic but I feel that the additional complexity is justified.

hwware · 2024-12-24T16:11:13Z

Just throwing this out there -

One thing I was thinking is if the problem we actually want to solve is "per port configurations". Or perhaps "per endpoint configurations", if we decouple from just TCP.

If we did this, the case of admin port would simply be setting up an additional endpoint and configuring it with no max clients limit.

Something like:
valkey.conf
...

configure-endpoint type plaintext cluster-broadcast yes maxclients 1000  # user plaintext port

configure-endpoint type tls cluster-broadcast yes maxclients 1000 # user tls port

configure-endpoint type tls maxclients -1 cluster-broadcast no  # admin port
I think this opens it up to future improvements (e.g. perhaps prioritization capabilities to support management connections from being starved by user traffic). Or we can combine ACLs into it, e.g. only a certain endpoint can authenticate to certain ACLs, or an endpoint can automatically authenticate to a certain ACL (if the operator wants to use a firewall as their security perimeter). Obviously - we could choose to get as complicated or simple as we'd like with it.

In my head, I'm thinking of Envoy, where you can set up many "listeners", which each has its own configuration.

As a side benefit, it also opens us up to a mechanism for supporting module owned endpoints, where the module is responsible for the whole connection flow, something like:
module load my_module.so
configure-endpoint type my_module_conn_type
Obviously, it is a superset of the problem solved by just the admin port alone, so that comes with its own tradeoffs. But it feels like a more flexible direction then having a prescriptive set of configurations attached to an admin port IMO

I really like this idea, we should have it in our agenda.

madolson · 2025-01-03T23:57:50Z

If we did this, the case of admin port would simply be setting up an additional endpoint and configuring it with no max clients limit.

The syntax seems a bit atrocious, but the idea seems sound to me. I think another thread it was mentioned it would be nice if we could have a native map configuration. So it could instead look something like:

configure-endpoint name=admin type=plaintext cluster-broadcast=yes maxclients=1000

Which comes across as so much more readable to me :) That gets away from the issue Ping mentioned about it having to be special. (but we have more specialness like having more than one of the configuration). Well, we can hash through the syntax.

soloestoy · 2025-01-06T05:51:22Z

configure-endpoint name=admin type=plaintext cluster-broadcast=yes maxclients=1000

I have two questions:

Will the maxclients config be changed to limit individual endpoints rather than the entire server? If so, the original maxclients config should be deprecated, which is a significant change, and we need to consider issues related to upgrading from older versions.
The semantics of the cluster-broadcast attribute confuse me. Is it intended to describe the cluster-bus-port? If so, it should be given a more accurate name since the listening events on the cluster bus are different from the services provided to users. Perhaps a name like protocol=resp|cluster could be used.

murphyjacob4 · 2025-01-10T01:20:12Z

Will the maxclients config be changed to limit individual endpoints rather than the entire server?

I think it is up for discussion. Probably for the time being - having maxclients-enforcement=off would be more usable. Having per-endpoint resource restriction can be an iterative improvement (or we may want to do something like resource groups, allowing lumping of multiple endpoints into one shared limit, kinda like cgroups for linux)

The semantics of the cluster-broadcast attribute confuse me

The idea is that we broadcast two ports currently to the cluster: port and tls-port. So the idea with this configuration is you would configure at least one port to be the cluster-broadcast port, or optionally one plaintext and one TLS port.

Given the one-to-one mapping and Madelyn's improved syntax proposal, it would probably make more sense like:

configure-endpoint name=user-tcp type=plaintext maxclients-enforcement=on
configure-endpoint name=user-tls type=tls maxclients-enforcement=on
configure-endpoint name=admin type=plaintext maxclients-enforcement=off

cluster-announce-plaintext-endpoint "user-tcp"
cluster-announce-tls-endpoint "user-tls"

hwware force-pushed the admin-port-imple branch from a1932d5 to 6092230 Compare October 3, 2024 07:20

hwware requested review from madolson, zuiderkwast and PingXie October 3, 2024 07:40

hwware force-pushed the admin-port-imple branch from 6092230 to bccac8c Compare October 7, 2024 06:08

soloestoy reviewed Oct 9, 2024

View reviewed changes

PingXie reviewed Oct 14, 2024

View reviewed changes

madolson added the major-decision-pending Major decision pending by TSC team label Oct 14, 2024

hwware force-pushed the admin-port-imple branch from cbea31d to 55a35c2 Compare October 16, 2024 00:44

hwware force-pushed the admin-port-imple branch 2 times, most recently from 632e9e0 to e12dc1e Compare November 3, 2024 07:00

hwware force-pushed the admin-port-imple branch 3 times, most recently from 23e662e to 66129c6 Compare November 13, 2024 15:51

hwware force-pushed the admin-port-imple branch 2 times, most recently from 08aebb5 to 8ac9019 Compare December 5, 2024 20:06

hwware changed the title ~~Add admin-port feature~~ Add admin-port to let administrator connect to the server even maxclient number is reached Dec 5, 2024

hwware mentioned this pull request Dec 6, 2024

Introduce admin port parameter to differentiate the admin client connection and general client connection RFC valkey-io/valkey-rfc#3

Closed

hwware force-pushed the admin-port-imple branch from 8ac9019 to e94b3f7 Compare December 11, 2024 17:11

hwware added 5 commits December 11, 2024 17:16

Add admin-port feature

5f68664

Signed-off-by: hwware <[email protected]>

Add example to indicate how to use admin port

11365cc

Signed-off-by: hwware <[email protected]>

Make a little bit adjustment

733c08b

Signed-off-by: hwware <[email protected]>

remove unrelated codes and files

20900e8

Signed-off-by: hwware <[email protected]>

Adjust format

e40f4c2

Signed-off-by: hwware <[email protected]>

hwware force-pushed the admin-port-imple branch from 1b29430 to e40f4c2 Compare December 11, 2024 17:18

madolson reviewed Dec 16, 2024

View reviewed changes

update valkey.conf for admin port comment

e4ac7b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add admin-port to let administrator connect to the server even maxclient number is reached #1120

Add admin-port to let administrator connect to the server even maxclient number is reached #1120

hwware commented Oct 3, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

soloestoy Oct 9, 2024

hwware Oct 10, 2024

zuiderkwast Oct 10, 2024

hwware Oct 11, 2024

PingXie Oct 14, 2024

madolson Dec 10, 2024

hwware Dec 12, 2024

enjoy-binbin Dec 23, 2024 •

edited

Loading

PingXie Dec 24, 2024

enjoy-binbin Dec 24, 2024

PingXie left a comment

PingXie Oct 14, 2024

hwware commented Dec 12, 2024

madolson left a comment

PingXie commented Dec 17, 2024

PingXie commented Dec 23, 2024

PingXie commented Dec 24, 2024

PingXie commented Dec 24, 2024

murphyjacob4 commented Dec 24, 2024

PingXie commented Dec 24, 2024

hwware commented Dec 24, 2024

madolson commented Jan 3, 2025

soloestoy commented Jan 6, 2025

murphyjacob4 commented Jan 10, 2025 •

edited

Loading

Add admin-port to let administrator connect to the server even maxclient number is reached #1120

Are you sure you want to change the base?

Add admin-port to let administrator connect to the server even maxclient number is reached #1120

Conversation

hwware commented Oct 3, 2024 • edited Loading

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enjoy-binbin Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwware commented Dec 12, 2024

madolson left a comment

Choose a reason for hiding this comment

PingXie commented Dec 17, 2024

PingXie commented Dec 23, 2024

PingXie commented Dec 24, 2024

PingXie commented Dec 24, 2024

murphyjacob4 commented Dec 24, 2024

PingXie commented Dec 24, 2024

hwware commented Dec 24, 2024

madolson commented Jan 3, 2025

soloestoy commented Jan 6, 2025

murphyjacob4 commented Jan 10, 2025 • edited Loading

hwware commented Oct 3, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

enjoy-binbin Dec 23, 2024 •

edited

Loading

murphyjacob4 commented Jan 10, 2025 •

edited

Loading