-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add admin-port to let administrator connect to the server even maxclient number is reached #1120
base: unstable
Are you sure you want to change the base?
Conversation
a1932d5
to
6092230
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1120 +/- ##
============================================
- Coverage 70.83% 70.68% -0.16%
============================================
Files 118 118
Lines 63561 63583 +22
============================================
- Hits 45025 44943 -82
- Misses 18536 18640 +104
|
6092230
to
bccac8c
Compare
/* Limit the number of connections we take at the same time. | ||
* | ||
* Admission control will happen before a client is created and connAccept() | ||
* called, because we don't want to even start transport-level negotiation | ||
* if rejected. */ | ||
if (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients) { | ||
if (port != server.admin_port && (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connections from admin-port
can consume clients out of maxclients
, but the eventloop's size is fixed, do we need to make eventloop to be auto resizing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @soloestoy,
Yes, you’re correct—the event loop size is fixed. However, since the port is known only to the administrator and accepts only local connections, we don't anticipate a large number of clients on the port. Additionally, we've reserved 96 extra FDs in the event loop, which I believe should be sufficient to handle the expected client load on the port. ref:
Line 2634 in 9b8a061
server.el = aeCreateEventLoop(server.maxclients + CONFIG_FDSET_INCR); |
That said, I think I overlooked this. Do you think we should track and add a condition for the number of clients connected to the port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a fixed number and limit the number of admin connections. We don't need to count the admin connections separately. We can just accept connections on the admin port as long as the total number of connections is less than for example server.maxclients + 20
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check for the number of connections accepted on the port. Currently, it's set to 20, but I believe 10 would be sufficient. Additionally, the event loop size also been updated based on the maximum number of clients accepted on the port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will also need to track admin
clients separately from server.clients
or we are eating into the actual end-user client quote, something like below, assuming server.clients
is the total
if (port != server.admin_port &&
(listLength(server.clients) - listLength(server.admin_clients) + getClusterConnectionsCount() >= server.maxclients))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At amazon we have a fixed number of clients ontop of what is requested as well (although I believe we chose 256). It's worth mentioning that clusterbus connections also eat into the overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this feature could solve part of this problem in cluster mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At amazon we have a fixed number of clients ontop of what is requested as well (although I believe we chose 256). It's worth mentioning that clusterbus connections also eat into the overhead.
We also have a similar CONFIG_EXTRA_RESERVED_FDS, the value is 500. we are doing some whitelist-ips(and localhost) to bypass the maxclients check. I am worry that admin-port consume a listen port and in here it only used for bypass the maxclients check. I am hoping there is a way that admin can do more, for example, when the server is blocked, admin can still has the way to connect in. Also port management is suck in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping there is a way that admin can do more,
Yeah we discussed in yesterday's core team meeting that the fd limit needs to be taken care as well. In addition, we should not evict "admin clients" either. I also think we shouldn't allow a "normal" or "standard" client to kill "admin" clients. Another related question would be whether we should count the "admin" clients when reporting "connected_clients" in INFO
.
For the fd limit, I think we could avoid hard-code values by resizing the eventloop->events
array dynamically in aeCreateFileEvent
when the current size is about to be exceeded. The main benefit though would be simplified error handling - see #528
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we should not evict "admin clients" either. I also think we shouldn't allow a "normal" or "standard" client to kill "admin" clients. Another related question would be whether we should count the "admin" clients when reporting "connected_clients" in INFO.
this feels a bit overkill, i agree this is better. A admin should be very fast, and the number should be quite small. Normally, we should try to avoid using the admin port. Ps, we marked client kill as a admin command, so a normal user is unlikely to call it to kill the admin client (we can add a type filter for sure)
For the fd limit, I think we could avoid hard-code values by resizing the eventloop->events array dynamically in aeCreateFileEvent when the current size is about to be exceeded.
yes, i agree dynamically is better in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we close the discussion at valkey-io/valkey-rfc#3 (comment)?
/* Limit the number of connections we take at the same time. | ||
* | ||
* Admission control will happen before a client is created and connAccept() | ||
* called, because we don't want to even start transport-level negotiation | ||
* if rejected. */ | ||
if (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients) { | ||
if (port != server.admin_port && (listLength(server.clients) + getClusterConnectionsCount() >= server.maxclients)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will also need to track admin
clients separately from server.clients
or we are eating into the actual end-user client quote, something like below, assuming server.clients
is the total
if (port != server.admin_port &&
(listLength(server.clients) - listLength(server.admin_clients) + getClusterConnectionsCount() >= server.maxclients))
cbea31d
to
55a35c2
Compare
632e9e0
to
e12dc1e
Compare
23e662e
to
66129c6
Compare
08aebb5
to
8ac9019
Compare
8ac9019
to
e94b3f7
Compare
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
1b29430
to
e40f4c2
Compare
It is closed right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core team discussion:
- Wen will followup if just doing a local connection is sufficient for his use case. That simplifies the management of this new port. If Huawei does want a remote connection, we need to discuss if we will support TLS and how we will do that.
I still think the value of introducing an admin port is limited. When the admin port idea was initially proposed, it was intended to address two scenarios, IIRC:
I believe the correct way to handle this use case is through ACLs.
While Looking at this from the operator's perspective, I see two reasons why an operator needs to connect to the server:
In our case at GCP, we’ve moved to a PUSH model where the server periodically exports its metrics to an external file. This approach avoids the need for direct connections during high-load situations.
These operations are generally less time-sensitive and can afford to wait so the priority of ensuring immediate connectivity under resource constraints is lower IMO. I can create a separate issue to describe the PUSH model in more detail, if folks are interested. |
[core team discussion]
need to patch "maxclients", "max fd", and "client eviction" policies.
|
[decision time] @valkey-io/core-team, please 👍 or 👎
It is the operator's responsibility to determine whether the admin port is firewalled off from the applications. |
Separately, if we approve this feature, can we consider renaming the port since it doesn’t provide additional privileges? I’m thinking of names like “service port,” “management port,” or perhaps “aux port.” |
Just throwing this out there - One thing I was thinking is if the problem we actually want to solve is "per port configurations". Or perhaps "per endpoint configurations", if we decouple from just TCP. If we did this, the case of admin port would simply be setting up an additional endpoint and configuring it with no max clients limit. Something like:
I think this opens it up to future improvements (e.g. perhaps prioritization capabilities to support management connections from being starved by user traffic). Or we can combine ACLs into it, e.g. only a certain endpoint can authenticate to certain ACLs, or an endpoint can automatically authenticate to a certain ACL (if the operator wants to use a firewall as their security perimeter). Obviously - we could choose to get as complicated or simple as we'd like with it. In my head, I'm thinking of Envoy, where you can set up many "listeners", which each has its own configuration. As a side benefit, it also opens us up to a mechanism for supporting module owned endpoints, where the module is responsible for the whole connection flow, something like:
Obviously, it is a superset of the problem solved by just the admin port alone, so that comes with its own tradeoffs. But it feels like a more flexible direction then having a prescriptive set of configurations attached to an admin port IMO |
Spot on. And the file descriptor limit could be solved independently without introducing a new policy (i.e., dynamically extending the
The only potential downside is that this will need to be a SPECIAL config and we will have to provide custom parsing logic but I feel that the additional complexity is justified. |
I really like this idea, we should have it in our agenda. |
The syntax seems a bit atrocious, but the idea seems sound to me. I think another thread it was mentioned it would be nice if we could have a native map configuration. So it could instead look something like:
Which comes across as so much more readable to me :) That gets away from the issue Ping mentioned about it having to be special. (but we have more specialness like having more than one of the configuration). Well, we can hash through the syntax. |
I have two questions:
|
I think it is up for discussion. Probably for the time being - having
The idea is that we broadcast two ports currently to the cluster: Given the one-to-one mapping and Madelyn's improved syntax proposal, it would probably make more sense like:
|
Background:
The parameter "maxclient" default value is 10000. Client administrators could set higher value to accept more client connections.
However, there is one situation: If the connection number of client reaches the maxclient, then even administrators have no chance to connect to the server to process some urgent issues or update some configurations.
To solve above case, by connecting to the admin port, an administrator or cloud provider has the chance to connect to the server even the max number of connected clients reaches
Of course, there is one risk by introducing the admin port: hacker has one more chance to invade the Valkey server, but
I think this kind of risky is less than what we thought. Because in standlone mode, primary node could expose ports to multiply replicas and sentinel nodes, and in cluster mode. primary node has port to connect with other nodes. Thus one more port to connect with administrator is not a big issue.