You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Knative stack (eventing and serving) is pretty awesome. Overall it makes infra easy to use for our devs and ML engineers. On Kuberntes I don't wanna miss it anymore. It has advanced scaling capabilities based on concurrency. It can handle scaling very well in the following two scenarios:
a microservces which answers very quickly (milliseconds)
long running jobs which take more than a couple of minutes with the JobSink approach
For the scenarios in between of multiple seconds or up to a minute seem to be a blind spot especially when combining Knative eventing and serving.
Possible configurations
First of all a couple of (central) configurations and behaviors (if something is wrong or there is anything else, please let me know!):
kubectl get cm -n knative-eventing config-kafka-broker-data-plane -o yaml
max.poll.records=50
This is how many messages it pulls kinda in parallel per partition.
It is a central configuration and is set on Kubernetes cluster level (to be more specific on knative-eventing-kafka installation level).
kubectl get cm -n knative-eventing kafka-broker-config -o yaml
default.topic.partitions: "10"
The partitions are like a multiplier. max.poll.records x default.topic.partitions = 50 x 10 = 500 messages are pulled out kinda in parallel.
This configuration is on Broker level but requires at least an additional kafka-broker-config-* ConfigMap.
Another possibility to restrict traffic is using ordered. "An ordered consumer is a per-partition blocking consumer that waits for a successful response from the CloudEvent subscriber before it delivers the next message of the partition." Which means with 1 partition it handles only 1 message at the same time. With 10 partitions it handles 10 messages at the same time.
Another important variable is the DeliverySpec.Timout which can be set on Broker and Trigger level (which makes it independent for every Sink/microservice). I could not find the default values but I assume it is 30s.
Concurrency determines the number of simultaneous requests that can be processed by each replica of an application at any given time.
Background
I am working in a Python world with AI. Python is not good in doing things fast and concurrently. AI (we do a lot in the vision area with segments and so forth) needs a lot processing time and is therefore slow. We also face often a tipping point, when overloading Python FastAPI/Flask, it has a much lower throughput then. Hence, with a lower concurrency it can achieve the highest throughput.
Scenario and possible tweaks
Lets assume I have a microservice which can handle 2 messages in parallel and needs 10s for each message. Hence the throughput is around 12 messages per minute per replica. Keeping the default values with max.poll.records=50, default.topic.partitions: "10" and timeout: PT30S (unordered), it would overload the microservice immateriality if we get 500 messages and nothing gets processed correctly.
Now, we could set a containerConcurrency: 2 (hard-limit), which ensures only getting 2 messages at FastAPI/Flask level to get the max performance but it is on "queue-proxy"-level. The "queue-proxy" needs to buffer all other messages and produces timeouts after 30s too (I could not even identify how to increase the timeout there).
If we would decrease the max.poll.records to 2, it would throttle all other microservices in our Kubernetes cluster which could also handle super huge throughput.
We could set a combination of default.topic.partitions: "10" (another number possible) and kafka.eventing.knative.dev/delivery.order: ordered. But first the partition configuration is on Broker level and this is quite static.
All this leads to timeouts which triggers retries which over floats the system even more.
I know knative-eventing is in the end an abstraction of Kafka. But would it be possible to implement an intelligent mechanism that knative-eventing respects and is aware of knative-serving concurrency and only pulls the messages out of kafka to utilize the microservices as best as possible (but not too much)? While also scaling more replicas if the demand requires that?
Expected behavior
So the expected behavior would be:
Lets assume we have 500 messages in queue and we have only 1 replica which can only have 2 concurrent requests, then it should not send more than 2 concurrent requests to that replica (hard-limit)
Then it should scale accordingly a couple of replicas to deal with the traffic
In my opinion it is the last missing puzzle for Knative in terms of scalability.
I am fully open to hop on a call for any discussion =)
The text was updated successfully, but these errors were encountered:
The Knative stack (eventing and serving) is pretty awesome. Overall it makes infra easy to use for our devs and ML engineers. On Kuberntes I don't wanna miss it anymore. It has advanced scaling capabilities based on concurrency. It can handle scaling very well in the following two scenarios:
For the scenarios in between of multiple seconds or up to a minute seem to be a blind spot especially when combining Knative eventing and serving.
Possible configurations
First of all a couple of (central) configurations and behaviors (if something is wrong or there is anything else, please let me know!):
kubectl get cm -n knative-eventing config-kafka-broker-data-plane -o yaml
This is how many messages it pulls kinda in parallel per partition.
It is a central configuration and is set on Kubernetes cluster level (to be more specific on knative-eventing-kafka installation level).
kubectl get cm -n knative-eventing kafka-broker-config -o yaml
The partitions are like a multiplier. max.poll.records x default.topic.partitions = 50 x 10 = 500 messages are pulled out kinda in parallel.
This configuration is on Broker level but requires at least an additional kafka-broker-config-* ConfigMap.
kubectl get trigger <example-trigger> -o yaml
Another possibility to restrict traffic is using
ordered
. "An ordered consumer is a per-partition blocking consumer that waits for a successful response from the CloudEvent subscriber before it delivers the next message of the partition." Which means with 1 partition it handles only 1 message at the same time. With 10 partitions it handles 10 messages at the same time.Concurrency determines the number of simultaneous requests that can be processed by each replica of an application at any given time.
Background
I am working in a Python world with AI. Python is not good in doing things fast and concurrently. AI (we do a lot in the vision area with segments and so forth) needs a lot processing time and is therefore slow. We also face often a tipping point, when overloading Python FastAPI/Flask, it has a much lower throughput then. Hence, with a lower concurrency it can achieve the highest throughput.
Scenario and possible tweaks
Lets assume I have a microservice which can handle 2 messages in parallel and needs 10s for each message. Hence the throughput is around 12 messages per minute per replica. Keeping the default values with
max.poll.records=50
,default.topic.partitions: "10"
andtimeout: PT30S
(unordered), it would overload the microservice immateriality if we get 500 messages and nothing gets processed correctly.containerConcurrency: 2
(hard-limit), which ensures only getting 2 messages at FastAPI/Flask level to get the max performance but it is on "queue-proxy"-level. The "queue-proxy" needs to buffer all other messages and produces timeouts after 30s too (I could not even identify how to increase the timeout there).max.poll.records
to 2, it would throttle all other microservices in our Kubernetes cluster which could also handle super huge throughput.default.topic.partitions: "10"
(another number possible) andkafka.eventing.knative.dev/delivery.order: ordered
. But first the partition configuration is on Broker level and this is quite static.All this leads to timeouts which triggers retries which over floats the system even more.
I know knative-eventing is in the end an abstraction of Kafka. But would it be possible to implement an intelligent mechanism that knative-eventing respects and is aware of knative-serving concurrency and only pulls the messages out of kafka to utilize the microservices as best as possible (but not too much)? While also scaling more replicas if the demand requires that?
Expected behavior
So the expected behavior would be:
In my opinion it is the last missing puzzle for Knative in terms of scalability.
I am fully open to hop on a call for any discussion =)
The text was updated successfully, but these errors were encountered: