Respecting knative-serving concurrency in knative-eventing #15792

tikr7 · 2025-02-26T09:22:58Z

The Knative stack (eventing and serving) is pretty awesome. Overall it makes infra easy to use for our devs and ML engineers. On Kuberntes I don't wanna miss it anymore. It has advanced scaling capabilities based on concurrency. It can handle scaling very well in the following two scenarios:

a microservces which answers very quickly (milliseconds)
long running jobs which take more than a couple of minutes with the JobSink approach

For the scenarios in between of multiple seconds or up to a minute seem to be a blind spot especially when combining Knative eventing and serving.

Possible configurations

First of all a couple of (central) configurations and behaviors (if something is wrong or there is anything else, please let me know!):

kubectl get cm -n knative-eventing config-kafka-broker-data-plane -o yaml

max.poll.records=50

This is how many messages it pulls kinda in parallel per partition.
It is a central configuration and is set on Kubernetes cluster level (to be more specific on knative-eventing-kafka installation level).

kubectl get cm -n knative-eventing kafka-broker-config -o yaml

default.topic.partitions: "10"

The partitions are like a multiplier. max.poll.records x default.topic.partitions = 50 x 10 = 500 messages are pulled out kinda in parallel.
This configuration is on Broker level but requires at least an additional kafka-broker-config-* ConfigMap.

kubectl get trigger <example-trigger> -o yaml

metadata:
  annotations:
    kafka.eventing.knative.dev/delivery.order: ordered

Another possibility to restrict traffic is using ordered. "An ordered consumer is a per-partition blocking consumer that waits for a successful response from the CloudEvent subscriber before it delivers the next message of the partition." Which means with 1 partition it handles only 1 message at the same time. With 10 partitions it handles 10 messages at the same time.

Another important variable is the DeliverySpec.Timout which can be set on Broker and Trigger level (which makes it independent for every Sink/microservice). I could not find the default values but I assume it is 30s.

  delivery:
    timeout: PT30S

concurrency hard-limit

spec:
  template:
    spec:
      containerConcurrency: 50

Concurrency determines the number of simultaneous requests that can be processed by each replica of an application at any given time.

Background

I am working in a Python world with AI. Python is not good in doing things fast and concurrently. AI (we do a lot in the vision area with segments and so forth) needs a lot processing time and is therefore slow. We also face often a tipping point, when overloading Python FastAPI/Flask, it has a much lower throughput then. Hence, with a lower concurrency it can achieve the highest throughput.

Scenario and possible tweaks

Lets assume I have a microservice which can handle 2 messages in parallel and needs 10s for each message. Hence the throughput is around 12 messages per minute per replica. Keeping the default values with max.poll.records=50, default.topic.partitions: "10" and timeout: PT30S (unordered), it would overload the microservice immateriality if we get 500 messages and nothing gets processed correctly.

Now, we could set a containerConcurrency: 2 (hard-limit), which ensures only getting 2 messages at FastAPI/Flask level to get the max performance but it is on "queue-proxy"-level. The "queue-proxy" needs to buffer all other messages and produces timeouts after 30s too (I could not even identify how to increase the timeout there).
If we would decrease the max.poll.records to 2, it would throttle all other microservices in our Kubernetes cluster which could also handle super huge throughput.
We could set a combination of default.topic.partitions: "10" (another number possible) and kafka.eventing.knative.dev/delivery.order: ordered. But first the partition configuration is on Broker level and this is quite static.

All this leads to timeouts which triggers retries which over floats the system even more.

I know knative-eventing is in the end an abstraction of Kafka. But would it be possible to implement an intelligent mechanism that knative-eventing respects and is aware of knative-serving concurrency and only pulls the messages out of kafka to utilize the microservices as best as possible (but not too much)? While also scaling more replicas if the demand requires that?

Expected behavior

So the expected behavior would be:

Lets assume we have 500 messages in queue and we have only 1 replica which can only have 2 concurrent requests, then it should not send more than 2 concurrent requests to that replica (hard-limit)
Then it should scale accordingly a couple of replicas to deal with the traffic

In my opinion it is the last missing puzzle for Knative in terms of scalability.

I am fully open to hop on a call for any discussion =)

The text was updated successfully, but these errors were encountered:

tikr7 added the kind/feature Well-understood/specified features, ready for coding. label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respecting knative-serving concurrency in knative-eventing #15792

Respecting knative-serving concurrency in knative-eventing #15792

tikr7 commented Feb 26, 2025

Respecting knative-serving concurrency in knative-eventing #15792

Respecting knative-serving concurrency in knative-eventing #15792

Comments

tikr7 commented Feb 26, 2025

Possible configurations

Background

Scenario and possible tweaks

Expected behavior