Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic concurrency support for transaction prover service #908

Open
igamigo opened this issue Oct 7, 2024 · 13 comments
Open

Basic concurrency support for transaction prover service #908

igamigo opened this issue Oct 7, 2024 · 13 comments
Assignees
Milestone

Comments

@igamigo
Copy link
Collaborator

igamigo commented Oct 7, 2024

The current version of the transaction proving service processes exclusively one transaction at a time. To improve performance and scalability, we want to introduce concurrency while ensuring the system remains resilient against potential DDOS attacks and resource exhaustion.

Basic desired features:

  • Task queue: We want to manage multiple transaction proofs concurrently without overwhelming the system, so we should introduce a basic task queue where incoming valid requests are placed and processed by any available worker.
    • Concurrency control: A first basic approach would be to define available workers based on system resources and once the workers are not available, additional transaction should be put on hold or rejected.
    • Investigate options for distributing workers across nodes.
  • Rate limiting: We should try to reduce the attack surface for DDoS attacks by rate limiting incoming requests probably based on incoming request IPs.
  • Timeouts: I'm unsure if there is any simple way in which an incoming transaction could stall a worker but there should be timeout mechanisms. API clients could also implement exponential backoff retries.
  • Simple monitoring: It would be nice to implement an endpoint with (initially basic) metrics related to the service health and status.
@bobbinth
Copy link
Contributor

bobbinth commented Oct 8, 2024

Looks great! A couple of additional comments:

  • We should assume that there is only one worker per machine right now. As a worker comes online, it could ping the coordinator to notify them that it is available and then the coordinator would add the worker to the pool of available workers.
  • I think we should have a configurable timeout. If proof generation takes more this timeout, the coordinator should assume that the worker is dead and remove it from the pool. If the work is not in fact dead, it should be able to detect this somehow.
  • It wold be really nice if the coordinator could somehow request increase in capacity (i.e., in the number of workers). The default implementation of this could do nothing, but in the future this could be configured to request a new AWS spot instance, for example.

And the last point is that ideally we'd not implement it ourselves but would be able to use an already existing component/framework.

@bobbinth bobbinth added this to the v0.7 milestone Oct 8, 2024
@bobbinth bobbinth modified the milestones: v0.7, v0.6 Oct 17, 2024
@SantiagoPittella
Copy link
Collaborator

Hello! I've been researching a bit on this issue, in particular using Cloudflare's Pingora crates. I think that it fits as a solution for all our problems here.

  • Task queue: We want to manage multiple transaction proofs concurrently without overwhelming the system, so we should introduce a basic task queue where incoming valid requests are placed and processed by any available worker.
    Concurrency control: A first basic approach would be to define available workers based on system resources and once the workers are not available, additional transaction should be put on hold or rejected.
    Investigate options for distributing workers across nodes.

This can be tackled with Pingora's LoadBalancer and setting the config to 1 thread per service. I can elaborate a PoC of this with a simple "hello world" server shortly.

Related to this is also the creation and destruction of workers. At least in my first approach I was thinking of manually running instances of the server and adding the endpoints to the upstream configuration of the load balancer; this will also work for destructing workers (remove the server from the list of upstreams, reload, turn of the prover server). This can be benefited from the Graceful Upgrade that Pingora supports.

  • Rate limiting: We should try to reduce the attack surface for DDoS attacks by rate limiting incoming requests probably based on incoming request IPs.

For this we can use the rate limiting that the crate has out of the box, just setting the max amount of request per user per second as described here.

  • Timeouts: I'm unsure if there is any simple way in which an incoming transaction could stall a worker but there should be timeout mechanisms. API clients could also implement exponential backoff retries.

We can also use the Pingora's timeout out of the box for this, which is easily configurable.

  • Simple monitoring: It would be nice to implement an endpoint with (initially basic) metrics related to the service health and status.

It also has a builtin Prometheus server that can be used for that purpose.

If you think that it is ok, I can proceed the following way:

  1. Creating the Proof of Concept of the Load Balancer (also want to check a couple things about this).
  2. Adding the real Load Balancer with a couple of proving servers.
  3. Rate limiting & Timeouts.
  4. Metrics.

All of these can be done in different issues and if you agree I can start immediately.

@bobbinth
Copy link
Contributor

This sounds great! Let's start with the PoC to see if we hit any roadblocks.

In terms of the setup, I was thinking to load balancer could run on a relatively small machine (e.g., 4 cores or something like that), and the provers would run on big machines (e.g., 64 cores). Is that how you are thinking about it too?

@SantiagoPittella
Copy link
Collaborator

SantiagoPittella commented Oct 22, 2024

This sounds great! Let's start with the PoC to see if we hit any roadblocks.

In terms of the setup, I was thinking to load balancer could run on a relatively small machine (e.g., 4 cores or something like that), and the provers would run on big machines (e.g., 64 cores). Is that how you are thinking about it too?

Sorry for the late reply. Considering that the prover takes advantage of concurrency, that sounds great and should work more than ok. If I can get access to one of those machines I might ran a couple tests. And for the load balancer, 4 cores should be ok too, it is not supposed to be a heavy workload.

@SantiagoPittella
Copy link
Collaborator

PoC update

I implemented a minimal version of the load balancer with pingora and a "Hello world!" server, everything work smoothly out-of-the-box. I've been thinking that we could make a primitive first version of the queue deploying workers and adding those to the load balancer, and then letting the round robin work as a kind of queue. In an ideal case, where every prove takes the same time to run, this will behave like a queue of size N where N is the number of workers.

Then, I thought on adding queues built in place on each worker, augmenting the proving capacity to N*M where N is the number of workers and M the size of each queue. I'm not sure yet on how to implement this queues.

This approach lets us split the issue in two tasks that can be perform in paralel:

  1. Queues in the proving service.
  2. Pingora's server including load balancer, timeouts, rate limiting, gRPC support and metrics. This task might be aswell be split once we have a basic structure in place.

I consider that we may start with the proxy server, which allows us to deploy various workers in case of we need it, and then (or simultaneously) start to work on the queue.

Also, I'm still thinking on an automated way to deploy/kill workers and gracefully restarting the proxy. Any idea on this or the queues implementating would be great.

I am starting with the pingora server but let me know if we prefer to start with the queue and I will switch to that.

@bobbinth
Copy link
Contributor

bobbinth commented Oct 22, 2024

Question: could the load balancer not maintain the queue itself? Basically, can we avoid having queues in the proving service and rely on the load balancer to manage the jobs? Or it doesn't work this way?

Basically, the ideal approach in my mind would be:

  1. We start out with a load balancer and a single worker. The load balancer just manages the job queue for the worker.
  2. We expand to multiple workers. Here, load balancer would just assign jobs from the queue to the workers.
  3. We make worker procurement/shutdown dynamic. This could be done much later (I'd even say that only the first point is a high priority for now).

If this works, we won't need to modify the current prover service at all - but let me know if that's not a viable approach.

@SantiagoPittella
Copy link
Collaborator

Following your last question I've been doing a bit more of research on Pingora's load balancing strategies. It looks like we can achieve that functionality by defining our own strategy. To add the queue doesn't seem that hard, I'm not sure on how to tell when to remove the element from the queue at the moment, im looking after this now. In general this sounds like a good and doable approach, and shouldn't raise the requirements for the server running the proxy.

@SantiagoPittella
Copy link
Collaborator

I did most of the work in #930 . We are still missing support for queues, but all the other features are in place.

I'm now working on that, let me know @bobbinth if I should stale the reviews in the PR until queues are in place or if I can do it in another PR.

@bobbinth
Copy link
Contributor

I did most of the work in #930 . We are still missing support for queues, but all the other features are in place.

I'm now working on that, let me know @bobbinth if I should stale the reviews in the PR until queues are in place or if I can do it in another PR.

Nice! So, right now as the tasks arrive they will be assigned to queues of individual workers in the round-robin fashion, right? And so, if there is one worker, it'll basically just have a single queue of tasks which it'll work on one-by-one, right? If so, this may be good enough - though, of course, we get a single queue working, it'll be even better.

@SantiagoPittella
Copy link
Collaborator

I did most of the work in #930 . We are still missing support for queues, but all the other features are in place.
I'm now working on that, let me know @bobbinth if I should stale the reviews in the PR until queues are in place or if I can do it in another PR.

Nice! So, right now as the tasks arrive they will be assigned to queues of individual workers in the round-robin fashion, right? And so, if there is one worker, it'll basically just have a single queue of tasks which it'll work on one-by-one, right? If so, this may be good enough - though, of course, we get a single queue working, it'll be even better.

Right now there is no queue, just round robin for the workers. I'm currently adding queues. The design that I'm working on is one queue per backend, and the requests evenly distributed using round robin.
Basically this means that whenever we get a request we use round robin to retrieve the corresponding backend and then the request is enqueued and waits until the turn of that requests is next in the queue

@bobbinth
Copy link
Contributor

Right now there is no queue, just round robin for the workers. I'm currently adding queues. The design that I'm working on is one queue per backend, and the requests evenly distributed using round robin.
Basically this means that whenever we get a request we use round robin to retrieve the corresponding backend and then the request is enqueued and waits until the turn of that requests is next in the queue

What does "backend" mean here? Is it the same as "worker"?

@SantiagoPittella
Copy link
Collaborator

Right now there is no queue, just round robin for the workers. I'm currently adding queues. The design that I'm working on is one queue per backend, and the requests evenly distributed using round robin.
Basically this means that whenever we get a request we use round robin to retrieve the corresponding backend and then the request is enqueued and waits until the turn of that requests is next in the queue

What does "backend" mean here? Is it the same as "worker"?

Yes, the same as worker

@bobbinth
Copy link
Contributor

Makes sense. Yes, this should be fine as initial implementation. And we can improve it later on to make distribution more balanced and fault-tolerant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants