Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New job queue: worker registration and leader election #3307

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sandhose
Copy link
Member

@sandhose sandhose commented Oct 7, 2024

This adds the base for the new job queue system, with a simple worker registration system, as well as a leader election system.

The worker registration is meant to be used to detect lost workers and reschedule dead tasks they locked.
The leader election system is meant to have one leader performing all the maintenance work, like rescheduling tasks.

Part of #2785

@sandhose sandhose added the A-Jobs Related to asynchronous jobs label Oct 7, 2024
Copy link

cloudflare-workers-and-pages bot commented Oct 7, 2024

Deploying matrix-authentication-service-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 76afd6a
Status: ✅  Deploy successful!
Preview URL: https://410771fa.matrix-authentication-service-docs.pages.dev
Branch Preview URL: https://quenting-new-queue-initial.matrix-authentication-service-docs.pages.dev

View logs

Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, but it also seems quite intricate and would benefit from a careful review, including aspects like whether it's robust against clock drift, node failures, etc — that sort of thing.

I would also want to carefully review what happens whether there are any problems if the current leader loses connection but still believes it is the leader.

-- The leader is responsible for running maintenance tasks
CREATE UNLOGGED TABLE queue_leader (
-- This makes the row unique
active BOOLEAN NOT NULL DEFAULT TRUE UNIQUE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it sounds silly, but I'd make this a PRIMARY KEY — maybe this sounds dogmatic? But there a handful of tools are not happy with tables that don't have a primary key, e.g. logical replication in Postgres by default, I'd say it's worth always using it instead of UNIQUE etc.

// If no row was updated, the worker was shutdown so we return an error
DatabaseError::ensure_affected_rows(&res, 1)?;

Ok(worker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring says this returns the modified worker, but I don't see us modifying it.

I would expect the worker to track its own validity timestamps, but I guess the critical thing here is just that we 'take away' the Worker if we can't renew it?

clock: &dyn Clock,
threshold: Duration,
) -> Result<(), Self::Error> {
let now = clock.now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it reasonable to rely on the system clock (which could drift between servers)?

I suppose we could use the Postgres database's clock alternatively. But I don't know which one is best, mostly just interested in considering it carefully

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, although I'd imagine multiple servers in the same datacenter usually have the same time source/NTP server?

The clock is abstracted through a trait though, so maybe at some point we can take into account time drift, and regularly sync the local system clock with the database or something, but I wouldn't worry too much about it for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Jobs Related to asynchronous jobs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants