Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[16.0][IMP] queue_job: HA job runner using session level advisory lock #668

Open
wants to merge 4 commits into
base: 16.0
Choose a base branch
from

Conversation

sbidoul
Copy link
Member

@sbidoul sbidoul commented Jul 2, 2024

Another attempt.

closes #422

@OCA-git-bot
Copy link
Contributor

Hi @guewen,
some modules you are maintaining are being modified, check this out!

@sbidoul sbidoul force-pushed the 16.0-ha-runner-sbi branch 3 times, most recently from 02ef89b to deecd27 Compare July 2, 2024 12:00
@sbidoul
Copy link
Member Author

sbidoul commented Jul 2, 2024

Yep, this should work.

Copy link
Member

@guewen guewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@sbidoul
Copy link
Member Author

sbidoul commented Jul 4, 2024

@PCatinean do you know who we should ping in the odoo.sh team to have an opinion on this approach?

@PCatinean
Copy link
Contributor

Hi @sbidoul the only two people I know around this topic are @amigrave which gave the initial feedback on the advisory lock MR here #256 and @sts-odoo which also provided some feedback on the pg_application_name approach

@sbidoul
Copy link
Member Author

sbidoul commented Jul 4, 2024

@amigrave @sts-odoo so the TL;DR here is that we have one long lived connection to the database on which we take a session-level advisory lock and do a LISTEN. There is no long-lived transaction, so this should not impact replication.

I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.

Copy link

github-actions bot commented Nov 3, 2024

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days.
If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

@github-actions github-actions bot added the stale PR/Issue without recent activity, it'll be soon closed automatically. label Nov 3, 2024
@sbidoul sbidoul removed the stale PR/Issue without recent activity, it'll be soon closed automatically. label Nov 3, 2024
@sbidoul sbidoul changed the title [IMP] queue_job: HA job runner using session level advisory lock [16.0][IMP] queue_job: HA job runner using session level advisory lock Dec 4, 2024
@simahawk
Copy link
Contributor

simahawk commented Dec 6, 2024

I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.

@sbidoul any feedback?

@sbidoul sbidoul force-pushed the 16.0-ha-runner-sbi branch from b65bbc6 to ffb27a4 Compare December 6, 2024 08:40
@sbidoul
Copy link
Member Author

sbidoul commented Dec 6, 2024

Feeback given in #673 (comment).

And rebased.

Copy link
Contributor

@AnizR AnizR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM.
I'm going to install it on one of my projects and battle test it.

@0yik
Copy link

0yik commented Jan 9, 2025

sorry, why is this not merged yet?

@luke-stdev001
Copy link

luke-stdev001 commented Jan 20, 2025

@simahawk @sbidoul ,

I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.

@sbidoul any feedback?

I'd like to run this on my staging and production GKE cluster. I would especially like to test the scaling capabilities of this in my staging environment. If I deploy this to my staging env, would either of you like the keys to my staging env and the GKE staging cluster to kick the tires and load test this with K6 or similar tools?

I would love to see this merged, and would be happy to run this in production after some load testing in staging and report back on results or allow you to monitor.

I can reach out to you via email to get this going through your company's official channels if this is something you'd like to explore.

@sbidoul
Copy link
Member Author

sbidoul commented Jan 20, 2025

Hi everyone. This is not merged precisely because we would like more feedback from actual deployments. Tests are ongoing at Acsone, and I would encourage others to do the same.

@luke-stdev001
Copy link

Hi everyone. This is not merged precisely because we would like more feedback from actual deployments. Tests are ongoing at Acsone, and I would encourage others to do the same.

Thanks. I'll get this into staging and then production and report back with findings.

@luke-stdev001
Copy link

@sbidoul
extremely silly question from my end, but is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?

I'm pulling this into our staging environment now, but if all goes well I do plan to run this in production over a few weeks and report back if I encounter issues.

@AnizR
Copy link
Contributor

AnizR commented Feb 28, 2025

Code LGTM. I'm going to install it on one of my projects and battle test it.

It has been deployed in production for almost 3 weeks and I haven't any issue to report

Without this, we leak connections to Databases that don't have queue_job
installed.
Without this we risk connection leaks in case of exceptions in the
constructor.
@sbidoul
Copy link
Member Author

sbidoul commented Feb 28, 2025

is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?

@luke-stdev001 yes it should be safe. I just rebased.

@luke-stdev001
Copy link

luke-stdev001 commented Mar 1, 2025

is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?

@luke-stdev001 yes it should be safe. I just rebased.

Thank you.

#422 (comment)

Yes, same config on all instances will work.

@sbidoul ,

Seems to be working well from initial load testing in staging, thank you.

I'd like to rearchitect our GKE based HA deployment of Odoo:

  • User-facing app instances with server wide modules without queue job and with HTTP workers
  • Larger single vertically scaled instance for cron workers, no HTTP workers (current architecture we have has queue jobs on this same server for the same reasons)
  • Queue job only instance with server wide modules with queue job and with no HTTP workers, dedicated node pool to scale out queue instances separately to user-facing instances

My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.

If you wouldn't mind could you confirm if that assumption is correct? My apologies if there are any fundamental misunderstandings on my side to how this works.

Once i've had a week to toy with the concept in staging i'll deploy to production and advise on progress.

@sbidoul
Copy link
Member Author

sbidoul commented Mar 2, 2025

My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.

You can do that yes. I'm curious about the metrics you plan to use for auto scaling.

Note this was already feasible without this PR, with a single dedicated pod with --load=queue_job sending the requests to run jobs to the pool. This PR simplifies configuration, helps avoiding configuration mistakes, and also helps in situations where you cannot have instances with different configs such as odoo.sh.

@luke-stdev001
Copy link

luke-stdev001 commented Mar 3, 2025

My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.

You can do that yes. I'm curious about the metrics you plan to use for auto scaling.

Note this was already feasible without this PR, with a single dedicated pod with --load=queue_job sending the requests to run jobs to the pool. This PR simplifies configuration, helps avoiding configuration mistakes, and also helps in situations where you cannot have instances with different configs such as odoo.sh.

Thanks, I wasn't aware of that ability previously, i'll take a look.

To be perfectly honest when it comes to auto-scaling metrics we will be figuring it out as we go along and playing with what works and learning what doesn't. I'm happy to report back here with our own internal notes and would love to hear from anyone else who has any advice.

I am thinking at present perhaps WSGI request queue length, request rate, request duration latency, ratio of busy workers to total number of workers, and then DB connection pool saturation.

I'm happy to report back that this PR is working fine in production, and has been for a few days. I will monitor closely for issues, but so far I have not encountered any hiccups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] queue_job when multiple odoo servers are used with load balancing (and single postgres)
8 participants