Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Job Processing Loop Fails Due to Database Connection Timeout #30668

Closed
fabrizzio-dotCMS opened this issue Nov 15, 2024 · 2 comments · Fixed by #30710
Closed

Issue: Job Processing Loop Fails Due to Database Connection Timeout #30668

fabrizzio-dotCMS opened this issue Nov 15, 2024 · 2 comments · Fixed by #30710

Comments

@fabrizzio-dotCMS
Copy link
Contributor

Parent Issue

#29498

Problem Statement

The job processing loop in JobQueueManagerAPIImpl is failing intermittently with a DotDataException caused by a database connection timeout. This results in an inability to fetch the next job for processing and potentially disrupts job queue management. The error logs indicate the following stack trace:

23:43:06.004  ERROR api.JobQueueManagerAPIImpl - Unexpected error in job processing loop: Error fetching next job
com.dotmarketing.exception.DotDataException: Error fetching next job
	at com.dotcms.jobs.business.api.JobQueueManagerAPIImpl.processNextJob(JobQueueManagerAPIImpl.java:613) ~[?:?]
	at com.dotcms.jobs.business.api.JobQueueManagerAPIImpl.processJobs(JobQueueManagerAPIImpl.java:573) ~[?:?]
	at com.dotcms.jobs.business.api.JobQueueManagerAPIImpl.lambda$start$0(JobQueueManagerAPIImpl.java:217) ~[?:?]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317) ~[?:?]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: com.dotcms.jobs.business.queue.error.JobQueueDataException: Database error while fetching next job
	at com.dotcms.jobs.business.queue.PostgresJobQueue.nextJob(PostgresJobQueue.java:573) ~[?:?]
	at com.dotcms.jobs.business.api.JobQueueManagerAPIImpl.processNextJob(JobQueueManagerAPIImpl.java:604) ~[?:?]
	... 8 more
Caused by: com.dotmarketing.exception.DotDataException: jdbc/dotCMSPool - Connection is not available, request timed out after 5991ms.{
  "SQL": ["UPDATE job_queue SET state = ? WHERE id = (SELECT id FROM job_queue WHERE state = ? ORDER BY priority DESC, created_at ASC LIMIT 1 FOR UPDATE SKIP LOCKED) RETURNING *"],
  "maxRows": [-1],
  "offest": [0],
  "params": [
    "RUNNING",
    "PENDING"
  ]
}
	at com.dotmarketing.common.db.DotConnect.loadResult(DotConnect.java:310) ~[?:?]
	at com.dotmarketing.common.db.DotConnect.loadObjectResults(DotConnect.java:997) ~[?:?]
	at com.dotcms.jobs.business.queue.PostgresJobQueue.nextJob(PostgresJobQueue.java:562) ~[?:?]
	at com.dotcms.jobs.business.api.JobQueueManagerAPIImpl.processNextJob(JobQueueManagerAPIImpl.java:604) ~[?:?]
	... 8 more
Caused by: com.dotmarketing.exception.DotRuntimeException: jdbc/dotCMSPool - Connection is not available, request timed out after 5991ms.
	at com.dotmarketing.db.DbConnectionFactory.getConnection(DbConnectionFactory.java:236) ~[?:?]
	at com.dotmarketing.common.db.DotConnect.executeQuery(DotConnect.java:599) ~[?:?]
	at com.dotmarketing.common.db.DotConnect.loadResult(DotConnect.java:308) ~[?:?]
	at com.dotmarketing.common.db.DotConnect.loadObjectResults(DotConnect.java:997) ~[?:?]
	at com.dotcms.jobs.business.queue.PostgresJobQueue.nextJob(PostgresJobQueue.java:562) ~[?:?]
	at com.dotcms.jobs.business.api.JobQueueManagerAPIImpl.processNextJob(JobQueueManagerAPIImpl.java:604) ~[?:?]
	... 8 more
Caused by: java.sql.SQLTransientConnectionException: jdbc/dotCMSPool - Connection is not available, request timed out after 5991ms.
	at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:689) ~[?:?]
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:196) ~[?:?]
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:161) ~[?:?]
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100) ~[?:?]
	at com.dotmarketing.db.DbConnectionFactory.getConnection(DbConnectionFactory.java:223) ~[?:?]
	at com.dotmarketing.common.db.DotConnect.executeQuery(DotConnect.java:599) ~[?:?]
	at com.dotmarketing.common.db.DotConnect.loadResult(DotConnect.java:308) ~[?:?]
	at com.dotmarketing.common.db.DotConnect.loadObjectResults(DotConnect.java:997) ~[?:?]
	at com.dotcms.jobs.business.queue.PostgresJobQueue.nextJob(PostgresJobQueue.java:562) ~[?:?]
	at com.dotcms.jobs.business.api.JobQueueManagerAPIImpl.processNextJob(JobQueueManagerAPIImpl.java:604) ~[?:?]
	... 8 more
23:43:06.004  ERROR license.LicenseManager - Could not detect if server 5b7f00e4-49c5-48a0-8bd5-712cbfd0153c is duplicated
com.dotmarketing.exception.DotDataException: jdbc/dotCMSPool - Connection is not available, request timed out after 5991ms.{
  "SQL": ["SELECT id FROM sitelic WHERE (serverid = ? OR license = ?) AND startup_time != ?"],
  "maxRows": [-1],
  "offest": [0],
  "params": [
    "5b7f00e4-49c5-48a0-8bd5-712cbfd0153c",
    "0toaiuQe4vxQiX/5ewQQj9N6JGBqmb3WUdbMBmCUC17vYK86AX3HJ6PZoNI70BfRKzMBhRurLeICbPxZIOYhrwJKk7lgT39564Rt56FDOZAhvEidREV3YKZZrdffGGgF1sLGilVWKlU69i0wEoCD+SbtmT5DEdFamklkfvUtyKMAAAAIRGV2IFRlc3QAAAAIAAABWbfTJtkAAAAIAAAAAAAAAAAAAAAEAAAB9AAAACRjNmVhYzkzZS0wOGNiLTRjOTItYmFlMi05NmNmYjY4Y2ZjMDMAAAAEcHJvZAAAAAQAAAABAAAABAAAAAAAAAAEAAABLAAAACQxNzUwOTlkMS1iMzM1LTRmYzYtOTYzYS0wMGM5YTdhYTcyYzQ=",
    1731626407189
  ]
}

This problem arose with very little load

Steps to Reproduce

Unfortunately, I do not have a pattern to reproduce this
It started happening on a long-running instance. Like the next day

Acceptance Criteria

We need to revise the Database connection pool configuration (dotCMSPool) and determine if any different configuration can resolve the problem or ensure recovery. Perhaps a dedicated pool or something.

dotCMS Version

main

Proposed Objective

Technical User Experience

Proposed Priority

Priority 2 - Important

External Links... Slack Conversations, Support Tickets, Figma Designs, etc.

No response

Assumptions & Initiation Needs

No response

Quality Assurance Notes & Workarounds

No response

Sub-Tasks & Estimates

No response

@fabrizzio-dotCMS
Copy link
Contributor Author

After testing the jobs queue over the past few days, I can confirm that the connection issue has not reoccurred.

@fabrizzio-dotCMS fabrizzio-dotCMS moved this from Internal QA to QA - Backlog in dotCMS - Product Planning Nov 25, 2024
@fabrizzio-dotCMS fabrizzio-dotCMS removed their assignment Nov 25, 2024
@bryanboza
Copy link
Contributor

Unable to reproduce anymore! We can pass for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants