Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness probe queries causing high database IO usage on airflow 2.5.3 #829

Closed
2 tasks done
bitsofdave opened this issue Feb 19, 2024 · 1 comment · Fixed by #853
Closed
2 tasks done

liveness probe queries causing high database IO usage on airflow 2.5.3 #829

bitsofdave opened this issue Feb 19, 2024 · 1 comment · Fixed by #853
Labels
kind/bug kind - things not working properly
Milestone

Comments

@bitsofdave
Copy link
Contributor

Checks

Chart Version

8.7.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.17-eks-5e0fdde", GitCommit:"221f136737816ce068aa975f082cd0bf81c7ad5f", GitTreeState:"clean", BuildDate:"2024-01-02T20:35:57Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.8.0", GitCommit:"d14138609b01886f544b2025f5000351c9eb092e", GitTreeState:"clean", GoVersion:"go1.17.6"}

Description

This change to the liveness probes in support of airflow 2.6.0 is causing high usage database on IO when using airflow 2.5.3.

From what I can tell, the part of the query that is causing poor performance is this change:

-                          .query(SchedulerJob) \
+                          .query(Job) \
+                          .filter_by(job_type=SchedulerJobRunner.job_type) 

I believe the issue is that the SchedulerJob class in 2.5.3 doesn't set a value for job_type.

The new query doesn't correctly filter by job_type (notice the filter clause is job.job_type = job.job_type):

SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
FROM job 
WHERE job.job_type = job.job_type AND job.hostname = %(hostname_1)s ORDER BY job.latest_heartbeat DESC 
 LIMIT %(param_1)s

Old query used in helm chart v8.7.0 and earlier used to filter by job_type:

SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
FROM job 
WHERE job.hostname = %(hostname_1)s AND job.job_type IN (__[POSTCOMPILE_job_type_1]) ORDER BY job.latest_heartbeat DESC 
 LIMIT %(param_1)s

While the new query does work, its inefficiency is can cause database IO to be throttled, resulting in liveness probe failures due to query timeouts.

Relevant Logs

Warning  Unhealthy         7m29s (x5 over 10m)    kubelet             Liveness probe failed: The SchedulerJob (id=97704647) for hostname 'airflow-scheduler-8586fc7d9f-br7vb' is not alive

Custom Helm Values

No response

@bitsofdave bitsofdave added the kind/bug kind - things not working properly label Feb 19, 2024
@thesuperzapper thesuperzapper added this to the airflow-8.8.1 milestone Apr 24, 2024
@thesuperzapper
Copy link
Member

@bitsofdave this was a great find!

It will be resolved by #853, which will be part of the 8.9.0 version of the chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind - things not working properly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants