This document is a guide to the day-to-day operations of the admin team.
- Grafana dashboards
- Node details
- This dashboard gives you an overview of all the nodes (headnode, worker nodes, etc.)
- Storage (ZFS server)
- This dashboard gives you an overview of the ZFS storage server, its load, performance, availability, etc.
- VGCN monitoring
- This dashboard gives your information related to the worker nodes. Helps to find the worker nodes that are not connected to HTCondor anymore but are still in the BWCloud. Helps to determine the worker nodes in the stuck state. In such cases have a look at the remote logs (
/var/log/remote/<worker_node_name_here>/
) of the worker nodes available on the maintenance server (maintenance.galaxyproject.eu
).
- This dashboard gives your information related to the worker nodes. Helps to find the worker nodes that are not connected to HTCondor anymore but are still in the BWCloud. Helps to determine the worker nodes in the stuck state. In such cases have a look at the remote logs (
- CVMFS
- This dashboard gives you an overview of the CVMFS stratum 1 availability and the repo availability.
- Galaxy
- This dashboard gives you an overview of condor job states and which tools are currently used
- Jobs-Dashboard
- shows not the Condor job status, but Galaxy's job status
- Alerts
- NOTE: There is a new WIP dashboard that groups and summarizes information
- Node details
- Sentry: Check for new issues
- Rabbitmq: Dashboard (Check for connection errors, have a look at the queue)
- Celery Flower: Dashboard (Check if workers are offline and the number of failing tasks is increasing, check recent failed tasks in this case)
- To connect:
- Install tailscale client,
- Login using GitHub creds,
- Select usegalaxy-eu organization (you need to be a member of usegalaxy-eu/admin)
- To connect:
- On headnode:
- Check server load:
top
,htop
, intop
especially thewa
(waiting for I/O) value might be interesting. It should not exceed8.0
and can indicate a storage problem or handler misconfiguration. - Check storage availability of: JWD's, root partition, etc. (only if not available in Grafana or further investigation is needed)
- Check the idle, running, held jobs in condor queue:
condor_q
- If jobs are in held state then investigate those jobs and try to release them.
- To get the list of jobs in held state and the reason:
condor_q -hold
- If you see the reason as
SYSTEM_PERIODIC_HOLD
then it means that the job is held because of the periodic hold policy. This policy is set to hold the jobs which are in the queue or was running for more than2592000
seconds (720 hrs). - In such cases have a look at the job details (use
gxadmin
queries), who submitted it, what tool it is and if you know that this tool does not take that long then it means something is wrong. So try and investigate it or simply release the job by runningcondor_release <job_id>
and monitor the job for a while.
- If you see the reason as
- You can better analyze the job:
condor_q --better-analyze <job_id>
- To release the job:
condor_release <job_id>
- To get the list of jobs in held state and the reason:
- If many jobs are in the
idle
state then it means the cluster is under heavy load- Check if there are any
Unclaimed/Idle
slots available:condor_status --compact | (head -n 2; grep Ui)
- Check VGCN monitoring dashboard for any issues related to the availability of the worker nodes: Dashboard
- Get more information about the idle jobs:
condor_q -autoformat:t ClusterId JobDescription RequestMemory RequestCpus JobStatus | grep -P "\t1$"
- Have a brief look at the better analysis of the jobs:
condor_q --better-analyze <job_id>
and check the resource requirements of the job. - By fixing any issues with the availability of the worker nodes usually fixes the issue.
- Check if there are any
- If jobs are in held state then investigate those jobs and try to release them.
- Gxadmin count of new, queue, running jobs:
gxadmin tsvquery queue-detail --all | awk '{print $1}' | sort | uniq -c
- Watch the new and queue jobs to find if they are getting picked up by the handlers and getting the condor job ids:
watchendnew
: It's an alias, watches the end of the new queue. This helps to find whether the jobs are getting picked up by the handlers or not.watchendqueue
: It's an alias, watches the end of the queue. This helps to find whether the jobs are getting assigned the condor id's or not.highscore
: It's an alias, shows the number of jobs submitted by each user.
- Check handler logs:
glh
orjournalctl -fu galaxy-handler@<handler_number_here>
and glg orjournalctl -fu galaxy-gunicorn@<handler_number_here>
- Check when was the last time the web handlers wrote some logs:
gxadmin gunicorn lastlog
(should be as recent as possible, if not it means web handlers have some issues)
- Check server load:
- Internal user (Freiburg) requests
- External user requests (pinged on Gitter channels or direct messages)
- Pull requests in various repositories and Issues
- Requests for TIaaS
- Requests for quota increase
- Projects are converted to tasks and priorities are defined for each task and documented here: Projects
- Projects typically belong to the following:
- Quality of Life projects
- New feature projects
- Infrastructure improvement projects
- Documentation improvement projects
- Projects related to the maintenance, updates, migration, monitoring, testing.