A batch job monitoring solution designed for use with the PBS Professional by Altair built and designed by the Center for Research Computing and Data (CRCD) for the Metis Supercomputing Cluster at NIU.
This tool monitors jobs by parsing and storing the output of the jobstat -anL
and jmanl <username> year raw
commands in a persistant database, and allows you to view this data via a convenient web application.
Authentication and data access is done based on the user's actual login to the Metis cluster, which is securely and remotely verified over SSH. Similarly, commands are also run over SSH, which means this software can be deployed anywhere.
This web application and its backend are written entirely in Rust. Rust was selected as the language, as parsing sensitive data on a scale this large requires a highly performant and secure solution - both of which are strong selling points of Rust.
One of the newer innovations in frontend development is Server Side Rendering (SSR). This has many benefits, one of the largest being faster loading times. It is possible for hundreds or even thousands of jobs to be displayed on one page; and with Client Side Rendering (CSR), this would normally be handled with an API call. This API call can slow down the browser, and possibly create a janky experience for the user as they stare at a blank page while the page loads.
By using the Askama Rust framework, this data can be directly injected and pre-rendered into the returned HTML.
The web server for this application is built on the Axum Rust framework. Axum is well-suited for creating safe, highly parallel, and extremely performant web servers.
Authentication is handled by remotely executing an expect
script for the su
command over SSH, done with the openssh
crate. Sessions are stored with the tower-sessions
crate. Because of the extremely sensitive nature of the credentials, both the credentials themselves and the sessions are only stored in memory - and sessions expire after 30 minutes.
Command execution is done remotely over SSH, after which the command output is parsed with the regex
crate.
Data from jobstat
, jmanl
, and groups
is stored persistantly in a SQLite database via the rusqlite
crate. Commands are run in parallel using the asynchronus Rust framework Tokio.
This application and its dependancies are declaratively defined using the Nix Package Manager and hash-locked using Nix Flakes. You can enter the development environment for it with nix develop .#hawkeye
, or build the application wtih nix build .#hawkeye
.
This application is built into a reproducible Docker container image using GitHub Actions (tutorial here), and is publically available at ghcr.io/hiibolt/hawkeye:latest
.
This application can be deployed either using the standalone container image at ghcr.io/hiibolt/hawkeye:latest
or with Docker Compose.
By default, this application exposes itself on post 5777 and on the host network.
Firstly, clone this respository and move into it:
git clone https://github.com/hiibolt/hawkeye.git
cd hawkeye
Next, create a .env
file, and add the following variables:
Required Variables
REMOTE_USERNAME
- The SSH username you'd like to use to log into the remote machineREMOTE_HOSTNAME
- The SSH hostname of the remote machineDB_PATH
- The path of the DB you'd like to open from, relative to thedata
volume. You can leave this asdata.db
, if you don't know what to do. It will create a new database for you.
Optional Variables
RUST_LOG
- The max level of logging to use. Some options areinfo
,warn
, anderror
. I suggest usingwarn
, there is a staggering of output on theinfo
level. If you wish to debug, use selective levels.GROUPS_DAEMON_PERIOD
- The time in seconds between each groups daemon run. The default is an hour.JOBS_DAEMON_PERIOD
- The time in seconds between each data gathering (jobstat
). Default is every 5 minutes.OLD_JOBS_DAEMON_PERIOD
- The time in seconds between each data verification (jmanl
). Default is every 30 minutes.
Deploying is as simple as running docker compose up -d
. Please note that it may take substantial time to pull the image for the first time.
Next, you'll need to enter the container and generate an SSH keyfile:
docker exec -it hawkeye-hawkeye-1 /bin/sh
ssh-keygen
ssh-copy-id <remote_username>@<remote_hostname>
exit
Finally, restart the stack to include the new SSH login:
docker compose restart