Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating HTCondor metrics #23

Open
sebastian-luna-valero opened this issue Nov 22, 2024 · 0 comments
Open

Updating HTCondor metrics #23

sebastian-luna-valero opened this issue Nov 22, 2024 · 0 comments

Comments

@sebastian-luna-valero
Copy link
Contributor

sebastian-luna-valero commented Nov 22, 2024

Hi @pauldg and @sanjaysrikakulam

We are currently doing the metascheduling using total number of unclaimed CPUs and total amount of unclaimed RAM across the entire HTCondor cluster. However @abdulrahmanazab pointed out this can be problematic and suggested to report metrics specifying the type of unclaimed slots.

To make the switch I suggest we move from the current bash script to using the python bindings instead.

Here is a quick and dirty script for you to test:

import htcondor
import json

collector = htcondor.Collector()

# get CPUs and Mem from unclaimed slots and group them in a dictionary
unclaimed_slots_summary = {}
for slot in collector.query(htcondor.htcondor.AdTypes.Startd, constraint = 'State == "Unclaimed"', projection=['Cpus', 'Memory']):
    cpu_mem_combo = slot.printJson()
    if cpu_mem_combo not in unclaimed_slots_summary:
        unclaimed_slots_summary[cpu_mem_combo] = 1
    else:
        unclaimed_slots_summary[cpu_mem_combo] += 1

for unclaimed_slot_type in unclaimed_slots_summary:
    classad = json.loads(unclaimed_slot_type)
    print("There are {} slots with {} CPUs and {} memory".format(
        unclaimed_slots_summary[unclaimed_slot_type],
        classad['Cpus'],
        classad['Memory']
    ))

You need to install the htcondor package. Could you please run it on your end and let me know the results?

Here is a sample output on the EGI pulsar endpoint:

$ python mon.py 
There are 5 slots with 8 CPUs and 15736 memory                                                                                                                                                                     
There are 1 slots with 6 CPUs and 7800 memory                                           

If it works as expected, we just need to decide how to represent this properly using the influxdb protocol format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant