Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Ongoing transfers after a job is killed #920

Open
tonyhutter opened this issue May 26, 2020 · 2 comments
Open

Ongoing transfers after a job is killed #920

tonyhutter opened this issue May 26, 2020 · 2 comments

Comments

@tonyhutter
Copy link

Question:

Some questions that came up while testing the BB API:

Q1:
Let's say I start a job that initiates a big 1TB transfer between the SSD and GPFS. Then the job is killed off. If I knew the transfer handle (lets say I saved it), could I cancel the transfer even though I had a new job ID?

If the answer is yes:
What would happen if I was able to guess the transfer handle of another user? Could I cancel their transfers as well? Or is that not allowed?

If the answer is no:
How would I kill off the my old transfer? Assume I'm the same user and don't have root.

Q2:
What happens if User1 requests the enitre burst buffer on a node, fills it with a single file, starts transferring the file from the burst buffer to GPFS, and then hits a segfault (so the job dies, but the transfer is still going). Then, User2 gets assigned the same node and burst buffer, and does the same thing. That is, User2 writes a single file to the entire burst buffer, overwriting User1's extents. Does User1's ongoing "zombie transfer" then get corrupted with User2's new data? Or is User1's BB transfer automatically cancelled when a new user gets their node?

Answer:

place final answer here

Approach:

here replace this with a short summary of how you addressed the problem. in the comments place step by step notes of progress as you go

What is next:

Define the next steps and follow up here

@tgooding
Copy link
Contributor

A1:
If you are referring to a JOB:
When you submit your job, you can specify 2 different stage-out scripts. The two script execute at different points of stageout job teardown. Stageout1 script executes before transfers are queried. This stage can be used to initiate new transfers from SSD->GPFS or cancel existing ones. Stageout2 script executes after all transfers have completed. This stage is intended for updating any checkpoint tracking information that you might have in SCR.

If you are referring to a JOBSTEP:
The orchestrating job script can query and cancel transfers using the handle(s), even if they occur under different job steps. Jobstep=0 is special and can be used to query all transfers within the job.

Job metadata leverages the file system and stored as the user of the job. So if another user were to be given/guess the job/jobstep/handle, they would not have sufficient permissions to inspect/cancel the transfers.

@tgooding
Copy link
Contributor

A2:
No. Each job has its own XFS file system is an LVM logical volume (which is kindof like a dynamic partition). The size of the LV is determined by the bsub commandline request.

In your scenario, user1's job would retain its LV storage for the entire node, this should prevent LSF from scheduling a job on that node (no free space) until the LV is destroyed at stageout phase.

For the case where 2 or more LVs exist simultaneously (e.g., aggregate is less than 1.6T), they will reside in separate LVs and therefore different physical locations on the SSD. bbServer tracks extents by SSD physical LBA (e.g., not relative to LV, XFS, or hashing path/filenames).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants