Ongoing transfers after a job is killed #920

tonyhutter · 2020-05-26T20:05:28Z

Question:

Some questions that came up while testing the BB API:

Q1:
Let's say I start a job that initiates a big 1TB transfer between the SSD and GPFS. Then the job is killed off. If I knew the transfer handle (lets say I saved it), could I cancel the transfer even though I had a new job ID?

If the answer is yes:
What would happen if I was able to guess the transfer handle of another user? Could I cancel their transfers as well? Or is that not allowed?

If the answer is no:
How would I kill off the my old transfer? Assume I'm the same user and don't have root.

Q2:
What happens if User1 requests the enitre burst buffer on a node, fills it with a single file, starts transferring the file from the burst buffer to GPFS, and then hits a segfault (so the job dies, but the transfer is still going). Then, User2 gets assigned the same node and burst buffer, and does the same thing. That is, User2 writes a single file to the entire burst buffer, overwriting User1's extents. Does User1's ongoing "zombie transfer" then get corrupted with User2's new data? Or is User1's BB transfer automatically cancelled when a new user gets their node?

Answer:

place final answer here

Approach:

here replace this with a short summary of how you addressed the problem. in the comments place step by step notes of progress as you go

What is next:

Define the next steps and follow up here

tgooding · 2020-06-12T15:43:13Z

A1:
If you are referring to a JOB:
When you submit your job, you can specify 2 different stage-out scripts. The two script execute at different points of stageout job teardown. Stageout1 script executes before transfers are queried. This stage can be used to initiate new transfers from SSD->GPFS or cancel existing ones. Stageout2 script executes after all transfers have completed. This stage is intended for updating any checkpoint tracking information that you might have in SCR.

If you are referring to a JOBSTEP:
The orchestrating job script can query and cancel transfers using the handle(s), even if they occur under different job steps. Jobstep=0 is special and can be used to query all transfers within the job.

Job metadata leverages the file system and stored as the user of the job. So if another user were to be given/guess the job/jobstep/handle, they would not have sufficient permissions to inspect/cancel the transfers.

tgooding · 2020-06-12T15:50:52Z

A2:
No. Each job has its own XFS file system is an LVM logical volume (which is kindof like a dynamic partition). The size of the LV is determined by the bsub commandline request.

In your scenario, user1's job would retain its LV storage for the entire node, this should prevent LSF from scheduling a job on that node (no free space) until the LV is destroyed at stageout phase.

For the case where 2 or more LVs exist simultaneously (e.g., aggregate is less than 1.6T), they will reside in separate LVs and therefore different physical locations on the SSD. bbServer tracks extents by SSD physical LBA (e.g., not relative to LV, XFS, or hashing path/filenames).

tonyhutter mentioned this issue May 26, 2020

Expected behavior of AXL after being killed ECP-VeloC/AXL#57

Open

tgooding added Comp: Burst Buffer PhaseFound: Customer Sev: 3 Status: Open open for someone to grab and start working on Type: Documentation labels Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ongoing transfers after a job is killed #920

Ongoing transfers after a job is killed #920

tonyhutter commented May 26, 2020

tgooding commented Jun 12, 2020

tgooding commented Jun 12, 2020

Ongoing transfers after a job is killed #920

Ongoing transfers after a job is killed #920

Comments

tonyhutter commented May 26, 2020

Question:

Answer:

Approach:

What is next:

tgooding commented Jun 12, 2020

tgooding commented Jun 12, 2020