Allow user to delete an instance while it is still spawning #334

bodom0015 · 2018-12-13T17:53:47Z

Problem

Fixes #330

When a user spawns an instance, very rarely a job can fail and the instance spins indefinitely. In this case, it would be helpful to be able to delete the faulty instance.

Approach

We currently hide the delete button from tales in a LAUNCHING state, but the API does not prevent operating on these objects. If we always show the delete button, the user can reset their state without needing to contact support.

Test Case

Checkout and run this branch
Open the WholeTale Dashboard and navigate to /browse
Choose any Tale from the list on the left and click "Launch"
- You should see your Instance appear in the "Launched Tales" panel on the right
- You should see a spinner at the center of the new Instance in the list, indicating that the Instance is "Launching"
- You should see the Delete (X) icon appears at the top-right of the new Instance in the list at all times
Before the Tale finishes spawning (if you see the spinner disappear, you were too late), click the Delete button
- You should be prompted for confirmation via modal window
Confirm deletion of the Tale in the modal
- You should see the Instance disappear from the "Launched Tales" panel

ThomasThelen · 2018-12-13T20:31:49Z

I was able to get the instance list in a weird state

Launched a Tale and let it fully launch
Launched a second tale
As the spinner was spinning I switched to the run page
I deleted the second Tale
After deleting it, there was a slot in the Tale list (like when you see in issue Deleted instances still show up in Launched Tales tab #213) except that it was blank. There was a border for the Tale and I could mouse over to highlight it.
I deleted the first Tale
I returned to the browse page
See the screenshot and console log

I saw that there were two calls to instance and I thought that it was interesting that one returned nothing, while the other returned []. There was also a call to job

Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number
error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number error.js:40
Error: Ember Data Request DELETE https://girder.dev.wholetale.org/api/v1/instance/5c117b0e3fea9e0001a54999 returned a 400
Payload (Empty Content-Type)
[object Object] error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number
error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number

After writing all of that and refreshing the page, the dashboard looks like it's in its normal state, so maybe it's a non issue. I'm going to leave that in because it may be useful for issue #213.

ThomasThelen · 2018-12-13T20:41:55Z

While looking at the jobs in the issue above I noticed some errors that were occurring. In the image below, I've launched a Tale and deleted it before completion multiple times. In some cases I may have deleted the Tale before the Tale Volume job finished.

Starting from the bottom and going up the list
The first two Spawn Instance jobs don't have logs.
The third Spawn Instance (5c12c0ff2b0fa10001d8d2b9) with the spinner has the following log

HttpError: HTTP error 400: GET https://girder.dev.wholetale.org/api/v1//instance/5c12c0f42b0fa10001d8d2b2 Response text: {"message": "Invalid instance id (5c12c0f42b0fa10001d8d2b2).", "type": "rest"} File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/girder_worker/task.py", line 148, in __call__ results = super(Task, self).__call__(*_t_args, **_t_kwargs) File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/gwvolman/gwvolman/tasks.py", line 139, in launch_container self.girder_client, payload['instanceId']) File "/gwvolman/gwvolman/utils.py", line 86, in _get_user_and_instance instance = girder_client.get('/instance/' + instanceId) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 503, in get return self.sendRestRequest('GET', path, parameters, jsonResp=jsonResp) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 497, in sendRestRequest response=result) HttpError: HTTP error 400: GET https://girder.dev.wholetale.org/api/v1//instance/5c12c0f42b0fa10001d8d2b2 Response text: {"message": "Invalid instance id (5c12c0f42b0fa10001d8d2b2).", "type": "rest"} File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/girder_worker/task.py", line 148, in __call__ results = super(Task, self).__call__(*_t_args, **_t_kwargs) File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/gwvolman/gwvolman/tasks.py", line 139, in launch_container self.girder_client, payload['instanceId']) File "/gwvolman/gwvolman/utils.py", line 86, in _get_user_and_instance instance = girder_client.get('/instance/' + instanceId) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 503, in get return self.sendRestRequest('GET', path, parameters, jsonResp=jsonResp) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 497, in sendRestRequest response=result)

The Create Tale Data Volume task with the spinner (5c12c16a2b0fa10001d8d2c1) had the following log

[2018-12-13 20:30:54,217] WARNING: Failed to post job: HTTP error 500: POST https://girder.dev.wholetale.org/api/v1/job?public=False&title=Spawn+Instance&type=celery&args=%5B%7B%22sessionId%22%3A+%225c12c16e2b0fa10001d8d2c4%22%2C+%22nodeId%22%3A+%22kfim4zvd540hhinoxh2ey4doi%22%2C+%22mountPoint%22%3A+%22%2Fvar%2Flib%2Fdocker%2Fvolumes%2F59a657b2615678000159c2bd_thelen_duJgWs%2F_data%22%2C+%22instanceId%22%3A+%225c12c16a2b0fa10001d8d2c0%22%2C+%22volumeName%22%3A+%2259a657b2615678000159c2bd_thelen_duJgWs%22%7D%5D&otherFields=%7B%22celeryParentTaskId%22%3A+%2271f7a9ba-41ae-46d9-a2b5-1ed1bbf14810%22%2C+%22celeryTaskId%22%3A+%223b720f0a-366f-430a-9506-1893d0eed566%22%7D&handler=celery_handler
Response text: {"message": "TypeError: TypeError(\"'NoneType' object does not support item assignment\",)", "trace": ["<FrameSummary file /girder/girder/api/rest.py, line 630 in endpointDecorator>", "<FrameSummary file /girder/girder/api/rest.py, line 1224 in POST>", "<FrameSummary file /girder/girder/api/rest.py, line 967 in handleRoute>", "<FrameSummary file /girder/girder/api/rest.py, line 445 in wrapped>", "<FrameSummary file /girder/girder/api/access.py, line 88 in wrapped>", "<FrameSummary file /girder/girder/api/describe.py, line 709 in wrapped>", "<FrameSummary file /girder/plugins/jobs/server/job_rest.py, line 105 in createJob>", "<FrameSummary file /girder/plugins/jobs/server/models/job.py, line 283 in createJob>", "<FrameSummary file /girder/plugins/jobs/server/models/job.py, line 303 in save>", "<FrameSummary file /girder/girder/models/model_base.py, line 530 in save>", "<FrameSummary file /girder/girder/events.py, line 314 in trigger>", "<FrameSummary file /girder/plugins/worker/server/__init__.py, line 205 in attachJobInfoSpec>", "<FrameSummary file /girder/plugins/jobs/server/models/job.py, line 442 in updateJob>", "<FrameSummary file /girder/girder/events.py, line 314 in trigger>", "<FrameSummary file /girder/plugins/wholetale/server/models/instance.py, line 196 in finalizeInstance>", "<FrameSummary file /girder/plugins/wholetale/server/models/instance.py, line 141 in updateInstance>"], "type": "internal"}

The Create Tale Data Volume with the red X (5c12c1c02b0fa10001d8d2cc) had the following log

HttpError: HTTP error 400: GET https://girder.dev.wholetale.org/api/v1/folder?offset=0&parentId=596444f3043dea00010d0c74&parentType=folder&limit=50 Response text: {"field": "id", "message": "No such folder: 596444f3043dea00010d0c74", "type": "validation"} File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/girder_worker/task.py", line 148, in __call__ results = super(Task, self).__call__(*_t_args, **_t_kwargs) File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/gwvolman/gwvolman/tasks.py", line 96, in create_volume for folder in self.girder_client.listFolder(tale['folderId']) File "/gwvolman/gwvolman/tasks.py", line 95, in <listcomp> {'itemId': folder['_id'], 'mountPath': '/' + folder['name']} File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 578, in listResource records = self.get(path, params) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 503, in get return self.sendRestRequest('GET', path, parameters, jsonResp=jsonResp) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 497, in sendRestRequest response=result)

bodom0015 · 2018-12-13T22:56:17Z

Good find! I had certainly not seen that behavior, but I was indeed able to reproduce it... Something wonky happening here with the state... I'll need to take a closer look tomorrow morning.

Can maybe answer some of the oddities above:

Looking at your screenshots, you didn't appear to have multiple calls to /instance - you seem to have had an OPTIONS preflight call (usually related to CORS and telling your client what is allowed), followed by the actual GET request.
The job log stuff is a mystery to me - here be dragons

ThomasThelen · 2018-12-14T00:06:32Z

Oops you're right about the preflight call-so that stuff is probably perfectly normal.

I think it might be some sort of race condition. We spawn two asynch jobs for creating instances here. When we delete instances here, we spawn a new job to do the cleanup and add it to the celery job queue.

I think when destroyRecord is called on the instance, it ends up calling deleteInstance (linked above). At this point, it's possible to have three jobs running (Spawn, Create Volume, Shutdown)-with Shutdown wiping things that that Create Volume is using (Job 5c12c1c02b0fa10001d8d2cc failed because a folder no longer existed).

@Xarthisius may have a better idea of what's going on; if it is a race condition we could fix it by letting the user know we'll delete the instance once it's done spawning. Or we should make sure we kill any instance creation jobs before we attempt to run the Shutdown Instance job.

Edit:
This may be related

Sidenote: Is that an extra slash in the first error? It may not matter.
https://girder.dev.wholetale.org/api/v1//instance/5c12c0f42b0fa10001d8d2b2
api/v1//instance

bodom0015 · 2018-12-17T17:05:06Z

Interesting... I obviously didn't alter the spawning logic here, but I also hadn't noticed the extra slash. In any case, it seems innocuous since the request still makes it to the correct endpoint:

$ curl https://girder.dev.wholetale.org/api/v1//instance/5c17d4ca2b0fa10001d8d395 -H "Girder-Token: <valid-token-scraped-from-browser-cookies>"
{"_id": "5c17d4ca2b0fa10001d8d395", "access": {"groups": [], "users": [{"flags": [], "id": "5bdb63506141bc0001cc1b55", "level": 2}]}, "created": "2018-12-17T16:54:34.880000+00:00", "creatorId": "5bdb63506141bc0001cc1b55", "iframe": true, "lastActivity": "2018-12-17T16:54:34.880000+00:00", "name": "LIGO Tutorial", "status": 0, "taleId": "59f0b91584b7920001b46f2e", "updated": "2018-12-17T16:54:42.947000+00:00"}

Regarding the Jobs, I still need to dig into the code but I am just realizing that I've actually never seen the Job statuses portion on our Girder dev instance, since I am apparently not yet an admin there... would @Xarthisius or some other kind soul be willing to add me? 😅

Is the "Jobs" view in Girder one that I should be visiting and/or validating regularly? Or is this just used as supplemental information when debugging?

EDIT: Strange/unrelated thing that I just noticed, Girder seems to pick up headers, but not cookies? I thought this was accepted both ways, but apparently not:

$ curl https://girder.dev.wholetale.org/api/v1//instance/5c17d4ca2b0fa10001d8d395 --cookie "girderToken: <valid-token-scraped-from-browser-cookies>"
{"message": "You must be logged in.", "type": "access"}

Xarthisius · 2018-12-17T17:14:09Z

Regarding the Jobs, I still need to dig into the code but I am just realizing that I've actually never seen the Job statuses portion on our Girder dev instance, since I am apparently not yet an admin there... would @Xarthisius or some other kind soul be willing to add me?

Is the "Jobs" view in Girder one that I should be visiting and/or validating regularly? Or is this just used as supplemental information when debugging?

I just made you a site admin. However, the job view doesn't require elevated privileges:

https://girder.dev.wholetale.org/#jobs/user/5bdb63506141bc0001cc1b55

^^ should show you your jobs. If anything goes wrong, it will probably show up there.

Answering @ThomasThelen question: yeah, currently it's impossible to cleanly shutdown tasks related to instance spawning (as they depend on each other, are not wrapped in `cancelableTask, etc). We'll need to work on handling it more gracefully on the backend side.

bodom0015 · 2018-12-17T18:03:58Z

I just made you the site admin. However, the job view doesn't require elevated privileges

~~On other instances, the Jobs view does require admin access:~~

Failed to load resource: the server responded with a status of 403 ()
girder_lib.min.js:1 403 error {"message": "Administrator access required.", "type": "access"}
error @ girder_lib.min.js:1
api/v1/job/typeandstatus/all:1 Failed to load resource: the server responded with a status of 403 ()
girder_lib.min.js:1 403 error {"message": "Administrator access required.", "type": "access"}

~~But maybe WholeTale is configured differently? To get to this view, I need to go to the Admin Console > Jobs... Is there another way to get there, besides knowing to alter the address bar?~~

EDIT: Scratch that, I just found the "My jobs" link on the account dropdown at the top-right... I can't believe that I have literally never seen this before :o

ThomasThelen

LGTM!

craig-willis · 2019-01-17T21:42:53Z

Per 1/17 discussion with @Xarthisius @bodom0015, the main issue here is that we are missing the event handler for volume creation on the backend and not properly setting the error status. If error status was propagated correctly, then this PR is unnecessary. Sorry Mike.

craig-willis · 2019-03-20T15:49:35Z

Per 3/20 discussion with @bodom0015 and @Xarthisius we're leaving this open until we resolve the underlying problem.

Allow user to delete an instance while it is still spawning

e1d2b79

bodom0015 requested review from Xarthisius, craig-willis, abrinckm, ThomasThelen and cyberdante December 13, 2018 17:53

ThomasThelen approved these changes Dec 17, 2018

View reviewed changes

craig-willis mentioned this pull request Jan 17, 2019

Instance error status not set properly whole-tale/girder_wholetale#214

Closed

bodom0015 closed this May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user to delete an instance while it is still spawning #334

Allow user to delete an instance while it is still spawning #334

bodom0015 commented Dec 13, 2018 •

edited

Loading

ThomasThelen commented Dec 13, 2018

ThomasThelen commented Dec 13, 2018

bodom0015 commented Dec 13, 2018

ThomasThelen commented Dec 14, 2018 •

edited

Loading

bodom0015 commented Dec 17, 2018 •

edited

Loading

Xarthisius commented Dec 17, 2018 •

edited

Loading

bodom0015 commented Dec 17, 2018 •

edited

Loading

ThomasThelen left a comment

craig-willis commented Jan 17, 2019

craig-willis commented Mar 20, 2019

Allow user to delete an instance while it is still spawning #334

Allow user to delete an instance while it is still spawning #334

Conversation

bodom0015 commented Dec 13, 2018 • edited Loading

Problem

Approach

Test Case

ThomasThelen commented Dec 13, 2018

ThomasThelen commented Dec 13, 2018

bodom0015 commented Dec 13, 2018

ThomasThelen commented Dec 14, 2018 • edited Loading

bodom0015 commented Dec 17, 2018 • edited Loading

Xarthisius commented Dec 17, 2018 • edited Loading

bodom0015 commented Dec 17, 2018 • edited Loading

ThomasThelen left a comment

Choose a reason for hiding this comment

craig-willis commented Jan 17, 2019

craig-willis commented Mar 20, 2019

bodom0015 commented Dec 13, 2018 •

edited

Loading

ThomasThelen commented Dec 14, 2018 •

edited

Loading

bodom0015 commented Dec 17, 2018 •

edited

Loading

Xarthisius commented Dec 17, 2018 •

edited

Loading

bodom0015 commented Dec 17, 2018 •

edited

Loading