Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to delete an instance while it is still spawning #334

Closed
wants to merge 1 commit into from

Conversation

bodom0015
Copy link
Member

@bodom0015 bodom0015 commented Dec 13, 2018

Problem

Fixes #330

When a user spawns an instance, very rarely a job can fail and the instance spins indefinitely. In this case, it would be helpful to be able to delete the faulty instance.

Approach

We currently hide the delete button from tales in a LAUNCHING state, but the API does not prevent operating on these objects. If we always show the delete button, the user can reset their state without needing to contact support.

Test Case

  1. Checkout and run this branch
  2. Open the WholeTale Dashboard and navigate to /browse
  3. Choose any Tale from the list on the left and click "Launch"
    • You should see your Instance appear in the "Launched Tales" panel on the right
    • You should see a spinner at the center of the new Instance in the list, indicating that the Instance is "Launching"
    • You should see the Delete (X) icon appears at the top-right of the new Instance in the list at all times
  4. Before the Tale finishes spawning (if you see the spinner disappear, you were too late), click the Delete button
    • You should be prompted for confirmation via modal window
  5. Confirm deletion of the Tale in the modal
    • You should see the Instance disappear from the "Launched Tales" panel

@ThomasThelen
Copy link
Member

I was able to get the instance list in a weird state

  1. Launched a Tale and let it fully launch
  2. Launched a second tale
  3. As the spinner was spinning I switched to the run page
  4. I deleted the second Tale
  5. After deleting it, there was a slot in the Tale list (like when you see in issue Deleted instances still show up in Launched Tales tab #213) except that it was blank. There was a border for the Tale and I could mouse over to highlight it.
  6. I deleted the first Tale
  7. I returned to the browse page
  8. See the screenshot and console log

I saw that there were two calls to instance and I thought that it was interesting that one returned nothing, while the other returned []. There was also a call to job

screen shot 2018-12-13 at 12 19 56 pm

screen shot 2018-12-13 at 12 20 22 pm

Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number
error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number error.js:40
Error: Ember Data Request DELETE https://girder.dev.wholetale.org/api/v1/instance/5c117b0e3fea9e0001a54999 returned a 400
Payload (Empty Content-Type)
[object Object] error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number
error.js:40
Error: Assertion Failed: `id` passed to `findRecord()` has to be non-empty string or number

screen shot 2018-12-13 at 12 22 22 pm

screen shot 2018-12-13 at 12 24 27 pm

screen shot 2018-12-13 at 12 25 29 pm

After writing all of that and refreshing the page, the dashboard looks like it's in its normal state, so maybe it's a non issue. I'm going to leave that in because it may be useful for issue #213.

@ThomasThelen
Copy link
Member

While looking at the jobs in the issue above I noticed some errors that were occurring. In the image below, I've launched a Tale and deleted it before completion multiple times. In some cases I may have deleted the Tale before the Tale Volume job finished.
screen shot 2018-12-13 at 12 32 57 pm

Starting from the bottom and going up the list
The first two Spawn Instance jobs don't have logs.
The third Spawn Instance (5c12c0ff2b0fa10001d8d2b9) with the spinner has the following log

HttpError: HTTP error 400: GET https://girder.dev.wholetale.org/api/v1//instance/5c12c0f42b0fa10001d8d2b2 Response text: {"message": "Invalid instance id (5c12c0f42b0fa10001d8d2b2).", "type": "rest"} File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/girder_worker/task.py", line 148, in __call__ results = super(Task, self).__call__(*_t_args, **_t_kwargs) File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/gwvolman/gwvolman/tasks.py", line 139, in launch_container self.girder_client, payload['instanceId']) File "/gwvolman/gwvolman/utils.py", line 86, in _get_user_and_instance instance = girder_client.get('/instance/' + instanceId) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 503, in get return self.sendRestRequest('GET', path, parameters, jsonResp=jsonResp) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 497, in sendRestRequest response=result) HttpError: HTTP error 400: GET https://girder.dev.wholetale.org/api/v1//instance/5c12c0f42b0fa10001d8d2b2 Response text: {"message": "Invalid instance id (5c12c0f42b0fa10001d8d2b2).", "type": "rest"} File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/girder_worker/task.py", line 148, in __call__ results = super(Task, self).__call__(*_t_args, **_t_kwargs) File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/gwvolman/gwvolman/tasks.py", line 139, in launch_container self.girder_client, payload['instanceId']) File "/gwvolman/gwvolman/utils.py", line 86, in _get_user_and_instance instance = girder_client.get('/instance/' + instanceId) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 503, in get return self.sendRestRequest('GET', path, parameters, jsonResp=jsonResp) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 497, in sendRestRequest response=result)

The Create Tale Data Volume task with the spinner (5c12c16a2b0fa10001d8d2c1) had the following log

[2018-12-13 20:30:54,217] WARNING: Failed to post job: HTTP error 500: POST https://girder.dev.wholetale.org/api/v1/job?public=False&title=Spawn+Instance&type=celery&args=%5B%7B%22sessionId%22%3A+%225c12c16e2b0fa10001d8d2c4%22%2C+%22nodeId%22%3A+%22kfim4zvd540hhinoxh2ey4doi%22%2C+%22mountPoint%22%3A+%22%2Fvar%2Flib%2Fdocker%2Fvolumes%2F59a657b2615678000159c2bd_thelen_duJgWs%2F_data%22%2C+%22instanceId%22%3A+%225c12c16a2b0fa10001d8d2c0%22%2C+%22volumeName%22%3A+%2259a657b2615678000159c2bd_thelen_duJgWs%22%7D%5D&otherFields=%7B%22celeryParentTaskId%22%3A+%2271f7a9ba-41ae-46d9-a2b5-1ed1bbf14810%22%2C+%22celeryTaskId%22%3A+%223b720f0a-366f-430a-9506-1893d0eed566%22%7D&handler=celery_handler
Response text: {"message": "TypeError: TypeError(\"'NoneType' object does not support item assignment\",)", "trace": ["<FrameSummary file /girder/girder/api/rest.py, line 630 in endpointDecorator>", "<FrameSummary file /girder/girder/api/rest.py, line 1224 in POST>", "<FrameSummary file /girder/girder/api/rest.py, line 967 in handleRoute>", "<FrameSummary file /girder/girder/api/rest.py, line 445 in wrapped>", "<FrameSummary file /girder/girder/api/access.py, line 88 in wrapped>", "<FrameSummary file /girder/girder/api/describe.py, line 709 in wrapped>", "<FrameSummary file /girder/plugins/jobs/server/job_rest.py, line 105 in createJob>", "<FrameSummary file /girder/plugins/jobs/server/models/job.py, line 283 in createJob>", "<FrameSummary file /girder/plugins/jobs/server/models/job.py, line 303 in save>", "<FrameSummary file /girder/girder/models/model_base.py, line 530 in save>", "<FrameSummary file /girder/girder/events.py, line 314 in trigger>", "<FrameSummary file /girder/plugins/worker/server/__init__.py, line 205 in attachJobInfoSpec>", "<FrameSummary file /girder/plugins/jobs/server/models/job.py, line 442 in updateJob>", "<FrameSummary file /girder/girder/events.py, line 314 in trigger>", "<FrameSummary file /girder/plugins/wholetale/server/models/instance.py, line 196 in finalizeInstance>", "<FrameSummary file /girder/plugins/wholetale/server/models/instance.py, line 141 in updateInstance>"], "type": "internal"}

The Create Tale Data Volume with the red X (5c12c1c02b0fa10001d8d2cc) had the following log

HttpError: HTTP error 400: GET https://girder.dev.wholetale.org/api/v1/folder?offset=0&parentId=596444f3043dea00010d0c74&parentType=folder&limit=50 Response text: {"field": "id", "message": "No such folder: 596444f3043dea00010d0c74", "type": "validation"} File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 382, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/girder_worker/task.py", line 148, in __call__ results = super(Task, self).__call__(*_t_args, **_t_kwargs) File "/usr/local/lib/python3.5/dist-packages/celery/app/trace.py", line 641, in __protected_call__ return self.run(*args, **kwargs) File "/gwvolman/gwvolman/tasks.py", line 96, in create_volume for folder in self.girder_client.listFolder(tale['folderId']) File "/gwvolman/gwvolman/tasks.py", line 95, in <listcomp> {'itemId': folder['_id'], 'mountPath': '/' + folder['name']} File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 578, in listResource records = self.get(path, params) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 503, in get return self.sendRestRequest('GET', path, parameters, jsonResp=jsonResp) File "/usr/local/lib/python3.5/dist-packages/girder_client/__init__.py", line 497, in sendRestRequest response=result)

@bodom0015
Copy link
Member Author

Good find! I had certainly not seen that behavior, but I was indeed able to reproduce it... Something wonky happening here with the state... I'll need to take a closer look tomorrow morning.

Can maybe answer some of the oddities above:

  • Looking at your screenshots, you didn't appear to have multiple calls to /instance - you seem to have had an OPTIONS preflight call (usually related to CORS and telling your client what is allowed), followed by the actual GET request.
  • The job log stuff is a mystery to me - here be dragons

@ThomasThelen
Copy link
Member

ThomasThelen commented Dec 14, 2018

Oops you're right about the preflight call-so that stuff is probably perfectly normal.

I think it might be some sort of race condition. We spawn two asynch jobs for creating instances here. When we delete instances here, we spawn a new job to do the cleanup and add it to the celery job queue.

I think when destroyRecord is called on the instance, it ends up calling deleteInstance (linked above). At this point, it's possible to have three jobs running (Spawn, Create Volume, Shutdown)-with Shutdown wiping things that that Create Volume is using (Job 5c12c1c02b0fa10001d8d2cc failed because a folder no longer existed).

@Xarthisius may have a better idea of what's going on; if it is a race condition we could fix it by letting the user know we'll delete the instance once it's done spawning. Or we should make sure we kill any instance creation jobs before we attempt to run the Shutdown Instance job.

Edit:
This may be related

Sidenote: Is that an extra slash in the first error? It may not matter.
https://girder.dev.wholetale.org/api/v1//instance/5c12c0f42b0fa10001d8d2b2
api/v1//instance

@bodom0015
Copy link
Member Author

bodom0015 commented Dec 17, 2018

Interesting... I obviously didn't alter the spawning logic here, but I also hadn't noticed the extra slash. In any case, it seems innocuous since the request still makes it to the correct endpoint:

$ curl https://girder.dev.wholetale.org/api/v1//instance/5c17d4ca2b0fa10001d8d395 -H "Girder-Token: <valid-token-scraped-from-browser-cookies>"
{"_id": "5c17d4ca2b0fa10001d8d395", "access": {"groups": [], "users": [{"flags": [], "id": "5bdb63506141bc0001cc1b55", "level": 2}]}, "created": "2018-12-17T16:54:34.880000+00:00", "creatorId": "5bdb63506141bc0001cc1b55", "iframe": true, "lastActivity": "2018-12-17T16:54:34.880000+00:00", "name": "LIGO Tutorial", "status": 0, "taleId": "59f0b91584b7920001b46f2e", "updated": "2018-12-17T16:54:42.947000+00:00"}

Regarding the Jobs, I still need to dig into the code but I am just realizing that I've actually never seen the Job statuses portion on our Girder dev instance, since I am apparently not yet an admin there... would @Xarthisius or some other kind soul be willing to add me? 😅

Is the "Jobs" view in Girder one that I should be visiting and/or validating regularly? Or is this just used as supplemental information when debugging?

EDIT: Strange/unrelated thing that I just noticed, Girder seems to pick up headers, but not cookies? I thought this was accepted both ways, but apparently not:

$ curl https://girder.dev.wholetale.org/api/v1//instance/5c17d4ca2b0fa10001d8d395 --cookie "girderToken: <valid-token-scraped-from-browser-cookies>"
{"message": "You must be logged in.", "type": "access"}

@Xarthisius
Copy link
Contributor

Xarthisius commented Dec 17, 2018

Regarding the Jobs, I still need to dig into the code but I am just realizing that I've actually never seen the Job statuses portion on our Girder dev instance, since I am apparently not yet an admin there... would @Xarthisius or some other kind soul be willing to add me?

Is the "Jobs" view in Girder one that I should be visiting and/or validating regularly? Or is this just used as supplemental information when debugging?

I just made you a site admin. However, the job view doesn't require elevated privileges:

https://girder.dev.wholetale.org/#jobs/user/5bdb63506141bc0001cc1b55

^^ should show you your jobs. If anything goes wrong, it will probably show up there.

Answering @ThomasThelen question: yeah, currently it's impossible to cleanly shutdown tasks related to instance spawning (as they depend on each other, are not wrapped in `cancelableTask, etc). We'll need to work on handling it more gracefully on the backend side.

@bodom0015
Copy link
Member Author

bodom0015 commented Dec 17, 2018

I just made you the site admin. However, the job view doesn't require elevated privileges

On other instances, the Jobs view does require admin access:

Failed to load resource: the server responded with a status of 403 ()
girder_lib.min.js:1 403 error {"message": "Administrator access required.", "type": "access"}
error @ girder_lib.min.js:1
api/v1/job/typeandstatus/all:1 Failed to load resource: the server responded with a status of 403 ()
girder_lib.min.js:1 403 error {"message": "Administrator access required.", "type": "access"}

But maybe WholeTale is configured differently? To get to this view, I need to go to the Admin Console > Jobs... Is there another way to get there, besides knowing to alter the address bar?

EDIT: Scratch that, I just found the "My jobs" link on the account dropdown at the top-right... I can't believe that I have literally never seen this before :o

Copy link
Member

@ThomasThelen ThomasThelen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@craig-willis
Copy link
Contributor

Per 1/17 discussion with @Xarthisius @bodom0015, the main issue here is that we are missing the event handler for volume creation on the backend and not properly setting the error status. If error status was propagated correctly, then this PR is unnecessary. Sorry Mike.

@craig-willis
Copy link
Contributor

Per 3/20 discussion with @bodom0015 and @Xarthisius we're leaving this open until we resolve the underlying problem.

@bodom0015 bodom0015 closed this May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to close/destroy instance while it's spawning
4 participants