Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error during training #89

Closed
cooperlab opened this issue Feb 13, 2024 · 9 comments
Closed

Fatal error during training #89

cooperlab opened this issue Feb 13, 2024 · 9 comments

Comments

@cooperlab
Copy link

The system crashes when retraining after labeling several chips in the active learning task. Girder is trying to os.unlink something that does not exist.

[2024-02-07 21:52:01,937] ERROR: Failed to delete file /assetstore/4f/23/4f23e7e90af0798b8f38c7afa294c722320357446bf80409eb040371f450f6f998bc8ec87de6ab1679429e4800aba1a2d0b70477cd 4694526f1508c6d8a23459 Traceback (most recent call last): File "/opt/girder/girder/utility/filesystem_assetstore_adapter.py", line 306, in deleteFile os.unlink(path) FileNotFoundError: [Errno 2] No such file or directory: '/assetstore/4f/23/4f23e7e90af0798b8f38c7afa294c722320357446bf80409eb040371f450f6f998bc8ec87de6ab1679429e4800aba1a2d0b70477 cd4694526f1508c6d8a23459' Additional info: Request URL: PUT http://127.0.0.1:8080/api/v1/folder/65c2f5d5201f96fa9ca19598/yaml_config/.histomicsui_config.yaml Query string: Remote IP: 172.25.0.1 Request UID: 6f42862a-8b20-4f46-bc2b-c84ca093cfc6 [2024-02-07 21:59:20,409] ERROR: Failed to delete file /assetstore/94/98/9498c946becfd894633f47c3eccb0e61c7ba216d4576b485f57f682747c22ba34e3e2a6e0a152037a4a9738c42292b9978a013b062 daf6ba32a5f1b072dd3ccb Traceback (most recent call last): File "/opt/girder/girder/utility/filesystem_assetstore_adapter.py", line 306, in deleteFile os.unlink(path) FileNotFoundError: [Errno 2] No such file or directory: '/assetstore/94/98/9498c946becfd894633f47c3eccb0e61c7ba216d4576b485f57f682747c22ba34e3e2a6e0a152037a4a9738c42292b9978a013b0 62daf6ba32a5f1b072dd3ccb' Additional info: Request URL: PUT http://127.0.0.1:8080/api/v1/folder/65c2f5d5201f96fa9ca19598/yaml_config/.histomicsui_config.yaml Query string: Remote IP: 172.25.0.1 Request UID: 752968ce-1106-4f6f-8ccf-49a5cc90eae3

@bnmajor
Copy link
Collaborator

bnmajor commented Feb 13, 2024

This looks like this is failing to update the config file because it doesn't exist for some reason... I will try to reproduce this locally and push a fix! In the meantime can you clarify a few things for me?

  1. What state was this starting from after the latest changes were pulled? Was this newly created? Started up from the setup (labeling) step? Or started up from the predictions (chip labeling) step?
  2. Is there an item named .histomicsui_config.yaml at the top level of the project? And if so:
    1. Is there only one file inside the item?
    2. Was it edited by hand at all?

@cooperlab
Copy link
Author

This is a fresh deployment with d5f41b7.

The workflow to trigger this was retraining after initial labeling -> train -> active learning.

We have two identical files under .histomicsui_config.yaml, neither generated by hand.

@bnmajor
Copy link
Collaborator

bnmajor commented Feb 13, 2024

We have two identical files under .histomicsui_config.yaml, neither generated by hand.

I haven't been able to reproduce the error yet so this is very helpful, thank you!

Can you please try to open both and make sure they are both valid yaml files? In the meantime you should be able to safely delete both files and this should clear up the error (the config will be recreated when the UI is accessed again).

@cooperlab
Copy link
Author

It's a valid yaml. Both files have identical sha512.

Thanks for the quick workaround. Could this be a permissions problem?

@bnmajor
Copy link
Collaborator

bnmajor commented Feb 14, 2024

If it were a permissions issue I would expect an error about "access denied" or something along those lines... The PUT request should be creating or re-using the histomicui_config.yaml item and then creating the file with the yaml input and if there was an existing file it should be deleted. My assumption is this is where things went wrong (based on the error trying to delete and the fact that there were two files left in the item).

Have you seen this with any other projects?

@bnmajor
Copy link
Collaborator

bnmajor commented Feb 14, 2024

@manthey I cannot reproduce this behavior - Do you have any ideas on how we may have ended up in this state? It seems like there was an attempt to remove the old file but for some reason it didn't exist in the assetstore and we ended up with two copies of the config and a broken project...

@manthey
Copy link
Contributor

manthey commented Feb 15, 2024

I have a thesis on what is happening: If two requests to write a config file happen in a short enough time span, there could end up with two yaml files in the same girder item. The solution is to add a guard to prevent this (since Mongo doesn't have cross-collection transactions) or to fix this once done. There will be a PR in large_image to address this condition.

@cooperlab
Copy link
Author

@bnmajor provided an easy workaround for the time being. Not urgent to address.

@manthey
Copy link
Contributor

manthey commented Feb 15, 2024

I think this will have be eliminated via girder/large_image#1467. If you have the latest DSA containers and you see it again, please reopen the issue.

@manthey manthey closed this as completed Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants