Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Folder and disk allocation issues around priming #199

Open
GrubLord opened this issue Jan 13, 2016 · 5 comments
Open

Folder and disk allocation issues around priming #199

GrubLord opened this issue Jan 13, 2016 · 5 comments

Comments

@GrubLord
Copy link

Hi there,

We've had a great deal of trouble with GMS not re-assessing available folders and free space in the process of starting new runs. The issues we've discovered are twofold:

a) When the amount of free space available on a disk changes, GMS does not update its information.

We have noticed that GMS will doggedly abide by the "remaining space" measurement stored in its "Volume" object, as set when the machine was primed, even if actual space on device is greater than this.

In order to run anything, we have had to manually edit Creator.pm's _get_allocation_without_lock_impl() command, to invoke the Volume object's sync_total_kb() command in the for-loop, prior to checking for free space on candidate volumes.

b) When the /tmp folder is replaced with a symlink, GMS begins to write to the home directory instead.

We swapped the /tmp folder with a symlink to /opt/tmp-store, because our tmp disk seemed insufficient for the large amount of scratch space GMS appears to require. However, doing this caused it to switch its tmp directory to the user's home directory, causing some poor performance and root-drive filling issues.

We reverted our change, and the /tmp directory is back to normal, but the issue with /home/user being used in place of /tmp persists, and appears to be a configuration that cannot be changed by the end-user.

The temp directory appears to be set when the system is primed, and it seems re-priming or editing this config is impossible... but it's changed on us anyway, so we would like to know how to get into the database or other config-location to revert this change. We've checked in /etc/genome.conf, and the path to the temp directory is not stored there... would appreciate a solution.

Regards,

-- Liviu & Shu at Garvan

@sakoht
Copy link
Contributor

sakoht commented Jan 14, 2016

We should really update the docs for this. Sorry you are stuck. This is an unfortunate side-effect of us trying to make a standalone deploy simple.

For a production install, you really need to attach volumes that are exclusively used for disk allocations. Putting data on the system disk has some issues as you have found.

This ensures that:

  1. If your system disk grows, you won't be wrong about available space.
  2. If you overflow a data disk, you won't crash your OS.

The "free space" on a volume is calculated as its real size minus the size of all allocations. Many of those allocations may be larger than the consumed space, since pipeline steps allocate before adding data. So simply looking at space consumer doesn't work.

It would seem possible to allow non-allocated data on the disk, and to track that dynamically. In practice the time it takes to measure disk consumption by directory is slow, and would slow the system considerably, and still be prone to race conditions. And still expose you to the two situations above.

If you have just one large expensive disk with the software/OS, and really want to use it for your data too, we recommend making two partitions. This lets the OS enforce a hard boundary.

If you need a fast fix, or really want to track non-GMS space dynamically in some way instead if partitioning:

  1. estimate the amount of the disk consumed by non-GMS data
  2. create a single allocation of this size, and leave it empty

The final solution is a band-aid though, and not recommended long-term, but will allow you to control things until you can re-partition.

@gatoravi
Copy link
Contributor

Pasting from #162 to keep this in the same thread

"""
Thanks kindly, Avi... this issue is rather odd to be sure. Generally speaking we've found GMS to behave rather unexpectedly when it comes to handling storage... and this was an outgrowth of that - perhaps caused my trying to do a fresh GMS setup over the drives of our old instance.

While I have done as fresh an install as possible, with new drives, and managed to circumvent this - it's the other issues we have reported recently that have been real show-stoppers - particularly the fact that without us making direct changes to the "Creator.pm" code to invoke Volume.pm's "sync_total_kb()" method, our disk is considered to run out of space long before the drive actually fills up...

It was this issue that caused us to have to rebuild our GMS setup altogether, along with issues around re-priming, and an unfortunate problem where GMS started writing its temporary files to "/home/ubuntu" rather than /tmp (due to an accessibility problem caused by replacing /tmp with a symlink, which we then reverted). Despite re-creating the original setup as best we could, and various attempts using environment variables and poking around in the code, we were unable to get GMS to write to /tmp any more, and every job would fail due to lack of space in /home/ubuntu.

While I think we can consider this issue closed, I would very much appreciate your assistance with these other two - particularly as your colleague appears to have misunderstood our setup (we've been using an 80 Gb root drive, and two additional drives of 4 TB and 7 TB each, mounted at /tmp and /opt respectively, which we understand to be your recommended setup).

Also, the replacement of /tmp with a symlink was done because we would keep getting disk space errors, and believed that perhaps consolidating to a single larger /opt disk and symlinking /tmp onto this drive would ensure that we never run out of temporary space due to a mismatch in size between the two volumes. While this was perhaps ill-advised, due to how GMS handles storage, it did bring to light the rather show-stopping issues that occur when /tmp becomes unavailable.

Would very much appreciate your assistance with the /tmp and Volume.pm issues any time you can manage.
"""

@gatoravi
Copy link
Contributor

We do have a instance of the SGMS here where /tmp is a symlink, specifically it looks like this when I do a ls -lhd

ls -lhd /tmp
lrwxrwxrwx 1 root root 12 May  2  2015 /tmp -> /opt/gms/tmp

ls -lhd /opt/gms/tmp
drwxrwxrwx 22 root root 20K Jan 15 14:48 /opt/gms/tmp

This seems to work fine as far as I can tell, when I look at workflow-server.err in a build directory for example it says
2016-01-15 14:25:10-0600 localhost: 2016/01/15 14:25:10 passthru Announcing we are at localhost:40687 to /tmp/kJlvZogGt8/hub_location indicating that /tmp is being used.

Could you check and let us know what the permissions for your symlink and the destination folder under /opt/ are?

@GrubLord
Copy link
Author

Not certain what they started as. During troubleshooting, I did a chmod a+rwx on the symlink - but perhaps it was too late. Once it had started writing to /home/ubuntu, it never stopped doing so, even when we restored the original /tmp partition (with full permissions for the ubuntu user, naturally). That's why we had to reinstall the whole machine: seemed to be no way to get it to use /tmp again.

The destination folder under /opt was /opt/tmp-store, and similarly world-writeable.

-- Liviu

On 16 Jan 2016, at 7:50 am, Avi Ramu <[email protected]mailto:[email protected]> wrote:

We do have a instance of the SGMS here where /tmp is a symlink, specifically it looks like this when I do a ls -lhd

ls -lhd /tmp
lrwxrwxrwx 1 root root 12 May 2 2015 /tmp -> /opt/gms/tmp

ls -lhd /opt/gms/tmp
drwxrwxrwx 22 root root 20K Jan 15 14:48 /opt/gms/tmp

This seems to work fine as far as I can tell, when I look at workflow-server.err for example it says
2016-01-15 14:25:10-0600 localhost: 2016/01/15 14:25:10 passthru Announcing we are at localhost:40687 to /tmp/kJlvZogGt8/hub_location indicating that /tmp is being used.

Could you check and let us know what the permissions for your symlink and the destination folder under /opt/ are?


Reply to this email directly or view it on GitHubhttps://github.com//issues/199#issuecomment-172089021.

NOTICE
Please consider the environment before printing this email. This message and any attachments are intended for the addressee named and may contain legally privileged/confidential/copyright information. If you are not the intended recipient, you should not read, use, disclose, copy or distribute this communication. If you have received this message in error please notify us at once by return email and then delete both messages. We accept no liability for the distribution of viruses or similar in electronic communications. This notice should not be removed.

@gatoravi
Copy link
Contributor

I wonder if something like TMPDIR=/opt/gms/tmp/ genome model build start would work? The code for determining the temp file paths are here I believe but I haven't tried this solution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants