Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Cache whole file #10

Open
christianreiss opened this issue May 12, 2017 · 5 comments
Open

Request: Cache whole file #10

christianreiss opened this issue May 12, 2017 · 5 comments

Comments

@christianreiss
Copy link

Hey!

Great job! :) In the times of CloudDrives this will be very handy-- and is. Would it be possible to add a switch to make backfs cache entire files (regardless of size) uppon first read?

Some people will have this use case; if not already.
Cheers,
-Chris.

@wfraser
Copy link
Owner

wfraser commented May 24, 2017

Thanks!

Unfortunately, adding such functionality would be hard to do correctly. Currently, each read blocks the caller while it fetches the block. Because the block size is typically quite small, this delay isn't very noticeable. But if we block the first read on fetching the entire file, which might be very large, then the stall would be very noticeable.

Of course, the proper fix would be to fetch the data asynchronously and complete the first read once the first block has been fetched, but this is complex to implement.

I'm curious what your use case is. When I've needed to ensure entire files are loaded into the cache, I just do something like cat file > /dev/null beforehand. Would that not be sufficient for your use case?

@christianreiss
Copy link
Author

Hey,

thanks for replying- am just about to head out, so I must be brief.
I am using google Drive and Amazon CloudDrive which I am mounting into my home Fileserver. I then use backfs to accelerate the caches which I am then sharing. So there is the use case of "remote" ISO files that need to be pulled over. This I could work around by simply copying the file to my local PC.

I do have a set top Box that does record TV locally and during the night it pushes all records into the cloud. The STB has the cloud mounted via said Server by employing cifs. This would greatly benefit If a read ahead/ cache whole file would be implemented. A read-ahead would need to be in effect uppon the first access of the file. A read ahread would be prefferable as a cache-whole file would really be bad: You would open/stream a video file, backfs would start caching the whole (10gb) file. after 2-3 Minutes you'll notice you have already seeen this show, quit, open the next and another cache would start...

The current state of backfs would not help at all with this setup:

  • I would need read-ahead of some sort
  • Accelleration must occur uppon first access
  • re-reading the same file will very much never happen
  • Due to the immense size of the folder It would be awesome if the directory hierarchy would be locally cached, too.

Thanks for reading and your consideration!
-Chris.

@wfraser
Copy link
Owner

wfraser commented May 24, 2017

As a quick workaround, you could also experiment with increasing the block size. The default is 128 KiB (0x20000 bytes); you could bump this up to several megabytes and effectively get very large readahead, though it may have the problems with stalling periodically as I mentioned, depending on how userspace does its reads.

You can specify -o block_size=$((10*1024*1024)) when mounting backfs to get a 10 MiB block size, for example.

(note that you will have to delete your cache every time you change the block size)

@christianreiss
Copy link
Author

Hey,

Will try that when I get home-- would this affect first-access as well or 'only' subsequent access to the files?

Cheers,
-Chris

@wfraser
Copy link
Owner

wfraser commented May 25, 2017

It'll mean the first access of each block will take much longer.

Say you set the block size to 10M. The kernel still issues individual read calls with pretty small buffers -- usually on the order of a few kbytes. Imagine a program reading a file sequentially from the start to the end. It'll go like this:

  1. first read, 4K at offset 0. This causes the first 10M of the file to be fetched, and the call blocks for a while until this is done.
  2. subsequent reads of 4K from offsets 4K to (10M - 4K): these are cache hits and complete basically immediately.
  3. next read of 4K at offset 10M. This causes the next 10M to be fetched, and blocks for a while.
    ... etc.

Of course this is just the first time you read that file. Afterwards, it'll be in the cache, and all these calls will be cache hits and it'll be read very quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants