Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Filesystem Package Managers and rsync #81

Open
aschmahmann opened this issue Jul 23, 2019 · 12 comments
Open

Filesystem Package Managers and rsync #81

aschmahmann opened this issue Jul 23, 2019 · 12 comments
Assignees

Comments

@aschmahmann
Copy link

aschmahmann commented Jul 23, 2019

We seem to be running into issues where using rsync on top of IPFS to load filesystem Package Manager registries into IPFS. These issues largely stem from IPFS simply being a different application than an OS's file system. There are a number of potential avenues to explore here that can each solve this problem.

  1. Just as we have ipget (https://github.com/ipfs/ipget) as an IPFS aware wget we could have ipsync
  2. We could create/utilize a filesystem layer to emulate the filesystem so that native tools like rsync work better
    • FUSE: Because rsync relies on filesystem metadata this needs to be available either with some FUSE-internal metadata or within the IPLD objects describing the directories + files that FUSE is "rendering"
    • Filestore: This is similar to FUSE with internal metadata since we can just leave the file metadata as it is. Currently we are not able to simply replace files stored in the Filestore and have everything work. This seems reasonable given that IPFS is an application, and the OS's default filesystem is a separate application that is totally unaware of IPFS. However, we could add some basic tooling to enable updating data stored in the Filestore API though if we wanted to.
      • Strawman: ipfs pin add --recursive --best-effort --force <path> which would change the folder from being strongly pinned (data is definitely kept) to weakly pinned (it's probably there, but no promises). Then do rsync, then ipfs pin add --recursive <path> to strongly re-pin the data.

Notably each of these approaches basically amount to writing an IPFS application that properly handles rsync, whether we do it explicitly (ipsync, or a shell script) or implicitly (FUSE). When deciding a path forward we'll need to take into account Performance, DX for package managers, and Reusability beyond package managers.

Note: See #21 for more info

@andrew
Copy link
Collaborator

andrew commented Jul 23, 2019

Related to #74 and #71

@meiqimichelle
Copy link
Contributor

Notes from sprint planning: @djdv to write a quick note here to relate this to some of his work, and @andrew may pull this in to a summary doc he's started in #78.

@djdv
Copy link

djdv commented Jul 23, 2019

Notably each of these approaches basically amount to writing an IPFS application that properly handles rsync, whether we do it explicitly (ipsync, or a shell script) or implicitly (FUSE). When deciding a path forward we'll need to take into account Performance, DX for package managers, and Reusability beyond package managers.

👍

The work being done around the overarching FS API effort (#71) should likely tie into this.

It seems likely that there will be a lot of overlap between a purpose made syncing tool, and a performant FS API.
Ideally work done around FS APIs would allow for the core of these components to be shared.
In the tool scenario you can imagine FS-like manipulations being done through the same API we'd use to implement mount

Semi-related: #74 (comment)

@djdv
Copy link

djdv commented Aug 7, 2019

During our meeting, we talked about the idea of using existing benchmarking tools, that are meant for traditional filesystems, and using them to target our mount implementation. As well as the idea of building harnesses that simply act as 9P clients that attach to our FS server (on the daemon).

This would also give us an idea of the overhead associated with using the FS protocol, when compared against benchmarks that are using the underlying APIs directly.

The nice thing about this, is that it should just come naturally as the implementation progresses. As it becomes more technically correct (spec compliant), we can simply target it with existing testing software.
Previously, I've used (a fork of)FSX, as well as others, to help debug the past fuse implementations.

@kevincox
Copy link

I think that the performance problems are simple enough to find. For example just reading a locally cached file is incredibly slow and ipfs daemon burns 500% CPU.

% pv /ipfs/QmaUXCVgQgC9b6TPJ1ZAZqWXsoW7vaUwR7f3tadFBx839R >/dev/null
 175MiB 0:02:38 [1.63MiB/s] [=======>                          ] 26% ETA 0:07:24
% ipfs version
ipfs version 0.4.23

@andrew
Copy link
Collaborator

andrew commented Mar 13, 2020

@kevincox I believe there have been a number of performance improvements made recently, although they have not been released yet as 0.5.0, might be worth testing against master of go-ipfs as well

@kevincox
Copy link

I tried again on latest git and while the results were better they weren't great. It was now using <400% CPU and transferring a bit fast. I think this is more than "performance improvement" range. Someone needs to take a look at the architecture and make fundamental changes.

% ipfs version
ipfs version 0.5.0-dev
╭(10:57:58)─(0)─(~)
╰% pv /ipfs/QmaUXCVgQgC9b6TPJ1ZAZqWXsoW7vaUwR7f3tadFBx839R >/dev/null
39.9MiB 0:00:24 [1.82MiB/s] [>                                 ]  5% ETA 0:06:19

@aschmahmann
Copy link
Author

@kevincox do you mind giving a little more information on what you are doing?

  • How big is the file you've added and are reading?
  • Are you using FUSE to read the data, if so have you tried using the IPFS CLI (e.g. ipfs get or ipfs cat)?
  • Are you using an HDD or SSD?
  • Have you tried using the Badger datastore?

@kevincox
Copy link

  • The file is large. This example was 670MiB.
  • Yes, am talking about fuse. ipfs cat is acceptably fast (>100MiB/s)
  • SSD
  • I have no idea what that is. I can try looking into it.

@kevincox
Copy link

One other note. I tried with default fixed-sized chunking and rabin. The speed was roughly the same in either case.

@aschmahmann
Copy link
Author

@kevincox the FUSE implementation definitely needs some reworking. A contributor was working on an upgrade #74, but unfortunately got pulled away before it could be completed.

If you're interested in helping out take a look, I'm sure the help would be appreciated, I know the contributor is interested in continuing work on the implementation as well once he's available.

@djdv
Copy link

djdv commented Mar 14, 2020

[This post is meant to give some insight as to what those branches are, and my intent on what to do with them.]

I would advise against focusing on the experimental mount branches as they are complete rewrites that were not well received. They covered more platforms and were more performant when functioning, but are a practical dead end due to their large complexity/breadth.

Focusing on improving the existing implementation seems more likely to have impact in the short term. (albeit at the cost of being platform bound.)
However this is something I can't help with since the existing implementation was not referenced at all during the rewrites. (It was meant to be replaced, so there's little value in gleaning over something that's antithetical to both goals.)

At current, I am working on a branch which combines the efforts and orchestrates them via the mount command, using the appropriate file system provider/implementation for the platform.
(i.e. using the 9P protocol on Linux, and cgo-fuse on everything else, with a common interface beneath them. Allowing for more platforms to be supported later as well, by implementing their preferred provider.)
But there's no expectation of that work making it into mainline as it's completely different to the existing solution and would require a large amount of coordination on a low priority item. This was already tried by myself and I failed to meet the quality bar, twice.
The effort being spent is strictly out of necessity as I and other users have wanted this feature for a long time and are willing to accept something that improves incrementally over nothing. As far as I know, nobody else is working on this.
Feedback from users for both branches were very positive and people remarked on the perceptible performance improvements without even being asked. So I feel it's worth finishing, but again, I don't think it's possible to get it into a state where it would be deemed officially acceptable. (without leaning heavily on the engineers of the project to guide an amateur)
Something new by someone more experienced is likely going to be the best solution, as I'm just a fallback solution.

As for my availability, I'm struggling with some personal difficulty that could go either way. I'd honestly rather work on this but I don't know when I'll be freed up, if at all.
I'll try my best ┐('~`;)┌

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants