pfcon file transmission overhaul #27
jennydaman
started this conversation in
Ideas
Replies: 1 comment
-
Minor comment on "The idea of JPI is that a single-root DAG of ds-plugins can be executed by the remote compute resource without egress of intermediate results." This is not completely correct. The idea was more about keeping data in the remote to improve on intermediate retransmission back to the compute. Data would still egress from individual plugins so to still represent a full experience. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Current Behavior
Currently, pfcon sends and receives a ZIP file for plugin instance data.
Using a ZIP file is inefficient.
Successive plugin instances can benefit from a file cache.
Data in ChRIS is for the most part, write-once, read-many (WORM). A plugin instance's output files are deleted then downloaded again even though they don't change.
The originally proposed solution to this was "Jorge's pipeline idea" (JPI) which are implemented in CUBE but were never supported by pfcon. The idea of JPI is that a single-root DAG of ds-plugins can be executed by the remote compute resource without egress of intermediate results.
The Idea: Pull-Into-Cache (PIC) Pfcon
Remote File Cache
Observation: "swift" file path names uniquely identify data files from a given ChRIS instance. Thus, they can be used as cache keys.
Improved idea: (5) for each file name: if pfcon has previously cached the file, use the cached file, else pull the file from CUBE.
Pfcon will maintain a file cache where (CUBEInstance, FileName) -> file data.
Caching optimizes successive plugin instance data ingress.
The performance of creating plugin instances will be equivalent to the performance of JPI, whether or not the plugin instances are part of a pipeline. I.e. plugin instances will be efficient by nature and this idea replaces JPI.
Sequence Diagram
Inspirations and Alternatives
IPFS and git-annex use content-based addressing to deduplicate data on distributed storage. Content-based addressing is more ideal for reproducibility, however it is not performant. In ChRIS we already have this property where filenames uniquely identify file data so we exploit this property as file cache keys.
Tangent: Dedupling data in CUBE
CUBE should use content-based addressing to exploit the WORM property of data in ChRIS.
Thought Experiment: Federated ChRIS as Compute Resources
A sophisticated idea for CUBE federation would be that other CUBEs can be registered to CUBE as remote compute environments. CUBE already has a file storage feature.
Beta Was this translation helpful? Give feedback.
All reactions