Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: retrieval research rfc #623

Merged
merged 14 commits into from
Dec 17, 2024
93 changes: 93 additions & 0 deletions rfcs/002-data-retrieval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# RFC-002 Data Retrieval

cernicc marked this conversation as resolved.
Show resolved Hide resolved
Author: Rok Černič — @cernicc
Date: 16/12/24

This document aggregates research done on the topic of retrieving stored data
from the storage providers. It does not cover the retrieval markets and assumes
that no additional incentives are needed for being part of the retrieval network
and that every storage provider also provides a retrieval service. It also
covers the main protocols for transfer — Bitswap and GraphSync — recommending
the implementation of Bitswap as a first instance due to the lower upfront
effort required.

## Problem statement

We would like to enable 3rd parties to retrieve data from the storage providers.
To retrieve the data stored, they should provide the payload CID (root node of
the CAR file).

## Storage provider server

The current server implementation exposes an endpoint which allows 3rd parties
to upload some content. The content is then encoded as a CAR file; the file is
then packed into the unsealed sector, stored on disk; the unsealed sector is
sealed and stored on disk; the sealed sector is unreadable, as such for a
purposes of data retrieval, we either keep the unsealed sector around at the
cost of storage capacity or unseal sectors on-demand — for the purposes of POC
we expect the unsealed sector to always be available.

### Local Index Directory

As part of the storage server there is a need for a [local index directory](https://boost.filecoin.io/deployment/local-index-directory)
subsystem. Sectors are opaque, meaning they don't contain metadata that
indicates where its files start and end, that is where the index enters, mapping
the sector and enabling retrieval of individual files.

### Storage provider server - retrieval provider

The retrieval process requires access to the local index and the unsealed
sectors, thus, it must be co-located with the storage process.

#### Bitswap

[Bitswap](https://docs.ipfs.tech/concepts/bitswap/) is used to exchange blocks
of data between peers. In short, it works on a "question and answer" basis,
where the client request the data for a given CID and the server replies with
that data, be it more CIDs or an actual block of data. When coupled with IPLD
graphs, this approach becomes "chatty" for large files; since the first rounds
of the protocol will usually consist of requesting a CID and getting N CIDs
back, requesting each of those CIDs and getting more back, until reaching actual
data blocks.

#### GraphSync

[GraphSync](https://ipld.io/specs/transport/graphsync/) is used to synchronize
graphs across peers. It uses IPLD selectors to efficiently transfer graphs (or
selections of parts of graphs) with a minimal number of independent requests. It
supersedes Bitswap, because the client only needs to send a single query
request. The server then knows that the block is part of some tree and returns
all relative blocks back to the client.

## Retrieval client

The retrieval clients enable 3rd parties to easily retrieve stored data. A good
example of the retrieval client is [Lassie](https://github.com/filecoin-project/lassie)
which can be used as a CLI, library or HTTP server.

When used, the client temporarily becomes a node in the same network as the
retrieval provider above. The client first queries the indexer (do not confuse
with the Local Index Directory) for retrieval candidates (storage providers).
After it receives one or more candidates it sends a retrieval request to those
providers. The request is done over the P2P network using Graphsync or Bitswap.
It depends on the protocol which the provider supports.

## Conclusion

This document covered the key technical challenge when implementing retrievals
for a system like Polka Storage, as well as the main contenders for the
retrieval protocol — Bitswap and GraphSync.

Both protocols have their strengths:

1. Bitswap is widely implemented, including in Rust, making it a practical
choice for quick implementation in a proof-of-concept (POC).
2. GraphSync offers more efficient graph synchronization and selective data
retrieval, which could be beneficial for larger files, but currently there is
no good enough implementation in Rust.

For the immediate future and POC development, implementing Bitswap appears to be
the most pragmatic approach due to its existing Rust implementation and
straightforward nature. However, as the system evolves and scales, it may be
worthwhile to consider implementing GraphSync for its advanced features and
efficiency.
Loading