Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the ByteRangeRequest more literate #2437

Open
d-v-b opened this issue Oct 24, 2024 · 10 comments
Open

Make the ByteRangeRequest more literate #2437

d-v-b opened this issue Oct 24, 2024 · 10 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Oct 24, 2024

Continuing a conversation from #1661

We have IO operations that can fetch ranges of bytes. That byte range parameter is parameterized by an (optional) 2-tuple of nullable ints, i.e. tuple[int | None, int | None] | None. The tuple part is ambiguous between two similar specifications of a range of integers -- (start, step) and (start, stop); this ambiguity is bad.

We should rework this type so that it's harder to be confused about these two possibilities. One option suggested by @kylebarron would be a dataclass, e.g. :

@dataclasses.dataclass
class ByteRangeRequest:
  start: int | None = 0
  step: int | None = None

It would also be nice if there was an instance of this class (e.g., the default) that conveyed "fetch the whole thing", and then byte_range wouldn't need to be optional in all the store methods. I'm also not sure we get much from None in this type, since there are integers that can express "the end" of a range, e.g. -1. Curious to hear other ideas here.

@paraseba
Copy link

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 24, 2024

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes". Maybe @normanrz can correct me if this is wrong.

@kylebarron
Copy link

@paraseba Suffix requests are already used (#1661 (comment)), we just want to ensure the type is well defined so that semantics are clear.

Most (all?) object stores, for example, allow that type of query.

All except Azure.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes".

From here it's (-100, None) that would express the last 100 bytes:

index_bytes = await byte_getter.get(
prototype=numpy_buffer_prototype(), byte_range=(-shard_index_size, None)
)

@normanrz
Copy link
Contributor

Yes, it is (-100, None)

@paraseba
Copy link

That is great, then making this type more clear is even more important.

@normanrz
Copy link
Contributor

So, the dataclass would be like this?

@dataclass
class ByteRangeRequest:
  start: int # can be negative for indexing from the back
  length: int | None # None = read to the end

@kylebarron
Copy link

No it would be more like a union:

class RangeRequest(TypedDict):
	"""A request with a start and end in the file"
	start: int
	"""The first byte of the range"""

	end: int
	"""The end of the byte range (exclusive)"""

class SuffixRequest(TypedDict):
	"""A range that fetches the last N bytes from a file"""

	size: int
	"""The number of suffix bytes to fetch"""

class OffsetRequest(TypedDict):
	"""A request for the first N bytes from a file"""

	size: int
	"""The number of bytes to fetch"""

ByteRangeRequest = Union[RangeRequest, SuffixRequest, OffsetRequest]

I think here the semantics are very clear, even if the union makes it slightly verbose. And here I used typed dicts but that wouldn't fully work as shown because there's no way to decipher between the suffix and offset request (if they both were to use the "size" dict key), but that's easily fixed.

@kylebarron
Copy link

In case you're curious, in developmentseed/obstore#67 I updated obstore to allow either a tuple of two non-negative integers, or a dict with offset or suffix keys:
image

@rabernat
Copy link
Contributor

rabernat commented Nov 1, 2024

Dataclasses are heavier than tuples. Is there any significant performance consequence (memory and / or runtime) of making this change? What if we are fetching 100_000 chunks?

@kylebarron
Copy link

kylebarron commented Nov 2, 2024

In obstore I'm still using a tuple of two numbers for the most common case of a range with a known, positive start and end. I think the semantics there are pretty clear (same as range(start, end)). It's only for suffix and offset requests that I'm requiring dict input for clarity.

In [1]: %timeit (100, 1000)
4.07 ns ± 0.429 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [3]: %timeit {"suffix": 100}
31.2 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

If you made 100,000 suffix requests, it would be 2.7ms slower from this.

I figure that suffix requests make up a small part of normal usage anyways, and that this makes the API a lot clearer. But if people make issues in obstore in the future saying that this is a perf obstacle I could add more supported inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants