Make the `ByteRangeRequest` more literate #2437

d-v-b · 2024-10-24T14:25:09Z

Continuing a conversation from #1661

We have IO operations that can fetch ranges of bytes. That byte range parameter is parameterized by an (optional) 2-tuple of nullable ints, i.e. tuple[int | None, int | None] | None. The tuple part is ambiguous between two similar specifications of a range of integers -- (start, step) and (start, stop); this ambiguity is bad.

We should rework this type so that it's harder to be confused about these two possibilities. One option suggested by @kylebarron would be a dataclass, e.g. :

@dataclasses.dataclass
class ByteRangeRequest:
  start: int | None = 0
  step: int | None = None

It would also be nice if there was an instance of this class (e.g., the default) that conveyed "fetch the whole thing", and then byte_range wouldn't need to be optional in all the store methods. I'm also not sure we get much from None in this type, since there are integers that can express "the end" of a range, e.g. -1. Curious to hear other ideas here.

The text was updated successfully, but these errors were encountered:

paraseba · 2024-10-24T14:29:24Z

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

d-v-b · 2024-10-24T14:31:59Z

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes". Maybe @normanrz can correct me if this is wrong.

kylebarron · 2024-10-24T14:32:55Z

@paraseba Suffix requests are already used (#1661 (comment)), we just want to ensure the type is well defined so that semantics are clear.

Most (all?) object stores, for example, allow that type of query.

All except Azure.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes".

From here it's (-100, None) that would express the last 100 bytes:

zarr-python/src/zarr/codecs/sharding.py

Lines 700 to 702 in bc588a7

    
           index_bytes = await byte_getter.get( 
        
               prototype=numpy_buffer_prototype(), byte_range=(-shard_index_size, None) 
        
           )

normanrz · 2024-10-24T14:34:07Z

Yes, it is (-100, None)

paraseba · 2024-10-24T14:34:30Z

That is great, then making this type more clear is even more important.

normanrz · 2024-10-24T14:35:21Z

So, the dataclass would be like this?

@dataclass
class ByteRangeRequest:
  start: int # can be negative for indexing from the back
  length: int | None # None = read to the end

kylebarron · 2024-10-25T01:50:06Z

No it would be more like a union:

class RangeRequest(TypedDict):
	"""A request with a start and end in the file"
	start: int
	"""The first byte of the range"""

	end: int
	"""The end of the byte range (exclusive)"""

class SuffixRequest(TypedDict):
	"""A range that fetches the last N bytes from a file"""

	size: int
	"""The number of suffix bytes to fetch"""

class OffsetRequest(TypedDict):
	"""A request for the first N bytes from a file"""

	size: int
	"""The number of bytes to fetch"""

ByteRangeRequest = Union[RangeRequest, SuffixRequest, OffsetRequest]

I think here the semantics are very clear, even if the union makes it slightly verbose. And here I used typed dicts but that wouldn't fully work as shown because there's no way to decipher between the suffix and offset request (if they both were to use the "size" dict key), but that's easily fixed.

kylebarron · 2024-11-01T20:16:15Z

In case you're curious, in developmentseed/obstore#67 I updated obstore to allow either a tuple of two non-negative integers, or a dict with offset or suffix keys:

rabernat · 2024-11-01T22:58:07Z

Dataclasses are heavier than tuples. Is there any significant performance consequence (memory and / or runtime) of making this change? What if we are fetching 100_000 chunks?

kylebarron · 2024-11-02T00:38:47Z

In obstore I'm still using a tuple of two numbers for the most common case of a range with a known, positive start and end. I think the semantics there are pretty clear (same as range(start, end)). It's only for suffix and offset requests that I'm requiring dict input for clarity.

In [1]: %timeit (100, 1000)
4.07 ns ± 0.429 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [3]: %timeit {"suffix": 100}
31.2 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

If you made 100,000 suffix requests, it would be 2.7ms slower from this.

I figure that suffix requests make up a small part of normal usage anyways, and that this makes the API a lot clearer. But if people make issues in obstore in the future saying that this is a perf obstacle I could add more supported inputs.

d-v-b mentioned this issue Oct 24, 2024

Prototype of object-store-based Store implementation #1661

Open

8 tasks

kylebarron mentioned this issue Oct 25, 2024

remove range from GetOptions for now developmentseed/obstore#51

Merged

This was referenced Nov 1, 2024

Monthly issue metrics report #2455

Closed

Monthly issue metrics report MSanKeys963/zarr-python#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the `ByteRangeRequest` more literate #2437

Make the `ByteRangeRequest` more literate #2437

d-v-b commented Oct 24, 2024

paraseba commented Oct 24, 2024

d-v-b commented Oct 24, 2024

kylebarron commented Oct 24, 2024

normanrz commented Oct 24, 2024

paraseba commented Oct 24, 2024

normanrz commented Oct 24, 2024

kylebarron commented Oct 25, 2024

kylebarron commented Nov 1, 2024

rabernat commented Nov 1, 2024

kylebarron commented Nov 2, 2024 •

edited

Loading

Make the ByteRangeRequest more literate #2437

Make the ByteRangeRequest more literate #2437

Comments

d-v-b commented Oct 24, 2024

paraseba commented Oct 24, 2024

d-v-b commented Oct 24, 2024

kylebarron commented Oct 24, 2024

normanrz commented Oct 24, 2024

paraseba commented Oct 24, 2024

normanrz commented Oct 24, 2024

kylebarron commented Oct 25, 2024

kylebarron commented Nov 1, 2024

rabernat commented Nov 1, 2024

kylebarron commented Nov 2, 2024 • edited Loading

Make the `ByteRangeRequest` more literate #2437

Make the `ByteRangeRequest` more literate #2437

kylebarron commented Nov 2, 2024 •

edited

Loading