Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Extensions to object store GetOptions #7170

Merged
merged 3 commits into from
Feb 27, 2025

Conversation

crepererum
Copy link
Contributor

Which issue does this PR close?

Closes #7155.

Rationale for this change

See #7155.

What changes are included in this PR?

The Extensions type and the new GetOptions::extensions field.

Are there any user-facing changes?

Adding a field to GetOptions is technically a breaking change.

@github-actions github-actions bot added the object-store Object Store Interface label Feb 21, 2025
@tustvold
Copy link
Contributor

I like this approach and think it would be fine. I do, however, wonder if we're somewhat overloading the ObjectStore interface a bit much. I've filed #7171 to articulate this and perhaps explore the design space a bit more

@waynr
Copy link

waynr commented Feb 21, 2025

As the author of #7155 where my intent was to end up with something that could be compatible with datafusion's SessionConfig (ie such that we could transfer opaque trait objects from a SessionConfig instance to an object store extensions instance), this doesn't look like it will satisfy the need I've described.

With the API surface you've provided for inserting new elements into Extensions, it would be impossible to insert items from a SessionConfig's extensions into the new Extensions described here and still be able to retrieve them by their original type.

Let's say that SessionConfig's extensions carry an IOx Span (which I described in #7135 and #7155 as a motivating use case). In datafusion, I would like to be able to pass on all of the SessionConfig extensions to the GetOptions extensions such that a Span instance is retrievable from the latter. So if the Arc<dyn ObjectStore> receiving such a modified GetOptions needed a parent Span (and its associated span recording mechanisms), it could attempt to retrieve one using `opts.get::()'.

The problem is that the Span type stored in the SessionConfig extensions for influxdb is stored as an opaque trait object -- and we can't downcast them in datafusion without adding a dependency to the shared IOx core crates where the Span type is defined. If we were to try iterating over SessionConfig extensions to insert them with Extensions.set, they would be indexed by their opaque trait object type -- Arc<dyn Any + Send + Sync + 'static> (and if there were more than one extension, I think only the last one would remain albeit irretrievable by its original type id).

A impl From<HashMap<TypeId, Arc<dyn Any + Send + Sync + 'static>>> for Extensions should address this need. Or maybe a impl Extend<(TypeId, Arc<dyn Any + Send + Sync + 'static>)> for Extensions.

@tustvold
Copy link
Contributor

With the API surface you've provided for inserting new elements into Extensions, it would be impossible to insert items from a SessionConfig's extensions into the new Extensions described here and still be able to retrieve them by their original type.

You could add DataFusion's entire Extensions bundle here, whilst you would have two indirections, it would be the least magical mechanism for doing this.

That being said, I am increasingly of the opinion the ObjectStore trait is being overloaded for something it fundamentally is not designed to be, and I would be interested in your thoughts on #7171. It seems really odd to me that we need to extend the ObjectStore trait, a generic interface to object stores, to support propagating DF context through to an IOx subsystem. This screams to me of a missing / incorrect layer of abstraction. I wonder if instead IOx could be using a less generic trait, e.g. overriding AsyncFileReaderFactory, or similar?

@waynr
Copy link

waynr commented Feb 21, 2025

You could add DataFusion's entire Extensions bundle here, whilst you would have two indirections, it would be the least magical mechanism for doing this.

Do you mean that instead of the trait Extension approach shown in this PR we do exactly as is done for extensions in datafusion, ie just use a HashMap<TypeId, Arc<dyn Any + Send + Sync + 'static>>?

That being said, I am increasingly of the opinion the ObjectStore trait is being overloaded for something it fundamentally is not designed to be, and I would be interested in your thoughts on #7171.

Yeah I'm taking a look at that now, I'll see if I can think of any useful feedback for you there.

It seems really odd to me that we need to extend the ObjectStore trait, a generic interface to object stores, to support propagating DF context through to an IOx subsystem. This screams to me of a missing / incorrect layer of abstraction. I wonder if instead IOx could be using a less generic trait, e.g. overriding AsyncFileReaderFactory, or similar?

I'm still pretty new to datafusion ecosystem but it seems to me like using it the way influxdb is more or less requires passing in some kind of Arc<dyn ObjectStore> so I don't yet see how overriding AsyncFileReaderFactory would be a viable alternative here -- unless there is some logic in the datafusion execution path I missed that would prefer that over an object store. Or are you suggesting a refactor of datafusion abstractions to support that approach?

@tustvold
Copy link
Contributor

Do you mean that instead of the trait Extension approach shown in this PR we do exactly as is done for extensions in datafusion, ie just use a HashMap<TypeId, Arc<dyn Any + Send + Sync + 'static>>?

No I mean registering the entire DF's Extensions bundle onto the ObjectStore Extensions bundle

like using it the way influxdb is more or less requires passing in some kind of Arc so I don't yet see how overriding AsyncFileReaderFactory would be a viable alternative here

It would entail making changes to how InfluxDB is using this, correct. I believe it shouldn't require changes to DF, given that AsyncFileReaderFactory was added to enable using DF with other IO abstractions such as OpenDAL.

@waynr
Copy link

waynr commented Feb 21, 2025

I believe it shouldn't require changes to DF, given that AsyncFileReaderFactory was added to enable using DF with other IO abstractions such as OpenDAL.

The reason I'm thinking it might require changes to DF is that InfluxDB registers object stores with what looks to me like a fairly core type: https://github.com/apache/datafusion/blob/22156b2a6862e68495a82bd2579d3ba22c6c5cc0/datafusion/execution/src/runtime_env.rs#L78

Is this not actually required? I've observed it in use in a few places, but maybe those are edge cases that aren't applicable to InfluxDB.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a pragmatic solution, and would be happy to merge it.

That being said, I wonder what you think of using http::Extensions instead.

This would lose debug support, and making http a required dependency, but would have the advantage of being a standard ecosystem abstraction, and would allow for a pretty compelling way to pass through any extensions to the HttpClient implementation (#7183)

@crepererum
Copy link
Contributor Author

That being said, I wonder what you think of using http::Extensions instead.

This would lose debug support, and making http a required dependency, but would have the advantage of being a standard ecosystem abstraction, and would allow for a pretty compelling way to pass through any extensions to the HttpClient implementation (#7183)

I think http is already a dependency (indirectly), so the dependency argument doesn't count. I could live w/ loosing debugging support TBH and in vein of #7183 I think this is the better / more standard interface.

FWIW InfluxData already uses http extension interface to pass tracing contexts through tower service layers and into tonic service implementations. So I think that somewhat confirms that using a common interface might be good.

I'll leave this PR open for now and rebase & adjust it after #7183 gets merged.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wire these through to HttpClient as part of this?

@crepererum
Copy link
Contributor Author

Should we wire these through to HttpClient as part of this?

I can. Another question: should we make the http requirement mandatory or should the field on GetOptions be optional and depend on the said requirement? I would strongly prefer the first option.

@tustvold
Copy link
Contributor

I would prefer to make it mandatory, it is a relatively simple crate and don't foresee any issues

Comment on lines +727 to +728
version: _,
head: _,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to unpack the struct to make the forwarding here easier to maintain (so new fields are not forgotten). However we seem to have missed these two.

@tustvold is this a bug? Shouldn't we at least wire up version? And head should probably change the method, but I'm not sure how much will break if we do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version is object store specific what it means and so is handled that way, head requests is likely historical.

I wouldn't object to finding a way to roll the logic into here, but it isn't entirely straightforward to do

@tustvold
Copy link
Contributor

I presume you will follow this up with PRs for the other Options structs

@tustvold tustvold added the api-change Changes to the arrow API label Feb 27, 2025
@crepererum crepererum merged commit 8df892b into apache:main Feb 27, 2025
15 checks passed
crepererum added a commit to crepererum/arrow-rs that referenced this pull request Feb 27, 2025
@crepererum
Copy link
Contributor Author

I presume you will follow this up with PRs for the other Options structs

Done in #7213.

crepererum added a commit that referenced this pull request Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce Extensions concept to object_store::GetOptions and object_store::PutOptions
3 participants