Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataAPI.levels method? #66

Open
dmbates opened this issue May 4, 2021 · 6 comments
Open

DataAPI.levels method? #66

dmbates opened this issue May 4, 2021 · 6 comments

Comments

@dmbates
Copy link
Contributor

dmbates commented May 4, 2021

In the Arrow.jl package we added a method

function DataAPI.levels(x::DictEncoded)
    rp = DataAPI.refpool(x)   # may contain missing values
    Missing <: eltype(rp) || return rp
    convert(AbstractArray{nonmissingtype(eltype(rp))}, deleteat!(rp, ismissing.(rp)))
end

Should there be a similar method for PooledArray?

At present the default method for levels ends up sorting the pool.

@bkamins
Copy link
Member

bkamins commented May 5, 2021

At present the default method for levels ends up sorting the pool.

This is what the DataAPI.levels API requires:

Values are returned in the preferred order for the collection, with the result of sort as a default.


As for PooledArray we could write collect(skipmissing(DataAPI.refpool(pa))) instead which should be faster. The question is what meaning levels for PooledArray should have, i.e. in what situations someone would use levels on a PooledArray (I do not think it is useful for this type).

@dmbates
Copy link
Contributor Author

dmbates commented May 5, 2021

At present the default method for levels ends up sorting the pool.

This is what the DataAPI.levels API requires:

Values are returned in the preferred order for the collection, with the result of sort as a default.

I would say that for a PooledArray the order in the refpool is the preferred order.

My understanding of the API is that refpool can return nothing if the argument is not a factor-like object such as PooledArray, CategoricalArray, Arrow.DictEncoded, etc. But levels should always return a set of potential levels for the argument. If the result of refpool is nothing then the levels are determined by a sorting operation that eliminates the duplicates.

So for factor-like arrays I would want the result of levels to be the same as the result of refpool without any missing.

As for PooledArray we could write collect(skipmissing(DataAPI.refpool(pa))) instead which should be faster. The question is what meaning levels for PooledArray should have, i.e. in what situations someone would use levels on a PooledArray (I do not think it is useful for this type).

I am happy to create a PR with that implementation.

Perhaps I could explain why I want to add this definition. In the StatsModels package when contrasts like EffectsCoding are applied to a term in a formula the order of the levels determines the interpretation of the coefficients. The levels can be specified in the call to the coding specification but they default to the value of DataAPI.levels.

An Arrow representation of a DictEncoded type can be flagged as ordered. In that case, the order should be preserved all the way to constructing contrasts for a model. Right now, copying an Arrow.DictEncoded object creates a PooledArray which is why I want to ensure that levels applied to a PooledArray returns the refpool modulo removal of any missing values.

@bkamins
Copy link
Member

bkamins commented May 5, 2021

if the argument is not a factor-like object such as PooledArray

I would say that PooledArray is not a factor-like object. The fact that it returns refpool is only performance and storage related. From user's perspective PooledArray and Vector should be indistinguishable.

I am not sure how it is the case for Arrow.DictEncoded. For sure CategoricalArray is factor-like.

The case of Arrow.DictEncoded is that DataAPI.refpool can contain duplicates AFAICT, as opposed to PooledArray and CategoricalArray. The question then is if levels for Arrow.DictEncoded should return duplicates. I think not as the DataAPI.levels contract is:

a vector of unique values which occur or could occur in collection x

and it explicitly requires uniqueness

Right now, copying an Arrow.DictEncoded object creates a PooledArray which is why I want to ensure that levels applied to a PooledArray returns the refpool modulo removal of any missing values.

@nalimilan - so we have a challenge here. Ideally a CategoricalArray should be created instead of PooledArray (so that in particular the ordered flag is retained). I understand that the challenge is that Arrow.DictEncoded supports a wider array of types than CategoricalArray does - is this correct?

If we do not resolve the issue cleanly then considering PooledArray to be a factor-like container is a second best option (but first I would prefer to investigate if we can use CategoricalArray instead as conceptually PooledArray is just a vector without any notion of being a factor).

@dmbates - note that currently even different PooledArrays share the same DataAPI.refpool for efficiency (so really the fact that there is a DataAPI.refpool for PooledArray should be considered an implementation detail). Here is an example:

julia> using PooledArrays, DataAPI

julia> x = PooledArray(1:5);

julia> y = x[1:3];

julia> z = copy(x);

julia> DataAPI.refpool(x) === DataAPI.refpool(y) === DataAPI.refpool(z)
true

@palday
Copy link

palday commented May 5, 2021

I would be tempted to define DataAPI.levels(::PooledArray) to just be a very efficient version of unique(::PooledArray) (or if there is already specialization of unique to just set const DataAPI.levels(::PooledArray) = unique(::PooledArray)). This would be useful in many contexts and at the same time provide the functionality that @dmbates wants.

@bkamins
Copy link
Member

bkamins commented May 5, 2021

Two things:

  1. we would still need to drop missing (this is easy)
  2. If I understand things correctly this is not what @dmbates wants as he needs levels to inherit the order from source Arrow.DictEncoded that can be arbitrary and even contain values not present it the array

@nalimilan
Copy link
Member

I agree with @bkamins that levels(x::Array) should return the same thing as levels(PooledArray(x)). This can indeed be made faster than the fallback method by calling sort!(unique(x)), but that's an implementation detail.

(I don't know what's most appropriate for Arrow.DictEncoded.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants