Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bool values sometimes converted to Integer #118

Open
GregLawson opened this issue Jan 14, 2023 · 6 comments
Open

Bool values sometimes converted to Integer #118

GregLawson opened this issue Jan 14, 2023 · 6 comments

Comments

@GregLawson
Copy link

GregLawson commented Jan 14, 2023

dictionary([:isvalid => true, :value => 0])
2-element Dictionary{Symbol, Int64}
:isvalid │ 1
:value │ 0
dictionary([:isvalid => true, :value => "0"])
2-element Dictionary{Symbol, Any}
:isvalid │ true
:value │ "0"
Dict works as expected.
Dict(:isvalid => true, :value => 0)
Dict{Symbol, Integer} with 2 entries:
:isvalid => true
:value => 0
Perhaps it is because Bool <: Integer but not Bool <: Int64?
This seems to occur when there is at least one Integer and Bool value.
Is this a bug? Is there a workaround?

@andyferris
Copy link
Owner

andyferris commented Jan 14, 2023

Yes, we use promote to harmonise the elements of a homogenous dictionary, like how for arrays [true, 0] gives Int[1, 0]. It's likely Dict uses promote_jointype instead (but generally the AbstractDictionary interface mirrors the AbstractArray one as much as possible).

The workaround is to specify the type explicty. Dictionary{Symbol, Union{Int64, Bool}}([:isvalid, :value], Union{Int64, Bool}[true, 0]) should work. We do need to improve dictionary so we can specify the key and value types - do you have any thoughts for improvements there?

@GregLawson
Copy link
Author

I'll try your work-around, and hopefully report back. As for suggestions on the dictionary interface, I don't feel I understand Julia well enough to be helpful. I'm still struggling with Abstract data types and generic functions. Each time I try, I end up particularizing the code to get it to work. I'm still learning Julia.
It might be helpful for me to describe why I hope Dictionary is preferable to Dict. I would call my use case Exploratory Parsing as a front-end to Exploratory Data Analysis. The first step is to transform a poorly documented (almost all are) data source into a universal, anonymous, lightweight data structure. For example, JSON.parse returns a tree of nested Arrays and Dicts. You probably really want structs, tables, DataFrames, Third Normal Form, Tidy Data, and your Strings converted to Dates, Times, numbers, etc. This conversion probably can't be automated, the data was created by programmers like us, so we can probably guess what they were thinking. While probably a theoretically intractable problem, it strikes me that Julia is the perfect tool for exploring whether your hopes about data structure and data quality are justified. Julia statistical tools, type inference, Regexes, @Assert, AbstractTable, and Exceptions-as-data can go a long way toward specifying the actual data structure and meaning.
Dictionary and Dict provide a universal, anonymous, lightweight data structure for heterogeneous data and Arrays provide one for homogeneous data. I want both, so the idea of making the Dictionary interface more like the Array interface is appealing, so that tree processing algorithms can be more polymorphic. Uniform indices of nested Dictionary / Arrays allows a find function to return the indices of a piece of data you are sure exists in a complex tree data structure (e.g. an HTML or PDF file). The indices can then be iterated to extract the rows and columns of a table.
Additional advantages of Dictionary over Dict accrue from better control of the ordering of keys. This should be helpful in parsing because order is often meaningful. Controlling order of keys is also useful for making Dictionary print-outs more readable. It is also desirable to control the order of keys when Dictionaries are appended into DataFrames.

@GregLawson
Copy link
Author

GregLawson commented Jan 16, 2023

There seem three possible solutions to having type instability or mangling:

  1. don't use the dictionary function,
  2. change promotion to promote_jointype,
  3. change the dictionary function to a generic parameterized by the desire key and value types. The default types could be the current default.
    1 discourages Dictionary adoption for codebases containing lots of Dict pair initialization. It also loses the asymmetric syntactic sugar of the => pair operator.
    I don't know why AbstractArray likes to promote Bools to Int64, but I suspect the reason doesn't make sense for heterogeneous data structures like Dict and Dictionary.
    3 seems the least disruptive to any code depending on Bools to be treated as Int64 (which I suspect doesn't exist)
    So I vote for 2, but would like to hear arguments for other approaches.

@GregLawson
Copy link
Author

At a minimum, I think promotion should be optional. Unexpected, automatic conversions can be quite confusing to debug. Promotion can provide a performance advantage for Arrays by avoiding an indirection, but this seems less useful for Dictionaries.
When dealing with data that might contain errors, even with Arrays of numbers, one might want to return a number or error information in the form of an Exception, nothing, missing, @Assert failure, overflow, saturation or error message. Only after careful consideration of error modes should you address performance optimizations.

@andyferris
Copy link
Owner

andyferris commented Jan 21, 2023

Promotion is optional: if you specify the dictionary type precisely and use it's constructor there is no promotion. The issue here is taking a list of pairs and constructing a dictionary with a certian element type (i.e. the dictionary function needs improving or replacing).

Keep in mind in the first example the behavior isn't due to Dictionaries.jl but is due to Julia:

julia> [:isvalid => true, :value => 0]
2-element Vector{Pair{Symbol, Int64}}:
 :isvalid => 1
   :value => 0

The dictionary function receives data which has already been promoted.

Promotion can provide a performance advantage for Arrays by avoiding an indirection, but this seems less useful for Dictionaries.

This isn't true. Harmonizing the types makes arrays faster, and also makes dictionaries faster (which in many use cases for Dictionaries.jl like iteration/mapping/broadcasting/etc are the same speed as arrays, for much the same design decisions).

but I suspect the reason doesn't make sense for heterogeneous data structures like Dict and Dictionary

Again I feel there is a misunderstanding here. Dictionaries in Julia are for homogenous data, just like arrays. On top of that one nice thing you can choose to do in Julia because of it's type system is if you want heterogenous data you can use Vector{Any}, Dict{Any, Any}, Dictionary{Any, Any}, etc (and if you have heterogenous data of "known shape" you can of course use a struct, tuple or named tuple).

There is a valid issue with the dictionary(...) function here since you can't specify the keytype, valtype, or dictionary type. We may need to replace it or change how the dictionary constructors work (a breaking change, by the way, but probably worth it since I have this struggle when using Dictionaries.jl myself). However, please respect that Dictionaries.jl will continue to perform promotion for the same reason as Vector and Dict and that's not really the debate (there might be some wiggle room to switch to promote_typejoin on the keys/indices instead of promote, but that would need to be argued on its merits, and in any case would only affect inputs which are generators and similar).

@GregLawson
Copy link
Author

GregLawson commented Jan 31, 2023

Thanks for the detailed response.
How about
function dict(pairs...)
Dictionary(pairs)
end
dict(:isvalid => true, :value => 0)
2-element Dictionary{Int64, Pair{Symbol}}
1 │ :isvalid => true
2 │ :value => 0
The dict function is more interoperable with Dict. Hopefully preserves order.
It could be called dictionary if it avoids breaking changes.
No promotion occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants