Replies: 4 comments 6 replies
-
function number_comparator(a, b)
if type(a) == type(b) then
return a > b
end
if type(a) == 'double' or type(b) == 'double' then -- pseudocode, not real
return double(a) > double(b)
end
-- etc
end also we should define rules for comparing positive ints vs uints |
Beta Was this translation helpful? Give feedback.
-
at the point we CAN prefer signed or unsigned value by space:format.field.type or always prefer unsigned |
Beta Was this translation helpful? Give feedback.
-
I lean towards the first option. It's better not to introduce any behavior changes. Besides, it seems logical that we have an index type matching each possible field type. AFAICS double field type was added for the sake of SQL DOUBLE. Is there any place where SQL relies on the fact that numbers are compared exactly like doubles?
|
Beta Was this translation helpful? Give feedback.
-
Does it mean that 51af059 is going to be reverted? If not, I don't understand why we can't convert 'double' to 'number' without index rebuild: 'double' stores |
Beta Was this translation helpful? Give feedback.
-
Reviewers
Tickets
Changelog
v2:
double
index field type has a new preferrable solution.Summary
The document explains solution for a problem that Tarantool is facing at the moment of writing about incorrect sorting and hashing of numbers in indexes. It resulted in unaccessible values in unique/non-unique indexes and duplicates in unique indexes, both memtx and vinyl, both hash and tree index types.
Initially the issue was thought to be small and only about suboptimal MessagePack handling. Like when a number is stored not in the most compact way, or when a positive integer was stored with an
MP_INT
header.But later the problem revealed a larger scale. Here all the known untrivial issues are listed with solutions. Some of them have multiple solutions, unclear which to choose. Hence the RFC. To gain a consensus on those uncertainties.
The issues
✅ Suboptimal MessagePack
Some unsigned numbers can be stored in 9 different ways in MessagePack. Only one of them being the most optimal and taking a single byte of space. All numbers in
[int32_min, uint32_max]
range can be stored in more than one way.The most compact way is obviously better to use, because it takes less space in memory and on disk. However:
[128, int32_max]
can be stored in equal number of bytes asMP_INT
andMP_UINT
. The argument of compactness doesn't work here.The decision was to support the suboptimal MessagePack. All the comparators and hash functions must provide the same results regardless of how numbers are encoded.
✅ Float values in hash index
The same value stored as
double
andfloat
had different hashes. One could argument for this as something intentional, but there is virtually no case when a value stored infloat
wouldn't be also available indouble
type. The difference is the same as storing a number in 5-byteMP_UINT
vs 9-byteMP_UINT
.The decision is to hash all floating numbers as
double
.✅ Hash index uses different hashing algorithm for some field types
Example:
This is broken, because a single
unsigned
field uses one hashing algorithm, and a single field of any other type uses another algorithm.One solution could be that the hash index always uses the same algo for a single field, regardless of its type. But that would complicate the code, and according to the benches isn't super beneficial.
The decision is to remove the templated optimization for hash indexes having up to 3 string/unsigned fields.
❓Index behaviour depending on field type
There is quite an untrivial bug, affecting both hash and tree indexes in memtx at least. The problem is that some indexes' behaviour depends on their field types.
For example, if an indexed field type is
double
, then all its values are forcefully cast todouble
before comparison/hashing. This was done so people, having a field of typedouble
, could insert there also integers literally asMP_INT
andMP_UINT
, but they would "behave" like doubles. From user's PoV it would look like the numbers were cast todouble
before being saved.That in turn breaks other assumptions. For instance, there is an assumption, that
double
is included intonumber
type completely. One would expect that conversion of adouble
field into anumber
field doesn't need any checks. But it can break the index. Because in anumber
index the comparison/hashing is based entirely on field values, without any lossy casting. A specific buggy example:Initially the unique index contains 2 unique integer numbers. But then the field type is changed to
double
. If Tarantooldouble
is treated as C/C++double
, thenunsigned
->double
field type change shouldn't have worked. Because the different unsigned values stored in the index atm are the samedouble
value.get
, only visible viaselect
/pairs
).Another example:
This is completely nonsense which is hard to explain. But it seems to be related to the same problem.
Another example:
Same value is found or not only depending on the field type. Nothing else changes.
The bug has 3 possible solutions:
double
type work entirely the same as C/C++double
when it comes to indexes. That is, index field type alterunsigned
->double
would cause index rebuild. Besides, if the index is unique, this field alter might fail if some integers get converted to the samedouble
. The conversiondouble
<->number
will do the same. Because it changes the ordering and changes the definition of duplicate values. This solution wouldn't break anything that isn't already broken.double
type work same asnumber
withoutdecimal
. This solution can breakget
s in unique indexes when people expect integers not fitting thedouble
type to be cast todouble
. So they intentionally make lookups of value by not matching keys. But that is highly unlikely.double
type an alias fornumber
. Same as (2) but even simpler.The decision is to follow number 1. I.e. make index compare fields using the index's field type. Not the space format or value type. And make sure the alter respects that (i.e. rebuilds the index when needed, and makes no-duplicate checks).
Beta Was this translation helpful? Give feedback.
All reactions