-
-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic schema generation for Optional[Union[...]]
types
#691
Comments
Seems I'm being bested by this! Even after swapping In a few cases, I've found setting |
Which Python version are you using locally and in the CI? Are you installing estará packages during the CI that you do not have locally? |
This is with Python 3.11.8 and notably |
Okay, I've managed to now reproduce the issue within the same environment, locally. This is where:
|
Comically, I thought I could be clever and get away with forcing the
But even more annoyingly, and also |
Okay, I think I see where this is going wrong. At this point of the UnionField handling, we call
In the Python docs for typing.get_args it has this releveant tidbit:
This is happening specifically for the I suspect if we simply apply some sort of deterministic sorting to the types coming into that tuple, we would be able to ensure this doesn't get jumbled between runs. The subsequent code to ensure the default is at the front would still function fine regardless of the order coming in. dataclasses-avroschema/dataclasses_avroschema/fields/fields.py Lines 318 to 331 in 52ac9b4
|
I'll work on a PR for ^ While I'm here, any chance you'd also be able to loosen some dependency version requirements? It'd be great to shift something like |
Python doesn't seem to have good facilities or sorting types, so this would require a custom comparison fucntion for Python types. I'm not too keen on that given the complexity but it would accomplish the goals. What do you think @marcosschroh? I've prepped some tests here which more-or-less exhibits the issue where different |
@lgo I do not think from typing import Optional, Dict, Union, get_args
types = Optional[
Dict[str, Optional[Union[str, int, float, bool, Dict[str, Optional[str]]]]]
]
result = get_args(types)
for _ in range(1, 1000):
if result != get_args(types):
raise ValueError I always get the same result For the schema is the same, it is deterministic class Foo(AvroModel):
value: Optional[
Dict[str, Optional[Union[str, int, float, bool, Dict[str, Optional[str]]]]]
]
result = Foo.avro_schema()
for _ in range(1, 1000):
if result != Foo.avro_schema():
raise ValueError The result is always the same: {"type": "record", "name": "Foo", "fields": [{"name": "value", "type": [{"type": "map", "values": ["string", "long", "double", "boolean", {"type": "map", "values": ["string", "null"], "name": "value"}, "null"], "name": "value"}, "null"]}]} I think your problem is in the way that you are comparing the |
To clarify, the issue I'm facing is execution of the same Python code (/schemas) via different entrypoints, where unrelated code that is loaded affects the results. Sorry for the confusion! I got this minimal repro working now that I've also understood this more (: import os.path
import os
from typing import List, Union
from dataclasses_avroschema import AvroModel
import json
import sys
if len(sys.argv) != 2:
print("Usage: repro.py <temp schema filename>")
sys.exit(1)
schema_file = sys.argv[1]
# Uncomment this after the first run.
#
# some_other_val: List[Union[str, None, int]] = {}
class Foo(AvroModel):
value: List[Union[str, int, None]]
result = Foo.avro_schema()
# On first run, write the schema.
if not os.path.isfile(schema_file):
with open(schema_file, 'w') as f:
f.write(json.dumps(result))
sys.exit(0)
# Otherwise, compare it.
with open(schema_file, 'r') as f:
prev_result = json.loads(f.read())
if prev_result != result:
print("Previous: ", prev_result)
print("Current: ", result)
sys.exit(1) $ python /Users/joeyp/Downloads/repro.py /tmp/schema_file.json
# Uncomment the indicated `some_other_val`
$ python /Users/joeyp/Downloads/repro.py /tmp/schema_file.json
Previous: {"type": "record", "name": "Foo", "fields": [{"name": "value", "type": {"type": "array", "items": ["string", "long", "null"], "name": "value"}}]}
Current: {"type": "record", "name": "Foo", "fields": [{"name": "value", "type": {"type": "array", "items": ["string", "null", "long"], "name": "value"}}]} And we can see the first run has Relating this back to the mention on the In practice, this has happened as I have some code like |
Hi @lgo Thanks for the clarification. I think we can not do anything about it as the python cache works in this way. I found the python issue related to our behavior, it seems that they are working on it. This is also affecting pydantic. I seems that you will have to find a work around and load your in a different way and wait for the fix in python. Regarding |
Describe the bug
I've been hitting some odd non-determinism for some complex schemas. I haven't quite narrowed it down or reached a reproduction as it also seems to be dependent on some import load, e.g. some code in one module loading schemas sees one order (str, null, int, float, ...) and code in another module run separately sees (str, int, float, ..., null).
This is resulting in some non-determinism across code that writes schemas, and code that checks for stale schemas.
Here's the specific schema field I'm dealing with in case the surrounding bits are also relevant, although I believe the problem is specifically with just the
Optional[Union[...]]
with several sub-values in particular.I'll give a go at trying to reproduce this more reliabily later, but at the moment I've found swapping this to
Union[None, ...]
to force the ordering to be consistent.To Reproduce
Steps to reproduce the behavior
Expected behavior
A clear and concise description of what you expected to happen.
The text was updated successfully, but these errors were encountered: