Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cursor.close() can throw UnicodeDecodeError #569

Open
csringhofer opened this issue Feb 6, 2025 · 3 comments
Open

cursor.close() can throw UnicodeDecodeError #569

csringhofer opened this issue Feb 6, 2025 · 3 comments

Comments

@csringhofer
Copy link
Collaborator

This was witnessed during #566
The issue only happened with Python 3.9. It is probably related to using native vs interpreted Thrift protocol - when setting fallback=False when creating TBinaryProtocolAccelerated in my environment with Python 3.9 the tests failed early while other Python versions had no issues. Changing to TBinaryProtocol led to seeing the error on all 3.* Python versions (but not on Python 2.7).

It is not clear at this moment whether this is an issue in my Ubuntu 20.04 environment or there is something wrong on Python 3.9 in the Thrift package. Tried changing from thrift 0.16.0 to newer versions but the issue still occurred.

I see several questions here:

  • why is there an invalid utf8 string in TCloseSessionResp?
  • why does TBinaryProtocolAccelerated work differently than TBinaryProtocol?
  • why does this only come on Python 3.9?
Callstack:
impala/hiveserver2.py:308: in close
    self.session.close()
impala/hiveserver2.py:1313: in close
    self._rpc('CloseSession', req, False)
impala/hiveserver2.py:1179: in _rpc
    response = self._execute(func_name, request, safe_to_retry)
impala/hiveserver2.py:1199: in _execute
    return func(request)
impala/_thrift_gen/TCLIService/TCLIService.py:228: in CloseSession
    return self.recv_CloseSession()
impala/_thrift_gen/TCLIService/TCLIService.py:247: in recv_CloseSession
    result.read(iprot)
impala/_thrift_gen/TCLIService/TCLIService.py:1550: in read
    self.success.read(iprot)
impala/_thrift_gen/TCLIService/ttypes.py:3390: in read
    iprot.skip(ftype)
.tox/py39/lib/python3.9/site-packages/thrift/protocol/TProtocol.py:214: in skip
    self.skip(ttype)
.tox/py39/lib/python3.9/site-packages/thrift/protocol/TProtocol.py:214: in skip
    self.skip(ttype)
.tox/py39/lib/python3.9/site-packages/thrift/protocol/TProtocol.py:207: in skip
    self.readString()
.tox/py39/lib/python3.9/site-packages/thrift/protocol/TProtocol.py:185: in readString
    return binary_to_str(self.readBinary())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

bin_val = b'\xc8\xbcJsL\xb3F\xcc\x00\x00\x00\x00\x08!\x8b\x9f'

    def binary_to_str(bin_val):
>       return bin_val.decode('utf8')
E       UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5: invalid start byte

@paulmayer
Copy link
Contributor

paulmayer commented Feb 6, 2025

Can reproduce under 3.9 as well (likewise for 3.8.20, 3.10.16, 3.12.8, 3.13.1) - only 3.11.11 passes reliably.

git rev-parse HEAD # 335c742bb069dbead90b9d57f5c27b2a9e9fcd39
uv venv 39 --python cpython-3.9.21
source 39/bin/activate 
uv pip install . "pytest>=6,<7" "sqlalchemy>=2" "requests"
uv pip freeze

Using Python 3.9.21 environment at: 39
attrs==25.1.0
bitarray==2.9.3
certifi==2025.1.31
charset-normalizer==3.4.1
greenlet==3.1.1
idna==3.10
impyla @ file:///home/paul/dev/tmp/impyla
iniconfig==2.0.0
packaging==24.2
pluggy==1.5.0
pure-sasl==0.6.2
py==1.11.0
pytest==6.2.5
requests==2.32.3
six==1.17.0
sqlalchemy==2.0.37
thrift==0.16.0
thrift-sasl==0.4.3
toml==0.10.2
typing-extensions==4.12.2
urllib3==2.3.0

uv run pytest  impala/tests/test_dbapi_connect.py

Re-ran thrift-gen via impala/thrift/process_thrift.sh without no_utf8strings, but the generated code is virtually the same for py3 (no difference in terms of result)

@paulmayer
Copy link
Contributor

paulmayer commented Feb 6, 2025

looking slightly deeper, this appears to be related to the different TBinaryFastProtocolAccelerated._fast_decode implementations: With 3.11, it uses fastbinary, with 3.10 (just to pick an example for which I ended up with a thrift==0.16.0 installation for which the fastbinary build failed during setup.py) it doesn't. Forcing my 3.11 installation to not use fastbinary also results in the error.

Corollary something appears to be wrong in the generated code that doesn't use fastbinary (or fastbinary is more robust to certain edgecases). Potentially something with the struct definitions in the thrift files? Maybe even something server-side.


When calling cur.execute with configuration={"debug_action":"IMPALAD_LOAD_TABLES_DELAY:SLEEP@4000"} (note: not with "debug_action":"IMPALAD_LOAD_TABLES_DELAY:SLEEP@400"), then the parsing of structs eventually trips up at TCloseSessionResp:

  • we first build a TStatus object TStatus(statusCode=0, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None)
  • The struct that is being read still contains a second field of TType.STRUCT (which shouldn't be there).

@csringhofer
Copy link
Collaborator Author

Wrote about the cause of the issue in: #566 (comment)
Keeping this issue open as having different behavior in depending on the Thrift protocol doesn't seem ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants