Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle chunked responses when downloading resources #1442

Open
0x64746b opened this issue Jan 3, 2025 · 5 comments
Open

Handle chunked responses when downloading resources #1442

0x64746b opened this issue Jan 3, 2025 · 5 comments
Labels

Comments

@0x64746b
Copy link

0x64746b commented Jan 3, 2025

Describe the bug
stanza.download() fails to download resources from a host that sends a chunked response.

In [1]: import stanza

In [2]: stanza.download('en')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-c2f724e525cb> in <module>
----> 1 stanza.download('en')

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json)
    577         if not download_json:
    578             logger.warning("Asked to skip downloading resources.json, but the file does not exist.  Downloading anyway")
--> 579         download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies)
    580
    581     resources = load_resources_json(model_dir)

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies)
    457         resources_filepath = os.path.join(model_dir, 'resources.json')
    458     # make request
--> 459     request_file(
    460         resources_url,
    461         resources_filepath,

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5)
    155     with tempfile.TemporaryDirectory(dir=basedir) as temp:
    156         temppath = os.path.join(temp, os.path.split(path)[-1])
--> 157         download_file(url, temppath, proxies, raise_for_status)
    158         os.replace(temppath, path)
    159     assert_file_exists(path, md5, alternate_md5)

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status)
    121         r.raise_for_status()
    122     with open(path, 'wb') as f:
--> 123         file_size = int(r.headers.get('content-length'))
    124         default_chunk_size = 131072
    125         desc = 'Downloading ' + url

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

In [3]:

The problem can occur when downloading a resource via stanza.download() from a custom host (set via STANZA_RESOURCES_URL)

In [3]: import os

In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')
Out[4]: True

In [5]:

download_file() unconditionally parses the HTTP Content-Length header into an integer to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain a Content-Length header. Passing the None value into int() leads to the above TypeError.

To Reproduce
Steps to reproduce the behavior:

  1. Define a server via STANZA_RESOURCES_URL that sends the resources_1.x.y.json in a chunked response
  2. python3 -c 'import stanza; stanza.download("en")'
  3. See stack trace

Expected behavior
Downloads of resources should work from HTTP/1.1 compliant servers.

Environment (please complete the following information):

  • OS: Ubuntu 20.04, MacOS 15.2
  • Python version: Python 3.8.10, Python 3.13.1
  • Stanza version: 1.4.0, 1.10.1

Additional context
We are using stanza in an enterprise setting and can only download resources from a centralized caching server.

@0x64746b 0x64746b added the bug label Jan 3, 2025
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Jan 3, 2025 via email

AngledLuffa added a commit that referenced this issue Jan 4, 2025
@AngledLuffa
Copy link
Collaborator

Have pushed a fix to the dev branch.

How urgent is this fix / can you use the dev branch in your environment? Or only released versions possible? It's actually not super stressful to make a new release as long as the models haven't changed, and nothing's changed in the last week or so

I'm looking into adding a test server framework which will let us unit test the downloads to catch things like this, possibly https://github.com/csernazs/pytest-httpserver

@0x64746b
Copy link
Author

0x64746b commented Jan 4, 2025

Thanks for the super quick response!

Do you have an example server I could try to verify the fix?

Unfortunately, the server I'm using is only reachable from within the company's network. However, I should be able to validate the fix in a non-productive environment next week.

How urgent is this fix

I have implemented a hacky workaround (by preloading all resources via cURL and disabling all downloads through stanza), so we currently aren't blocked and are happy to wait for the proper release.

Thanks again for the great response!
dtk

@0x64746b
Copy link
Author

0x64746b commented Jan 6, 2025

I can indeed confirm that the fix works for us:

(stanza-dev) stanza@f66684eeb8db:/tmp$ python3
Python 3.8.10 (default, Nov  7 2024, 13:10:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stanza
>>> stanza.download('en')
Downloading https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 25.6MB/s]
2025-01-06 11:34:11 INFO: Downloaded file to /home/stanza/stanza_resources/resources.json
2025-01-06 11:34:11 INFO: Downloading default packages for language: en (English) ...
Downloading https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip: 100%|███████████████████████████████████████████████████████████████████| 479M/479M [00:47<00:00, 10.2MB/s]
2025-01-06 11:35:00 INFO: Downloaded file to /home/stanza/stanza_resources/en/default.zip
2025-01-06 11:35:03 INFO: Finished downloading models and saved to /home/stanza/stanza_resources

That is for both chunked and un-chunked transfers (note the progress bar for the model download above):

>>> resources_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json')
>>> resources_response.headers.get('transfer-encoding')
'chunked'
>>> 'content-length' in resources_response.headers
False
>>> models_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip')
>>> models_response.headers.get('content-length')
'479293702'
>>>

Thank you!

@AngledLuffa
Copy link
Collaborator

Excellent, glad to hear it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants