Handle chunked responses when downloading resources #1442

0x64746b · 2025-01-03T14:06:35Z

Describe the bug
stanza.download() fails to download resources from a host that sends a chunked response.

In [1]: import stanza

In [2]: stanza.download('en')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-c2f724e525cb> in <module>
----> 1 stanza.download('en')

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json)
    577         if not download_json:
    578             logger.warning("Asked to skip downloading resources.json, but the file does not exist.  Downloading anyway")
--> 579         download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies)
    580
    581     resources = load_resources_json(model_dir)

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies)
    457         resources_filepath = os.path.join(model_dir, 'resources.json')
    458     # make request
--> 459     request_file(
    460         resources_url,
    461         resources_filepath,

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5)
    155     with tempfile.TemporaryDirectory(dir=basedir) as temp:
    156         temppath = os.path.join(temp, os.path.split(path)[-1])
--> 157         download_file(url, temppath, proxies, raise_for_status)
    158         os.replace(temppath, path)
    159     assert_file_exists(path, md5, alternate_md5)

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status)
    121         r.raise_for_status()
    122     with open(path, 'wb') as f:
--> 123         file_size = int(r.headers.get('content-length'))
    124         default_chunk_size = 131072
    125         desc = 'Downloading ' + url

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

In [3]:

The problem can occur when downloading a resource via stanza.download() from a custom host (set via STANZA_RESOURCES_URL)

In [3]: import os

In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')
Out[4]: True

In [5]:

download_file() unconditionally parses the HTTP Content-Length header into an integer to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain a Content-Length header. Passing the None value into int() leads to the above TypeError.

Since requests is an HTTP/1.1-only client, chunked responses cannot be avoided by making HTTP/1.0 requests
All HTTP/1.1 compliant clients are required to handle chunked responses. They cannot be disabled.

To Reproduce
Steps to reproduce the behavior:

Define a server via STANZA_RESOURCES_URL that sends the resources_1.x.y.json in a chunked response
python3 -c 'import stanza; stanza.download("en")'
See stack trace

Expected behavior
Downloads of resources should work from HTTP/1.1 compliant servers.

Environment (please complete the following information):

OS: Ubuntu 20.04, MacOS 15.2
Python version: Python 3.8.10, Python 3.13.1
Stanza version: 1.4.0, 1.10.1

Additional context
We are using stanza in an enterprise setting and can only download resources from a centralized caching server.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2025-01-03T19:04:12Z

Thank you for reporting. So basically the constant is unnecessary aside from display? Should be an easy fix. Do you have an example server I could try to verify the fix?

…

On Fri, Jan 3, 2025, 8:06 AM dtk ***@***.***> wrote: *Describe the bug* stanza.download() fails to download resources from a host that sends a chunked response <https://en.wikipedia.org/wiki/Chunked_transfer_encoding>. In [1]: import stanza In [2]: stanza.download('en')---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-2-c2f724e525cb> in <module>----> 1 stanza.download('en') /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json) 577 if not download_json: 578 logger.warning("Asked to skip downloading resources.json, but the file does not exist. Downloading anyway")--> 579 download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies) 580 581 resources = load_resources_json(model_dir) /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies) 457 resources_filepath = os.path.join(model_dir, 'resources.json') 458 # make request--> 459 request_file( 460 resources_url, 461 resources_filepath, /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5) 155 with tempfile.TemporaryDirectory(dir=basedir) as temp: 156 temppath = os.path.join(temp, os.path.split(path)[-1])--> 157 download_file(url, temppath, proxies, raise_for_status) 158 os.replace(temppath, path) 159 assert_file_exists(path, md5, alternate_md5) /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status) 121 r.raise_for_status() 122 with open(path, 'wb') as f:--> 123 file_size = int(r.headers.get('content-length')) 124 default_chunk_size = 131072 125 desc = 'Downloading ' + url TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' In [3]: The problem can occur when downloading a resource via stanza.download() from a custom host (set via STANZA_RESOURCES_URL) In [3]: import os In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')Out[4]: True In [5]: download_file() <https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L114> unconditionally parses the HTTP Content-Length header into an integer <https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L123> to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain a Content-Length header. Passing the None value into int() leads to the above TypeError. - Since requests is an HTTP/1.1-only client <psf/requests#5512 (comment)>, chunked responses cannot be avoided by making HTTP/1.0 requests - All HTTP/1.1 compliant clients are required to handle chunked responses <https://stackoverflow.com/questions/31969990/how-to-tell-the-http-server-to-not-send-chunked-encoding/31970668#31970668>. They cannot be disabled. *To Reproduce* Steps to reproduce the behavior: 1. Define a server via STANZA_RESOURCES_URL that sends the resources_1.x.y.json in a chunked response 2. python3 -c 'import stanza; stanza.download("en")' 3. See stack trace *Expected behavior* Downloads of resources should work from HTTP/1.1 compliant servers. *Environment (please complete the following information):* - OS: Ubuntu 20.04, MacOS 15.2 - Python version: Python 3.8.10, Python 3.13.1 - Stanza version: 1.4.0, 1.10.1 *Additional context* We are using stanza in an enterprise setting and can only download resources from a centralized caching server <https://jfrog.com/de/artifactory/>. — Reply to this email directly, view it on GitHub <#1442>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWIV3K7OHNZ6CWIKX4D2I2KQFAVCNFSM6AAAAABURYBQROVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DONRTGMYDSMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

…wnload succeeds even if it isn't present. #1442

AngledLuffa · 2025-01-04T06:56:22Z

Have pushed a fix to the dev branch.

How urgent is this fix / can you use the dev branch in your environment? Or only released versions possible? It's actually not super stressful to make a new release as long as the models haven't changed, and nothing's changed in the last week or so

I'm looking into adding a test server framework which will let us unit test the downloads to catch things like this, possibly https://github.com/csernazs/pytest-httpserver

0x64746b · 2025-01-04T10:11:44Z

Thanks for the super quick response!

Do you have an example server I could try to verify the fix?

Unfortunately, the server I'm using is only reachable from within the company's network. However, I should be able to validate the fix in a non-productive environment next week.

How urgent is this fix

I have implemented a hacky workaround (by preloading all resources via cURL and disabling all downloads through stanza), so we currently aren't blocked and are happy to wait for the proper release.

Thanks again for the great response!
dtk

0x64746b · 2025-01-06T12:30:52Z

I can indeed confirm that the fix works for us:

(stanza-dev) stanza@f66684eeb8db:/tmp$ python3
Python 3.8.10 (default, Nov  7 2024, 13:10:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stanza
>>> stanza.download('en')
Downloading https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 25.6MB/s]
2025-01-06 11:34:11 INFO: Downloaded file to /home/stanza/stanza_resources/resources.json
2025-01-06 11:34:11 INFO: Downloading default packages for language: en (English) ...
Downloading https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip: 100%|███████████████████████████████████████████████████████████████████| 479M/479M [00:47<00:00, 10.2MB/s]
2025-01-06 11:35:00 INFO: Downloaded file to /home/stanza/stanza_resources/en/default.zip
2025-01-06 11:35:03 INFO: Finished downloading models and saved to /home/stanza/stanza_resources

That is for both chunked and un-chunked transfers (note the progress bar for the model download above):

>>> resources_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json')
>>> resources_response.headers.get('transfer-encoding')
'chunked'
>>> 'content-length' in resources_response.headers
False
>>> models_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip')
>>> models_response.headers.get('content-length')
'479293702'
>>>

Thank you!

AngledLuffa · 2025-01-06T18:36:59Z

Excellent, glad to hear it

0x64746b added the bug label Jan 3, 2025

AngledLuffa added a commit that referenced this issue Jan 4, 2025

file_size is only used for making a pretty tqdm, so make it so the do…

744ea93

…wnload succeeds even if it isn't present. #1442

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle chunked responses when downloading resources #1442

Handle chunked responses when downloading resources #1442

0x64746b commented Jan 3, 2025

AngledLuffa commented Jan 3, 2025 via email

AngledLuffa commented Jan 4, 2025

0x64746b commented Jan 4, 2025

0x64746b commented Jan 6, 2025

AngledLuffa commented Jan 6, 2025

Handle chunked responses when downloading resources #1442

Handle chunked responses when downloading resources #1442

Comments

0x64746b commented Jan 3, 2025

AngledLuffa commented Jan 3, 2025 via email

AngledLuffa commented Jan 4, 2025

0x64746b commented Jan 4, 2025

0x64746b commented Jan 6, 2025

AngledLuffa commented Jan 6, 2025