-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle chunked responses when downloading resources #1442
Comments
Thank you for reporting. So basically the constant is unnecessary aside
from display? Should be an easy fix.
Do you have an example server I could try to verify the fix?
…On Fri, Jan 3, 2025, 8:06 AM dtk ***@***.***> wrote:
*Describe the bug*
stanza.download() fails to download resources from a host that sends a chunked
response <https://en.wikipedia.org/wiki/Chunked_transfer_encoding>.
In [1]: import stanza
In [2]: stanza.download('en')---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-2-c2f724e525cb> in <module>----> 1 stanza.download('en')
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json)
577 if not download_json:
578 logger.warning("Asked to skip downloading resources.json, but the file does not exist. Downloading anyway")--> 579 download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies)
580
581 resources = load_resources_json(model_dir)
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies)
457 resources_filepath = os.path.join(model_dir, 'resources.json')
458 # make request--> 459 request_file(
460 resources_url,
461 resources_filepath,
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5)
155 with tempfile.TemporaryDirectory(dir=basedir) as temp:
156 temppath = os.path.join(temp, os.path.split(path)[-1])--> 157 download_file(url, temppath, proxies, raise_for_status)
158 os.replace(temppath, path)
159 assert_file_exists(path, md5, alternate_md5)
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status)
121 r.raise_for_status()
122 with open(path, 'wb') as f:--> 123 file_size = int(r.headers.get('content-length'))
124 default_chunk_size = 131072
125 desc = 'Downloading ' + url
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
In [3]:
The problem can occur when downloading a resource via stanza.download()
from a custom host (set via STANZA_RESOURCES_URL)
In [3]: import os
In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')Out[4]: True
In [5]:
download_file()
<https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L114> unconditionally
parses the HTTP Content-Length header into an integer
<https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L123>
to optionally visualize a progress bar. However, if the server chose to
send a chunked response, it cannot and therefore does not contain a
Content-Length header. Passing the None value into int() leads to the
above TypeError.
- Since requests is an HTTP/1.1-only client
<psf/requests#5512 (comment)>,
chunked responses cannot be avoided by making HTTP/1.0 requests
- All HTTP/1.1 compliant clients are required to handle chunked
responses
<https://stackoverflow.com/questions/31969990/how-to-tell-the-http-server-to-not-send-chunked-encoding/31970668#31970668>.
They cannot be disabled.
*To Reproduce*
Steps to reproduce the behavior:
1. Define a server via STANZA_RESOURCES_URL that sends the
resources_1.x.y.json in a chunked response
2. python3 -c 'import stanza; stanza.download("en")'
3. See stack trace
*Expected behavior*
Downloads of resources should work from HTTP/1.1 compliant servers.
*Environment (please complete the following information):*
- OS: Ubuntu 20.04, MacOS 15.2
- Python version: Python 3.8.10, Python 3.13.1
- Stanza version: 1.4.0, 1.10.1
*Additional context*
We are using stanza in an enterprise setting and can only download
resources from a centralized caching server
<https://jfrog.com/de/artifactory/>.
—
Reply to this email directly, view it on GitHub
<#1442>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWIV3K7OHNZ6CWIKX4D2I2KQFAVCNFSM6AAAAABURYBQROVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DONRTGMYDSMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
…wnload succeeds even if it isn't present. #1442
Have pushed a fix to the dev branch. How urgent is this fix / can you use the dev branch in your environment? Or only released versions possible? It's actually not super stressful to make a new release as long as the models haven't changed, and nothing's changed in the last week or so I'm looking into adding a test server framework which will let us unit test the downloads to catch things like this, possibly https://github.com/csernazs/pytest-httpserver |
Thanks for the super quick response!
Unfortunately, the server I'm using is only reachable from within the company's network. However, I should be able to validate the fix in a non-productive environment next week.
I have implemented a hacky workaround (by preloading all resources via cURL and disabling all downloads through Thanks again for the great response! |
I can indeed confirm that the fix works for us: (stanza-dev) stanza@f66684eeb8db:/tmp$ python3
Python 3.8.10 (default, Nov 7 2024, 13:10:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stanza
>>> stanza.download('en')
Downloading https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 25.6MB/s]
2025-01-06 11:34:11 INFO: Downloaded file to /home/stanza/stanza_resources/resources.json
2025-01-06 11:34:11 INFO: Downloading default packages for language: en (English) ...
Downloading https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip: 100%|███████████████████████████████████████████████████████████████████| 479M/479M [00:47<00:00, 10.2MB/s]
2025-01-06 11:35:00 INFO: Downloaded file to /home/stanza/stanza_resources/en/default.zip
2025-01-06 11:35:03 INFO: Finished downloading models and saved to /home/stanza/stanza_resources That is for both chunked and un-chunked transfers (note the progress bar for the model download above): >>> resources_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json')
>>> resources_response.headers.get('transfer-encoding')
'chunked'
>>> 'content-length' in resources_response.headers
False
>>> models_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip')
>>> models_response.headers.get('content-length')
'479293702'
>>> Thank you! |
Excellent, glad to hear it |
Describe the bug
stanza.download()
fails to download resources from a host that sends a chunked response.The problem can occur when downloading a resource via
stanza.download()
from a custom host (set viaSTANZA_RESOURCES_URL
)download_file()
unconditionally parses the HTTP Content-Length header into an integer to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain aContent-Length
header. Passing theNone
value intoint()
leads to the aboveTypeError
.requests
is an HTTP/1.1-only client, chunked responses cannot be avoided by making HTTP/1.0 requestsTo Reproduce
Steps to reproduce the behavior:
STANZA_RESOURCES_URL
that sends theresources_1.x.y.json
in a chunked responseExpected behavior
Downloads of resources should work from HTTP/1.1 compliant servers.
Environment (please complete the following information):
Additional context
We are using
stanza
in an enterprise setting and can only download resources from a centralized caching server.The text was updated successfully, but these errors were encountered: