-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base moodle's file_exist method on the file content instead of headers #6841
Conversation
7a883af
to
5ecbd43
Compare
lms/services/moodle.py
Outdated
# We don't want to download the full file so we'll do a GET request for the first bytes | ||
# and check if the files is indeed a PDF. | ||
# Most error response end up being JSON but not all of them are, some are HTML. | ||
LOG.info("Content from Moodle file check %s", response.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapting the log line added yesterday to response.text, you can't have too much logging.
lms/services/moodle.py
Outdated
@@ -137,14 +137,15 @@ def list_files(self, course_id: int): | |||
def file_exists(self, file_id) -> bool: | |||
"""Check if the file exists in the course.""" | |||
# Moodle file IDs are URLs, but they need the token to be accessible | |||
response = self._http.request("HEAD", f"{file_id}&token={self.token}") | |||
response = self._http.request( | |||
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-5"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to google the Range
header. Leaving the docs here for future readers context https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested an approach in which we still check for Content-type
and only if the header does not exist, we inspect the body contents. Unless we think checking for '%PDF'
is reliable enough.
lms/services/moodle.py
Outdated
# and check if the files is indeed a PDF. | ||
# Most error response end up being JSON but not all of them are, some are HTML. | ||
LOG.info("Content from Moodle file check %s", response.text) | ||
return "%PDF" in response.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could still check for the content-type header, and fall back to the response body just if Content-type doesn't exist?
Headers are at the very beginning of the response message payload, so I would expect them to be available in this case as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted a possible issue re. testing for ASCII bytes vs Unicode characters.
lms/services/moodle.py
Outdated
# and check if the files is indeed a PDF. | ||
# Most error response end up being JSON but not all of them are, some are HTML. | ||
LOG.info("Content from Moodle file check %s", response.text) | ||
return "%PDF" in response.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does response.text
do if the file is not valid UTF-8, which is entirely possible here? You should be checking for the ASCII bytes "%PDF" rather than Unicode characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. UTF-8 is not actually the default assumed encoding because these are old standards. I'm not clear if response.text
can fail, but nevertheless treating the response as binary would be the correct thing to do.
Edit: It looks like response.text
will always use substitutions if errors are encountered rather than failing - https://github.com/psf/requests/blob/a6cf27a77f6f5dd6116096e95c16e7c1a616b419/src/requests/models.py#L910.
lms/services/moodle.py
Outdated
@@ -137,14 +137,15 @@ def list_files(self, course_id: int): | |||
def file_exists(self, file_id) -> bool: | |||
"""Check if the file exists in the course.""" | |||
# Moodle file IDs are URLs, but they need the token to be accessible | |||
response = self._http.request("HEAD", f"{file_id}&token={self.token}") | |||
response = self._http.request( | |||
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-5"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard check is to look for "%PDF" in the first 1024 bytes of the file. Most PDFs will have these characters at the very start, but there may be files that do not, yet which work in other viewers because.
In moodle we can't make a API call to check if a file exists and we have to rely on the URL where file will be downloaded. When a problem occurs we get a 200 response with a JSON document with the reason of the error. We don't want to always download the document as it will actually download the full PDF file for success cases, the most common case. Until now we been relying on the headers of a HEAD HTTP request, interpreting JSON responses as the file being missing. We have found at least one school that doesn't include the content-type header on the response so he are switch the approach to inspect the first bytes of the response and check if we are getting a PDF back from Moodle.
5ecbd43
to
9139a7d
Compare
Made some changes based on the suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In the present, I agree. My comment was mostly about doing the "proper" thing to avoid nasty surprises in future because our code is improperly treating binary data as text and the handling of this changes (see eg. hypothesis/h#8953) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In the process of reviewing this I found that some PDF viewers (eg. Acrobat) have tightened up the rule about allowing PDF files with junk before the %PDF
magic. Being tolerant here is probably still the right thing to do though.
LOG.info("Headers from Moodle file check %s", response.headers) | ||
return not response.headers["content-type"].startswith("application/json") | ||
response = self._http.request( | ||
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-1024"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue already existed before, but can we be certain that self.token
will never contain characters that need to be escaped in a query param?
LOG.info("Headers from Moodle file check %s", response.headers) | ||
return not response.headers["content-type"].startswith("application/json") | ||
response = self._http.request( | ||
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-1024"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out my information about Adobe requiring "%PDF" in the first 1024 bytes of the file is out of date, per https://helpx.adobe.com/acrobat/kb/pdf-error-1015-11001-update.html. I expect other PDF viewers are still more liberal though.
In moodle we can't make a API call to check if a file exists and we have to rely on the URL where file will be downloaded.
When a problem occurs we get a 200 response with a JSON document with the reason of the error.
We don't want to always download the document as it will actually download the full PDF file for success cases, the most common case.
Until now we been relying on the headers of a HEAD HTTP request, interpreting JSON responses as the file being missing.
We have found at least one school that doesn't include the content-type header on the response so he are switch the approach to inspect the first bytes of the response and check if we are getting a PDF back from Moodle.
Here's the headers from the file in the Moodle instance that doesn't include content-type:
Testing
Open:
localhost - Moodle File
localhost - Deleted Moodle file
Getting a file on the first and an error in the second.