Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base moodle's file_exist method on the file content instead of headers #6841

Merged
merged 1 commit into from
Nov 6, 2024

Conversation

marcospri
Copy link
Member

In moodle we can't make a API call to check if a file exists and we have to rely on the URL where file will be downloaded.

When a problem occurs we get a 200 response with a JSON document with the reason of the error.

We don't want to always download the document as it will actually download the full PDF file for success cases, the most common case.

Until now we been relying on the headers of a HEAD HTTP request, interpreting JSON responses as the file being missing.

We have found at least one school that doesn't include the content-type header on the response so he are switch the approach to inspect the first bytes of the response and check if we are getting a PDF back from Moodle.

Here's the headers from the file in the Moodle instance that doesn't include content-type:

 {'Date': 'Tue, 05 Nov 2024 14:42:55 GMT', 'Content-Length': '694351', 'Connection': 'keep-alive', 'Expires': 'Tue, 05 Nov 2024 16:43:00 GMT', 'Cache-Control': 'private, max-age=7205', 'Last-Modified': 'Fri, 01 Nov 2024 10:38:06 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'x-trace': 'ip-172-31-65-239->amazon_172_31_65_222', 'Age': '0', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains', 'X-Cache': 'MISS'}

Testing

Open:

localhost - Moodle File

localhost - Deleted Moodle file

Getting a file on the first and an error in the second.

# We don't want to download the full file so we'll do a GET request for the first bytes
# and check if the files is indeed a PDF.
# Most error response end up being JSON but not all of them are, some are HTML.
LOG.info("Content from Moodle file check %s", response.text)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapting the log line added yesterday to response.text, you can't have too much logging.

@@ -137,14 +137,15 @@ def list_files(self, course_id: int):
def file_exists(self, file_id) -> bool:
"""Check if the file exists in the course."""
# Moodle file IDs are URLs, but they need the token to be accessible
response = self._http.request("HEAD", f"{file_id}&token={self.token}")
response = self._http.request(
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-5"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to google the Range header. Leaving the docs here for future readers context https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range

Copy link
Contributor

@acelaya acelaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested an approach in which we still check for Content-type and only if the header does not exist, we inspect the body contents. Unless we think checking for '%PDF' is reliable enough.

# and check if the files is indeed a PDF.
# Most error response end up being JSON but not all of them are, some are HTML.
LOG.info("Content from Moodle file check %s", response.text)
return "%PDF" in response.text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could still check for the content-type header, and fall back to the response body just if Content-type doesn't exist?

Headers are at the very beginning of the response message payload, so I would expect them to be available in this case as well.

Copy link
Member

@robertknight robertknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted a possible issue re. testing for ASCII bytes vs Unicode characters.

# and check if the files is indeed a PDF.
# Most error response end up being JSON but not all of them are, some are HTML.
LOG.info("Content from Moodle file check %s", response.text)
return "%PDF" in response.text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does response.text do if the file is not valid UTF-8, which is entirely possible here? You should be checking for the ASCII bytes "%PDF" rather than Unicode characters.

Copy link
Member

@robertknight robertknight Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. UTF-8 is not actually the default assumed encoding because these are old standards. I'm not clear if response.text can fail, but nevertheless treating the response as binary would be the correct thing to do.

Edit: It looks like response.text will always use substitutions if errors are encountered rather than failing - https://github.com/psf/requests/blob/a6cf27a77f6f5dd6116096e95c16e7c1a616b419/src/requests/models.py#L910.

@@ -137,14 +137,15 @@ def list_files(self, course_id: int):
def file_exists(self, file_id) -> bool:
"""Check if the file exists in the course."""
# Moodle file IDs are URLs, but they need the token to be accessible
response = self._http.request("HEAD", f"{file_id}&token={self.token}")
response = self._http.request(
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-5"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard check is to look for "%PDF" in the first 1024 bytes of the file. Most PDFs will have these characters at the very start, but there may be files that do not, yet which work in other viewers because.

In moodle we can't make a API call to check if a file exists and we have
to rely on the URL where file will be downloaded.

When a problem occurs we get a 200 response with  a JSON document
with the reason of the error.

We don't want to always download the document as it will actually
download the full PDF file for success cases, the most common case.

Until now we been relying on the headers of a HEAD HTTP request,
interpreting JSON responses as the file being missing.

We have found at least one school that doesn't include the content-type
header on the response so he are switch the approach to inspect the
first bytes of the response and check if we are getting a PDF back from
Moodle.
@marcospri
Copy link
Member Author

@robertknight @acelaya

Made some changes based on the suggestions:

  • We keep the old behaviour when we do get headers, that's being working for over a year so let's keep it that way. Use the PDF detection as a fallback when there's no headers.
  • Now we read the first 1024 bytes of the document instead of just 5
  • We check response.content instead of response.text. I don't think it would make a big difference but we are looking for bytes so that more explicit.

Copy link
Contributor

@acelaya acelaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robertknight
Copy link
Member

We check response.content instead of response.text. I don't think it would make a big difference but we are looking for bytes so that more explicit.

In the present, I agree. My comment was mostly about doing the "proper" thing to avoid nasty surprises in future because our code is improperly treating binary data as text and the handling of this changes (see eg. hypothesis/h#8953)

Copy link
Member

@robertknight robertknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In the process of reviewing this I found that some PDF viewers (eg. Acrobat) have tightened up the rule about allowing PDF files with junk before the %PDF magic. Being tolerant here is probably still the right thing to do though.

LOG.info("Headers from Moodle file check %s", response.headers)
return not response.headers["content-type"].startswith("application/json")
response = self._http.request(
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-1024"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue already existed before, but can we be certain that self.token will never contain characters that need to be escaped in a query param?

LOG.info("Headers from Moodle file check %s", response.headers)
return not response.headers["content-type"].startswith("application/json")
response = self._http.request(
"GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-1024"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out my information about Adobe requiring "%PDF" in the first 1024 bytes of the file is out of date, per https://helpx.adobe.com/acrobat/kb/pdf-error-1015-11001-update.html. I expect other PDF viewers are still more liberal though.

@marcospri marcospri merged commit eb21f29 into main Nov 6, 2024
9 checks passed
@marcospri marcospri deleted the moodle-is-file-pdf branch November 6, 2024 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants