Base moodle's file_exist method on the file content instead of headers #6841

marcospri · 2024-11-05T14:43:46Z

In moodle we can't make a API call to check if a file exists and we have to rely on the URL where file will be downloaded.

When a problem occurs we get a 200 response with a JSON document with the reason of the error.

We don't want to always download the document as it will actually download the full PDF file for success cases, the most common case.

Until now we been relying on the headers of a HEAD HTTP request, interpreting JSON responses as the file being missing.

We have found at least one school that doesn't include the content-type header on the response so he are switch the approach to inspect the first bytes of the response and check if we are getting a PDF back from Moodle.

Here's the headers from the file in the Moodle instance that doesn't include content-type:

 {'Date': 'Tue, 05 Nov 2024 14:42:55 GMT', 'Content-Length': '694351', 'Connection': 'keep-alive', 'Expires': 'Tue, 05 Nov 2024 16:43:00 GMT', 'Cache-Control': 'private, max-age=7205', 'Last-Modified': 'Fri, 01 Nov 2024 10:38:06 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'x-trace': 'ip-172-31-65-239->amazon_172_31_65_222', 'Age': '0', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains', 'X-Cache': 'MISS'}

Testing

Open:

localhost - Moodle File

localhost - Deleted Moodle file

Getting a file on the first and an error in the second.

marcospri · 2024-11-05T14:47:02Z

lms/services/moodle.py

+        # We don't want to download the full file so we'll do a GET request for the first bytes
+        # and check if the files is indeed a PDF.
+        # Most error response end up being JSON but not all of them are, some are HTML.
+        LOG.info("Content from Moodle file check %s", response.text)


Adapting the log line added yesterday to response.text, you can't have too much logging.

acelaya · 2024-11-05T14:51:26Z

lms/services/moodle.py

@@ -137,14 +137,15 @@ def list_files(self, course_id: int):
    def file_exists(self, file_id) -> bool:
        """Check if the file exists in the course."""
        # Moodle file IDs are URLs, but they need the token to be accessible
-        response = self._http.request("HEAD", f"{file_id}&token={self.token}")
+        response = self._http.request(
+            "GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-5"}


I had to google the Range header. Leaving the docs here for future readers context https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range

acelaya

I suggested an approach in which we still check for Content-type and only if the header does not exist, we inspect the body contents. Unless we think checking for '%PDF' is reliable enough.

acelaya · 2024-11-05T14:54:07Z

lms/services/moodle.py

+        # and check if the files is indeed a PDF.
+        # Most error response end up being JSON but not all of them are, some are HTML.
+        LOG.info("Content from Moodle file check %s", response.text)
+        return "%PDF" in response.text


Maybe we could still check for the content-type header, and fall back to the response body just if Content-type doesn't exist?

Headers are at the very beginning of the response message payload, so I would expect them to be available in this case as well.

robertknight

I noted a possible issue re. testing for ASCII bytes vs Unicode characters.

robertknight · 2024-11-05T14:59:22Z

lms/services/moodle.py

+        # and check if the files is indeed a PDF.
+        # Most error response end up being JSON but not all of them are, some are HTML.
+        LOG.info("Content from Moodle file check %s", response.text)
+        return "%PDF" in response.text


What does response.text do if the file is not valid UTF-8, which is entirely possible here? You should be checking for the ASCII bytes "%PDF" rather than Unicode characters.

Oh, right. UTF-8 is not actually the default assumed encoding because these are old standards. I'm not clear if response.text can fail, but nevertheless treating the response as binary would be the correct thing to do.

Edit: It looks like response.text will always use substitutions if errors are encountered rather than failing - https://github.com/psf/requests/blob/a6cf27a77f6f5dd6116096e95c16e7c1a616b419/src/requests/models.py#L910.

robertknight · 2024-11-05T15:00:22Z

lms/services/moodle.py

@@ -137,14 +137,15 @@ def list_files(self, course_id: int):
    def file_exists(self, file_id) -> bool:
        """Check if the file exists in the course."""
        # Moodle file IDs are URLs, but they need the token to be accessible
-        response = self._http.request("HEAD", f"{file_id}&token={self.token}")
+        response = self._http.request(
+            "GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-5"}


The standard check is to look for "%PDF" in the first 1024 bytes of the file. Most PDFs will have these characters at the very start, but there may be files that do not, yet which work in other viewers because.

In moodle we can't make a API call to check if a file exists and we have to rely on the URL where file will be downloaded. When a problem occurs we get a 200 response with a JSON document with the reason of the error. We don't want to always download the document as it will actually download the full PDF file for success cases, the most common case. Until now we been relying on the headers of a HEAD HTTP request, interpreting JSON responses as the file being missing. We have found at least one school that doesn't include the content-type header on the response so he are switch the approach to inspect the first bytes of the response and check if we are getting a PDF back from Moodle.

marcospri · 2024-11-06T09:49:24Z

@robertknight @acelaya

Made some changes based on the suggestions:

We keep the old behaviour when we do get headers, that's being working for over a year so let's keep it that way. Use the PDF detection as a fallback when there's no headers.
Now we read the first 1024 bytes of the document instead of just 5
We check response.content instead of response.text. I don't think it would make a big difference but we are looking for bytes so that more explicit.

acelaya

LGTM

robertknight · 2024-11-06T10:45:10Z

We check response.content instead of response.text. I don't think it would make a big difference but we are looking for bytes so that more explicit.

In the present, I agree. My comment was mostly about doing the "proper" thing to avoid nasty surprises in future because our code is improperly treating binary data as text and the handling of this changes (see eg. hypothesis/h#8953)

robertknight

LGTM. In the process of reviewing this I found that some PDF viewers (eg. Acrobat) have tightened up the rule about allowing PDF files with junk before the %PDF magic. Being tolerant here is probably still the right thing to do though.

robertknight · 2024-11-06T10:46:59Z

lms/services/moodle.py

-        LOG.info("Headers from Moodle file check %s", response.headers)
-        return not response.headers["content-type"].startswith("application/json")
+        response = self._http.request(
+            "GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-1024"}


This issue already existed before, but can we be certain that self.token will never contain characters that need to be escaped in a query param?

robertknight · 2024-11-06T10:52:31Z

lms/services/moodle.py

-        LOG.info("Headers from Moodle file check %s", response.headers)
-        return not response.headers["content-type"].startswith("application/json")
+        response = self._http.request(
+            "GET", f"{file_id}&token={self.token}", headers={"Range": "bytes=0-1024"}


It turns out my information about Adobe requiring "%PDF" in the first 1024 bytes of the file is out of date, per https://helpx.adobe.com/acrobat/kb/pdf-error-1015-11001-update.html. I expect other PDF viewers are still more liberal though.

marcospri force-pushed the moodle-is-file-pdf branch from 7a883af to 5ecbd43 Compare November 5, 2024 14:46

marcospri commented Nov 5, 2024

View reviewed changes

marcospri requested a review from acelaya November 5, 2024 14:47

acelaya reviewed Nov 5, 2024

View reviewed changes

acelaya approved these changes Nov 5, 2024

View reviewed changes

robertknight requested changes Nov 5, 2024

View reviewed changes

marcospri force-pushed the moodle-is-file-pdf branch from 5ecbd43 to 9139a7d Compare November 6, 2024 09:44

marcospri requested review from acelaya and robertknight November 6, 2024 09:49

acelaya approved these changes Nov 6, 2024

View reviewed changes

robertknight approved these changes Nov 6, 2024

View reviewed changes

marcospri merged commit eb21f29 into main Nov 6, 2024
9 checks passed

marcospri deleted the moodle-is-file-pdf branch November 6, 2024 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base moodle's file_exist method on the file content instead of headers #6841

Base moodle's file_exist method on the file content instead of headers #6841

marcospri commented Nov 5, 2024

marcospri Nov 5, 2024

acelaya Nov 5, 2024

acelaya left a comment

acelaya Nov 5, 2024

robertknight left a comment

robertknight Nov 5, 2024

robertknight Nov 5, 2024 •

edited

Loading

robertknight Nov 5, 2024

marcospri commented Nov 6, 2024

acelaya left a comment

robertknight commented Nov 6, 2024

robertknight left a comment

robertknight Nov 6, 2024

robertknight Nov 6, 2024

Base moodle's file_exist method on the file content instead of headers #6841

Base moodle's file_exist method on the file content instead of headers #6841

Conversation

marcospri commented Nov 5, 2024

Testing

marcospri Nov 5, 2024

Choose a reason for hiding this comment

acelaya Nov 5, 2024

Choose a reason for hiding this comment

acelaya left a comment

Choose a reason for hiding this comment

acelaya Nov 5, 2024

Choose a reason for hiding this comment

robertknight left a comment

Choose a reason for hiding this comment

robertknight Nov 5, 2024

Choose a reason for hiding this comment

robertknight Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

robertknight Nov 5, 2024

Choose a reason for hiding this comment

marcospri commented Nov 6, 2024

acelaya left a comment

Choose a reason for hiding this comment

robertknight commented Nov 6, 2024

robertknight left a comment

Choose a reason for hiding this comment

robertknight Nov 6, 2024

Choose a reason for hiding this comment

robertknight Nov 6, 2024

Choose a reason for hiding this comment

robertknight Nov 5, 2024 •

edited

Loading