Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vague suggestion: Utilities for parsing strings #166

Open
Aran-Fey opened this issue Jan 14, 2025 · 3 comments
Open

Vague suggestion: Utilities for parsing strings #166

Aran-Fey opened this issue Jan 14, 2025 · 3 comments

Comments

@Aran-Fey
Copy link

Parsing strings has turned out to be unexpectedly challenging. I've spent the last hour trying to figure out what text encoding is used in multipart form data, and I still haven't got a clue. There's no Content-Type/charset header anywhere to be found, and some sources say it's utf8 while others say HTTP requests are ISO-8859-1.

So, it would be very nice if this module had builtin support for parsing strings. Ideally in a way that works together with the ListTarget so that we can also parse lists of strings.

@siddhantgoel
Copy link
Owner

I think the encoding might depend on a bunch of different things, at least going by the RFC. Could you post the raw request body that you're working with?

@Aran-Fey
Copy link
Author

Aran-Fey commented Jan 14, 2025

Here's an example request where the file_names parameter should be "Eine größere Textdatei.txt":

b'------geckoformboundaryc1734bfb1ebb04d62438bb4100c2be6\r\nContent-Disposition: form-data; name="file_names"\r\n\r\nEine gr\xc3\xb6\xc3\x9fere Textdatei.txt\r\n------geckoformboundaryc1734bfb1ebb04d62438bb4100c2be6\r\nContent-Disposition: form-data; name="file_types"\r\n\r\ntext/plain\r\n------geckoformboundaryc1734bfb1ebb04d62438bb4100c2be6\r\nContent-Disposition: form-data; name="file_sizes"\r\n\r\n17\r\n------geckoformboundaryc1734bfb1ebb04d62438bb4100c2be6\r\nContent-Disposition: form-data; name="file_streams"; filename="Eine gr\xc3\xb6\xc3\x9fere Textdatei.txt"\r\nContent-Type: text/plain\r\n\r\nM\xc3\xa4use \xc3\xbcberleben\r\n------geckoformboundaryc1734bfb1ebb04d62438bb4100c2be6\r\nContent-Disposition: form-data; name="dummy"\r\n\r\ndummy\r\n------geckoformboundaryc1734bfb1ebb04d62438bb4100c2be6--\r\n'

It seems to use utf-8, which matches my website's document.characterSet. Not sure if that's a coincidence or not.

@siddhantgoel
Copy link
Owner

I guess if there's no hint anywhere as to how the browser/client encoded the data, it's hard to say how the string should be obtained on the server side. The RFC has the following piece of text that I found relevant.

In practice, many widely deployed implementations do not supply a
charset parameter in each part, but rather, they rely on the notion
of a "default charset" for a multipart/form-data instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants