download_url should make use of the encoding in the Content-Type header #2

fawkesley · 2013-11-14T14:06:21Z

At the moment we're returning a file object from response.content which loses any information we had about the file's unicode encoding:

Content-Type: text/html; charset=UTF-8

We can cunningly wrap the returned file handle using the codecs module:

from cStringIO import StringIO
>>> f = StringIO(a.encode('utf-8'))
>>> f.read()
'Marat\xc3\xb3n'
>>> f.seek(0)
>>> g = codecs.getreader('utf-8')(f)
>>> print g.read()
Maratón

The text was updated successfully, but these errors were encountered:

scraperdragon · 2013-12-02T12:11:30Z

The critical bit of code is:

g = codecs.getreader('utf-8')(f)
g.read()

f is a file handle containing UTF-8 bytes; but g.read() returns correct unicode.

StevenMaude mentioned this issue May 28, 2014

If server doesn't specify an encoding, requests always tries to guess it #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download_url should make use of the encoding in the Content-Type header #2

download_url should make use of the encoding in the Content-Type header #2

fawkesley commented Nov 14, 2013

scraperdragon commented Dec 2, 2013

download_url should make use of the encoding in the Content-Type header #2

download_url should make use of the encoding in the Content-Type header #2

Comments

fawkesley commented Nov 14, 2013

scraperdragon commented Dec 2, 2013