requests.Session() how do you set 'replace' on the encoding?
Veek M
vek.m1234 at gmail.com
Mon Jul 6 05:36:29 EDT 2015
dieter wrote:
> Veek M <vek.m1234 at gmail.com> writes:
>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>> position 8: illegal multibyte sequence
>
> You give us very little context.
It's a longish chunk of code: basically, i'm trying to download using the
'requests.Session' module and that should give me Unicode once it's told
what encoding is being used 'gbk'.
def get_page(s, url):
print(url)
r = s.get(url, headers = {
'User-Agent' : user_agent,
'Keep-Alive' : '3600',
'Connection' : 'keep-alive',
})
s.encoding='gbk'
text = r.text
return text
# Open output file
fh=codecs.open('/tmp/out', 'wb')
fh.write(header)
# Download
s = requests.Session()
------------
If 'text' is NOT proper unicode because the server introduced some junk,
then when i do anchor.getparent() on my 'text' it'll traceback..
ergo the question, how do i set a replacement char within 'requests'
> In general: when you need control over encoding handling because
> deep in a framework an econding causes problems (as apparently in
> your case), you can usually first take the plain text,
> fix any encoding problems and only then pass the fixed text to
> your framework.
>
>> I'm doing:
>> s = requests.Session()
>> to suck data in, so.. how do i 'replace' chars that fit gbk
>
> It does not seem that the problem occurs inside the "requests" module.
> Thus, you have a chance to "intercept" the downloaded text
> and fix encoding problems.
Okay, so i should use the 'raw' method in requests and then clean up the
raw-text and then convert that to unicode.. vs trying to do it using
'requests'? The thing is 'codec's has a xmlcharrefreplace_errors(...) etc so
i figured if output has clean up, input ought to have it :p
More information about the Python-list
mailing list