requests.Session() how do you set 'replace' on the encoding?

Veek M vek.m1234 at gmail.com
Mon Jul 6 11:36:29 CEST 2015


dieter wrote:

> Veek M <vek.m1234 at gmail.com> writes:
>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>> position 8: illegal multibyte sequence
> 
> You give us very little context.

It's a longish chunk of code: basically, i'm trying to download using the 
'requests.Session' module and that should give me Unicode once it's told 
what encoding is being used 'gbk'.

def get_page(s, url):
    print(url)
    r = s.get(url, headers = {
          'User-Agent' : user_agent,
          'Keep-Alive' : '3600',
          'Connection' : 'keep-alive',
          })
    s.encoding='gbk'
    text = r.text
    return text

# Open output file
fh=codecs.open('/tmp/out', 'wb')
fh.write(header)

# Download
s = requests.Session()
------------

If 'text' is NOT proper unicode because the server introduced some junk, 
then when i do anchor.getparent() on my 'text' it'll traceback..
ergo the question, how do i set a replacement char within 'requests'

> In general: when you need control over encoding handling because
> deep in a framework an econding causes problems (as apparently in
> your case), you can usually first take the plain text,
> fix any encoding problems and only then pass the fixed text to
> your framework.
> 
>> I'm doing:
>> s = requests.Session()
>> to suck data in, so.. how do i 'replace' chars that fit gbk
> 
> It does not seem that the problem occurs inside the "requests" module.
> Thus, you have a chance to "intercept" the downloaded text
> and fix encoding problems.

Okay, so i should use the 'raw' method in requests and then clean up the 
raw-text and then convert that to unicode.. vs trying to do it using 
'requests'? The thing is 'codec's has a xmlcharrefreplace_errors(...) etc so 
i figured if output has clean up, input ought to have it :p



More information about the Python-list mailing list