[Tutor] Problems with encoding in BeautifulSoup

Eduardo Vieira eduardo.susan at gmail.com
Tue Aug 18 17:01:19 CEST 2009


On Tue, Aug 18, 2009 at 5:59 AM, Kent Johnson<kent37 at tds.net> wrote:
> On Tue, Aug 18, 2009 at 12:18 AM, Mal Wanstall<m.wanstall at gmail.com> wrote:
>> On Tue, Aug 18, 2009 at 9:00 AM, Eduardo Vieira<eduardo.susan at gmail.com> wrote:
>
>>> Here is the Error output:
>>> utf-8
>>> Traceback (most recent call last):
>>>  File "C:\myscripts\encondingproblem.py", line 13, in <module>
>>>    print companies[:4]
>>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
>>> position 373: ordinal not in range(128)
>>
>> It's caused by Python not wanting to send non-ASCII characters to your
>> terminal. To override this you need to create a sitecustomize.py file
>> in your /usr/lib/python/ folder and put the following in it:
>>
>> import sys
>> sys.setdefaultencoding("utf-8")
>>
>> This will set the default encoding in Python to UTF8 and you should
>> stop getting these parsing errors. I dealt with this recently when I
>> was playing around with some international data.
>
> Eduardo is on Windows so his terminal encoding is probably not utf-8.
> More likely it is cp437.
>
> Setting sys.setdefaultencoding() affects all scripts you run and will
> make scripts that you write non-portable. A better solution is to
> properly encode the output, for example
> for company in companies[:4]: # assuming companies is a list
>  print company.encode('cp437')
>
> Kent
>

So, I gather that you all don't get this error, then?
Anyway, running sys.getdefaultencoding() I get 'ascii'
The test example from BSoup docs works differently wether I use IDLE
or the cmd window in my XP. The example is:
latin1word = 'Sacr\xe9 bleu!'
unicodeword = unicode(latin1word, 'latin-1')
print unicodeword

In IDLE I get "Sacré bleu!", in the windows shell I get "Sacr\xe9 bleu!"

simply trying a loop, prevents me from the error, I found out:
for company in companies[:10]:
    print company

however I can't see the accents displayed properly. This is corrected
when I use this encoding: company.encode("iso-8859-1"). This way I get
the right results. Thanks for pointing me to this.
What remains a question is why when printing a list it throws an error
and when printing the strings in the list it doesn't.

Regards,
Eduardo


More information about the Tutor mailing list