[Tutor] Beautiful Soup / Unicode problem?
grouch at gmail.com
Fri Aug 26 18:57:44 CEST 2005
>This is the first question in the BeautifulSoup FAQ at
>Unfortunately the author of BS considers this a problem with your
Python installation! So it
>seems he doesn't have a good understanding of Python and Unicode.
(OK, I can forgive him
>that, I think there are only a handful of people who really do
understand it completely.)
>The first fix given doesn't work. The second fix works but it is not
a good idea to change the
>default encoding for your Python install. There is a hack you can use
to change the default
>encoding just for one program; in your program put
> reload(sys); sys.setdefaultencoding('utf-8')
>This seems to fix the problem you are having.
I did read the FAQ before posting, honest :) But it does seem to be
addressing a different issue.
He says to try:
>>> latin1word = 'Sacr\xe9 bleu!'
>>> unicodeword = unicode(latin1word, 'latin-1')
>>> print unicodeword
Which worked fine for me. And then he gives a solution for fixing
-display- problems on the terminal. For instance, his first solution
"The easy way is to remap standard output to a converter that's not
afraid to send ISO-Latin-1 or UTF-8 characters to the terminal."
But I avoided displaying anything in my original example, because I
didn't want to confuse the issue. It's also why I didn't mention the
damning FAQ entry:
>>> y = results.a.fetchText(re.compile('.+'))
Is all I am trying to do.
I don't expect non-ASCII characters to display correctly, however I
was suprised when I tried "print x" in my original example, and it
printed. I would have expected to have to do something like:
>>> print x.encode("utf8")
Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>
I've just looked, and I have to do this explicit encoding under python
2.3.4, but not under 2.4.1. So perhaps 2.4 is less afraid/smarter
about converting and displaying non-ascii characters to the terminal.
Either way, I don't -think- that's my problem with Beautiful Soup.
Changing my default encoding does indeed fix it, but it may be a
reflection of the author making bad assumptions because his default
was set to utf-8. I'm not really experienced enough to tell what is
going on in his code, but I've been trying. Does seem to defeat the
point of unicode, however.
More information about the Tutor