[Tutor] string encoding

Rick Pasotto rick at niof.net
Fri Jun 18 06:21:25 CEST 2010


On Fri, Jun 18, 2010 at 12:24:25PM +1000, Lie Ryan wrote:
> On 06/18/10 06:41, Rick Pasotto wrote:
> > I'm using BeautifulSoup to process a webpage. One of the fields has a
> > unicode character in it. (It's the 'registered trademark' symbol.) When
> > I try to write this string to another file I get this error:
> > 
> > UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128)
> > 
> > In the interpreter the  offending string portion shows as: 'Realtors\xc2\xae'.
> > 
> > How can I deal with this single string? The rest of the document works
> > fine.
> 
> You need to tell BeautifulSoup the encoding of the HTML document. You
> can encode this information in either the:
> 
> - (preferred) Encoding is specified externally from HTTP Header
> ContentType declaration, e.g.:
> Content-Type: text/html; charset=utf-8
> 
> - HTML ContentType declaration: e.g.
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

The document has:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

When I look at the document in vim and when I 'print' in python I see
the two characters of an acented capital A and the circled 'r'.

> latin1word = 'Sacr\xe9 bleu!'
> unicodeword = unicode(latin1word, 'latin-1')
> print unicodeword

TypeError: decoding Unicode is not supported

> If this works but Beautiful Soup doesn't, there's probably a bug in
> Beautiful Soup. However, if this doesn't work, the problem's with your
> Python setup. Python is playing it safe and not sending non-ASCII
> characters to your terminal. There are two ways to override this behavior.
> 
> 1. The easy way is to remap standard output to a converter that's not
> afraid to send ISO-Latin-1 or UTF-8 characters to the terminal.
> 
> import codecs
> import sys
> streamWriter = codecs.lookup('utf-8')[-1]
> sys.stdout = streamWriter(sys.stdout)
> 
> codecs.lookup returns a number of bound methods and other objects
> related to a codec. The last one is a StreamWriter object capable of
> wrapping an output stream.

Those four lines executed but I still get

TypeError: decoding Unicode is not supported

> Remember, even if your terminal display is restricted to ASCII, you can
> still use Beautiful Soup to parse, process, and write documents in UTF-8
> and other encodings. You just can't print certain strings with print.

I can print the string fine. It's f.write(string_with_unicode) that fails with:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128)

Shouldn't I be able to f.write() *any* 8bit byte(s)?

repr() gives: u"Realtors\\xc2\\xae"

BTW, I'm running python 2.5.5 on debian linux.

-- 
"Making fun of born-again christians is like hunting dairy cows with a
 high powered rifle and scope." -- P.J. O'Rourke
    Rick Pasotto    rick at niof.net    http://www.niof.net


More information about the Tutor mailing list