[Tutor] string encoding

Lie Ryan lie.1296 at gmail.com
Fri Jun 18 04:24:25 CEST 2010


On 06/18/10 06:41, Rick Pasotto wrote:
> I'm using BeautifulSoup to process a webpage. One of the fields has a
> unicode character in it. (It's the 'registered trademark' symbol.) When
> I try to write this string to another file I get this error:
> 
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128)
> 
> In the interpreter the  offending string portion shows as: 'Realtors\xc2\xae'.
> 
> How can I deal with this single string? The rest of the document works
> fine.

You need to tell BeautifulSoup the encoding of the HTML document. You
can encode this information in either the:

- (preferred) Encoding is specified externally from HTTP Header
ContentType declaration, e.g.:
Content-Type: text/html; charset=utf-8

- HTML ContentType declaration: e.g.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

- XML declaration -- for XHTML document used for parsing using XML
parser (hint: BeautifulSoup isn't XML/XHTML parser), e.g.:
<?xml version="1.0" encoding="utf-8"?>

However, BeautifulSoup will also uses some heuristics to *guess* the
encoding of a tag soup that doesn't have a proper encoding.

So, the most likely reason is this, from Beautiful Soup's FAQ:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Why
can't Beautiful Soup print out the non-ASCII characters I gave it?
"""
Why can't Beautiful Soup print out the non-ASCII characters I gave it?

If you're getting errors that say: "'ascii' codec can't encode character
'x' in position y: ordinal not in range(128)", the problem is probably
with your Python installation rather than with Beautiful Soup. Try
printing out the non-ASCII characters without running them through
Beautiful Soup and you should have the same problem. For instance, try
running code like this:

latin1word = 'Sacr\xe9 bleu!'
unicodeword = unicode(latin1word, 'latin-1')
print unicodeword

If this works but Beautiful Soup doesn't, there's probably a bug in
Beautiful Soup. However, if this doesn't work, the problem's with your
Python setup. Python is playing it safe and not sending non-ASCII
characters to your terminal. There are two ways to override this behavior.

1. The easy way is to remap standard output to a converter that's not
afraid to send ISO-Latin-1 or UTF-8 characters to the terminal.

import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)

codecs.lookup returns a number of bound methods and other objects
related to a codec. The last one is a StreamWriter object capable of
wrapping an output stream.

2. The hard way is to create a sitecustomize.py file in your Python
installation which sets the default encoding to ISO-Latin-1 or to UTF-8.
Then all your Python programs will use that encoding for standard
output, without you having to do something for each program. In my
installation, I have a /usr/lib/python/sitecustomize.py which looks like
this:

import sys
sys.setdefaultencoding("utf-8")

For more information about Python's Unicode support, look at Unicode for
Programmers or End to End Unicode Web Applications in Python. Recipes
1.20 and 1.21 in the Python cookbook are also very helpful.

Remember, even if your terminal display is restricted to ASCII, you can
still use Beautiful Soup to parse, process, and write documents in UTF-8
and other encodings. You just can't print certain strings with print.
"""



More information about the Tutor mailing list