Encoding/decoding: Still don't get it :-/

Peter Otten __peter__ at web.de
Fri Mar 13 09:24:52 EDT 2009


Gilles Ganault wrote:

> I must be dense, but I still don't understand 1) why Python sometimes
> barfs out this type of error when displaying text that might not be
> Unicode-encoded, 2) whether I should use encode() or decode() to solve
> the issue, or even 3) if this is a Python issue or due to APWS SQLite
> wrapper that I'm using:
> 
> ======
> sql = 'SELECT id,address FROM companies'
> rows=list(cursor.execute(sql))
> 
> for row in rows:
>         id = row[0]
> 
>         #could be 'utf-8', 'iso8859-1' or 'cp1252'
>         try:
>                 address = row[1]

Assuming row is a tuple with len(row) >= 2 the above line can never fail.
Therefore you can rewrite the loop as

for row in rows:
    id, address = row[:2]
    print id, address

>         except UnicodeDecodeError:
>                 try:
>                         address = row[1].decode('iso8859-1')
>                 except UnicodeDecodeError:
>                         address = row[1].decode('cp1252')
> 
>         print id,address
> ======
> 152 Traceback (most recent call last):
>   File "C:\zip.py", line 28, in <module>
>     print id,address
>   File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
>     return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\xc8' in
> position 2
> 4: character maps to <undefined>

It seems the database gives you the strings as unicode. When a unicode
string is printed python tries to encode it using sys.stdout.encoding
before writing it to stdout. As you run your script on the windows commmand
line that encoding seems to be cp437. Unfortunately your database contains
characters the cannot be expressed in that encoding. One workaround is to 
replace these characters with "?":

encoding = sys.stdout.encoding or "ascii"
for row in rows:
    id, address = row[:2]
    print id, address.encode(encoding, "replace")


Example:

>>> u"ähnlich lölich üblich".encode("ascii", "replace")
'?hnlich l?lich ?blich'

Peter




More information about the Python-list mailing list