Totally confused by Python's string thing.

Mon Dec 16 14:38:11 EST 2002

"Doru-Catalin Togea" wrote:

> 1) According to
> http://www.cl.cam.ac.uk/~mgk25/ucs/CP1252.html, the 1252 extension
> extends ISO 8859-1. Now ISO 8859-1 allready contains the norwegian
> characters, at least according to
> http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html
>
> So what is my problem, actually?

you're trying to encode a string that's already encoded.  to do this, Python
tries to *decode* it first, using the default encoding (ASCII).

> 2) How do I set up my system to deal correctly and robustly with the ISO
> 8859-1 character set? How about the ISO 8859-2 character set?

convert all text to Unicode strings on the way in, and to the appropriate
encoding on the way out.

to convert from encoded data to Unicode text, use:

    txt = raw.decode(encoding)

or

    txt = unicode(raw, encoding)

to convert from Unicode text to encoded data, use:

    raw = txt.encode(encoding)

(where "raw" is an encoded string, and "txt" is a unicode string)

> 3) Is there any INTRODUCTORY documentation about Python's internal string
> thing?

very brief, but may help:

    http://effbot.org/zone/unicode-objects.htm

see also:

    http://www.jorendorff.com/articles/unicode/python.html
    http://www.reportlab.com/i18n/python_unicode_tutorial.html
    http://www.python.org/doc/current/ref/strings.html

> 4) What kind of string objects does pyXML employ, since I can parse XML
> with norwegian content and call doEncode on strings returned from my XML
> file, without any Unicode crash?

Unicode strings.

(some XML libraries may return ASCII-only 8-bit strings for pure ASCII
data, and Unicode strings for everything else, but I don't think PyXML
does that by default.  You can usually mix Unicode strings and pure-
ASCII strings freely)

</F>

<!-- (the eff-bot guide to) the python standard library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->