Totally confused by Python's string thing.
Fredrik Lundh
fredrik at pythonware.com
Mon Dec 16 14:38:11 EST 2002
"Doru-Catalin Togea" wrote:
> 1) According to
> http://www.cl.cam.ac.uk/~mgk25/ucs/CP1252.html, the 1252 extension
> extends ISO 8859-1. Now ISO 8859-1 allready contains the norwegian
> characters, at least according to
> http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html
>
> So what is my problem, actually?
you're trying to encode a string that's already encoded. to do this, Python
tries to *decode* it first, using the default encoding (ASCII).
> 2) How do I set up my system to deal correctly and robustly with the ISO
> 8859-1 character set? How about the ISO 8859-2 character set?
convert all text to Unicode strings on the way in, and to the appropriate
encoding on the way out.
to convert from encoded data to Unicode text, use:
txt = raw.decode(encoding)
or
txt = unicode(raw, encoding)
to convert from Unicode text to encoded data, use:
raw = txt.encode(encoding)
(where "raw" is an encoded string, and "txt" is a unicode string)
> 3) Is there any INTRODUCTORY documentation about Python's internal string
> thing?
very brief, but may help:
http://effbot.org/zone/unicode-objects.htm
see also:
http://www.jorendorff.com/articles/unicode/python.html
http://www.reportlab.com/i18n/python_unicode_tutorial.html
http://www.python.org/doc/current/ref/strings.html
> 4) What kind of string objects does pyXML employ, since I can parse XML
> with norwegian content and call doEncode on strings returned from my XML
> file, without any Unicode crash?
Unicode strings.
(some XML libraries may return ASCII-only 8-bit strings for pure ASCII
data, and Unicode strings for everything else, but I don't think PyXML
does that by default. You can usually mix Unicode strings and pure-
ASCII strings freely)
</F>
<!-- (the eff-bot guide to) the python standard library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->
More information about the Python-list
mailing list