Puzzled by code pages

John Machin sjmachin at lexicon.net
Sat May 15 17:07:20 EDT 2010


Adam Tauno Williams <awilliam <at> whitemice.org> writes:

> On Fri, 2010-05-14 at 20:27 -0400, Adam Tauno Williams wrote:
> > I'm trying to process OpenStep plist files in Python.  I have a parser
> > which works, but only for strict ASCII.  However plist files may contain
> > accented characters - equivalent to ISO-8859-2 (I believe).  For example
> > I read in the line:

> > '    "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
> > NSFileName;\n'
> > What is the correct way to re-encode this data into UTF-8 so I can use
> > unicode strings, and then write the output back to ISO8859-?

> Buried in the parser is a str(...) call.  Replacing that with
> unicode(...) and now the OpenSTEP plist parser is working with Italian
> plists.

Some observations:

Italian text is much more likely to be encoded in ISO-8859-1 than ISO-8859-2.
The latter covers eastern European languages (e.g. Polish, Czech, Hungarian)
that use the Latin alphabet with many "decorations" not found in western 
alphabets.

Let's look at the 'localit\xc3\xa0' example. Using ISO-8859-2, that decodes to
u'localit\u0102\xa0'. The second-last character is LATIN CAPITAL LETTER A WITH
BREVE (according to unicodedata.name()). The last character is NO-BREAK SPACE.
Doesn't look like an Italian word to me.

However, using UTF-8, that decodes to u'localit\xe0'. The last character is
LATIN SMALL LETTER A WITH GRAVE. Looks like a plausible Italian word to me. Also
to Wikipedia: "A località (literally "locality"; plural località) is the name
given in Italian administrative law to a type of territorial subdivision of a
comune ..."

Conclusions:

It's worth closely scrutinising "accented characters - equivalent to ISO-8859-2
(I believe)". Which variety of "OpenStep plist files" are you looking at:
NeXTSTEP, GNUstep, or MAC OS X? If the latter, it's evidently an XML document,
and you should be letting the XML parser decode it for you and in any case as an
XML document it's most likely UTF-8, not ISO-8859-2.

It's worth examining your definition of "working".





More information about the Python-list mailing list