[Python-Dev] Python3 "complexity"

Kristján Valur Jónsson kristjan at ccpgames.com
Thu Jan 9 15:24:00 CET 2014

> -----Original Message-----
> From: Victor Stinner [mailto:victor.stinner at gmail.com]
> Sent: 9. janúar 2014 13:51
> To: Kristján Valur Jónsson
> Cc: Antoine Pitrou; python-dev at python.org
> Subject: Re: [Python-Dev] Python3 "complexity"
> 2014/1/9 Kristján Valur Jónsson <kristjan at ccpgames.com>:
> > This definition is funny, because according to Wikipedia, it is a
> > "superset" of 8869-1 ( latin1)
> Bytes 0x80..0x9f are unassigned in ISO/CEI 8859-1... but are assigned in
> (IANA's) ISO-8859-1.
> Python implements the latter, ISO-8859-1.
> Wikipedia says "This encoding is a superset of ISO 8859-1, but differs from
> the IANA's ISO-8859-1".

Thanks.  That's entirely non-confusing :)
" ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429."

So anyway, yes, Python's "latin1" encoding does cover the entire 256 range.  But on windows we use cp1252 instead which does not,
but instead defines useful and common windows characters in many of the control caracters slots.
Hence the need for "surrogateescape" to be able to roundtrip characters.

Again, this is non-obvious, and knowing from my experience with cp1252, I had no way of guessing that the "subset", i.e. latin1, would indeed cover all the range.  Two things then I have learned since my initial foray into parsing ascii files with python3:  Surrogateescapes and "latin1 in python == IANA's ISO-8859-1 which does indeed define the whole 8 bit range".


More information about the Python-Dev mailing list