[Python-3000] Unicode and OS strings

Stephen J. Turnbull stephen at xemacs.org
Thu Sep 13 23:12:04 CEST 2007


"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

 >> Of course, if the input data already contains PUA characters,
 >> there would be an ambiguity. We can rule this out for most codecs,
 >> as they don't support PUA characters. The major exception would
 >> be UTF-8,

 > Most codecs other than UTF-8 don't have this problem.

All Japanese codecs do.  Corporate variants of JIS remain alive, and
well.  They're not limited to Microsoft and Apple, but also IBM,
Fujitsu/Sun, Hitachi, and NEC software allow entry of characters not
in the JIS sets.

 > Unicode people are generally allergic to any non-standard variants of
 > Unicode specifications, and feel that this is a heresy. I experimentally
 > and optionally use U+0000 escaping, but I'm not convinced that anything
 > like this is a good idea, and it should probably not be enabled by
 > default.

-1

Heresy, no.  That doesn't make it anything like a good idea.  There
are plenty of character sets, even those that are ISO 2022 compatible,
with undefined code points.  Such code points regularly do appear in
text content where the coded character set is either incorrectly
specified or ambiguous.  This means that a way of handling such points
is very useful, and as long as there's enough PUA space, the approach
I suggested can handle all of these various issues.  Any application
where there won't be enough PUA space is very special, either
demanding more than 2 planes worth of private space (planes 15 and
16), or demanding very high efficiency (needs to fit in the BMP
private space).  The approach I suggest has the advantage that
applications with a small PUA usage (IIRC more than 4000 PUA code
points are available in the BMP) will have string length == character
count.

 > the contexts we are talking about don't allow U+0000 anyway.

zsh at least allows you to type ^V^SPC to enter an ASCII NUL character
on the command line, and to assign a string containing NULs to an
environment variable.



More information about the Python-3000 mailing list