[Python-Dev] PEP 393 Summer of Code Project

Thu Sep 1 10:33:50 CEST 2011

Glenn Linderman writes:

 > I found your discussion of streams versus arrays, as separate concepts 
 > related to Unicode, along with Terry's bisect indexing implementation, 
 > to rather inspiring.  Just because Unicode defines streams of codeunits 
 > of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when 
 > processes communicate and for storage (which is one way processes 
 > communicate), that doesn't imply that the internal representation of 
 > character strings in a programming language must use exactly that 
 > representation.

That is true, and Unicode is *very* careful to define its requirements
so that is true.  That doesn't mean using an alternative
representation is an improvement, though.

 > I'm unaware of any current Python implementation that has chosen to
 > use UTF-8 as the internal representation of character strings (I'm
 > also aware Perl has made that choice), yet UTF-8 is one of the
 > commonly recommend character representations on the Linux platform,
 > from what I read.

There are two reasons for that.  First, widechar representations are
right out for anything related to the file system or OS, unless you
are prepared to translate before passing to the OS.  If you use UTF-8,
then asking the user to use a UTF-8 locale to communicate with your
app is a plausible way to eliminate any translation in your app.  (The
original moniker for UTF-8 was UTF-FSS, where FSS stands for "file
system safe.")

Second, much text processing is stream-oriented and one-pass.  In
those cases, the variable-width nature of UTF-8 doesn't cost you
anything.  Eg, this is why the common GUIs for Unix (X.org, GTK+, and
Qt) either provide or require UTF-8 coding for their text.  It costs
*them* nothing and is file-system-safe.

 > So in that sense, Python has rejected the idea of using the
 > "native" or "OS configured" representation as its internal
 > representation.

I can't agree with that characterization.  POSIX defines the concept
of *locale* precisely because the "native" representation of text in
Unix is ASCII.  Obviously that won't fly, so they solved the problem
in the worst possible way<wink/>:  they made the representation
variable!

It is the *variability* of text representation that Python rejects,
just as Emacs and Perl do.  They happen to have chosen six different
representations.[1]

 > So why, then, must one choose from a repertoire of Unicode-defined
 > stream representations if they don't meet the goal of efficient
 > length, indexing, or slicing operations on actual characters?

One need not.  But why do anything else?  It's not like the authors of
that standard paid no attention to various concerns about efficiency
and backward compatibility!  That's the question that you have not
answered, and I am presently lacking in any data that suggests I'll
ever need the facilities you propose.

Footnotes: 
[1]  Emacs recently changed its mind.  Originally it used the
so-called MULE encoding, and now a different extension of UTF-8 from
Perl.  Of course, Python beats that, with narrow, wide, and now
PEP-393 representations!<wink />