[Python-Dev] Bytes path support

Stephen J. Turnbull stephen at xemacs.org
Thu Aug 28 04:04:01 CEST 2014


Glenn Linderman writes:
 > On 8/27/2014 5:16 AM, Nick Coghlan wrote:

 > > Choosing UTF-8 aims to treat formatting text for communication with 
 > > the user as "just a display issue". It's a low impact design that will 
 > > "just work" for a lot of software, but it comes at a price:
 > >
 > >   * because encoding consistency checks are mostly avoided, data in
 > >     different encodings may be freely concatenated and passed on to
 > >     other applications. Such data is typically not usable by the
 > >     receiving application.
 > 
 > I don't believe this is a necessary result of using UTF-8.

No, it's not, but if you're going to do the same kind of checks that
are necessary for transcoding UTF-8 to abstract Unicode, there's no
benefit to using UTF-8 internally, and you lose a lot.  The only
operations that you can do efficiently are concatenation and
iteration.  I've worked with a UTF-8-like internal encoding for 20
years now -- it's a huge cost.

 > Python3 could have evolved to using UTF-8 as its underlying data
 > format, and obtained equal encoding consistency as it has today.

Thank heaven it didn't!

 > One of the choices of Python3, was to retain character indexing as an 
 > underlying arithmetic implementation citing algorithmic speed, but that 
 > is a seldom needed operation,

That simply isn't true.  The negative effects of algorithmic slowness
in Emacsen are visible both as annoying user delays, and as excessive
developer concentration on optimizing a fundamentally insufficient
data structure.

 > and of limited general applicability when considering grapheme
 > clusters.  An iterator based approach can solve both problems,

On the contrary, grapheme clusters are the relatively rare use case in
textual computing, at least currently, that can be optimized for when
necessary.  There's no problem with creating iterators from arrays,
but making an iterator behave like a array ... well, that involves
creating the array.

 > Such solutions could still be implemented as options.

Sure, but the problems to be solved in that implementation are not due
to Python 3's internal representation.  A lot of painstaking (and
possibly hard?) work remains to be done.

 > A high-performance implementation would likely need to be
 > implemented at least partly in C rather than CPython,

That's how Emacs did it, and (a) over the decades it has involved an
inordinate amount of effort compared to rewriting the text-handling
functions for an array, (b) is fragile, and (c) performance sucks in
practice.

Unicode, not UTF-8, is the central component of the solution.  The
various UTFs are application-specific implementations of Unicode.
UTF-8 is an excellent solution for text streams, such as disk files
and network communication.  Fixed-width representations (ISO-8859-1,
UCS-2, UTF-32, PEP-393) are useful for applications of large buffers
that need O(1) "random" access, and can trivially be iterated for
stream applications.

Steve


More information about the Python-Dev mailing list