[Python-Dev] Bytes path support

Glenn Linderman v+python at g.nevcal.com
Wed Aug 27 20:18:11 CEST 2014


On 8/27/2014 5:16 AM, Nick Coghlan wrote:
> On 27 August 2014 08:52, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy at udel.edu> wrote:
>>> Nick, I think the first half of your post is one of the clearest
>>> expositions yet of 'why Python 3' (in particular, the str to unicode
>>> change).  It is worthy of wider distribution and without much change, it
>>> would be a great blog post.
>> Indeed, I had the same idea - I had been assuming users already understood
>> this context, which is almost certainly an invalid assumption.
>>
>> The blog post version is already mostly written, but I ran out of weekend.
>> Will hopefully finish it up and post it some time in the next few days :)
> Aaand, it's up:
> http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html
>
> Cheers,
> Nick.
>

Indeed, I also enjoyed and found enlightening your response to this 
issue, including the broader historical context. I remember when Unicode 
was first published back in 1991, and it sounded interesting, but far 
removed from the reality of implementations of the day. I was intrigued 
by UTF-8 at the time, and even wrote an encoder and decoder for it for a 
software package that eventually never reached any real customers.

Your blog post says:
>
> Choosing UTF-8 aims to treat formatting text for communication with 
> the user as "just a display issue". It's a low impact design that will 
> "just work" for a lot of software, but it comes at a price:
>
>   * because encoding consistency checks are mostly avoided, data in
>     different encodings may be freely concatenated and passed on to
>     other applications. Such data is typically not usable by the
>     receiving application.
>

I don't believe this is a necessary result of using UTF-8. It is a 
possible result, and I guess some implementations are using it this way, 
but a proper language could still provide and/or require proper usage of 
UTF-8 data through its type system just as Python3 is doing with PEP 
393.  In fact, if it were not for the requirement to support passing 
character strings in other formats (UTF-16, UTF-32) to historical APIs 
(in CPython add-on packages) and the resulting practical performance 
considerations of converting to/from UTF-8 repeatedly when calling those 
APIs, Python3 could have evolved to using UTF-8 as its underlying data 
format, and obtained equal encoding consistency as it has today.

Of course, nothing can be "required" if the user chooses to continue 
operating in the encoded domain, and manipulate data using the necessary 
byte-oriented features of of whatever language is in use.

One of the choices of Python3, was to retain character indexing as an 
underlying arithmetic implementation citing algorithmic speed, but that 
is a seldom needed operation, and of limited general applicability when 
considering grapheme clusters. An iterator based approach can solve both 
problems, but would have been best introduced as part of Python3.0, 
although it may have made 2to3 harder, and may have made it less 
practical to implement six and other "run on both Py2 and Py3" type 
solutions harder, without introducing those same iterative solutions 
into Python 2.6 or 2.7.

Such solutions could still be implemented as options. Even PEP 393 
grudgingly supports some use of UTF-8 when requested by the user, as I 
understand it. Whether such an implementation would be better based on 
bytes or str is uncertain without further analysis, although type 
checking would probably be easier if based on str. A high-performance 
implementation would likely need to be implemented at least partly in C 
rather than CPython, although it could be prototyped in Python for proof 
of functionality. The iterators could obviously be implemented to work 
based on top of solutions such as PEP 393, by simply using indexing 
underneath, when fixed-width characters are available, and other 
techniques when UTF-8 is the only available format (rather than 
converting from UTF-8 to fixed-width characters because of calling the 
iterator).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140827/b24feb3a/attachment.html>


More information about the Python-Dev mailing list