[Python-Dev] Bytes path support
Glenn Linderman
v+python at g.nevcal.com
Wed Aug 27 20:18:11 CEST 2014
On 8/27/2014 5:16 AM, Nick Coghlan wrote:
> On 27 August 2014 08:52, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> On 27 Aug 2014 02:52, "Terry Reedy" <tjreedy at udel.edu> wrote:
>>> Nick, I think the first half of your post is one of the clearest
>>> expositions yet of 'why Python 3' (in particular, the str to unicode
>>> change). It is worthy of wider distribution and without much change, it
>>> would be a great blog post.
>> Indeed, I had the same idea - I had been assuming users already understood
>> this context, which is almost certainly an invalid assumption.
>>
>> The blog post version is already mostly written, but I ran out of weekend.
>> Will hopefully finish it up and post it some time in the next few days :)
> Aaand, it's up:
> http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html
>
> Cheers,
> Nick.
>
Indeed, I also enjoyed and found enlightening your response to this
issue, including the broader historical context. I remember when Unicode
was first published back in 1991, and it sounded interesting, but far
removed from the reality of implementations of the day. I was intrigued
by UTF-8 at the time, and even wrote an encoder and decoder for it for a
software package that eventually never reached any real customers.
Your blog post says:
>
> Choosing UTF-8 aims to treat formatting text for communication with
> the user as "just a display issue". It's a low impact design that will
> "just work" for a lot of software, but it comes at a price:
>
> * because encoding consistency checks are mostly avoided, data in
> different encodings may be freely concatenated and passed on to
> other applications. Such data is typically not usable by the
> receiving application.
>
I don't believe this is a necessary result of using UTF-8. It is a
possible result, and I guess some implementations are using it this way,
but a proper language could still provide and/or require proper usage of
UTF-8 data through its type system just as Python3 is doing with PEP
393. In fact, if it were not for the requirement to support passing
character strings in other formats (UTF-16, UTF-32) to historical APIs
(in CPython add-on packages) and the resulting practical performance
considerations of converting to/from UTF-8 repeatedly when calling those
APIs, Python3 could have evolved to using UTF-8 as its underlying data
format, and obtained equal encoding consistency as it has today.
Of course, nothing can be "required" if the user chooses to continue
operating in the encoded domain, and manipulate data using the necessary
byte-oriented features of of whatever language is in use.
One of the choices of Python3, was to retain character indexing as an
underlying arithmetic implementation citing algorithmic speed, but that
is a seldom needed operation, and of limited general applicability when
considering grapheme clusters. An iterator based approach can solve both
problems, but would have been best introduced as part of Python3.0,
although it may have made 2to3 harder, and may have made it less
practical to implement six and other "run on both Py2 and Py3" type
solutions harder, without introducing those same iterative solutions
into Python 2.6 or 2.7.
Such solutions could still be implemented as options. Even PEP 393
grudgingly supports some use of UTF-8 when requested by the user, as I
understand it. Whether such an implementation would be better based on
bytes or str is uncertain without further analysis, although type
checking would probably be easier if based on str. A high-performance
implementation would likely need to be implemented at least partly in C
rather than CPython, although it could be prototyped in Python for proof
of functionality. The iterators could obviously be implemented to work
based on top of solutions such as PEP 393, by simply using indexing
underneath, when fixed-width characters are available, and other
techniques when UTF-8 is the only available format (rather than
converting from UTF-8 to fixed-width characters because of calling the
iterator).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140827/b24feb3a/attachment.html>
More information about the Python-Dev
mailing list