On 5 June 2014 22:01, Paul Sokolovsky <pmiscml@gmail.com> wrote:
Aside from some of the POSIX locale handling issues on Linux, many of the concerns are with the usability of bytes and bytearray, not with str - that's why binary interpolation is coming back in 3.5, and there will likely be other usability tweaks for those types as well.
All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes), while move unicode back to an explicit type to be used explicitly only when needed (bloated frameworks like Django can force users to it anyway, but that will be forcing on framework level, not on language level, against which people rebel.) People can dream, right?
If you don't model strings as arrays of code points, or at least assume a particular universal encoding (like UTF-8), you have to give up string concatenation in order to tolerate arbitrary encodings - otherwise you end up with unintelligible data that nobody can decode because it switches encodings without notice. That's a viable model if your OS guarantees it (Mac OS X does, for example, so Python 3 assumes UTF-8 for all OS interfaces there), but Linux currently has no such guarantee - many runtimes just decide they don't care, and assume UTF-8 anyway (Python 3 may even join them some day, due to the problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia