On Sat, Jan 11, 2014 at 04:15:35PM +0100, M.-A. Lemburg wrote:
I think we need to step back a little from the purist view of things and give more emphasis on the "practicality beats purity" Zen.
I complete agree with Stephen, that bytes are in fact often an encoding of text. If that text is ASCII compatible, I don't see any reason why we should not continue to expose the C lib standard string APIs available for text manipulations on bytes.
Later in your post, you talk about the masses of broken encodings found everywhere (not just in your spam folder). How do the C lib standard string APIs help programmers to avoid broken encodings?
We don't have to be pedantic about the bytes/text separation. It doesn't help in real life.
On the contrary, it helps a lot. To the extent that people keep that clean bytes/text separation, it helps avoid bugs. It prevents problems like this Python 2 nonsense: s = "Straße" assert len(s) == 6 # fails assert s[5] == 'e' # fails Most problematic, printing s may (depending on your terminal settings) actually look like "Straße". Not only is having a clean bytes/text separation the pedantic thing to do, it's also the right thing to do nearly always (not withstanding a few exceptions, allegedly).
If you give programmers the choice they will - most of the time - do the right thing.
Unicode has been available in Python since version 2.2, more than a decade ago. And yet here we are, five point releases later (2.7), and the majority of text processing code is still using bytes. I'm not just pointing the finger at others. My 2.x only code almost always uses byte strings for text processing, and not always because it was old code I wrote before I knew better. The coders I work with do the same, only you can remove the word "almost". The code I see posted on comp.lang.python and Reddit and the tutor mailing list invariably uses byte strings. The beginners on the tutor list at least have an excuse that they are beginners. A quarter of a century after Unicode was first published, nearly 28 years since IBM first introduced the concept of "code pages" to PC users, and we still have programmers writing ASCII only string-handling code that, if it works with extended character sets, only works by accident. The majority of programmer still have *no idea* of even the most basic parts of Unicode. They've had the the right tools for a decade, and ignored them. Python 3 forces the issue, and my code is better for it.
bytes already have most of the 8-bit string methods from Python 2, so it doesn't hurt adding some more of the missing features from Python 2 on top to make life easier for people dealing with multiple/unknown encoding data.
I personally think it was a mistake to keep text operations like upper() and lower() on bytes. I think it will compound the mistake to add even more text operations. -- Steven