[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Sat Jan 11 19:49:25 CET 2014

On Sat, Jan 11, 2014 at 04:15:35PM +0100, M.-A. Lemburg wrote:

> I think we need to step back a little from the purist view
> of things and give more emphasis on the "practicality beats
> purity" Zen.
> 
> I complete agree with Stephen, that bytes are in fact often
> an encoding of text. If that text is ASCII compatible, I don't
> see any reason why we should not continue to expose the C lib
> standard string APIs available for text manipulations on bytes.

Later in your post, you talk about the masses of broken encodings found 
everywhere (not just in your spam folder). How do the C lib standard 
string APIs help programmers to avoid broken encodings?

> We don't have to be pedantic about the bytes/text separation.
> It doesn't help in real life.

On the contrary, it helps a lot. To the extent that people keep that 
clean bytes/text separation, it helps avoid bugs. It prevents problems 
like this Python 2 nonsense:

s = "Straße"
assert len(s) == 6  # fails
assert s[5] == 'e'  # fails

Most problematic, printing s may (depending on your terminal settings) 
actually look like "Straße".

Not only is having a clean bytes/text separation the pedantic thing to 
do, it's also the right thing to do nearly always (not withstanding a 
few exceptions, allegedly).

> If you give programmers the choice they will - most of the time -
> do the right thing. 

Unicode has been available in Python since version 2.2, more than a 
decade ago. And yet here we are, five point releases later (2.7), and 
the majority of text processing code is still using bytes. I'm not just 
pointing the finger at others. My 2.x only code almost always uses byte 
strings for text processing, and not always because it was old code I 
wrote before I knew better. The coders I work with do the same, only you 
can remove the word "almost". The code I see posted on comp.lang.python 
and Reddit and the tutor mailing list invariably uses byte strings. The 
beginners on the tutor list at least have an excuse that they are 
beginners.

A quarter of a century after Unicode was first published, nearly 
28 years since IBM first introduced the concept of "code pages" 
to PC users, and we still have programmers writing ASCII only 
string-handling code that, if it works with extended character sets, 
only works by accident. The majority of programmer still have *no idea* 
of even the most basic parts of Unicode. They've had the the right tools 
for a decade, and ignored them.

Python 3 forces the issue, and my code is better for it.

> bytes already have most of the 8-bit string methods from Python 2,
> so it doesn't hurt adding some more of the missing features
> from Python 2 on top to make life easier for people dealing
> with multiple/unknown encoding data.

I personally think it was a mistake to keep text operations like upper() 
and lower() on bytes. I think it will compound the mistake to add even 
more text operations.

-- 
Steven