[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Steven D'Aprano
steve at pearwood.info
Sat Jan 11 19:49:25 CET 2014
On Sat, Jan 11, 2014 at 04:15:35PM +0100, M.-A. Lemburg wrote:
> I think we need to step back a little from the purist view
> of things and give more emphasis on the "practicality beats
> purity" Zen.
>
> I complete agree with Stephen, that bytes are in fact often
> an encoding of text. If that text is ASCII compatible, I don't
> see any reason why we should not continue to expose the C lib
> standard string APIs available for text manipulations on bytes.
Later in your post, you talk about the masses of broken encodings found
everywhere (not just in your spam folder). How do the C lib standard
string APIs help programmers to avoid broken encodings?
> We don't have to be pedantic about the bytes/text separation.
> It doesn't help in real life.
On the contrary, it helps a lot. To the extent that people keep that
clean bytes/text separation, it helps avoid bugs. It prevents problems
like this Python 2 nonsense:
s = "Straße"
assert len(s) == 6 # fails
assert s[5] == 'e' # fails
Most problematic, printing s may (depending on your terminal settings)
actually look like "Straße".
Not only is having a clean bytes/text separation the pedantic thing to
do, it's also the right thing to do nearly always (not withstanding a
few exceptions, allegedly).
> If you give programmers the choice they will - most of the time -
> do the right thing.
Unicode has been available in Python since version 2.2, more than a
decade ago. And yet here we are, five point releases later (2.7), and
the majority of text processing code is still using bytes. I'm not just
pointing the finger at others. My 2.x only code almost always uses byte
strings for text processing, and not always because it was old code I
wrote before I knew better. The coders I work with do the same, only you
can remove the word "almost". The code I see posted on comp.lang.python
and Reddit and the tutor mailing list invariably uses byte strings. The
beginners on the tutor list at least have an excuse that they are
beginners.
A quarter of a century after Unicode was first published, nearly
28 years since IBM first introduced the concept of "code pages"
to PC users, and we still have programmers writing ASCII only
string-handling code that, if it works with extended character sets,
only works by accident. The majority of programmer still have *no idea*
of even the most basic parts of Unicode. They've had the the right tools
for a decade, and ignored them.
Python 3 forces the issue, and my code is better for it.
> bytes already have most of the 8-bit string methods from Python 2,
> so it doesn't hurt adding some more of the missing features
> from Python 2 on top to make life easier for people dealing
> with multiple/unknown encoding data.
I personally think it was a mistake to keep text operations like upper()
and lower() on bytes. I think it will compound the mistake to add even
more text operations.
--
Steven
More information about the Python-Dev
mailing list