[Python-Dev] Pre-PEP: The "bytes" object

Mon Feb 27 21:14:00 CET 2006

Neil Schemenauer wrote:
> Ron Adam <rrr at ronadam.com> wrote:
>> Why was it decided that the unicode encoding argument should be ignored
>> if the first argument is a string?  Wouldn't an exception be better
>> rather than give the impression it does something when it doesn't?
>
>From the PEP:
>
>     There is no sane meaning that the encoding can have in that
>     case.  str objects *are* byte arrays and they know nothing about
>     the encoding of character data they contain.  We need to assume
>     that the programmer has provided str object that already uses
>     the desired encoding.
>
> Raising an exception would be a valid option.  However, passing the
> string through unchanged makes the transition from str to bytes
> easier.

Does it?

I am quite certain the bytes PEP is dead wrong on this.  It should be changed.

Suppose I have code like this:

    def faz(s):
        return s.encode('utf-16be')

If I want to transition from str to bytes, how should I change this code?

    def faz(s):
        return bytes(s, 'utf-16be')  # OOPS - subtle bug

This silently does the wrong thing when s is a str.  If I hadn't read
the PEP, I would confidently assume that bytes(str, encoding) ==
bytes(unicode, encoding), modulo the default encoding.  I'd be wrong. 
But there's a really good reason to think this.  Wherever a unicode
argument is expected in Python 2.x, you can pass a str and it'll be
silently decoded.  This is an extremely strong convention.  It's even
embedded in PyArg_ParseTuple().  I can't think of any exceptions to
the rule, offhand.

Is this special case special enough to break the rules?  Arguable.  I
suspect not.  But even if so, allowing the breakage to pass silently
is surely a mistake.  It should just refuse the temptation to guess,
and throw an exception--right?

Now you may be thinking:  the str/unicode duality of text, and the
bytes/text duality of the "str" type, are *bad* things, and we're
trying to get rid of them.  True.  My view is, we'll be rid of them in
3.0 regardless.  In the meantime, there is no point trying to pretend
that 2.0 "str" is bytes and not text.  It just ain't so; you'll only
succeed in confusing people and causing bugs.  (And in 3.0 you're
going to turn around and tell them "str" *is* text!)

Good APIs make simple, sensible, comprehensible promises.  I like
these promises:
  - bytes(arg) works like array.array('b', arg)
  - bytes(arg1, arg2) works like bytes(arg1.encode(arg2))

I dislike these promises:
  - bytes(s, [ignored]), where s is a str, works like array.array('b', s)
  - bytes(u, [encoding]), where u is a unicode,
        works like bytes(u.encode(encoding))

It seems more Pythonic to differentiate based on the number of
arguments, rather than the type.

-j

P.S.  As someone who gets a bit agitated when the word "Pythonic" or
the Zen of Python is taken in vain, I'd like to know if anyone feels
I've done so here, so I can properly apologize.  Thanks.