[Python-Dev] Pre-PEP: The "bytes" object
bokr at oz.net
Fri Feb 17 07:24:57 CET 2006
On Thu, 16 Feb 2006 12:47:22 -0800, Guido van Rossum <guido at python.org> wrote:
>On 2/15/06, Neil Schemenauer <nas at arctrix.com> wrote:
>> This could be a replacement for PEP 332. At least I hope it can
>> serve to summarize the previous discussion and help focus on the
>> currently undecided issues.
>> I'm too tired to dig up the rules for assigning it a PEP number.
>> Also, there are probably silly typos, etc. Sorry.
>I may check it in for you, although right now it would be good if we
>had some more feedback.
>I noticed one behavior in your pseudo-code constructor that seems
>questionable: while in the Q&A section you explain why the encoding is
>ignored when the argument is a str instance, in fact you require an
>encoding (and one that's not "ascii") if the str instance contains any
>non-ASCII bytes. So bytes("\xff") would fail, but bytes("\xff",
>"blah") would succeed. I think that's a bit strange -- if you ignore
>the encoding, you should always ignore it. So IMO bytes("\xff") and
>bytes("\xff", "ascii") should both return the same as bytes().
>Also, there's a code path where the initializer is a unicode instance
>and its encode() method is called with None as the argument. I think
>both could be fixed by setting the encoding to
>sys.getdefaultencoding() if it is None and the argument is a unicode
> def bytes(initialiser=, encoding=None):
> if isinstance(initialiser, basestring):
> if isinstance(initialiser, unicode):
> if encoding is None:
> encoding = sys.getdefaultencoding()
> initialiser = initialiser.encode(encoding)
> initialiser = [ord(c) for c in initialiser]
> elif encoding is not None:
> raise TypeError("explicit encoding invalid for non-string "
> create bytes object and fill with integers from initialiser
> return bytes object
As the above shows, str is encoding-agnostic and passes through
unmodified to bytes (except by ord).
I am wondering what it would hurt to allow the same for unicode ords,
since unicode is also encoding-agnostic. Please read  before
deciding that you have already decided this ;-)
The beauty of a unicode literal IMO is that it launders away
the source encoding into a coding-agnostic character sequence
that has stable ords across the universe, so why not use them?
It also solves a lot of ecaping grief. But see 
After all, in either case, an encoding can be specified if so desired. Thus
def bytes(initialiser=, encoding=None):
if isinstance(initialiser, basestring):
initialiser = initialiser.encode(encoding) # XXX for str ?? see 
initialiser = [ord(c) for c in initialiser]
elif encoding is not None:
raise TypeError("explicit encoding invalid for non-string "
create bytes object and fill with integers from initialiser
return bytes object
One thing I wonder is where sys.getdefaultencoding() gets its info, and whether
a module_encoding is also necessary for str arguments with encoding.
E.g. if the source encoding is utf-8, and you want sys.getdefaultencoding()
finally, don't you first have to do decode from the source encoding, rather than
let the default decoding assumption for that be ascii? E.g. for utf-8 source,
bombs, because it tries to do .decode('ascii') in place of .decode('utf-8')
Notice where the following fails (where utf-8 source is written to tutf8.py
by tutf.py and using latin-1 as standin for sys.getdefaultencoding())
----< tutf.py >-------------------------------------------
latin_1_src = """\
# -*- coding: utf-8 -*-
print '\\nfrom tutf8 import:'
print map(hex,map(ord, 'abc\xf6'))
if __name__ == '__main__':
print '\ntutf8.py utf-8 binary line reprs:'
print '\n'.join(repr(L) for L in open('tutf8.py','rb').read().splitlines())
[20:17] C:\pywk\pydev\pep0332>py24 tutf.py
tutf8.py utf-8 binary line reprs:
'# -*- coding: utf-8 -*-'
"print '\\nfrom tutf8 import:'"
"print map(hex,map(ord, 'abc\xc3\xb6'))"
from tutf8 import:
['0x61', '0x62', '0x63', '0xc3', '0xb6']
['0x61', '0x62', '0x63', '0xf6']
Traceback (most recent call last):
File "tutf.py", line 15, in ?
File "C:\pywk\pydev\pep0332\tutf8.py", line 5, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
I.e., if you leave out encoding for a str, you apparently get the native source
str representation of the literal, so it would seem that that must be undone
if you want to re-encode to anything else.
Should there be tutf8.__encoding__ available for this after import tutf8?
But that's interesting when str becomes unicode, and all literals will presumably have
an internal uniform unicode encoding, so the 'literal'.decode(source_encoding) will in effect already
have been done. What does a decode mean on unicode? It seems to mean blow up on non-ascii, so
that's not very portable. Why not use latin-1 as the default intermediate str representation when
doing a u'something'.decode(enc) ? The restriction to ascii in that context seems artificial.
IMHO and with all due respect ISTM the pain of all these considerations is not worth it when
the simple practicality of just prefixing a "u" on any ascii literal freely sprinkled
with escapes gets you exactly the bytes values you specify in any hex escapes. That's normally
what you want.
If by 'abc\xf6' you really mean the character with ord value 0xf6 in some encoding, then
bytes('abc\xf6'.decode(someenc), destenc) would be the way, so no one is stuck.
One danger is that someone is writing an in incomplete source character set and
wants to stick in some byte values in hex, happily sticking to the ascii subset
plus escapes, but a decode from the source encoding can fail on non-existent character
if the "ascii escape" is not in the source character set. E.g., cp1252 is pretty complete,
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "d:\python-2.4b1\lib\encodings\cp1252.py", line 22, in decode
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
This can't happen with the same literal of ascii plus escapes passed as a unicode literal, given
that map(ord, literal) is done on it to get bytes when no encoding is specified. You just get what you expect.
It seems practical to me. I'm really trying to help, not piss you off ;-)
BTW, I recently posted re str.translate vs unicode.translate, which has some tie-in with this, since
I anticipate that bytes.translate would be a useful thing in the absence of str.translate.
unicode.translate won't do all one might like to do with bytes.translate, I believe. Both
>BTW, for folks who want to experiment, it's quite simple to create a
>working bytes implementation by inheriting from array.array. Here's a
>quick draft (which only takes str instance arguments):
> from array import array
> class bytes(array):
> def __new__(cls, data=None):
> b = array.__new__(cls, "B")
> if data is not None:
> return b
> def __str__(self):
> return self.tostring()
> def __repr__(self):
> return "bytes(%s)" % repr(list(self))
> def __add__(self, other):
> if isinstance(other, array):
> return bytes(super(bytes, self).__add__(other))
> return NotImplemented
More information about the Python-Dev