[Python-Dev] Byte string class hierarchy
Jack Jansen
Jack.Jansen at cwi.nl
Thu Aug 19 00:16:33 CEST 2004
I may have missed a crucial bit of the discussion, having been away, so
if this is completely besides the point let me know. But my feeling is
that the crucial bit is the type inheritance graph of all the byte and
string types. And I wonder whether the following graph would help us
solve most problems (aside from introducing one new one, that may be a
showstopper):
genericbytes
mutablebytes
bytes
genericstring
string
unicode
The basic type for all bytes, buffers and strings is genericbytes. This
abstract base type is neither mutable nor immutable, and has the
interface that all of the types would share. Mutablebytes adds slice
assignment and such. Bytes, on the other hand, adds hashing.
genericstring is the magic stuff that's there already that makes
unicode and string interoperable for hashing and dict keys and such.
Casting to a basetype is always free and doesn't copy anything, i.e.
the bits stay the same. 'foo' in sourcecode is a string, and if you
cast it to bytes you'll just get the bits, which is pretty much the
same as what you get now. If you really want to make sure you get an
8-bit ascii representation even if you run in an interpreter built with
UCS4 as the default character set you must use
bytes('foo'.encode('ascii')).
Casting to a subtype may cause a copy, but does not modify the bits.
Casting sideways copies, and may modify the bits too, the current
unicode encode/decode stuff. These 2 rules mean that unicode('foo') is
something different from unicode(bytes('foo')), and probably illegal to
boot, but I don't think that's too much of a problem: you shouldn't
explicitly cast to bytes() unless you really want uninterpreted bits.
Operations like concatenation return the most specialised class.
Mutablebytes is the only problem case here, we should probably forbid
concatenating these with the others. The alternatives (return
mutablebytes, return the other one, return the type of the first
operand) all seem somewhat random.
Read() is guaranteed only to return genericbytes, but if you open a
file in textmode they'll returns strings, and we should add the ability
to open files for unicode and probably mutablebytes too. I'm not sure
about socket.recv() and such, but something similar probably holds.
Readline() really shouldn't be allowed on files open in binary mode,
but that may be a bit too much.
Write and friends accept genericbytes, and binary files will just dump
the bits. Files open in text mode may convert unicode and string
objects between representations.
The bad news (aside from any glaring holes I may have overseen in the
above: shoot away!) is that I don't know what to do for hash on bytes
objects. On the one hand I would like hash('foo') ==
hash(bytes('foo')). But that leads to also wanting hash(u'foo') ==
hash(bytes(u'foo')), and we can't really have that because hash('foo')
== hash(u'foo') is needed to make string/unicode interoperability for
dictionaries work. Note that for the value 'foo' this isn't a problem,
but for 'föö' (thats F O-UMLAUT O-UMLAUT) it is. So it seems that
making hash('foo') != hash(bytes('foo')) is the only reasonable
solution (and probably also a good idea with the future in mind:
explicit is better than implicit so just put a cast there if you want
the binary bits to be interpreted as an ASCII or Unicode string!) it
will probably break existing code.
--
Jack Jansen, <Jack.Jansen at cwi.nl>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma
Goldman
More information about the Python-Dev
mailing list