[Python-Dev] s/bytes/octet/ [Was:Re: bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]]

Mon Feb 20 12:52:22 CET 2006

On Sat, 18 Feb 2006 09:59:38 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin at v.loewis.de> wrote:

>Aahz wrote:
>> The problem is that they don't understand that "Martin v. L?wis" is not
>> Unicode -- once all strings are Unicode, this is guaranteed to work.
Well, after all the "string" literal escapes that were being used
to define byte values are all rewritten, yes, I'll believe the guarantee ;-)
(BTW, are there plans for migration tools?)

Ok, now back to the s/bytes/octet/ topic:
>
>This specific call, yes. I don't think the problem will go away as long
>as both encode and decode are available for both strings and byte
>arrays.
>
>> While it's not absolutely true, my experience of watching Unicode
>> confusion is that the simplest approach for newbies is: encode FROM
>> Unicode, decode TO Unicode.
>
>I think this is what should be in-grained into the library, also. It
>shouldn't try to give additional meaning to these terms.
>
Thinking about bytes recently, it occurs to me that bytes are really not intrinsically
numeric in nature. They don't necessarily represent uint8's. E.g., a binary file is
really a sequence of bit octets in its most primitive and abstract sense.

So I'm wondering if we shouldn't have an octet type analogous to unicode, and instances of octet
would be vectors of octets as abstract 8-bit bit vectors, like instances of unicode are vectors of abstract characters.

If you wanted integers you could map ord for integers guaranteed to be in range(256).
The constructor would naturally take any suitable integer sequence so octet([65,66,67]) would work.

In general, all encode methods would produce an octet instance, e.g. unicode.encode.
octet.decode(octet_instance, 'src_encoding') or octet_instance.decode('src_encoding') would do
all the familiar character code sequence decoding,
e.g., octet.decode(oseq, 'utf-8') or oseq.decode('utf-8') to make a unicode instance.

Going from unicode, unicode.encode(uinst, 'utf-8') or uinst.encode('utf-8') would produce an octet instance.
I think this is conceptually purer than the current bytes idea, since the result really has no arithmetic significance.

Also, ord would work on a length-one octet instance, and produce the unsigned integer value you'd expect, but would fail
if not length-one, like ord on unicode (or current str).

Thus octet would replace bytes as the binary info container, and would not have any presumed aritmetic
significance, either as integer or as character-of-current-source-encoding-inferred-from-integer-value-as-ord.

To get a text representation of octets, hex is natural, e.g., octet('6162 6380') # spaces ignored
so repr(octet('a deaf bee')) => "octet('adeafbee')" and octet('616263').decode('ascii') => u'abc' and
back: u'abc.encode('ascii') => octet('616263'). The base64 codec looks conceptually cleaner too, so long
as you keep in mind base64 as a character subset of unicode and the name of the transformation function pair.
octet('616263').decode('base64') => u'YWJj\n' # octets -> characters
u'YWJj\n'.encode('base64') => octet('616263') # characters -> octets

If you wanted integer-nature bytes, you could have octet codecs for uint8 and int8, e.g., octseq.decode('int8')
could produce a list of signed integers all in range(-128,128). Or maybe map(dec_int8, octseq). The array
module could easily be a target for octet.decode, e.g., octseq.decode('array_B') or octet.decode(octseq, 'array_B'),
and octet(array_instance) the other way.

Likewise, other types could be destination for octet.decode.

E.g., if you had an abstraction for a display image one could have 'gif' and 'png' and 'bmp' etc
be like 'cp437', 'latin-1', and 'utf-8' etc are for decoding octest to unicode, and write stuff like

    o_seq = open('pic.gif','rb')  # makes octet instance
    img = o_seq.decode('gif89')   # => img is abstract, internally represented suitably but hidden, like unicode.
    open('pic.png', 'wb').write(img.encode('png'))

UIAM PIL has this functionality, if not as encode/decode methods.

Similarly, there could be an abstract archive container, and you could have

    arch = open('tree.tgz','rb').decode('tgz') # => might do lazy things waiting for encode
    egg_octets = arch.encode('python_egg')  # convert to egg format?? (just hand-waving ;-)

Probably all it would take is to wrap some things in abstract-container (AC) types, to enforce the protocol.
Image(octet_seq, 'gif') might produce an AC that only saved a (octet_seq, 'gif') internally, or it might
do eager conversion per optional additional args. Certainly .bmp without rle can be hugely wasteful.

For flexibility like eager vs not, or perhaps returning an iterator instead of a byte sequence,
I guess the encode/decode signatures should be (enc, *args, **kw) and pass those things on to
the worker functions? An abstract container could have a "pack" codec to do serial composition/decomposition.

I'm sure Mal has all this stuff one way or another, but I wanted the conceptual purity of AC instances ac in
ac = octet_seq.decode('src_enc'); octet_seq  = ac.encode('dst_enc') ;-)

Bottom line thought: binary octets aren't numeric ;-)

Regards,
Bengt Richter