[Python-Dev] methods on the bytes object

Mon May 1 21:02:21 CEST 2006

Before I get into my reply, I'm going to start out by defining a new
term:

operationX - the operation of interpreting information differently than
how it is presented, generally by constructing a data structure based on
the input information.
    eg; programming language source file -> parse tree,
        natual language -> parse tree or otherwise,
        structured data file -> data structure (tree, dictionary, etc.),
        etc.
synonyms: parsing, unmarshalling, interpreting, ...

Any time I would previously describe something as some variant of 'parse',
replace that with 'operationX'.  I will do that in all of my further
replies.

"Martin v. Löwis" <martin at v.loewis.de> wrote:
> Josiah Carlson wrote:
> > Certainly that is the case.  But how would you propose embedded bytes
> > data be represented? (I talk more extensively about this particular
> > issue later).
> 
> Can't answer: I don't know what "embedded bytes data" are.

I described this before as the output of img2py from wxPython.  Here's a
sample which includes the py.ico from Python 2.3 .

#----------------------------------------------------------------------
# This file was generated by C:\Python23\Scripts\img2py
#
from wx import ImageFromStream, BitmapFromImage
import cStringIO, zlib

def getData():
    return zlib.decompress(
'x\xda\x01\x14\x02\xeb\xfd\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00 \
\x00\x00\x00 \x08\x06\x00\x00\x00szz\xf4\x00\x00\x00\x04sBIT\x08\x08\x08\x08\
|\x08d\x88\x00\x00\x01\xcbIDATX\x85\xb5W\xd1\xb6\x84 \x08\x1ct\xff;\xf6\xc3\
\x93\xfb`\x18\x92f\xb6^:\x9e\xd4\x94\x19\x04\xd4\x88BDO$\xed\x02\x00\x14"i]\
\xdb\xddI\x93B=\x02\x92va\xceu\xe6\\T\x98\xd7\x91h\x12\xb0\xd6z\xb1\xa4V\x90\
\xf8\xf4>\xc8A\x81\xe8\xack\xdb\xae\xc6\xbf\x11\xc8`\x02\x80L\x1d\xa5\xbdJ\
\xc2Rm\xab\t\x88PU\xb7m\xe0>V^\x13(\xa9G\xa7\xbf\xb5n\xfd\xafoI\xbbhyC\xa0\
\xc4\x80*h\x05\xd8]\xd0\xd5\xe9y\xee\x1bO\t\x10\x85X\xe5\xfc\x8c\xf0\x06\xa0\
\x91\x153)\x1af\xc1y\xab\xdf\x906\x81\xa7.)1\xe0w\xba\x1e\xb0\xaf\xff*C\x02\
\x17`\xc2\xa3\xad\xe0\xe9*\x04U\xec\x97\xb6i\xb1\x02\x0b\xc0_\xd3\xf7C2\xe6\
\xe9#\x05\xc6bfG\xc8\xcc-\xa4\xcc\xd8Q0\x06\n\x91\x14@\x15To;\xdd\x125s<\xf0\
\x8c\x94\xd3\xd0\xfa\xab\xb5\xeb{\xcb\xcb\x1d\xe1\xd7\x15\xf0\x1d\x1e\x9c9wz\
p\x0f\xfa\x06\x1cp\xa7a\x01?\x82\x8c7\x80\xf5\xe3\xa1J\x95\xaa\xf5\xdc\x00\
\x9f\x91\xe2\x82\xa4g\x80\x0f\xc8\x06p9\x0f\xb66\xf8\xccNH\x14\xe2\t\xde\x1a\
`\x14\x8d|>\x0b\x0e\x00\x9f\x94v\t!RJ\xbb\xf4&VV/\x04\x97\xb4K\xe5\x82\xe0&\
\x97\xcc\x18X\xfd\x16\x1cxx+\x06\xfa\xfeVp+\x17\xb7\xb9~\xd5\xcd<\xb8\x13V\
\xdb\xf1\r\xf8\xf54\xcc\xee\xbc\x18\xc1\xd7;G\x93\x80\x0f\xb6.\xc1\x06\xf8\
\xd9\x7f=\xe6[c\xbb\xff\x05O\x97\xff\xadh\xcct]\xb0\xf2\xcc/\xc6\x98mV\xe3\
\xe1\xf1\xb5\xbcGhDT--\x87\x9e\xdb\xca\xa7\xb2\xe0"\xe6~\xd0\xfb6LM\n\xb1[\
\x90\xef\n\xe5a>j\x19R\xaaq\xae\xdc\xe9\xad\xca\xdd\xef\xb9\xaeD\x83\xf4\xb2\
\xff\xb3?\x1c\xcd1U-7%\x96\x00\x00\x00\x00IEND\xaeB`\x82\xdf\x98\xf1\x8f' )

def getBitmap():
    return BitmapFromImage(getImage())

def getImage():
    stream = cStringIO.StringIO(getData())
    return ImageFromStream(stream)

That data is non-textual.  It is bytes within a string literal.  And it
is embedded (within a .py file).

> > I am apparently not communicating this particular idea effectively
> > enough.  How would you propose that I store parsing literals for
> > non-textual data, and how would you propose that I set up a dictionary
> > to hold some non-trivial number of these parsing literals?
> 
> I can't answer that question: I don't know what a "parsing literal
> for non-textual data" is. If you are asking how you represent bytes
> object in source code: I would encode them as a list of integers,
> then use, say,
> 
>   parsing_literal = bytes([3,5,30,99])

An operationX literal is a symbol that describes how to interpret the
subsequent or previous data.  For an example of this, see the pickle
module (portions of which I include below).

> > From what I understand, it would seem that you would suggest that I use
> > something like the following...
> > 
> > handler = {bytes('...', encoding=...).encode('latin-1'): ...,
> >            #or
> >            '\uXXXX\uXXXX...': ...,
> >            #or even without bytes/str
> >            (0xXX, 0xXX, ...): ..., }
> > 
> > Note how two of those examples have non-textual data inside of a Python
> > 3.x string?  Yeah.
> 
> Unfortunately, I don't notice. I assume you don't mean a literal '...';
> if this is what you represent, I would write
> 
> handler = { '...': "some text" }
> 
> But I cannot guess what you want to put into '...' instead.

I described before how you would use this kind of thing to perform
operationX on structured information.  It turns out that pickle (in
Python) uses a dictionary of operationX symbols/literals -> unbound
instance methods to perform operationX on the pickled representation of
Python objects (literals where XXXX = '...' are defined, and symbols
using the XXXX names). The relevant code for unpickling is the while 1:
section of the following.

    def load(self):
        """Read a pickled object representation from the open file.

        Return the reconstituted object hierarchy specified in the file.
        """
        self.mark = object() # any new unique object
        self.stack = []
        self.append = self.stack.append
        read = self.read
        dispatch = self.dispatch
        try:
            while 1:
                key = read(1)
                dispatch[key](self)
        except _Stop, stopinst:
            return stopinst.value

> >>> We've not removed the problem, only changed it from being contained 
> >>> in non-unicode
> >>> strings to be contained in unicode strings (which are 2 or 4 times larger
> >>> than their non-unicode counterparts).
> >> We have removed the problem.
> > 
> > Excuse me?  People are going to use '...' to represent literals of all
> > different kinds.
> 
> In Python 3, '...' will be a character string. You can't use it to
> represent anything else but characters.
> 
> Can you give examples of actual source code where people use '...'
> something other than text?

For an example of where people use '...' to represent non-textual
information in a literal, see the '# Protocol 2' section of pickle.py ...

# Protocol 2

PROTO           = '\x80'  # identify pickle protocol
NEWOBJ          = '\x81'  # build object by applying cls.__new__ to argtuple
EXT1            = '\x82'  # push object from extension registry; 1-byte index
EXT2            = '\x83'  # ditto, but 2-byte index
EXT4            = '\x84'  # ditto, but 4-byte index
TUPLE1          = '\x85'  # build 1-tuple from stack top
TUPLE2          = '\x86'  # build 2-tuple from two topmost stack items
TUPLE3          = '\x87'  # build 3-tuple from three topmost stack items
NEWTRUE         = '\x88'  # push True
NEWFALSE        = '\x89'  # push False
LONG1           = '\x8a'  # push long from < 256 bytes
LONG4           = '\x8b'  # push really big long

Also look at the getData() function I defined earlier in this post for
non-text string literals.

> > What does pickle.load(...) do to the files that are passed into it?  It
> > reads the (possibly binary) data it reads in from a file (or file-like
> > object), performing a particular operation based based on a dictionary
> > of expected tokens in the file, producing a Python object. I would say
> > that pickle 'parses' the content of a file, which I presume isn't
> > necessary text.
> 
> Right. I think pickle can be implemented with just the bytes type.
> There is a textual and a binary version of the pickle format. The
> textual should be read as text; the binary version using bytes.
> (I *thinK* the encoding of the textual version is ASCII, but one
> would have to check).

The point of this example was to show that operationX isn't necessarily
the processing of text, but may in fact be the interpretation of binary
data. It was also supposed to show how one may need to define symbols
for such interpretation via literals of some kind.  In the pickle module,
this is done in two parts: XXX = <literal>; dispatch[XXX] = fcn.  I've
also seen it as dispatch = {<literal>: fcn}

In regards to the text pickles using text as input, and binary pickles
using bytes as input, I would remind you that bytes are mutable.  This
isn't quite as big a deal with pickle, but one would need to either add
both the '...' and int literals to the dictionary:
    dispatch[SYMBOL] = fcn
    dispatch[SYMBOL.encode('latin-1')[0]] = fcn

Or always use text _or_ integers in in the decoding:

#decoding bytes when only str in dispatch...

    dispatch[bytesX.decode('latin-1')](self)

#encoding str when only int in dispatch...

    dispatch[strX.encode('latin-1')[0]](self)

> > Replace pickle with a structured data storage format of your choice.
> > It's still parsing (at least according to my grasp of English, which
> > could certainly be flawed (my wife says as much on a daily basis)).
> 
> Well, I have no "native language" intuition with respect to these
> words - I only use them in the way I see them used elsewhere. I see
> "parsing" typically associated with textual data, and "unmarshalling"
> with binary data.

Before Python, I hadn't heard of 'marshal' in relation to the storage or
retrieval of data structures as data, and at least in my discussion
about such topics with my friends and colleagues, unmarshalling is an
example of parsing.

In any case, you can replace many of my uses of 'parsing' with
'unmarshalling', or even 'operationX', which I defined at the beginning
of this email.

> >>> Look how successful and
> >>> effective it has been so far in the history of Python.  In order to make
> >>> the bytes object be as effective in 3.x, one would need to add basically
> >>> all of the Python 2.x string methods to it
> >> The precondition of this clause is misguided: the bytes type doesn't
> >> need to be as effective, since the string type is as effective in 2.3,
> >> so you can do all parsing based on strings.
> > 
> > Not if those strings contain binary data.
> 
> I couldn't (until right now) understand what you mean by "parsing binary
> data". I still doubt that you typically need the same operations for
> unmarshalling that you need for parsing.
> 
> In parsing, you need to look (scan) for separators, follow a possibly
> recursive grammar, and so on. For unmarshalling, you typically have
> a TLV (tag-length-value) structure, where you read a tag (of fixed
> size), then the length, then the value (of the size indicated in the
> length). There are variations, of course, but you typically don't
> need .find, .startswith, etc.

See any line-based socket protocol for where .find() is useful.  We'll
take telnet, for example.  A telnet line is terminated by a CRLF pair,
and being able to interpret incoming information on a line-by-line basis
is sufficient to support all of the required portions of the RFC.  In
addition, telnet lines may contain non-text information as escapes
(ascii 0-8, 11-12, 14-31, 128-255), regardless of line buffering.

 - Josiah