[Python-3000] immutable bytes in py3k (was: [Python-Dev] methods on the bytes object)
Josiah Carlson
jcarlson at uci.edu
Mon May 1 23:54:54 CEST 2006
"Martin v. Löwis" <martin at v.loewis.de> wrote:
>
> Josiah Carlson wrote:
> > Before I get into my reply, I'm going to start out by defining a new
> > term:
> >
> > operationX - the operation of interpreting information differently than
> > how it is presented, generally by constructing a data structure based on
> > the input information.
> > eg; programming language source file -> parse tree,
> > natual language -> parse tree or otherwise,
> > structured data file -> data structure (tree, dictionary, etc.),
> > etc.
> > synonyms: parsing, unmarshalling, interpreting, ...
> >
> > Any time I would previously describe something as some variant of 'parse',
> > replace that with 'operationX'. I will do that in all of my further
> > replies.
> >
> >>> Certainly that is the case. But how would you propose embedded bytes
> >>> data be represented? (I talk more extensively about this particular
> >>> issue later).
> >> Can't answer: I don't know what "embedded bytes data" are.
>
> Ok. I think I would use base64, of possibly compressed content. It's
> more compact than your representation, as it only uses 1.3 characters
> per byte, instead of the up-to-four bytes that the img2py uses.
I never said it was the most efficient representation, just one that was
being used (and one in which I had no control over previously defining).
What I provided was automatically generated by a script provided with
wxPython.
> If ease-of-porting is an issue, img2py should just put an
> .encode("latin-1") at the end of the string.
Ultimately, this is still the storage of bytes in a textual string. It
may be /encoded/ as text, but it is still conceptually bytes in text,
which is at least as confusing as text in bytes.
> > return zlib.decompress(
> > 'x\xda\x01\x14\x02\xeb\xfd\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00 \
> [...]
>
> > That data is non-textual. It is bytes within a string literal. And it
> > is embedded (within a .py file).
>
> In Python 2.x, it is that, yes. In Python 3, it is a (meaningless)
> text.
Toss an .encode('latin-1'), and it isn't meaningless.
>>> type(x)
<type 'unicode'>
>>> zlib.decompress(x.encode('latin-1'))[:4]
'\x89PNG'
> >>> I am apparently not communicating this particular idea effectively
> >>> enough. How would you propose that I store parsing literals for
> >>> non-textual data, and how would you propose that I set up a dictionary
> >>> to hold some non-trivial number of these parsing literals?
> >
> > An operationX literal is a symbol that describes how to interpret the
> > subsequent or previous data. For an example of this, see the pickle
> > module (portions of which I include below).
>
> I don't think there can be, or should be, a general solution for
> all operationX literals, because the different applications of
> operationX all have different requirements wrt. their literals.
>
> In binary data, integers are the most obvious choice for
> operationX literals. In text data, string literals are.
[snip]
> Yes. For pickle, the ordinals of the type code make good operationX
> literals.
But, as I brought up before, while single integers are sufficient for
some operationX literals, that may not be the case for others. Say, for
example, a tool which discovers the various blobs from quicktime .mov
files (movie portions, audio portions, images, etc.). I don't remember
all of the precise names to parse, but I do remember that they were all
4 bytes long.
This means that we would generally use the following...
dispatch = {(ord(ch), ord(ch), ord(ch), ord(ch)): ...,
#or
tuple(ord(i) for i in '...'): ...,
}
And in the actual operationX process...
#if we are reading bytes...
key = tuple(read(4))
#if we are reading str...
key = tuple(bytes(read(4), 'latin-1'))
#or tuple(read(4).encode('latin-1'))
#or tuple(ord(i) for i in read(4))
There are, of course, other options which could use struct and 8, 16, 32,
and/or 64 bit integers (with masks and/or shifts), for the dispatch = ...
or key = ... cases, but those, again, would rely on using Python 3.x
strings as a container for non-text data.
> > I described before how you would use this kind of thing to perform
> > operationX on structured information. It turns out that pickle (in
> > Python) uses a dictionary of operationX symbols/literals -> unbound
> > instance methods to perform operationX on the pickled representation of
> > Python objects (literals where XXXX = '...' are defined, and symbols
> > using the XXXX names). The relevant code for unpickling is the while 1:
> > section of the following.
>
> Right. I would convert the top of pickle.py to read
>
> MARK = ord('(')
> STOP = ord('.')
> ...
>
> > For an example of where people use '...' to represent non-textual
> > information in a literal, see the '# Protocol 2' section of pickle.py ...
>
> Right.
>
> > # Protocol 2
> >
> > PROTO = '\x80' # identify pickle protocol
>
> This should be changed to
>
> PROTO = 0x80 # identify pickle protocol
> etc.
I see that you don't see ord(...) as a case where strings are being used
to hold bytes data. I would disagree, in much the same way that I would
disagree with the idea that bytes.encode('base64') only holds text. But
then again, I also see that the majority of this "rethink your data
structures and dispatching" would be unnecessary if there were an
immutable bytes literal in Python 3.x. People could then use...
MARK = b'('
STOP = b'.'
...
PROTO = b'\x80'
...
dispatch = {b'...': fcn}
key = read(X)
dispatch[X](self)
#regardless of X
... etc., as they have already been doing (only without the 'b' or other
prefix).
> > key = read(1)
>
> and then this to
> key = ord(read(1))
This, of course, presumes that read() will return Python 3.x strings,
which may be ambiguous and/or an error in the binary pickle case
(especially if people don't pass 'latin-1' as the encoding to the open()
call).
> > See any line-based socket protocol for where .find() is useful.
>
> Any line-based protocol is textual, usually based on ASCII.
Not all of ASCII 0...127 is text, and the RFC for telnet describes how
ASCII 128...255 can be used as optional extensions. Further, the FTP
protocol defines a mechanism where by in STREAM or BLOCK modes, data can
be terminated by an EOR, EOF, or even a different specified marker
(could be multiple contiguous bytes).
It is also the case that some filesystems, among other things, define
file names as a null-terminated string in a variable-lengthed record, or
even pad the remaining portion of the field with nulls (which one would
presumably .rstrip('\0') in Python 2.x). (On occasion, I have found the
need to write a filesystem explorer which opens volumes raw, especially
on platforms without drivers for that particular filesystem).
- Josiah
More information about the Python-3000
mailing list