[Python-Dev] methods on the bytes object

Mon May 1 10:55:52 CEST 2006

Josiah Carlson wrote:
> Certainly that is the case.  But how would you propose embedded bytes
> data be represented? (I talk more extensively about this particular
> issue later).

Can't answer: I don't know what "embedded bytes data" are.

> Um...struct.unpack() already works on unicode...
>     >>> struct.unpack('>L', u'work')
>     (2003792491L,)
> As does array.fromstring...
>     >>> a = array.array('B')
>     >>> a.fromstring(u'work')
>     >>> a
>     array('B', [119, 111, 114, 107])
> 
> ... assuming that all characters are in the 0...127 range.  But that's a
> different discussion.

Yes, it applies the default encoding. This is unfortunate: it shouldn't
have worked in the first place. I hope this gives a type error in Python
3, with the default encoding gone.

> I am apparently not communicating this particular idea effectively
> enough.  How would you propose that I store parsing literals for
> non-textual data, and how would you propose that I set up a dictionary
> to hold some non-trivial number of these parsing literals?

I can't answer that question: I don't know what a "parsing literal
for non-textual data" is. If you are asking how you represent bytes
object in source code: I would encode them as a list of integers,
then use, say,

  parsing_literal = bytes([3,5,30,99])

> From what I understand, it would seem that you would suggest that I use
> something like the following...
> 
> handler = {bytes('...', encoding=...).encode('latin-1'): ...,
>            #or
>            '\uXXXX\uXXXX...': ...,
>            #or even without bytes/str
>            (0xXX, 0xXX, ...): ..., }
> 
> Note how two of those examples have non-textual data inside of a Python
> 3.x string?  Yeah.

Unfortunately, I don't notice. I assume you don't mean a literal '...';
if this is what you represent, I would write

handler = { '...': "some text" }

But I cannot guess what you want to put into '...' instead.

>>> We've not removed the problem, only changed it from being contained 
>>> in non-unicode
>>> strings to be contained in unicode strings (which are 2 or 4 times larger
>>> than their non-unicode counterparts).
>> We have removed the problem.
> 
> Excuse me?  People are going to use '...' to represent literals of all
> different kinds.

In Python 3, '...' will be a character string. You can't use it to
represent anything else but characters.

Can you give examples of actual source code where people use '...'
something other than text?

> What does pickle.load(...) do to the files that are passed into it?  It
> reads the (possibly binary) data it reads in from a file (or file-like
> object), performing a particular operation based based on a dictionary
> of expected tokens in the file, producing a Python object. I would say
> that pickle 'parses' the content of a file, which I presume isn't
> necessary text.

Right. I think pickle can be implemented with just the bytes type.
There is a textual and a binary version of the pickle format. The
textual should be read as text; the binary version using bytes.
(I *thinK* the encoding of the textual version is ASCII, but one
would have to check).

> Replace pickle with a structured data storage format of your choice.
> It's still parsing (at least according to my grasp of English, which
> could certainly be flawed (my wife says as much on a daily basis)).

Well, I have no "native language" intuition with respect to these
words - I only use them in the way I see them used elsewhere. I see
"parsing" typically associated with textual data, and "unmarshalling"
with binary data.

>>> Look how successful and
>>> effective it has been so far in the history of Python.  In order to make
>>> the bytes object be as effective in 3.x, one would need to add basically
>>> all of the Python 2.x string methods to it
>> The precondition of this clause is misguided: the bytes type doesn't
>> need to be as effective, since the string type is as effective in 2.3,
>> so you can do all parsing based on strings.
> 
> Not if those strings contain binary data.

I couldn't (until right now) understand what you mean by "parsing binary
data". I still doubt that you typically need the same operations for
unmarshalling that you need for parsing.

In parsing, you need to look (scan) for separators, follow a possibly
recursive grammar, and so on. For unmarshalling, you typically have
a TLV (tag-length-value) structure, where you read a tag (of fixed
size), then the length, then the value (of the size indicated in the
length). There are variations, of course, but you typically don't
need .find, .startswith, etc.

Regards,
Martni