[Python-Dev] methods on the bytes object
"Martin v. Löwis"
martin at v.loewis.de
Mon May 1 10:55:52 CEST 2006
Josiah Carlson wrote:
> Certainly that is the case. But how would you propose embedded bytes
> data be represented? (I talk more extensively about this particular
> issue later).
Can't answer: I don't know what "embedded bytes data" are.
> Um...struct.unpack() already works on unicode...
> >>> struct.unpack('>L', u'work')
> (2003792491L,)
> As does array.fromstring...
> >>> a = array.array('B')
> >>> a.fromstring(u'work')
> >>> a
> array('B', [119, 111, 114, 107])
> ... assuming that all characters are in the 0...127 range. But that's a
> different discussion.
Yes, it applies the default encoding. This is unfortunate: it shouldn't
have worked in the first place. I hope this gives a type error in Python
3, with the default encoding gone.
> I am apparently not communicating this particular idea effectively
> enough. How would you propose that I store parsing literals for
> non-textual data, and how would you propose that I set up a dictionary
> to hold some non-trivial number of these parsing literals?
I can't answer that question: I don't know what a "parsing literal
for non-textual data" is. If you are asking how you represent bytes
object in source code: I would encode them as a list of integers,
then use, say,
parsing_literal = bytes([3,5,30,99])
> From what I understand, it would seem that you would suggest that I use
> something like the following...
> handler = {bytes('...', encoding=...).encode('latin-1'): ...,
> #or
> '\uXXXX\uXXXX...': ...,
> #or even without bytes/str
> (0xXX, 0xXX, ...): ..., }
> Note how two of those examples have non-textual data inside of a Python
> 3.x string? Yeah.
Unfortunately, I don't notice. I assume you don't mean a literal '...';
if this is what you represent, I would write
handler = { '...': "some text" }
But I cannot guess what you want to put into '...' instead.
>>> We've not removed the problem, only changed it from being contained
>>> in non-unicode
>>> strings to be contained in unicode strings (which are 2 or 4 times larger
>>> than their non-unicode counterparts).
>> We have removed the problem.
> Excuse me? People are going to use '...' to represent literals of all
> different kinds.
In Python 3, '...' will be a character string. You can't use it to
represent anything else but characters.
Can you give examples of actual source code where people use '...'
something other than text?
> What does pickle.load(...) do to the files that are passed into it? It
> reads the (possibly binary) data it reads in from a file (or file-like
> object), performing a particular operation based based on a dictionary
> of expected tokens in the file, producing a Python object. I would say
> that pickle 'parses' the content of a file, which I presume isn't
> necessary text.
Right. I think pickle can be implemented with just the bytes type.
There is a textual and a binary version of the pickle format. The
textual should be read as text; the binary version using bytes.
(I *thinK* the encoding of the textual version is ASCII, but one
would have to check).
> Replace pickle with a structured data storage format of your choice.
> It's still parsing (at least according to my grasp of English, which
> could certainly be flawed (my wife says as much on a daily basis)).
Well, I have no "native language" intuition with respect to these
words - I only use them in the way I see them used elsewhere. I see
"parsing" typically associated with textual data, and "unmarshalling"
with binary data.
>>> Look how successful and
>>> effective it has been so far in the history of Python. In order to make
>>> the bytes object be as effective in 3.x, one would need to add basically
>>> all of the Python 2.x string methods to it
>> The precondition of this clause is misguided: the bytes type doesn't
>> need to be as effective, since the string type is as effective in 2.3,
>> so you can do all parsing based on strings.
> Not if those strings contain binary data.
I couldn't (until right now) understand what you mean by "parsing binary
data". I still doubt that you typically need the same operations for
unmarshalling that you need for parsing.
In parsing, you need to look (scan) for separators, follow a possibly
recursive grammar, and so on. For unmarshalling, you typically have
a TLV (tag-length-value) structure, where you read a tag (of fixed
size), then the length, then the value (of the size indicated in the
length). There are variations, of course, but you typically don't
need .find, .startswith, etc.
