Add encoding attribute to bytes
A Python interpreter has one encoding for floats, ints, and strings. sys.float_info and sys.int_info give details about the first two. although they are mostly invisible to user code. (I presume they are attached to sys rather than float and int precisely because this.) A couple of recent posts have discussed making the unicode encoding (UCS2 v 4) both less visible and more discoverable to extensions. Bytes are nearly always an encoding of *something*, but the particular encoding used is instance-specific. As Guido has said, the programmer must keep track. But how? In an OO language, one obvious way is as an attribute of the instance. That would be carried with the instance and make it self-identifying. What I do not know if it is feasible to give an immutable instance of a builtin class a mutable attribute slot. If it were, I think this could make 3.x bytes easier and more transparent to use. When a string is encoded to bytes, the attribute would be set. If it were then pickled, the attribute would be stored with it and restored with it, and less easily lost. If it were then decoded, the attribute would be used. If it were sent to the net, the attribute would be used to set the appropriate headers. The reverse process would apply from net to bytes to (unicode) text. Bytes representing other types of data, such as nedia could also be tagged, not just those representing text. This would be a proposal for 3.3 at the earliest. It would involved revising stdlib modules, as appropriate, to use the new info. Terry Jan Reedy
Terry Reedy wrote:
A Python interpreter has one encoding for floats, ints, and strings. sys.float_info and sys.int_info give details about the first two. although they are mostly invisible to user code. (I presume they are attached to sys rather than float and int precisely because this.) A couple of recent posts have discussed making the unicode encoding (UCS2 v 4) both less visible and more discoverable to extensions.
Bytes are nearly always an encoding of *something*, but the particular encoding used is instance-specific. As Guido has said, the programmer must keep track. But how? In an OO language, one obvious way is as an attribute of the instance. That would be carried with the instance and make it self-identifying.
What I do not know if it is feasible to give an immutable instance of a builtin class a mutable attribute slot. If it were, I think this could make 3.x bytes easier and more transparent to use. When a string is encoded to bytes, the attribute would be set. If it were then pickled, the attribute would be stored with it and restored with it, and less easily lost. If it were then decoded, the attribute would be used. If it were sent to the net, the attribute would be used to set the appropriate headers. The reverse process would apply from net to bytes to (unicode) text.
Bytes representing other types of data, such as nedia could also be tagged, not just those representing text.
This would be a proposal for 3.3 at the earliest. It would involved revising stdlib modules, as appropriate, to use the new info.
You said "give an immutable instance of a builtin class a mutable attribute slot". Why would the slot be mutable? Surely if the attribute said that the bytes represented a certain type of data then you shouldn't be able to change it. ("The attribute says that the bytes are UTF-8, but I'm going to change it so that it says they are ISO-8859-1.") I think that the attribute should be immutable.
MRAB writes:
You said "give an immutable instance of a builtin class a mutable attribute slot". Why would the slot be mutable?
I think the idea is that in many cases you won't know what the encoding is until after you've read the bytes. But I don't really see this idea as that useful either way. The obvious use case for me would be in the email module. So you read in a message and create a bytes object, which you stash away for later use as necessary. The header and the body, each MIME part, each MIME part header and payload, and so on recursively are identified as slices of the BigBytesObject you read in at the beginning, which is implicitly a binary blob and doesn't need an encoding (strike one). Each header identifies the encoding (which here would have to refer ambiguously to Content-Type or Content-Transfer-Encoding, strike two) of the corresponding payload. And you'll need to deal with cases where Content-Type and Content-Transfer-Encoding are both relevant, strike three. You may as well keep the various layers of encoding explicitly in email-specific objects, so use case: email strikes out. That's only one use case, of course. But we can see what a use case would have to look like: you read in a bytes object, just enough to enable you to accurately parse the rest of the stream in the same way and tag each bytes part with an appropriate encoding. What are they?
MRAB wrote:
Terry Reedy wrote:
A Python interpreter has one encoding for floats, ints, and strings. sys.float_info and sys.int_info give details about the first two. although they are mostly invisible to user code. (I presume they are attached to sys rather than float and int precisely because this.) A couple of recent posts have discussed making the unicode encoding (UCS2 v 4) both less visible and more discoverable to extensions.
Bytes are nearly always an encoding of *something*, but the particular encoding used is instance-specific. As Guido has said, the programmer must keep track. But how? In an OO language, one obvious way is as an attribute of the instance. That would be carried with the instance and make it self-identifying.
What I do not know if it is feasible to give an immutable instance of a builtin class a mutable attribute slot. If it were, I think this could make 3.x bytes easier and more transparent to use. When a string is encoded to bytes, the attribute would be set. If it were then pickled, the attribute would be stored with it and restored with it, and less easily lost. If it were then decoded, the attribute would be used. If it were sent to the net, the attribute would be used to set the appropriate headers. The reverse process would apply from net to bytes to (unicode) text.
Bytes representing other types of data, such as nedia could also be tagged, not just those representing text.
This would be a proposal for 3.3 at the earliest. It would involved revising stdlib modules, as appropriate, to use the new info.
You said "give an immutable instance of a builtin class a mutable attribute slot". Why would the slot be mutable?
As Stephen said, in case the info is initially missing or determined to be erroneous.
Surely if the attribute said that the bytes represented a certain type of data then you shouldn't be able to change it. ("The attribute says that the bytes are UTF-8, but I'm going to change it so that it says they are ISO-8859-1.") I think that the attribute should be immutable.
Encoding set by unicode.encode or a wrapper thereof is definitionally correct and should not be changed. Encoding inferred by mimetype header or file extension might be erroneous. I had in mind that the difference might be indicated somehow: 'utf8' versus 'utf8?', for instance. Terry Jan Reedy
Terry Reedy wrote:
MRAB wrote:
Terry Reedy wrote:
A Python interpreter has one encoding for floats, ints, and strings. sys.float_info and sys.int_info give details about the first two. although they are mostly invisible to user code. (I presume they are attached to sys rather than float and int precisely because this.) A couple of recent posts have discussed making the unicode encoding (UCS2 v 4) both less visible and more discoverable to extensions.
Bytes are nearly always an encoding of *something*, but the particular encoding used is instance-specific. As Guido has said, the programmer must keep track. But how? In an OO language, one obvious way is as an attribute of the instance. That would be carried with the instance and make it self-identifying.
What I do not know if it is feasible to give an immutable instance of a builtin class a mutable attribute slot. If it were, I think this could make 3.x bytes easier and more transparent to use. When a string is encoded to bytes, the attribute would be set. If it were then pickled, the attribute would be stored with it and restored with it, and less easily lost. If it were then decoded, the attribute would be used. If it were sent to the net, the attribute would be used to set the appropriate headers. The reverse process would apply from net to bytes to (unicode) text.
Bytes representing other types of data, such as nedia could also be tagged, not just those representing text.
This would be a proposal for 3.3 at the earliest. It would involved revising stdlib modules, as appropriate, to use the new info.
You said "give an immutable instance of a builtin class a mutable attribute slot". Why would the slot be mutable?
As Stephen said, in case the info is initially missing or determined to be erroneous.
Surely if the attribute said that the bytes represented a certain type of data then you shouldn't be able to change it. ("The attribute says that the bytes are UTF-8, but I'm going to change it so that it says they are ISO-8859-1.") I think that the attribute should be immutable.
Encoding set by unicode.encode or a wrapper thereof is definitionally correct and should not be changed. Encoding inferred by mimetype header or file extension might be erroneous. I had in mind that the difference might be indicated somehow: 'utf8' versus 'utf8?', for instance.
I was thinking more along the lines of saying that the attribute (default None) is specified when the bytes object is created. You wouldn't be able to change it, but you could create a new bytes object with a different attribute: new_bytes = bytes(old_bytes, "utf8") The actual bytes themselves wouldn't need to be copied; they could be safely shared because 'bytes' objects are immutable. There then comes the question of whether new_bytes == old_bytes.
Terry Reedy wrote:
Bytes are nearly always an encoding of *something*, but the particular encoding used is instance-specific. As Guido has said, the programmer must keep track. But how? In an OO language, one obvious way is as an attribute of the instance. That would be carried with the instance and make it self-identifying.
I work in comms and spend a lot of time shuttling bytes from one place to another without caring in the least about the encoding. Caring about that kind of detail is application layer stuff and belongs in application layer objects. More importantly, such an attribute implies a defined responsibility for keeping it accurate. For application layer objects, it is possible to define that. For a low level data structure like bytes, it isn't. Attaching metadata to something without defining a responsible entity for keeping that metadata accurate and up to date is a recipe for trouble. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Terry Reedy schrieb:
A Python interpreter has one encoding for floats, ints, and strings. sys.float_info and sys.int_info give details about the first two. although they are mostly invisible to user code. (I presume they are attached to sys rather than float and int precisely because this.) A couple of recent posts have discussed making the unicode encoding (UCS2 v 4) both less visible and more discoverable to extensions.
Bytes are nearly always an encoding of *something*, but the particular encoding used is instance-specific. As Guido has said, the programmer must keep track. But how? In an OO language, one obvious way is as an attribute of the instance. That would be carried with the instance and make it self-identifying.
What I do not know if it is feasible to give an immutable instance of a builtin class a mutable attribute slot.
As soon as you can mutate an instance, it is not an immutable type anymore. Calling it "immutable" despite will cause trouble. (The same bytes instance could be used somewhere else transparently, e.g. as a function default argument, or cached as a constant local.) As for the usefulness, I often have to work with proprietary communication protocols between computer and devices, and there the bytes have no encoding whatsoever (though I agree that most bytes do have a meaningful encoding). However, a class as fundamental as "bytes" should not be burdened with an attribute that may not even apply -- it's easy to make a custom class to represent a (bytes, encoding) pair. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
Georg Brandl wrote:
What I do not know if it is feasible to give an immutable instance of a builtin class a mutable attribute slot.
As soon as you can mutate an instance, it is not an immutable type anymore. Calling it "immutable" despite will cause trouble. (The same bytes instance could be used somewhere else transparently, e.g. as a function default argument, or cached as a constant local.)
OK, scratch that implementation of my idea.
As for the usefulness, I often have to work with proprietary communication protocols between computer and devices, and there the bytes have no encoding whatsoever
Random bits? It seems to me that protocol means some sort of encoding, formatting, or structuring, some sort of agreed on interpretation, even if private.
(though I agree that most bytes do have a meaningful encoding). However, a class as fundamental as "bytes" should not be burdened with an attribute that may not even apply -- it's easy to make a custom class to represent a (bytes, encoding) pair.
The fundamental problem I am interested in is the separation of raw data from how to use it info. Text encoding of bytes in only one instance, though the most common that pops up on Python list. I had also thought of something like (imcomplete): class Textbytes: def __init__(self, text, code): if type(text) is str: text = text.encode(code) if type(text) is bytes: self.text = text self.code = code else: raise ValueError() def __str__(self): return self.text.decode(self.code) b = Textbytes('abc', 'utf8') print(b) One problem is that it is a lot bulkier than a raw bytes. Leaving that aside, a custom class is just that: custom. Stdlib modules will neither accept nor produce such a wrapper rathar than bytes. My underlying idea is that maybe the standard Python distribution should promote encapsulation of encoding info with raw bytes to make bug-free usage easier. Adding an attribute was one implementation idea. Adding a standardized wrapper class (at least in a module) would be another. Terry Jan Reedy
Terry Reedy writes:
The fundamental problem I am interested in is the separation of raw data from how to use it info.
But this is ambiguous. Take reStructuredText. It *is* text/plain. But it also *is* application/x-structuredtext. Not to forget application/octet-stream. An MUA will treat it as the first, docutils as the second, and gzip as the third.
My underlying idea is that maybe the standard Python distribution should promote encapsulation of encoding info with raw bytes to make bug-free usage easier.
I think you will find that every use case makes different demands on this feature, and that it typically interacts with higher-level needs of the application. There's a reason that ASN.1 is insanely complex and only applications that really need it ever use it. This feature will either be too simple to serve most practical needs, or too complex to serve most practical programmers.<wink> And "bug-free" usage is hopeless. Much, perhaps the vast majority, of the coding information will be automatically derived from sources you deprecate as "heuristic", like MIME Content-Type headers. It will get attached to the bytes as an attribute, and after that you can't know how reliable it is. If you have a practical example of such a simple class (bytes + encoding attribute) that serves as a base for more complex applications, I'd really like to see them. But until there are real use cases on the table, I have to say I can't see the proposed facility as being particularly useful to the email package, for example.
Terry Reedy schrieb:
Georg Brandl wrote:
What I do not know if it is feasible to give an immutable instance of a builtin class a mutable attribute slot.
As soon as you can mutate an instance, it is not an immutable type anymore. Calling it "immutable" despite will cause trouble. (The same bytes instance could be used somewhere else transparently, e.g. as a function default argument, or cached as a constant local.)
OK, scratch that implementation of my idea.
As for the usefulness, I often have to work with proprietary communication protocols between computer and devices, and there the bytes have no encoding whatsoever
Random bits? It seems to me that protocol means some sort of encoding, formatting, or structuring, some sort of agreed on interpretation, even if private.
Sure, but nothing you could map entirely onto a string of Unicode characters. Georg
Georg Brandl wrote:
Terry Reedy schrieb:
Random bits? It seems to me that protocol means some sort of encoding, formatting, or structuring, some sort of agreed on interpretation, even if private.
Sure, but nothing you could map entirely onto a string of Unicode characters.
My idea is not limited to unicode encodings. But I see that one field/attribute can be either too many or too few, and hence not a universal solution.
Terry Reedy wrote:
As for the usefulness, I often have to work with proprietary communication protocols between computer and devices, and there the bytes have no encoding whatsoever
Random bits? It seems to me that protocol means some sort of encoding, formatting, or structuring, some sort of agreed on interpretation, even if private.
This is true, but the encoding scheme *isn't* a property of the binary data in and of itself. It's metadata about it that guides the application as to how the stream should be interpreted. For a lot of the things I've done in the past, I haven't cared at all about the encoding of binary data - I've just been schlepping bits from point A to point B and back without caring what they actually *meant*. Other times I didn't have to guess or pass any metadata around because the comms port was hardwired to a particular device that only knew one way of communicating - the definition of the protocol was implicit in the implementation of the interface software. In fact, one of the key features typically desired in a communications protocol is for it to be content neutral: you push binary data in one end and get the same binary data out of the other end. Peer applications using the channel to communicate with each other don't need to care what the channel is doing with the data, but equally importantly, the software implementing the comms channel doesn't need to know how to interpret the bits it is transporting*. For other applications, the Unicode encoding might be important to know. Some will care more about the MIME type, or use some other defined binary encoding (what is the Unicode encoding of an sqlite or bsddb database file?). Other applications may be interested in a proprietary binary format that is formally defined solely by the code that knows how to read and write it. Can bytes be used to store encoded Unicode data? Sure they can. But they can be used for a whole host of other things as well, so burdening them with an attribute that is occasional helpful, but more often dead weight or even outright misleading would be a mistake. Cheers, Nick. * Sometimes a bit more coupling makes sense when there are engineering advantages to be had, but this is usually an application specific thing (e.g. IP has a protocol field that identifies different application layer protocols such as TCP, UDP and ESP which have different network performance expectations, This allows IP network routers to apply different rules without having to peek inside the payload of each IP packet) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Nick Coghlan wrote:
Terry Reedy wrote:
As for the usefulness, I often have to work with proprietary communication protocols between computer and devices, and there the bytes have no encoding whatsoever Random bits? It seems to me that protocol means some sort of encoding, formatting, or structuring, some sort of agreed on interpretation, even if private.
This is true, but the encoding scheme *isn't* a property of the binary data in and of itself. It's metadata about it that guides the application as to how the stream should be interpreted.
For a lot of the things I've done in the past, I haven't cared at all about the encoding of binary data - I've just been schlepping bits from point A to point B and back without caring what they actually *meant*. Other times I didn't have to guess or pass any metadata around because the comms port was hardwired to a particular device that only knew one way of communicating - the definition of the protocol was implicit in the implementation of the interface software.
In fact, one of the key features typically desired in a communications protocol is for it to be content neutral: you push binary data in one end and get the same binary data out of the other end. Peer applications using the channel to communicate with each other don't need to care what the channel is doing with the data, but equally importantly, the software implementing the comms channel doesn't need to know how to interpret the bits it is transporting*.
For other applications, the Unicode encoding might be important to know. Some will care more about the MIME type, or use some other defined binary encoding (what is the Unicode encoding of an sqlite or bsddb database file?). Other applications may be interested in a proprietary binary format that is formally defined solely by the code that knows how to read and write it.
Can bytes be used to store encoded Unicode data? Sure they can. But they can be used for a whole host of other things as well, so burdening them with an attribute that is occasional helpful, but more often dead weight or even outright misleading would be a mistake.
Cheers, Nick.
* Sometimes a bit more coupling makes sense when there are engineering advantages to be had, but this is usually an application specific thing (e.g. IP has a protocol field that identifies different application layer protocols such as TCP, UDP and ESP which have different network performance expectations, This allows IP network routers to apply different rules without having to peek inside the payload of each IP packet)
Your experience has been different from mine. Thanks for the exposition. I can see why you prefer metadata to either be in the stream itself or as part of a wrapper object. Terry Jan Reedy
Terry Reedy wrote:
Your experience has been different from mine. Thanks for the exposition. I can see why you prefer metadata to either be in the stream itself or as part of a wrapper object.
One of the things I've learned on python-list/-dev/-ideas is that the *kind* of software one writes regularly makes a big difference to what seems like a good idea. I tend to write fairly low level hardware control code, so that's the way I tend to think. Others come from the financial world or from an academic/scientific background or are interested in Python for education purposes or in building big frameworks that try to solve the world (or at least a particular problem space within it ;). It says a lot about Python's flexibility as a language that it applies so well to so many different problem domains, but it can lead to some interesting discussions when we try to align the interests of all those different ill-defined groups :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Nick Coghlan wrote:
It says a lot about Python's flexibility as a language that it applies so well to so many different problem domains, but it can lead to some interesting discussions when we try to align the interests of all those different ill-defined groups :)
Yes, and I think that because of this diversity of requirements, it's very important to keep the basic building blocks of the language as simple and focused as possible. The fundamental types should each concentrate on doing just one thing and doing it well. Seems to me the bytes type is just right as it is -- basic raw data that you can use any way you see fit. Anything more specialised should be built by the user to suit their use case. -- Greg
On Thu, Nov 5, 2009 at 8:15 PM, Terry Reedy <tjreedy@udel.edu> wrote:
A Python interpreter has one encoding for floats, ints, and strings. sys.float_info and sys.int_info give details about the first two.
(Instead of changing bytes,) This suggests a sys.string_info that contains information about the default string representation --including whether the internal encoding is UCS2 or UCS4 or something else. That should at least make it possible to give better diagnostic messages. -jJ
Jim Jewett wrote:
On Thu, Nov 5, 2009 at 8:15 PM, Terry Reedy <tjreedy@udel.edu> wrote:
A Python interpreter has one encoding for floats, ints, and strings. sys.float_info and sys.int_info give details about the first two.
(Instead of changing bytes,)
This suggests a sys.string_info that contains information about the default string representation --including whether the internal encoding is UCS2 or UCS4 or something else.
That should at least make it possible to give better diagnostic messages.
What to do about interpreter-wide unicode string info, if anything, is related but separate from what to do about instance-specific bytes info.
participants (7)
-
Georg Brandl
-
Greg Ewing
-
Jim Jewett
-
MRAB
-
Nick Coghlan
-
Stephen J. Turnbull
-
Terry Reedy