[Python-ideas] Add encoding attribute to bytes

Tue Nov 10 03:54:45 CET 2009

Terry Reedy wrote:
> MRAB wrote:
>> Terry Reedy wrote:
>>> A Python interpreter has one encoding for floats, ints, and strings. 
>>> sys.float_info and sys.int_info give details about the first two. 
>>> although they are mostly invisible to user code. (I presume they are 
>>> attached to sys rather than float and int precisely because this.) A 
>>> couple of recent posts have discussed making the unicode encoding 
>>> (UCS2 v 4) both less visible and more discoverable to extensions.
>>>
>>> Bytes are nearly always an encoding of *something*, but the 
>>> particular encoding used is instance-specific. As Guido has said, the 
>>> programmer must keep track. But how? In an OO language, one obvious 
>>> way is as an attribute of the instance. That would be carried with 
>>> the instance and make it self-identifying.
>>>
>>> What I do not know if it is feasible to give an immutable instance of 
>>> a builtin class a mutable attribute slot. If it were, I think this 
>>> could make 3.x bytes easier and more transparent to use. When a 
>>> string is encoded to bytes, the attribute would be set. If it were 
>>> then pickled, the attribute would be stored with it and restored with 
>>> it, and less easily lost. If it were then decoded, the attribute 
>>> would be used. If it were sent to the net, the attribute would be 
>>> used to set the appropriate headers. The reverse process would apply 
>>> from net to bytes to (unicode) text.
>>>
>>> Bytes representing other types of data, such as nedia could also be 
>>> tagged, not just those representing text.
>>>
>>> This would be a proposal for 3.3 at the earliest. It would involved 
>>> revising stdlib modules, as appropriate, to use the new info.
>>>
>> You said "give an immutable instance of a builtin class a mutable
>> attribute slot". Why would the slot be mutable? 
> 
> As Stephen said, in case the info is initially missing or determined to 
> be erroneous.
> 
>> Surely if the attribute
>> said that the bytes represented a certain type of data then you
>> shouldn't be able to change it. ("The attribute says that the bytes are
>> UTF-8, but I'm going to change it so that it says they are ISO-8859-1.")
>> I think that the attribute should be immutable.
> 
> Encoding set by unicode.encode or a wrapper thereof is definitionally 
> correct and should not be changed. Encoding inferred by mimetype header 
> or file extension might be erroneous. I had in mind that the difference 
> might be indicated somehow: 'utf8' versus 'utf8?', for instance.
> 
I was thinking more along the lines of saying that the attribute
(default None) is specified when the bytes object is created. You
wouldn't be able to change it, but you could create a new bytes object
with a different attribute:

     new_bytes = bytes(old_bytes, "utf8")

The actual bytes themselves wouldn't need to be copied; they could be
safely shared because 'bytes' objects are immutable.

There then comes the question of whether new_bytes == old_bytes.