[Python-ideas] Add encoding attribute to bytes

Tue Nov 10 03:22:14 CET 2009

MRAB wrote:
> Terry Reedy wrote:
>> A Python interpreter has one encoding for floats, ints, and strings. 
>> sys.float_info and sys.int_info give details about the first two. 
>> although they are mostly invisible to user code. (I presume they are 
>> attached to sys rather than float and int precisely because this.) A 
>> couple of recent posts have discussed making the unicode encoding 
>> (UCS2 v 4) both less visible and more discoverable to extensions.
>>
>> Bytes are nearly always an encoding of *something*, but the particular 
>> encoding used is instance-specific. As Guido has said, the programmer 
>> must keep track. But how? In an OO language, one obvious way is as an 
>> attribute of the instance. That would be carried with the instance and 
>> make it self-identifying.
>>
>> What I do not know if it is feasible to give an immutable instance of 
>> a builtin class a mutable attribute slot. If it were, I think this 
>> could make 3.x bytes easier and more transparent to use. When a string 
>> is encoded to bytes, the attribute would be set. If it were then 
>> pickled, the attribute would be stored with it and restored with it, 
>> and less easily lost. If it were then decoded, the attribute would be 
>> used. If it were sent to the net, the attribute would be used to set 
>> the appropriate headers. The reverse process would apply from net to 
>> bytes to (unicode) text.
>>
>> Bytes representing other types of data, such as nedia could also be 
>> tagged, not just those representing text.
>>
>> This would be a proposal for 3.3 at the earliest. It would involved 
>> revising stdlib modules, as appropriate, to use the new info.
>>
> You said "give an immutable instance of a builtin class a mutable
> attribute slot". Why would the slot be mutable? 

As Stephen said, in case the info is initially missing or determined to 
be erroneous.

> Surely if the attribute
> said that the bytes represented a certain type of data then you
> shouldn't be able to change it. ("The attribute says that the bytes are
> UTF-8, but I'm going to change it so that it says they are ISO-8859-1.")
> I think that the attribute should be immutable.

Encoding set by unicode.encode or a wrapper thereof is definitionally 
correct and should not be changed. Encoding inferred by mimetype header 
or file extension might be erroneous. I had in mind that the difference 
might be indicated somehow: 'utf8' versus 'utf8?', for instance.

Terry Jan Reedy