[Python-ideas] Add encoding attribute to bytes
tjreedy at udel.edu
Tue Nov 10 03:22:14 CET 2009
> Terry Reedy wrote:
>> A Python interpreter has one encoding for floats, ints, and strings.
>> sys.float_info and sys.int_info give details about the first two.
>> although they are mostly invisible to user code. (I presume they are
>> attached to sys rather than float and int precisely because this.) A
>> couple of recent posts have discussed making the unicode encoding
>> (UCS2 v 4) both less visible and more discoverable to extensions.
>> Bytes are nearly always an encoding of *something*, but the particular
>> encoding used is instance-specific. As Guido has said, the programmer
>> must keep track. But how? In an OO language, one obvious way is as an
>> attribute of the instance. That would be carried with the instance and
>> make it self-identifying.
>> What I do not know if it is feasible to give an immutable instance of
>> a builtin class a mutable attribute slot. If it were, I think this
>> could make 3.x bytes easier and more transparent to use. When a string
>> is encoded to bytes, the attribute would be set. If it were then
>> pickled, the attribute would be stored with it and restored with it,
>> and less easily lost. If it were then decoded, the attribute would be
>> used. If it were sent to the net, the attribute would be used to set
>> the appropriate headers. The reverse process would apply from net to
>> bytes to (unicode) text.
>> Bytes representing other types of data, such as nedia could also be
>> tagged, not just those representing text.
>> This would be a proposal for 3.3 at the earliest. It would involved
>> revising stdlib modules, as appropriate, to use the new info.
> You said "give an immutable instance of a builtin class a mutable
> attribute slot". Why would the slot be mutable?
As Stephen said, in case the info is initially missing or determined to
> Surely if the attribute
> said that the bytes represented a certain type of data then you
> shouldn't be able to change it. ("The attribute says that the bytes are
> UTF-8, but I'm going to change it so that it says they are ISO-8859-1.")
> I think that the attribute should be immutable.
Encoding set by unicode.encode or a wrapper thereof is definitionally
correct and should not be changed. Encoding inferred by mimetype header
or file extension might be erroneous. I had in mind that the difference
might be indicated somehow: 'utf8' versus 'utf8?', for instance.
Terry Jan Reedy
More information about the Python-ideas