[Python-Dev] email package status in 3.X
P.J. Eby
pje at telecommunity.com
Mon Jun 21 19:17:57 CEST 2010
At 11:43 AM 6/21/2010 -0400, Barry Warsaw wrote:
>On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote:
> >Something that may make sense to ease the porting process is for some
> >of these "on the boundary" I/O related string manipulation functions
> >(such as os.path.join) to grow "encoding" keyword-only arguments. The
> >recommended approach would be to provide all strings, but bytes could
> >also be accepted if an encoding was specified. (If you want to mix
> >encodings - tough, do the decoding yourself).
>
>This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz
>for it.
>
>Would it make sense to have "encoding-carrying" bytes and str types?
It's not a stupid idea, and could potentially work. It also might
have a better chance of being able to actually be *implemented* in
3.x than my idea.
>Basically, I'm thinking of types (maybe even the current ones) that carry
>around a .encoding attribute so that they can be automatically encoded and
>decoded where necessary. This at least would simplify APIs that need to do
>the conversion.
I'm not really sure how much use the encoding is on a unicode object
- what would it actually mean?
Hm. I suppose it would effectively mean "this string can be
represented in this encoding" -- which is useful, in that you could
fail operations when combining with bytes of a different encoding.
Hm... no, in that case you should just encode the string to the
bytes' encoding, and let that throw an error if it fails. So,
really, there's no reason for a string to know its encoding. All you
need is the bytes type to have an encoding attribute, and when doing
mixed-type operations between bytes and strings, coerce to *bytes of
the same encoding*.
However, if .encoding is None, then coercion would follow the same
rules as now -- i.e., convert the bytes to unicode, assuming an ascii
encoding. (This would be different than setting an encoding of
'ascii', because in that case, it means you want cross-type
operations to result in ascii bytes, rather than a unicode string,
and to fail if the unicode part can't be encoded appropriately. The
'None' setting is effectively a nod to compatibility with prior 3.x
versions, since I assume we can't just throw out the old coercion behavior.)
Then, a few more changes to the bytes type would round out the implementation:
* Allow .decode() to not specify an encoding, unless .encoding is None
* Add back in the missing string methods (e.g. .encode()), since you
can transparently upgrade to a string)
* Smart __str__, as shown in your proposal.
>Would it be feasible? Dunno.
Probably, although it might mean adding back in special cases that
were previously taken out, and a few new ones.
> Would it help ease the bytes/str confusion? Dunno.
Not sure what confusion you mean -- Web-SIG and I at least are not
confused about the difference between bytes and str, or we wouldn't
be having an issue. ;-) Or maybe you mean the stdlib's API
confusion? In which case, yes, definitely!
> But I think it would help make APIs easier to design and use because
>it would cut down on the encoding-keyword function signature infection.
Not only that, but I believe it would also retroactively make the
stdlib's implementation of those APIs "correct" again, and give us
One Obvious Way to work with bytes of a known encoding, while
constraining any unicode that gets combined with those bytes to be
validly encodable. It also gives you an idempotent constructor for
bytes of a specified encoding, that can take either a bytes of
unspecified encoding, a bytes of the correct encoding, or a string
that can be encoded as such.
In short, +1. (I wish it were possible to go back and make bytes
non-strings and have only this ebytes or bstr or whatever type have
string methods, but I'm pretty sure that ship has already sailed.)
More information about the Python-Dev
mailing list