[Python-Dev] email package status in 3.X

Mon Jun 21 19:17:57 CEST 2010

At 11:43 AM 6/21/2010 -0400, Barry Warsaw wrote:
>On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote:
> >Something that may make sense to ease the porting process is for some
> >of these "on the boundary" I/O related string manipulation functions
> >(such as os.path.join) to grow "encoding" keyword-only arguments. The
> >recommended approach would be to provide all strings, but bytes could
> >also be accepted if an encoding was specified. (If you want to mix
> >encodings - tough, do the decoding yourself).
>
>This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz
>for it.
>
>Would it make sense to have "encoding-carrying" bytes and str types?

It's not a stupid idea, and could potentially work.  It also might 
have a better chance of being able to actually be *implemented* in 
3.x than my idea.

>Basically, I'm thinking of types (maybe even the current ones) that carry
>around a .encoding attribute so that they can be automatically encoded and
>decoded where necessary.  This at least would simplify APIs that need to do
>the conversion.

I'm not really sure how much use the encoding is on a unicode object 
- what would it actually mean?

Hm. I suppose it would effectively mean "this string can be 
represented in this encoding" -- which is useful, in that you could 
fail operations when combining with bytes of a different encoding.

Hm... no, in that case you should just encode the string to the 
bytes' encoding, and let that throw an error if it fails.  So, 
really, there's no reason for a string to know its encoding.  All you 
need is the bytes type to have an encoding attribute, and when doing 
mixed-type operations between bytes and strings, coerce to *bytes of 
the same encoding*.

However, if .encoding is None, then coercion would follow the same 
rules as now -- i.e., convert the bytes to unicode, assuming an ascii 
encoding.  (This would be different than setting an encoding of 
'ascii', because in that case, it means you want cross-type 
operations to result in ascii bytes, rather than a  unicode string, 
and to fail if the unicode part can't be encoded appropriately.  The 
'None' setting is effectively a nod to compatibility with prior 3.x 
versions, since I assume we can't just throw out the old coercion behavior.)

Then, a few more changes to the bytes type would round out the implementation:

* Allow .decode() to not specify an encoding, unless .encoding is None

* Add back in the missing string methods (e.g. .encode()), since you 
can transparently upgrade to a string)

* Smart __str__, as shown in your proposal.

>Would it be feasible?  Dunno.

Probably, although it might mean adding back in special cases that 
were previously taken out, and a few new ones.

>   Would it help ease the bytes/str confusion?  Dunno.

Not sure what confusion you mean -- Web-SIG and I at least are not 
confused about the difference between bytes and str, or we wouldn't 
be having an issue.  ;-)  Or maybe you mean the stdlib's API 
confusion?  In which case, yes, definitely!

>   But I think it would help make APIs easier to design and use because
>it would cut down on the encoding-keyword function signature infection.

Not only that, but I believe it would also retroactively make the 
stdlib's implementation of those APIs "correct" again, and give us 
One Obvious Way to work with bytes of a known encoding, while 
constraining any unicode that gets combined with those bytes to be 
validly encodable.  It also gives you an idempotent constructor for 
bytes of a specified encoding, that can take either a bytes of 
unspecified encoding, a bytes of the correct encoding, or a string 
that can be encoded as such.

In short, +1.  (I wish it were possible to go back and make bytes 
non-strings and have only this ebytes or bstr or whatever type have 
string methods, but I'm pretty sure that ship has already sailed.)