[Python-Dev] email package status in 3.X

Mon Jun 21 18:34:04 CEST 2010

On Mon, Jun 21, 2010 at 11:43:07AM -0400, Barry Warsaw wrote:
> On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote:
> 
> >Something that may make sense to ease the porting process is for some
> >of these "on the boundary" I/O related string manipulation functions
> >(such as os.path.join) to grow "encoding" keyword-only arguments. The
> >recommended approach would be to provide all strings, but bytes could
> >also be accepted if an encoding was specified. (If you want to mix
> >encodings - tough, do the decoding yourself).
> 
> This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz
> for it.
> 
> Would it make sense to have "encoding-carrying" bytes and str types?
> Basically, I'm thinking of types (maybe even the current ones) that carry
> around a .encoding attribute so that they can be automatically encoded and
> decoded where necessary.  This at least would simplify APIs that need to do
> the conversion.
> 
> By default, the .encoding attribute would be some marker to indicated "I have
> no idea, do it explicitly" and if you combine ebytes or estrs that have
> incompatible encodings, you'd either throw an exception or reset the .encoding
> to IAmConfuzzled.  But say you had an email header like:
> 
> =?euc-jp?b?pc+l7aG8pe+hvKXrpcmhqg==?=
> 
> And code like the following (made less crappy):
> 
> -----snip snip-----
> class ebytes(bytes):
>     encoding = 'ascii'
> 
>     def __str__(self):
>         s = estr(self.decode(self.encoding))
>         s.encoding = self.encoding
>         return s
> 
> 
> class estr(str):
>     encoding = 'ascii'
> 
> 
> s = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa', 'euc-jp')
> b = bytes(s, 'euc-jp')
> 
> eb = ebytes(b)
> eb.encoding = 'euc-jp'
> es = str(eb)
> print(repr(eb), es, es.encoding)
> -----snip snip-----
> 
> Running this you get:
> 
> b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa' ハローワールド！ euc-jp
> 
> Would it be feasible?  Dunno.  Would it help ease the bytes/str confusion?
> Dunno.  But I think it would help make APIs easier to design and use because
> it would cut down on the encoding-keyword function signature infection.
> 
I like the idea of having encoding information carried with the data.
I don't think that an ebytes type that can *optionally* have an encoding
attribute makes the situation less confusing, though.  To me the biggest
problem with python-2.x's unicode/bytes handling was not that it threw
exceptions but that it didn't always throw exceptions.  You might test this
in python2::
    t = u'cafe'
    function(t)

And say, ah my code works.  Then a user gives it this::
    t = u'café'
    function(t)

And get a unicode error because the function only works with unicode in the
ascii range.

ebytes seems to have the same pitfall where the code path exercised by your
tests could work with::
    eb = ebytes(b)
    eb.encoding = 'euc-jp'
    function(eb)

but the user exercises a code path that does this and fails::
    eb = ebytes(b)
    function(eb)

What do you think of making the encoding attribute a mandatory part of
creating an ebyte object?  (ex: ``eb = ebytes(b, 'euc-jp')``).

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100621/27ffdbf0/attachment.pgp>