[Email-SIG] [Python-3000] Questions about email bytes/str (python 3000)

Barry Warsaw barry at python.org
Tue Aug 14 17:39:29 CEST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 13, 2007, at 10:22 PM, Victor Stinner wrote:

> After many tests, I'm unable to convert email module to Python  
> 3000. I'm also
> unable to take decision of the best type for some contents.

I made a lot of progress on the email package while I was traveling,  
though I haven't checked things in yet.  I probably will very soon,  
even if I haven't yet fixed the last few remaining problems.  I'm  
down to 7 failures, 9 errors of 247 tests.

> (1) Email parts should be stored as byte or character string?

Strings.  Email messages are conceptually strings so I think it makes  
sense to represent them internally as such.  The FeedParser should  
expect strings and the Generator should output strings.  One place  
where I think bytes should show up would be in decoded payloads, but  
in that case I really want to make an API change so that .get_payload 
(decoded=True) is deprecated in favor of a separate method.

I'm proposing other API changes to make things work better, a few of  
which are in my current patch, but others I want to defer if they  
don't directly contribute to getting these tests to pass.

> Related methods: Generator class, Message.get_payload(),  
> Message.as_string().
>
> Let's take an example: multipart (MIME) email with latin-1 and  
> base64 (ascii)
> sections. Mix latin-1 and ascii => mix bytes. So the best type  
> should be
> bytes.
>
> => bytes

Except that by the time they're parsed into an email message, they  
must be ascii, either encoded as base64 or quoted-printable.  We also  
have to know at that point the charset being used, so I think it  
makes sense to keep everything as strings.

> (2) Parsing file (raw string): use bytes or str in parsing?
>
> The parser use methods related to str like splitlines(), lower(),  
> strip(). But
> it should be easy to rewrite/avoid these methods. I think that low- 
> level
> parsing should be done on bytes. At the end, or when we know the  
> charset, we
> can convert to str.
>
> => bytes

Maybe, though I'm not totally convinced.  It's certainly easier to  
get the tests to pass if we stick with parsing strings.   
email.message_from_string() should continue to accept strings,  
otherwise obviously it would have to be renamed, but also because  
it's primary use case is turning a triple quoted string literal into  
an email message.

I alluded to the one crufty part of this in a separate thread.  In  
order to accept universal newlines but preserve end-of-line  
characters, you currently have to open files in binary mode.  Then,  
because my parser works on strings you have to convert those bytes to  
strings, which I am successfully doing now, but which I suspect is  
ultimately error prone.  I would like to see a flag to preserve line  
endings on files opened in text + universal newlines mode, and then I  
think the hack for Parser.parse() would go away.  We'd define how  
files passed to this method must be opened.  Besides, I think it is  
much more common to be parsing strings into email messages anyway.

> About base64, I agree with Bill Janssen:
>  - base64MIME.decode converts string to bytes
>  - base64MIME.encode converts bytes to string

I agree.

> But decode may accept bytes as input (as base64 modules does): use
> str(value, 'ascii', 'ignore') or str(value, 'ascii', 'strict').

Hmm, I'm not sure about this, but I think that .encode() may have to  
accept strings.

> I wrote 4 differents (non-working) patches. So I you want to work  
> on email
> module and Python 3000, please first contact me. When I will get a  
> better
> patch, I will submit it.

Like I said, I also have an extensive patch that gets me most of the  
way there.  I don't want to having dueling patches, so I think what  
I'll do is put a branch in the sandbox and apply my changes there for  
now.  Then we will have real code to discuss.

A few other things from my notes and diff:

Do we need email.message_from_bytes() and Message.as_bytes()?  While  
I'm (currently <wink>) pretty well convinced that email messages  
should be strings, the use case for bytes includes reading them  
directly to or from sockets, though in this case because the RFCs  
generally require ascii with encodings and charsets clearly  
described, I think a bytes-to-string wrapper may suffice.

Charset class: How do we do conversions from input charset to output  
charset?  This is required by e.g. Japanese to go from euc-jp to  
iso-2022-jp IIUC.  Currently I have to use a crufty string-to-bytes  
converter like so:

 >>> bytes(ord(c) for c in s)

rather than just bytes(s).  I'm sure there's a better way I haven't  
found yet.

Generator._write_headers() and the _is8bitstring() test aren't really  
appropriate or correct now that everything's a unicode.  This  
affected quite a few tests because long headers that previously were  
getting split were now not getting split.  I ended up ditching the  
_is8bitstring() test, but that lead me into an API change for  
Message.__str__() and Message.as_string(), which I've long wanted to  
do anyway.  First Message.__str__() no longer includes the Unix-From  
header, but more importantly, .as_string() takes the maxheaderlen as  
an argument and defaults to no header wrapping.  By changing various  
related tests to call .as_string(maxheaderlen=78), these split header  
tests can be made to pass again.  I think these changes make str 
(some_message) saner and more explicit (because it does not split  
headers) but these may be controversial in the email-sig.

You asked earlier about decode_header().  This should definitely  
return a list of tuples of (bytes, charset|None).

Header is going to need some significant revision  First, there's the  
whole mess of .encode() vs. __str__() vs. __unicode__() to sort out.   
It's insane that the latter two had different semantics w.r.t.  
whitespace preservation between encoded words, so let's fix that.   
Also, if the common use case is to do something like this:

 >>> msg['subject'] = 'a subject string'

then I wonder if we shouldn't be doing more sanity checking on the  
header value.  For example, if the value had a non-ascii character in  
it, then what should we do?  One way would be to throw an exception,  
requiring the use of something like:

 >>> msg['subject'] = Header('a \xfc subject', 'utf-8')

or we could do the most obvious thing and try to convert to 'ascii'  
then 'utf-8' if no charset is given explicitly.  I thought about  
always turning headers into Header instances, but I think that might  
break some common use cases.  It might be possible to define equality  
and other operations on Header instances so that these common cases  
continue to work.  The email-sig can address that later.

However, if all Header instances are unicode and have a valid  
charset, I wonder if the splittable tests are still relevant, and  
whether we can simplify header splitting.  I have to think about this  
some more.

As for the remaining failures and errors, they come down to  
simplifying the splittable logic, dealing with Message.__str__() vs.  
Message.__unicode__(), verifying that the UnicodeErrors some tests  
expect to get raise don't make sense any more, and fixing a couple of  
other small issues I haven't gotten to yet.

I will create a sandbox branch and apply my changes later today so we  
have something concrete to look at.

Cheers,
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRsHMsXEjvBPtnXfVAQLfCwP8CeHi9RBW5ULri3w6sBz5a1fkdVCftk71
uW8q0LercTJSa2ewvtrlWdKm9F403IabYjh2Bg8cZfHmYyZ+/b18oU64zzkZylo/
pHw9Iyvk9ZW6G7mwJRwpV9c6JXJNvsQtKRWipuue0ZMagI5OJBXR8vhRIDGkt+NC
ARhIrHXPEW8=
=DBLp
-----END PGP SIGNATURE-----


More information about the Email-SIG mailing list