[Email-SIG] [Python-3000] Questions about email bytes/str (python 3000)
Barry Warsaw
barry at python.org
Tue Aug 14 17:39:29 CEST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Aug 13, 2007, at 10:22 PM, Victor Stinner wrote:
> After many tests, I'm unable to convert email module to Python
> 3000. I'm also
> unable to take decision of the best type for some contents.
I made a lot of progress on the email package while I was traveling,
though I haven't checked things in yet. I probably will very soon,
even if I haven't yet fixed the last few remaining problems. I'm
down to 7 failures, 9 errors of 247 tests.
> (1) Email parts should be stored as byte or character string?
Strings. Email messages are conceptually strings so I think it makes
sense to represent them internally as such. The FeedParser should
expect strings and the Generator should output strings. One place
where I think bytes should show up would be in decoded payloads, but
in that case I really want to make an API change so that .get_payload
(decoded=True) is deprecated in favor of a separate method.
I'm proposing other API changes to make things work better, a few of
which are in my current patch, but others I want to defer if they
don't directly contribute to getting these tests to pass.
> Related methods: Generator class, Message.get_payload(),
> Message.as_string().
>
> Let's take an example: multipart (MIME) email with latin-1 and
> base64 (ascii)
> sections. Mix latin-1 and ascii => mix bytes. So the best type
> should be
> bytes.
>
> => bytes
Except that by the time they're parsed into an email message, they
must be ascii, either encoded as base64 or quoted-printable. We also
have to know at that point the charset being used, so I think it
makes sense to keep everything as strings.
> (2) Parsing file (raw string): use bytes or str in parsing?
>
> The parser use methods related to str like splitlines(), lower(),
> strip(). But
> it should be easy to rewrite/avoid these methods. I think that low-
> level
> parsing should be done on bytes. At the end, or when we know the
> charset, we
> can convert to str.
>
> => bytes
Maybe, though I'm not totally convinced. It's certainly easier to
get the tests to pass if we stick with parsing strings.
email.message_from_string() should continue to accept strings,
otherwise obviously it would have to be renamed, but also because
it's primary use case is turning a triple quoted string literal into
an email message.
I alluded to the one crufty part of this in a separate thread. In
order to accept universal newlines but preserve end-of-line
characters, you currently have to open files in binary mode. Then,
because my parser works on strings you have to convert those bytes to
strings, which I am successfully doing now, but which I suspect is
ultimately error prone. I would like to see a flag to preserve line
endings on files opened in text + universal newlines mode, and then I
think the hack for Parser.parse() would go away. We'd define how
files passed to this method must be opened. Besides, I think it is
much more common to be parsing strings into email messages anyway.
> About base64, I agree with Bill Janssen:
> - base64MIME.decode converts string to bytes
> - base64MIME.encode converts bytes to string
I agree.
> But decode may accept bytes as input (as base64 modules does): use
> str(value, 'ascii', 'ignore') or str(value, 'ascii', 'strict').
Hmm, I'm not sure about this, but I think that .encode() may have to
accept strings.
> I wrote 4 differents (non-working) patches. So I you want to work
> on email
> module and Python 3000, please first contact me. When I will get a
> better
> patch, I will submit it.
Like I said, I also have an extensive patch that gets me most of the
way there. I don't want to having dueling patches, so I think what
I'll do is put a branch in the sandbox and apply my changes there for
now. Then we will have real code to discuss.
A few other things from my notes and diff:
Do we need email.message_from_bytes() and Message.as_bytes()? While
I'm (currently <wink>) pretty well convinced that email messages
should be strings, the use case for bytes includes reading them
directly to or from sockets, though in this case because the RFCs
generally require ascii with encodings and charsets clearly
described, I think a bytes-to-string wrapper may suffice.
Charset class: How do we do conversions from input charset to output
charset? This is required by e.g. Japanese to go from euc-jp to
iso-2022-jp IIUC. Currently I have to use a crufty string-to-bytes
converter like so:
>>> bytes(ord(c) for c in s)
rather than just bytes(s). I'm sure there's a better way I haven't
found yet.
Generator._write_headers() and the _is8bitstring() test aren't really
appropriate or correct now that everything's a unicode. This
affected quite a few tests because long headers that previously were
getting split were now not getting split. I ended up ditching the
_is8bitstring() test, but that lead me into an API change for
Message.__str__() and Message.as_string(), which I've long wanted to
do anyway. First Message.__str__() no longer includes the Unix-From
header, but more importantly, .as_string() takes the maxheaderlen as
an argument and defaults to no header wrapping. By changing various
related tests to call .as_string(maxheaderlen=78), these split header
tests can be made to pass again. I think these changes make str
(some_message) saner and more explicit (because it does not split
headers) but these may be controversial in the email-sig.
You asked earlier about decode_header(). This should definitely
return a list of tuples of (bytes, charset|None).
Header is going to need some significant revision First, there's the
whole mess of .encode() vs. __str__() vs. __unicode__() to sort out.
It's insane that the latter two had different semantics w.r.t.
whitespace preservation between encoded words, so let's fix that.
Also, if the common use case is to do something like this:
>>> msg['subject'] = 'a subject string'
then I wonder if we shouldn't be doing more sanity checking on the
header value. For example, if the value had a non-ascii character in
it, then what should we do? One way would be to throw an exception,
requiring the use of something like:
>>> msg['subject'] = Header('a \xfc subject', 'utf-8')
or we could do the most obvious thing and try to convert to 'ascii'
then 'utf-8' if no charset is given explicitly. I thought about
always turning headers into Header instances, but I think that might
break some common use cases. It might be possible to define equality
and other operations on Header instances so that these common cases
continue to work. The email-sig can address that later.
However, if all Header instances are unicode and have a valid
charset, I wonder if the splittable tests are still relevant, and
whether we can simplify header splitting. I have to think about this
some more.
As for the remaining failures and errors, they come down to
simplifying the splittable logic, dealing with Message.__str__() vs.
Message.__unicode__(), verifying that the UnicodeErrors some tests
expect to get raise don't make sense any more, and fixing a couple of
other small issues I haven't gotten to yet.
I will create a sandbox branch and apply my changes later today so we
have something concrete to look at.
Cheers,
- -Barry
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
iQCVAwUBRsHMsXEjvBPtnXfVAQLfCwP8CeHi9RBW5ULri3w6sBz5a1fkdVCftk71
uW8q0LercTJSa2ewvtrlWdKm9F403IabYjh2Bg8cZfHmYyZ+/b18oU64zzkZylo/
pHw9Iyvk9ZW6G7mwJRwpV9c6JXJNvsQtKRWipuue0ZMagI5OJBXR8vhRIDGkt+NC
ARhIrHXPEW8=
=DBLp
-----END PGP SIGNATURE-----
More information about the Email-SIG
mailing list