[Email-SIG] Problem Report for email.Utils.decode_rfc2231

Wed Jul 19 06:16:03 CEST 2006

On Jul 17, 2006, at 8:35 PM, Mark Sapiro wrote:

> I just looked at the fix in SVN, and I think there is still a problem.
> I don't think the RFC 2231 encodings that produce the error are
> 'buggy'. There are two independent things going on in RFC 2231 - the
> charset and language encoding and the splitting of the parameter into
> multiple pieces, e.g. filename*0=, filename*1=, etc.
>
> The problem with email.utils.decode_params() is it doesn't distinguish
> between these cases. The charset/language information is only present
> if there is a * immediately preceeding the = as in
>
> filename*=charset'language'value
>
> or
>
> filename*0*=charset'language'value
> ...
>
> in these cases, a compliant value must not contain '
>
> However, if the parameter is
>
> filename*0=value_part_0
> filename*1=value_part_1
> ...
>
> these value_parts may contain any number of ' characters and they  
> don't
> delimit charset and language information.
>
> See my suggested patch attached to
> <http://mail.python.org/pipermail/email-sig/2006-July/000293.html>.

Mark, I think you're right in your diagnosis.  I've gone back and re- 
read RFC 2231 and I agree that we need to distinguish between the two  
segment types, which I'll call encoded (name ends in *) and non- 
encoded (no * at end of name).

The way I read the RFC however, I don't think the patch is quite  
right.  Specifically, you can mix encoded and non-encoded segments in  
an extended parameter, like so:

filename*0*="This is%20encoded"
filename*1="This is%20not encoded"

I believe this should end up with a 'filename' parameter with a value:

This is encodedThis is%20not encoded

Further, if any segment ends in a * then the charset and language  
information must appear at the front of the string, but this is  
decoded after segments are %-decoded and all the segments are  
concatenated together. (The RFC appears to be a bit ambiguous here,  
but this is the only interpretation that makes sense to me.)

Both of these changes caused many failures in the test suite, but I  
believe that's because many of the tests were incorrect.  Some broke  
because they were using all non-encoded segments yet were expecting  
Message.get_param() to return a 3-tuple.  That interface, while  
yucky, seems clear that when all non-encoded segments are used, the  
return value should be a simple string.

The other breakage was that non-encoded segments should not be %- 
decoded, but there were many cases where they were still being decoded.

I believe the attached patch fixes all these cases, and yet retains  
the failsafe checks in decode_rfc2231() -- be liberal in what you  
accept, blah, blah, blah.  The patch also updates all the affected  
tests.  This patch is against the Python trunk.  Please let me know  
what you think!  If it looks good, I'll commit it and back port the  
whole schmere to the earlier email package versions.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: email.diff
Type: application/octet-stream
Size: 10097 bytes
Desc: not available
Url : http://mail.python.org/pipermail/email-sig/attachments/20060719/42621e24/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/email-sig/attachments/20060719/42621e24/attachment.pgp