[issue19806] smtpd crashes when a multi-byte UTF-8 sequence is split between consecutive data packets

Mon Jul 14 15:09:49 CEST 2014

R. David Murray added the comment:

As Milan said, the problem doesn't arise in 3.5 with decode_data=False, since there's no decoding.  His patch doesn't actually fix the bug for the decode_data=True case, though, since the bug is a *valid* utf-8 sequence getting split across tcp buffers.

To fix it, we would need to change the implementation of decode_data.  Instead of conditionally decoding in collect_data, we'd need to postpone decoding to found_terminator.  This would have the undesirable affect of changing what is in the received_lines attribute, which is why we didn't do it in the decode_data patch.  Using an incremental decoder won't solve that problem, since it too would change what gets stored in received_lines.

Since decode_data=True is really not a legitimate mode for smtpd (it is an historical accident/bug) and we are planning on removing it eventually, I think we should go ahead and apply Milan's patch as is, since it does improve the error reporting.  The message would need to be adjusted though, since it can trigger on valid utf-8 data.  It should say that smtpd should be run with decode_data=False in order to fix the decode problem.

That would leave the bug as-is in 3.4, but a similar patch with an error message suggesting an upgrade to 3.5/decode_data=True could be applied.  That feels a little weird, though :).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue19806>
_______________________________________