[Python-Dev] Multilingual programming article on the Red Hat Developer blog
R. David Murray
rdmurray at bitdance.com
Tue Sep 16 19:46:50 CEST 2014
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico <rosuav at gmail.com> wrote:
> On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> > That isn't the case in the email package. The smuggled bytes are not
> > errors[*], they are literally smuggled bytes.
>
> But they're not characters, which is what Stephen and I were saying -
> and contrary to what Jim said about treating them as characters. At
> best, they represent characters but in some encoding other than the
> one you're using, and you have no idea how many bytes form a character
> or anything. So you can't, for instance, word-wrap the text, because
> you can't know how wide these unknown bytes are, whether they
> represent spaces (wrap points), or newlines, or anything like that.
> You can't treat them as characters, so while you have them in your
> string, you can't treat it as a pure Unicode string - it''s a Unicode
> string with smuggled bytes.
Well, except that I do. The email header parsing algorithms all work
fine if I treat the surrogate escaped bytes as 'unknown junk' and just
parse based on the valid unicode. (Unless the header is so garbled that
it can't be parsed, of course, at which point it becomes an invalid
header).
You are right about the wrapping, though. If a header with invalid
bytes (and in this scenario we *are* talking about errors) needs to
be wrapped, we have to first decode the smuggled bytes and turn it
into an 'unknown-8bit' encoded word before we can wrap the header.
--David
More information about the Python-Dev
mailing list