[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Tue Sep 16 20:02:11 CEST 2014

On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray <rdmurray at bitdance.com> wrote:
>> You can't treat them as characters, so while you have them in your
>> string, you can't treat it as a pure Unicode string - it''s a Unicode
>> string with smuggled bytes.
>
> Well, except that I do.  The email header parsing algorithms all work
> fine if I treat the surrogate escaped bytes as 'unknown junk' and just
> parse based on the valid unicode.  (Unless the header is so garbled that
> it can't be parsed, of course, at which point it becomes an invalid
> header).

Do what, exactly? As I understand you, you treat the unknown bytes as
completely opaque, not representing any characters at all. Which is
what I'm saying: those are not characters.

If you, instead, represented the header as a list with some str
elements and some bytes, it would be just as valid (though much harder
to work with); all your manipulations are done on the str parts, and
the bytes just tag along for the ride.

> You are right about the wrapping, though.  If a header with invalid
> bytes (and in this scenario we *are* talking about errors) needs to
> be wrapped, we have to first decode the smuggled bytes and turn it
> into an 'unknown-8bit' encoded word before we can wrap the header.

Yeah, and that's going to be a bit messy. If you get 60 characters
followed by 30 unknown bytes, where do you wrap it? Dare you wrap in
the middle of the smuggled section?

ChrisA