[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Chris Angelico rosuav at gmail.com
Tue Sep 16 20:02:11 CEST 2014


On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray <rdmurray at bitdance.com> wrote:
>> You can't treat them as characters, so while you have them in your
>> string, you can't treat it as a pure Unicode string - it''s a Unicode
>> string with smuggled bytes.
>
> Well, except that I do.  The email header parsing algorithms all work
> fine if I treat the surrogate escaped bytes as 'unknown junk' and just
> parse based on the valid unicode.  (Unless the header is so garbled that
> it can't be parsed, of course, at which point it becomes an invalid
> header).

Do what, exactly? As I understand you, you treat the unknown bytes as
completely opaque, not representing any characters at all. Which is
what I'm saying: those are not characters.

If you, instead, represented the header as a list with some str
elements and some bytes, it would be just as valid (though much harder
to work with); all your manipulations are done on the str parts, and
the bytes just tag along for the ride.

> You are right about the wrapping, though.  If a header with invalid
> bytes (and in this scenario we *are* talking about errors) needs to
> be wrapped, we have to first decode the smuggled bytes and turn it
> into an 'unknown-8bit' encoded word before we can wrap the header.

Yeah, and that's going to be a bit messy. If you get 60 characters
followed by 30 unknown bytes, where do you wrap it? Dare you wrap in
the middle of the smuggled section?

ChrisA


More information about the Python-Dev mailing list