[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Tue Sep 16 21:29:30 CEST 2014

On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico <rosuav at gmail.com> wrote:
> On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> >> You can't treat them as characters, so while you have them in your
> >> string, you can't treat it as a pure Unicode string - it''s a Unicode
> >> string with smuggled bytes.
> >
> > Well, except that I do.  The email header parsing algorithms all work
> > fine if I treat the surrogate escaped bytes as 'unknown junk' and just
> > parse based on the valid unicode.  (Unless the header is so garbled that
> > it can't be parsed, of course, at which point it becomes an invalid
> > header).
> 
> Do what, exactly? As I understand you, you treat the unknown bytes as
> completely opaque, not representing any characters at all. Which is
> what I'm saying: those are not characters.

Yes.  I thought you were saying that one could not treat the string with
smuggled bytes as if it were a string.  (It's a string that can't be
encoded unless you use the surrogateescape error handler, but it is
still a string from Python's POV, which is the point of the error
handler).

Or, to put it another way, your implication was that there were no
string operations that could be usefully applied to a string containing
smuggled bytes, but that is not the case.  (I may well have read an
implication that was not there; if so I apologize and you can ignore the
rest of this :)  Basically, we are pretending that the each smuggled
byte is single character for string parsing purposes...but they don't
match any of our parsing constants.  They are all "any character" matches
in the regexes and what have you.  Of course, this only works in
contexts where we can ignore or "carry along" the smuggled bytes as
being components of "arbitrary text" portions of the syntax, and we must
take care to either replace them with valid unicode error glyphs or turn
the string of which the are a part into binary using the same codec and
error handler as we used to ingest them to begin with before emitting
them.  And, of course, we can't *modify* the sections containing the
smuggled bytes, only the syntax-matched sections that surround them; and
things like line wrapping are just an invitation to ugliness and bugs
even if you kept the smuggled bytes sections internally intact.

Finally, to explain what I meant by "except that I do": when I added
back binary support to the email package in Python3, initially I *did
not change the parsing algorithms* in the code.  I just smuggled the
bytes, and then dealt with the encoding/decoding at the API boundaries.
This is the same principle used when dealing with filenames in the API
of Python itself.  *Except* at that boundary, I do not need to worry
about whether a particular string contains smuggled bytes or not.[*]

> If you, instead, represented the header as a list with some str
> elements and some bytes, it would be just as valid (though much harder
> to work with); all your manipulations are done on the str parts, and
> the bytes just tag along for the ride.

Quite a bit harder, which is why I don't do that.

> > You are right about the wrapping, though.  If a header with invalid
> > bytes (and in this scenario we *are* talking about errors) needs to
> > be wrapped, we have to first decode the smuggled bytes and turn it
> > into an 'unknown-8bit' encoded word before we can wrap the header.
> 
> Yeah, and that's going to be a bit messy. If you get 60 characters
> followed by 30 unknown bytes, where do you wrap it? Dare you wrap in
> the middle of the smuggled section?

The point of RFC2047 encoded words is that they are an ASCII
representation of binary data, so once the bytes are "properly" Content
Transfer Encoded (as being in an unknown charset) the string contains no
smuggled bytes and can be wrapped.

--David

[*] I worried a lot that this was re-introducing the bytes/string
problem from python2.  The difference is that if the smuggled bytes
escape from the email API, that's a bug in the email package.  So user
code using the library is *not* in danger of getting mysterious encoding
errors when one day the input is international where before it was all
ASCII.  (Absent bugs in the library.)