[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 06:42:56 CEST 2014

On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmurray at bitdance.com> wrote:

> > Basically, we are pretending that the each smuggled
> > byte is single character for string parsing purposes...but they don't
> > match any of our parsing constants.  They are all "any character" matches
> > in the regexes and what have you.
> 
> This is slightly iffy, as you can't be sure that one byte represents
> one character, but as long as you don't much care about that, it's not
> going to be an issue.

This discussion would probably be a lot more easy to follow, with fewer 
miscommunications, if there were some examples. Here is my example, 
perhaps someone can tell me if I'm understanding it correctly.

I want to send an email including the header line:

'Subject: “NOBODY expects the Spanish Inquisition!”'

Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I 
do the right thing and encode it as UTF-8:

b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

but my mail package, not being written in a language as awesome as 
Python, is just riddled with bugs, and somehow I end up with this 
corrupted byte-string instead:

b'Subject: \x9c\x80\xe2NOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

Note that the bytes from the first curly quote bytes are in the wrong 
order, but the second is okay. (Like I said, it's just *riddled* with 
bugs.) That means that trying to decode those bytes will fail in Python:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 9: 
invalid start byte

but it's not up to Python's email package to throw those invalid bytes 
out or permantly replace them with something else. Also, we want to work 
with Unicode strings, not byte strings, so there has to be a way to 
smuggle those three bytes into Unicode, without ending up with either 
the replacement bytes:

# using the 'replace' error handler
'Subject: ���NOBODY expects the Spanish Inquisition!”'

or incorrectly interpreting them as valid, but wrong, code points. (If 
we do the second, we end up with two control characters "\x9c\x80" 
followed by "â".) We want to be able to round-trip back to the same 
bytes we received.

Am I right so far?

So the email package uses the surrogate-escape error handler and ends up 
with this Unicode string:

'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'

which can be encoded back to the bytes we started with.

Note that technically those three \u... code points are NOT classified 
as "noncharacters". They are actually surrogate code points:

http://www.unicode.org/faq/private_use.html#nonchar4
http://www.unicode.org/glossary/#surrogate_code_point

and they're supposed to be reserved for UTF-16. I'm not sure of the 
implication of that.

> I'm fairly sure you're never going to find an
> encoding in which one unknown byte represents two characters,

There are encodings which use a "shift" mechanism, whereby a byte X 
represents one character by default, and a different character after the 
shift mechanism. But I don't think that matters, since we're not able to 
interpret those bytes. If we were, we'd just decode them to a text 
string and be done with it.

> but
> there are cases where it takes more than one byte to make up a
> character (or the bytes are just shift codes or something). 

Multi-byte encodings are very common. All the Unicode encodings are 
multi-byte. So are many East Asian encodings.

> Does that
> ever throw off your regexes? It wouldn't be an issue to a .* between
> two character markers, but if you ever say .{5} then it might match
> incorrectly.

I don't think the idea is to match on these smuggled bytes specifically. 
I think the idea is to match *around* them. In the example above, we 
might match everything from "Subject: " to the end of the line. So long 
as we never end up with a situation where the smuggled bytes are 
replaced by something else, or shuffled around into different positions, 
we should be fine.

David, is my understanding correct?

-- 
Steven