[Python-Dev] Multilingual programming article on the Red Hat Developer blog
Steven D'Aprano
steve at pearwood.info
Wed Sep 17 06:42:56 CEST 2014
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> > Basically, we are pretending that the each smuggled
> > byte is single character for string parsing purposes...but they don't
> > match any of our parsing constants. They are all "any character" matches
> > in the regexes and what have you.
>
> This is slightly iffy, as you can't be sure that one byte represents
> one character, but as long as you don't much care about that, it's not
> going to be an issue.
This discussion would probably be a lot more easy to follow, with fewer
miscommunications, if there were some examples. Here is my example,
perhaps someone can tell me if I'm understanding it correctly.
I want to send an email including the header line:
'Subject: “NOBODY expects the Spanish Inquisition!”'
Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I
do the right thing and encode it as UTF-8:
b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'
but my mail package, not being written in a language as awesome as
Python, is just riddled with bugs, and somehow I end up with this
corrupted byte-string instead:
b'Subject: \x9c\x80\xe2NOBODY expects the Spanish Inquisition!\xe2\x80\x9d'
Note that the bytes from the first curly quote bytes are in the wrong
order, but the second is okay. (Like I said, it's just *riddled* with
bugs.) That means that trying to decode those bytes will fail in Python:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 9:
invalid start byte
but it's not up to Python's email package to throw those invalid bytes
out or permantly replace them with something else. Also, we want to work
with Unicode strings, not byte strings, so there has to be a way to
smuggle those three bytes into Unicode, without ending up with either
the replacement bytes:
# using the 'replace' error handler
'Subject: ���NOBODY expects the Spanish Inquisition!”'
or incorrectly interpreting them as valid, but wrong, code points. (If
we do the second, we end up with two control characters "\x9c\x80"
followed by "â".) We want to be able to round-trip back to the same
bytes we received.
Am I right so far?
So the email package uses the surrogate-escape error handler and ends up
with this Unicode string:
'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
which can be encoded back to the bytes we started with.
Note that technically those three \u... code points are NOT classified
as "noncharacters". They are actually surrogate code points:
http://www.unicode.org/faq/private_use.html#nonchar4
http://www.unicode.org/glossary/#surrogate_code_point
and they're supposed to be reserved for UTF-16. I'm not sure of the
implication of that.
> I'm fairly sure you're never going to find an
> encoding in which one unknown byte represents two characters,
There are encodings which use a "shift" mechanism, whereby a byte X
represents one character by default, and a different character after the
shift mechanism. But I don't think that matters, since we're not able to
interpret those bytes. If we were, we'd just decode them to a text
string and be done with it.
> but
> there are cases where it takes more than one byte to make up a
> character (or the bytes are just shift codes or something).
Multi-byte encodings are very common. All the Unicode encodings are
multi-byte. So are many East Asian encodings.
> Does that
> ever throw off your regexes? It wouldn't be an issue to a .* between
> two character markers, but if you ever say .{5} then it might match
> incorrectly.
I don't think the idea is to match on these smuggled bytes specifically.
I think the idea is to match *around* them. In the example above, we
might match everything from "Subject: " to the end of the line. So long
as we never end up with a situation where the smuggled bytes are
replaced by something else, or shuffled around into different positions,
we should be fine.
David, is my understanding correct?
--
Steven
More information about the Python-Dev
mailing list