[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 09:02:33 CEST 2014

On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano <steve at pearwood.info> wrote:
> On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> 
> > > Basically, we are pretending that the each smuggled
> > > byte is single character for string parsing purposes...but they don't
> > > match any of our parsing constants.  They are all "any character" matches
> > > in the regexes and what have you.
> > 
> > This is slightly iffy, as you can't be sure that one byte represents
> > one character, but as long as you don't much care about that, it's not
> > going to be an issue.
> 
> This discussion would probably be a lot more easy to follow, with fewer 
> miscommunications, if there were some examples. Here is my example, 
> perhaps someone can tell me if I'm understanding it correctly.
> 
> I want to send an email including the header line:
> 
> 'Subject: â€œNOBODY expects the Spanish Inquisition!â€'
> 
> Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I 
> do the right thing and encode it as UTF-8:
> 
> b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

That won't work until email supports RFC 6532.  Until then, you can only
use ascii and encoded words successfully.  So just having the curly
quotes is a buggy enough program.

> but it's not up to Python's email package to throw those invalid bytes 
> out or permantly replace them with something else. Also, we want to work 
> with Unicode strings, not byte strings, so there has to be a way to 
> smuggle those three bytes into Unicode, without ending up with either 
> the replacement bytes:
> 
> # using the 'replace' error handler
> 'Subject: ï¿½ï¿½ï¿½NOBODY expects the Spanish Inquisition!â€'

What you'll get if you request a text copy of that header is

  'Subject: ï¿½ï¿½ï¿½NOBODY expects the Spanish Inquisition!ï¿½ï¿½ï¿½'

> Am I right so far?
> 
> So the email package uses the surrogate-escape error handler and ends up 
> with this Unicode string:
> 
> 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â€'

Except that it encodes the closing quote, too :)

> which can be encoded back to the bytes we started with.

Right.  If you serialize the message as bytes, the bytes are recovered
and output when that header is output.

Now, once we support RFC 6532, you will be exactly right, as we will
then have the option of handling utf-8 encoded headers, and we will do
that using the utf-8 codec to ingest headers, and the surrogateescape
error handler to handle exactly the kind of bad data you postulate.

--David