[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 09:20:50 CEST 2014

Sorry for the mojibake.  I've not yet gotten around to actually using
the email package to write a smarter replacement for nmh, which is what
I use for email, and I always forget that I need to manually tell nmh
when there non-ascii in the message...

On Wed, 17 Sep 2014 03:02:33 -0400, "R. David Murray" <rdmurray at bitdance.com> wrote:
> On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano <steve at pearwood.info> wrote:
> > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> > > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> > 
> > > > Basically, we are pretending that the each smuggled
> > > > byte is single character for string parsing purposes...but they don't
> > > > match any of our parsing constants.  They are all "any character" matches
> > > > in the regexes and what have you.
> > > 
> > > This is slightly iffy, as you can't be sure that one byte represents
> > > one character, but as long as you don't much care about that, it's not
> > > going to be an issue.
> > 
> > This discussion would probably be a lot more easy to follow, with fewer 
> > miscommunications, if there were some examples. Here is my example, 
> > perhaps someone can tell me if I'm understanding it correctly.
> > 
> > I want to send an email including the header line:
> > 
> > 'Subject: â€œNOBODY expects the Spanish Inquisition!â€'
> > 
> > Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I 
> > do the right thing and encode it as UTF-8:
> > 
> > b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'
> 
> That won't work until email supports RFC 6532.  Until then, you can only
> use ascii and encoded words successfully.  So just having the curly
> quotes is a buggy enough program.
> 
> > but it's not up to Python's email package to throw those invalid bytes 
> > out or permantly replace them with something else. Also, we want to work 
> > with Unicode strings, not byte strings, so there has to be a way to 
> > smuggle those three bytes into Unicode, without ending up with either 
> > the replacement bytes:
> > 
> > # using the 'replace' error handler
> > 'Subject: ï¿½ï¿½ï¿½NOBODY expects the Spanish Inquisition!â€'
> 
> What you'll get if you request a text copy of that header is
> 
>   'Subject: ï¿½ï¿½ï¿½NOBODY expects the Spanish Inquisition!ï¿½ï¿½ï¿½'
> 
> > Am I right so far?
> > 
> > So the email package uses the surrogate-escape error handler and ends up 
> > with this Unicode string:
> > 
> > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â€'
> 
> Except that it encodes the closing quote, too :)
> 
> > which can be encoded back to the bytes we started with.
> 
> Right.  If you serialize the message as bytes, the bytes are recovered
> and output when that header is output.
> 
> Now, once we support RFC 6532, you will be exactly right, as we will
> then have the option of handling utf-8 encoded headers, and we will do
> that using the utf-8 codec to ingest headers, and the surrogateescape
> error handler to handle exactly the kind of bad data you postulate.
> 
> --David
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/rdmurray%40bitdance.com