[Python-Dev] Multilingual programming article on the Red Hat Developer blog
R. David Murray
rdmurray at bitdance.com
Wed Sep 17 09:20:50 CEST 2014
Sorry for the mojibake. I've not yet gotten around to actually using
the email package to write a smarter replacement for nmh, which is what
I use for email, and I always forget that I need to manually tell nmh
when there non-ascii in the message...
On Wed, 17 Sep 2014 03:02:33 -0400, "R. David Murray" <rdmurray at bitdance.com> wrote:
> On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano <steve at pearwood.info> wrote:
> > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> > > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> >
> > > > Basically, we are pretending that the each smuggled
> > > > byte is single character for string parsing purposes...but they don't
> > > > match any of our parsing constants. They are all "any character" matches
> > > > in the regexes and what have you.
> > >
> > > This is slightly iffy, as you can't be sure that one byte represents
> > > one character, but as long as you don't much care about that, it's not
> > > going to be an issue.
> >
> > This discussion would probably be a lot more easy to follow, with fewer
> > miscommunications, if there were some examples. Here is my example,
> > perhaps someone can tell me if I'm understanding it correctly.
> >
> > I want to send an email including the header line:
> >
> > 'Subject: âNOBODY expects the Spanish Inquisition!â'
> >
> > Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I
> > do the right thing and encode it as UTF-8:
> >
> > b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'
>
> That won't work until email supports RFC 6532. Until then, you can only
> use ascii and encoded words successfully. So just having the curly
> quotes is a buggy enough program.
>
> > but it's not up to Python's email package to throw those invalid bytes
> > out or permantly replace them with something else. Also, we want to work
> > with Unicode strings, not byte strings, so there has to be a way to
> > smuggle those three bytes into Unicode, without ending up with either
> > the replacement bytes:
> >
> > # using the 'replace' error handler
> > 'Subject: ���NOBODY expects the Spanish Inquisition!â'
>
> What you'll get if you request a text copy of that header is
>
> 'Subject: ���NOBODY expects the Spanish Inquisition!���'
>
> > Am I right so far?
> >
> > So the email package uses the surrogate-escape error handler and ends up
> > with this Unicode string:
> >
> > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â'
>
> Except that it encodes the closing quote, too :)
>
> > which can be encoded back to the bytes we started with.
>
> Right. If you serialize the message as bytes, the bytes are recovered
> and output when that header is output.
>
> Now, once we support RFC 6532, you will be exactly right, as we will
> then have the option of handling utf-8 encoded headers, and we will do
> that using the utf-8 codec to ingest headers, and the surrogateescape
> error handler to handle exactly the kind of bad data you postulate.
>
> --David
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/rdmurray%40bitdance.com
More information about the Python-Dev
mailing list