Python Unicode handling wins again -- mostly
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sat Nov 30 02:11:59 EST 2013
On Fri, 29 Nov 2013 23:00:27 -0700, Ian Kelly wrote:
> On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <roy at panix.com> wrote:
>> I was speaking specifically of "ligatures like fi" (or, if you prefer,
>> "ligatures like ό". By which I mean those things printers invented
>> because some letter combinations look funny when typeset as two
>> distinct letters.
>
> I think the encoding of your email is incorrect, because GREEK SMALL
> LETTER OMICRON WITH TONOS is not a ligature.
Roy's post, which is sent via Usenet not email, doesn't have an encoding
set. Since he's sending from a Mac, his software may believe that the
entire universe understands the Mac Roman encoding, which makes a certain
amount of sense since if I recall correctly the fi and fl ligatures
originally appeared in early Mac fonts.
I'm going to give Roy the benefit of the doubt and assume he actually
entered the fi ligature at his end. If his software was using Mac Roman,
it would insert a single byte DE into the message:
py> '\N{LATIN SMALL LIGATURE FI}'.encode('macroman')
b'\xde'
But that's not what his post includes. The message actually includes two
bytes CF8C, in other words:
'\N{LATIN SMALL LIGATURE FI}'.encode('who the hell knows')
=> b'\xCF\x8C'
Since nearly all of his post is in single bytes, it's some variable-width
encoding, but not UTF-8.
With no encoding set, our newsreader software starts off assuming that
the post uses UTF-8 ('cos that's the only sensible default), and those
two bytes happen to encode to ό GREEK SMALL LETTER OMICRON WITH TONOS.
I'm not surprised that Roy has a somewhat jaundiced view of Unicode, when
the tools he uses are apparently so broken. But it isn't Unicode's fault,
its the tools.
The really bizarre thing is that apparently Roy's software, MT-
NewsWatcher, knows enough Unicode to normalise ffl LATIN SMALL LIGATURE FFL
(sent in UTF-8 and therefore appearing as bytes b'\xef\xac\x84') to the
ASCII letters "ffl". That's astonishingly weird.
That is really a bizarre error. I suppose it is not entirely impossible
that the software is actually being clever rather than dumb. Having
correctly decoded the UTF-8 bytes, perhaps it realised that there was no
glyph for the ligature, and rather than display a MISSING CHAR glyph
(usually one of those empty boxes you sometimes see), it normalized it to
ASCII. But if it's that clever, why the hell doesn't it set an encoding
line in posts?????
--
Steven
More information about the Python-list
mailing list