encoding problems (é and è)

John Machin sjmachin at lexicon.net
Fri Mar 24 05:43:23 CET 2006

On 24/03/2006 2:19 PM, Jean-Paul Calderone wrote:
> On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sjmachin at lexicon.net> 
> wrote:
>> On 24/03/2006 8:36 AM, Peter Otten wrote:
>>> John Machin wrote:
>>>> You can replace ALL of this upshifting and accent removal in one 
>>>> blow by
>>>> using the string translate() method with a suitable table.
>>> Only if you convert to unicode first or if your data maintains 1 byte 
>>> == 1
>>> character, in particular it is not UTF-8.
>> I'm sorry, I forgot that there were people who are unaware that
>> variable-length gizmos like UTF-8 and various legacy CJK encodings are
>> for storage & transmission, and are better changed to a
>> one-character-per-storage-unit representation before *ANY* data
>> processing is attempted.
> Unfortunately, unicode only appears to solve this problem in a sane 
> manner.  Most people conveniently forget (or never learn in the first 
> place) about combining sequences and denormalized forms.  Consider 
> u'e\u0301', u'U\u0301', or u'C\u0327'.

Yes, and many people don't even bother to look at their data. If they 
did, and found combining forms, then they would treat them as I said as 
"variable-length gizmos" which are "better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted."

In any case, as the OP is upshifting and stripping accents [presumably 
as elementary preparation for some sort of fuzzy matching], all that is 
needed is to throw away the combining accents (0301, 0327, etc).

 >  These difficulties can be
> mitigated to some degree via normalization (see unicodedata.normalize), 
> but this step is often forgotten

It's not a matter of forget or not. People should bother to examine 
their data and see what characters are in use; then they would know 
whether they had a problem or not.

> and, for things like u'\u0565\u0582' 
> (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work.

Sorry, I don't understand.
0565 is stand-alone ECH
0582 is stand-alone YIWN
0587 is the ligature.
What doesn't work? At first guess, in the absence of an Armenian 
informant, for pre-matching normalisation, I'd replace 0587 by the two 
constituents -- just like 00DF would be expanded to "ss" (before 
upshifting and before not caring too much about differences caused by 
doubled letters).

More information about the Python-list mailing list