encoding problems (é and è)
John Machin
sjmachin at lexicon.net
Thu Mar 23 23:43:23 EST 2006
On 24/03/2006 2:19 PM, Jean-Paul Calderone wrote:
> On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sjmachin at lexicon.net>
> wrote:
>
>> On 24/03/2006 8:36 AM, Peter Otten wrote:
>>
>>> John Machin wrote:
>>>
>>>> You can replace ALL of this upshifting and accent removal in one
>>>> blow by
>>>> using the string translate() method with a suitable table.
>>>
>>>
>>> Only if you convert to unicode first or if your data maintains 1 byte
>>> == 1
>>> character, in particular it is not UTF-8.
>>>
>>
>> I'm sorry, I forgot that there were people who are unaware that
>> variable-length gizmos like UTF-8 and various legacy CJK encodings are
>> for storage & transmission, and are better changed to a
>> one-character-per-storage-unit representation before *ANY* data
>> processing is attempted.
>
>
> Unfortunately, unicode only appears to solve this problem in a sane
> manner. Most people conveniently forget (or never learn in the first
> place) about combining sequences and denormalized forms. Consider
> u'e\u0301', u'U\u0301', or u'C\u0327'.
Yes, and many people don't even bother to look at their data. If they
did, and found combining forms, then they would treat them as I said as
"variable-length gizmos" which are "better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted."
In any case, as the OP is upshifting and stripping accents [presumably
as elementary preparation for some sort of fuzzy matching], all that is
needed is to throw away the combining accents (0301, 0327, etc).
> These difficulties can be
> mitigated to some degree via normalization (see unicodedata.normalize),
> but this step is often forgotten
It's not a matter of forget or not. People should bother to examine
their data and see what characters are in use; then they would know
whether they had a problem or not.
> and, for things like u'\u0565\u0582'
> (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work.
Sorry, I don't understand.
0565 is stand-alone ECH
0582 is stand-alone YIWN
0587 is the ligature.
What doesn't work? At first guess, in the absence of an Armenian
informant, for pre-matching normalisation, I'd replace 0587 by the two
constituents -- just like 00DF would be expanded to "ss" (before
upshifting and before not caring too much about differences caused by
doubled letters).
More information about the Python-list
mailing list