encoding problems (é and è)

Fri Mar 24 06:40:44 EST 2006

On 24/03/2006 8:11 PM, Duncan Booth wrote:
> Peter Otten wrote:
> 
> 
>>>You can replace ALL of this upshifting and accent removal in one blow
>>>by using the string translate() method with a suitable table.
>>
>>Only if you convert to unicode first or if your data maintains 1 byte
>>== 1 character, in particular it is not UTF-8. 
>>
> 
> 
> There's a nice little codec from Skip Montaro for removing accents from 

For the benefit of those who may read only this far, it is NOT nice.

> latin-1 encoded strings. It also has an error handler so you can convert 
> from unicode to ascii and strip all the accents as you do so:
> 
> http://orca.mojam.com/~skip/python/latscii.py
> 
> 
>>>>import latscii
>>>>import htmlentitydefs
>>>>print u'\u00c9'.encode('ascii','replacelatscii')
> 
> E
> 
> 
> So Bussiere could replace a large chunk of his code with:

Could, but definitely shouldn't.

> 
>     ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
>     ligneA = ligneA.upper()
> 
> INPUTENCODING is 'utf8' unless (one possible explanation for his problem) 
> his files are actually in some different encoding.
> 
> Unfortunately, just as I finished writing this I discovered that the 
> latscii module isn't as robust as I thought, it blows up on consecutive 
> accented characters. 
> 
>  :(
> 
Some of the transformations are a little unfortunate :-(
0x00d0: ord('D'), # Ð
0x00f0: ord('o'), # ð
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
The Icelandic thorn letters become P & p (based on physical appearance), 
when they should become Th and th.
The German letter Eszett (00DF) becomes B (appearance) when it should be ss.
Creating alphabetics out of punctuation is scarcely something that 
bussiere should be interested in:
     0x00a2: ord('c'), # ¢
     0x00a4: ord('o'), # ¤
     0x00a5: ord('Y'), # ¥
     0x00a7: ord('S'), # §
     0x00a9: ord('c'), # ©
     0x00ae: ord('R'), # ®
     0x00b6: ord('P'), # ¶