encoding problems (é and è)
Peter Otten
__peter__ at web.de
Fri Mar 24 06:16:51 EST 2006
Duncan Booth wrote:
> There's a nice little codec from Skip Montaro for removing accents from
> latin-1 encoded strings. It also has an error handler so you can convert
> from unicode to ascii and strip all the accents as you do so:
>
> http://orca.mojam.com/~skip/python/latscii.py
>
>>>> import latscii
>>>> import htmlentitydefs
>>>> print u'\u00c9'.encode('ascii','replacelatscii')
> E
>>>>
>
> So Bussiere could replace a large chunk of his code with:
>
> ligneA = ligneA.decode(INPUTENCODING).encode('ascii',
> 'replacelatscii') ligneA = ligneA.upper()
>
> INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
> his files are actually in some different encoding.
>
> Unfortunately, just as I finished writing this I discovered that the
> latscii module isn't as robust as I thought, it blows up on consecutive
> accented characters.
>
> :(
You made me look into it -- and I found that reusing the decoding map as the
encoding map lets you write
>>> u"Élève ééé".encode("latscii")
'Eleve eee'
without relying on the faulty error handler. I tried to fix the handler,
too:
>>> u"Élève ééé".encode("ascii", "replacelatscii")
'Eleve eee'
>>> g = u"\N{GREEK CAPITAL LETTER GAMMA}"
>>> (u"möglich ähnlich üblich ááá" + g*3).encode("ascii", "replacelatscii")
'moglich ahnlich ublich aaa???'
No real testing was performed.
Peter
--- latscii_old.py 2006-03-24 11:45:22.580588520 +0100
+++ latscii.py 2006-03-24 11:48:13.191651696 +0100
@@ -141,7 +141,7 @@
### Encoding Map
-encoding_map = codecs.make_identity_dict(range(256))
+encoding_map = decoding_map
### From Martin Blais
@@ -166,9 +166,9 @@
## ustr.encode('ascii', 'replacelatscii')
##
def latscii_error( uerr ):
- key = ord(uerr.object[uerr.start:uerr.end])
+ key = ord(uerr.object[uerr.start])
try:
- return unichr(decoding_map[key]), uerr.end
+ return unichr(decoding_map[key]), uerr.start + 1
except KeyError:
handler = codecs.lookup_error('replace')
return handler(uerr)
More information about the Python-list
mailing list