Replace accented chars with unaccented ones

Jeff Epler jepler at unpythonic.net
Mon Mar 15 18:55:18 EST 2004


You have two options.  First, convert the string to Unicode and use code
like the following:

    replacements = [(u'\xe9', 'e'), ...]
    def remove_accents(u):
        for a, b in replacements:
            u = u.replace(a, b)
        return u

>>> remove_accents(u'\xe9')
u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
    replacement_map = string.maketrans('\xe9...', 'e...')
    def remove_accents(s):
        return s.translate(replacement_map)

>>> remove_accents('\xe9')
'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
    # -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
    replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.
    
Jeff




More information about the Python-list mailing list