Replace accented chars with unaccented ones

Josiah Carlson jcarlson at nospam.uci.edu
Mon Mar 15 21:19:00 EST 2004


Jeff Epler wrote:

> You have two options.  First, convert the string to Unicode and use code
> like the following:
> 
>     replacements = [(u'\xe9', 'e'), ...]
>     def remove_accents(u):
>         for a, b in replacements:
>             u = u.replace(a, b)
>         return u
> 
> 
>>>>remove_accents(u'\xe9')
> 
> u'e'
> 
> Second, if you are using a single-byte encoding (iso8859-1, for
> instance), then work with byte string:
>     replacement_map = string.maketrans('\xe9...', 'e...')
>     def remove_accents(s):
>         return s.translate(replacement_map)
> 
> 
>>>>remove_accents('\xe9')
> 
> 'e'
> 
> If you want to have strings like u'é' in your programs, you have to
> include a line at the top of the source file that tells Python the
> encoding, like the following line does:
>     # -*- coding: utf-8 -*-
> (except you have to name the encoding your editor uses, if it's not
> utf-8) See http://python.org/peps/pep-0263.html
> 
> Once you've done that, you can write
>     replacements = [(u'é', 'e'), ...]
> instead of using the \xXX escape for it.

Translating the replacements pairs into a dictionary would result in a 
significant speedup for large numbers of replacements.

mapping = dict(replacement_pairs)

def multi_replace(inp, mapping=mapping):
     return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better 
(running-time wise) than the string.replace method that runs in 
O(len(inp) * len(replacement_pairs)) time as given.

  - Josiah



More information about the Python-list mailing list