unicode speed

Neil Hodgson nyamatongwe+thunder at gmail.com
Tue Nov 29 11:14:26 CET 2005

David Siroky:

>     output = ''

    I suspect you really want "output = u''" here.

>     for c in line:
>         if not unicodedata.combining(c):
>             output += c

    This is creating as many as 50000 new string objects of increasing 
size. To build large strings, some common faster techniques are to 
either create a list of characters and then use join on the list or use 
a cStringIO to accumulate the characters.

    This is about 10 times faster for me:

def no_diacritics(line):
     if type(line) != unicode:
         line = unicode(line, 'utf-8')

     line = unicodedata.normalize('NFKD', line)

     output = []
     for c in line:
         if not unicodedata.combining(c):
     return u''.join(output)


