unicode speed
Neil Hodgson
nyamatongwe+thunder at gmail.com
Tue Nov 29 05:14:26 EST 2005
David Siroky:
> output = ''
I suspect you really want "output = u''" here.
> for c in line:
> if not unicodedata.combining(c):
> output += c
This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.
This is about 10 times faster for me:
def no_diacritics(line):
if type(line) != unicode:
line = unicode(line, 'utf-8')
line = unicodedata.normalize('NFKD', line)
output = []
for c in line:
if not unicodedata.combining(c):
output.append(c)
return u''.join(output)
Neil
More information about the Python-list
mailing list