*firstname*nlsnews at georgea*lastname*.com
Tue Nov 29 21:50:19 CET 2005
In article <pan.2005.11.29.08.48.15.951250 at email.cz>,
David Siroky <dsiroky at email.cz> wrote:
> I need to enlighten myself in Python unicode speed and implementation.
> My platform is AMD Athlon at 1300 (x86-32), Debian, Python 2.4.
> First a simple example (and time results):
> x = "a"*50000000
> real 0m0.195s
> user 0m0.144s
> sys 0m0.046s
> x = u"a"*50000000
> real 0m2.477s
> user 0m2.119s
> sys 0m0.225s
> So my first question is why creation of a unicode string lasts more then 10x
> longer than non-unicode string?
Your first example uses about 50 MB. Your second uses about 200 MB, (or
100 MB if your Python is compiled oddly). Check the size of Unicode
>>> import sys
If it says '0x10ffff' each unichar uses 4 bytes; if it says '0xffff',
each unichar uses 2 bytes.
> Another situation: speed problem with long strings
> I have a simple function for removing diacritics from a string:
> # -*- coding: UTF-8 -*-
> import unicodedata
> def no_diacritics(line):
> if type(line) != unicode:
> line = unicode(line, 'utf-8')
> line = unicodedata.normalize('NFKD', line)
> output = ''
> for c in line:
> if not unicodedata.combining(c):
> output += c
> return output
> Now the calling sequence (and time results):
> for i in xrange(1):
> x = u"a"*50000
> y = no_diacritics(x)
> real 0m17.021s
> user 0m11.139s
> sys 0m5.116s
> for i in xrange(5):
> x = u"a"*10000
> y = no_diacritics(x)
> real 0m0.548s
> user 0m0.502s
> sys 0m0.004s
> In both cases the total amount of data is equal but when I use shorter strings
> it is much faster. Maybe it has nothing to do with Python unicode but I would
> like to know the reason.
It has to do with how strings (either kind) are implemented. Strings
are "immutable", so string concatination is done by making a new string
that has the concatenated value, ans assigning it to the left-hand-side.
Often, it is faster (but more memory intensive) to append to a list and
then at the end do a u''.join(mylist). See GvR's essay on optimization
Alternatively, you could use array.array from the Python Library (it's
easy) to get something "just as good as" mutable strings.
TonyN.:' *firstname*nlsnews at georgea*lastname*.com
More information about the Python-list