unicode() vs. s.decode()

garabik-news-2005-05 at kassiopeia.juls.savba.sk garabik-news-2005-05 at kassiopeia.juls.savba.sk
Fri Aug 7 07:49:05 EDT 2009


Thorsten Kampe <thorsten at thorstenkampe.de> wrote:
> * Steven D'Aprano (06 Aug 2009 19:17:30 GMT)
>> What if you're writing a loop which takes one million different lines of 
>> text and decodes them once each?
>> 
>> >>> setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
>> >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
>> >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
>> >>> t1.timeit(number=1)
>> 5.6751680374145508
>> >>> t2.timeit(number=1)
>> 2.6822888851165771
>> 
>> Seems like a pretty meaningful difference to me.
> 
> Bollocks. No one will even notice whether a code sequence runs 2.7 or 
> 5.7 seconds. That's completely artificial benchmarking.
>

For a real-life example, I have often a file with one word per line, and
I run python scripts to apply some (sometimes fairy trivial)
transformation over it. REAL example, reading lines with word, lemma,
tag separated by tabs from stdin and writing word into stdout, unless it
starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope
the comments are self-explanatory)

no unicode
user    0m2.380s

decode('utf-8'), encode('utf-8')
user    0m3.560s

sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin)
user    0m6.180s

unicode(line, 'utf8'), encode('utf-8')
user    0m3.820s

unicode(line, 'utf-8'), encode('utf-8')
user    0m2.880sa

python3.1
user    0m1.560s

Since I have something like 18 million words in my currenct project (and
 > 600 million overall) and I often tweak some parameters and re-run the
 > transformations, the differences are pretty significant.

Personally, I have been surprised by:
1) bad performance of the codecs wrapper (I expected it to be on par with 
   unicode(x,'utf-8'), mayble slightly better due to less function calls
2) good performance of python3.1 (utf-8 locale)


-- 
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!



More information about the Python-list mailing list