unicode() vs. s.decode()
garabik-news-2005-05 at kassiopeia.juls.savba.sk
garabik-news-2005-05 at kassiopeia.juls.savba.sk
Fri Aug 7 07:49:05 EDT 2009
Thorsten Kampe <thorsten at thorstenkampe.de> wrote:
> * Steven D'Aprano (06 Aug 2009 19:17:30 GMT)
>> What if you're writing a loop which takes one million different lines of
>> text and decodes them once each?
>>
>> >>> setup = 'L = ["abc"*(n%100) for n in xrange(1000000)]'
>> >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
>> >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
>> >>> t1.timeit(number=1)
>> 5.6751680374145508
>> >>> t2.timeit(number=1)
>> 2.6822888851165771
>>
>> Seems like a pretty meaningful difference to me.
>
> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.
>
For a real-life example, I have often a file with one word per line, and
I run python scripts to apply some (sometimes fairy trivial)
transformation over it. REAL example, reading lines with word, lemma,
tag separated by tabs from stdin and writing word into stdout, unless it
starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope
the comments are self-explanatory)
no unicode
user 0m2.380s
decode('utf-8'), encode('utf-8')
user 0m3.560s
sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin)
user 0m6.180s
unicode(line, 'utf8'), encode('utf-8')
user 0m3.820s
unicode(line, 'utf-8'), encode('utf-8')
user 0m2.880sa
python3.1
user 0m1.560s
Since I have something like 18 million words in my currenct project (and
> 600 million overall) and I often tweak some parameters and re-run the
> transformations, the differences are pretty significant.
Personally, I have been surprised by:
1) bad performance of the codecs wrapper (I expected it to be on par with
unicode(x,'utf-8'), mayble slightly better due to less function calls
2) good performance of python3.1 (utf-8 locale)
--
-----------------------------------------------------------
| Radovan GarabĂk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
More information about the Python-list
mailing list