speeding up string.split()
Fredrik Aronsson
d98aron at dtek.chalmers.se
Fri May 25 16:40:34 EDT 2001
In article <m2n182cs9c.fsf at phosphorus.tucc.uab.edu>,
Chris Green <cmg at uab.edu> writes:
> Is there any way to speed up the following code? Speed doesn't matter
> terribly much to me but this seems to be a fair example of what I need
> to do.
>
> In the real data, I will be able to have a large array and use
> map rather than do it line by line but, I doubt this will change
> things much for the better.
In my case, it actually made it worse... and the reason is probably
that map and list comprehensions builds a new list.
So, if you are only going to extract data and not build a new list,
it's prbably faster with normal indexing.
> I've tried w/ python 2 and 1.5.2 and the differences between perl and
> python remain huge ( about 5s:1s python:perl ).
Yep, perl is optimized for text processing.
>From "man perl":
Perl is a language optimized for scanning arbitrary text
files, extracting information from those text files, and
printing reports based on that information....
>
> The string is +'d together for usenet purposes
>
> #!/usr/bin/python
> from string import split
>
> for i in range(300000):
> array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 6' +
> '1064 80 54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
[perl snipped]
The best speedup I saw was by using string methods, time was cut down
to about half. YMMW
/Fredrik
-- results --
running on a constant string 100000 times...
original 4.63118143082 [4.543, 4.645, 4.638, 4.670, 4.659]
in_func 4.38402023315 [4.317, 4.408, 4.414, 4.390, 4.391]
splitting_on_space 4.64088983536 [4.643, 4.647, 4.640, 4.642, 4.633]
running on a real 10000 item array...
normal_for 15.3752954006 [15.440, 15.350, 15.388, 15.346, 15.353]
index_for 16.7716764212 [16.876, 16.843, 16.685, 16.771, 16.683]
index_for_using_xrange 16.8590580225 [16.830, 16.781, 16.769, 16.780, 17.135]
index_for_local_var 15.8590993881 [15.728, 15.895, 15.892, 15.883, 15.896]
map_split 22.4902464151 [22.262, 22.339, 22.727, 23.051, 22.073]
map_split_local_var 22.2637700081 [22.089, 22.436, 22.830, 21.799, 22.166]
string_method 7.49720318317 [7.569, 7.486, 7.481, 7.481, 7.469]
list_comp_split 19.7473443985 [19.551, 19.909, 20.321, 19.301, 19.653]
-- code (not usenet friendly... long lines) --
from time import time
# probably better to use the profile module... but this is simple.
from string import split
# create long list...
large_list = 'xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80 54 54 1 1 14:00:00.8094 14:00:00.8908 1 2' * 10000
# Functions
def in_func():
for i in range(100000):
array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80 54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
def splitting_on_space():
for i in range(100000):
array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80 54 54 1 1 14:00:00.8094 14:00:00.8908 1 2',' ')
def normal_for():
for i in large_list:
array = split(i)
def index_for():
for i in range(len(large_list)):
array = split(large_list[i])
def index_for_using_xrange():
for i in xrange(len(large_list)):
array = split(large_list[i])
def index_for_local_var():
mylist = large_list
mysplit = split
for i in range(len(mylist)):
array = mysplit(mylist[i])
def map_split():
for array in map(split,large_list):
pass
def map_split_local_var():
mysplit = split
mylist = large_list
for array in map(mysplit,mylist):
pass
def string_method():
for i in large_list:
array = i.split()
def list_comp_split():
for array in [l.split() for l in large_list]:
pass
funcs = [in_func,splitting_on_space,
normal_for,index_for,index_for_using_xrange,index_for_local_var,
map_split,map_split_local_var,string_method,list_comp_split]
#Timings...
def avg(list):
from operator import add
return reduce(add,list)/len(list)
times = {}
def process_time(name,diff):
times.setdefault(name,[]).append(end-start)
print name, avg(times[name]),
print "[" + ", ".join(["%.3f" % t for t in times[name]]) + "]"
for l in range(5):
print "\n\nRun no.",l+1
start = time()
for i in range(100000):
array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80 54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
end = time()
process_time("original",end-start)
for func in funcs:
start = time()
func()
end = time()
process_time(func.func_name,end-start)
More information about the Python-list
mailing list