speed of string chunks file parsing
andrew cooke
andrew at acooke.org
Mon Apr 6 12:09:42 EDT 2009
[disclaimer - this is just guessing from general knowledge of regular
expressions; i don't know any details of python's regexp engine]
if your regular expression is the bottleneck rewrite it to avoid lazy
matching, references, groups, lookbacks, and perhaps even counted repeats.
with a little thought you can do almost everything using just choices
'(a|b)' and repeat 'a*'. even if the expression is longer, it will
probably be faster.
character ranges - either explicit '[a-z]' or predefined '\w' (even '.') -
should be fine, but try to avoid having multiple occurrences of ".*".
see the timeit package for testing the speed of small chunks of code.
andrew
Hyunchul Kim wrote:
> Hi, all
>
> I have a simple script.
> Can you improve algorithm of following 10 line script, with a view point
> of speed ?
> Following script do exactly what I want but I want to improve the speed.
>
> This parse a file and accumulate lines till a line match a given regular
> expression.
> Then, when a line match a given regular expression, this function yield
> lines before the matched lines.
>
> ****************
> import re
> resultlist = []
> cp_regularexpression = re.compile('^a complex regular expression here$)
> for line in file(inputfile):
> if cp_regularexpression.match(line):
> if resultlist != []:
> yield resultlist
> resultlist = []
> resultlist.append(line)
> yield resultlist
> ****************
>
> Thank you in advance,
>
> Hyunchul
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
More information about the Python-list
mailing list