speed of string chunks file parsing

andrew cooke andrew at acooke.org
Mon Apr 6 12:09:42 EDT 2009


[disclaimer - this is just guessing from general knowledge of regular
expressions; i don't know any details of python's regexp engine]

if your regular expression is the bottleneck rewrite it to avoid lazy
matching, references, groups, lookbacks, and perhaps even counted repeats.
 with a little thought you can do almost everything using just choices
'(a|b)' and repeat 'a*'.  even if the expression is longer, it will
probably be faster.

character ranges - either explicit '[a-z]' or predefined '\w' (even '.') -
should be fine, but try to avoid having multiple occurrences of ".*".

see the timeit package for testing the speed of small chunks of code.

andrew



Hyunchul Kim wrote:
> Hi, all
>
> I have a simple script.
> Can you improve algorithm of following 10 line script, with a view point
> of speed ?
> Following script do exactly what I want but I want to improve the speed.
>
> This parse a file and accumulate lines till a line match a given regular
> expression.
> Then, when a line match a given regular expression, this function yield
> lines before the matched lines.
>
> ****************
> import re
> resultlist = []
> cp_regularexpression = re.compile('^a complex regular expression here$)
> for line in file(inputfile):
>         if cp_regularexpression.match(line):
>                 if resultlist != []:
>                         yield resultlist
>                         resultlist = []
>         resultlist.append(line)
> yield resultlist
> ****************
>
> Thank you in advance,
>
> Hyunchul
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>





More information about the Python-list mailing list