performance problem with time.strptime()

Thu Jul 2 09:00:44 EDT 2009

Nils Rüttershoff wrote:
> Hi everyone,
>
> In my free time I translate scripts from open source projects or write
> my own, to train my python skills. ATM I convert the aplogmerge.pl from
> awstats. It merges multiple apache logfiles and sort the output by the
> timestamps of each line.  My first version of this script hasn't a good
> performance, so I started profiling. It turned out that the script spend
> much time for converting the timestamps of the line into a struct_time
> object. Here a code example (Python 2.6.2 on Ubuntu 7.10):
>
> Rec = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2})\s\+\d{4}\].*")
> Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"'
>
> def strptime():
>     m = Rec.match(Line)
>     if m:
>         date_string = m.group(1)
> 	# date_string example: '02/Jul/2009:01:50:26'
>         return time.strptime(date_string, "%d/%b/%Y:%H:%M:%S")
>
> with timeit this functions takes approximate 125 sec and 29.004.081
> function calls (I've configured timeit with 1.000.000 runs). A look at
> the output of cProfile told me that more than the half time is spent in
> locale.py:
>
>   1000002   11.712    0.000   19.592    0.000 locale.py:316(normalize)
>
>   1000002    3.639    0.000   23.231    0.000 locale.py:382(_parse_localename)
>
>   1000002    5.162    0.000   30.271    0.000 locale.py:481(getlocale)
>
>
>
> I studied the time documentation and thought that I had to set TZ in os
> environ:
>
> os.environ['TZ'] = 'Europe/Berlin'
>
> time.set()
>
>
> but that had no effect. I don't know why time.strptime() looks every
> time for my locale. Maybe it's a problem with my OS locale... However.
> I've introduced a work around, which works perfectly for my problem. For
> time comparison I could use any sort of time representation and so I
> convert to epoch:
>
> # needed to translate month str to dig repr
>
> Shortmonth = {'Jan' : '01',
>
>               'Feb' : '02',
>
>               'Mar' : '03',
>
>               'Apr' : '04',
>
>               'May' : '05',
>
>               'Jun' : '06',
>
>               'Jul' : '07',
>
>               'Aug' : '08',
>
>               'Sep' : '09',
>
>               'Oct' : '10',
>
>               'Nov' : '11',
>
>               'Dec' : '12'}
>
> Rec = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(?P<day>\d{2})/(?P<month>\w+)/(?P<year>\d{4}):(?P<hour>\d{2}):(?P<min>\d{2}):(?P<sec>\d{2})\s\+\d{4}\].*")
>
> Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"'
>
> def epoch():
>
>     m = Rec.match(Line)
>
>     if m:
>
>         result = m.groupdict()
>
>         date_tuple = (result["year"], Shortmonth[result["month"]], result["day"], result["hour"], result["min"], result["sec"], -1, -1, -1)
>
>         date_tuple = map(int,date_tuple)
>
>         return time.mktime(date_tuple)
>
>
> with this workaround I had a speed up to 4 times; it tooks only 31 sec
> with only 5.000.009 function calls. Maybe this helps some of you, who
> had the similar problems with time conversion....
>
> ...But one big question remains: Why time.strptime() checks everytime
> the locale? had I missed something or could I have a problem with my OS
> locale?
>
> With python 3.1 there is no difference, unless that time.strptime() took
> approx 12 sec longer... :(
>
> regards, Nils
>
>
>
> Here a complete test script:
>
> #!/opt/python/2.6.2/bin/python
>
> import time
>
> import timeit
>
> import cProfile
>
> import re
>
> # needed to tranlate month str to dig repr
>
> Shortmonth = {'Jan' : '01',
>
>               'Feb' : '02',
>
>               'Mar' : '03',
>
>               'Apr' : '04',
>
>               'May' : '05',
>
>               'Jun' : '06',
>
>               'Jul' : '07',
>
>               'Aug' : '08',
>
>               'Sep' : '09',
>
>               'Oct' : '10',
>
>               'Nov' : '11',
>
>               'Dec' : '12'}
>
> Rec1 = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(?P<day>\d{2})/(?P<month>\w+)/(?P<year>\d{4}):(?P<hour>\d{2}):(?P<min>\d{2}):(?P<sec>\d{2})\s\+\d{4}\].*")
>
> Rec2 = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2})\s\+\d{4}\].*")
>
> Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"'
>
> def epoch():
>
>     m = Rec1.match(Line)
>
>     if m:
>
>         result = m.groupdict()
>
>         date_tuple = (result["year"], Shortmonth[result["month"]], result["day"], result["hour"], result["min"], result["sec"], -1, -1, -1)
>
>         date_tuple = map(int,date_tuple)
>
>         return time.mktime(date_tuple)
>
> def strptime():
>
>     m = Rec2.match(Line)
>
>     if m:
>
>         date_string = m.group(1)
>
>         return time.strptime(date_string, "%d/%b/%Y:%H:%M:%S")
>
> if __name__ == "__main__":
>
>     t1 = timeit.Timer("epoch()","from __main__ import epoch")
>
>     t2 = timeit.Timer("strptime()", "from __main__ import strptime")
>
>     cProfile.run("t1.timeit();print")
>
>     print ""
>
>     cProfile.run("t2.timeit();print")
>
>
>
>   
As you say, if you don't actually need the datetime fields, why waste 
time generating them.  You gained a lot more than the time spent in 
locale(), so your algorithm must be faster than the one in strptime().  
That frequently happens when you write a special case for code which is 
otherwise general.

Notice that only a quarter of the time is spent in locale logic (30.271 
secs).  You can't add up the three terms since they're measuring the 
same thing.  Every call to getlocale() results in a call to

_parse_localename(), so the cumulative times of the latter are already included in the cumulative times in getlocale().

Why does it need to call locale() ?  How else is it going to know how to spell the 11th month?