Need a specific sort of string modification. Can someone help?
roy at panix.com
Sun Jan 6 18:28:55 CET 2013
In article <roy-103D43.15470305012013 at news.panix.com>,
Roy Smith <roy at panix.com> wrote:
> It's rare to find applications these days that are truly CPU bound.
> Once you've used some reasonable algorithm, i.e. not done anything in
> O(n^2) that could have been done in O(n) or O(n log n), you will more
> often run up against I/O speed, database speed, network latency, memory
> exhaustion, or some such as the reason your code is too slow.
Well, I just found a counter-example :-)
I've been doing some log analysis. It's been taking a grovelingly long
time, so I decided to fire up the profiler and see what's taking so
long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
might be (looking up IP addresses in the geolocation database, or
producing some pretty pictures using matplotlib). It was just a matter
of figuring out which it was.
As with most attempts to out-guess the profiler, I was totally,
absolutely, and embarrassingly wrong.
It turns out we were spending most of the time parsing timestamps!
Since there's no convenient way (I don't consider strptime() to be
convenient) to parse isoformat strings in the standard library, our
habit has been to use the oh-so-simple parser from the third-party
dateutil package. Well, it turns out that's slow as all get-out
(probably because it's trying to be smart about auto-recognizing
formats). For the test I ran (on a few percent of the real data), we
spent 90 seconds in parse().
OK, so I dragged out the strptime() docs and built the stupid format
string (%Y-%m-%dT%H:%M:%S+00:00). That got us down to 25 seconds in
But, I could also see it was spending a significant amount in routines
that looked like they were computing things like day of the week that we
didn't need. For what I was doing, we only really needed the hour and
minute. So I tried:
t_hour = int(date[11:13])
t_minute = int(date[14:16])
that got us down to 12 seconds overall (including the geolocation and
I think it turns out we never do anything with the hour and minute other
than print them back out, so just
t_hour_minute = date[11:16]
would probably be good enough, but I think I'm going to stop where I am
and declare victory :-)
More information about the Python-list