Match beginning of two strings
rxs141 at cwru.edu
Sun Aug 3 05:46:26 CEST 2003
> I don't think that os.path.commonprefix was designed with 200Gb of
> data in mind. Inspection of Lib/*path.py gives one the impression that
> it spends a lot of time discovering that the first element is a prefix
> of itself.
> Ravi, You may need to drop down to C to get the speed you want for
> your requirement to find the longest common prefix of two strings. Two
> things puzzling me: (1) how you would do this with regular expressions
> (2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
> after a year you have almost 9000Gb; where are you putting it all?
> BTW, I do hope your algorithm is not O(N**2) ...
Well, I timed os.path.commonprefix with some typical data and it's
pulling about 40usec per loop. So I did what I hated and coded a little
function in C. Goes something like this. My reasoning is that usually
the point where the strings start to differ is in the 30 - 50 character
range. Basically the idea is the same as a binary search on a sorted
array. Divide and conquer by going halfway each time.
Read in both strings.
Check to see if the first character matches.
Check halfway through the string and see if that character matches
Repeatedly check halfway until the difference point is found.
Go back through from the difference point backwards and make sure
the characters match from the start to the difference point.
I timed it, and it seems to be doing about 3.5usec per loop. Using
pipes, I can feed it directly into the processing program. Good enough
As for the data, it's data from a radio telescope that's being recorded.
We do pattern analysis and reduce it to strings. By examining these
strings (more analysis than just the first common bit of course), we can
determine what data should be looked at further and what data is
garbage. The 9000GB problem isn't all that bad, the stuff compresses
extremely well, down to about 700GB for a year's worth. A couple of RAID
arrays makes quick work of that.
More information about the Python-list