Match beginning of two strings

Ravi rxs141 at cwru.edu
Mon Aug 4 18:53:27 EDT 2003


Ravi wrote:
> Hi,
> 
> I have about 200GB of data that I need to go through and extract the 
> common first part of a line. Something like this.
> 
>  >>>a = "abcdefghijklmnopqrstuvwxyz"
>  >>>b = "abcdefghijklmnopBHLHT"
>  >>>c = extract(a,b)
>  >>>print c
> "abcdefghijklmnop"
> 
> Here I want to extract the common string "abcdefghijklmnop". Basically I 
> need a fast way to do that for any two given strings. For my situation, 
> the common string will always be at the beginning of both strings. I can 
> use regular expressions to do this, but from what I understand there is 
> a lot of overhead. New data is being generated at the rate of about 1GB 
> per hour, so this needs to be reasonably fast while leaving CPU time for 
> other processes.
> 
> Thanks
> Ravi
> 

I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie 
and Bengt. However I have this problem taken care of now. Took around 6 
hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As 
for the data, if you want to know about it just for the sake of an 
optimized algorithm, there are no Null (\0) characters in the strings 
(actually they're Base64), and I've included a typical pair of strings. 
The version I used was Andrew's.

Someone suggested that this would be better done in larger sets than 
just pairs. That's not suitable because of the structure of the data, 
two strings might be highly correlated, but are probably quite different 
from another pair of strings. Perhaps more significantly, correlation in 
sets of greater than two has no physical significance to the experiment.

I grabbed this from a typical data file. So I would want to be 
extracting 'A832nv81a'
"
A832nv81a81nW103v9c24jgpy92T
A832nv81aTyqiep4v9c324jgpy92T
"

Thanks for your help everyone, coming from a Perl (It's a four letter 
word to me :) world, I'm very impressed by how helpful all of you are.

Ravi





More information about the Python-list mailing list