Match beginning of two strings
Ravi
rxs141 at cwru.edu
Mon Aug 4 18:53:27 EDT 2003
Ravi wrote:
> Hi,
>
> I have about 200GB of data that I need to go through and extract the
> common first part of a line. Something like this.
>
> >>>a = "abcdefghijklmnopqrstuvwxyz"
> >>>b = "abcdefghijklmnopBHLHT"
> >>>c = extract(a,b)
> >>>print c
> "abcdefghijklmnop"
>
> Here I want to extract the common string "abcdefghijklmnop". Basically I
> need a fast way to do that for any two given strings. For my situation,
> the common string will always be at the beginning of both strings. I can
> use regular expressions to do this, but from what I understand there is
> a lot of overhead. New data is being generated at the rate of about 1GB
> per hour, so this needs to be reasonably fast while leaving CPU time for
> other processes.
>
> Thanks
> Ravi
>
I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
and Bengt. However I have this problem taken care of now. Took around 6
hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
for the data, if you want to know about it just for the sake of an
optimized algorithm, there are no Null (\0) characters in the strings
(actually they're Base64), and I've included a typical pair of strings.
The version I used was Andrew's.
Someone suggested that this would be better done in larger sets than
just pairs. That's not suitable because of the structure of the data,
two strings might be highly correlated, but are probably quite different
from another pair of strings. Perhaps more significantly, correlation in
sets of greater than two has no physical significance to the experiment.
I grabbed this from a typical data file. So I would want to be
extracting 'A832nv81a'
"
A832nv81a81nW103v9c24jgpy92T
A832nv81aTyqiep4v9c324jgpy92T
"
Thanks for your help everyone, coming from a Perl (It's a four letter
word to me :) world, I'm very impressed by how helpful all of you are.
Ravi
More information about the Python-list
mailing list