prefix search on a large file
js
ebgssth at gmail.com
Thu Oct 12 04:45:27 EDT 2006
Hello, list.
I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.
Let's say the list has a sentence below
1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "
2. "Python has been an important part of Google"
3. "important part of Google"
As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)
So I decided to clean up the list.
I tried to do this simple brute-force manner, like
---------------------------------------------------------------------------
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - (unneeded))
....
---------------------------------------------------------------------------
This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.
I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.
Thank you in advance.
More information about the Python-list
mailing list