prefix search on a large file
John Machin
sjmachin at lexicon.net
Thu Oct 12 06:08:57 EDT 2006
js wrote:
> Hello, list.
>
> I have a list of sentence in text files that I use to filter-out some data.
> I managed the list so badly that now it's become literally a mess.
>
> Let's say the list has a sentence below
>
> 1. "Python has been an important part of Google since the beginning,
> and remains so as the system grows and evolves. "
>
> 2. "Python has been an important part of Google"
>
> 3. "important part of Google"
>
> As you see sentence 2 is a subset of sentence 1
> so I don't need to have sentence 1 on the list.
Are you 100% sure that you wnat to throw away the *longer* sentence?
> (For some reason, it's no problem to have sentence 3.
> Only sentence that has the "same prefix part" is the one I want to remove)
>
> So I decided to clean up the list.
>
> I tried to do this simple brute-force manner, like
Please don't waste time and bandwith with "like"; post the exact code
that you executed.
>
> ---------------------------------------------------------------------------
> sorted_list = sorted(file('thelist'), key=len)
> for line in sorted_list[:]
# won't compile, missing a colon at end
> unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
> sorted_list = list(set(sorted_list) - (unneeded))
# can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output "sorted_list" is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.
> ....
> ---------------------------------------------------------------------------
>
> This is so slow and not so helpful because the list is
> so big(more than 100M bytes and has about 3 million lines)
> and I have more than 100 lists.
Here's one way of doing it, tested to the extent shown:
C:\junk>type prefixdel.py
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]
"""
> sorted_list = sorted(file('thelist'), key=len)
> for line in sorted_list[:]
> unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
> sorted_list = list(set(sorted_list) - set(unneeded))
"""
slist = sorted(data, key=len)
print "OP trial 1 input"; pp(slist)
for line in slist[:]:
unneeded = [line2 for line2 in slist[:] if line2.startswith(line)]
slist = list(set(slist) - set(unneeded))
print "OP trial 1 output"; pp(slist)
ilist = sorted(data)
print "sorted input"; pp(ilist)
olist = []
for i in xrange(len(ilist)-1, 0, -1):
if ilist[i].startswith(ilist[i-1]):
continue
olist.append(ilist[i])
olist.append(ilist[0])
print "output (keeping shorter)"; pp(olist)
olist = []
for i in xrange(len(ilist)-1):
if ilist[i+1].startswith(ilist[i]):
continue
olist.append(ilist[i])
olist.append(ilist[-1])
print "output (keeping longer)"; pp(olist)
C:\junk>prefixdel.py
OP trial 1 input
['foo',
'food',
'food',
'zzzz',
'xyzzy',
'plugh',
'foo bar',
'foo bar baz',
'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
'foo bar',
'foo bar baz',
'food',
'food',
'plugh',
'xyzzy',
'xyzzy and plugh are magic',
'zzzz']
output (keeping shorter)
['zzzz', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz']
HTH,
John
More information about the Python-list
mailing list