How to remove subset from a file efficiently?
Tim Williams (gmail)
tdwdotnet at gmail.com
Thu Jan 12 12:19:37 EST 2006
On 12 Jan 2006 09:04:21 -0800, fynali <iladijas at gmail.com> wrote:
>
> Hi all,
>
> I have two files:
>
> - PSP0000320.dat (quite a large list of mobile numbers),
> - CBR0000319.dat (a subset of the above, a list of barred bumbers)
>
> # head PSP0000320.dat CBR0000319.dat
> ==> PSP0000320.dat <==
> 96653696338
> 96653766996
> 96654609431
> 96654722608
> 96654738074
> 96655697044
> 96655824738
> 96656190117
> 96656256762
> 96656263751
>
> ==> CBR0000319.dat <==
> 96651131135
> 96651131135
> 96651420412
> 96651730095
> 96652399117
> 96652399142
> 96652399142
> 96652399142
> 96652399160
> 96652399271
>
> Objective: to remove the numbers present in barred-list from the
> PSPfile.
>
> $ ls -lh PSP0000320.dat CBR0000319.dat
> ... 56M Dec 28 19:41 PSP0000320.dat
> ... 8.6M Dec 28 19:40 CBR0000319.dat
>
> $ wc -l PSP0000320.dat CBR0000319.dat
> 4,462,603 PSP0000320.dat
> 693,585 CBR0000319.dat
>
> I wrote the following in python to do it:
>
> #: c01:rmcommon.py
> barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
> postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
> outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
>
> # reading it all in one go, so as to avoid frequent disk accesses
> (assume machine has plenty memory)
> barredlist.read()
> postlist.read()
>
> #
> for number in postlist:
> if number in barrlist:
> pass
> else:
> outfile.write(number)
>
> barredlist.close(); postlist.close(); outfile.close()
> #:~
>
> The above code simply takes too long to complete. If I were to do a
> diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
> sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
> complete.
It should be quicker to do this
#
for number in postlist:
if not number in barrlist:
outfile.write(number)
and quicker doing this
#
numbers = [number for number in postlist if not number in barrlist]
outfile.write(''.join(numbers))
HTH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20060112/aa06a714/attachment.html>
More information about the Python-list
mailing list