How to remove subset from a file efficiently?
Tim Williams (gmail)
tdwdotnet at gmail.com
Thu Jan 12 12:35:57 EST 2006
On 12/01/06, Tim Williams (gmail) <tdwdotnet at gmail.com> wrote:
>
>
>
> On 12 Jan 2006 09:04:21 -0800, fynali <iladijas at gmail.com> wrote:
> >
> > Hi all,
> >
> > I have two files:
> >
> > - PSP0000320.dat (quite a large list of mobile numbers),
> > - CBR0000319.dat (a subset of the above, a list of barred bumbers)
> >
> > # head PSP0000320.dat CBR0000319.dat
> > ==> PSP0000320.dat <==
> > 96653696338
> > 96653766996
> > 96654609431
> > 96654722608
> > 96654738074
> > 96655697044
> > 96655824738
> > 96656190117
> > 96656256762
> > 96656263751
> >
> > ==> CBR0000319.dat <==
> > 96651131135
> > 96651131135
> > 96651420412
> > 96651730095
> > 96652399117
> > 96652399142
> > 96652399142
> > 96652399142
> > 96652399160
> > 96652399271
> >
> > Objective: to remove the numbers present in barred-list from the
> > PSPfile.
> >
> > $ ls -lh PSP0000320.dat CBR0000319..dat
> > ... 56M Dec 28 19:41 PSP0000320.dat
> > ... 8.6M Dec 28 19:40 CBR0000319.dat
> >
> > $ wc -l PSP0000320.dat CBR0000319.dat
> > 4,462,603 PSP0000320.dat
> > 693,585 CBR0000319.dat
> >
> > I wrote the following in python to do it:
> >
> > #: c01:rmcommon.py
> > barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
> > postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
> > outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
> >
> > # reading it all in one go, so as to avoid frequent disk accesses
> > (assume machine has plenty memory)
> > barredlist.read()
> > postlist.read()
> >
> > #
> > for number in postlist:
> > if number in barrlist:
> > pass
> > else:
> > outfile.write(number)
> >
> > barredlist.close(); postlist.close(); outfile.close()
> > #:~
> >
> > The above code simply takes too long to complete. If I were to do a
> > diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
> > sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
> > complete.
>
>
>
> It should be quicker to do this
>
> #
> for number in postlist:
> if not number in barrlist:
> outfile.write(number)
>
>
> and quicker doing this
>
> #
> numbers = [number for number in postlist if not number in barrlist]
> c
>
I forgot to add this one
for num in (number for number in postlist if not number in barrlist):
outfile.write(number)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20060112/0824add3/attachment.html>
More information about the Python-list
mailing list