How to remove subset from a file efficiently?

Tim Williams (gmail) tdwdotnet at gmail.com
Thu Jan 12 12:19:37 EST 2006


On 12 Jan 2006 09:04:21 -0800, fynali <iladijas at gmail.com> wrote:
>
> Hi all,
>
> I have two files:
>
>   - PSP0000320.dat (quite a large list of mobile numbers),
>   - CBR0000319.dat (a subset of the above, a list of barred bumbers)
>
>     # head PSP0000320.dat CBR0000319.dat
>     ==> PSP0000320.dat <==
>     96653696338
>     96653766996
>     96654609431
>     96654722608
>     96654738074
>     96655697044
>     96655824738
>     96656190117
>     96656256762
>     96656263751
>
>     ==> CBR0000319.dat <==
>     96651131135
>     96651131135
>     96651420412
>     96651730095
>     96652399117
>     96652399142
>     96652399142
>     96652399142
>     96652399160
>     96652399271
>
> Objective: to remove the numbers present in barred-list from the
> PSPfile.
>
>     $ ls -lh PSP0000320.dat CBR0000319.dat
>     ...  56M Dec 28 19:41 PSP0000320.dat
>     ... 8.6M Dec 28 19:40 CBR0000319.dat
>
>     $ wc -l PSP0000320.dat CBR0000319.dat
>      4,462,603 PSP0000320.dat
>        693,585 CBR0000319.dat
>
> I wrote the following in python to do it:
>
>     #: c01:rmcommon.py
>     barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
>     postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
>     outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
>
>     # reading it all in one go, so as to avoid frequent disk accesses
> (assume machine has plenty memory)
>     barredlist.read()
>     postlist.read()
>
>     #
>     for number in postlist:
>             if number in barrlist:
>                     pass
>             else:
>                     outfile.write(number)
>
>     barredlist.close(); postlist.close(); outfile.close()
>     #:~
>
> The above code simply takes too long to complete.  If I were to do a
> diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
> sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
> complete.



It should be quicker to do this

   #
   for number in postlist:
           if not number in barrlist:
                   outfile.write(number)


and quicker doing this

   #
numbers =  [number for number in postlist if not number in barrlist]
outfile.write(''.join(numbers))

HTH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20060112/aa06a714/attachment.html>


More information about the Python-list mailing list