How to remove subset from a file efficiently?

Fredrik Lundh fredrik at pythonware.com
Thu Jan 12 12:34:05 EST 2006


"fynali" wrote:

> > Objective: to remove the numbers present in barred-list from the
> > PSPfile.
> >
> >     $ ls -lh PSP0000320.dat CBR0000319.dat
> >     ...  56M Dec 28 19:41 PSP0000320.dat
> >     ... 8.6M Dec 28 19:40 CBR0000319.dat
> >
> >    $ wc -l PSP0000320.dat CBR0000319.dat
> >      4,462,603 PSP0000320.dat
> >        693,585 CBR0000319.dat
> >
> > I wrote the following in python to do it:
> >
> >     #: c01:rmcommon.py
> >     barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
> >     postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
> >     outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
> >
> >     # reading it all in one go, so as to avoid frequent disk accesses
> >     (assume machine has plenty memory)
> >     barredlist.read()
> >     postlist.read()
> >
> >     #
> >     for number in postlist:
> >             if number in barrlist:
> >                     pass
> >             else:
> >                     outfile.write(number)
> >
> >     barredlist.close(); postlist.close(); outfile.close()
> >     #:~
> >
> > The above code simply takes too long to complete.

the above code doesn't even run.

(why is it that nobody remembers how to use cut and paste these
days?  has it perhaps been banned in some part of the world, with-
out me noticing)

this might work a little better:

        barred = set(open('/home/sjd/python/wip/CBR0000319.dat'))

        infile = open('/home/sjd/python/wip/PSP0000320.dat')
        outfile = open('/home/sjd/python/wip/PSP-CBR.dat', 'w')

        for number in infile:
            if number not in barred:
                outfile.write(number)

if you feel adventurous, you can replace the for/if loop with

        outfile.writelines(number for number in infile if number not in barred)

:::

tim wrote:

> It should be quicker to do this
>
>    #
>    for number in postlist:
>            if not number in barrlist:
>                    outfile.write(number)
>
>
> and quicker doing this
>
>    #
> numbers =  [number for number in postlist if not number in barrlist]
> outfile.write(''.join(numbers))

looks like premature non-optimization to me...

</F>






More information about the Python-list mailing list