How to remove subset from a file efficiently?
Steve Holden
steve at holdenweb.com
Thu Jan 12 13:48:00 EST 2006
Fredrik Lundh wrote:
> "fynali" wrote:
>
>
>>>Objective: to remove the numbers present in barred-list from the
>>>PSPfile.
>>>
>>> $ ls -lh PSP0000320.dat CBR0000319.dat
>>> ... 56M Dec 28 19:41 PSP0000320.dat
>>> ... 8.6M Dec 28 19:40 CBR0000319.dat
>>>
>>> $ wc -l PSP0000320.dat CBR0000319.dat
>>> 4,462,603 PSP0000320.dat
>>> 693,585 CBR0000319.dat
>>>
>>>I wrote the following in python to do it:
>>>
>>> #: c01:rmcommon.py
>>> barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
>>> postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
>>> outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
>>>
>>> # reading it all in one go, so as to avoid frequent disk accesses
>>> (assume machine has plenty memory)
>>> barredlist.read()
>>> postlist.read()
>>>
>>> #
>>> for number in postlist:
>>> if number in barrlist:
>>> pass
>>> else:
>>> outfile.write(number)
>>>
>>> barredlist.close(); postlist.close(); outfile.close()
>>> #:~
>>>
>>>The above code simply takes too long to complete.
>
>
> the above code doesn't even run.
>
> (why is it that nobody remembers how to use cut and paste these
> days? has it perhaps been banned in some part of the world, with-
> out me noticing)
>
> this might work a little better:
>
> barred = set(open('/home/sjd/python/wip/CBR0000319.dat'))
>
> infile = open('/home/sjd/python/wip/PSP0000320.dat')
> outfile = open('/home/sjd/python/wip/PSP-CBR.dat', 'w')
>
> for number in infile:
> if number not in barred:
> outfile.write(number)
>
> if you feel adventurous, you can replace the for/if loop with
>
> outfile.writelines(number for number in infile if number not in barred)
>
> :::
>
> tim wrote:
>
>
>>It should be quicker to do this
>>
>> #
>> for number in postlist:
>> if not number in barrlist:
>> outfile.write(number)
>>
>>
>>and quicker doing this
>>
>> #
>>numbers = [number for number in postlist if not number in barrlist]
>>outfile.write(''.join(numbers))
>
>
> looks like premature non-optimization to me...
>
It might be quicker to establish a dict whose keys are the barred
numbers and use that, rather than a list, to determine whether the input
numbers should make it through.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
More information about the Python-list
mailing list