<br><br><div><span class="gmail_quote">On 12 Jan 2006 09:04:21 -0800, <b class="gmail_sendername">fynali</b> <<a href="mailto:iladijas@gmail.com">iladijas@gmail.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi all,<br><br>I have two files:<br><br> - PSP0000320.dat (quite a large list of mobile numbers),<br> - CBR0000319.dat (a subset of the above, a list of barred bumbers)<br><br> # head PSP0000320.dat CBR0000319.dat<br>
==> PSP0000320.dat <==<br> 96653696338<br> 96653766996<br> 96654609431<br> 96654722608<br> 96654738074<br> 96655697044<br> 96655824738<br> 96656190117<br> 96656256762<br> 96656263751
<br><br> ==> CBR0000319.dat <==<br> 96651131135<br> 96651131135<br> 96651420412<br> 96651730095<br> 96652399117<br> 96652399142<br> 96652399142<br> 96652399142<br> 96652399160<br> 96652399271
<br><br>Objective: to remove the numbers present in barred-list from the<br>PSPfile.<br><br> $ ls -lh PSP0000320.dat CBR0000319.dat<br> ... 56M Dec 28 19:41 PSP0000320.dat<br> ... 8.6M Dec 28 19:40 CBR0000319.dat
<br><br> $ wc -l PSP0000320.dat CBR0000319.dat<br> 4,462,603 PSP0000320.dat<br> 693,585 CBR0000319.dat<br><br>I wrote the following in python to do it:<br><br> #: c01:rmcommon.py<br> barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
<br> postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')<br> outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')<br><br> # reading it all in one go, so as to avoid frequent disk accesses<br>(assume machine has plenty memory)
<br> barredlist.read()<br> postlist.read()<br><br> #<br> for number in postlist:<br> if number in barrlist:<br> pass<br> else:<br> outfile.write(number)
<br><br> barredlist.close(); postlist.close(); outfile.close()<br> #:~<br><br>The above code simply takes too long to complete. If I were to do a<br>diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
<br>sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to<br>complete.</blockquote><div><br>
<br>
It should be quicker to do this<br>
<br>
#<br>
for number in postlist:<br>
if not number in barrlist:<br>
outfile.write(number)<br>
</div><br>
<br>
and quicker doing this<br>
<br>
#<br>
numbers = [number for number in postlist if not number in barrlist]<br>
outfile.write(''.join(numbers)) <br>
<br>
HTH <br>
</div>