How to remove subset from a file efficiently?
fynali
iladijas at gmail.com
Thu Jan 12 12:04:21 EST 2006
Hi all,
I have two files:
- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)
# head PSP0000320.dat CBR0000319.dat
==> PSP0000320.dat <==
96653696338
96653766996
96654609431
96654722608
96654738074
96655697044
96655824738
96656190117
96656256762
96656263751
==> CBR0000319.dat <==
96651131135
96651131135
96651420412
96651730095
96652399117
96652399142
96652399142
96652399142
96652399160
96652399271
Objective: to remove the numbers present in barred-list from the
PSPfile.
$ ls -lh PSP0000320.dat CBR0000319.dat
... 56M Dec 28 19:41 PSP0000320.dat
... 8.6M Dec 28 19:40 CBR0000319.dat
$ wc -l PSP0000320.dat CBR0000319.dat
4,462,603 PSP0000320.dat
693,585 CBR0000319.dat
I wrote the following in python to do it:
#: c01:rmcommon.py
barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
# reading it all in one go, so as to avoid frequent disk accesses
(assume machine has plenty memory)
barredlist.read()
postlist.read()
#
for number in postlist:
if number in barrlist:
pass
else:
outfile.write(number)
barredlist.close(); postlist.close(); outfile.close()
#:~
The above code simply takes too long to complete. If I were to do a
diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
complete.
I wrote the following in bash to do the same:
#!/bin/bash
ARGS=2
if [ $# -ne $ARGS ] # takes two arguments
then
echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
echo; echo " eg.: `basename $0` PSP0000320.dat
CBR0000319.dat"; echo;
echo "NOTE: first argument: PSP file, second: CBR file";
echo " this script _does_ no_ input validation!"
exit 1
fi;
# fix prefix; cost: 12.587 secs
cat $1 | sed -e 's/^0*/966/' > $1.good
cat $2 | sed -e 's/^0*/966/' > $2.good
# sort/save files; for the 4,462,603 lines, cost: 36.589 secs
sort $1.good > $1.sorted
sort $2.good > $2.sorted
# diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
diff -y $1.sorted $2.sorted | grep "<" > $1.filtered
# remove trailing junk [spaces & <]; cost: 1 min 3 secs
cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned
# remove intermediate files, good, sorted, filtered
rm -f *.good *.sorted *.filtered
#:~
...but strangely though, there's a discrepancy, the reason for which I
can't figure out!
Needless to say, I'm utterly new to python and my programming skills &
know-how are rudimentary.
Any help will be genuinely appreciated.
--
fynali
More information about the Python-list
mailing list