How to remove subset from a file efficiently?
Mike Meyer
mwm at mired.org
Thu Jan 12 17:50:45 EST 2006
"fynali" <iladijas at gmail.com> writes:
> Hi all,
>
> I have two files:
Others have pointed out the Python solution - use a set instead of a
list for membership testing. I want to point out a better Unix
solution ('cause I probably wouldn't have written a Python program to
do this):
> Objective: to remove the numbers present in barred-list from the
> PSPfile.
>
> $ ls -lh PSP0000320.dat CBR0000319.dat
> ... 56M Dec 28 19:41 PSP0000320.dat
> ... 8.6M Dec 28 19:40 CBR0000319.dat
>
> $ wc -l PSP0000320.dat CBR0000319.dat
> 4,462,603 PSP0000320.dat
> 693,585 CBR0000319.dat
>
> I wrote the following in bash to do the same:
>
> #!/bin/bash
>
> ARGS=2
>
> if [ $# -ne $ARGS ] # takes two arguments
> then
> echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
> echo; echo " eg.: `basename $0` PSP0000320.dat
> CBR0000319.dat"; echo;
> echo "NOTE: first argument: PSP file, second: CBR file";
> echo " this script _does_ no_ input validation!"
> exit 1
> fi;
>
> # fix prefix; cost: 12.587 secs
> cat $1 | sed -e 's/^0*/966/' > $1.good
> cat $2 | sed -e 's/^0*/966/' > $2.good
>
> # sort/save files; for the 4,462,603 lines, cost: 36.589 secs
> sort $1.good > $1.sorted
> sort $2.good > $2.sorted
>
> # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
> diff -y $1.sorted $2.sorted | grep "<" > $1.filtered
>
> # remove trailing junk [spaces & <]; cost: 1 min 3 secs
> cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned
>
> # remove intermediate files, good, sorted, filtered
> rm -f *.good *.sorted *.filtered
> #:~
>
> ...but strangely though, there's a discrepancy, the reason for which I
> can't figure out!
The above script can be shortened quite a bit:
#!/bin/sh
comm -23 <(sed 's/^0*/966/' $1 | sort) <(sed 's/^0*/966/ $2 | sort)
Will output only lines that occur in $1. It also runs the seds and
sorts in parallel, which can make a significant difference in the
clock time it takes to get the job done.
The Python version is probably faster, since it doesn't sort the
data.
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
More information about the Python-list
mailing list