How to remove subset from a file efficiently?

Thu Jan 12 17:50:45 EST 2006

"fynali" <iladijas at gmail.com> writes:

> Hi all,
>
> I have two files:

Others have pointed out the Python solution - use a set instead of a
list for membership testing. I want to point out a better Unix
solution ('cause I probably wouldn't have written a Python program to
do this):

> Objective: to remove the numbers present in barred-list from the
> PSPfile.
>
>     $ ls -lh PSP0000320.dat CBR0000319.dat
>     ...  56M Dec 28 19:41 PSP0000320.dat
>     ... 8.6M Dec 28 19:40 CBR0000319.dat
>
>     $ wc -l PSP0000320.dat CBR0000319.dat
>      4,462,603 PSP0000320.dat
>        693,585 CBR0000319.dat
>
> I wrote the following in bash to do the same:
>
>     #!/bin/bash
>
>     ARGS=2
>
>     if [ $# -ne $ARGS ]     # takes two arguments
>     then
>         echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
>         echo; echo "    eg.: `basename $0` PSP0000320.dat
> CBR0000319.dat"; echo;
>         echo "NOTE: first argument: PSP file, second: CBR file";
>         echo "      this script _does_ no_ input validation!"
>         exit 1
>     fi;
>
>     # fix prefix; cost: 12.587 secs
>     cat $1 | sed -e 's/^0*/966/' > $1.good
>     cat $2 | sed -e 's/^0*/966/' > $2.good
>
>     # sort/save files; for the 4,462,603 lines, cost: 36.589 secs
>     sort $1.good > $1.sorted
>     sort $2.good > $2.sorted
>
>     # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
>     diff -y $1.sorted $2.sorted | grep "<" > $1.filtered
>
>      # remove trailing junk [spaces & <]; cost: 1 min 3 secs
>     cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned
>
>     # remove intermediate files, good, sorted, filtered
>      rm -f *.good *.sorted *.filtered
>     #:~
>
> ...but strangely though, there's a discrepancy, the reason for which I
> can't figure out!

The above script can be shortened quite a bit:

#!/bin/sh

comm -23 <(sed 's/^0*/966/' $1 | sort) <(sed 's/^0*/966/ $2 | sort)

Will output only lines that occur in $1. It also runs the seds and
sorts in parallel, which can make a significant difference in the
clock time it takes to get the job done.

The Python version is probably faster, since it doesn't sort the
data.

        <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.