looking for speed-up ideas

Mon Feb 3 20:22:32 EST 2003

Ram Bhamidipaty <ramb at sonic.net> wrote:
> I have some python code that processes a large file. I want to see how
> much faster this code can get. Mind you, I don't _need_ the code to go
> faster - but it sure would be nice if it were faster...
> 
> Here is a specification of the input:
> 1. Lines start with T, S or F
> 2. The first line of the file starts with
>    T, all other lines start with S or F.
> 3. F lines look like "F/<number>/string"
> 4. S lines look like "S/string/<number>/<number>"
> 
> Here is a sample:
> 
> T /remote 0
> S/name/0/1
> S/joe/1/2
> S/bob/1/3
> F/3150900/big_file.tar.gz
> S/testing/3/4
> F/414/.envrc
> F/276/BUILD_FLAGS
> F/36505/make.incl
> F/3861/build_envrc
> 
> In case you are curious the file is a dump of a file system. F lines
> specify a file name and file size. S lines speficy a directory. The
> numbers on an S line represent a directory number and a directory
> parent number. All the F lines under an S line are files in a
> particular directory.
> 
> My script reads the file and prints out the 200 largest files.

Behold:
    egrep '^F' dumpfile | sort -t '/' -n -k 2,2 | tail -200

How fast does it run?

> 
> I am currently using the heapcq module written by John Eikenberry. I
> downloaded it from here: http://zhar.net/projects/python/
> 
> There is an edited version of the script at the end of this message.
> 
> My script current processes 300,000 lines in about 18 seconds
> on a Sun Ultra 60. To make preformance testing easier the
> script currently limits processing to just reading 300,000 lines.
> The wc program can read the same 300k lines in around 0.4 seconds.
> 
> The full input file is around 43 Meg with around 2.2 million lines.

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
Linux solution for data management and processing.