YATOPS (Yet Another Thread on Python's Speed) (was Re: HELP! Must choose language!)

Branimir Petrovic BranimirPetrovic at yahoo.com
Thu Jan 9 23:04:01 EST 2003


David Bolen <db3l at fitlinxx.com> wrote in message news:<uu1gh6gz3.fsf at fitlinxx.com>...
> BranimirPetrovic at yahoo.com (Branimir Petrovic) writes:
> 
> > Andy, would you mind showing the snippet of the critical part of your
> > "
just loads up each file and writes it to dest." statement. Not quite
> > see/understand what you mean (and how).
> 
> I expect from his comment that Andy literally reads the entire file
> into memory (e.g., open()s it and just calls read()), and then writes
> it out.  So a single iteration through Python code, but potentially a
> lot of memory usage.

That's how it appeared to me too, but wasn't really sure. Being new to
Python I am still on 'wobbly legs' when it comes to Pythonic (or
proper)
way of thinking. Now that I see it - I know exactly how :)

	while 1:
		block = source.read(BLOCKSIZE)
		if not block: break
		dest.write(block)



> > Having fair amount of scripting experience of what's (not) achievable
> > using M$ pathetic scripting approach (FileSystemsObject & cousins), 
> > I had impression that beating robocopy.exe using (any) scripting 
> > language is just not in cards.
> 
> Well, using the very high level COM scripting approach is going to
> have reasonably high overhead to instantiate the various objects along
> the way and then to actually issue the various calls amongst the
> objects.
> 

COM way of dealing with file system is not too bad providing simple 
requirements. As soon as anything more is expected - like large folder
structure traversal say looking for pattern matching or time stamp 
and size sampling, performance goes 'out of Window(s)'. Mandatory 
COM, variants, interpreted nature of program execution all that adds 
up killing any chance of getting acceptable (if not good) performance.
For instance - comparing and synchronizing two almost identical folder
structures not much more than 100-200 MB each containing few tens of
thousands files takes way, way too long while CPU 'sweats bullets' 
(at %100 usage). 

Very same task using compiled utility (like robocopy) does the job
so much faster. Did not take exact figures, but the difference feels
as if hundred times.

Wonderful thing about Python is that is as 'close to the metal' as
scripting language can possibly be by allowing API calls. I am aware 
that we are talking about Win32 subset, but nevertheless...


> For example, the following routine scans a directory tree to summarize
> numbers and sizes of contained files.  Originally, I implemented this
> with the more generic os.path.walk routine.  By moving to the more
> direct Win32 calls, and also avoiding some of the higher level path
> manipulation calls in favor of direct code, I was able to obtain an
> over 4 fold increase in performance.  It can process a filesystem with
> about 140,000 files (spread over a large number of directories) in
> about 4 minutes.  Or around 550-600 files/s including directory
> traversal.  Generally speaking - in a transfer process that's probably
> fast enough to keep your pipes busy :-)

Now that's a quite a performance for interpreted language! I will 
definitely try your sample and see how it compares.

I would prefer using Python if script performance can be tweaked
to get to within 1/2 or 1/3 of speed achievable using 'robocopy'. 
You see robocopy.exe is part of so called Server Resource Kit, and 
as such can not be freely distributed or used on all machines on
the network (1 system 1 license is the motto), otherwise I wouldn't
be shy to resort to simplest solution possible that gets the job 
done. Over time I adopted and perfected what I call shameless 
approach to scripting where as much as possible (and aesthetically 
digestible) I tend to use compiled utilities to do 'heavy lifting'. 
Libraries that come with Python look very very good, so I might
be able to kick this questionable habit after all.

David thank you very much for samples, explanations and the time
you spent doing it. 

Branimir




More information about the Python-list mailing list