When Python outruns C++ ...

Tue Apr 1 08:41:37 EST 2003

In article <3E890322.4B003C9C at engcorp.com>, Peter Hansen wrote:
> Julian Tibble wrote:
>> 
>> p.s. maybe my python version is not very good, feel free to make
>> it go faster:
> 
> I'd love to try improving it, and there are certainly many
> possibilities for improvement, but the requirements are 
> inadequately specified.  Specifically, what is the nature of
> the input?  Is it supposed to work with any kind of file,
> including binaries, or only with files containing, for example,
> text with basic punctuation and numbers? (In other words, 
> exactly what character set is involved, and what is the nature
> of the input file?)
>

If some of the design decisions seem a bit strange, it's because I wrote
it to produce exactly the same output as my C program so that it was a
fair test.

It is supposed to work with all binary files, although in practise it
would be used on text files - documentation, etc.

>> # Regular expression to match a word
>> re_word = re.compile('[a-z]+')
> 
> Regular expressions are often not the fastest way to approach
> a problem.

Well, I thought they might not be, but I couldn't think of any
alternatives.

> 
>> # Read in whole file and split into words
>> try:
>>         while 1:
>>                 line = raw_input().lower()
> 
> Dealing with data one line at a time is often not the 
> fastest approach.  It does suggest that your data is
> line-oriented, however, but is it necessarily so?  For
> example, could newlines be replaced with spaces and the
> program would still work correctly?

Yes, I did try reading the whole file into memory (that's what the C
version does) but the process kept getting killed for using too much
memory, even though I have 256MB of RAM and the input file is ~80MB

>>                 for word in line:
>>                         if wordcount.has_key(word):
>>                                 wordcount[word] += 1
>>                         else:
>>                                 wordcount[word] = 1
> 
> Among other things, it's possible that catching a KeyError
> for the case where the key does not yet exist is faster,
> and there is also dict.setdefault() which might be of help.

setdefault() actually makes the running time worse, however catching
the KeyError did not.  This is probably due to the fact that the input
file is the same file repeated a number of times - so you should only
get exceptions during the first "section".  This shouldn't make too much
difference, as the C version should also degrades in speed when there
are more unique words due to hash collisions.

Also, the program would probably only prove useful on text files, and
there are only ~45,000 words in /usr/dict/words IIRC, so large numbers
of KeyErrors shouldn't occur.

>> for n,w in sortable:
>>         print w,n
> 
> Since you're not using the output here, why include it in the
> timing?  Seems like it could skew the result under certain 
> conditions...

As I said, I wanted it to be identical to the C program.
Also, by _far_ the majority of the time is spent in the loop.

Julian