Newbie completely confused
Roel Schroeven
rschroev_nospam_ml at fastmail.fm
Sat Sep 22 04:47:36 EDT 2007
Jeroen Hegeman schreef:
> ...processing all 2 files found
> --> 1/2: ./test_file0.txt
> Now reading ...
> DEBUG readLines A took 0.093 s
> ...took 8.85717201233 seconds
> --> 2/2: ./test_file0.txt
> Now reading ...
> DEBUG readLines A took 3.917 s
> ...took 12.8725550175 seconds
>
> So the first time around the file gets read in in ~0.1 seconds, the
> second time around it needs almost four seconds! As far as I can see
> this is related to 'something in memory being copied around' since if
> I replace the 'alternative 1' by the 'alternative 2', basically
> making sure that my classes are not used, reading time the second
> time around drops back to normal (= roughly what it is the first pass).
(First, I had to add timing code to ReadClasses: the code you posted
doesn't include them, and only shows timings for ReadLines.)
Your program uses quite a bit of memory. I guess it gets harder and
harder to allocate the required amounts of memory.
If I change this line in ReadClasses:
built_classes[len(built_classes)] = HugeClass(long_line)
to
dummy = HugeClass(long_line)
then both times the files are read and your data structures are built,
but after each run the data structure is freed. The result is that both
runs are equally fast.
Also, if I run the first version (without the dummy) on a computer with
a bit more memory (1 GiB), it seems there is no problem allocating
memory: both runs are equally fast.
I'm not sure how to speed things up here... you're doing much processing
on a lot of small chunks of data. I have a number of observations and
possible improvements though, and some might even speed things up a bit.
You read the files, but don't use the contents; instead you use
long_line over and over. I suppose you do that because this is a test,
not your actual code?
__init__() with nothing (or only return) in it is not useful; better to
just leave it out.
You have a number of return statements that don't do anything (i.e. they
return nothing (None actually) at the end of the function). A function
without return automatically returns None at the end, so it's better to
leave them out.
Similarly you don't need to call sys.exit(): the script will terminate
anyway if it reaches the end. Better leave it out.
LongClass.clear() doesn't do anything and isn't called anyway; leave it out.
ModerateClass.__del__() doesn't do anything either. I'm not sure how it
affects what happens if ModerateClass gets freed, but I suggest you
don't start messing with __del__() until you have more Python knowledge
and experience. I'm not sure why you think you need to implement that
method.
The same goes for HugeClass.__del__(). It does delete self.B4v, but the
default behavior will do that too. Again, I don't get why you want to
override the default behavior.
In a number of cases, you use a dict like this:
built_classes = {}
for i in LINES:
built_classes[len(built_classes)] = ...
So you're using the indices 0, 1, 2, ... as the keys. That's not what
dictionaries are made for; lists are much better for that:
built_classes = []
for i in LINES:
built_classes.append(...)
HugeClass.B4v isn't used, so you can safely remove it.
Your readLines() function reads a whole file into memory. If you're
working with large files, that's not such a good idea. It's better to
load one line at a time into memory and work on that. I would even
completely remove readLines() and restructure ReadClasses() like this:
def ReadClasses(filename):
print 'Now reading ...'
built_classes = []
# Open file
in_file = open(filename, 'r')
# Read lines and interpret them.
time_a = time.time()
for i in in_file:
## This is alternative 1.
built_classes.append(HugeClass(long_line))
## The next line is alternative 2.
## built_classes[len(built_classes)] = long_line
in_file.close()
time_b = time.time()
print "DEBUG readClasses took %.3f s" % (time_b - time_a)
Personally I only use 'i' for integer indices (as in 'for i in
range(10)'); for other use I prefer more descriptive names:
for line in in_file: ...
But I guess that's up to personal preference. Also you used LINES to
store the file contents; the convention is that names with all capitals
are used for constants, not for things that change.
In ProcessList(), you keep the index in a separate variable. Python has
a trick so you don't have to do that yourself:
nfiles = len(input_files)
for file_index, i in enumerate(input_files):
print "--> %i/%i: %s" % (file_index + 1, nfiles, i)
ReadClasses(i)
Instead of item0, item1, ... , it's generally better to use a list, so
you can use item[0], item[1], ...
And finally, John Machin's suggestion looks like a good way to
restructure that long sequence of conversions and assignments in HugeClass.
--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov
Roel Schroeven
More information about the Python-list
mailing list