Memory woes under Windows

[Noah, I'm wondering whether this is related to our W98 NatSpeak woes -- Python grows its lists much like a certain product we both work on <ahem> grows its arrays ...]
Here's a simple test case:
from time import clock
def run(): n = 1 while n < 4000000: a = [] push = a.append start = clock() for i in xrange(n): push(1) finish = clock() print "%10d push %10.3f" % (n, round(finish - start, 3)) n = n + n
for i in (1, 2, 3): try: run() except MemoryError: print "Got a memory error"
So run() builds a number of power-of-2 sized lists, each by appending one element at a time. It prints the list length and elapsed time to build each one (on Windows, this is basically wall-clock time, and is derived from the Pentium's high-resolution cycle timer). The driver simply runs this 3 times, reporting any MemoryError that pops up.
The largest array constructed has 2M elements, so consumes about 8Mb -- no big deal on most machines these days.
Here's what happens on my new laptop (damn, this thing is fast! -- usually):
Win98 (Second Edition) 600MHz Pentium III 160Mb RAM Python 1.6a2 from python.org, via the Windows installer
1 push 0.000 2 push 0.000 4 push 0.000 8 push 0.000 16 push 0.000 32 push 0.000 64 push 0.000 128 push 0.000 256 push 0.001 512 push 0.001 1024 push 0.003 2048 push 0.011 4096 push 0.020 8192 push 0.053 16384 push 0.074 32768 push 0.163 65536 push 0.262 131072 push 0.514 262144 push 0.713 524288 push 1.440 1048576 push 2.961 Got a memory error 1 push 0.000 2 push 0.000 4 push 0.000 8 push 0.000 16 push 0.000 32 push 0.000 64 push 0.000 128 push 0.000 256 push 0.001 512 push 0.001 1024 push 0.003 2048 push 0.007 4096 push 0.014 8192 push 0.029 16384 push 0.057 32768 push 0.116 65536 push 0.231 131072 push 0.474 262144 push 2.361 524288 push 24.059 1048576 push 67.492 Got a memory error 1 push 0.000 2 push 0.000 4 push 0.000 8 push 0.000 16 push 0.000 32 push 0.000 64 push 0.000 128 push 0.000 256 push 0.001 512 push 0.001 1024 push 0.003 2048 push 0.007 4096 push 0.014 8192 push 0.028 16384 push 0.057 32768 push 0.115 65536 push 0.232 131072 push 0.462 262144 push 2.349 524288 push 23.982 1048576 push 67.257 Got a memory error
Commentary: The first time it runs, the timing behavior is indistinguishable from O(N). But realloc returns NULL at some point when growing the 2M array! There "should be" huge gobs of memory available.
The 2nd and 3rd runs are very similar to each other, both blow up at about the same time, but both run *very* much slower than the 1st run before that point as the list size gets non-trivial -- and, while the output doesn't show this, the disk starts thrashing too.
It's *not* the case that Win98 won't give Python more than 8Mb of memory. For example,
a = [1]*30000000 # that's 30M
works fine and fast on this machine, with no visible disk traffic [Noah, that line sucks up about 120Mb from malloc in one shot].
So, somehow or other, masses of allocations are confusing the system memory manager nearly to death (implying we should use Vladimir's PyMalloc under Windows after grabbing every byte the machine has <0.6 wink>).
My belief is that the Windows 1.6a2 from python.org was compiled with VC6, yes? Scream if that's wrong.
This particular test case doesn't run any better under my Win95 (original) P5-166 with 32Mb RAM using Python 1.5.2. But at work, we've got a (unfortunately huge, and C++) program that runs much slower on a large-memory W98 machine than a small-memory W95 one, due to disk thrashing. It's a mystery! If anyone has a clue about any of this, spit it out <wink>.
[Noah, I watched the disk cache size while running the above, and it's not the problem -- while W98 had allocated about 100Mb for disk cache at the start, it gracefully gave that up as the program's memory demands increased]
just-another-day-with-windows-ly y'rs - tim

This is definately wierd! As you only mentioned Win9x, I thought I would give it a go on Win2k.
This is from a CVS update of only a few days ago, but it is a non-debug build. PII266 with 196MB ram:
1 push 0.001 2 push 0.000 4 push 0.000 8 push 0.000 16 push 0.000 32 push 0.000 64 push 0.000 128 push 0.001 256 push 0.001 512 push 0.003 1024 push 0.006 2048 push 0.011 4096 push 0.040 8192 push 0.043 16384 push 0.103 32768 push 0.203 65536 push 0.583
Things are looking OK to here - the behaviour Tim expected. But then things seem to start going a little wrong:
131072 push 1.456 262144 push 4.763 524288 push 16.119 1048576 push 60.765
All of a sudden we seem to hit N*N behaviour?
I gave up waiting for the next one. Performance monitor was showing CPU at 100%, but the Python process was only sitting on around 15MB of RAM (and growing _very_ slowly - at the rate you would expect). Machine had tons of ram showing as available, and the disk was not thrashing - ie, Windows definately had lots of mem available, and I have no reason to believe that a malloc() would fail here - but certainly no one would ever want to wait and see :-)
This was all definately built with MSVC6, SP3.
no-room-should-ever-have-more-than-one-windows-ly y'rs
Mark.

[Mark Hammond]
This is definately wierd! As you only mentioned Win9x, I thought I would give it a go on Win2k.
Thanks, Mark! I've only got W9X machines at home.
This is from a CVS update of only a few days ago, but it is a non-debug build. PII266 with 196MB ram:
1 push 0.001 2 push 0.000 4 push 0.000 8 push 0.000 16 push 0.000 32 push 0.000 64 push 0.000 128 push 0.001 256 push 0.001 512 push 0.003 1024 push 0.006 2048 push 0.011 4096 push 0.040 8192 push 0.043 16384 push 0.103 32768 push 0.203 65536 push 0.583
Things are looking OK to here - the behaviour Tim expected. But then things seem to start going a little wrong:
131072 push 1.456 262144 push 4.763 524288 push 16.119
1048576 push 60.765
So that acts like my Win95 (which I didn't show), and somewhat like my 2nd & 3rd Win98 runs.
All of a sudden we seem to hit N*N behaviour?
*That* part really isn't too surprising. Python "overallocates", but by a fixed amount independent of the current size. This leads to quadratic-time behavior "in theory" once a vector gets large enough. Guido's cultural myth for why that theory shouldn't matter is that if you keep appending to the same vector, the OS will eventually move it to the end of the address space, whereupon further growth simply boosts the VM high-water mark without actually moving anything. I call that "a cultural myth" because some flavors of Unix did used to work that way, and some may still -- I doubt it's ever been a valid argument under Windows, though. (you, of all people, know how much Python's internal strategies were informed by machines nobody uses <wink>).
So I was more surprised up to this point by the supernatural linearity of my first W98 run (which is reproducible, btw). But my 2nd & 3rd W98 runs (also reproducible), and unlike your W2K run, show *worse* than quadratic behavior.
I gave up waiting for the next one.
Under both W98 and W95, the next one does eventually hit the MemoryError for me, but it does take a long time. If I thought it would help, I'd measure it. And *this* one is surprising, because, as you say:
Performance monitor was showing CPU at 100%, but the Python process was only sitting on around 15MB of RAM (and growing _very_ slowly - at the rate you would expect). Machine had tons of ram showing as available, and the disk was not thrashing - ie, Windows definately had lots of mem available, and I have no reason to believe that a malloc() would fail here - but certainly no one would ever want to wait and see :-)
How long did you wait? If less than 10 minutes, perhaps not long enough. I certainly didn't expect a NULL return either, even on my tiny machine, and certainly not on the box with 20x more RAM than the list needs.
This was all definately built with MSVC6, SP3.
Again good to know. I'll chew on this, but don't expect a revelation soon.
no-room-should-ever-have-more-than-one-windows-ly y'rs
Hmm. I *did* run these in different rooms <wink>.
no-accounting-for-windows-ly y'rs - tim

Just a brief note on the little list-grower I posted. Upon more digging this doesn't appear to have any relation to Dragon's Win98 headaches, so I haven't looked at it much more. Two data points:
1. Gordon McM and I both tried it under NT 4 systems (thanks, G!), and those are the only Windows platforms under which no MemoryError is raised. But the runtime behavior is very clearly quadratic-time (in the ultimate length of the list) under NT.
2. Win98 comes with very few diagnostic tools useful at this level. The Python process does *not* grow to an unreasonable size. However, using a freeware heap walker I quickly determined that Python quickly sprays data *all over* its entire 2Gb virtual heap space while running this thing, and then the memory error occurs. The dump file for the system heap memory blocks (just listing the start address, length, & status of each block) is about 128Kb and I haven't had time to analyze it. It's clearly terribly fragmented, though. The mystery here is why Win98 isn't coalescing all the gazillions of free areas to come with a big- enough contiguous chunk to satisfy the request (according to me <wink>, the program doesn't create any long-lived data other than the list -- it appends "1" each time, and uses xrange).
Dragon's Win98 woes appear due to something else: right after a Win98 system w/ 64Mb RAM is booted, about half the memory is already locked (not just committed)! Dragon's product needs more than the remaining 32Mb to avoid thrashing. Even stranger, killing every process after booting releases an insignificant amount of that locked memory. Strange too, on my Win98 w/ 160Mb of RAM, upon booting Win98 a massive 50Mb is locked. This is insane, and we haven't been able to figure out on whose behalf all this memory is being allocated.
personally-like-win98-a-lot-but-then-i-bought-a-lot-of-ram-ly y'rs - tim

Just polishing part of this off, for the curious:
... Dragon's Win98 woes appear due to something else: right after a Win98 system w/ 64Mb RAM is booted, about half the memory is already locked (not just committed)! Dragon's product needs more than the remaining 32Mb to avoid thrashing. Even stranger, killing every process after booting releases an insignificant amount of that locked memory. ...
That turned out to be (mostly) irrelevant, and even if it were relevant it turns out you can reduce the locked memory (to what appears to be an undocumented minimum) and the file-cache size (to what is a documented minimum) just by malloc'ing, zero'ing and free'ing a few giant arrays (Windows malloc()-- unlike Linux's --returns a pointer to committed memory; Windows has other calls if you really want memory you can't trust <0.5 wink>).
The next red herring was much funnier: we couldn't reproduce the problem when running the recognizer by hand (from a DOS box cmdline)! But, run it as Research did, system()'ed from a small Perl script, and it magically ran 3x slower, with monstrous disk thrashing. So I had a great time besmirching Perl's reputation <wink>.
Alas, it turned out the *real* trigger was something else entirely, that we've known about for years but have never understood: from inside the Perl script, people used UNC paths to various network locations. Like
\earwig\research2\data5\natspeak\testk\big55.voc
Exactly the same locations were referenced when people ran it "by hand", but when people do it by hand, they naturally map a drive letter first, in order reduce typing. Like
net use N: \earwig\research2\data5\natspeak
once and then
N:\testk\big55.voc
in their command lines.
This difference alone can make a *huge* timing difference! Like I said, we've never understood why. Could simply be a bug in Dragon's out-of-control network setup, or a bug in MS's networking code, or a bug in Novell's server code -- I don't think we'll ever know. The number of IQ-hours that have gone into *trying* to figure this out over the years could probably have carried several startups to successful IPOs <0.9 wink>.
One last useless clue: do all this on a Win98 with 128Mb RAM, and the timing difference goes away. Ditto Win95, but much less RAM is needed. It sometimes acts like a UNC path consumes 32Mb of dedicated RAM!
Apart from this UNC-vs-mapped-drive issue, over many hours of dead-end scenarios I was pleased to see that Win98 appears to do a good job of reallocating physical RAM in response to changing demands, & in particular better than Win95. There's no problem here at all!
The original test case I posted-- showing massive heap fragmentation under Win95, Win98, and W2K (but not NT), when growing a large Python list one element at a time --remains an as-yet unstudied mystery. I can easily make *that* problem go away by, e.g., doing
a = [1]*3000000 del a
from time to time, apparently just to convince the Windows malloc that it would be a wise idea to allocate a lot more than it thinks it needs from time to time. This suggests (untested) that it *could* be a huge win for huge lists under Windows to overallocate huge lists by more than Python does today. I'll look into that "someday".
participants (2)
-
Mark Hammond
-
Tim Peters