[Python-bugs-list] [ python-Bugs-495401 ] Build troubles: --with-pymalloc

Wed, 10 Apr 2002 23:26:37 -0700

Bugs item #495401, was opened at 2001-12-20 14:24
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=495401&group_id=5470

Category: Build
Group: Platform-specific
Status: Open
Resolution: None
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Martin v. Löwis (loewis)
Summary: Build troubles: --with-pymalloc

Initial Comment:
The build process segfaults with the current CVS 
version when using --with-pymalloc

System is SuSE Linux 7.0

> uname -a
Linux amazonas 2.2.16-SMP #1 SMP Wed Aug 2 20:01:21 
GMT 2000 i686 unknown
> gcc -v
Reading specs from /usr/lib/gcc-lib/i486-suse-
linux/2.95.2/specs
gcc version 2.95.2 19991024 (release)

Attached is the complete build log.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2002-04-11 08:26

Message:
Logged In: YES 
user_id=21627

I've revised the patch unicode3.diff to the current code
base (it still includes the assertion at the end); I've also
added a new timing test (utim3.py) which considers the
following cases:
- a string consisting of only spaces
- a string consisting of only spaces, and a single character
that needs three bytes in UTF-8
- a string only consisting of characters that need three bytes.

For all three cases, it tests various sizes of the string,
both below and above the pymalloc threshold. For the current
CVS (unicodeobject.c 2.136: MAL's change to use a variable
overalloc), I get

10 spaces                      20.060
100 spaces                     2.600
200 spaces                     2.030
1000 spaces                    0.930
10000 spaces                   0.690
10 spaces, 3 bytes             23.520
100 spaces, 3 bytes            3.730
200 spaces, 3 bytes            2.470
1000 spaces, 3 bytes           0.980
10000 spaces, 3 bytes          0.690
30 bytes                       24.800
300 bytes                      5.220
600 bytes                      3.830
3000 bytes                     2.480
30000 bytes                    2.230

With unicode3.diff, I get

10 spaces                      19.940
100 spaces                     3.260
200 spaces                     2.340
1000 spaces                    1.650
10000 spaces                   1.450
10 spaces, 3 bytes             21.420
100 spaces, 3 bytes            3.410
200 spaces, 3 bytes            2.420
1000 spaces, 3 bytes           1.660
10000 spaces, 3 bytes          1.450
30 bytes                       22.260
300 bytes                      5.830
600 bytes                      4.700
3000 bytes                     3.740
30000 bytes                    3.540

So it appears that unicode3.diff is more efficient for short
strings, and overallocation is more efficient for long
strings. Since overallocation may waste memory, I still
recommend to integrate unicode3.diff.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-13 13:39

Message:
Logged In: YES 
user_id=38388

Tim, I ran the test on my notebook and guess what:

when compiling Python with -mcpu=pentium I get
    58 vs. 59.8 (with / without patch)

when compiling Python with -mcpu=i686, it's
    65.8 vs. 60.17 (with / without patch)

No idea what chip is used in the notebook.
It's pretty old, though. I used gcc 2.95.2 and
some old SuSE Linux version (glibc 2).

Would be interesting to check this on a 
modern pentium 4 machine and maybe
a more recent sun sparc machine. Oh yes,
and your Cray, of coure ;-)

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-02-11 22:06

Message:
Logged In: YES 
user_id=31435

MAL, cool -- I saw a major slowdown using the patch too, 
but not nearly as dramatic as you saw, so was curious about 
what could account for that.  Chip, compiler and OS can all 
have major effects.

I assume Martin is using a Pentium box, so assuming your 
notebook is running Linux too, it would be good to get 
another LinTel datapoint.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-11 21:50

Message:
Logged In: YES 
user_id=38388

Tim: Yes, I'm still all AMD based... it's Athlon 1200 I'm 
running. PGCC (the Pentium GCC groups version) has
a special AMD optimization mode for Athlon which is
what I'm using. Somebody has to hold up the flag against
the Wintel camp ;-)

Hmm, I could do the same tests on my notebook which
runs on one of those Inteliums. Maybe tomorrow...

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-02-11 21:06

Message:
Logged In: YES 
user_id=31435

time.time() sucks for benchmarking on Windows (updates at 
about 18Hz).  Running the test as-is, MSVC6 and Win98SE, 
it's 1.3 seconds with current CVS, and 1.7 with 
unicode3.diff.

The quantization error in Windows time.time() is > 0.05 
seconds, so no point pretending there are 3 significant 
digits there; luckily(?), it's apparent there's a major 
difference with just 2 digits.

MAL, are you still using an AMD box?  In a decade, nobody 
else has ever reproduced the timing results you see <wink>.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-11 19:42

Message:
Logged In: YES 
user_id=38388

Ok, with 100000 loops and time.clock() I get:
4.690 - 4.710 without your patch,
9.560 - 9.570 with your patch
(again, without pymalloc and the same compiler/machine
on SuSE 7.1).

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-02-11 19:04

Message:
Logged In: YES 
user_id=21627

time.clock vs. time.time does not make a big difference on
an unloaded machine (except time.time has a higher resolution).
Can you please run the test 10x more often? I then get
12.520 clocks with CVS python, glibc 2.2.4, gcc 2.95, and
10.890 with my patch.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-11 18:49

Message:
Logged In: YES 
user_id=38388

I get different timings (note that you have to use time.clock() for 
benchmarks, not time.time()):

without your patch: 0.470 seconds
with your patch: 0.960 seconds

This is on Linux with pgcc 2.95.2, glibc 2.2, without pymalloc
(which is the normal configuration).

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-02-11 18:06

Message:
Logged In: YES 
user_id=21627

Marc: Please have a look at pymalloc; it cannot be "fixed".
It is in the nature of a pool allocator that you have to
copy when you want to move between pools - or you have to
waste the extra space.

I agree that UTF-8 coding needs to be fast; that's why I
wrote this patch. I've revised it to fit the current
implementation, and too add the assert that Tim has
requested (unicode3.diff).

For the test case time_utf8.zip (which is a UTF-8 converted
Misc/ACKS), the version that first counts the size is about
10% faster on my system (Linux glibc 2.2.4) (see timings
inside time_utf8.py; #592 is the patched version). So the
price for counting the size turns out to be negligible, and
to offer significant, and is more than compensated for by
the reduction of calls to the memory management system.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-08 23:27

Message:
Logged In: YES 
user_id=38388

Good news, Walter.

Martin:
As I explained in an earlier comment, pymalloc needs to be fixed to better address overallocation. The two pass 
logic would avoid overallocation, but at a high price. Copying memory (if at all needed) is likely to be *much* faster. 
The UTF-8 codec has to be as fast as possible since it is one of the most used codecs in Python's Unicode 
implementation. Also note that I have reduced overallocation to 2*size in the codec.

I suggest to close the bug report.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-02-07 23:59

Message:
Logged In: YES 
user_id=21627

I still think the current algorithm has serious problems as 
it is based on overallocation, and that it should be 
replaced with an algorithm that counts the memory 
requirements upfront. This will be particularly important 
for pymalloc, but will also avoid unnecessary copies for 
many other malloc implementations.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-02-07 10:58

Message:
Logged In: YES 
user_id=89016

I tried the current CVS and make altinstall runs to 
completion.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-06 19:12

Message:
Logged In: YES 
user_id=38388

I've checked in a patch which fixes the memory allocation problem. Please give it a try and tell me whether this 
fixes your problem, Walter. If so, I'd suggest to close the bug.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-31 20:36

Message:
Logged In: YES 
user_id=31435

Martin, I like your second patch fine, but would like it a 
lot better with

assert(p - PyString_AS_STRING(v) == cbAllocated);

at the end.  Else a piddly change in either loop can cause 
more miserably hard-to-track-down problems.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-31 14:46

Message:
Logged In: YES 
user_id=21627

MAL,
I'm 100% positive that the crash on my system was caused by
the UTF-8 encoding; I've seen it in the debugger overwrite
memory that it doesn't own.

As for unicode.diff: Tim has proposed that this should not
be done, but that 4*size should be allocated in advance.
What do you think?

On unicode2.diff: If pymalloc was changed to shrink the
memory, it would have to copy the original string, since it
would likely be in a different size class. This is less
efficient than the approach taken in unicode2.diff. What
specifically is it that you dislike about first counting the
memory requirements? 

It actually simplifies the code. Notice that the current
code is still buggy with regard to surrogates. If there is a
high surrogate, but not a low one, it will write bogus UTF-8
 (with no lead byte). This is fixed in unicode2.diff as well.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-12-31 13:48

Message:
Logged In: YES 
user_id=38388

I like the unicode.diff, but not the unicode2.diff. Instead of fixing the UTF-8 codec here we should fix the 
pymalloc implementation, since overallocation is common thing to do and not only used in codecs. (Note 
that all Python extensions will start to use pymalloc too once we enable it per default.)

I don't know whether it's relevant, but I found that core dumps can easily be triggered by mixing the 
various 
memory allocation APIs in Python and the C lib. The bug may not only be related to the UTF-8 codec but 
may also linger in some other extension modules.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-30 11:26

Message:
Logged In: YES 
user_id=21627

I disabled threading, which fortunately gave me memory
watchpoints back. Then I noticed that the final *p=0
corrupted a non-NULL freeblock pointer, slightly decreasing
it. Then following the freeblock pointer, freeblock was
changed to a bogus block, which had its next pointer as
garbage. I had to trace this from the back, of course.

As for overallocation, I wonder whether the UTF-8 encoding
should overallocate at all. unicode2.diff is a modification
where it runs over the string twice, counting the number of
needed bytes the first time. This is likely slower (atleast
if no reallocations occur), but doesn't waste that much
memory (I notice that pymalloc will never copy objects when
they shrink, to return the extra space).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-30 04:54

Message:
Logged In: YES 
user_id=31435

Good eye, Martin!  It's clearly possible for the unpatched 
code to write beyond the memory allocated.  The one thing 
that doesn't jibe is that you said bp is 0x2, which means 3 
of its 4 bytes are 0x00, but UTF-8 doesn't produce 0 bytes 
except for one per \u0000 input character.  Right?  So, if 
this routine is the cause, where are the 0 bytes coming 
from?  (It could be test_unicode sets up a UTF-8 encoding 
case with several \u0000 characters, but if so I didn't 
stumble into it.)  Plausible:  when a new pymalloc "page" 
is allocated, the 40-byte chunks in it are *not* linked 
together at the start.  Instead a NULL pointer is stored at 
just the start of "the first" 40-byte chunk, and pymalloc-- 
on subsequent mallocs --finds that NULL and incrementally 
carves out additional 40-byte chunks.  So as a startup-- 
but not a steady-state --condition, the "next free block" 
pointers will very often be NULLs, and then if this is a 
little-endian machine, writing a single 2 byte at the start 
of a free block would lead to a bogus pointer value of 0x2.

About a fix, I'm in favor of junking all the cleverness 
here, by allocating size*4 bytes from the start.  It's 
overallocating in all normal cases already, so we're going 
to incur the expense of cutting the result string back 
anyway; how *much* we overallocate doesn't matter to speed, 
except that if we don't have to keep checking inside the 
loop, the code gets simpler and quicker and more robust.  
The loop should instead merely assert that cbWritten <= 
cbAllocated at the end of each trip.  Had this been done 
from the start, a debug build would have assert-failed a 
few nanoseconds after the wild store.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-30 02:44

Message:
Logged In: YES 
user_id=21627

I've found one source of troubles, see the attached
unicode.diff. Guido's intuition was right; it was an UCS-4
problem: EncodeUTF8 would over-allocate 3*size bytes, but
can actually write 4*size in the worst case, which occurs in
test_unicode.

I'll leave the patch for review and experiments; it fixes
the problem for me.

The existing adjustment for surrogates is pointless, IMO:
for the surrogate pair, it will allocate 6 bytes UTF-8 in
advance, which is more than actually needed.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-30 00:13

Message:
Logged In: YES 
user_id=21627

It's a Heisenbug. I found that even slightest modifications
to the Python source make it come and go, or appear at
different places. On my system, the crashes normally occur
in the first run (.pyc). So I don't think the order of make
invocations is the source of the problem. It's likely as Tim
says: somebody overwrites memory somewhere.

Unfortunately, even though I can reproduce crashes for the
same pool, for some reason, my gdb memory watches don't
trigger. Tim's approach of checking whether a the value came
from following the free list did not help, either: the bug
disappeared under the change.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2001-12-29 23:01

Message:
Logged In: YES 
user_id=6656

Hmm.  I now think that the stuff about extension modules is 
almost certainly a read herring.  What I said about "make && 
make altinstall" vs "make altinstall" still seems to be true, 
though.

If you compile with --with-pydebug, you crash right at the 
end of the second (-O) run of compileall.py -- I suspect this 
is something else, but it might not be.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2001-12-29 22:29

Message:
Logged In: YES 
user_id=6656

I don't know if these are helpful observations or not, but 
anyway:

(1) it doesn't core without the --enable-unicode=ucs4 option
(2) if you just run "make altinstall"  the library files are 
installed *and compiled* before the dynamically linked 
modules are built.  Then we don't crash.
(3) if you run "make altinstall" again, we crash.
If you initially ran "make && make install", we crash.
(4) when we crash, it's not long after the unicode tests are 
compiled.

Are these real clues or just red herrings?  I'm afraid I 
can't tell :(

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-29 19:43

Message:
Logged In: YES 
user_id=31435

Ouch.  Boosted priority back to 5, since Martin can 
reproduce it.  Alas, where pymalloc got called *from* is 
almost certainly irrelevant -- we're seeing the end result 
of earlier corruption.

Note that pymalloc is unusually sensitive to off-by-1 
stores, since the chunks it hands out are contiguous 
(there's no hidden bookkeeping padding between them).  
Plausible:  an earlier bogus store went beyond the end of 
its allocated chunk, overwriting the "next free block" 
pointer at the start of a previously free()'ed chunk of the 
same size (rounded up to a multiple of 8; 40 bytes in this 
case).

At the time this blows up, bp is supposed to point to a 
previously free()'ed chunk of size 40 bytes (if there were 
none free()'ed and available, the earlier "pool != pool-
>nextpool" guard should have failed).  The first 4 bytes 
(let's simplify by assuming this is a 32-bit box) of the 
free chunks link the free chunks together, most recently 
free()'ed at the start of the (singly linked) list.  So the 
code at this point is intent on returning bp, and "pool-
>freeblock = *(block **)bp" is setting the 40-byte-chunk 
list header's idea of the *next* available 40-byte chunk.

But bp is bogus.  The value of bp is gotten out of the free 
list headers, the static array usedpools.  This mechanism 
is horridly obscure, an array of pointer pairs that, in 
effect, capture just the first two members of the 
pool_header struct, once for each chunk size.  It's 
possible that someone is overwriting usedpools[4 + 4]-
>freeblock directly with 2, but that seems unlikely.

More likely is that a free() operation linked a 40-byte 
chunk into the list headed at usedpools[4+4]->freeblock 
correctly, and a later bad store overwrote the first 4 
bytes of the free()'ed block with 2.  Then the "pool-
>freeblock = *(block **)bp)" near the start of an 
unexceptional pymalloc would copy the 2 into the list 
header's freeblock without complaint.  The error wouldn't 
show up until a subsequent malloc tried to use it.

So that's one idea to get closer to the cause:  add code to 
dereference pool->freeblock, before the "return (void *)
bp".  If that blows up earlier, then the first four bytes 
of bp were corrupted, and that gives you a useful data 
breakpoint address for the next run.  If it doesn't blow up 
earlier, the corruption will be harder to find, but let's 
count on being lucky at first <wink>.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-29 17:00

Message:
Logged In: YES 
user_id=21627

Ok, I can reproduce it now; I did not 'make install' before.
Here is a gdb back trace

#0  _PyCore_ObjectMalloc (nbytes=33) at Objects/obmalloc.c:417
#1  0x805885c in PyString_FromString (str=0x816c6e8
"checkJoin") at Objects/stringobject.c:136
#2  0x805d772 in PyString_InternFromString (cp=0x816c6e8
"checkJoin") at Objects/stringobject.c:3640
#3  0x807c9c6 in com_addop_varname (c=0xbfffe87c, kind=0,
name=0x816c6e8 "checkJoin")
    at Python/compile.c:939
#4  0x807de24 in com_atom (c=0xbfffe87c, n=0x816c258) at
Python/compile.c:1478
#5  0x807f01c in com_power (c=0xbfffe87c, n=0x816c8b8) at
Python/compile.c:1846
#6  0x807f545 in com_factor (c=0xbfffe87c, n=0x816c898) at
Python/compile.c:1975
#7  0x807f56c in com_term (c=0xbfffe87c, n=0x816c878) at
Python/compile.c:1985
#8  0x807f6bc in com_arith_expr (c=0xbfffe87c, n=0x816c858)
at Python/compile.c:2020
#9  0x807f7dc in com_shift_expr (c=0xbfffe87c, n=0x816c838)
at Python/compile.c:2046
#10 0x807f8fc in com_and_expr (c=0xbfffe87c, n=0x816c818) at
Python/compile.c:2072
#11 0x807fa0c in com_xor_expr (c=0xbfffe87c, n=0x816c7f8) at
Python/compile.c:2094
...

The access that crashes is *(block **)bp, since bp is 0x2.

Not sure how that happens; I'll investigate (but would
appreciate a clue). It seems that the pool chain got corrupted.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-29 15:52

Message:
Logged In: YES 
user_id=6380

Aha!  The --enable-unicode=ucs4 is more suspicious than the
--with-pymalloc. I had missed that info when this was first
reported.

Not that I'm any closer to solving it... :-(

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-12-29 11:53

Message:
Logged In: YES 
user_id=89016

OK, I did a "make distclean" which removed .o files and
the build directory and redid a "./configure --enable-
unicode=ucs4 --with-pymalloc && make && make altinstall".

The build process still crashes in the same spot:
Compiling /usr/local/lib/python2.2/test/test_urlparse.py ...
make: *** [libinstall] Segmentation fault

I also retried with a fresh untarred Python-2.2.tgz. This 
shows the same behaviour.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-29 10:23

Message:
Logged In: YES 
user_id=21627

Atleast I cannot reproduce it, on SuSE 7.3. Can you retry
this, building from a clean source tree (no .o files, no
build directory)?

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-28 23:30

Message:
Logged In: YES 
user_id=6380

My prediction: this is irreproducible. Lowering the priority
accordingly.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-12-23 13:24

Message:
Logged In: YES 
user_id=89016

Unfortunately no core file was generated. Can I somehow 
force core file generation?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-22 15:42

Message:
Logged In: YES 
user_id=21627

Did that produce a core file? If so, can you attach a gdb
backtrace as well?

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=495401&group_id=5470