Does anyone have local access to a sparc machine to try to track down the ongoing buildbot failures in test_subprocess?
(I think the problem is specific to 3.x builds on sparc machines, but I haven't checked the buildbots all that closely - that assessment is just based on what I recall of the buildbot failure emails).
Cheers, Nick.
Here is what I found just by analyzing the logs. It seems the first failures appeared after this change:
http://svn.python.org/view/python/branches/release30-maint/Objects/object.c?...
The logs of failing test runs all shows the same error message:
[31481 refs] * ob object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728 * op->_ob_prev->_ob_next object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728 * op->_ob_next->_ob_prev object : [31776 refs]
This is the output of _Py_ForgetReference (which calls _PyObject_Dump) called either from _PyUnicode_New or unicode_subtype_new. In both cases, this implies PyObject_MALLOC returned NULL when allocating the internal array of a str object. However, I have no idea why malloc() is failing there.
By counting the number of [reftotal] printed in the log, I found that the failing test could be one of the following: test_invalid_args, test_invalid_bufsize, test_list2cmdline, test_no_leaking. Looking at the tests, it seems only test_no_leaking could be problematic:
* test_list2cmdline checks if the subprocess.line2cmdline function works correctly, only Python code is involved here; * test_invalid_args checks if using an option unsupported by a platform raises an exception, only Python code is involved here; * test_invalid_bufsize only checks whether Popen rejects non-integer bufsize, only Python code is involved here.
And unsurprisingly, that is the failing test:
test test_subprocess failed -- Traceback (most recent call last): File "/home/pybot/buildarea-sid/3.0.klose-debian-sparc/build/Lib/test/test_subprocess.py", line 423, in test_no_leaking data = p.communicate(b"lime")[0] File "/home/pybot/buildarea-sid/3.0.klose-debian-sparc/build/Lib/subprocess.py", line 671, in communicate return self._communicate(input) File "/home/pybot/buildarea-sid/3.0.klose-debian-sparc/build/Lib/subprocess.py", line 1171, in _communicate bytes_written = os.write(self.stdin.fileno(), chunk) OSError: [Errno 32] Broken pipe
It seems one of the spawned processes goes out of memory while allocating a new PyUnicode object. I believe we don't see the usual MemoryError because the parent process catches stderr and stdout of the children.
Also, only klose-*-sparc buildbots are failing this way; loewis-sun is failing too but for a different reason. So, how much memory is available on this machine (or actually, on this virtual machine)?
Now, I wonder why manipulating the GIL caused the bug to appear in 3.0, but not in 2.x. Maybe it is related to the new I/O library in Python 3.0.
Regards, -- Alexandre
On Tue, Dec 30, 2008 at 4:20 PM, Nick Coghlan ncoghlan@gmail.com wrote:
Does anyone have local access to a sparc machine to try to track down the ongoing buildbot failures in test_subprocess?
(I think the problem is specific to 3.x builds on sparc machines, but I haven't checked the buildbots all that closely - that assessment is just based on what I recall of the buildbot failure emails).
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/alexandre%40peadrop.com
Alexandre Vassalotti wrote:
The logs of failing test runs all shows the same error message:
[31481 refs]
- ob
object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728
- op->_ob_prev->_ob_next
object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728
- op->_ob_next->_ob_prev
object : [31776 refs]
A reliable way to get that in a --with-pydebug build seems to be:
~/py3k$ ./python -c "import locale; locale.format_string(1,1)" * ob object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c * op->_ob_prev->_ob_next NULL * op->_ob_next->_ob_prev object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c Fatal Python error: UNREF invalid object TypeError: expected string or buffer Aborted
Found using Fusil in a very quick run on top of: Python 3.1a0 (py3k:68055M, Dec 31 2008, 01:34:52) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
So kudos to Victor again :)
HTH, Daniel
On Tue, Dec 30, 2008 at 10:41 PM, Daniel (ajax) Diniz ajaksu@gmail.com wrote:
A reliable way to get that in a --with-pydebug build seems to be:
~/py3k$ ./python -c "import locale; locale.format_string(1,1)"
- ob
object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c
- op->_ob_prev->_ob_next
NULL
- op->_ob_next->_ob_prev
object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c Fatal Python error: UNREF invalid object TypeError: expected string or buffer Aborted
Nice catch! I reduced your example to: "import _sre; _sre.compile(0, 0, [])". And, it doesn't seem to be an input validation problem with _sre. From what I saw, it's actually a bug in Py_TRACE_REFS's code. Now, it's getting interesting!
It seems something is breaking the refchain. However, I don't know what is causing the problem exactly.
Found using Fusil in a very quick run on top of: Python 3.1a0 (py3k:68055M, Dec 31 2008, 01:34:52) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
So kudos to Victor again :)
Could share the details on how you used Fusil to find another crasher? It sounds like a useful tool.
Thanks!
-- Alexandre
Alexandre Vassalotti schrieb:
On Tue, Dec 30, 2008 at 10:41 PM, Daniel (ajax) Diniz ajaksu@gmail.com wrote:
A reliable way to get that in a --with-pydebug build seems to be:
~/py3k$ ./python -c "import locale; locale.format_string(1,1)"
- ob
object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c
- op->_ob_prev->_ob_next
NULL
- op->_ob_next->_ob_prev
object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c Fatal Python error: UNREF invalid object TypeError: expected string or buffer Aborted
Nice catch! I reduced your example to: "import _sre; _sre.compile(0, 0, [])". And, it doesn't seem to be an input validation problem with _sre. From what I saw, it's actually a bug in Py_TRACE_REFS's code. Now, it's getting interesting!
It seems something is breaking the refchain. However, I don't know what is causing the problem exactly.
This only occurs --with-pydebug, I assume?
It is the same basic problem as in http://bugs.python.org/issue3299, which I analysed some time ago. Simply speaking, it is caused by the object allocation and deallocation scheme that _sre chooses: if _compile's argument processing raises an error, PyObject_DEL is called which doesn't remove the object from the refchain.
Georg
Georg Brandl wrote:
This only occurs --with-pydebug, I assume?
For me, on 32 bits Linux, yes, only --with-pydebug*.
It is the same basic problem as in http://bugs.python.org/issue3299, which I analysed some time ago.
Yes, I guess my 'catch' is exactly that. But it might be a red herring (sorry if that's the case): is the correlation with sparc and/or rev.67888 real?
Regards, Daniel
Daniel (ajax) Diniz wrote:
Georg Brandl wrote:
This only occurs --with-pydebug, I assume?
For me, on 32 bits Linux, yes, only --with-pydebug*.
It is the same basic problem as in http://bugs.python.org/issue3299, which I analysed some time ago.
Yes, I guess my 'catch' is exactly that. But it might be a red herring (sorry if that's the case): is the correlation with sparc and/or rev.67888 real?
The correlation with sparc probably isn't real (that was just a subjective impression on my part based on the buildbot failure emails). When --with-pydebug is enabled, I can reproduce the fault (as posted by Alexandre) on 32-bit x86 Linux.
There may be a specific issue with the klose buildbots, but the crash in the object deallocation is obscuring the original problem.
I'll put further comment on the issue Georg linked.
Cheers, Nick.