test_subprocess and sparc buildbots
Does anyone have local access to a sparc machine to try to track down the ongoing buildbot failures in test_subprocess? (I think the problem is specific to 3.x builds on sparc machines, but I haven't checked the buildbots all that closely - that assessment is just based on what I recall of the buildbot failure emails). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Here is what I found just by analyzing the logs. It seems the first failures appeared after this change: http://svn.python.org/view/python/branches/release30-maint/Objects/object.c?rev=67888&view=diff&r1=67888&r2=67887&p1=python/branches/release30-maint/Objects/object.c&p2=/python/branches/release30-maint/Objects/object.c The logs of failing test runs all shows the same error message: [31481 refs] * ob object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728 * op->_ob_prev->_ob_next object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728 * op->_ob_next->_ob_prev object : [31776 refs] This is the output of _Py_ForgetReference (which calls _PyObject_Dump) called either from _PyUnicode_New or unicode_subtype_new. In both cases, this implies PyObject_MALLOC returned NULL when allocating the internal array of a str object. However, I have no idea why malloc() is failing there. By counting the number of [reftotal] printed in the log, I found that the failing test could be one of the following: test_invalid_args, test_invalid_bufsize, test_list2cmdline, test_no_leaking. Looking at the tests, it seems only test_no_leaking could be problematic: * test_list2cmdline checks if the subprocess.line2cmdline function works correctly, only Python code is involved here; * test_invalid_args checks if using an option unsupported by a platform raises an exception, only Python code is involved here; * test_invalid_bufsize only checks whether Popen rejects non-integer bufsize, only Python code is involved here. And unsurprisingly, that is the failing test: test test_subprocess failed -- Traceback (most recent call last): File "/home/pybot/buildarea-sid/3.0.klose-debian-sparc/build/Lib/test/test_subprocess.py", line 423, in test_no_leaking data = p.communicate(b"lime")[0] File "/home/pybot/buildarea-sid/3.0.klose-debian-sparc/build/Lib/subprocess.py", line 671, in communicate return self._communicate(input) File "/home/pybot/buildarea-sid/3.0.klose-debian-sparc/build/Lib/subprocess.py", line 1171, in _communicate bytes_written = os.write(self.stdin.fileno(), chunk) OSError: [Errno 32] Broken pipe It seems one of the spawned processes goes out of memory while allocating a new PyUnicode object. I believe we don't see the usual MemoryError because the parent process catches stderr and stdout of the children. Also, only klose-*-sparc buildbots are failing this way; loewis-sun is failing too but for a different reason. So, how much memory is available on this machine (or actually, on this virtual machine)? Now, I wonder why manipulating the GIL caused the bug to appear in 3.0, but not in 2.x. Maybe it is related to the new I/O library in Python 3.0. Regards, -- Alexandre On Tue, Dec 30, 2008 at 4:20 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Does anyone have local access to a sparc machine to try to track down the ongoing buildbot failures in test_subprocess?
(I think the problem is specific to 3.x builds on sparc machines, but I haven't checked the buildbots all that closely - that assessment is just based on what I recall of the buildbot failure emails).
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/alexandre%40peadrop.com
Alexandre Vassalotti wrote:
The logs of failing test runs all shows the same error message:
[31481 refs] * ob object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728 * op->_ob_prev->_ob_next object : <refcnt 0 at 0x3a97728> type : str refcount: 0 address : 0x3a97728 * op->_ob_next->_ob_prev object : [31776 refs]
A reliable way to get that in a --with-pydebug build seems to be: ~/py3k$ ./python -c "import locale; locale.format_string(1,1)" * ob object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c * op->_ob_prev->_ob_next NULL * op->_ob_next->_ob_prev object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c Fatal Python error: UNREF invalid object TypeError: expected string or buffer Aborted Found using Fusil in a very quick run on top of: Python 3.1a0 (py3k:68055M, Dec 31 2008, 01:34:52) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 So kudos to Victor again :) HTH, Daniel
On Tue, Dec 30, 2008 at 10:41 PM, Daniel (ajax) Diniz <ajaksu@gmail.com> wrote:
A reliable way to get that in a --with-pydebug build seems to be:
~/py3k$ ./python -c "import locale; locale.format_string(1,1)" * ob object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c * op->_ob_prev->_ob_next NULL * op->_ob_next->_ob_prev object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c Fatal Python error: UNREF invalid object TypeError: expected string or buffer Aborted
Nice catch! I reduced your example to: "import _sre; _sre.compile(0, 0, [])". And, it doesn't seem to be an input validation problem with _sre. From what I saw, it's actually a bug in Py_TRACE_REFS's code. Now, it's getting interesting! It seems something is breaking the refchain. However, I don't know what is causing the problem exactly.
Found using Fusil in a very quick run on top of: Python 3.1a0 (py3k:68055M, Dec 31 2008, 01:34:52) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
So kudos to Victor again :)
Could share the details on how you used Fusil to find another crasher? It sounds like a useful tool. Thanks! -- Alexandre
Alexandre Vassalotti schrieb:
On Tue, Dec 30, 2008 at 10:41 PM, Daniel (ajax) Diniz <ajaksu@gmail.com> wrote:
A reliable way to get that in a --with-pydebug build seems to be:
~/py3k$ ./python -c "import locale; locale.format_string(1,1)" * ob object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c * op->_ob_prev->_ob_next NULL * op->_ob_next->_ob_prev object : <refcnt 0 at 0x825c76c> type : tuple refcount: 0 address : 0x825c76c Fatal Python error: UNREF invalid object TypeError: expected string or buffer Aborted
Nice catch! I reduced your example to: "import _sre; _sre.compile(0, 0, [])". And, it doesn't seem to be an input validation problem with _sre. From what I saw, it's actually a bug in Py_TRACE_REFS's code. Now, it's getting interesting!
It seems something is breaking the refchain. However, I don't know what is causing the problem exactly.
This only occurs --with-pydebug, I assume? It is the same basic problem as in http://bugs.python.org/issue3299, which I analysed some time ago. Simply speaking, it is caused by the object allocation and deallocation scheme that _sre chooses: if _compile's argument processing raises an error, PyObject_DEL is called which doesn't remove the object from the refchain. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
Georg Brandl wrote:
This only occurs --with-pydebug, I assume?
For me, on 32 bits Linux, yes, only --with-pydebug*.
It is the same basic problem as in http://bugs.python.org/issue3299, which I analysed some time ago.
Yes, I guess my 'catch' is exactly that. But it might be a red herring (sorry if that's the case): is the correlation with sparc and/or rev.67888 real? Regards, Daniel
Daniel (ajax) Diniz wrote:
Georg Brandl wrote:
This only occurs --with-pydebug, I assume?
For me, on 32 bits Linux, yes, only --with-pydebug*.
It is the same basic problem as in http://bugs.python.org/issue3299, which I analysed some time ago.
Yes, I guess my 'catch' is exactly that. But it might be a red herring (sorry if that's the case): is the correlation with sparc and/or rev.67888 real?
The correlation with sparc probably isn't real (that was just a subjective impression on my part based on the buildbot failure emails). When --with-pydebug is enabled, I can reproduce the fault (as posted by Alexandre) on 32-bit x86 Linux. There may be a specific issue with the klose buildbots, but the crash in the object deallocation is obscuring the original problem. I'll put further comment on the issue Georg linked. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
participants (4)
-
Alexandre Vassalotti
-
Daniel (ajax) Diniz
-
Georg Brandl
-
Nick Coghlan