Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

On 2013-02-26, 16:25 GMT, Terry Reedy wrote:
So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related. Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything). Any thoughts? Matěj P.S.: Crossposting to python-devel in hope there would be somebody understanding more about that particular commit. For that I have also intentionally not trim the original messages to preserve context. -- http://www.ceplovi.cz/matej/, Jabber: mcepl<at>ceplovi.cz GPG Finger: 89EF 4BC6 288A BF43 1BAB 25C3 E09F EF25 D964 84AC When you're happy that cut and paste actually works I think it's a sign you've been using X-Windows for too long. -- from /. discussion on poor integration between KDE and GNOME

On Wed, 06 Mar 2013 14:09:54 +0100, =?UTF-8?Q?Mat=C4=9Bj?= Cepl <mcepl@redhat.com> wrote:
If no issue number is mentioned in the commit message, then chances are there's no specific issue in the tracker related to that particular commit. Normally there will be an issue, but sometimes things are done without one (a practice we should maybe think about changing). Most likely the commit's author, Victor Stinner, will see your message or this one and respond. That particular change recently came up (by implication) in another context (unicode singletons...) --David

Hi, 2013/3/6 Matěj Cepl <mcepl@redhat.com>
I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed... Does it change something if you use "==" instead? -- Amaury Forgeot d'Arc

Hi, In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version. Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits. Copy/paste of the CIA IRC bot: 19:30 < irker555> cpython: Victor Stinner 3.3 * 82517:3dd2fa78fb89 / Objects/unicodeobject.c: _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/3dd2fa78fb89 19:30 < irker032> cpython: Victor Stinner default * 82518:fa59a85b373f / Objects/unicodeobject.c: (Merge 3.3) _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/fa59a85b373f Victor 2013/3/6 Amaury Forgeot d'Arc <amauryfa@gmail.com>:

On 2013-03-06, 18:34 GMT, Victor Stinner wrote:
I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail. But, Amaury is right: the function doesn't make much sense. However, ... when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95 def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line tests on ALL versions of Python are suddenly failing ... https://travis-ci.org/mcepl/html2text/builds/5288190 Curiouser and curiouser! At least, I seem to have the point, where things are breaking, but I have to admit that condition really doesn’t make any sense to me.
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.
Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world). Any thoughts? Matěj

On 2013-03-07, at 11:08 , Matej Cepl wrote:
The second test looks like some kind of corruption, it's supposedly iterating on the characters of a line yet testing for two spaces? Is it possible that the original was a literal tab embedded in the source code (instead of '\t') and that got broken at some point? According to its name + docstring, the implementation of this method should really be replaced by `return line and line.isspace()` (the first part being to handle the case of an empty line: in the current implementation the line will be returned directly if no whitespace is found, which will be "negative" for an empty line, and ''.isspace() -> false). Does that fix the failing tests?

You should try to write a simple test not using your library (just copy/paste code) reproducing the issue. If you can do that, please fill an issue on bugs.python.org. Victor 2013/3/7 Matej Cepl <mcepl@redhat.com>:

Hi Matej, On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl <mcepl@redhat.com> wrote:
if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':
Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character ' '. A bientôt, Armin.

----- Original Message -----
Comments on https://github.com/mcepl/html2text/commit/f511f3c78e60d7734d677f8945580f52ef... (perhaps in https://github.com/aaronsw/html2text/pull/77) are more than welcome. When using SPACE_RE = re.compile(r'\s\+') for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though. Best, Matěj

On 05/05/2013 23:01, Matej Cepl wrote:
That will match a whitespace character followed by a '+'.
for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though.
What do you mean by "doesn't work"?

On Wed, 06 Mar 2013 14:09:54 +0100, =?UTF-8?Q?Mat=C4=9Bj?= Cepl <mcepl@redhat.com> wrote:
If no issue number is mentioned in the commit message, then chances are there's no specific issue in the tracker related to that particular commit. Normally there will be an issue, but sometimes things are done without one (a practice we should maybe think about changing). Most likely the commit's author, Victor Stinner, will see your message or this one and respond. That particular change recently came up (by implication) in another context (unicode singletons...) --David

Hi, 2013/3/6 Matěj Cepl <mcepl@redhat.com>
I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed... Does it change something if you use "==" instead? -- Amaury Forgeot d'Arc

Hi, In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version. Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits. Copy/paste of the CIA IRC bot: 19:30 < irker555> cpython: Victor Stinner 3.3 * 82517:3dd2fa78fb89 / Objects/unicodeobject.c: _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/3dd2fa78fb89 19:30 < irker032> cpython: Victor Stinner default * 82518:fa59a85b373f / Objects/unicodeobject.c: (Merge 3.3) _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/fa59a85b373f Victor 2013/3/6 Amaury Forgeot d'Arc <amauryfa@gmail.com>:

On 2013-03-06, 18:34 GMT, Victor Stinner wrote:
I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail. But, Amaury is right: the function doesn't make much sense. However, ... when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95 def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line tests on ALL versions of Python are suddenly failing ... https://travis-ci.org/mcepl/html2text/builds/5288190 Curiouser and curiouser! At least, I seem to have the point, where things are breaking, but I have to admit that condition really doesn’t make any sense to me.
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.
Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world). Any thoughts? Matěj

On 2013-03-07, at 11:08 , Matej Cepl wrote:
The second test looks like some kind of corruption, it's supposedly iterating on the characters of a line yet testing for two spaces? Is it possible that the original was a literal tab embedded in the source code (instead of '\t') and that got broken at some point? According to its name + docstring, the implementation of this method should really be replaced by `return line and line.isspace()` (the first part being to handle the case of an empty line: in the current implementation the line will be returned directly if no whitespace is found, which will be "negative" for an empty line, and ''.isspace() -> false). Does that fix the failing tests?

You should try to write a simple test not using your library (just copy/paste code) reproducing the issue. If you can do that, please fill an issue on bugs.python.org. Victor 2013/3/7 Matej Cepl <mcepl@redhat.com>:

Hi Matej, On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl <mcepl@redhat.com> wrote:
if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':
Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character ' '. A bientôt, Armin.

----- Original Message -----
Comments on https://github.com/mcepl/html2text/commit/f511f3c78e60d7734d677f8945580f52ef... (perhaps in https://github.com/aaronsw/html2text/pull/77) are more than welcome. When using SPACE_RE = re.compile(r'\s\+') for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though. Best, Matěj

On 05/05/2013 23:01, Matej Cepl wrote:
That will match a whitespace character followed by a '+'.
for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though.
What do you mean by "doesn't work"?
participants (9)
-
Amaury Forgeot d'Arc
-
Armin Rigo
-
Georg Brandl
-
Matej Cepl
-
Matěj Cepl
-
MRAB
-
R. David Murray
-
Victor Stinner
-
Xavier Morel