Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

On 2013-02-26, 16:25 GMT, Terry Reedy wrote:
On 2/21/2013 4:22 PM, Matej Cepl wrote:
as my method to commemorate Aaron Swartz, I have decided to port his html2text to work fully with the latest python 3.3. After some time dealing with various bugs, I have now in my repo https://github.com/mcepl/html2text (branch python3) working solution which works all the way to python 3.2 (inclusive; https://travis-ci.org/mcepl/html2text). However, the last problem remains. This
<li>Run this command: <pre>ls -l *.html</pre></li> <li>?</li>
should lead to
Run this command:
ls -l *.html
?
but it doesn’t. It leads to this (with python 3.3 only)
* Run this command: ls -l *.html * ?
Does anybody know about something which changed in modules re or http://docs.python.org/3.3/whatsnew/changelog.html between 3.2 and 3.3, which could influence this script?
Search the changelob or 3.3 misc/News for items affecting those two modules. There are at least 4. http://docs.python.org/3.3/whatsnew/changelog.html
It is faintly possible that the switch from narrow/wide builds to unified builds somehow affected that. Have you tested with 2.7/3.2 on both narrow and wide unicode builds?
So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.
Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).
Any thoughts?
Matěj
P.S.: Crossposting to python-devel in hope there would be somebody understanding more about that particular commit. For that I have also intentionally not trim the original messages to preserve context.

On Wed, 06 Mar 2013 14:09:54 +0100, =?UTF-8?Q?Mat=C4=9Bj?= Cepl mcepl@redhat.com wrote:
So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.
Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).
If no issue number is mentioned in the commit message, then chances are there's no specific issue in the tracker related to that particular commit. Normally there will be an issue, but sometimes things are done without one (a practice we should maybe think about changing).
Most likely the commit's author, Victor Stinner, will see your message or this one and respond. That particular change recently came up (by implication) in another context (unicode singletons...)
--David

Hi,
2013/3/6 Matěj Cepl mcepl@redhat.com
On 2013-02-26, 16:25 GMT, Terry Reedy wrote:
On 2/21/2013 4:22 PM, Matej Cepl wrote:
as my method to commemorate Aaron Swartz, I have decided to port his html2text to work fully with the latest python 3.3. After some time dealing with various bugs, I have now in my repo https://github.com/mcepl/html2text (branch python3) working solution which works all the way to python 3.2 (inclusive; https://travis-ci.org/mcepl/html2text). However, the last problem remains. This
<li>Run this command: <pre>ls -l *.html</pre></li> <li>?</li>
should lead to
Run this command:
ls -l *.html
?
but it doesn’t. It leads to this (with python 3.3 only)
* Run this command: ls -l *.html * ?
Does anybody know about something which changed in modules re or http://docs.python.org/3.3/whatsnew/changelog.html between 3.2 and 3.3, which could influence this script?
Search the changelob or 3.3 misc/News for items affecting those two modules. There are at least 4. http://docs.python.org/3.3/whatsnew/changelog.html
It is faintly possible that the switch from narrow/wide builds to unified builds somehow affected that. Have you tested with 2.7/3.2 on both narrow and wide unicode builds?
So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.
Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).
I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed...
Does it change something if you use "==" instead?

On 2013-03-06 14:18, Amaury Forgeot d'Arc wrote:
Hi,
2013/3/6 Matěj Cepl <mcepl@redhat.com mailto:mcepl@redhat.com>
On 2013-02-26, 16:25 GMT, Terry Reedy wrote: > On 2/21/2013 4:22 PM, Matej Cepl wrote: >> as my method to commemorate Aaron Swartz, I have decided to port his >> html2text to work fully with the latest python 3.3. After some time >> dealing with various bugs, I have now in my repo >> https://github.com/mcepl/html2text (branch python3) working solution >> which works all the way to python 3.2 (inclusive; >> https://travis-ci.org/mcepl/html2text). However, the last problem >> remains. This >> >> <li>Run this command: >> <pre>ls -l *.html</pre></li> >> <li>?</li> >> >> should lead to >> >> * Run this command: >> >> ls -l *.html >> >> * ? >> >> but it doesn’t. It leads to this (with python 3.3 only) >> >> * Run this command: >> ls -l *.html >> >> * ? >> >> Does anybody know about something which changed in modules re or >> http://docs.python.org/3.3/whatsnew/changelog.html between 3.2 and >> 3.3, which could influence this script? > > Search the changelob or 3.3 misc/News for items affecting those two > modules. There are at least 4. > http://docs.python.org/3.3/whatsnew/changelog.html > > It is faintly possible that the switch from narrow/wide builds to > unified builds somehow affected that. Have you tested with 2.7/3.2 on > both narrow and wide unicode builds? So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related. Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).
I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed...
Does it change something if you use "==" instead?
That function looks a little odd to me. Maybe I just don't understand what it's doing! :-)

Hi,
In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version.
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits. Copy/paste of the CIA IRC bot:
19:30 < irker555> cpython: Victor Stinner 3.3 * 82517:3dd2fa78fb89 / Objects/unicodeobject.c: _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/3dd2fa78fb89 19:30 < irker032> cpython: Victor Stinner default * 82518:fa59a85b373f / Objects/unicodeobject.c: (Merge 3.3) _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/fa59a85b373f
Victor
2013/3/6 Amaury Forgeot d'Arc amauryfa@gmail.com:
So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.
Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).
I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed...
Does it change something if you use "==" instead?
-- Amaury Forgeot d'Arc

On 2013-03-06, 18:34 GMT, Victor Stinner wrote:
In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version.
I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail.
But, Amaury is right: the function doesn't make much sense. However, ...
when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95
def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line
to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90
def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line
tests on ALL versions of Python are suddenly failing ... https://travis-ci.org/mcepl/html2text/builds/5288190
Curiouser and curiouser! At least, I seem to have the point, where things are breaking, but I have to admit that condition really doesn’t make any sense to me.
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.
Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world).
Any thoughts?
Matěj

On 2013-03-07, at 11:08 , Matej Cepl wrote:
On 2013-03-06, 18:34 GMT, Victor Stinner wrote:
In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version.
I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail.
But, Amaury is right: the function doesn't make much sense. However, ...
when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95
def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line
to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90
def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line
The second test looks like some kind of corruption, it's supposedly iterating on the characters of a line yet testing for two spaces? Is it possible that the original was a literal tab embedded in the source code (instead of '\t') and that got broken at some point?
According to its name + docstring, the implementation of this method should really be replaced by `return line and line.isspace()` (the first part being to handle the case of an empty line: in the current implementation the line will be returned directly if no whitespace is found, which will be "negative" for an empty line, and ''.isspace() -> false). Does that fix the failing tests?

You should try to write a simple test not using your library (just copy/paste code) reproducing the issue. If you can do that, please fill an issue on bugs.python.org.
Victor
2013/3/7 Matej Cepl mcepl@redhat.com:
On 2013-03-06, 18:34 GMT, Victor Stinner wrote:
In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version.
I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail.
But, Amaury is right: the function doesn't make much sense. However, ...
when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95
def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line
to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90
def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line
tests on ALL versions of Python are suddenly failing ... https://travis-ci.org/mcepl/html2text/builds/5288190
Curiouser and curiouser! At least, I seem to have the point, where things are breaking, but I have to admit that condition really doesn’t make any sense to me.
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.
Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world).
Any thoughts?
Matěj _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com

Am 07.03.2013 11:08, schrieb Matej Cepl:
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.
Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world).
It's not a "fix", it's an optimization. Please understand that using the "is" operator on strings is entirely wrong.
Georg

Hi Matej,
On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl mcepl@redhat.com wrote:
if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':
Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character ' '.
A bientôt,
Armin.

----- Original Message -----
From: "Armin Rigo" arigo@tunes.org To: "Matej Cepl" mcepl@redhat.com Cc: python-dev@python.org Sent: Saturday, May 4, 2013 11:59:42 AM Subject: Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)
Hi Matej,
On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl mcepl@redhat.com wrote:
if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':
Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character '
Comments on https://github.com/mcepl/html2text/commit/f511f3c78e60d7734d677f8945580f52ef... (perhaps in https://github.com/aaronsw/html2text/pull/77) are more than welcome. When using
SPACE_RE = re.compile(r'\s+')
for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though.
Best,
Matěj

On 05/05/2013 23:01, Matej Cepl wrote:
----- Original Message -----
From: "Armin Rigo" arigo@tunes.org To: "Matej Cepl" mcepl@redhat.com Cc: python-dev@python.org Sent: Saturday, May 4, 2013 11:59:42 AM Subject: Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)
Hi Matej,
On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl mcepl@redhat.com wrote:
if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':
Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character '
Comments on https://github.com/mcepl/html2text/commit/f511f3c78e60d7734d677f8945580f52ef... (perhaps in https://github.com/aaronsw/html2text/pull/77) are more than welcome. When using
SPACE_RE = re.compile(r'\s+')
That will match a whitespace character followed by a '+'.
for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though.
What do you mean by "doesn't work"?
participants (9)
-
Amaury Forgeot d'Arc
-
Armin Rigo
-
Georg Brandl
-
Matej Cepl
-
Matěj Cepl
-
MRAB
-
R. David Murray
-
Victor Stinner
-
Xavier Morel