Mailman 3 Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial) - Python-Dev

newer
PEP 435 - reference implementation...

Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

older
enum discussion: can someone...

Matěj Cepl

6 Mar 2013 6 Mar '13

1:09 p.m.

On 2013-02-26, 16:25 GMT, Terry Reedy wrote:

...

On 2/21/2013 4:22 PM, Matej Cepl wrote:

...
as my method to commemorate Aaron Swartz, I have decided to port his html2text to work fully with the latest python 3.3. After some time dealing with various bugs, I have now in my repo https://github.com/mcepl/html2text (branch python3) working solution which works all the way to python 3.2 (inclusive; https://travis-ci.org/mcepl/html2text). However, the last problem remains. This

<li>Run this command: <pre>ls -l *.html</pre></li> <li>?</li>

should lead to

* Run this command:

ls -l *.html

* ?

but it doesn’t. It leads to this (with python 3.3 only)

* Run this command: ls -l *.html

* ?

Does anybody know about something which changed in modules re or http://docs.python.org/3.3/whatsnew/changelog.html between 3.2 and 3.3, which could influence this script?

Search the changelob or 3.3 misc/News for items affecting those two modules. There are at least 4. http://docs.python.org/3.3/whatsnew/changelog.html

It is faintly possible that the switch from narrow/wide builds to unified builds somehow affected that. Have you tested with 2.7/3.2 on both narrow and wide unicode builds?

So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related. Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything). Any thoughts? Matěj P.S.: Crossposting to python-devel in hope there would be somebody understanding more about that particular commit. For that I have also intentionally not trim the original messages to preserve context. -- http://www.ceplovi.cz/matej/, Jabber: mcepl<at>ceplovi.cz GPG Finger: 89EF 4BC6 288A BF43 1BAB 25C3 E09F EF25 D964 84AC When you're happy that cut and paste actually works I think it's a sign you've been using X-Windows for too long. -- from /. discussion on poor integration between KDE and GNOME

Attachments:

signature.asc (application/pgp-signature — 190 bytes)

Show replies by date

R. David Murray

6 Mar 6 Mar

1:51 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

On Wed, 06 Mar 2013 14:09:54 +0100, =?UTF-8?Q?Mat=C4=9Bj?= Cepl wrote:

...

So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.

Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).

If no issue number is mentioned in the commit message, then chances are there's no specific issue in the tracker related to that particular commit. Normally there will be an issue, but sometimes things are done without one (a practice we should maybe think about changing). Most likely the commit's author, Victor Stinner, will see your message or this one and respond. That particular change recently came up (by implication) in another context (unicode singletons...) --David

Amaury Forgeot d'Arc

2:18 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

Hi, 2013/3/6 Matěj Cepl

...

On 2013-02-26, 16:25 GMT, Terry Reedy wrote:

...
On 2/21/2013 4:22 PM, Matej Cepl wrote:

...
as my method to commemorate Aaron Swartz, I have decided to port his html2text to work fully with the latest python 3.3. After some time dealing with various bugs, I have now in my repo https://github.com/mcepl/html2text (branch python3) working solution which works all the way to python 3.2 (inclusive; https://travis-ci.org/mcepl/html2text). However, the last problem remains. This

<li>Run this command: <pre>ls -l *.html</pre></li> <li>?</li>

should lead to

* Run this command:

ls -l *.html

* ?

but it doesn’t. It leads to this (with python 3.3 only)

* Run this command: ls -l *.html

* ?

Does anybody know about something which changed in modules re or http://docs.python.org/3.3/whatsnew/changelog.html between 3.2 and 3.3, which could influence this script?

Search the changelob or 3.3 misc/News for items affecting those two modules. There are at least 4. http://docs.python.org/3.3/whatsnew/changelog.html

It is faintly possible that the switch from narrow/wide builds to unified builds somehow affected that. Have you tested with 2.7/3.2 on both narrow and wide unicode builds?

So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.

Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).

I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed... Does it change something if you use "==" instead? -- Amaury Forgeot d'Arc

MRAB

4:22 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

On 2013-03-06 14:18, Amaury Forgeot d'Arc wrote:

...

Hi,

2013/3/6 Matěj Cepl mailto:mcepl@redhat.com>

On 2013-02-26, 16:25 GMT, Terry Reedy wrote: > On 2/21/2013 4:22 PM, Matej Cepl wrote: >> as my method to commemorate Aaron Swartz, I have decided to port his >> html2text to work fully with the latest python 3.3. After some time >> dealing with various bugs, I have now in my repo >> https://github.com/mcepl/html2text (branch python3) working solution >> which works all the way to python 3.2 (inclusive; >> https://travis-ci.org/mcepl/html2text). However, the last problem >> remains. This >> >> <li>Run this command: >> <pre>ls -l *.html</pre></li> >> <li>?</li> >> >> should lead to >> >> * Run this command: >> >> ls -l *.html >> >> * ? >> >> but it doesn’t. It leads to this (with python 3.3 only) >> >> * Run this command: >> ls -l *.html >> >> * ? >> >> Does anybody know about something which changed in modules re or >> http://docs.python.org/3.3/whatsnew/changelog.html between 3.2 and >> 3.3, which could influence this script? > > Search the changelob or 3.3 misc/News for items affecting those two > modules. There are at least 4. > http://docs.python.org/3.3/whatsnew/changelog.html > > It is faintly possible that the switch from narrow/wide builds to > unified builds somehow affected that. Have you tested with 2.7/3.2 on > both narrow and wide unicode builds?

So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.

Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).

I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed...

Does it change something if you use "==" instead?

That function looks a little odd to me. Maybe I just don't understand what it's doing! :-)

Victor Stinner

6:34 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

Hi, In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version. Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits. Copy/paste of the CIA IRC bot: 19:30 < irker555> cpython: Victor Stinner 3.3 * 82517:3dd2fa78fb89 / Objects/unicodeobject.c: _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/3dd2fa78fb89 19:30 < irker032> cpython: Victor Stinner default * 82518:fa59a85b373f / Objects/unicodeobject.c: (Merge 3.3) _PyUnicode_Writer() now also reuses Unicode singletons: empty string and latin1 single character http://hg.python.org/cpython/rev/fa59a85b373f Victor 2013/3/6 Amaury Forgeot d'Arc :

...

...
So, in the end, I have went the long way and bisected cpython to find the commit which broke my tests, and it seems that the culprit is http://hg.python.org/cpython/rev/123f2dc08b3e so it is clearly something Unicode related.

Unfortunately, it really doesn't tell me what exactly is broken (is it a known regression) and if there is known workaround. Could anybody suggest a way how to find bugs on http://bugs.python.org related to some particular commit (plain search for 123f2dc0 didn’t find anything).

I strongly suspect an incorrect usage of the "is" operator: https://github.com/mcepl/html2text/blob/master/html2text.py#L95 Identity of strings is not guaranteed...

Does it change something if you use "==" instead?

-- Amaury Forgeot d'Arc

Matej Cepl

7 Mar 7 Mar

10:08 a.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

On 2013-03-06, 18:34 GMT, Victor Stinner wrote:

...

In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version.

I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail. But, Amaury is right: the function doesn't make much sense. However, ... when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95 def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line tests on ALL versions of Python are suddenly failing ... https://travis-ci.org/mcepl/html2text/builds/5288190 Curiouser and curiouser! At least, I seem to have the point, where things are breaking, but I have to admit that condition really doesn’t make any sense to me.

...

Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.

Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world). Any thoughts? Matěj

Xavier Morel

10:31 a.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

On 2013-03-07, at 11:08 , Matej Cepl wrote:

...

On 2013-03-06, 18:34 GMT, Victor Stinner wrote:

...
In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version.

I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail.

But, Amaury is right: the function doesn't make much sense. However, ...

when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95

def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line

to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90

def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line

The second test looks like some kind of corruption, it's supposedly iterating on the characters of a line yet testing for two spaces? Is it possible that the original was a literal tab embedded in the source code (instead of '\t') and that got broken at some point? According to its name + docstring, the implementation of this method should really be replaced by `return line and line.isspace()` (the first part being to handle the case of an empty line: in the current implementation the line will be returned directly if no whitespace is found, which will be "negative" for an empty line, and ''.isspace() -> false). Does that fix the failing tests?

Victor Stinner

1:34 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

You should try to write a simple test not using your library (just copy/paste code) reproducing the issue. If you can do that, please fill an issue on bugs.python.org. Victor 2013/3/7 Matej Cepl :

...

On 2013-03-06, 18:34 GMT, Victor Stinner wrote:

...
In short, Unicode was rewritten in Python 3.3 for the PEP 393. It's not surprising that minor details like singleton differ. You should not use "is" to compare strings in Python, or your program will fail on other Python implementations (like PyPy, IronPython, or Jython) or even on a different CPython version.

I am sorry, I don't understand what you are saying. Even though this has been changed to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90 the tests still fail.

But, Amaury is right: the function doesn't make much sense. However, ...

when I have “fixed it” from https://github.com/mcepl/html2text/blob/master/html2text.py#L95

def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c is not ' ' and c is not ' ': return c is ' ' return line

to https://github.com/mcepl/html2text/blob/fix_tests/html2text.py#L90

def onlywhite(line): """Return true if the line does only consist of whitespace characters.""" for c in line: if c != ' ' and c != ' ': return c == ' ' return line

tests on ALL versions of Python are suddenly failing ... https://travis-ci.org/mcepl/html2text/builds/5288190

Curiouser and curiouser! At least, I seem to have the point, where things are breaking, but I have to admit that condition really doesn’t make any sense to me.

...
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.

Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world).

Any thoughts?

Matěj _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com

Georg Brandl

8:20 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

Am 07.03.2013 11:08, schrieb Matej Cepl:

...

...
Anyway, you spotted a missed optimization: it's now "fixed" in Python 3.3 and 3.4 by the following commits.

Well, whatever is the problem, it is not fixed in python 3.3.0 (as you can see in https://travis-ci.org/mcepl/html2text/builds/4969045) as I can see on my computer. Actually, good news is that it seems to be fixed in the master branch of cpython (or the tip, as they say in the Mercurial world).

It's not a "fix", it's an optimization. Please understand that using the "is" operator on strings is entirely wrong. Georg

Armin Rigo

4 May 4 May

9:59 a.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

Hi Matej, On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl wrote:

...

if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':

Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character ' '. A bientôt, Armin.

Matej Cepl

5 May 5 May

10:01 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

----- Original Message -----

...

From: "Armin Rigo" To: "Matej Cepl" Cc: python-dev@python.org Sent: Saturday, May 4, 2013 11:59:42 AM Subject: Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

Hi Matej,

On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl wrote:

...
if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':

Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character '

Comments on https://github.com/mcepl/html2text/commit/f511f3c78e60d7734d677f8945580f52ef... (perhaps in https://github.com/aaronsw/html2text/pull/77) are more than welcome. When using SPACE_RE = re.compile(r'\s\+') for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though. Best, Matěj

MRAB

10:20 p.m.

New subject: Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

On 05/05/2013 23:01, Matej Cepl wrote:

...

----- Original Message -----

...
From: "Armin Rigo" To: "Matej Cepl" Cc: python-dev@python.org Sent: Saturday, May 4, 2013 11:59:42 AM Subject: Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

Hi Matej,

On Thu, Mar 7, 2013 at 11:08 AM, Matej Cepl wrote:

...
if c is not ' ' and c is not ' ': if c != ' ' and c != ' ':

Sorry for the delay in answering, but I just noticed what is wrong in this "fix": it compares c with the same single-character ' ' twice, whereas the original compared it with ' ' and with the two-character '

Comments on https://github.com/mcepl/html2text/commit/f511f3c78e60d7734d677f8945580f52ef... (perhaps in https://github.com/aaronsw/html2text/pull/77) are more than welcome. When using

SPACE_RE = re.compile(r'\s\+')

That will match a whitespace character followed by a '+'.

...

for checking, whole onlywhite function is not needed anymore (and it still made me wonder what Aaron meant when he wrote it). Why line.isspace() doesn't work is weird though.

What do you mean by "doesn't work"?

4007

Age (days ago)

4067

Last active (days ago)

List overview

Download

11 comments

9 participants

participants (9)

Amaury Forgeot d'Arc
Armin Rigo
Georg Brandl
Matej Cepl
Matěj Cepl
MRAB
R. David Murray
Victor Stinner
Xavier Morel

Re: [Python-Dev] Difference in RE between 3.2 and 3.3 (or Aaron Swartz memorial)

tags

participants (9)