
my latest changes fixed a couple of things, but broke one of the old RE tests, namely: re.match('\\x00ffffffffffffff', '\377') != None or in other words, long hexadecimal escapes are cast down to 8-bit characters in RE. in SRE (after the latest change), they're cast down to the size of the engine's internal word size (currently 16 bits). is the old behaviour worth keeping? I'd rather not make the engine dependent on string types; it shouldn't really matter if you're using unicode patterns on 8-bit target strings, or vice versa. </F>

On Fri, Jun 30, 2000 at 04:18:13PM +0200, Fredrik Lundh wrote:
re.match('\\x00ffffffffffffff', '\377') != None or in other words, long hexadecimal escapes are cast down to 8-bit characters in RE.
This is for compatibility with Python string literals: kronos Python-1.6>./python
'\x00fffffff' '\377' u'\x00fffffff' u'\uFFFF'
(Where do these semantics come from, BTW? C's \x seems to take any number of hex digits but then reports an error if the character is greater than 256, too large to fit into a byte.) Note that the \u escape for Unicode characters uses exactly 4 digits, no more, no less. It would certainly be simpler and clearer to only support a fixed number of digits with \x, since I find the casting down behaviour is magical and not obvious. But I don't know if we want to make that change now. (Guido now realizes the downside to numbering it 2.0, as everyone hurries to suggest their favorite backward-incompatible change.) That doesn't help with regexes, of course, since a pattern might be written as a regular string but be intended to match Unicode. Maybe the simplest rule is the best; always take 4 digits, even if it winds up being incompatible with the \x in string literals. --amk

[Andrew Kuchling]
... This is for compatibility with Python string literals:
kronos Python-1.6>./python
'\x00fffffff' '\377' u'\x00fffffff' u'\uFFFF'
(Where do these semantics come from, BTW? C's \x seems to take any number of hex digits but then reports an error if the character is greater than 256, too large to fit into a byte.)
The behavior of \x in C is mostly implementation-defined. The committee knew that C had to do *something* to support "large characters" down the road, but in those early days they had no clear idea exactly what. So, rather than do something sensible <0.5 wink>, they invented a perfectly general mechanism without portable semantics. "C itself" isn't complaining if the character "is greater than 256", it's the specific implementation of C you're using that's complaining. A different implementation is free to (& probably will!) do something different. Guido adopted the most commonly implemented semantics (ignore all but the last byte) in Python, apparently under the delusion that this would be a Good Thing <wink>. Marc-Andre followed suit by generalizing this madness to Unicode.
Note that the \u escape for Unicode characters uses exactly 4 digits, no more, no less.
I pushed for that obnoxiously. Glad you appreciate it <wink>. Java does the same.
It would certainly be simpler and clearer to only support a fixed number of digits with \x, since I find the casting down behaviour is magical and not obvious.
Yes, it's basically nuts.
But I don't know if we want to make that change now.
No from me, because it may break stuff. Wait for Python 2.0 <ahem>.
(Guido now realizes the downside to numbering it 2.0, as everyone hurries to suggest their favorite backward-incompatible change.)
Guido always realized that, I believe. It's a "least of evils" kind of thing, mixed with a celebration, not a pure win.
That doesn't help with regexes, of course, since a pattern might be written as a regular string but be intended to match Unicode. Maybe the simplest rule is the best; always take 4 digits, even if it winds up being incompatible with the \x in string literals.
I vote for backward compatibility for now, and not only because that will irritate /F the most.

tim wrote:
That doesn't help with regexes, of course, since a pattern might be written as a regular string but be intended to match Unicode. Maybe the simplest rule is the best; always take 4 digits, even if it winds up being incompatible with the \x in string literals.
I vote for backward compatibility for now, and not only because that will irritate /F the most.
backward compatibility with what? 8-bit string literals or unicode string literals? the problem here is that the pattern is compiled once (from either 8-bit or unicode strings), and can then be used on either 8-bit or unicode targets. to be fully backwards compatible, this means that the compiler should use 8 bits, no matter what string type you're using. another solution would be to use the type of the pattern string to choose between 8 and 16 bits. I almost implemented that, before I realized that it broke the following rather nice property: sre.compile("some pattern") == sre.compile(u"some pattern") (well, the pattern type doesn't implement __cmp__, but you get the idea). the current implementation guarantees "==", but I'm planning to change that to "is" (!). anyway, I suspect it's too late to change this in 2.0b1. if enough people complain about this, we can always label it a "critical bug", and do something about it in b2. </F>

[Tim]
I vote for backward compatibility for now, and not only because that will irritate /F the most.
[/F]
backward compatibility with what?
1.5.2.
8-bit string literals
At least, because they were in 1.5.2.
or unicode string literals?
I'm sorry \x escapes are even allowed in those -- \x notation is a gimmick for making strings hold arbitrary binary data, which we're trying to get away from. To the extent that they make any sense at all in Unicode strings, \u should be used instead.
the problem here is that the pattern is compiled once (from either 8-bit or unicode strings), and can then be used on either 8-bit or unicode targets. to be fully backwards compatible, this means that the compiler should use 8 bits, no matter what string type you're using.
Unicode strings weren't in 1.5.2, so there can't possibly be a backwards compatibility issue with them -- at least not in the sense I'm using the phrase here.
another solution would be to use the type of the pattern string to choose between 8 and 16 bits. I almost implemented that, before I realized that it broke the following rather nice property:
sre.compile("some pattern") == sre.compile(u"some pattern")
(well, the pattern type doesn't implement __cmp__, but you get the idea). the current implementation guarantees "==", but I'm planning to change that to "is" (!).
Do you mean that, e.g., sre.compile("\u0045") == sre.compile(u"\u0045") too? If so, that doesn't make any sense to me (interpreting \u in 8-bit strings is even more confused than interpreting \x in Unicode strings). But if you didn't mean to include this case, then the equality doesn't actually hold now, so there's nothing to preserve <wink>.
anyway, I suspect it's too late to change this in 2.0b1. if enough people complain about this, we can always label it a "critical bug", and do something about it in b2.
I think the real problem here was MAL's generalization of \x to 2-byte stuff in Unicode strings. If Unicode strings *have* to support \x, then \x0123456789abcdef in Unicode strings should act like \u00ef in Unicode strings, and SRE should play along with that too. \x was broken to begin with; better to wipe it out than try to generalize it. OTOH, I didn't get much sleep last night <0.8 wink>.

tim wrote:
to be fully backwards compatible, this means that the compiler should use 8 bits, no matter what string type you're using. ... I think the real problem here was MAL's generalization of \x to 2-byte stuff in Unicode strings. If Unicode strings *have* to support \x, then
\x0123456789abcdef
in Unicode strings should act like
\u00ef
in Unicode strings, and SRE should play along with that too. \x was broken to begin with; better to wipe it out than try to generalize it.
I think this means that we agree -- \x is a wart that can only be used to embed *binary bytes* in a string. </F>

[/F]
I think this means that we agree -- \x is a wart that can only be used to embed *binary bytes* in a string.
We certainly agree about that part! I thought my
I'm sorry \x escapes are even allowed in [u-strings] -- \x notation is a gimmick for making strings hold arbitrary binary data, which we're trying to get away from. To the extent that they make any sense at all in Unicode strings, \u should be used instead.
was pretty explicit <wink>. What we may still disagree on is how SRE should deal with the \x mess. I'm in favor of making \x mean "just the last byte" in plain and Unicode strings -- do the least harm with this (mis)feature. Making \x mean anything other than that for plain strings, regardless of context, is not backward compatible (with 1.5.2). And since Unicode strings haven't been released yet, it's not too late to change what they do with \x. That would make SRE's job clear here, yes? And in a way that allows the now-failing test to pass again?

tim wrote:
That would make SRE's job clear here, yes? And in a way that allows the now-failing test to pass again?
sure. quoting from python-checkins: RCS file: /cvsroot/python/python/dist/src/Lib/test/output/test_sre,v ... test_sre - === Failed incorrectly ('\\x00ffffffffffffff', '\377', 0, 'found', '\377') === Failed incorrectly ('^(.+)?B', 'AB', 0, 'g1', 'A') ... still messes up on nested repetitions, but that's entirely different problem... </F>

my latest changes fixed a couple of things, but broke one of the old RE tests, namely:
re.match('\\x00ffffffffffffff', '\377') != None
or in other words, long hexadecimal escapes are cast down to 8-bit characters in RE.
in SRE (after the latest change), they're cast down to the size of the engine's internal word size (currently 16 bits).
is the old behaviour worth keeping? I'd rather not make the engine dependent on string types; it shouldn't really matter if you're using unicode patterns on 8-bit target strings, or vice versa.
To someone familiar with '\x00ffffffffffffff' == '\377', the failure is surprising. What Would Larry Do? (I.e. is this in Perl?) Maybe make it dependent on the type of the searched string ('\377') rather than on the type of the pattern? --Guido van Rossum (home page: http://www.python.org/~guido/)

On Fri, Jun 30, 2000 at 11:07:16AM -0500, Guido van Rossum wrote:
To someone familiar with '\x00ffffffffffffff' == '\377', the failure is surprising. What Would Larry Do? (I.e. is this in Perl?)
It uses two digits: "\x00ffff" is the string "<binary 0>ffff".
Maybe make it dependent on the type of the searched string ('\377') rather than on the type of the pattern?
Won't work; you could just be compiling a pattern to make a regex object, and have no idea what you're matching against. --amk

On Fri, Jun 30, 2000 at 11:07:16AM -0500, Guido van Rossum wrote:
To someone familiar with '\x00ffffffffffffff' == '\377', the failure is surprising. What Would Larry Do? (I.e. is this in Perl?)
It uses two digits: "\x00ffff" is the string "<binary 0>ffff".
Maybe make it dependent on the type of the searched string ('\377') rather than on the type of the pattern?
Won't work; you could just be compiling a pattern to make a regex object, and have no idea what you're matching against.
OK. Let's change our spec. --Guido van Rossum (home page: http://www.python.org/~guido/)

I don't know if this is related, exactly, but there is some kind of problem with the current test. When I run make test, I see: test test_sre crashed -- exceptions.SyntaxError : inconsistent use of tabs and spaces in indentation tabnanny thinks test_sre.py is fine, so I'm not sure what the problem is. Jeremy
participants (5)
-
Andrew Kuchling
-
Fredrik Lundh
-
Guido van Rossum
-
Jeremy Hylton
-
Tim Peters