[Python-Dev] SRE incompatibility

Tim Peters tpeters@beopen.com
Fri, 30 Jun 2000 12:38:21 -0400


[Andrew Kuchling]
> ...
> This is for compatibility with Python string literals:
>
> kronos Python-1.6>./python
> >>> '\x00fffffff'
> '\377'
> >>> u'\x00fffffff'
> u'\uFFFF'
>
> (Where do these semantics come from, BTW?  C's \x seems to take any
> number of hex digits but then reports an error if the character is
> greater than 256, too large to fit into a byte.)

The behavior of \x in C is mostly implementation-defined.  The committee
knew that C had to do *something* to support "large characters" down the
road, but in those early days they had no clear idea exactly what.  So,
rather than do something sensible <0.5 wink>, they invented a perfectly
general mechanism without portable semantics.  "C itself" isn't complaining
if the character "is greater than 256", it's the specific implementation of
C you're using that's complaining.  A different implementation is free to (&
probably will!) do something different.

Guido adopted the most commonly implemented semantics (ignore all but the
last byte) in Python, apparently under the delusion that this would be a
Good Thing <wink>.  Marc-Andre followed suit by generalizing this madness to
Unicode.

> Note that the \u escape for Unicode characters uses exactly 4 digits,
> no more, no less.

I pushed for that obnoxiously.  Glad you appreciate it <wink>.  Java does
the same.

> It would certainly be simpler and clearer to only support a fixed
> number of digits with \x, since I find the casting down behaviour is
> magical and not obvious.

Yes, it's basically nuts.

> But I don't know if we want to make that change now.

No from me, because it may break stuff.  Wait for Python 2.0 <ahem>.

> (Guido now realizes the downside to numbering it 2.0, as everyone
> hurries to suggest their favorite backward-incompatible change.)

Guido always realized that, I believe.  It's a "least of evils" kind of
thing, mixed with a celebration, not a pure win.

> That doesn't help with regexes, of course, since a pattern might be
> written as a regular string but be intended to match Unicode.  Maybe
> the simplest rule is the best; always take 4 digits, even if it winds
> up being incompatible with the \x in string literals.

I vote for backward compatibility for now, and not only because that will
irritate /F the most.