[Python-Dev] SRE incompatibility

Tim Peters tpeters@beopen.com
Fri, 30 Jun 2000 14:20:46 -0400


[Tim]
> I vote for backward compatibility for now, and not only because
> that will irritate /F the most.

[/F]
> backward compatibility with what?

1.5.2.

> 8-bit string literals

At least, because they were in 1.5.2.

> or unicode string literals?

I'm sorry \x escapes are even allowed in those -- \x notation is a gimmick
for making strings hold arbitrary binary data, which we're trying to get
away from.  To the extent that they make any sense at all in Unicode
strings, \u should be used instead.

> the problem here is that the pattern is compiled once (from either
> 8-bit or unicode strings), and can then be used on either 8-bit or
> unicode targets.  to be fully backwards compatible, this means that
> the compiler should use 8 bits, no matter what string type you're
> using.

Unicode strings weren't in 1.5.2, so there can't possibly be a backwards
compatibility issue with them -- at least not in the sense I'm using the
phrase here.

> another solution would be to use the type of the pattern string to
> choose between 8 and 16 bits.  I almost implemented that, before
> I realized that it broke the following rather nice property:
>
>     sre.compile("some pattern") == sre.compile(u"some pattern")
>
> (well, the pattern type doesn't implement __cmp__, but you get the
> idea).  the current implementation guarantees "==", but I'm planning
> to change that to "is" (!).

Do you mean that, e.g.,

    sre.compile("\u0045") == sre.compile(u"\u0045")

too?  If so, that doesn't make any sense to me (interpreting \u in 8-bit
strings is even more confused than interpreting \x in Unicode strings).  But
if you didn't mean to include this case, then the equality doesn't actually
hold now, so there's nothing to preserve <wink>.

> anyway, I suspect it's too late to change this in 2.0b1.  if enough
> people complain about this, we can always label it a "critical bug",
> and do something about it in b2.

I think the real problem here was MAL's generalization of \x to 2-byte stuff
in Unicode strings.  If Unicode strings *have* to support \x, then

    \x0123456789abcdef

in Unicode strings should act like

    \u00ef

in Unicode strings, and SRE should play along with that too.  \x was broken
to begin with; better to wipe it out than try to generalize it.

OTOH, I didn't get much sleep last night <0.8 wink>.