[Python-Dev] Go \x yourself

Thu, 3 Aug 2000 04:05:31 -0400

Offline, Guido and /F and I had a mighty battle about the meaning of \x
escapes in Python.  In the end we agreed to change the meaning of \x in a
backward-*in*compatible way.  Here's the scoop:

In 1.5.2 and before, the Reference Manual implies that an \x escape takes
two or more hex digits following, and has the value of the last byte.  In
reality it also accepted just one hex digit, or even none:

>>> "\x123465"  # same as "\x65"
'e'
>>> "\x65"
'e'
>>> "\x1"
'\001'
>>> "\x\x"
'\\x\\x'
>>>

I found no instances of the 0- or 1-digit forms in the CVS tree or in any of
the Python packages on my laptop.  Do you have any in your code?

And, apart from some deliberate abuse in the test suite, I found no
instances of more-than-two-hex-digits \x escapes either.  Similarly, do you
have any?  As Guido said and all agreed, it's probably a bug if you do.

The new rule is the same as Perl uses for \x escapes in -w mode, except that
Python will raise ValueError at compile-time for an invalid \x escape:  an
\x escape is of the form

    \xhh

where h is a hex digit.  That's it.  Guido reports that the O'Reilly books
(probably due to their Perl editing heritage!) already say Python works this
way.  It's the same rule for 8-bit and Unicode strings (in Perl too, at
least wrt the syntax).  In a Unicode string \xij has the same meaning as
\u00ij, i.e. it's the obvious Latin-1 character.  Playing back the above
pretending the new rule is in place:

>>> "\x123465" # \x12 -> \022, "3456" left alone
'\0223456'
>>> "\x65"
'e'
>>> "\x1"
ValueError
>>> "\x\x"
ValueError
>>>

We all support this:  the open-ended gobbling \x used to do lost information
without warning, and had no benefit whatsoever.  While there was some
attraction to generalizing \x in Unicode strings, \u1234 is already
perfectly adequate for specifying Unicode characters in hex form, and the
new rule for \x at least makes consistent Unicode sense now (and in a way
JPython should be able to adopt easily too).  The new rule gets rid of the
unPythonic TMTOWTDI introduced by generalizing Unicode \x to "the last 4
bytes".  That generalization also didn't make sense in light of the desire
to add \U12345678 escapes too (i.e., so then how many trailing hex digits
should a generalized \x suck up?  2?  4?  8?).  The only actual use for \x
in 8-bit strings (i.e., a way to specify a byte in hex) is still supported
with the same meaning as in 1.5.2, and \x in a Unicode string means
something as close to that as is possible.

Sure feels right to me.  Gripe quick if it doesn't to you.

as-simple-as-possible-is-a-nice-place-to-rest-ly y'rs  - tim