stoopid question: why the heck is xmllib using
"RuntimeError" to flag XML syntax errors?
raise RuntimeError, 'Syntax error at line %d: %s' % (self.lineno, message)
what's wrong with "SyntaxError"?
An HTML version of the attached can be viewed at
This will be adopted for 2.0 unless there's an uproar. Note that it *does*
have potential for breaking existing code -- although no real-life instance
of incompatibility has yet been reported. This is explained in detail in
the PEP; check your code now.
although-if-i-were-you-i-wouldn't-bother<0.5-wink>-ly y'rs - tim
Title: Change the Meaning of \x Escapes
Version: $Revision: 1.4 $
Author: tpeters(a)beopen.com (Tim Peters)
Type: Standards Track
Change \x escapes, in both 8-bit and Unicode strings, to consume
exactly the two hex digits following. The proposal views this as
correcting an original design flaw, leading to clearer expression
in all flavors of string, a cleaner Unicode story, better
compatibility with Perl regular expressions, and with minimal risk
to existing code.
The syntax of \x escapes, in all flavors of non-raw strings, becomes
where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is
not clearly specified in the Reference Manual; it says
implying "two or more" hex digits, but one-digit forms are also
accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
itself (i.e., a backslash followed by the letter x). It's unclear
whether the Reference Manual intended either of the 1-digit or
In an 8-bit non-raw string,
expands to the character
Note that this is the same as in 1.6 and before.
In a Unicode string,
acts the same as
i.e. it expands to the obvious Latin-1 character from the initial
segment of the Unicode space.
An \x not followed by at least two hex digits is a compile-time error,
specifically ValueError in 8-bit strings, and UnicodeError (a subclass
of ValueError) in Unicode strings. Note that if an \x is followed by
more than two hex digits, only the first two are "consumed". In 1.6
and before all but the *last* two were silently ignored.
>>> "\x123465" # same as "\x65"
>>> "\x123465" # \x12 -> \022, "3456" left alone
[ValueError is raised]
[ValueError is raised]
History and Rationale
\x escapes were introduced in C as a way to specify variable-width
character encodings. Exactly which encodings those were, and how many
hex digits they required, was left up to each implementation. The
language simply stated that \x "consumed" *all* hex digits following,
and left the meaning up to each implementation. So, in effect, \x in C
is a standard hook to supply platform-defined behavior.
Because Python explicitly aims at platform independence, the \x escape
in Python (up to and including 1.6) has been treated the same way
across all platforms: all *except* the last two hex digits were
silently ignored. So the only actual use for \x escapes in Python was
to specify a single byte using hex notation.
Larry Wall appears to have realized that this was the only real use for
\x escapes in a platform-independent language, as the proposed rule for
Python 2.0 is in fact what Perl has done from the start (although you
need to run in Perl -w mode to get warned about \x escapes with fewer
than 2 hex digits following -- it's clearly more Pythonic to insist on
2 all the time).
When Unicode strings were introduced to Python, \x was generalized so
as to ignore all but the last *four* hex digits in Unicode strings.
This caused a technical difficulty for the new regular expression
SRE tries very hard to allow mixing 8-bit and Unicode patterns and
strings in intuitive ways, and it no longer had any way to guess what,
for example, r"\x123456" should mean as a pattern: is it asking to
the 8-bit character \x56 or the Unicode character \u3456?
There are hacky ways to guess, but it doesn't end there. The ISO C99
standard also introduces 8-digit \U12345678 escapes to cover the entire
ISO 10646 character space, and it's also desired that Python 2 support
that from the start. But then what are \x escapes supposed to mean?
Do they ignore all but the last *eight* hex digits then? And if less
than 8 following in a Unicode string, all but the last 4? And if less
than 4, all but the last 2?
This was getting messier by the minute, and the proposal cuts the
Gordian knot by making \x simpler instead of more complicated. Note
that the 4-digit generalization to \xijkl in Unicode strings was also
redundant, because it meant exactly the same thing as \uijkl in Unicode
strings. It's more Pythonic to have just one obvious way to specify a
Unicode character via hex notation.
Development and Discussion
The proposal was worked out among Guido van Rossum, Fredrik Lundh and
Tim Peters in email. It was subsequently explained and disussed on
Python-Dev under subject "Go \x yourself", starting 2000-08-03.
Response was overwhelmingly positive; no objections were raised.
Changing the meaning of \x escapes does carry risk of breaking existing
code, although no instances of incompabitility have yet been discovered.
The risk is believed to be minimal.
Tim Peters verified that, except for pieces of the standard test suite
deliberately provoking end cases, there are no instances of \xabcdef...
with fewer or more than 2 hex digits following, in either the Python
CVS development tree, or in assorted Python packages sitting on his
It's unlikely there are any with fewer than 2, because the Reference
Manual implied they weren't legal (although this is debatable!). If
there are any with more than 2, Guido is ready to argue they were buggy
anyway <0.9 wink>.
Guido reported that the O'Reilly Python books *already* document that
Python works the proposed way, likely due to their Perl editing
heritage (as above, Perl worked (very close to) the proposed way from
Finn Bock reported that what JPython does with \x escapes is
unpredictable today. This proposal gives a clear meaning that can be
consistently and easily implemented across all Python implementations.
Effects on Other Tools
Believed to be none. The candidates for breakage would mostly be
parsing tools, but the author knows of none that worry about the
internal structure of Python strings beyond the approximation "when
there's a backslash, swallow the next character". Tim Peters checked
python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
coloring subsystem, and believes there's no need to change any of
them. Tools like tabnanny.py and checkappend.py inherit their immunity
The code changes are so simple that a separate patch will not be
Fredrik Lundh is writing the code, is an expert in the area, and will
simply check the changes in before 2.0b1 is released.
Yes, ValueError, not SyntaxError. "Problems with literal
traditionally raise 'runtime' exceptions rather than syntax errors."
This document has been placed in the public domain.
FYI: This misdefinition with LONG_BIT was due to a bug in glibc's limits.h. It
has been fixed in glibc 2.96.
On Wed, Oct 04, 2000 at 06:42:32PM -0700, Tim Peters wrote:
> Update of /cvsroot/python/python/dist/src/Include
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv5758/python/dist/src/Include
> Modified Files:
> Log Message:
> Move LONG_BIT from intobject.c to pyport.h. #error if it's already been
> #define'd to an unreasonable value (several recent gcc systems have
> misdefined it, causing bogus overflows in integer multiplication). Nuke
> CHAR_BIT entirely.
> The 2.0 docs clearly state 'The optional else clause is executed when no
> exception occurs in the try clause.' This makes it sound as though it
> gets executed on the 'way out'.
Of course. That's not what the docs meant, though, and Guido is not going
to change the implementation now because that would break code that relies
on how Python has always *worked* in these cases. The way Python works is
also the way Guido intended it to work (I'm allowed to channel him when he's
on vacation <0.9 wink)>.
Indeed, that's why I suggested a specific doc change. If your friend would
also be confused by that, then we still have a problem; else we don't.
> If curses is a core facility now, the default build should tread it
> as one.
> IMO ssl isn't an issue because it's not documented as being in the
> standard module set.
> 3. Documented as being in the core but not built in by default.
> My more general claim is that the existence of class 3 is a problem
In the case of curses, I believe there is a documentation error in the
2.0 documentation. The curses packages is listed under "Generic
Operating System Services". I believe this is wrong, it should be listed
as "Unix Specific Services".
Unless I'm mistaken, the curses module is not available on the Mac and
on Windows. With that change, the curses module would then fall into
Eric's category 2 (Not documented as being in the core and not built
in by default).
That documentation change should be carried out even if curses is
autoconfigured; autoconf is used on Unix only, either.
P.S. The "Python Library Reference" content page does not mention the
word "core" at all, except as part of asyncore...
The current FAQ is horribly out of date. I think the FAQ-Wizard method
has proven itself not very efficient (for example, apparently no one
noticed until now that it's not working <0.2 wink>). Is there any
hope putting the FAQ in Misc/, having a script which scp's it
to the SF page, and making that the official FAQ?
On a related note, what is the current status of the PSA? Is it officially
Moshe Zadka <sig(a)zadka.site.co.il>
This is a signature anti-virus.
Please stop the spread of signature viruses!
[resent since python.org ran out of disk space]
> My only problem with it is your copyright notice. AFAIK, patches to
> the Python core cannot contain copyright notices without proper
> license information. OTOH, I don't think that these minor changes
> really warrant adding a complete license paragraph.
I'd like to get an "official" clarification on this question. Is it
the case that patches containing copyright notices are only accepted
if they are accompanied with license information?
I agree that the changes are minor, I also believe that I hold the
copyright to the changes whether I attach a notice or not (at least
according to our local copyright law).
What concerns me that without such a notice, gencodec.py looks as if
CNRI holds the copyright to it. I'm not willing to assign the
copyright of my changes to CNRI, and I'd like to avoid the impression
of doing so.
What is even more concerning is that CNRI also holds the copyright to
the generated files, even though they are derived from information
made available by the Unicode consortium!
Add this error to the pot:
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /cgi-bin/moinmoin.
Reason: Document contains no data
Apache/1.3.9 Server at www.python.org Port 80
Also, as far as I can tell:
+ news->mail for c.l.py hasn't delivered anything for well over 24 hours.
+ No mail to Python-Dev has showed up in the archives (let alone been
delivered) since Fri, 29 Dec 2000 16:42:44 +0200 (IST).
+ The other Python mailing lists appear equally dead.
time-for-a-new-year!-ly y'rs - tim
A bug was filed on SF contending that the default linkage for bsddb should
be shared instead of static because some Linux systems ship multiple
versions of libdb.
Would those of you who can and do build bsddb (probably only unixoids of
some variety) please give this simple test a try? Uncomment the *shared*
line in Modules/Setup.config.in, re-run configure, build Python and then
db = bsddb.btopen("/tmp/dbtest.db", "c")
db["1"] = "1"
If this doesn't fail for anyone I'll check the change in and close the bug
report, otherwise I'll add a(nother) comment to the bug report that *shared*
breaks bsddb for others and close the bug report.