stoopid question: why the heck is xmllib using
"RuntimeError" to flag XML syntax errors?
raise RuntimeError, 'Syntax error at line %d: %s' % (self.lineno, message)
what's wrong with "SyntaxError"?
An HTML version of the attached can be viewed at
This will be adopted for 2.0 unless there's an uproar. Note that it *does*
have potential for breaking existing code -- although no real-life instance
of incompatibility has yet been reported. This is explained in detail in
the PEP; check your code now.
although-if-i-were-you-i-wouldn't-bother<0.5-wink>-ly y'rs - tim
Title: Change the Meaning of \x Escapes
Version: $Revision: 1.4 $
Author: tpeters(a)beopen.com (Tim Peters)
Type: Standards Track
Change \x escapes, in both 8-bit and Unicode strings, to consume
exactly the two hex digits following. The proposal views this as
correcting an original design flaw, leading to clearer expression
in all flavors of string, a cleaner Unicode story, better
compatibility with Perl regular expressions, and with minimal risk
to existing code.
The syntax of \x escapes, in all flavors of non-raw strings, becomes
where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is
not clearly specified in the Reference Manual; it says
implying "two or more" hex digits, but one-digit forms are also
accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
itself (i.e., a backslash followed by the letter x). It's unclear
whether the Reference Manual intended either of the 1-digit or
In an 8-bit non-raw string,
expands to the character
Note that this is the same as in 1.6 and before.
In a Unicode string,
acts the same as
i.e. it expands to the obvious Latin-1 character from the initial
segment of the Unicode space.
An \x not followed by at least two hex digits is a compile-time error,
specifically ValueError in 8-bit strings, and UnicodeError (a subclass
of ValueError) in Unicode strings. Note that if an \x is followed by
more than two hex digits, only the first two are "consumed". In 1.6
and before all but the *last* two were silently ignored.
>>> "\x123465" # same as "\x65"
>>> "\x123465" # \x12 -> \022, "3456" left alone
[ValueError is raised]
[ValueError is raised]
History and Rationale
\x escapes were introduced in C as a way to specify variable-width
character encodings. Exactly which encodings those were, and how many
hex digits they required, was left up to each implementation. The
language simply stated that \x "consumed" *all* hex digits following,
and left the meaning up to each implementation. So, in effect, \x in C
is a standard hook to supply platform-defined behavior.
Because Python explicitly aims at platform independence, the \x escape
in Python (up to and including 1.6) has been treated the same way
across all platforms: all *except* the last two hex digits were
silently ignored. So the only actual use for \x escapes in Python was
to specify a single byte using hex notation.
Larry Wall appears to have realized that this was the only real use for
\x escapes in a platform-independent language, as the proposed rule for
Python 2.0 is in fact what Perl has done from the start (although you
need to run in Perl -w mode to get warned about \x escapes with fewer
than 2 hex digits following -- it's clearly more Pythonic to insist on
2 all the time).
When Unicode strings were introduced to Python, \x was generalized so
as to ignore all but the last *four* hex digits in Unicode strings.
This caused a technical difficulty for the new regular expression
SRE tries very hard to allow mixing 8-bit and Unicode patterns and
strings in intuitive ways, and it no longer had any way to guess what,
for example, r"\x123456" should mean as a pattern: is it asking to
the 8-bit character \x56 or the Unicode character \u3456?
There are hacky ways to guess, but it doesn't end there. The ISO C99
standard also introduces 8-digit \U12345678 escapes to cover the entire
ISO 10646 character space, and it's also desired that Python 2 support
that from the start. But then what are \x escapes supposed to mean?
Do they ignore all but the last *eight* hex digits then? And if less
than 8 following in a Unicode string, all but the last 4? And if less
than 4, all but the last 2?
This was getting messier by the minute, and the proposal cuts the
Gordian knot by making \x simpler instead of more complicated. Note
that the 4-digit generalization to \xijkl in Unicode strings was also
redundant, because it meant exactly the same thing as \uijkl in Unicode
strings. It's more Pythonic to have just one obvious way to specify a
Unicode character via hex notation.
Development and Discussion
The proposal was worked out among Guido van Rossum, Fredrik Lundh and
Tim Peters in email. It was subsequently explained and disussed on
Python-Dev under subject "Go \x yourself", starting 2000-08-03.
Response was overwhelmingly positive; no objections were raised.
Changing the meaning of \x escapes does carry risk of breaking existing
code, although no instances of incompabitility have yet been discovered.
The risk is believed to be minimal.
Tim Peters verified that, except for pieces of the standard test suite
deliberately provoking end cases, there are no instances of \xabcdef...
with fewer or more than 2 hex digits following, in either the Python
CVS development tree, or in assorted Python packages sitting on his
It's unlikely there are any with fewer than 2, because the Reference
Manual implied they weren't legal (although this is debatable!). If
there are any with more than 2, Guido is ready to argue they were buggy
anyway <0.9 wink>.
Guido reported that the O'Reilly Python books *already* document that
Python works the proposed way, likely due to their Perl editing
heritage (as above, Perl worked (very close to) the proposed way from
Finn Bock reported that what JPython does with \x escapes is
unpredictable today. This proposal gives a clear meaning that can be
consistently and easily implemented across all Python implementations.
Effects on Other Tools
Believed to be none. The candidates for breakage would mostly be
parsing tools, but the author knows of none that worry about the
internal structure of Python strings beyond the approximation "when
there's a backslash, swallow the next character". Tim Peters checked
python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
coloring subsystem, and believes there's no need to change any of
them. Tools like tabnanny.py and checkappend.py inherit their immunity
The code changes are so simple that a separate patch will not be
Fredrik Lundh is writing the code, is an expert in the area, and will
simply check the changes in before 2.0b1 is released.
Yes, ValueError, not SyntaxError. "Problems with literal
traditionally raise 'runtime' exceptions rather than syntax errors."
This document has been placed in the public domain.
Since most Python users on Windows don't have any use for them, I trimmed
the Python 2.0b2 installer by leaving out the debug-build .lib, .pyd, .exe
and .dll files. If you want them, they're available in a separate zip
archive; read the Windows Users notes at
for info and a download link. If you don't already know *why* you might
want them, trust me: you don't want them <wink>.
they-don't-even-make-good-paperweights-ly y'rs - tim
May I have developer status on the SourceForge CVS, please? I maintain
two standard-library modules (shlex and netrc) and have been involved
with the development of several others (including Cmd, smtp, httplib, and
My only immediate plan for what to do with developer access is to add
the browser-launch capability previously discussed on this list. My
general interest is in improving the standard class library,
especially in the areas of Internet-protocol support (urllib, ftp,
telnet, pop, imap, smtp, nntplib, etc.) and mini-language toolkits and
frameworks (shlex. netrc, Cmd, ConfigParser).
If the Internet-protocol support in the library were broken out as a
development category, I would be willing to fill the patch-handler
slot for it.
<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>
See, when the GOVERNMENT spends money, it creates jobs; whereas when the money
is left in the hands of TAXPAYERS, God only knows what they do with it. Bake
it into pies, probably. Anything to avoid creating jobs.
-- Dave Barry
>I would be happy to! Although I am happy to report that I believe it
>safe - I have been very careful of this from the time I wrote it.
>What is the process? How formal should it be?
Not sure how formal it should be, but I would recommend you review uses of
strcpy and convince yourself that the source string is never longer than the
target buffer. I am not convinced. For example, in calculate_path(), char
*pythonhome is initialized from an environment variable and thus has unknown
length. Later it used in a strcpy(prefix, pythonhome), where prefix has a
fixed length. This looks like a vulnerability than could be closed by using
strncpy(prefix, pythonhome, MAXPATHLEN).
The Unix version of this code had three or four vulnerabilities of this
sort. So I imagine the Windows version has those too. I was imagining that
the registry offered a whole new opportunity to provide unexpectedly long
strings that could overflow buffers.
I'd like some feedback on a patch assigned to me. It is designed to
prevent Python extensions built for an earlier version of Python from
crashing the new version.
I haven't actually tested the patch, but I am sure it works as advertised
(who is db31 anyway?).
My question relates more to the "style" - the patch locates the new .pyd's
address in memory, and parses through the MS PE/COFF format, locating the
import table. If then scans the import table looking for Pythonxx.dll, and
compares any found entries with the current version.
Quite clever - a definite plus is that is should work for all old and
future versions (of Python - dunno about Windows ;-) - but do we want this
sort of code in Python? Is this sort of hack, however clever, going to
some back and bite us?
Second related question: if people like it, is this feature something we
can squeeze in for 2.0?
If there are no objections to any of this, I am happy to test it and check
it in - but am not confident of doing so without some feedback.
> Unfortunately, I can't see what "encoding" I should use if I want
> to read & write Unicode string objects to it. ;( (Marc-Andre,
> please tell me I've missed something!)
It depends on the output you want to have. One option would be
Then, s.write(u'\251') prints a string in Python quoting notation.
won't work, since print *first* tries to convert the argument to a
string, and then prints the string onto the stream.
> On the other hand, it's annoying that I can't create a file-object
> that takes Unicode strings from "print", and doesn't seem intuitive.
Since you are asking for a hack :-) How about having an additional
letter of 'u' in the "mode" attribute of a file object?
Then, print would be
if type(string) == UnicodeType:
if 'u' in stream.mode:
The Stream readers and writers would then need to have a mode or 'ru'
or 'wu', respectively.
Any other protocol to signal unicode-awareness in a stream might do as
P.S. Is there some function to retrieve the UCN names from ucnhash.c?
Jeremy was just playing with the xml.sax package, and decided to
print the string returned from parsing "û" (the copyright
symbol). Sure enough, he got a traceback:
>>> print u'\251'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
and asked me about it. I was a little surprised myself. First, that
anyone would use "print" in a SAX handler to start with, and second,
that it was so painful.
Now, I can chalk this up to not using a reasonable stdout that
understands that Unicode needs to be translated to Latin-1 given my
font selection. So I looked at the codecs module to provide a usable
output stream. The EncodedFile class provides a nice wrapper around
another file object, and supports both encoding both ways.
Unfortunately, I can't see what "encoding" I should use if I want to
read & write Unicode string objects to it. ;( (Marc-Andre, please
tell me I've missed something!) I also don't think I
can use it with "print", extended or otherwise.
The PRINT_ITEM opcode calls PyFile_WriteObject() with whatever it
gets, so that's fine. Then it converts the object using
PyObject_Str() or PyObject_Repr(). For Unicode objects, the tp_str
handler attempts conversion to the default encoding ("ascii" in this
case), and raises the traceback we see above.
Perhaps a little extra work is needed in PyFile_WriteObject() to
allow Unicode objects to pass through if the file is merely file-like,
and let the next layer handle the conversion? This would probably
break code, and therefore not be acceptable.
On the other hand, it's annoying that I can't create a file-object
that takes Unicode strings from "print", and doesn't seem intuitive.
Fred L. Drake, Jr. <fdrake at beopen.com>
BeOpen PythonLabs Team Member
I was playing with a different SourceForge project and I screwed up my
CVSROOT (used Python's instead). Sorry SOrry!
How do I undo this cleanly? I could 'cvs remove' the README.txt file but that
would still leave the top-level 'black/' turd right? Do the SourceForge admin
guys have to manually kill the 'black' directory in the repository?
On Wed, Sep 27, 2000 at 12:06:06AM -0700, Trent Mick wrote:
> Update of /cvsroot/python/black
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv20977
> Log Message:
> first import into CVS
> Vendor Tag: vendor
> Release Tags: start
> N black/README.txt
> No conflicts created by this import
> ***** Bogus filespec: -
> ***** Bogus filespec: Imported
> ***** Bogus filespec: sources
> Python-checkins mailing list