stoopid question: why the heck is xmllib using
"RuntimeError" to flag XML syntax errors?
raise RuntimeError, 'Syntax error at line %d: %s' % (self.lineno, message)
what's wrong with "SyntaxError"?
</F>
An HTML version of the attached can be viewed at
http://python.sourceforge.net/peps/pep-0223.html
This will be adopted for 2.0 unless there's an uproar. Note that it *does*
have potential for breaking existing code -- although no real-life instance
of incompatibility has yet been reported. This is explained in detail in
the PEP; check your code now.
although-if-i-were-you-i-wouldn't-bother<0.5-wink>-ly y'rs - tim
PEP: 223
Title: Change the Meaning of \x Escapes
Version: $Revision: 1.4 $
Author: tpeters(a)beopen.com (Tim Peters)
Status: Active
Type: Standards Track
Python-Version: 2.0
Created: 20-Aug-2000
Post-History: 23-Aug-2000
Abstract
Change \x escapes, in both 8-bit and Unicode strings, to consume
exactly the two hex digits following. The proposal views this as
correcting an original design flaw, leading to clearer expression
in all flavors of string, a cleaner Unicode story, better
compatibility with Perl regular expressions, and with minimal risk
to existing code.
Syntax
The syntax of \x escapes, in all flavors of non-raw strings, becomes
\xhh
where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is
not clearly specified in the Reference Manual; it says
\xhh...
implying "two or more" hex digits, but one-digit forms are also
accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
itself (i.e., a backslash followed by the letter x). It's unclear
whether the Reference Manual intended either of the 1-digit or
0-digit behaviors.
Semantics
In an 8-bit non-raw string,
\xij
expands to the character
chr(int(ij, 16))
Note that this is the same as in 1.6 and before.
In a Unicode string,
\xij
acts the same as
\u00ij
i.e. it expands to the obvious Latin-1 character from the initial
segment of the Unicode space.
An \x not followed by at least two hex digits is a compile-time error,
specifically ValueError in 8-bit strings, and UnicodeError (a subclass
of ValueError) in Unicode strings. Note that if an \x is followed by
more than two hex digits, only the first two are "consumed". In 1.6
and before all but the *last* two were silently ignored.
Example
In 1.5.2:
>>> "\x123465" # same as "\x65"
'e'
>>> "\x65"
'e'
>>> "\x1"
'\001'
>>> "\x\x"
'\\x\\x'
>>>
In 2.0:
>>> "\x123465" # \x12 -> \022, "3456" left alone
'\0223456'
>>> "\x65"
'e'
>>> "\x1"
[ValueError is raised]
>>> "\x\x"
[ValueError is raised]
>>>
History and Rationale
\x escapes were introduced in C as a way to specify variable-width
character encodings. Exactly which encodings those were, and how many
hex digits they required, was left up to each implementation. The
language simply stated that \x "consumed" *all* hex digits following,
and left the meaning up to each implementation. So, in effect, \x in C
is a standard hook to supply platform-defined behavior.
Because Python explicitly aims at platform independence, the \x escape
in Python (up to and including 1.6) has been treated the same way
across all platforms: all *except* the last two hex digits were
silently ignored. So the only actual use for \x escapes in Python was
to specify a single byte using hex notation.
Larry Wall appears to have realized that this was the only real use for
\x escapes in a platform-independent language, as the proposed rule for
Python 2.0 is in fact what Perl has done from the start (although you
need to run in Perl -w mode to get warned about \x escapes with fewer
than 2 hex digits following -- it's clearly more Pythonic to insist on
2 all the time).
When Unicode strings were introduced to Python, \x was generalized so
as to ignore all but the last *four* hex digits in Unicode strings.
This caused a technical difficulty for the new regular expression
engine:
SRE tries very hard to allow mixing 8-bit and Unicode patterns and
strings in intuitive ways, and it no longer had any way to guess what,
for example, r"\x123456" should mean as a pattern: is it asking to
match
the 8-bit character \x56 or the Unicode character \u3456?
There are hacky ways to guess, but it doesn't end there. The ISO C99
standard also introduces 8-digit \U12345678 escapes to cover the entire
ISO 10646 character space, and it's also desired that Python 2 support
that from the start. But then what are \x escapes supposed to mean?
Do they ignore all but the last *eight* hex digits then? And if less
than 8 following in a Unicode string, all but the last 4? And if less
than 4, all but the last 2?
This was getting messier by the minute, and the proposal cuts the
Gordian knot by making \x simpler instead of more complicated. Note
that the 4-digit generalization to \xijkl in Unicode strings was also
redundant, because it meant exactly the same thing as \uijkl in Unicode
strings. It's more Pythonic to have just one obvious way to specify a
Unicode character via hex notation.
Development and Discussion
The proposal was worked out among Guido van Rossum, Fredrik Lundh and
Tim Peters in email. It was subsequently explained and disussed on
Python-Dev under subject "Go \x yourself", starting 2000-08-03.
Response was overwhelmingly positive; no objections were raised.
Backward Compatibility
Changing the meaning of \x escapes does carry risk of breaking existing
code, although no instances of incompabitility have yet been discovered.
The risk is believed to be minimal.
Tim Peters verified that, except for pieces of the standard test suite
deliberately provoking end cases, there are no instances of \xabcdef...
with fewer or more than 2 hex digits following, in either the Python
CVS development tree, or in assorted Python packages sitting on his
machine.
It's unlikely there are any with fewer than 2, because the Reference
Manual implied they weren't legal (although this is debatable!). If
there are any with more than 2, Guido is ready to argue they were buggy
anyway <0.9 wink>.
Guido reported that the O'Reilly Python books *already* document that
Python works the proposed way, likely due to their Perl editing
heritage (as above, Perl worked (very close to) the proposed way from
its start).
Finn Bock reported that what JPython does with \x escapes is
unpredictable today. This proposal gives a clear meaning that can be
consistently and easily implemented across all Python implementations.
Effects on Other Tools
Believed to be none. The candidates for breakage would mostly be
parsing tools, but the author knows of none that worry about the
internal structure of Python strings beyond the approximation "when
there's a backslash, swallow the next character". Tim Peters checked
python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
coloring subsystem, and believes there's no need to change any of
them. Tools like tabnanny.py and checkappend.py inherit their immunity
from tokenize.py.
Reference Implementation
The code changes are so simple that a separate patch will not be
produced.
Fredrik Lundh is writing the code, is an expert in the area, and will
simply check the changes in before 2.0b1 is released.
BDFL Pronouncements
Yes, ValueError, not SyntaxError. "Problems with literal
interpretations
traditionally raise 'runtime' exceptions rather than syntax errors."
Copyright
This document has been placed in the public domain.
[Tim Peters]
> From BeOpen.com's POV, so long as they were paying major bills, they
> would rather have download traffic tickle their ad banners than SF's
> ad banners.
Even though this should have been clear to me all the time, stating it
explicitly triggers alarms for me.
I just checked, and it appears that Python 2.0 is not available for
download via ftp. In particular, it is not available via ftp from
ftp.python.org! If it was there, mirrors all over the world would pick
it up and bring it in a location near me (ftp.fu-berlin.de would be
nearest) (*).
So while making it available on SF may indeed give no advantage,
making it available on python.org would provide users with alternative
download locations, so that I don't feel the bandwidth limitation that
Web download from pythonlabs.com produces. That would be a clear
advantage to me, at least.
Of course, having ftp mirrors would mean that many downloads do not
tickle anybody's ad banners - which would probably be in the interest
of other users as well, just not in the interest of BeOpen. So I'm
curious how this conflict of interest is resolved...
Regards,
Martin
(*) ftp://ftp.python.org/python/src/README does not even mention
Python 2.0, and ftp://ftp.python.org/python/src/README.ftp says
Note that some mirrors only collect the this directory (src), and the
doc and contrib siblings, while the ftp master site,
<URL:ftp://ftp.python.org/>, has much more.
"Much more" may be true, just nothing recent... I probably should ask
ftpmaster.python.org to put Python-2.0.tar.gz in that directory.
> I don't have access to a Solaris machine, so I can't do anything to
> help these users.
The patch in 117606 looks right to me: gcc on Solaris (and on any
other platform) needs -shared to build shared library; configure
currently passes -G. I haven't actually tried the patch, since it is a
pain to extract it from the SF bug report page. What happens is that
gcc passes -G to the linker, which correctly produces a shared
library. However, gcc also links crt1/crti into the library, which
causes the reference to main.
117508 looks like a user error to me. On its own, configure would not
try to link -ldb, unless it detects the presence of db.h. My guess is
that there is a libdb in /usr/local, so a gcc configure finds it,
whereas the native compiler doesn't. Later, the linker even finds a
-ldb library, but somehow this doesn't have dbopen. So it could be
that the BSDDB installation on that system is screwed.
Regards,
Martin
Jeremy Hylton wrote:
>
> Update of /cvsroot/python/python/dist/src/Python
> In directory slayer.i.sourceforge.net:/tmp/cvs-serv32349/Python
>
> Modified Files:
> ceval.c
> ...
> N.B. The CALL_FUNCTION implementation is getting really hairy; should
> review it to see if it can be simplified.
How about a complete redesign of the whole call mechanism ?!
I have an old patch somewhere which reorganizes the Python
calling mechanism into something more sane than what we
have in ceval.c now.
Should I try to find it for someone to use as basis for the
reorg (don't have time for this myself) ?
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
> which actually will look something like this in real code:
>
> def trade(yours, mine):
> print _('if you give me %(yours)s, i will give you %(mine)s') % {
> 'yours': yours,
> 'mine' : mine,
> }
>
> The string wrapped in _() is what gets translated here.
>
> Okay, we all know that's a pain, right?
What's wrong with
def trade(yours, mine):
print _('if you give me %(yours)s, i will give you %(mine)s') % vars()
Look Ma, no magic!
Regards,
Martin
Do people want to be notified when the development copy of the
documentation is update? There was a comment from someone that they
didn't know where the online copy was. I can send the update notice
to python-dev instead of just to myself if everyone wants it, or I can
send it to some other (appropriate) list.
-Fred
--
Fred L. Drake, Jr. <fdrake at acm.org>
PythonLabs Team Member
Python 3K. It is the repository for our hopes and dreams. We tend to
invoke it in three different situations:
1. in delaying discussions of gee-whiz features (e.g. static type
checking)
2. in delaying hairy implementation re-factoring that we don't
want to undertake right now.
3. in delaying painful backwards-compatibility breakage
I think it is somewhat debatable whether we really need to or should do
these three things all at once but that's a separate discussion for
another day. (the other experiment may inform our decision)
I want to focus on "3" -- breakage. Rather than delaying painful
backwards-compatibility breakage I thing we should make it less painful.
We should decide where we are going long before we ask our users to move
there. I think we should start including alternatives to syntaxes and
features that we know are broken. Once people move to the alternatives
we can "reassign" or remove the original syntax with much less pain.
In other words, rather than telling people "here's a new version, your
code is broken, sorry." We should tell them: "we're going to break code.
Here's an alternative syntax that you can use that will be interpreted
the same in the old and new versions -- please move to this new syntax
as quickly as possible."
I'll outline some examples of the strategy. You may or may not like
details of the particular proposals but you might still agree with the
strategy. Also, I don't claim that all of the proposals are fully
fleshed-out...again, it's the strategy I'm most interested in. I don't
even agree with every feature change I describe below -- they are just
some I've heard of for Python 3000. In other words ** this is not a
design document for Python 3000! **
Separating byte arrays from strings:
1. immediately introduce two new functions:
binopen("foo").read() -> byte array
stropen("foo","UTF-8".read() -> u"...."
2. warn about string literals that have embedded non-Unicode characters
3. deprecate extension modules that return "old fashioned" string
arrays
4. after a period where all strings have been restricted to
Unicode-compatibility, merge the Unicode and string types.
5. deprecate the special Unicode u"" syntax as an imperialist
anachronism
Reclaiming the division operator from truncating integer division:
1. immediately introduce new functions: div() that does division as we
wish it was.
2. add a warning mode to Python (long overdue)
3. with the warning mode on, old-fashioned division triggers a
deprecation warning.
4. after three years as a warning situation we expect all in-use Python
scripts to have been upgraded to use the new operations and to
explicitly truncate integer division when that is what is wanted.
5. at that point we can re-assign the division operator to be a
floating point division if we wish.
Case insensitivity:
1. Warn whenever a module or class has two __dict__ entries that differ
only by case
2. Eventually, disallow that form of name-clash altogether
3. After a period, allow case variations to be equivalent.
Unifying types and classes (more vague):
1. Add something like extension class to make type subclassing easier.
2. Informally deprecate modules that do not incorporate it.
3. Replace or fix details of the language and implementation that
behave differently for types and classes. (e.g. the type() operator)
--
Paul Prescod - Not encumbered by ActiveState consensus
Simplicity does not precede complexity, but follows it.
- http://www.cs.yale.edu/homes/perlis-alan/quotes.html