> > > To reinforce Fredrik's point here, note that XML only supports
> > > encodings at the level of an entire file (or external entity). You
> > > can't tell an XML parser that a file is in UTF-8, except for this one
> > > element whose contents are in Latin1.
> > Hmm, this would mean that someone who writes:
> > """
> > #pragma script-encoding utf-8
> > u = u"\u1234"
> > print u
> > """
> > would suddenly see "\u1234" as output.
> not necessarily. consider this XML snippet:
> <?xml version='1.0' encoding='utf-8'?>
> if I run this through an XML parser and write it
> out as UTF-8, I get:
> in other words, the parser processes "&#x" after
> decoding to unicode, not before.
> I see no reason why Python cannot do the same.
Sure, and this is what I meant when I said that the compiler
has to deal with several different encodings. Unicode escape
sequences are currently handled by a special codec, the
unicode-escape codec which reads all characters with ordinal
< 256 as-is (meaning Latin-1, since the first 256 Unicode
ordinals map to Latin-1 characters (*)) except a few escape sequences
which it processes much like the Python parser does for 8-bit
strings and the new \uXXXX escape.
Perhaps we should make this processing use two levels...
the escape codecs would need some rewriting to process Unicode->
Unicode instead of 8-bit->Unicode as they do now.
To move along the method Fredrik is proposing I would suggest
(for Python 1.7) to introduce a preprocessor step which gets executed
even before the tokenizer. The preprocessor step would then
translate char* input into Py_UNICODE* (using an encoding hint which
would have to appear in the first few lines of input using some special
format). The tokenizer could then work on Py_UNICODE* buffer and
the parser would then take care of the conversion from Py_UNICODE*
back to char* for Python's 8-bit strings. It should shout out loud
in case it sees input data outside Unicode range(256) in what is
supposed to be a 8-bit string.
To make this fully functional we would have to change the 8-bit
string to Unicode coercion mechanism, though. It would have to
make a Latin-1 assumption instead of the current UTF-8 assumption.
In contrast to the current scheme, this assumption would be correct
for all constant strings appearing in source code given the above
preprocessor logic. For strings constructed from file or user input
the programmer would have to assure proper encoding or do the
Unicode conversion himself.
The UTF-8->Latin-1 change would probably also have to be propogated
to all other Unicode in/output logic -- perhaps Latin-1 is the better
default encoding after all...
A programmer could then write a Python script completely in UTF-8,
UTF-16 or Shift-JIS and the above logic would convert the input
data to Unicode or Latin-1 (which is 8-bit Unicode) as appropriate
and it would warn about impossible conversions to Latin-1 in the
compile step. The programmer would still have to make sure that file
and user input gets converted using the proper encoding, but this
can easily be done using the stream wrappers in the standard
Note that in this discussion we need to be very careful not
to mangle encodings used for source code and ones used when
reading/writing to files or other streams (including
BTW, to experiment with all this you can use the codecs.EncodedFile
stream wrapper. It allows specifying both data and stream side
encodings, e.g. you can redirect a UTF-8 stdin stream to Latin-1
returning file object which can then be used as source of data
(*) The conversion from Unicode to Latin-1 is similar to converting
a 2-byte unsigned short to an unsigned byte with some extra logic
to catch data loss. Latin-1 is comparable to 8-bit Unicode...
this is where all this talk about Latin-1 originates from :-)
Python Pages: http://www.lemburg.com/python/
[Cc'ed to python-dev from the zope-dev mailing list; trim your
R. David Murray writes:
>So it looks like there is a problem using Zope with a large database
>no matter what the platform. Has anyone figured out how to fix this?
>But given the number of people who have said "use FreeBSD if you want
>big files", I'm really wondering about this. What if later I
>have an application where I really need a >2GB database?
Different system calls are used for large files, because you can no
longer use 32-bit ints to store file position. There's a
HAVE_LARGEFILE_SUPPORT #define that turns on the use of these
alternate system calls; see Python's configure.in for the test used to
detect when it should be turned on. You could just hack the generated
config.h to turn on large file support and recompile your copy of
Python, but if the configure.in test is incorrect, that should be
The test is:
AC_MSG_CHECKING(whether to enable large file support)
if test "$have_long_long" = yes -a \
"$ac_cv_sizeof_off_t" -gt "$ac_cv_sizeof_long" -a \
"$ac_cv_sizeof_long_long" -ge "$ac_cv_sizeof_off_t"; then
I thought you have to use the loff_t type instead of off_t; maybe this
test should check for it instead? Anyone know anything about large
A.M. Kuchling http://starship.python.net/crew/amk/
When I dream, sometimes I remember how to fly. You just lift one leg, then you
lift the other leg, and you're not standing on anything, and you can fly.
-- Chloe Russell, in SANDMAN #43: "Brief Lives:3"
Greg Ward writes:
> ! # Not many Unices required ranlib anymore -- SunOS 4.x is, I
> ! # think the only major Unix that does. Maybe we need some
You're saying that SunOS 4.x *is* a major Unix???? Not for a while,
Fred L. Drake, Jr. <fdrake at acm.org>
Corporation for National Research Initiatives
I try to keep up-to-date with the cvs-tree at cvs.python.org and receive
the python-checkins(a)python.org mailing-list.
Just now I discovered that the cvs-server and the checkins-list are out of
sync. For example: according to the checkins-list the latest version of
src/Python/sysmodule.c is 2.62 and according to the cvs-server the latest
version is 2.59
Am I missing something or is there some kind of a problem ?
Fred Gansevles <mailto:Fred.Gansevles@cs.utwente.nl> Phone: +31 53 489 4613
>>> Your one-stop-shop for Linux/WinNT/NetWare <<<
Org.: Twente University, Fac. of CS, Box 217, 7500 AE Enschede, Netherlands
"Bill needs more time to learn Linux" - Steve B.
It's great that you made this change! I hadn't got through my mail, but
was going to recommend it... :-)
On Thu, 13 Apr 2000, Fred Drake wrote:
> --- 409,433 ----
> v = PyInt_FromLong(PY_VERSION_HEX));
> + /*
> + * These release level checks are mutually exclusive and cover
> + * the field, so don't get too fancy with the pre-processor!
> + */
> + #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_ALPHA
> + v = PyString_FromString("alpha");
> + #endif
> + #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_BETA
> + v = PyString_FromString("beta");
> + #endif
> + #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_GAMMA
> + v = PyString_FromString("candidate");
> + #endif
> #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_FINAL
> ! v = PyString_FromString("final");
> ! #endif
> PyDict_SetItemString(sysdict, "version_info",
> ! v = Py_BuildValue("iiiNi", PY_MAJOR_VERSION,
> ! PY_MICRO_VERSION, v,
> ! PY_RELEASE_SERIAL));
> PyDict_SetItemString(sysdict, "copyright",
I would recommend using the "s" format code in Py_BuildValue. It
simplifies the code, and it is quite a bit easier for a human to process.
When I first saw the code, I thought "the level string leaks!" Then I saw
the "N" code, went and looked it up, and realized what is going on.
So... to avoid that, the "s" code would be great.
Greg Stein, http://www.lyra.org/
> Modified Files:
> Log Message:
> Define version_info to be a tuple (major, minor, micro, level); level
> is a string "a2", "b1", "c1", or '' for a final release.
maybe level should be chosen so that version_info for a final
release is larger than version_info for the corresponding beta ?
There currently is a discussion about how to write Python
source code in different encodings on i18n. The (experimental)
solution so far has been to add a command line switch to
Python which tells the compiler which encoding to expect
for u"...strings..." ("...8-bit strings..." will still be used
as is -- it's the user's responsibility to use the right
encoding; the Unicode implementation will still assume them
to be UTF-8 encoded in automatic conversions).
In the end, a #pragma should be usable to tell the compiler
which encoding to use for decoding the u"..." strings.
What we need now, is a good proposal for handling these
#pragmas... does anyone have experience with these ? Any
Here's a simple strawman for the syntax:
# pragma key: value
parser = re.compile(
For the encoding this would be something like:
# pragma encoding: unicode-escape
The compiler would scan these pragma defs, add them to an
internal temporary dictionary and use them for all subsequent
code it finds during the compilation process. The dictionary
would have to stay around until the original compile() call has
completed (spanning recursive calls).
Python Pages: http://www.lemburg.com/python/
now that we have the sq_contains slot, would it make
sense to add support for "key in dict" ?
if key in dict:
is a bit more elegant than:
and much faster than:
if key in dict.keys():
(the drawback is that once we add this, some people might ex-
pect dictionaries to behave like sequences in others ways too...)
(and yes, this might break code that looks for tp_as_sequence
before looking for tp_as_mapping. haven't found any code like
that, but I might have missed something).
[Im re-sending as the attachment caused this to be held up for
administrative approval. Ive forwarded the attachement to Chris -
anyone else just mail me for it]
Ive struck a crash in the new trashcan mechanism (so I guess Chris
is gunna pay the most attention here). Although I can only provoke
this reliably in debug builds, I believe it also exists in release
builds, but is just far more insidious.
Unfortunately, I also can not create a simple crash case. But I
_can_ provide info on how you can reliably cause the crash.
Obviously only tested on Windows...
* Go to http://lima.mudlib.org/~rassilon/p2c/, and grab the
download, and unzip.
* Replace "transformer.py" with the attached version (multi-arg
append bites :-)
* Ensure you have a Windows "debug" build available, built from CVS.
* From the p2c directory, Run "python_d.exe gencode.py gencode.py"
You will get a crash, and the debugger will show you are destructing
a list, with an invalid object. The crash occurs about 1000 times
after this code is first hit, and I can't narrow the crash condition
If you open object.h, and disable the trashcan mechanism (by
changing the "xx", as the comments suggest) then it runs fine.
Hope this helps someone - Im afraid I havent a clue :-(