Mailman 3 April 2000 - Python-Dev

Python source code encoding
by M.-A. Lemburg 16 Apr '00

16 Apr '00

[Fredrik]: > [MAL]: > > > To reinforce Fredrik's point here, note that XML only supports > > > encodings at the level of an entire file (or external entity). You > > > can't tell an XML parser that a file is in UTF-8, except for this one > > > element whose contents are in Latin1. > > > > Hmm, this would mean that someone who writes: > > > > """ > > #pragma script-encoding utf-8 > > > > u = u"\u1234" > > print u > > """ > > > > would suddenly see "\u1234" as output. > > not necessarily. consider this XML snippet: > > <?xml version='1.0' encoding='utf-8'?> > <body>ሴ</body> > > if I run this through an XML parser and write it > out as UTF-8, I get: > > <body>á^´</body> > > in other words, the parser processes "&#x" after > decoding to unicode, not before. > > I see no reason why Python cannot do the same. Sure, and this is what I meant when I said that the compiler has to deal with several different encodings. Unicode escape sequences are currently handled by a special codec, the unicode-escape codec which reads all characters with ordinal < 256 as-is (meaning Latin-1, since the first 256 Unicode ordinals map to Latin-1 characters (*)) except a few escape sequences which it processes much like the Python parser does for 8-bit strings and the new \uXXXX escape. Perhaps we should make this processing use two levels... the escape codecs would need some rewriting to process Unicode-> Unicode instead of 8-bit->Unicode as they do now. -- To move along the method Fredrik is proposing I would suggest (for Python 1.7) to introduce a preprocessor step which gets executed even before the tokenizer. The preprocessor step would then translate char* input into Py_UNICODE* (using an encoding hint which would have to appear in the first few lines of input using some special format). The tokenizer could then work on Py_UNICODE* buffer and the parser would then take care of the conversion from Py_UNICODE* back to char* for Python's 8-bit strings. It should shout out loud in case it sees input data outside Unicode range(256) in what is supposed to be a 8-bit string. To make this fully functional we would have to change the 8-bit string to Unicode coercion mechanism, though. It would have to make a Latin-1 assumption instead of the current UTF-8 assumption. In contrast to the current scheme, this assumption would be correct for all constant strings appearing in source code given the above preprocessor logic. For strings constructed from file or user input the programmer would have to assure proper encoding or do the Unicode conversion himself. Sidenote: The UTF-8->Latin-1 change would probably also have to be propogated to all other Unicode in/output logic -- perhaps Latin-1 is the better default encoding after all... A programmer could then write a Python script completely in UTF-8, UTF-16 or Shift-JIS and the above logic would convert the input data to Unicode or Latin-1 (which is 8-bit Unicode) as appropriate and it would warn about impossible conversions to Latin-1 in the compile step. The programmer would still have to make sure that file and user input gets converted using the proper encoding, but this can easily be done using the stream wrappers in the standard codecs module. Note that in this discussion we need to be very careful not to mangle encodings used for source code and ones used when reading/writing to files or other streams (including stdin/stdout). BTW, to experiment with all this you can use the codecs.EncodedFile stream wrapper. It allows specifying both data and stream side encodings, e.g. you can redirect a UTF-8 stdin stream to Latin-1 returning file object which can then be used as source of data input. (*) The conversion from Unicode to Latin-1 is similar to converting a 2-byte unsigned short to an unsigned byte with some extra logic to catch data loss. Latin-1 is comparable to 8-bit Unicode... this is where all this talk about Latin-1 originates from :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

1 0

>2GB Data.fs files on FreeBSD
by Andrew M. Kuchling 15 Apr '00

15 Apr '00

[Cc'ed to python-dev from the zope-dev mailing list; trim your follow-ups appropriately] R. David Murray writes: >So it looks like there is a problem using Zope with a large database >no matter what the platform. Has anyone figured out how to fix this? ... >But given the number of people who have said "use FreeBSD if you want >big files", I'm really wondering about this. What if later I >have an application where I really need a >2GB database? Different system calls are used for large files, because you can no longer use 32-bit ints to store file position. There's a HAVE_LARGEFILE_SUPPORT #define that turns on the use of these alternate system calls; see Python's configure.in for the test used to detect when it should be turned on. You could just hack the generated config.h to turn on large file support and recompile your copy of Python, but if the configure.in test is incorrect, that should be fixed. The test is: AC_MSG_CHECKING(whether to enable large file support) if test "$have_long_long" = yes -a \ "$ac_cv_sizeof_off_t" -gt "$ac_cv_sizeof_long" -a \ "$ac_cv_sizeof_long_long" -ge "$ac_cv_sizeof_off_t"; then AC_DEFINE(HAVE_LARGEFILE_SUPPORT) AC_MSG_RESULT(yes) else AC_MSG_RESULT(no) fi I thought you have to use the loff_t type instead of off_t; maybe this test should check for it instead? Anyone know anything about large file support? -- A.M. Kuchling http://starship.python.net/crew/amk/ When I dream, sometimes I remember how to fly. You just lift one leg, then you lift the other leg, and you're not standing on anything, and you can fly. -- Chloe Russell, in SANDMAN #43: "Brief Lives:3"

3 3

Re: Comparison of cyclic objects (was RE: [Python-Dev] trashcan and PR#7)
by Jeremy Hylton 15 Apr '00

15 Apr '00

I did one more round of work on this idea, and I'm satisfied with the results. Most of the performance hit can be eliminated by doing nothing until there are at least N recursive calls to PyObject_Compare, where N is fairly large. (I picked 25000.) Non-circular objects that are not deeply nested only pay for an integer increment, a decrement, and a compare. Background for patches-only readers: This patch appears to fix PR#7. Comments and suggestions solicitied. I think this is worth checking in. Jeremy Index: Include/object.h =================================================================== RCS file: /projects/cvsroot/python/dist/src/Include/object.h,v retrieving revision 2.52 diff -r2.52 object.h 286a287,289 > /* tstate dict key for PyObject_Compare helper */ > extern PyObject *_PyCompareState_Key; > Index: Python/pythonrun.c =================================================================== RCS file: /projects/cvsroot/python/dist/src/Python/pythonrun.c,v retrieving revision 2.91 diff -r2.91 pythonrun.c 151a152,153 > _PyCompareState_Key = PyString_InternFromString("cmp_state"); > Index: Objects/object.c =================================================================== RCS file: /projects/cvsroot/python/dist/src/Objects/object.c,v retrieving revision 2.67 diff -r2.67 object.c 300a301,306 > PyObject *_PyCompareState_Key; > > int _PyCompareState_nesting = 0; > int _PyCompareState_flag = 0; > #define NESTING_LIMIT 25000 > 305a312,313 > int result; > 372c380 < if (vtp->tp_compare == NULL) --- > if (vtp->tp_compare == NULL) { 374c382,440 < return (*vtp->tp_compare)(v, w); --- > } > ++_PyCompareState_nesting; > if (_PyCompareState_nesting > NESTING_LIMIT) > _PyCompareState_flag = 1; > if (_PyCompareState_flag && > (vtp->tp_as_mapping || (vtp->tp_as_sequence && > !PyString_Check(v)))) > { > PyObject *tstate_dict, *cmp_dict, *pair; > > tstate_dict = PyThreadState_GetDict(); > if (tstate_dict == NULL) { > PyErr_BadInternalCall(); > return -1; > } > cmp_dict = PyDict_GetItem(tstate_dict, _PyCompareState_Key); > if (cmp_dict == NULL) { > cmp_dict = PyDict_New(); > if (cmp_dict == NULL) > return -1; > PyDict_SetItem(tstate_dict, > _PyCompareState_Key, > cmp_dict); > } > > pair = PyTuple_New(2); > if (pair == NULL) { > return -1; > } > if ((long)v <= (long)w) { > PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)v)); > PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)w)); > } else { > PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)w)); > PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)v)); > } > if (PyDict_GetItem(cmp_dict, pair)) { > /* already comparing these objects. assume > they're equal until shown otherwise > */ > Py_DECREF(pair); > --_PyCompareState_nesting; > if (_PyCompareState_nesting == 0) > _PyCompareState_flag = 0; > return 0; > } > if (PyDict_SetItem(cmp_dict, pair, pair) == -1) { > return -1; > } > result = (*vtp->tp_compare)(v, w); > PyDict_DelItem(cmp_dict, pair); > Py_DECREF(pair); > } else { > result = (*vtp->tp_compare)(v, w); > } > --_PyCompareState_nesting; > if (_PyCompareState_nesting == 0) > _PyCompareState_flag = 0; > return result;

3 4

Re: [Python-checkins] CVS: distutils/distutils unixccompiler.py,1.21,1.22
by Fred L. Drake, Jr. 14 Apr '00

14 Apr '00

Greg Ward writes: > ! # Not many Unices required ranlib anymore -- SunOS 4.x is, I > ! # think the only major Unix that does. Maybe we need some You're saying that SunOS 4.x *is* a major Unix???? Not for a while, now.... -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> Corporation for National Research Initiatives

1 0

cvs-server out of sync with mailing-list ?
by Fred Gansevles 14 Apr '00

14 Apr '00

I try to keep up-to-date with the cvs-tree at cvs.python.org and receive the python-checkins(a)python.org mailing-list. Just now I discovered that the cvs-server and the checkins-list are out of sync. For example: according to the checkins-list the latest version of src/Python/sysmodule.c is 2.62 and according to the cvs-server the latest version is 2.59 Am I missing something or is there some kind of a problem ? ____________________________________________________________________________ Fred Gansevles <mailto:Fred.Gansevles@cs.utwente.nl> Phone: +31 53 489 4613 >>> Your one-stop-shop for Linux/WinNT/NetWare <<< Org.: Twente University, Fac. of CS, Box 217, 7500 AE Enschede, Netherlands "Bill needs more time to learn Linux" - Steve B.

2 1

Re: [Python-checkins] CVS: python/dist/src/Python sysmodule.c,2.60,2.61
by Greg Stein 14 Apr '00

14 Apr '00

It's great that you made this change! I hadn't got through my mail, but was going to recommend it... :-) One comment: On Thu, 13 Apr 2000, Fred Drake wrote: >... > --- 409,433 ---- > v = PyInt_FromLong(PY_VERSION_HEX)); > Py_XDECREF(v); > + /* > + * These release level checks are mutually exclusive and cover > + * the field, so don't get too fancy with the pre-processor! > + */ > + #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_ALPHA > + v = PyString_FromString("alpha"); > + #endif > + #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_BETA > + v = PyString_FromString("beta"); > + #endif > + #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_GAMMA > + v = PyString_FromString("candidate"); > + #endif > #if PY_RELEASE_LEVEL == PY_RELEASE_LEVEL_FINAL > ! v = PyString_FromString("final"); > ! #endif > PyDict_SetItemString(sysdict, "version_info", > ! v = Py_BuildValue("iiiNi", PY_MAJOR_VERSION, > PY_MINOR_VERSION, > ! PY_MICRO_VERSION, v, > ! PY_RELEASE_SERIAL)); > Py_XDECREF(v); > PyDict_SetItemString(sysdict, "copyright", I would recommend using the "s" format code in Py_BuildValue. It simplifies the code, and it is quite a bit easier for a human to process. When I first saw the code, I thought "the level string leaks!" Then I saw the "N" code, went and looked it up, and realized what is going on. So... to avoid that, the "s" code would be great. Cheers, -g -- Greg Stein, http://www.lyra.org/

2 1

Re: [Python-checkins] CVS: python/dist/src/Python sysmodule.c,2.59,2.60
by Fredrik Lundh 14 Apr '00

14 Apr '00

> Modified Files: > sysmodule.c > Log Message: > > Define version_info to be a tuple (major, minor, micro, level); level > is a string "a2", "b1", "c1", or '' for a final release. maybe level should be chosen so that version_info for a final release is larger than version_info for the corresponding beta ? </F>

6 19

#pragmas in Python source code
by M.-A. Lemburg 14 Apr '00

14 Apr '00

There currently is a discussion about how to write Python source code in different encodings on i18n. The (experimental) solution so far has been to add a command line switch to Python which tells the compiler which encoding to expect for u"...strings..." ("...8-bit strings..." will still be used as is -- it's the user's responsibility to use the right encoding; the Unicode implementation will still assume them to be UTF-8 encoded in automatic conversions). In the end, a #pragma should be usable to tell the compiler which encoding to use for decoding the u"..." strings. What we need now, is a good proposal for handling these #pragmas... does anyone have experience with these ? Any ideas ? Here's a simple strawman for the syntax: # pragma key: value parser = re.compile( '^#\s*pragma\s+' '([a-zA-Z_][a-zA-Z0-9_]*):\s*' '(.+)' ) For the encoding this would be something like: # pragma encoding: unicode-escape The compiler would scan these pragma defs, add them to an internal temporary dictionary and use them for all subsequent code it finds during the compilation process. The dictionary would have to stay around until the original compile() call has completed (spanning recursive calls). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

9 17

if key in dict?
by Fredrik Lundh 13 Apr '00

13 Apr '00

now that we have the sq_contains slot, would it make sense to add support for "key in dict" ? after all, if key in dict: ... is a bit more elegant than: if dict.has_key(key): ... and much faster than: if key in dict.keys(): ... (the drawback is that once we add this, some people might ex- pect dictionaries to behave like sequences in others ways too...) (and yes, this might break code that looks for tp_as_sequence before looking for tp_as_mapping. haven't found any code like that, but I might have missed something). whaddyathink? </F>

5 4

Crash in new "trashcan" mechanism.
by Mark Hammond 13 Apr '00

13 Apr '00

[Im re-sending as the attachment caused this to be held up for administrative approval. Ive forwarded the attachement to Chris - anyone else just mail me for it] Ive struck a crash in the new trashcan mechanism (so I guess Chris is gunna pay the most attention here). Although I can only provoke this reliably in debug builds, I believe it also exists in release builds, but is just far more insidious. Unfortunately, I also can not create a simple crash case. But I _can_ provide info on how you can reliably cause the crash. Obviously only tested on Windows... * Go to http://lima.mudlib.org/~rassilon/p2c/, and grab the download, and unzip. * Replace "transformer.py" with the attached version (multi-arg append bites :-) * Ensure you have a Windows "debug" build available, built from CVS. * From the p2c directory, Run "python_d.exe gencode.py gencode.py" You will get a crash, and the debugger will show you are destructing a list, with an invalid object. The crash occurs about 1000 times after this code is first hit, and I can't narrow the crash condition down :-( If you open object.h, and disable the trashcan mechanism (by changing the "xx", as the comments suggest) then it runs fine. Hope this helps someone - Im afraid I havent a clue :-( Mark.

5 20