From guido at python.org Fri Sep 1 00:04:50 2006 From: guido at python.org (Guido van Rossum) Date: Thu, 31 Aug 2006 15:04:50 -0700 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <1cb725390608311313h4eac0f98x85a0690d3082b533@mail.gmail.com> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <1cb725390608311313h4eac0f98x85a0690d3082b533@mail.gmail.com> Message-ID: (Adding back py3k list assuming you just forgot it) On 8/31/06, Paul Prescod wrote: > On 8/31/06, Guido van Rossum wrote: > > > > (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes > > > per character, and doesn't support the supplemental characters above > > > 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.) > > > > I think we should also support UTF-16, since Java and .NET (and > > Win32?) appear to be using effectively; making surrogate handling an > > application issue doesn't seem *too* big of a burden for many apps. > > I think that the reason that UTF-16 seems "not too big of a burden" is > because people just ignore the UTF-16-ness of the data and hope that people > don't use those characters. In effect they trade correctness and > internationalization for simplicity and performance. It seems like it may > become a bigger issue as time goes by. Well there's a large class of apps that don't do anything for which surrogates matter, since they just copy strings around and only split them at specific characters. E.g. parsing XML would often fall in this category. > Plus, it sounds like you're proposing that the encodings of the underlying > data would leak through to the application. As I understood Fredrick's > model, the intention was to treat the encoding as an implementation detail. > If it works well, this could be an important differentiator for Python > (versus Java) as Unicode already is (versus Ruby). *Only* for UTF-16, which I consider a necessary evil since we can't rewrite the Java and .NET standards. > So my basic feeling is that if we're going to hide UTF-8 from the programmer > then we might as well go the extra mile and hide UTF-16 as well. I don't think the issues are the same. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From brett at python.org Fri Sep 1 00:18:17 2006 From: brett at python.org (Brett Cannon) Date: Thu, 31 Aug 2006 15:18:17 -0700 Subject: [Python-3000] Exception Expressions In-Reply-To: <76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> <76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com> Message-ID: On 8/31/06, Calvin Spealman wrote: > > On 8/31/06, Brett Cannon wrote: > > So this feels like the Perl idiom of using die: ``open(file) or die`` > (or > > something like that; I have never been a Perl guy so I could be off). > > > > > ... > > > > The problem I have with this whole proposal is that catching exceptions > > should be very obvious in the source code. This proposal does not help > with > > that ideal. So I am -1 on the whole idea. > > > > -Brett > > "Ouch" on the associated my idea with perl! =) The truth hurts. Although I agree that it is good to be obvious about exceptions, there > are some cases when they are simply less than exceptional. For > example, you can do d.get(key, default) if you know something is a > dictionary, but for general mappings you can't rely on that, and may > often use exceptions as a kind of logic control. No, that doesn't sync > with the purity of exceptions, but sometimes practicality and > real-world usage trumps theory. Practically most definitely beats purity, but I don't see the practicality of this over what we already have. Only allowing a single expression, it shouldn't be able to get ugly. Famous last words. Remember a big argument against the 'if' expressions was about them getting too unwieldly in terms of length and obscuring the fact that it is a conditional. I have used 'if' expressions and they have been hard to keep very readable unless you are willing to use parentheses and make them unreadable. I would be afraid of this happening here, but to an even more important construct that should always be easy to spot in source code. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/f41c8b7d/attachment.htm From walter at livinglogic.de Fri Sep 1 00:24:35 2006 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Fri, 01 Sep 2006 00:24:35 +0200 Subject: [Python-3000] Comment on iostack library In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> Message-ID: <44F761A3.5060009@livinglogic.de> tomer filiba wrote: > [...] > besides, encoding suffers from many issues. suppose you have a > damaged UTF8 file, which you read char-by-char. when we reach the > damaged part, you'll never be able to "skip" it, as we'll just keep > read()ing bytes, hoping to make a character out of it , until we > reach EOF, i.e.: > > def read_char(self): > buf = "" > while not self._stream.eof: > buf += self._stream.read(1) > try: > return buf.decode("utf8") > except ValueError: > pass > > which leads me to the following thought: maybe we should have > an "enhanced" encoding library for py3k, which would report > *incomplete* data differently from *invalid* data. today it's just a > ValueError: suppose decode() would raise IncompleteDataError > when the given data is not sufficient to be decoded successfully, > and ValueError when the data is just corrupted. > > that could aid iostack greatly. We *do* have that functionality in Python 2.5: incremental decoders can retain incomplete byte sequences on the call to the decode() method until the next call. Only when final=True is passed in the decode() call will it treat incomplete and invalid data in the same way: by raising an exception. Incomplete input: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\xe1") u'' >>> d.decode("\x88") u'' >>> d.decode("\xb4") u'\u1234' Invalid input: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\x80") Traceback (most recent call last): File "", line 1, in File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte Incomplete input with final=True: >>> import codecs >>> d = codecs.lookup("utf-8").incrementaldecoder() >>> d.decode("\xe1", final=True) Traceback (most recent call last): File "", line 1, in File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data Servus, Walter From greg.ewing at canterbury.ac.nz Fri Sep 1 04:39:37 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 01 Sep 2006 14:39:37 +1200 Subject: [Python-3000] Exception Expressions In-Reply-To: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> Message-ID: <44F79D69.6090909@canterbury.ac.nz> Calvin Spealman wrote: > Other example use cases: > > # Fallback on an alternative path > > # Handle divide-by-zero or get by with index() instead of find(): s.index("foo") except -1 if IndexError # :-) > open(filename) except open(filename2) if IOError One problem is that it doesn't seem to chain all that well. Suppose you had three files to try opening: open(name1) except (open(name2) except open(name3) if IOError) if IOError Maybe it would be better if the exception type and alternative expression were swapped over. Then you could write open(name1) except IOError then open(name2) except IOError then open(name3) Still rather unwieldy though. -0.7j, I think (the j to acknowledge that this is an imaginary proposal.:-) -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From david.nospam.hopwood at blueyonder.co.uk Fri Sep 1 04:53:21 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Fri, 01 Sep 2006 03:53:21 +0100 Subject: [Python-3000] Comment on iostack library In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> Message-ID: <44F7A0A1.30300@blueyonder.co.uk> tomer filiba wrote: > [Talin] > >>Well, as far as readline goes: In order to split the text into lines, >>you have to decode the text first anyway, which is a layer 3 operation. >>You can't just read bytes until you get a \n, because the file you are >>reading might be encoded in UCS2 or something. > > well, the LineBufferedLayer can be "configured" to split on any > "marker", i.e.: LineBufferedLayer(stream, marker = "\x00\x0a") > and of course layer 3, which creates layer 2, can set this marker > to any byte sequence. note it's a *byte* sequence, not chars, > since this passes down to layer 1 transparently. That isn't what is required; for big-endian UCS-2 or UTF-16, "\x00\x0a" should only be recognized as LF if it is at an even byte position. -- David Hopwood From talin at acm.org Fri Sep 1 05:13:27 2006 From: talin at acm.org (Talin) Date: Thu, 31 Aug 2006 20:13:27 -0700 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: <44F7A557.2010002@acm.org> Guido van Rossum wrote: > On 8/31/06, Talin wrote: >> One way to handle this efficiently would be to only support the >> encodings which have a constant character size: ASCII, Latin-1, UCS-2 >> and UTF-32. In other words, if the content of your text is plain ASCII, >> use an 8-bit-per-character string; If the content is limited to the >> Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using >> Unicode supplementary characters, use UTF-32. >> >> (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes >> per character, and doesn't support the supplemental characters above >> 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.) > > I think we should also support UTF-16, since Java and .NET (and > Win32?) appear to be using effectively; making surrogate handling an > application issue doesn't seem *too* big of a burden for many apps. I see that I misspoke - what I meant was, that we would "suppport" all of the available encodings in the sense that we could translate string objects to and from those encodings. But the internal representations of the string objects themselves would only use those encodings which represented a character in a fixed number of bytes. Moreover, this internal representation should be opaque to users of the string - if you want to write out a string as UTF-8 to a file, go for it, it shouldn't matter what the internal type of the string is. (Although Jython and IronPython should probably use whatever string representation is defined by the underlying VM.) >> By avoiding UTF-8, UTF-16 and other variable-character-length formats, >> you can always insure that character index operations are done in >> constant time. Index operations would simply require scaling the index >> by the character size, rather than having to scan through the string and >> count characters. >> >> The drawback of this method is that you may be forced to transform the >> entire string into a wider encoding if you add a single character that >> won't fit into the current encoding. > > A way to handle UTF-8 strings and other variable-length encodings > would be to maintain a small cache of index positions with the string > object. Actually, I realized that this drawback isn't really much of an issue at all. For virtually all string operations in Python, it is possible to predict ahead of time what string width will be required - thus you can allocated the proper width object up front, and not have to "widen" the string in mid-operation. So for example, any string operation which produces a subset of the string (such as partition, split, index, slice, etc.) will produce a string of the same width as the original string. Any string operation that involves combining two strings will produce a string that is the same type as the wider of the two strings. Thus, if I say something like: "Hello World" + chr( 0x8000 ) This will produce a 16-bits wide string, because 'chr( 0x8000 )' can't be represented in ASCII, and thus produces a 16-bit-wide string. Since the first string is plain ASCII (8 bits) and the second is 16 bits, the result of the concatenation is a 16-bit string. Similarly, transformations on strings such as upper / lower yield a string that is the same width as the original. The only case I can think of where you might need to "promote" an entire string is where you are concatenating to a string buffer, in other words you are dealing with a mutable string type. And this case is easily handled by simply making the mutable string buffer type always use UTF-32, and then narrowing the result when str() is called to the narrowest possible representation that can hold the result. So essentially what I am proposing is this: -- That the Python 3000 "str" type can consist of 8-bit, 16-bit, or 32-bit characters, where all characters within a string are the same number of bytes. -- That all 3 types of strings appear identical to Python programmers, such that they need not know what type of string they are using. -- Any operation that returns a string result has the responsibility to insure that the resulting string is wide enough to contain all of the characters produced by the operation. -- That string index operations will always be constant time, with no auxiliary data structures required. -- That all 3 string types can be converted into all of the available encodings, including variable-character-width formats, however the result is a "bytes" object, not a string. An additional, but separate part of the proposal is that for str objects, the contents of the string are always defined in terms of Unicode code points. So if you want to convert to ISO-Latin-1, you can, but the result is a bytes object, not a string. The advantage of this is that it means that you always know what the value of 'ord()' is for a given character. It also means that two strings can always be compared for equality without having to decode them first. >> (Another option is to simply make all strings UTF-32 -- which is not >> that unreasonable, considering that text strings normally make up only a >> small fraction of a program's memory footprint. I am sure that there are >> applications that don't conform to this generalization, however. ) > > Here you are effectively voting against polymorphic strings. I believe > Fredrik has good reasons to doubt this assertion. Yes, that is correct. I'm just throwing it out there as a possibility, as it is by far the simplest solution. Its a question of trading memory use for simplicity of implementation. Having a single, flat, internal representation for all strings would be much less complex than having different string types. -- Talin From ironfroggy at gmail.com Fri Sep 1 05:21:08 2006 From: ironfroggy at gmail.com (Calvin Spealman) Date: Thu, 31 Aug 2006 23:21:08 -0400 Subject: [Python-3000] Exception Expressions In-Reply-To: <44F79D69.6090909@canterbury.ac.nz> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> <44F79D69.6090909@canterbury.ac.nz> Message-ID: <76fd5acf0608312021w1e0cf0f3md00ee5232f3ef9f4@mail.gmail.com> On 8/31/06, Greg Ewing wrote: > One problem is that it doesn't seem to chain > all that well. Suppose you had three files to > try opening: > > open(name1) except (open(name2) except open(name3) if IOError) if IOError > > Maybe it would be better if the exception type > and alternative expression were swapped over. > Then you could write > > open(name1) except IOError then open(name2) except IOError then open(name3) > > Still rather unwieldy though. -0.7j, I think > (the j to acknowledge that this is an imaginary > proposal.:-) > > -- > Greg Ewing, Computer Science Dept, +--------------------------------------+ > University of Canterbury, | Carpe post meridiem! | > Christchurch, New Zealand | (I'm not a morning person.) | > greg.ewing at canterbury.ac.nz +--------------------------------------+ I considered the expr1 except exc_type then expr2 syntax, but it adds a keyword without much need to do so. But, I suppose that isn't a problem now that conditional expressions are in and then is already a keyword. I hereby upgrade this from imaginary proposal to real proposal status! From paul at prescod.net Fri Sep 1 05:32:32 2006 From: paul at prescod.net (Paul Prescod) Date: Thu, 31 Aug 2006 20:32:32 -0700 Subject: [Python-3000] UTF-16 Message-ID: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> On 8/31/06, Guido van Rossum wrote: > > (Adding back py3k list assuming you just forgot it) Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of "Reply To All." > Plus, it sounds like you're proposing that the encodings of the underlying > > data would leak through to the application. As I understood Fredrick's > > model, the intention was to treat the encoding as an implementation > detail. > > If it works well, this could be an important differentiator for Python > > (versus Java) as Unicode already is (versus Ruby). > > *Only* for UTF-16, which I consider a necessary evil since we can't > rewrite the Java and .NET standards. I see what you're getting at. I'd say that decoding UTF-16 data in CPython and PyPy should (by default) create true Unicode characters. Jython and IronPython could create surrogates and characters when necessary. When you run the program in CPython you'll get better behaviour than in Jython/IronPython. Maybe there could be a way to make CPython run like Jython and IronPython if you wanted 100% absolute compatibility between the environments. I think that we agree that it would be unfortunate if CPython copied Java and .NET to its own detriment. It's also not inconceivable that Java and .NET might evolve a 4-byte mode in the long term. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/688d3cc1/attachment.html From guido at python.org Fri Sep 1 05:46:55 2006 From: guido at python.org (Guido van Rossum) Date: Thu, 31 Aug 2006 20:46:55 -0700 Subject: [Python-3000] UTF-16 In-Reply-To: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: On 8/31/06, Paul Prescod wrote: > On 8/31/06, Guido van Rossum wrote: > > (Adding back py3k list assuming you just forgot it) > > Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of "Reply > To All." > > > > Plus, it sounds like you're proposing that the encodings of the > underlying > > > data would leak through to the application. As I understood Fredrick's > > > model, the intention was to treat the encoding as an implementation > detail. > > > If it works well, this could be an important differentiator for Python > > > (versus Java) as Unicode already is (versus Ruby). > > > > *Only* for UTF-16, which I consider a necessary evil since we can't > > rewrite the Java and .NET standards. > > I see what you're getting at. > > I'd say that decoding UTF-16 data in CPython and PyPy should (by default) > create true Unicode characters. Jython and IronPython could create > surrogates and characters when necessary. When you run the program in > CPython you'll get better behaviour than in Jython/IronPython. Maybe there > could be a way to make CPython run like Jython and IronPython if you wanted > 100% absolute compatibility between the environments. I think that we agree > that it would be unfortunate if CPython copied Java and .NET to its own > detriment. It's also not inconceivable that Java and .NET might evolve a > 4-byte mode in the long term. I think it would be best to do this as a CPython configuration option just like it's done today. You can choose 4-byte or 2-byte Unicode (essentially UCS-4 or UTF-16) in order to be compatible with other packages on the platform. Yes, 4-byte gives better Unicode support. But 2-bytes may be more compatible with other stuff on the platform. Too bad .NET and Java don't have this option. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Fri Sep 1 06:13:29 2006 From: guido at python.org (Guido van Rossum) Date: Thu, 31 Aug 2006 21:13:29 -0700 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F7A557.2010002@acm.org> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: On 8/31/06, Talin wrote: > > Here you are effectively voting against polymorphic strings. I believe > > Fredrik has good reasons to doubt this assertion. > > Yes, that is correct. I'm just throwing it out there as a possibility, > as it is by far the simplest solution. Its a question of trading memory > use for simplicity of implementation. Having a single, flat, internal > representation for all strings would be much less complex than having > different string types. I think you don't realize the significance of the immediate enthusiastic +1 votes from several OSX developers. These people are quite familiar with ObjectiveC. ObjectiveC has true polymorphic strings, and the internal representation *can* be UTF-8. These developers love that. For most practical purposes the internal representation is abstracted away from the application; *however* it is possible to go below this level, especially for I/O (I believe). The net effect, if I understand correctly, is that you can save yourself a lot of copying if you are mostly just moving whole strings around and doing relatively little slicing and dicing -- it avoids converting from UTF-8 (which is by far the most common external representation) to UCS-2 or UCS-4 and back again. I don't think these advantages are maintained by your "narrowest constant-width encoding that fits all the characters" proposal. I'm not saying that we should definitely adopt this -- it may well be that the ObjectiveC string API is significantly different from Python's (e.g. it could have less emphasis on character indices and character counts) so that the benefits would be lost in translation -- but I'm not sure that the added complexity of your proposal is warranted if it still requires encoding and decoding on most I/O operations. BTW, in some sense Python 2.x *has* polymorphic strings -- str and unicde have the same API (99% anyway) but different implementations, and there's even a common abstract base class (basestring). But this clearly isn't what the ObjectiveC folks want to see! -- --Guido van Rossum (home page: http://www.python.org/~guido/) From paul at prescod.net Fri Sep 1 06:24:19 2006 From: paul at prescod.net (Paul Prescod) Date: Thu, 31 Aug 2006 21:24:19 -0700 Subject: [Python-3000] UTF-16 In-Reply-To: References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: <1cb725390608312124u24d20ec2q27dbe5a69c2440d3@mail.gmail.com> On 8/31/06, Guido van Rossum wrote: > > On 8/31/06, Paul Prescod wrote: > > On 8/31/06, Guido van Rossum wrote: > > > (Adding back py3k list assuming you just forgot it) > > > > Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of > "Reply > > To All." > > > > > > Plus, it sounds like you're proposing that the encodings of the > > underlying > > > > data would leak through to the application. As I understood > Fredrick's > > > > model, the intention was to treat the encoding as an implementation > > detail. > > > > If it works well, this could be an important differentiator for > Python > > > > (versus Java) as Unicode already is (versus Ruby). > > > > > > *Only* for UTF-16, which I consider a necessary evil since we can't > > > rewrite the Java and .NET standards. > > > > I see what you're getting at. > > > > I'd say that decoding UTF-16 data in CPython and PyPy should (by > default) > > create true Unicode characters. Jython and IronPython could create > > surrogates and characters when necessary. When you run the program in > > CPython you'll get better behaviour than in Jython/IronPython. Maybe > there > > could be a way to make CPython run like Jython and IronPython if you > wanted > > 100% absolute compatibility between the environments. I think that we > agree > > that it would be unfortunate if CPython copied Java and .NET to its own > > detriment. It's also not inconceivable that Java and .NET might evolve a > > 4-byte mode in the long term. > > I think it would be best to do this as a CPython configuration option > just like it's done today. You can choose 4-byte or 2-byte Unicode > (essentially UCS-4 or UTF-16) in order to be compatible with other > packages on the platform. Yes, 4-byte gives better Unicode support. > But 2-bytes may be more compatible with other stuff on the platform. > Too bad .NET and Java don't have this option. :-) The current model is a hack (and I wrote the PEP!). If you decide to go to all of the effort and expense of polymorphic strings, I cannot understand why a user should be forced to choose between 16 and 32 bit strings AT BUILD TIME. PEP 261 says that reason for the build-time solution is: "[The alternate solutions] ... would require a much more complex implementation than the accepted solution. ... Guido is not willing to undertake the implementation right now. ...This PEP represents least-effort solution." Fair enough. A world of finite resouces. But I would be very annoyed if my ISP had installed a Python version that could magically handle 8-bit and 16-bit strings efficiently but I had to ask them to install a special version to handle 32 bit strings at all. Obviously build-time configuration is the least flexible of all available options. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/3dd236f2/attachment.htm From fredrik at pythonware.com Fri Sep 1 07:57:06 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 07:57:06 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F7A557.2010002@acm.org> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: Talin wrote: > So essentially what I am proposing is this: "look at me! look at me!" From fredrik at pythonware.com Fri Sep 1 08:05:18 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:05:18 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: Guido van Rossum wrote: > BTW, in some sense Python 2.x *has* polymorphic strings -- str and > unicde have the same API (99% anyway) but different implementations, > and there's even a common abstract base class (basestring). But this > clearly isn't what the ObjectiveC folks want to see! on the Python level, absolutely. the "use 8-bit strings for ASCII, Unicode strings for everything else" approach works perfectly well. I'm still a bit worried about C API complexities, but as I mentioned, in today's Python, only 8-bit strings are really simple. and there are standard ways to deal with backing stores; if that's good enough for apple hackers, it should be good enough for pythoneers. most of this can be prototyped and benchmarked under 2.X, and parts of it can be directly useful also for 2.X developers; I think I'll start tinkering. > These people are quite familiar with ObjectiveC. ObjectiveC has true > polymorphic strings, and the internal representation *can* be UTF-8. > These developers love that. you are aware that Objective C does provide B-tree strings under the hood too, I hope ;-) From fredrik at pythonware.com Fri Sep 1 08:22:54 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:22:54 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> Message-ID: tjreedy wrote: > These two similar features would be enough, to me, to make Py3 more than > just 2.x with cruft removed. well, it's really only C API issues that keeps us from implementing this in 2.x... (too much code uses PyString_Check and/or PyUnicode_Check and then happily digs into the associated buffers). From fredrik at pythonware.com Fri Sep 1 08:46:23 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:46:23 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: Guido van Rossum wrote: > A way to handle UTF-8 strings and other variable-length encodings > would be to maintain a small cache of index positions with the string > object. I think just delaying decoding would take us most of the way. the big advantage of storage polymorphism is that you can avoid decoding and encoding (and having to pay for the cycles and bytes needed for that) if you don't do have to. the XML case you mentioned is a typical example; just compare the behaviour of a library that does some extra work to keep things small under the hood with more straightforward implementations: http://effbot.org/zone/celementtree.htm#benchmarks (cElementTree uses the "8-bit ascii mixes well with unicode" approach) there are plenty of optimizations you can do when accessing the beginning and end of a string (startswith, endswith, comparisions, slicing, etc), but I think we can deal with that when we get there. I think the NFS sprint showed that you get better results by working with real use cases, rather than spending that theorizing. it also showed that the bottlenecks aren't always where you think they are. From fredrik at pythonware.com Fri Sep 1 08:49:38 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 01 Sep 2006 08:49:38 +0200 Subject: [Python-3000] UTF-16 In-Reply-To: References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: Guido van Rossum wrote: > I think it would be best to do this as a CPython configuration option > just like it's done today. You can choose 4-byte or 2-byte Unicode > (essentially UCS-4 or UTF-16) in order to be compatible with other > packages on the platform. Yes, 4-byte gives better Unicode support. > But 2-bytes may be more compatible with other stuff on the platform. > Too bad .NET and Java don't have this option. :-) the UCS2/UCS4 linking problems is a minor pain in the ass, though. maybe this is best done via a run-time setting? From fredrik at pythonware.com Fri Sep 1 09:56:52 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 09:56:52 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com><20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: Talin wrote: > (Another option is to simply make all strings UTF-32 -- which is not > that unreasonable, considering that text strings normally make up only a > small fraction of a program's memory footprint. I am sure that there are > applications that don't conform to this generalization, however. ) performance is more than just memory use, though. for some string operations, memory bandwidth is the bottleneck, not memory use. it simply takes more time to process four times as much data. (running the stringbench.py script in the sandbox on a recent 2.5 should give you some idea of this) From fredrik at pythonware.com Fri Sep 1 10:01:45 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 10:01:45 +0200 Subject: [Python-3000] locale-aware strings ? Message-ID: today's Python supports "locale aware" 8-bit strings; e.g. >>> import locale >>> "��".isalpha() False >>> locale.setlocale(locale.LC_ALL, "sv_SE") 'sv_SE' >>> "��".isalpha() True to what extent should this be supported by Python 3000 ? From tomerfiliba at gmail.com Fri Sep 1 10:05:10 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Fri, 1 Sep 2006 10:05:10 +0200 Subject: [Python-3000] Comment on iostack library In-Reply-To: <44F761A3.5060009@livinglogic.de> References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> <44F761A3.5060009@livinglogic.de> Message-ID: <1d85506f0609010105n69e8cdcbw989f861e05ca7a24@mail.gmail.com> very well, i'll use it. thanks. On 9/1/06, Walter D?rwald wrote: > tomer filiba wrote: > > > [...] > > besides, encoding suffers from many issues. suppose you have a > > damaged UTF8 file, which you read char-by-char. when we reach the > > damaged part, you'll never be able to "skip" it, as we'll just keep > > read()ing bytes, hoping to make a character out of it , until we > > reach EOF, i.e.: > > > > def read_char(self): > > buf = "" > > while not self._stream.eof: > > buf += self._stream.read(1) > > try: > > return buf.decode("utf8") > > except ValueError: > > pass > > > > which leads me to the following thought: maybe we should have > > an "enhanced" encoding library for py3k, which would report > > *incomplete* data differently from *invalid* data. today it's just a > > ValueError: suppose decode() would raise IncompleteDataError > > when the given data is not sufficient to be decoded successfully, > > and ValueError when the data is just corrupted. > > > > that could aid iostack greatly. > > We *do* have that functionality in Python 2.5: incremental decoders can > retain incomplete byte sequences on the call to the decode() method > until the next call. Only when final=True is passed in the decode() call > will it treat incomplete and invalid data in the same way: by raising an > exception. > > Incomplete input: > >>> import codecs > >>> d = codecs.lookup("utf-8").incrementaldecoder() > >>> d.decode("\xe1") > u'' > >>> d.decode("\x88") > u'' > >>> d.decode("\xb4") > u'\u1234' > > Invalid input: > >>> import codecs > >>> d = codecs.lookup("utf-8").incrementaldecoder() > >>> d.decode("\x80") > Traceback (most recent call last): > File "", line 1, in > File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, > in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > unexpected code byte > > Incomplete input with final=True: > >>> import codecs > >>> d = codecs.lookup("utf-8").incrementaldecoder() > >>> d.decode("\xe1", final=True) > Traceback (most recent call last): > File "", line 1, in > File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256, > in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: > unexpected end of data > > Servus, > Walter > > From fredrik at pythonware.com Fri Sep 1 13:14:13 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 13:14:13 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> Message-ID: > spending that theorizing. make that "spending that time theorizing about what you could, in theory, do." From qrczak at knm.org.pl Fri Sep 1 13:34:42 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Fri, 01 Sep 2006 13:34:42 +0200 Subject: [Python-3000] Comment on iostack library In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> (tomer filiba's message of "Thu, 31 Aug 2006 23:43:44 +0200") References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com> Message-ID: <87u03r6crx.fsf@qrnik.zagroda> "tomer filiba" writes: >> Encoding conversion and newline conversion should be performed a >> block at a time, below buffering, so not only I/O syscalls, but >> also invocations of the recoding machinery are amortized by >> buffering. > > you have a good point, which i also stumbled upon when implementing > the TextInterface. but how would you suggest to solve it? I've designed and implemented this for my language, but I'm not sure that you will like it because it's quite different from the Python tradition. The interface of block reading appends data to the end of the supplied buffer, up to the specified size (or infinity), and also it tells whether it reached end of data. The interface of block writing removes data from the beginning of the supplied buffer, up to the supplied size (or the whole buffer), and is told how to flush, which includes information whether this is the end of data. Both functions are allowed to read/write less than requested. The recoding engine moves data from the beginning of an input buffer to the end of an output buffer. The block recoding function has similar size parameters as above, and a flushing parameter. It returns True on output overflow, i.e. when it stopped because it needs more room in the output rather than because it needs more input. It leaves unconverted data at the end of the input buffer if data looks incomplete, unless it is told that this is the last block - in this case it fails. Both decoding input streams and encoding output streams have a persistent buffer in the format corresponding to their low end, i.e. a byte buffer when this is the boundary between bytes and characters. This design allows to plug everything together, including the cases where recoding changes sizes significantly (compression/decompression). It also allows reading/writing process to be interrupted without breaking the consistency of the state of buffers, as long as each primitive reading/writing operation is atomic, i.e. anything it removes from the input buffer is converted and put in the output buffer. Data not yet processed by the remaining layers remains in their respective buffers. For example reading a block from a decoding stream: 1. If there was no overflow previously, read more data from the underlying stream to the internal buffer, up to the supplied maximum size. 2. Decode data from the internal buffer to the supplied output buffer, up to the supplied maximum size. Tell the recoding engine that this is the last piece if there was no overflow previously and reading from the underlying stream reached the end. 3. Return True (i.e. end of input) if there was no overflow now and reading from the underlying stream reached the end. Writing a block to an encoding stream is simpler: 1. Encode data from the supplied input buffer to the internal buffer. 2. Write data from the internal buffer to the output stream. Buffered streams are typically put on the top of the stack. They support reading a line at a time, unlimited lookahead and unlimited unreading, and writing which guarantees that it won't leave anything in the buffer it is writing from. Newlines are converted by a separate layer. The buffered stream assumes "\n" endings. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From fredrik at pythonware.com Fri Sep 1 13:41:00 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 13:41:00 +0200 Subject: [Python-3000] string C API Message-ID: just noticed that PEP 3100 says that PyString_AsEncodedString and PyString_AsDecodedString is to be removed, but it doesn't mention any other PyString (or PyUnicode) functions. how large changes can we make here, really ? (I'm not going to sketch on a concrete proposal here; I'm more interested in general guidelines. the details are best fleshed out in code) From barry at python.org Fri Sep 1 14:14:46 2006 From: barry at python.org (Barry Warsaw) Date: Fri, 1 Sep 2006 08:14:46 -0400 Subject: [Python-3000] UTF-16 In-Reply-To: References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> Message-ID: <188FAEC3-875D-4AA8-8C66-A1DF6F8A96C6@python.org> On Sep 1, 2006, at 2:49 AM, Fredrik Lundh wrote: > Guido van Rossum wrote: > >> I think it would be best to do this as a CPython configuration option >> just like it's done today. You can choose 4-byte or 2-byte Unicode >> (essentially UCS-4 or UTF-16) in order to be compatible with other >> packages on the platform. Yes, 4-byte gives better Unicode support. >> But 2-bytes may be more compatible with other stuff on the platform. >> Too bad .NET and Java don't have this option. :-) > > the UCS2/UCS4 linking problems is a minor pain in the ass, though. > maybe this is best done via a run-time setting? Yes, the linking problem does crop up from time to time. Recent example: Gentoo Linux is heavily dependent on Python and I recently emerged in several packages. I don't remember the exact details, but there was a conflict between UCS2 and UCS4 where two different upstream packages required two different linkages, and the wrapping Python modules were thus incompatible. I basically had to decide which one I cared about most and delete the other to resolve the conflict. The problem was confusing the hell out of several Gentooers until we tracked down all the resources and figured out the (suboptimal) fix. -Barry From fredrik at pythonware.com Fri Sep 1 14:23:10 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri, 1 Sep 2006 14:23:10 +0200 Subject: [Python-3000] UTF-16 References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com> <188FAEC3-875D-4AA8-8C66-A1DF6F8A96C6@python.org> Message-ID: Barry Warsaw wrote: > I recently emerged in several packages. good thing dictionary.com includes wikipedia articles, or I'd never figured out if that was a typo or a rather odd spiritual phenomenon. From paul at prescod.net Fri Sep 1 16:11:35 2006 From: paul at prescod.net (Paul Prescod) Date: Fri, 1 Sep 2006 07:11:35 -0700 Subject: [Python-3000] Character Set Indepencence Message-ID: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> I thought that others might find this reference interesting. It is Matz (the inventor of Ruby) talking about why he thinks that Unicode is good for what it does but not sufficient in general, along with some hints of what he plans for multinationalization in Ruby. The translation is rough and is lifted from this email: http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html I think that the gist of it is that Unicode will be "just one character set" supported by Ruby. This idea has been kicked around for Python before but you quickly run into questions about how you compare character strings from multiple character sets, to say nothing of the complexity of an character encoding and character set agnostic regular expression engine. I guess Matz is the right guy to experiment with that stuff. Maybe it could be copied in Python 4K. What are your complaints towards Unicode? * it's thoroughly used, isn't it. * resentment towards Han unification? * inferiority complex of Japanese people? -- What are your complaints towards Unicode? * no, no I do not have any complaints about Unicode * in the domains where Unicode is adequate -- Then, why CSI? In most applications, UCS is enough thanks to Unicode. However, there are also applications for which this is not the case. -- Fields for which Unicode is not enough Big character sets * Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode) * TRON code * GB18030 -- Fields for which Unicode is not fitted Legacy encodings * conversion to UCS is useless * big conversion tables * round-trip problem -- If a language chooses the UCS system * you cannot write non-UCS applications * you can't handle text that can't be expressed with Unicode -- If a language chooses the CSI system * CSI is a superset of UCS * Unicode just has to be handled in CSI -- ... is what we can say but * CSI is difficult * can it really be implemented? -- That's where comes out Japan's traditional arts Adaptation for the Japanese language of applications * Modification of English language applications to be able to process Japanese -- Adaptation for the Japanese language of applications * What engineers of long ago experienced for sure - Emacs (NEmacs) - Perl (JPerl) - Bash -- Accumulation of know-how In Japan, the know-how of adaptation for the Japanese language (multi-byte text processing) has been accumulated. -- Accumulation of know-how in the first place, just for local use, text using 3 encodings circulate (4 if including UTF-8) -- Based on this know-how * multibyte text encodings * switching between encodings at the string level * processing them at practical speed is finished -- Available encodings euc_tw euc_jp iso8859_* utf-8 utf-32le ascii euc_kr koi8 utf-16le utf-32be big5 gb2312 sjis utf-16be ...and many others If it's a stateless encodings, in principle it can be available. -- It means For applications using only one encoding, code conversion is not needed -- Moreover Applications wanting to handle multiple encodings can choose an internal encoding (generally Unicode) that includes all others -- If you want to * you can also handle multiple encodings without conversion, letting characters as they are * but this is difficult so I do not recommend it -- However, only the basic part is done, it's far from being ready for practical use * code conversion * guessing encoding * etc. -- For the time being, today I want to tell everyone: * UCS is practical * but not all-purpose * CSI is not impossible -- The reason I'm saying that They may add CSI in Perl6 as they had added * Methods called by "." * Continuations from Ruby. Basically, they hate losing. -- Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060901/46576432/attachment.html From jimjjewett at gmail.com Fri Sep 1 16:24:42 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 1 Sep 2006 10:24:42 -0400 Subject: [Python-3000] Exception Expressions In-Reply-To: References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> <76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com> Message-ID: On 8/31/06, Brett Cannon wrote: > On 8/31/06, Calvin Spealman wrote: > > On 8/31/06, Brett Cannon wrote: > > > So this feels like the Perl idiom of using die: ``open(file) or die`` > > "Ouch" on the associated my idea with perl! > =) The truth hurts. Isn't this almost the opposite of "or die"? Unless I'm having a very bad day, the die idiom is more like a SystemExit, but this proposal is a way to recover from expected Exceptions. > func(ags) || die(msg) means >>> if not func(args): ... raise SystemExit(msg) This proposal, with the "a non-dict mapping might not have get" use case: >>> ((mymap[k] except KeyError then default) for key in source) means >>> def __temp(): ... for element in source: ... try: ... v=mymap[k] ... except KeyError: ... v=default ... yield v >>> __temp() -jJ From guido at python.org Fri Sep 1 16:59:47 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 07:59:47 -0700 Subject: [Python-3000] Character Set Indepencence In-Reply-To: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> References: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com> Message-ID: I think in a sense Python *will* continue to support multiple character sets -- as byte streams. IMO that's the only reasonable approach. Unlike apparently Matz I've never heard complaints that Python 2 doesn't have enough support for character sets larger than Unicode, and that is effectively what it supports: encoded strings and Unicode string. --Guido On 9/1/06, Paul Prescod wrote: > I thought that others might find this reference interesting. It is Matz (the > inventor of Ruby) talking about why he thinks that Unicode is good for what > it does but not sufficient in general, along with some hints of what he > plans for multinationalization in Ruby. The translation is rough and is > lifted from this email: > > http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html > > I think that the gist of it is that Unicode will be "just one character set" > supported by Ruby. This idea has been kicked around for Python before but > you quickly run into questions about how you compare character strings from > multiple character sets, to say nothing of the complexity of an character > encoding and character set agnostic regular expression engine. > > I guess Matz is the right guy to experiment with that stuff. Maybe it could > be copied in Python 4K. > What are your complaints towards Unicode? > * it's thoroughly used, isn't it. > * resentment towards Han unification? > > * inferiority complex of Japanese people? > -- > What are your complaints towards Unicode? > * no, no I do not have any complaints about Unicode > * in the domains where Unicode is adequate > -- > Then, why CSI? > > > In most applications, UCS is enough thanks to Unicode. > However, there are also applications for which this is not the case. > -- > Fields for which Unicode is not enough > Big character sets > * Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode) > > * TRON code > * GB18030 > -- > Fields for which Unicode is not fitted > Legacy encodings > * conversion to UCS is useless > * big conversion tables > * round-trip problem > -- > If a language chooses the UCS system > > * you cannot write non-UCS applications > * you can't handle text that can't be expressed with Unicode > -- > If a language chooses the CSI system > * CSI is a superset of UCS > * Unicode just has to be handled in CSI > > -- > ... is what we can say but > * CSI is difficult > * can it really be implemented? > -- > That's where comes out Japan's traditional arts > > Adaptation for the Japanese language of applications > * Modification of English language applications to be able to process > Japanese > > -- > Adaptation for the Japanese language of applications > > * What engineers of long ago experienced for sure > - Emacs (NEmacs) > - Perl (JPerl) > - Bash > -- > Accumulation of know-how > > In Japan, the know-how of adaptation for the Japanese language > > (multi-byte text processing) > has been accumulated. > -- > Accumulation of know-how > > in the first place, just for local use, > text using 3 encodings circulate > (4 if including UTF-8) > -- > Based on this know-how > > * multibyte text encodings > * switching between encodings at the string level > * processing them at practical speed > is finished > -- > Available encodings > > euc_tw euc_jp iso8859_* utf-8 utf-32le > > ascii euc_kr koi8 utf-16le utf-32be > big5 gb2312 sjis utf-16be > > ...and many others > If it's a stateless encodings, in principle it can be available. > -- > It means > For applications using only one encoding, code conversion is not needed > > -- > Moreover > Applications wanting to handle multiple encodings can choose an > internal encoding (generally Unicode) that includes all others > -- > If you want to > * you can also handle multiple encodings without conversion, letting > > characters as they are > * but this is difficult so I do not recommend it > -- > However, > only the basic part is done, > it's far from being ready for practical use > * code conversion > * guessing encoding > > * etc. > -- > For the time being, today > I want to tell everyone: > * UCS is practical > * but not all-purpose > * CSI is not impossible > -- > The reason I'm saying that > They may add CSI in Perl6 as they had added > > * Methods called by "." > * Continuations > from Ruby. > Basically, they hate losing. > -- > Thank you > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From mcherm at mcherm.com Fri Sep 1 17:03:59 2006 From: mcherm at mcherm.com (Michael Chermside) Date: Fri, 01 Sep 2006 08:03:59 -0700 Subject: [Python-3000] Exception Expressions Message-ID: <20060901080359.zsxl30h7bpwswc40@login.werra.lunarpages.com> Calvin Spealman writes: > I thought I felt in the mood for some abuse today, so I'm proposing > something sure to give me plenty of crap, but maybe someone will enjoy > the idea, anyway. [...] > expr1 except expr2 if exc_type This is wonderful! In combination with conditional expressions, list comprehensions, and lambda, I think this would make it possible to write full-powerd Python programs on a single line. Actually, putting it on a single line in your text editor would just make things unreadable, but if you wrap parentheses around it, then the entire program can be a single expression, something like this: for entry in entryList: if entry.status() == 'open': try: entry.display() except DisplayError: entry.setStatus('error') entry.hide() else: entry.hide(); would become this: ( ( ( entry.display() except ( entry.setStatus('error'), entry.hide() ) if DisplayError ) if entry.status() == 'open' else entry.hide() ) for entry in entryList ) (Or you *could* choose to compress it as follows:) (((entry.display()except(entry.setStatus('error' ),entry.hide())if DisplayError)if entry.status() =='open' else entry.hide())for entry in entryList) Now, I wouldn't try to claim that this single-expression version is *more* readable than the original, but it has a significant advantage: it makes the language no longer dependent on significant whitespace for demarking lines and blocks! There are places where significant whitespace is a problem, most notably when trying to embed Python code within other documents. Just imagine using this new form to embed Python within HTML to create a new and more powerful form of dynamic page generation:

<*entry.title()*>

Valid
Active
Inactive

<* entry.showContent() except "No Data Available" if Exception *>

Isn't it amazing? . . . Okay... *everything* above comes with a HUGE wink. It's a joke. Calvin's idea is clever, and readable once you get used to conditional expressions, but I'm still a solid -1 on the proposal. But thanks for giving me something fun to think about. -- Michael Chermside From nnorwitz at gmail.com Fri Sep 1 18:58:49 2006 From: nnorwitz at gmail.com (Neal Norwitz) Date: Fri, 1 Sep 2006 09:58:49 -0700 Subject: [Python-3000] string C API In-Reply-To: References: Message-ID: On 9/1/06, Fredrik Lundh wrote: > just noticed that PEP 3100 says that PyString_AsEncodedString and > PyString_AsDecodedString is to be removed, but it doesn't mention > any other PyString (or PyUnicode) functions. > > how large changes can we make here, really ? I don't know if it was the case here or not, but I added a bunch of APIs to the PEP that were labeled as deprecated or only for backwards compatibility. The sources were the doc, header files, and source files. (There's no single place to look.) n From guido at python.org Fri Sep 1 19:17:39 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 10:17:39 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: Message-ID: I say not at all. On 9/1/06, Fredrik Lundh wrote: > today's Python supports "locale aware" 8-bit strings; e.g. > > >>> import locale > >>> "???".isalpha() > False > >>> locale.setlocale(locale.LC_ALL, "sv_SE") > 'sv_SE' > >>> "???".isalpha() > True > > to what extent should this be supported by Python 3000 ? > > > > > > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Fri Sep 1 20:34:09 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 01 Sep 2006 20:34:09 +0200 Subject: [Python-3000] Ripping out exec Message-ID: Hi, in process of ripping out the exec statement, I stumbled over the following function in symtable.c (line 468ff): ------------------------------------------------------------------------------------ /* Check for illegal statements in unoptimized namespaces */ static int check_unoptimized(const PySTEntryObject* ste) { char buf[300]; const char* trailer; if (ste->ste_type != FunctionBlock || !ste->ste_unoptimized || !(ste->ste_free || ste->ste_child_free)) return 1; trailer = (ste->ste_child_free ? "contains a nested function with free variables" : "is a nested function"); switch (ste->ste_unoptimized) { case OPT_TOPLEVEL: /* exec / import * at top-level is fine */ case OPT_EXEC: /* qualified exec is fine */ return 1; case OPT_IMPORT_STAR: PyOS_snprintf(buf, sizeof(buf), "import * is not allowed in function '%.100s' " "because it is %s", PyString_AS_STRING(ste->ste_name), trailer); break; case OPT_BARE_EXEC: PyOS_snprintf(buf, sizeof(buf), "unqualified exec is not allowed in function " "'%.100s' it %s", PyString_AS_STRING(ste->ste_name), trailer); break; default: PyOS_snprintf(buf, sizeof(buf), "function '%.100s' uses import * and bare exec, " "which are illegal because it %s", PyString_AS_STRING(ste->ste_name), trailer); break; } PyErr_SetString(PyExc_SyntaxError, buf); PyErr_SyntaxLocation(ste->ste_table->st_filename, ste->ste_opt_lineno); return 0; } -------------------------------------------------------------------------------------- Of course, this check can't be made at compile time if exec() is a function. (You can even outsmart it currently by giving explicit None arguments to the exec statement) So my question is: is this check required, and can it be done at execution time instead? Comparing the exec code to execfile(), only this can be the cause for the extra precaution: (from Python/ceval.c, function exec_statement) if (plain) PyFrame_LocalsToFast(f, 0); Georg From guido at python.org Fri Sep 1 20:37:55 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 11:37:55 -0700 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: I would just rip it out. On 9/1/06, Georg Brandl wrote: > Hi, > > in process of ripping out the exec statement, I stumbled over the > following function in symtable.c (line 468ff): > > ------------------------------------------------------------------------------------ > /* Check for illegal statements in unoptimized namespaces */ > static int > check_unoptimized(const PySTEntryObject* ste) { > char buf[300]; > const char* trailer; > > if (ste->ste_type != FunctionBlock || !ste->ste_unoptimized > || !(ste->ste_free || ste->ste_child_free)) > return 1; > > trailer = (ste->ste_child_free ? > "contains a nested function with free variables" : > "is a nested function"); > > switch (ste->ste_unoptimized) { > case OPT_TOPLEVEL: /* exec / import * at top-level is fine */ > case OPT_EXEC: /* qualified exec is fine */ > return 1; > case OPT_IMPORT_STAR: > PyOS_snprintf(buf, sizeof(buf), > "import * is not allowed in function '%.100s' " > "because it is %s", > PyString_AS_STRING(ste->ste_name), trailer); > break; > case OPT_BARE_EXEC: > PyOS_snprintf(buf, sizeof(buf), > "unqualified exec is not allowed in function " > "'%.100s' it %s", > PyString_AS_STRING(ste->ste_name), trailer); > break; > default: > PyOS_snprintf(buf, sizeof(buf), > "function '%.100s' uses import * and bare exec, " > "which are illegal because it %s", > PyString_AS_STRING(ste->ste_name), trailer); > break; > } > > PyErr_SetString(PyExc_SyntaxError, buf); > PyErr_SyntaxLocation(ste->ste_table->st_filename, > ste->ste_opt_lineno); > return 0; > } > -------------------------------------------------------------------------------------- > > Of course, this check can't be made at compile time if exec() is a function. > (You can even outsmart it currently by giving explicit None arguments to the > exec statement) > > So my question is: is this check required, and can it be done at execution time > instead? > > Comparing the exec code to execfile(), only this can be the cause for the > extra precaution: > (from Python/ceval.c, function exec_statement) > > if (plain) > PyFrame_LocalsToFast(f, 0); > > Georg > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jcarlson at uci.edu Fri Sep 1 21:20:21 2006 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 01 Sep 2006 12:20:21 -0700 Subject: [Python-3000] "string" views Message-ID: <20060901120313.1B5F.JCARLSON@uci.edu> Attached you will find a zip file containing the implementation of a 'stringview' object written against Python 2.3 and Pyrex 0.9.3 . I didn't implement center, decode, encode, ljust, rjust, splitlines, title, translate, zfill, __[r]mod__, slicing with indices != 1, my optimization for view.join(...) doesn't seem to work, and view.split('') is also not implemented. I'm stopping for right now because I'm a bit burnt out on this particular project. If it seems hacked together, it is because it is hacked together. Whenever possible it returns views. It also will generally take anything that supports the buffer protocol as an argument where a string or view would have also made sense. Please remember that this is just a proof-of-concept implementation; I would imagine that an actual view object would likely need to be written in pure C, and though I have tested each method by hand, there may be bugs. I have also included the output file "stringview.c" for those without a working Pyrex installation, which should compile against Python 2.3 headers, and perhaps even 2.4 headers. - Josiah -------------- next part -------------- A non-text attachment was scrubbed... Name: stringview.zip Type: application/x-zip-compressed Size: 27685 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20060901/46f8ee58/attachment-0001.bin From g.brandl at gmx.net Fri Sep 1 23:28:15 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 01 Sep 2006 23:28:15 +0200 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: Guido van Rossum wrote: > I would just rip it out. It turns out that it's not so easy. The exec statement currently can modify the locals, which means that def f(): exec "a=1" print a succeeds. To make that possible, the compiler flags scopes containing exec statements as unoptimized and does not assume unbound names to be global. With exec being a function, currently the above function won't work because "a" is assumed to be global. I can see only two resolutions: * change exec() semantics so that it cannot modify the locals * do not make exec a function Georg From guido at python.org Fri Sep 1 23:57:18 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 14:57:18 -0700 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: On 9/1/06, Georg Brandl wrote: > Guido van Rossum wrote: > > I would just rip it out. > > It turns out that it's not so easy. The exec statement currently can > modify the locals, which means that > > def f(): > exec "a=1" > print a > > succeeds. To make that possible, the compiler flags scopes containing > exec statements as unoptimized and does not assume unbound names to > be global. > > With exec being a function, currently the above function won't work > because "a" is assumed to be global. > > I can see only two resolutions: > > * change exec() semantics so that it cannot modify the locals > * do not make exec a function Make it so it can't modify the locals. execfile() has the same limitation. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Sat Sep 2 00:37:20 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 02 Sep 2006 00:37:20 +0200 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: Guido van Rossum wrote: > On 9/1/06, Georg Brandl wrote: >> Guido van Rossum wrote: >> > I would just rip it out. >> >> It turns out that it's not so easy. The exec statement currently can >> modify the locals, which means that >> >> def f(): >> exec "a=1" >> print a >> >> succeeds. To make that possible, the compiler flags scopes containing >> exec statements as unoptimized and does not assume unbound names to >> be global. >> >> With exec being a function, currently the above function won't work >> because "a" is assumed to be global. >> >> I can see only two resolutions: >> >> * change exec() semantics so that it cannot modify the locals >> * do not make exec a function > > Make it so it can't modify the locals. execfile() has the same limitation. > Good. Patch is at python.org/sf/1550800. There's another one at python.org/sf/1550786 implementing the Ellipsis literal. cheers, Georg From greg.ewing at canterbury.ac.nz Sat Sep 2 02:10:55 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 02 Sep 2006 12:10:55 +1200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F7A557.2010002@acm.org> References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> Message-ID: <44F8CC0F.2020004@canterbury.ac.nz> Talin wrote: > So for example, any string operation which produces a subset of the > string (such as partition, split, index, slice, etc.) will produce a > string of the same width as the original string. It might be possible to represent it in a narrower format, however. Perhaps there should be an explicit operation for re-packing a string into the narrowest possible format? Or should one simply encode it as UTF-8 or something and then decode it again to get the same effect? -- Greg From greg.ewing at canterbury.ac.nz Sat Sep 2 02:37:02 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 02 Sep 2006 12:37:02 +1200 Subject: [Python-3000] Ripping out exec In-Reply-To: References: Message-ID: <44F8D22E.70202@canterbury.ac.nz> Guido van Rossum wrote: > I would just rip it out. I don't understand this business about ripping out exec. I thought that exec had to be a statement so the compiler can tell whether to use fast locals. Do you have a different way of handling that in mind for Py3k? -- Greg From guido at python.org Sat Sep 2 04:26:46 2006 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Sep 2006 19:26:46 -0700 Subject: [Python-3000] Ripping out exec In-Reply-To: <44F8D22E.70202@canterbury.ac.nz> References: <44F8D22E.70202@canterbury.ac.nz> Message-ID: On 9/1/06, Greg Ewing wrote: > Guido van Rossum wrote: > > I would just rip it out. > > I don't understand this business about ripping out > exec. I thought that exec had to be a statement so > the compiler can tell whether to use fast locals. > Do you have a different way of handling that in mind > for Py3k? Yes. If we implement the module-level analysis it should be easy enough to track whether 'exec' refers to the built-in function. (We're already planning to add some kind of prohibition against outside modules poking new globals into a module that shadow built-ins.) But I also see no bones in requiring the use of a dict arg if you want to observe the side effects of the exec'ed code. So instead of def f(s): exec s print a # presumably s must contain an assignment to a you'd have to write def f(s): ns = {} exec(s, ns) print ns['a'] This makes it a lot clearer what happens IMO. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From ncoghlan at gmail.com Sat Sep 2 05:42:45 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 02 Sep 2006 13:42:45 +1000 Subject: [Python-3000] Exception Expressions In-Reply-To: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com> Message-ID: <44F8FDB5.6000808@gmail.com> An interesting idea, although I suspect a leading try keyword would make things clearer. (try expr1 except expr2 if exc_type) print (try letters[7] except "N/A" if IndexError) f = (try open(filename) except open(filename2) if IOError) print (try eval(expr) except "Can not divide by zero!" if ZeroDivisionError) val = (try db.get(key) except cache.get(key) if TimeoutError) This wouldn't help the chaining problem that Greg pointed out, though: try open(name1) except (try open(name2) except open(name3) if IOError) if IOError Using a different keyword or a comma so expr2 comes last as Greg suggested would fix that: try open(name1) except IOError, (try open(name2) except IOError, open(name3)) I'd be somewhere between -1 and -0 at this point in time. Depending on the results a review of the standard library describing actual use cases that could be made easier to read might be enough to get me to a +0. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Sat Sep 2 05:47:57 2006 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 02 Sep 2006 13:47:57 +1000 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: Message-ID: <44F8FEED.9000600@gmail.com> Fredrik Lundh wrote: > today's Python supports "locale aware" 8-bit strings; e.g. > > >>> import locale > >>> "???".isalpha() > False > >>> locale.setlocale(locale.LC_ALL, "sv_SE") > 'sv_SE' > >>> "???".isalpha() > True > > to what extent should this be supported by Python 3000 ? Since all strings will be Unicode by then: >>> u"???".isalpha() True Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From qrczak at knm.org.pl Sat Sep 2 09:57:11 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 02 Sep 2006 09:57:11 +0200 Subject: [Python-3000] Making more effective use of slice objects in Py3k In-Reply-To: <44F8CC0F.2020004@canterbury.ac.nz> (Greg Ewing's message of "Sat, 02 Sep 2006 12:10:55 +1200") References: <20060827184941.1AE8.JCARLSON@uci.edu> <20060829102307.1B0F.JCARLSON@uci.edu> <6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com> <20060831044354.GH6257@performancedrivers.com> <44F72E75.2050204@acm.org> <44F7A557.2010002@acm.org> <44F8CC0F.2020004@canterbury.ac.nz> Message-ID: <87mz9izoo8.fsf@qrnik.zagroda> Greg Ewing writes: > It might be possible to represent it in a narrower format, > however. Perhaps there should be an explicit operation for > re-packing a string into the narrowest possible format? I suppose it's better to always normalize a polymorphic string representation. And always normalize bignums to fixnums (long->int). It increases chances of using the more compact representation. It doesn't add any asymptotic cost, it's done when the whole object is to be allocated anyway (these are immutable objects). It simplifies equality comparison. The narrow formats should be statistically more common than wide formats anyway. Programmers should not be expected to care about explicitly calling a normalization function. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From tomerfiliba at gmail.com Sat Sep 2 17:53:59 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Sat, 2 Sep 2006 17:53:59 +0200 Subject: [Python-3000] encoding hell Message-ID: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> i'm quite finished with the base of iostack (streams and layers), and have moved to implementing the adpaters layer (especially the dreaded TextAdapter). as was discussed earlier, streams and layers work with bytes, while adpaters may work with arbitrary objects (be it struct-style records, serialized objects, characters and whatnot). the question that arises is -- how far should we stretch this abstraction? for example, the TextAdapter reads and writes characters to the stream, after they go encoding or decoding, so from the programmer's point of view, he's working with *characters*, not *bytes*. that means the programmer need not be aware of how the characters are "physically" stored in the underlying stream. that's all very nice, but what do we do when it comes to seek()ing? do you want to seek by character position or by byte position? logically you are working with characters, but it would be impossible to implement without first decoding the entire stream in-memory... which is unacceptable of course. and if seek()ing is byte-oriented, then you must somehow seek only to the beginning of a multibyte character sequence... how would you do that? my solution would be completely leaving seek() and tell() out of the 3rd layer -- it's a byte-level operation. anyone thinks differently? if so, what's your solution? - - - - you can find the latest sources here (note: i haven't tested it yet, many things are likely to be broken, it's still being redesigned): http://sebulbasvn.googlecode.com/svn/trunk/iostack/ http://sebulbasvn.googlecode.com/svn/trunk/sock2/ -tomer From g.brandl at gmx.net Sat Sep 2 18:36:37 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 02 Sep 2006 18:36:37 +0200 Subject: [Python-3000] The future of exceptions Message-ID: While looking at the changes necessary to implement the exception related syntax changes (except ... as ..., raise without type), I came across some more substantial things that I think must be discussed. * How should exceptions be represented in C code? Should there still be a (type, value, traceback) triple? * Could the traceback be made an attribute of the exception? * What about exception chaining? Something like this comes to mind:: try: whatever except ValueError as err: raise CustomException("Something went wrong", prev=err) With tracebacks becoming part of the exception, that could be:: raise CustomException(*args, prev=err, tb=traceback) (`prev` and `tb` would be keyword-only arguments) With that, all exception info would be contained in one object, so sys.exc_info() could be renamed to sys.last_exc(). cheers, Georg From qrczak at knm.org.pl Sat Sep 2 20:04:08 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sat, 02 Sep 2006 20:04:08 +0200 Subject: [Python-3000] The future of exceptions In-Reply-To: (Georg Brandl's message of "Sat, 02 Sep 2006 18:36:37 +0200") References: Message-ID: <87pseew3fr.fsf@qrnik.zagroda> Georg Brandl writes: > * Could the traceback be made an attribute of the exception? > > * What about exception chaining? > > Something like this comes to mind:: > > try: > whatever > except ValueError as err: > raise CustomException("Something went wrong", prev=err) In my language the traceback is materialized from the stack only if needed (typically when an exception escapes from the toplevel), and it includes the history of other exceptions thrown from exception handlers, intermingled with source locations. The stack is not physically unwound until an exception handler completes successfully, so the data is available until then. For example the above (without storing prev) would include: - locations of active functions leading to whatever - the location of whatever when the value error is raised - exception: the ValueError instance - the location of raise CustomException - exception: the CustomException instance Printing the stack trace recognizes when the same exception object is reraised again, and prints this as a propagation instead of repeating the exception description. Of course this design is suitable only if the previous exception is used merely for printing the stack trace, not for unpacking and examining by the program. I don't know how Python stack traces are implemented, so I have no idea whether this would be practical for Python, assuming it would be desirable at all. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From talin at acm.org Sat Sep 2 22:23:32 2006 From: talin at acm.org (Talin) Date: Sat, 02 Sep 2006 13:23:32 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> Message-ID: <44F9E844.2020603@acm.org> tomer filiba wrote: > i'm quite finished with the base of iostack (streams and layers), and > have moved to implementing the adpaters layer (especially the dreaded > TextAdapter). > > as was discussed earlier, streams and layers work with bytes, while > adpaters may work with arbitrary objects (be it struct-style records, > serialized objects, characters and whatnot). > > the question that arises is -- how far should we stretch this abstraction? > for example, the TextAdapter reads and writes characters to the > stream, after they go encoding or decoding, so from the programmer's > point of view, he's working with *characters*, not *bytes*. > that means the programmer need not be aware of how the characters > are "physically" stored in the underlying stream. > > that's all very nice, but what do we do when it comes to seek()ing? > do you want to seek by character position or by byte position? > logically you are working with characters, but it would be impossible > to implement without first decoding the entire stream in-memory... > which is unacceptable of course. > > and if seek()ing is byte-oriented, then you must somehow seek > only to the beginning of a multibyte character sequence... how > would you do that? > > my solution would be completely leaving seek() and tell() out of the > 3rd layer -- it's a byte-level operation. > > anyone thinks differently? if so, what's your solution? Well, for comparison with other APIs: The .Net equivalent, System.IO.TextReader, does not have a "seek" method at all. The Java version, Java.io.BufferedReader, has a "skip()" method which only allows seeking forward. Sounds to me like copying the Java model would work. -- Talin From brett at python.org Sat Sep 2 22:44:00 2006 From: brett at python.org (Brett Cannon) Date: Sat, 2 Sep 2006 13:44:00 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: References: Message-ID: On 9/2/06, Georg Brandl wrote: > > While looking at the changes necessary to implement the exception > related syntax changes (except ... as ..., raise without type), > I came across some more substantial things that I think must be discussed. You have read Ping's PEP 344, right? * How should exceptions be represented in C code? Should there still > be a (type, value, traceback) triple? > > * Could the traceback be made an attribute of the exception? The problem with this is that it keeps the frame alive. This is why this and exception chaining were considered a design issue in Ping's PEP since that is a lot of stuff to keep alive. * What about exception chaining? > > Something like this comes to mind:: > > try: > whatever > except ValueError as err: > raise CustomException("Something went wrong", prev=err) > > With tracebacks becoming part of the exception, that could be:: > > raise CustomException(*args, prev=err, tb=traceback) > > (`prev` and `tb` would be keyword-only arguments) > > With that, all exception info would be contained in one object, > so sys.exc_info() could be renamed to sys.last_exc(). Right, which is why the original suggestion came up in the first place. It would be nice to compartmentalize exceptions entirely, but the worry of keeping a ont of memory alive for it needs to be addressed, especially if exceptions are to be kept lightweight and usable for things other than flagging errors. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060902/9caf2f96/attachment.html From tomerfiliba at gmail.com Sun Sep 3 00:29:25 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Sun, 3 Sep 2006 00:29:25 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <44F9E844.2020603@acm.org> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44F9E844.2020603@acm.org> Message-ID: <1d85506f0609021529o3a83dccbod0a7a643d39da696@mail.gmail.com> [Talin] > The Java version, Java.io.BufferedReader, has a "skip()" method which > only allows seeking forward. > Sounds to me like copying the Java model would work. then there's no need for it at all... just read() and discard the return value. we don't need a special API for that. on the other hand, the .NET version has a BaseStream attribute holding the underlying stream over which the StreamReader operates... this means you *can* change the position if the underlying stream supports seeking. i read through the msdn but found no explicit definition for what happens in the case of seeking in text-encoded streams, but they noted somewhere they use a "best fit" decoder, which, to the best of my understanding, may skip some bytes until it's in synch with the stream. that's a *horrible* design, imho, but that's microsoft. i say let's leave it below layer 3, at the byte level. if users find seeking very important, we can come up with a layer-2 ReSyncLayer, which will attempt to come in synch with a specified encoding. for example: f = TextAdapter( ReSyncLayer( BufferedLayer( FileStream("blah", "r") ), encoding = "utf8" ), encoding = "utf8" ) # read 3 UTF8 *characters* f.read(3) # this will seek by AT LEAST 7 *bytes*, until resynched f.substream.seekby(7) # we can resume reading of UTF8 *characters* f.read(3) heck, i even like this idea :) thanks for the pointers. -tomer On 9/2/06, Talin wrote: > tomer filiba wrote: > > i'm quite finished with the base of iostack (streams and layers), and > > have moved to implementing the adpaters layer (especially the dreaded > > TextAdapter). > > > > as was discussed earlier, streams and layers work with bytes, while > > adpaters may work with arbitrary objects (be it struct-style records, > > serialized objects, characters and whatnot). > > > > the question that arises is -- how far should we stretch this abstraction? > > for example, the TextAdapter reads and writes characters to the > > stream, after they go encoding or decoding, so from the programmer's > > point of view, he's working with *characters*, not *bytes*. > > that means the programmer need not be aware of how the characters > > are "physically" stored in the underlying stream. > > > > that's all very nice, but what do we do when it comes to seek()ing? > > do you want to seek by character position or by byte position? > > logically you are working with characters, but it would be impossible > > to implement without first decoding the entire stream in-memory... > > which is unacceptable of course. > > > > and if seek()ing is byte-oriented, then you must somehow seek > > only to the beginning of a multibyte character sequence... how > > would you do that? > > > > my solution would be completely leaving seek() and tell() out of the > > 3rd layer -- it's a byte-level operation. > > > > anyone thinks differently? if so, what's your solution? > > Well, for comparison with other APIs: > > The .Net equivalent, System.IO.TextReader, does not have a "seek" method > at all. > > The Java version, Java.io.BufferedReader, has a "skip()" method which > only allows seeking forward. > > Sounds to me like copying the Java model would work. > > -- Talin > From greg.ewing at canterbury.ac.nz Sun Sep 3 01:06:01 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 03 Sep 2006 11:06:01 +1200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> Message-ID: <44FA0E59.9010302@canterbury.ac.nz> tomer filiba wrote: > my solution would be completely leaving seek() and tell() out of the > 3rd layer -- it's a byte-level operation. That's what I'd recommend, too. Seeking doesn't make sense when the underlying units aren't fixed-length. The best you could do would be to return some kind of opaque object from tell() that could be passed back to seek(). But I'm far from convinced that would be worth the trouble. -- Greg From ironfroggy at gmail.com Sun Sep 3 02:24:22 2006 From: ironfroggy at gmail.com (Calvin Spealman) Date: Sat, 2 Sep 2006 20:24:22 -0400 Subject: [Python-3000] The future of exceptions In-Reply-To: References: Message-ID: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> On 9/2/06, Brett Cannon wrote: > Right, which is why the original suggestion came up in the first place. It > would be nice to compartmentalize exceptions entirely, but the worry of > keeping a ont of memory alive for it needs to be addressed, especially if > exceptions are to be kept lightweight and usable for things other than > flagging errors. > > -Brett So, at issue is attaching tracebacks to exceptions keeps too much alive and thus makes exceptions too heavy? If the traceback was passed to the exception constructor and then held as an attribute of the exception, any exception meant for "light" work (ie., not normal error flagging) could simply decided not to include the traceback, and so it would be destroyed, removing the weight from the exception. Similarly, tracebacks could have some lean() method to drop references to the frames. From brett at python.org Sun Sep 3 03:34:47 2006 From: brett at python.org (Brett Cannon) Date: Sat, 2 Sep 2006 18:34:47 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> Message-ID: On 9/2/06, Calvin Spealman wrote: > > On 9/2/06, Brett Cannon wrote: > > Right, which is why the original suggestion came up in the first > place. It > > would be nice to compartmentalize exceptions entirely, but the worry of > > keeping a ont of memory alive for it needs to be addressed, especially > if > > exceptions are to be kept lightweight and usable for things other than > > flagging errors. > > > > -Brett > > So, at issue is attaching tracebacks to exceptions keeps too much > alive and thus makes exceptions too heavy? Basically. Memory usage goes up if you do this as it stands now. If the traceback was passed > to the exception constructor and then held as an attribute of the > exception, any exception meant for "light" work (ie., not normal error > flagging) could simply decided not to include the traceback, and so it > would be destroyed, removing the weight from the exception. Similarly, > tracebacks could have some lean() method to drop references to the > frames. > Problem with that is you then lose any API guarantees of the traceback being there, which would mean you would still need to keep around sys.exc_info(). -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060902/060a9cf8/attachment.html From fredrik at pythonware.com Sun Sep 3 11:19:06 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Sun, 03 Sep 2006 11:19:06 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <44FA0E59.9010302@canterbury.ac.nz> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FA0E59.9010302@canterbury.ac.nz> Message-ID: Greg Ewing wrote: > The best you could do would be to return some kind > of opaque object from tell() that could be passed > back to seek(). that's how seek/tell works on text files in today's Python, of course. if you're writing portable code, you can only seek to the beginning or end of the file, or to a position returned to you by tell. From 2006 at jmunch.dk Sun Sep 3 19:11:27 2006 From: 2006 at jmunch.dk (Anders J. Munch) Date: Sun, 03 Sep 2006 19:11:27 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> Message-ID: <44FB0CBF.7070102@jmunch.dk> tomer filiba wrote: > my solution would be completely leaving seek() and tell() out of the > 3rd layer -- it's a byte-level operation. > > anyone thinks differently? if so, what's your solution? seek and tell are a poor mans sequence. I would have nothing by those names. I would have input streams, output streams and sequences, and I wouldn't mix the three. FileReader would be an InputStream, FileWriter would be an OutputStream. FileBytes would support the sequence protocol, mimicking bytes objects. It would support random-access read and write using __getitem__ and __setitem__, allowing slice assignment for slices of equal size. And there would be append() to extend the file, and partial __delitem__ support for truncating. Looking at your iostack2 Stream class, no sooner do you introduce the key methods read and write, than you supplement them with capability queries readable and writable that check whether these methods may even be called. IMO this is a clear indication that these methods really want to be refactored into separate classes. I think you'll find that separating input, output and random access into three separate ADTs will much simplify BufferingLayer (even though you'll need three of them). At least if you intend to take interactions between reads and writes into account. regards, Anders From tomerfiliba at gmail.com Sun Sep 3 20:17:39 2006 From: tomerfiliba at gmail.com (tomer filiba) Date: Sun, 3 Sep 2006 20:17:39 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <44FB0CBF.7070102@jmunch.dk> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> Message-ID: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> > FileReader would be an InputStream, > FileWriter would be an OutputStream yes, this has been discussed, but that's too java-ish by nature. besides, how would this model handle a simple operation, such as file("foo", "w+") ? opening TWO file descriptors for that purpose, one for reading and another for writing, is a complete waste of resources: handles are not cheap. not to mention that opening the same file multiple times may run you into platform-specific pits, like read-after-write bugs, etc. so the obvious solution is having an underlying "file-like object", which is basically like today's file (supports read() AND write()), over which InputStream and OutputStream just expose a different view of: f = file(...) fr = FileReader(f) fw = FileWriter(f) fr.read() fw.write() now, this means you start with a "capable" object like file, with all of the desired operations, and you intentionally CRIPPLE it down into separate reading and writing front-ends. so what's sense does that make? if you want an InputStream, just be sure you only call read() or readall(); if you want an OutputStream limit yourself to caling write(). input-only/output-only streams are just silly and artificial overhead -- we don't need them. the java/.NET world relies on interfaces so much that it might make sense in that context. but that's not the python way. > no sooner do you introduce the > key methods read and write, than you supplement them with capability > queries readable and writable that check whether these methods may > even be called. IMO this is a clear indication that these methods > really want to be refactored into separate classes. the reason is some streams, like pipes or partially shutdown()ed- sockets may be unidirectional; some (i.e., sockets) may not support seeking -- but the 2nd layer may augment that. for example, the BufferingLayer may add seeking (it already supports unreading). that's why streams are queriable -- iostack has a layered structure that allows each layer to add more functionality to the underlying layer. in other words, all stream are NOT born equal, but they can be made equal later :) that way, when your function accepts a stream as an argument, it would just check s.readable or s.seekable, without regard to the *type* of s itself, or the underlying storage -- it may be a file, it may be a buffered socket, but as long as you can read from it/seek in it, your code would work just fine. kinda like duck-typing. > FileBytes would support the > sequence protocol, mimicking bytes objects. It would support > random-access read and write using __getitem__ and __setitem__, > allowing slice assignment for slices of equal size. this may be a good direction. i'll try to see how it fits in. -tomer On 9/3/06, Anders J. Munch <2006 at jmunch.dk> wrote: > tomer filiba wrote: > > my solution would be completely leaving seek() and tell() out of the > > 3rd layer -- it's a byte-level operation. > > > > anyone thinks differently? if so, what's your solution? > > seek and tell are a poor mans sequence. I would have nothing by those > names. > > I would have input streams, output streams and sequences, and I > wouldn't mix the three. FileReader would be an InputStream, > FileWriter would be an OutputStream. FileBytes would support the > sequence protocol, mimicking bytes objects. It would support > random-access read and write using __getitem__ and __setitem__, > allowing slice assignment for slices of equal size. And there would > be append() to extend the file, and partial __delitem__ support for > truncating. > > Looking at your iostack2 Stream class, no sooner do you introduce the > key methods read and write, than you supplement them with capability > queries readable and writable that check whether these methods may > even be called. IMO this is a clear indication that these methods > really want to be refactored into separate classes. > > I think you'll find that separating input, output and random access > into three separate ADTs will much simplify BufferingLayer (even > though you'll need three of them). At least if you intend to take > interactions between reads and writes into account. > > regards, > Anders > > From qrczak at knm.org.pl Sun Sep 3 22:23:23 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Sun, 03 Sep 2006 22:23:23 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> (tomer filiba's message of "Sun, 3 Sep 2006 20:17:39 +0200") References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> Message-ID: <87lkp0bsxw.fsf@qrnik.zagroda> "tomer filiba" writes: >> FileReader would be an InputStream, >> FileWriter would be an OutputStream > > yes, this has been discussed, but that's too java-ish by nature. > besides, how would this model handle a simple operation, such as > file("foo", "w+") ? What is a rationale of this operation for a text file? -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From aahz at pythoncraft.com Sun Sep 3 22:45:28 2006 From: aahz at pythoncraft.com (Aahz) Date: Sun, 3 Sep 2006 13:45:28 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <87lkp0bsxw.fsf@qrnik.zagroda> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> Message-ID: <20060903204528.GA3950@panix.com> On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote: > "tomer filiba" writes: >> >> file("foo", "w+") ? > > What is a rationale of this operation for a text file? You want to be able to read the file and write data to it. That argues in favor of seek(0) and seek(-1) being the only supported behaviors, though. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ I support the RKAB From 2006 at jmunch.dk Mon Sep 4 00:29:43 2006 From: 2006 at jmunch.dk (Anders J. Munch) Date: Mon, 04 Sep 2006 00:29:43 +0200 Subject: [Python-3000] encoding hell In-Reply-To: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> Message-ID: <44FB5757.6070209@jmunch.dk> tomer filiba wrote: >> FileReader would be an InputStream, >> FileWriter would be an OutputStream > > yes, this has been discussed, but that's too java-ish by nature. > besides, how would this model handle a simple operation, such as > file("foo", "w+") ? You mean, with the intent of both reading and writing to the file in the same go? That's what I meant FileBytes for. Do you have a requirement for drop-in compatibility with the current I/O? In all my programming days I don't believe I written to and read from the same file handle even once. Use cases exist, like if you're implementing a DBMS, or adding to a zip file in-place, but they're the exception, and by separating that functionality out in a dedicated class like FileBytes, you avoid having the complexities of mixed input and output affect your typical use cases. > the reason is some streams, like pipes or partially shutdown()ed- > sockets may be unidirectional; some (i.e., sockets) may not support > seeking -- but the 2nd layer may augment that. for example, the > BufferingLayer may add seeking (it already supports unreading). Watch out! There's an essentiel difference between files and bidirectional communications channels that you need to take into account. For a TCP connection, input and output can be seen as isolated from one another, with each their own stream position, and each their own contents. For read/write files, it's a whole different ballgame, because stream position and data are shared. That means you cannot use the same buffering code for both cases. For files, whenever you write something, you need to take into account that that may overlap your read buffer or change read position. You should take another look at layer.BufferingLayer with that in mind. regards, Anders From talin at acm.org Mon Sep 4 01:04:34 2006 From: talin at acm.org (Talin) Date: Sun, 03 Sep 2006 16:04:34 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <44FB5757.6070209@jmunch.dk> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <44FB5757.6070209@jmunch.dk> Message-ID: <44FB5F82.3070809@acm.org> Anders J. Munch wrote: > Watch out! There's an essentiel difference between files and > bidirectional communications channels that you need to take into > account. For a TCP connection, input and output can be seen as > isolated from one another, with each their own stream position, and > each their own contents. For read/write files, it's a whole different > ballgame, because stream position and data are shared. > > That means you cannot use the same buffering code for both cases. For > files, whenever you write something, you need to take into account > that that may overlap your read buffer or change read position. You > should take another look at layer.BufferingLayer with that in mind. > > regards, Anders This is a better explanation of some of the comments I was raising earlier: The choice of buffering strategy depends on a number of factors related to how the stream is going to be used, as well as the internal implementation of the stream. A buffering strategy that works well for a socket won't work very well for a DBMS. When I stated earlier that 'the OS can do a better job of buffering than we can', what I meant to say was somewhat broader than that - which is that each layer is, in many cases, a better judge of what *kind* of buffering it needs than the person assembling the layers. This doesn't mean that each layer has to implement its own buffering algorithm. The common buffering algorithms can be factored out into their own objects -- but what I'd suggest is that the choice of buffer algorithm not *normally* be exposed to the person constructing the io stack. Thus, when creating a standard "line reader", instead of having the user call: fh = TextReader( Buffer( File( ... ) ) ) Instead, let the TextReader choose the kind of buffer it wants and supply that part automatically. There are several reasons why I think this would work better: 1) You can't simply stick just any buffer object in the middle there and expect it to work. Different buffer strategies have different interfaces, and trying to meld them all into one uber-interface would make for a very complex interface. 2) The TextReader knows perfectly well what kind of buffer it needs. Depending on how TextReader is implemented, it might want a serial, read-only buffer that allows a limited degree of look-ahead buffering so that it can find the line breaks. Or it might want a pair of buffers - one decoded, one encoded. There's no way that the user can know what kind of buffer to use without knowing the implementation details of TextReader. 3) TextReader can be optimized even more if it is allowed to 'peek' inside the internals of the buffer - something that would not be allowed if it had to conform to calling the buffer through a standard interface. More generally, the choice of buffer depends on the usage pattern for reading / writing to the file - and that usage pattern is embodied in the definition of "TextReader". By creating a "TextReader" object, the user is stating their intention to read the file a certain way, in a certain order, with certain performance characteristics. The choice of buffering derives directly from those usage patterns. So the two go hand in hand. Now, I'm not saying that you can't stick additional layers in-between TextReader and FileStream if you want to. An example might be the "resync" layer that you mentioned, or a journaling layer that insures that all writes are recoverable. I'm merely saying that for the specific issue of buffering, I think that the choice of buffer type is complicated, and requires knowledge that might not be accessible to the person assembling the stack. -- Talin From greg.ewing at canterbury.ac.nz Mon Sep 4 01:04:25 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 04 Sep 2006 11:04:25 +1200 Subject: [Python-3000] The future of exceptions In-Reply-To: References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> Message-ID: <44FB5F79.6060507@canterbury.ac.nz> Brett Cannon wrote: > Basically. Memory usage goes up if you do this as it stands now. I'm not sure I follow that. The traceback gets created anyway, so how is it going to use more memory if it's attached to a throwaway exception instead of kept in a sys variable? If you keep the exception around, that would keep the traceback too, but how often are exceptions kept for long periods after being caught? -- Greg From greg.ewing at canterbury.ac.nz Mon Sep 4 01:11:34 2006 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 04 Sep 2006 11:11:34 +1200 Subject: [Python-3000] encoding hell In-Reply-To: References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FA0E59.9010302@canterbury.ac.nz> Message-ID: <44FB6126.8030706@canterbury.ac.nz> Fredrik Lundh wrote: > that's how seek/tell works on text files in today's Python, of course. > if you're writing portable code, you can only seek to the beginning or > end of the file, or to a position returned to you by tell. True, but with arbitrary stacks of stream-transforming objects the value might need to be even more opaque, since it might need to encapsulate internal states of decoders, etc. Could be very messy. -- Greg From brett at python.org Mon Sep 4 01:19:55 2006 From: brett at python.org (Brett Cannon) Date: Sun, 3 Sep 2006 16:19:55 -0700 Subject: [Python-3000] The future of exceptions In-Reply-To: <44FB5F79.6060507@canterbury.ac.nz> References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> <44FB5F79.6060507@canterbury.ac.nz> Message-ID: On 9/3/06, Greg Ewing wrote: > > Brett Cannon wrote: > > > Basically. Memory usage goes up if you do this as it stands now. > > I'm not sure I follow that. The traceback gets created anyway, > so how is it going to use more memory if it's attached to a > throwaway exception instead of kept in a sys variable? It won't. If you keep the exception around, that would keep the > traceback too, but how often are exceptions kept for long > periods after being caught? Not very, but I didn't make this argument to begin with, other people did. It was a sticking point when the idea was first put forth. I personally supported adding the attributes, but people kept pushing against it. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060903/9325b8c2/attachment.htm From jimjjewett at gmail.com Mon Sep 4 01:22:18 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Sep 2006 19:22:18 -0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44F8FEED.9000600@gmail.com> References: <44F8FEED.9000600@gmail.com> Message-ID: On 9/1/06, Nick Coghlan wrote: > Fredrik Lundh wrote: > > today's Python supports "locale aware" 8-bit strings ... > > to what extent should this be supported by Python 3000 ? > Since all strings will be Unicode by then: > >>> u"???".isalpha() > True Two followup questions, then ... (1) To what extent should python support files (including stdin, stdout) in local (non-unicode) encodings? (not at all, per-file, settable global default?) (2) To what extent will strings have an opaque (or at least on-demand) backing store, so that decoding/encoding could be delayed? (For example, Swedish text could be stored in single-byte characters, and only converted to standard unicode on the rare occasions when it met strings in an incompatible encoding.) -jJ From jimjjewett at gmail.com Mon Sep 4 02:57:35 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Sep 2006 20:57:35 -0400 Subject: [Python-3000] The future of exceptions In-Reply-To: References: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com> <44FB5F79.6060507@canterbury.ac.nz> Message-ID: On 9/3/06, Brett Cannon wrote: > On 9/3/06, Greg Ewing wrote: > > The traceback gets created anyway, so how > > is it going to use more memory if it's attached to a > > throwaway exception instead of kept in a sys variable? > > ... how often are exceptions kept for long > > periods after being caught? > It was a sticking point when the idea was first put forth. I think people were really objecting to cyclic garbage in general. Both the garbage collector and weak references have improved since the original discussion. Even today, if a StopIteration() participates in a reference cycle, then it won't be reclaimed until the next gc run. I'm not quite sure which direction should be a weakref, but I think it would be reasonable for the cycle to get broken when an catching except block exits without reraising. -jJ From paul at prescod.net Mon Sep 4 03:55:20 2006 From: paul at prescod.net (Paul Prescod) Date: Sun, 3 Sep 2006 18:55:20 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: <1cb725390609031855r7258a2e9q2ce2877b45075744@mail.gmail.com> On 9/3/06, Jim Jewett wrote: > > On 9/1/06, Nick Coghlan wrote: > > Fredrik Lundh wrote: > > > today's Python supports "locale aware" 8-bit strings ... > > > to what extent should this be supported by Python 3000 ? > > > Since all strings will be Unicode by then: > > > >>> u"???".isalpha() > > True > > Two followup questions, then ... > > (1) To what extent should python support files (including stdin, > stdout) in local (non-unicode) encodings? (not at all, per-file, > settable global default?) I presume that Python's support of these will not change from today's. I don't think that locale changes file decoding today, nor should it. After all, files are emailed from place to place all the time. (2) To what extent will strings have an opaque (or at least > on-demand) backing store, so that decoding/encoding could be delayed? > (For example, Swedish text could be stored in single-byte characters, > and only converted to standard unicode on the rare occasions when it > met strings in an incompatible encoding.) I don't see this as particularly related to the locale issue either. It is being discussed in other threads under the name "Polymorphic strings." Fredrik Lundh said: "I think just delaying decoding would take us most of the way. the big advantage of storage polymorphism is that you can avoid decoding and encoding (and having to pay for the cycles and bytes needed for that) if you don't do have to." I believe he is working on a prototype. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060903/12c63525/attachment.htm From guido at python.org Mon Sep 4 04:11:02 2006 From: guido at python.org (Guido van Rossum) Date: Sun, 3 Sep 2006 19:11:02 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: On 9/3/06, Jim Jewett wrote: > On 9/1/06, Nick Coghlan wrote: > > Fredrik Lundh wrote: > > > today's Python supports "locale aware" 8-bit strings ... > > > to what extent should this be supported by Python 3000 ? > > > Since all strings will be Unicode by then: > > > >>> u"???".isalpha() > > True > > Two followup questions, then ... > > (1) To what extent should python support files (including stdin, > stdout) in local (non-unicode) encodings? (not at all, per-file, > settable global default?) I've always said (can someone find a quote perhaps?) that there ought to be a sensible default encoding for files (including but not limited to stdin/out/err), perhaps influenced by personalized settings, environment variables, the OS, etc. > (2) To what extent will strings have an opaque (or at least > on-demand) backing store, so that decoding/encoding could be delayed? > (For example, Swedish text could be stored in single-byte characters, > and only converted to standard unicode on the rare occasions when it > met strings in an incompatible encoding.) That seems to be a bit of a leading question. Talin is currently championing strings with different fixed-width storage, and others have proposed even more flexible "polymorphic strings". You might want to learn about the NSString type on Apple's ObjectiveC. BTW the term "backing store" is typically used for *disk-based* storage of large amounts of data -- but (despite that your first question is about files) I don't believe this what you're referring to. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Mon Sep 4 05:14:18 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Sep 2006 23:14:18 -0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: On 9/3/06, Guido van Rossum wrote: > On 9/3/06, Jim Jewett wrote: > > (2) To what extent will strings have an opaque > > (or at least on-demand) backing store, so that > > decoding/encoding could be delayed? > That seems to be a bit of a leading question. Yes; I (mis-?)read the original question as asking whether non-English users would still be able to use (faster) 8-bit representations. > BTW the term "backing store" is typically used for > *disk-based* storage of large amounts of data -- > but (despite that your first question is about files) > I don't believe this what you're referring to. You are correct; I had forgotten that meaning, and was taking my usage from the CFString (~= NSString) documentation suggested earlier. There it refers to the underlying (private) real storage, rather than to a disk. Today, python unicode characters are limited to a specific fixed width at compile time, because C extensions can operate directly on the data buffer. If C extensions were required to go through the unicode methods -- or at least to explicitly request a buffer -- then the underlying storage could (often) be far more efficient. This privatization would, however, be a major change to the API. Smaller and faster localized strings are one of the compensatory benefits. -jJ From jack at psynchronous.com Mon Sep 4 09:21:29 2006 From: jack at psynchronous.com (Jack Diederich) Date: Mon, 4 Sep 2006 03:21:29 -0400 Subject: [Python-3000] The future of exceptions In-Reply-To: References: Message-ID: <20060904072129.GC5707@performancedrivers.com> On Sat, Sep 02, 2006 at 06:36:37PM +0200, Georg Brandl wrote: > While looking at the changes necessary to implement the exception > related syntax changes (except ... as ..., raise without type), > I came across some more substantial things that I think must be discussed. > > * How should exceptions be represented in C code? Should there still > be a (type, value, traceback) triple? > > * Could the traceback be made an attribute of the exception? > > * What about exception chaining? > The last time this came up everyone's eyes glazed over and the conversation stopped. That doesn't mean it isn't worth talking about it just means that exceptions are hard and potentially make GC miserable. > Something like this comes to mind:: > > try: > whatever > except ValueError as err: > raise CustomException("Something went wrong", prev=err) > > With tracebacks becoming part of the exception, that could be:: > > raise CustomException(*args, prev=err, tb=traceback) > > (`prev` and `tb` would be keyword-only arguments) > > With that, all exception info would be contained in one object, > so sys.exc_info() could be renamed to sys.last_exc(). > The current system is awkward if you want to do fancy things with exceptions and tracebacks. I've never had to do fancy things with exceptions and tracebacks so I'm OK with it. "raise" as a bare word covers all the cases where I need to catch, inspect, and potentially reraise the original. In the above example you are just annotating and reraising an error so a KISS suggestion might go try: whatever except ValueError as err: err.also_squawk += 'Kilroy was here' raise Where 'also_squawk' was renamed to something more intuitive and much more international. -Jack From phd at mail2.phd.pp.ru Mon Sep 4 12:24:13 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Mon, 4 Sep 2006 14:24:13 +0400 Subject: [Python-3000] encoding hell In-Reply-To: <20060903204528.GA3950@panix.com> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> Message-ID: <20060904102413.GC21049@phd.pp.ru> On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote: > On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote: > > "tomer filiba" writes: > >> > >> file("foo", "w+") ? > > > > What is a rationale of this operation for a text file? > > You want to be able to read the file and write data to it. That argues > in favor of seek(0) and seek(-1) being the only supported behaviors, > though. Sometimes programs need tell() + seek(). Two examples (very similar, really). Example 1. I have a program, an email robot that receives email(s) and marks email addresses in a "database" that is actually a text file: --- email database file --- phd at phd.pp.ru phd at oper.med.ru --- / --- The program opens the file in "r+" mode, reads it line by line and stores the positions of the first character in an every line using tell(). When it needs to mark an email it seek()'s to the stored position and write '+' mark so the file looks like --- email database file --- +phd at phd.pp.ru phd at oper.med.ru --- / --- Example 2. INN (the NNTP daemon) stores (at least stored when I was using it) information about newsgroup in a text file database. It uses another approach - it stores info using lines of equal length: --- newsgroups --- comp.lang.python 000001234567 comp.lang.python.announce 000000abcdef --- / --- Probably INN doesn't use tell() - it just calculates the position using line length. But a python program needs tell() and seek() for such a file. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From aahz at pythoncraft.com Mon Sep 4 15:39:52 2006 From: aahz at pythoncraft.com (Aahz) Date: Mon, 4 Sep 2006 06:39:52 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <20060904102413.GC21049@phd.pp.ru> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> <20060904102413.GC21049@phd.pp.ru> Message-ID: <20060904133951.GA10810@panix.com> On Mon, Sep 04, 2006, Oleg Broytmann wrote: > On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote: >> >> You want to be able to read the file and write data to it. That argues >> in favor of seek(0) and seek(-1) being the only supported behaviors, >> though. > > Sometimes programs need tell() + seek(). Two examples (very similar, > really). > > Example 1. I have a program, an email robot that receives email(s) and > marks email addresses in a "database" that is actually a text file: [snip examples of file with email addresses and INN control files] My understanding is that those are in fact binary files that are being treated as line-oriented "text" files. I would agree that there needs to be a way to do line-oriented processing on binary files, but anyone who attempts to process these as text files is foolish at best. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ I support the RKAB From david.nospam.hopwood at blueyonder.co.uk Mon Sep 4 17:50:51 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Mon, 04 Sep 2006 16:50:51 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> Message-ID: <44FC4B5B.9010508@blueyonder.co.uk> Guido van Rossum wrote: > On 9/3/06, Jim Jewett wrote: > >>Two followup questions, then ... >> >>(1) To what extent should python support files (including stdin, >>stdout) in local (non-unicode) encodings? (not at all, per-file, >>settable global default?) Per-file, I hope. > I've always said (can someone find a quote perhaps?) that there ought > to be a sensible default encoding for files (including but not limited > to stdin/out/err), perhaps influenced by personalized settings, > environment variables, the OS, etc. While it should be possible to find out what the OS believes to be the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows; LC_CHARSET environment variable on Unix), that does not mean that it is this charset that Python programs should normally use. When defining a new text-based file type, it is simpler to define it to be always UTF-8. >>(2) To what extent will strings have an opaque (or at least >>on-demand) backing store, so that decoding/encoding could be delayed? >>(For example, Swedish text could be stored in single-byte characters, >>and only converted to standard unicode on the rare occasions when it >>met strings in an incompatible encoding.) > > That seems to be a bit of a leading question. Talin is currently > championing strings with different fixed-width storage, and others > have proposed even more flexible "polymorphic strings". You might want > to learn about the NSString type on Apple's ObjectiveC. Operating on encoded constant strings, and decoding each character on the fly, works fine when the charset is stateless and each character has a 1-1 correspondance with a Unicode character (i.e. code point). In that case the program can operate on the string essentially as if it were Unicode. It still works fine for variable-width charsets (including UTF-8 and UTF-16); that just means that the program has to avoid assuming that a position in the string is the same thing as a character count. For charsets like ISCII and ISO 2022, which are stateful and/or have a different encoding model to Unicode, I don't believe this approach would work very well. But it is fine to support this for some charsets and not others. -- David Hopwood From guido at python.org Mon Sep 4 23:32:12 2006 From: guido at python.org (Guido van Rossum) Date: Mon, 4 Sep 2006 14:32:12 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FC4B5B.9010508@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> Message-ID: On 9/4/06, David Hopwood wrote: > Guido van Rossum wrote: > > I've always said (can someone find a quote perhaps?) that there ought > > to be a sensible default encoding for files (including but not limited > > to stdin/out/err), perhaps influenced by personalized settings, > > environment variables, the OS, etc. > > While it should be possible to find out what the OS believes to be > the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows; > LC_CHARSET environment variable on Unix), that does not mean that it > is this charset that Python programs should normally use. When defining > a new text-based file type, it is simpler to define it to be always UTF-8. In this particular case I don't care what's simpler to implement, but what's most likely to do what the user expects. If on a particular box most files are encoded in encoding X, and the user did whatever is necessary to tell the tools that that's their preferred encoding, I want Python to honor that encoding when opening text files, unless the program makes other arrangements explicitly (such as specifying an explicit encoding as a parameter to open()). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rasky at develer.com Tue Sep 5 15:17:49 2006 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 5 Sep 2006 15:17:49 +0200 Subject: [Python-3000] have zip() raise exception for sequences of different lengths References: <44F608B6.5010209@ewtllc.com> <44F62745.60006@ewtllc.com> Message-ID: <017401c6d0ed$af903d10$b803030a@trilan> Raymond Hettinger wrote: > It's a PITA because it precludes all of the use cases whether the > inputs ARE intentionally of different length (like when one argument > supplys an infinite iterator): > > for lineno, ts, line in zip(count(1), timestamp(), sys.stdin): > print 'Line %d, Time %s: %s)' % (lineno, ts, line) which is a much more complicated way of writing: for lineno, line in enumerate(sys.stdin): ts = time.time() ... [assuming your "timestamp()" is what I think it is, never heard of it before]. I double-checked my own uses of zip() and they seem to follow the trend of those in Python stdlib: most of the cases are really programming errors if the two sequences do not match in length. I reckon the usage of infinite iterators is generally much less common. -- Giovanni Bajo From paul at prescod.net Tue Sep 5 18:08:47 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 09:08:47 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> Message-ID: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> On 9/4/06, Guido van Rossum wrote: > > In this particular case I don't care what's simpler to implement, but > what's most likely to do what the user expects. If on a particular box > most files are encoded in encoding X, and the user did whatever is > necessary to tell the tools that that's their preferred encoding, I > want Python to honor that encoding when opening text files, unless the > program makes other arrangements explicitly (such as specifying an > explicit encoding as a parameter to open()). It does not strike me as accurate that on a modern computer system, a Swedish person's computer is full of ISO/Swedish encoded files and a Chinese person's computer is full of a speciifc Chinese encoding etc. Maybe that was true before the notion of variant encodings became so popular. But now Europeans are just as likely to use UTF-8 as a national encoding and Asians each have MANY different encodings to select from (some defined by Unicode, some national). I doubt you'll frequently guess correctly except in specialized apps where a user has very explicit control over their file encodings and doesn't depend on applications to choose. The direction over the lifetype of Python 3000 will be AWAY from national, local, locale-predictable encodings and TOWARDS global, standard encodings. Once we get to a place where Unicode encodings are dominant, a local-encodings feature will be useless. In the transition period, it will be actually harmful. Also, only a portion of the text data on a computer is in "documents" where the end-user has control over the encoding. There are also many, many configuration files, emails, saved web pages, chat logs etc. where the encoding was selected by someone else with a potentially different nationality. I would guess that "most" text files on "most" computers in any particular locale are in ASCII/utf-8. Japanese people also have hosts files and .htaccess files and INI files and log files and ... Python can't know whether it is dealing with one of these files or an end-user document. Of the subset of documents that actually have their encoding controlled by the local user's preferences, an increasing portion with be XML and XML documents describe their encoding explicitly. It would be wrong to use the locale to override that. Beyond all of that: It just seems wrong to me that I could send someone a bunch of files and a Python program and their results processing them would be different from mine, despite the fact that we run the same version of Python on the same operating system. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/18f98882/attachment.html From jimjjewett at gmail.com Tue Sep 5 18:35:35 2006 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Sep 2006 12:35:35 -0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: On 9/5/06, Paul Prescod wrote: > On 9/4/06, Guido van Rossum wrote: > > In this particular case I don't care what's simpler to implement, but > > what's most likely to do what the user expects. Good. > But now Europeans are just as likely to use UTF-8 as a national encoding fine; then that will be the locale. > and Asians each have MANY different encodings to select from (some defined by > Unicode, some national). and the one they typically use will be the locale. If notepad (or vi/emacs/less/cat) agree on what a text file is, and python doesn't, it is python that will lose. >The direction over > the lifetype of Python 3000 will be AWAY from national, local, > locale-predictable encodings and TOWARDS global, standard encodings. Ruby is not wedding itself to unicode precisely because they have seen the opposite in Japan. It sounded like the "unicode doesn't quite work" problem will be permanent, because there are fundamental differences over which glyphs should be unified when. It isn't just a matter of using a larger set; there are glyphs which should be unified in some contexts but not others. > Also, only a portion of the text data on a computer is in "documents" where > the end-user has control over the encoding. There are also many, many > configuration files, emails, saved web pages, chat logs etc. where the > encoding was selected by someone else with a potentially different > nationality. Typically, these either list the encoding explicitly, or stick to something close to ASCII, which is included in most national encodings. > Beyond all of that: It just seems wrong to me that I could send someone a > bunch of files and a Python program and their results processing them would > be different from mine, despite the fact that we run the same version of > Python on the same operating system. So include the charset header. -jJ From guido at python.org Tue Sep 5 18:52:59 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 09:52:59 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: On 9/5/06, Paul Prescod wrote: > Beyond all of that: It just seems wrong to me that I could send someone a > bunch of files and a Python program and their results processing them would > be different from mine, despite the fact that we run the same version of > Python on the same operating system. And it seems just as wrong if Python doesn't do what the user expects. If I were a beginning Python user, I'd hate it if I had prepared a simple data file in vi or notepad and my Python program wouldn't read it right because Python's idea of encoding differs from my editor's. Sorry Paul, I appreciate your standards-driven perspective, but in this area I'd rather build in more flexibility than strictly needed, than too little. If it turns out that on a particular platform all files are in UTF-8, making Python *on that platform* always choose UTF-8 is simple enough. OTOH, if on a particular platform UTF-8 is *not* the norm, Python should not insist on using it anyway. We can remove this feature once everybody uses UTF-8. I don't believe we're there yet, and "it just seems wrong" doesn't count as proof. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Tue Sep 5 19:03:32 2006 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 05 Sep 2006 19:03:32 +0200 Subject: [Python-3000] have zip() raise exception for sequences of different lengths In-Reply-To: <017401c6d0ed$af903d10$b803030a@trilan> References: <44F608B6.5010209@ewtllc.com> <44F62745.60006@ewtllc.com> <017401c6d0ed$af903d10$b803030a@trilan> Message-ID: Giovanni Bajo wrote: > Raymond Hettinger wrote: > >> It's a PITA because it precludes all of the use cases whether the >> inputs ARE intentionally of different length (like when one argument >> supplys an infinite iterator): >> >> for lineno, ts, line in zip(count(1), timestamp(), sys.stdin): >> print 'Line %d, Time %s: %s)' % (lineno, ts, line) > > which is a much more complicated way of writing: > > for lineno, line in enumerate(sys.stdin): > ts = time.time() > ... enumerate() starts at 0, count(1) at 1, so you'd have to do a lineno += 1 in the body too. Whether for lineno, ts, line in zip(count(1), timestamp(), sys.stdin): is more complicated than for lineno, line in enumerate(sys.stdin): ts = time.time() lineno += 1 is a stylistic question. (However, enumerate() could grow a second argument specifying the starting index). Georg From brian at sweetapp.com Tue Sep 5 20:12:15 2006 From: brian at sweetapp.com (Brian Quinlan) Date: Tue, 05 Sep 2006 20:12:15 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: <44FDBDFF.7090505@sweetapp.com> Guido van Rossum wrote: > And it seems just as wrong if Python doesn't do what the user expects. > If I were a beginning Python user, I'd hate it if I had prepared a > simple data file in vi or notepad and my Python program wouldn't read > it right because Python's idea of encoding differs from my editor's. As a user, I don't have any expectations regarding non-ASCII text files. I'm using a US-English version of Windows XP (very common) and I haven't changed the default encoding (very common). Python claims that my system encoding is CP436 (from sys.stdin/stdout.encoding). I can assure you that most of the documents that I work with are not in CP436 - they are a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that this is true of many Windows XP (US-English) users. So, for me and users like me, Python is going to silently misinterpret my data. How about using ASCII as the default encoding and raising an exception if non-ASCII text is encountered? Cheers, Brian From guido at python.org Tue Sep 5 21:13:46 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 12:13:46 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FDBDFF.7090505@sweetapp.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> Message-ID: On 9/5/06, Brian Quinlan wrote: > Guido van Rossum wrote: > > And it seems just as wrong if Python doesn't do what the user expects. > > If I were a beginning Python user, I'd hate it if I had prepared a > > simple data file in vi or notepad and my Python program wouldn't read > > it right because Python's idea of encoding differs from my editor's. > > As a user, I don't have any expectations regarding non-ASCII text files. What tools do you use to edit or view those files? How do those tools know the encoding to use? (Auto-detection from sniffing the data is a perfectly valid answer BTW -- I see no reason why that couldn't be one option, as long as there's a way to disable it.) > I'm using a US-English version of Windows XP (very common) and I haven't > changed the default encoding (very common). Python claims that my system > encoding is CP436 (from sys.stdin/stdout.encoding). I can assure you > that most of the documents that I work with are not in CP436 - they are > a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that > this is true of many Windows XP (US-English) users. So, for me and users > like me, Python is going to silently misinterpret my data. Not to any greater extent than Notepad or whatever other tool you are using. > How about using ASCII as the default encoding and raising an exception > if non-ASCII text is encountered? That would not be doing what the user wants. We have extensive experience with defaulting to ASCII in Python 2.x and it's mostly bad. There should definitely be a way to force ASCII as the default encoding (if only as a debugging aid), both in the program code and in the environment; but it shouldn't be the only default. There should also be a way to force UTF-8 as the default, or ISO-8859-1. But if CP436 is the default encoding set by the OS I don't see why Python shouldn't use that as the default *in the absence of any other preferences*. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From paul at prescod.net Tue Sep 5 22:17:47 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 13:17:47 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> On 9/5/06, Guido van Rossum wrote: > > On 9/5/06, Paul Prescod wrote: > > Beyond all of that: It just seems wrong to me that I could send someone > a > > bunch of files and a Python program and their results processing them > would > > be different from mine, despite the fact that we run the same version of > > Python on the same operating system. > > And it seems just as wrong if Python doesn't do what the user expects. > If I were a beginning Python user, I'd hate it if I had prepared a > simple data file in vi or notepad and my Python program wouldn't read > it right because Python's idea of encoding differs from my editor's. My point is that most textual content in the world is NOT produced in vi or notepad or other applications that read the system encoding. Most content is produced in Word (future Word files will be zipped Unicode, not opaque binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb, etc. I haven't created locale-relevant content in a generic text editor in a very, very long time. Applications like vi and emacs that "help" you to create content that other people can't consume are not really helping at all. After all, we (now!) live in a networked era and people don't just create documents and then print them out on their local printers. Most of the time when I use text editors I am editing HTML, XML or Python and using the default of CP437 is wrong for all of those. Even Python will puke if you take a naive approach to text encodings in creating a Python program. sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file c:\temp\testencoding.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Are you going to change the Python interpreter so that it will "just work" with content created in vi and notepad? Otherwise you're saying that Python will take a modern collaboration-roeitend approach to text processing but encourage Python programmers to take a naive obsolete approach. It also isn't just a question of flexibility. I think that Brian Quinlan made the good point that most English Windows users do not know what encoding their computer is using. If this represents 25% of the world's Python users, and these users run into UTF-8 data more often than CP437 then Python will guess wrong more often than it will guess right for 25% of its users. This is really dangerous because CP437 will happily read and munge UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default for that 25%. But it's worse than even that. GUI applications on Windows use a different encoding than command line ones. So on the same box, Python-in-Tk and Python-on-command line will answer that the system encoding is "cp437" versus "cp1252". I just tested it. http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx Were it not for these issue I would say that it "isn't a big deal" because modern Linux distributions are moving to UTF-8 default anyhow, and the Mac seems to use ASCII. So we're moving to international standards regardless. But default encoding on Windows is totally broken. The Mac is not totally consistent either. The console decodes UTF-8 for display. Textedit and vim munge the display in different ways (same GUI versus command-line issue again, I guess) A question: what happens when Python is reading data from a socket or other file-like object? Will that data also be decoded as if it came from the user's locale? I don't think that this discussion really has anything to do with being compatible with "most of the files on a computer". It is about being compatible with a certain set of Unix text processing applications. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/064cd1a7/attachment-0001.html From paul at prescod.net Tue Sep 5 22:21:25 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 13:21:25 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> Message-ID: <1cb725390609051321i518d7b4cm607cbae361a55d7d@mail.gmail.com> On 9/5/06, Guido van Rossum wrote: > > > So, for me and users > > like me, Python is going to silently misinterpret my data. > > Not to any greater extent than Notepad or whatever other tool you are > using. Yes. Unicode was invented in large part because people got sick of crappy tools that silently misintepreted their data. "I see a Euro character here, a happy face there, a stack trace in a third place and my friend says he sees an accented character." Not only do we not want to emulate that (PEP 263 explicitly chooses not to), we don't want to encourage other programmers to do so either. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/18dd86ff/attachment.htm From guido at python.org Tue Sep 5 22:48:27 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 13:48:27 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com> Message-ID: I have no desire to continue this discussion in every detail. I believe we've both made our point, eloquently enough. The designers of the I/O library will have to come up with the specific rules for deciding on the default encoding. The only thing I'm saying is that hardcoding the default encoding in the language standard (like we did for str<-->unicode in 2.0) would be a mistake. I'm trusting that building the more basic facilities (such as being able to pass an explicit encoding to open()) first will enable us to experiment with different ways of determining a default encoding. That makes more sense to me than trying to settle this argument by raising our voices. (And yes, I am building in the possibility that I'm wrong. But he-said-she-said won't convince me; only actual usage experience.) --Guido On 9/5/06, Paul Prescod wrote: > On 9/5/06, Guido van Rossum wrote: > > > On 9/5/06, Paul Prescod wrote: > > > Beyond all of that: It just seems wrong to me that I could send someone > a > > > bunch of files and a Python program and their results processing them > would > > > be different from mine, despite the fact that we run the same version of > > > Python on the same operating system. > > > > And it seems just as wrong if Python doesn't do what the user expects. > > If I were a beginning Python user, I'd hate it if I had prepared a > > simple data file in vi or notepad and my Python program wouldn't read > > it right because Python's idea of encoding differs from my editor's. > > > My point is that most textual content in the world is NOT produced in vi or > notepad or other applications that read the system encoding. Most content is > produced in Word (future Word files will be zipped Unicode, not opaque > binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb, > etc. > > I haven't created locale-relevant content in a generic text editor in a > very, very long time. > > Applications like vi and emacs that "help" you to create content that other > people can't consume are not really helping at all. After all, we (now!) > live in a networked era and people don't just create documents and then > print them out on their local printers. Most of the time when I use text > editors I am editing HTML, XML or Python and using the default of CP437 is > wrong for all of those. > > Even Python will puke if you take a naive approach to text encodings in > creating a Python program. > > sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file > c:\temp\testencoding.py on line 1, but no encoding declared; see > http://www.python.org/peps/pep-0263.html for details > > Are you going to change the Python interpreter so that it will "just work" > with content created in vi and notepad? Otherwise you're saying that Python > will take a modern collaboration-roeitend approach to text processing but > encourage Python programmers to take a naive obsolete approach. > > It also isn't just a question of flexibility. I think that Brian Quinlan > made the good point that most English Windows users do not know what > encoding their computer is using. If this represents 25% of the world's > Python users, and these users run into UTF-8 data more often than CP437 then > Python will guess wrong more often than it will guess right for 25% of its > users. This is really dangerous because CP437 will happily read and munge > UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default > for that 25%. > > But it's worse than even that. GUI applications on Windows use a different > encoding than command line ones. So on the same box, Python-in-Tk and > Python-on-command line will answer that the system encoding is "cp437" > versus "cp1252". I just tested it. > > http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx > > Were it not for these issue I would say that it "isn't a big deal" because > modern Linux distributions are moving to UTF-8 default anyhow, and the Mac > seems to use ASCII. So we're moving to international standards regardless. > But default encoding on Windows is totally broken. > > The Mac is not totally consistent either. The console decodes UTF-8 for > display. Textedit and vim munge the display in different ways (same GUI > versus command-line issue again, I guess) > > A question: what happens when Python is reading data from a socket or other > file-like object? Will that data also be decoded as if it came from the > user's locale? > > I don't think that this discussion really has anything to do with being > compatible with "most of the files on a computer". It is about being > compatible with a certain set of Unix text processing applications. > > Paul Prescod > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From oliphant.travis at ieee.org Wed Sep 6 00:17:49 2006 From: oliphant.travis at ieee.org (Travis Oliphant) Date: Tue, 05 Sep 2006 16:17:49 -0600 Subject: [Python-3000] long/int unification In-Reply-To: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> Message-ID: martin at v.loewis.de wrote: > Here is a quick status of the int_unification branch, > summarizing what I did at the Google sprint in NYC. > > - the int type has been dropped; the builtins int and long > now both refer to long type > - all PyInt_* API is forwarded to the PyLong_* API. Little > changes to the C code are necessary; the most common offender > is PyInt_AS_LONG((PyIntObject*)v) since I completely removed > PyIntObject. > - Much of the test suite passes, although it still has a number > of bugs. > - There are timing tests for allocation and for addition. > On allocation, the current implementation is about a factor > of 2 slower; the integer addition is about 1.5 times slower; > the initial slowdowns was by a factor of 3. The pystones > dropped about 10% (pybench fails to run on p3yk). What impact is this long/int unification going to have on C-based sub-types of the old int-type? Will you be able to sub-class the integer-type in C without carrying around all the extra backage of the Python long? NumPy has a scalar-type that inherits from the current int-type which allows it to participate in many Python optimizations. Will the ability to do this disappear? I'm just wondering about the C-side view of the int/long unification. I can see benefit to the notion of integer unification, but wonder if strictly throwing out the small integer type on the C-level is actually going too far. In NumPy, we have 10 different integer data-types corresponding to what can be contained in an array. This direction was chosen after years of frustration of trying to fit a square peg (the item from the NumPy array) into a square hole (the limited Python scalar types). -Travis From guido at python.org Wed Sep 6 01:05:22 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 16:05:22 -0700 Subject: [Python-3000] long/int unification In-Reply-To: References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> Message-ID: On 9/5/06, Travis Oliphant wrote: > What impact is this long/int unification going to have on C-based > sub-types of the old int-type? Will you be able to sub-class the > integer-type in C without carrying around all the extra backage of the > Python long? This seems unlikely given that the PyInt *type* will go away (though the PyInt *API* methods may well continue to exist). You can subclass the PyLong type just as easily. What baggage are you thinking of? > NumPy has a scalar-type that inherits from the current int-type which > allows it to participate in many Python optimizations. Will the ability > to do this disappear? What kind of optimizations are you thinking of? If you're thinking of the current special-casing for e.g. list[int] in ceval.c, that code will likely disappear (although something equivalent will eventually be added). See my message about premature optimization in the Py3k from about 10 days ago. > I'm just wondering about the C-side view of the int/long unification. I > can see benefit to the notion of integer unification, but wonder if > strictly throwing out the small integer type on the C-level is actually > going too far. In NumPy, we have 10 different integer data-types > corresponding to what can be contained in an array. This direction was > chosen after years of frustration of trying to fit a square peg (the > item from the NumPy array) into a square hole (the limited Python scalar > types). But now that we have __index__, of course, there's less reason to subclass PyInt in the first place -- you can write your own 32-bit integer *without* inheriting from PyInt or PyLong, and it should be usable perfectly whenever an integer is expected. Id rather make sure *this* property is provided without compromise than attempting to keep random older optimizations alive for nostalgia's sake. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Sep 6 01:33:19 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 16:33:19 -0700 Subject: [Python-3000] long/int unification In-Reply-To: <44FE0752.9020903@ee.byu.edu> References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> <44FE0752.9020903@ee.byu.edu> Message-ID: On 9/5/06, Travis Oliphant wrote: > Guido van Rossum wrote: > > > On 9/5/06, Travis Oliphant wrote: > > > >> What impact is this long/int unification going to have on C-based > >> sub-types of the old int-type? Will you be able to sub-class the > >> integer-type in C without carrying around all the extra backage of the > >> Python long? > > > > > > This seems unlikely given that the PyInt *type* will go away (though > > the PyInt *API* methods may well continue to exist). You can subclass > > the PyLong type just as easily. What baggage are you thinking of? > > Just the extra stuff in the C-structure needed to handle the > arbitrary-length integer. That's just an int length and 15-for-16-bits encoding of the actual value. > > If you're thinking of the current special-casing for e.g. list[int] in > > ceval.c, that code will likely disappear (although something > > equivalent will eventually be added). > > Yes, that's what I'm thinking of. It would be nice if the "something > equivalent" could be extended to other objects. I suppose the > discussion can be held off until then. > > > > > But now that we have __index__, of course, there's less reason to > > subclass PyInt in the first place -- you can write your own 32-bit > > integer *without* inheriting from PyInt or PyLong, and it should be > > usable perfectly whenever an integer is expected. Id rather make sure > > *this* property is provided without compromise than attempting to keep > > random older optimizations alive for nostalgia's sake. > > > Of course, I agree entirely, so I doubt it will matter at all (except in > optimizations). There is probably going to be an increasing need to > tell whether or not something can handle one of these interfaces. I > know this was already discussed on this list, but was a decision reached > about how to tell if something exposes a specific interface? (I think > the relevant discussion took place under the name "callable"). > > I see a lot of > > isinstance(obj, int) > > in scientific Python code where testing for __index__ would be more > appropriate. I wouldn't rip this out just yet. 'int' may become an abstract type yet -- the int/long unification branch isn't the final word (if only because it doesn't pass all the unit tests yet). > Thanks for easing my mind. You're welcome. And how's that PEP coming? :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From oliphant at ee.byu.edu Wed Sep 6 01:25:06 2006 From: oliphant at ee.byu.edu (Travis Oliphant) Date: Tue, 05 Sep 2006 17:25:06 -0600 Subject: [Python-3000] long/int unification In-Reply-To: References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de> Message-ID: <44FE0752.9020903@ee.byu.edu> Guido van Rossum wrote: > On 9/5/06, Travis Oliphant wrote: > >> What impact is this long/int unification going to have on C-based >> sub-types of the old int-type? Will you be able to sub-class the >> integer-type in C without carrying around all the extra backage of the >> Python long? > > > This seems unlikely given that the PyInt *type* will go away (though > the PyInt *API* methods may well continue to exist). You can subclass > the PyLong type just as easily. What baggage are you thinking of? Just the extra stuff in the C-structure needed to handle the arbitrary-length integer. > > If you're thinking of the current special-casing for e.g. list[int] in > ceval.c, that code will likely disappear (although something > equivalent will eventually be added). Yes, that's what I'm thinking of. It would be nice if the "something equivalent" could be extended to other objects. I suppose the discussion can be held off until then. > > But now that we have __index__, of course, there's less reason to > subclass PyInt in the first place -- you can write your own 32-bit > integer *without* inheriting from PyInt or PyLong, and it should be > usable perfectly whenever an integer is expected. Id rather make sure > *this* property is provided without compromise than attempting to keep > random older optimizations alive for nostalgia's sake. Of course, I agree entirely, so I doubt it will matter at all (except in optimizations). There is probably going to be an increasing need to tell whether or not something can handle one of these interfaces. I know this was already discussed on this list, but was a decision reached about how to tell if something exposes a specific interface? (I think the relevant discussion took place under the name "callable"). I see a lot of isinstance(obj, int) in scientific Python code where testing for __index__ would be more appropriate. Thanks for easing my mind. -Travis From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 02:32:28 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 01:32:28 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> Message-ID: <44FE171C.1090101@blueyonder.co.uk> Guido van Rossum wrote: > On 9/5/06, Paul Prescod wrote: > >> Beyond all of that: It just seems wrong to me that I could send someone a >> bunch of files and a Python program and their results processing them >> would be different from mine, despite the fact that we run the same version of >> Python on the same operating system. > > And it seems just as wrong if Python doesn't do what the user expects. > If I were a beginning Python user, I'd hate it if I had prepared a > simple data file in vi or notepad and my Python program wouldn't read > it right because Python's idea of encoding differs from my editor's. I don't know about vi, but notepad will open and save files that are not in the system ("ANSI") encoding just fine. On opening it checks for a BOM and auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the Encoding drop-down box. This is exactly the behaviour that most users would expect of a well-behaved Unicode-aware app. It should be as easy as possible to match this behaviour in a Python program. > Sorry Paul, I appreciate your standards-driven perspective, but in > this area I'd rather build in more flexibility than strictly needed, > than too little. If it turns out that on a particular platform all > files are in UTF-8, making Python *on that platform* always choose > UTF-8 is simple enough. The problem is not the systems where all files are UTF-8, or all files are another known charset. The problem is the platforms where half of the files are UTF-8 and half are in some other charset, determined either by type or by presence of a UTF-8 BOM. This is a *very* common situation, especially for European users. Such a user cannot set the locale to UTF-8, because that will break all of their non-Unicode-aware applications. The Unicode-aware applications typically have much better support for reading and writing files in charsets that are not the system default. So in practice the locale has to be set to the "old" charset during a migration to UTF-8. (Setting different locales for different applications is far too much hassle. On Windows, although I believe it is technically possible to do the equivalent of selecting a UTF-8 locale, most users don't know how to do it, even if they want to use UTF-8 exclusively.) -- David Hopwood From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 02:36:10 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 01:36:10 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE171C.1090101@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: <44FE17FA.6030103@blueyonder.co.uk> David Hopwood wrote: > I don't know about vi, but notepad will open and save files that are not in > the system ("ANSI") encoding just fine. On opening it checks for a BOM and > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the > Encoding drop-down box. ... and it also helpfully prompts you to select a Unicode encoding, if you forget and the file contains characters that are not representable in the ANSI encoding. > This is exactly the behaviour that most users would expect of a well-behaved > Unicode-aware app. It should be as easy as possible to match this behaviour > in a Python program. -- David Hopwood From guido at python.org Wed Sep 6 02:44:37 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 17:44:37 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE171C.1090101@blueyonder.co.uk> References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: On 9/5/06, David Hopwood wrote: > Guido van Rossum wrote: > > On 9/5/06, Paul Prescod wrote: > > > >> Beyond all of that: It just seems wrong to me that I could send someone a > >> bunch of files and a Python program and their results processing them > >> would be different from mine, despite the fact that we run the same version of > >> Python on the same operating system. > > > > And it seems just as wrong if Python doesn't do what the user expects. > > If I were a beginning Python user, I'd hate it if I had prepared a > > simple data file in vi or notepad and my Python program wouldn't read > > it right because Python's idea of encoding differs from my editor's. > > I don't know about vi, but notepad will open and save files that are not in > the system ("ANSI") encoding just fine. On opening it checks for a BOM and > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the > Encoding drop-down box. > > This is exactly the behaviour that most users would expect of a well-behaved > Unicode-aware app. It should be as easy as possible to match this behaviour > in a Python program. And this is exactly why I want the determination of the default encoding (i.e. the encoding to be used when opening a file when no explicit encoding is specified by the Python code that does the opening) to be open-ended, rather than picking some standard default like UTF-8 and saying (like Paul seems to want to say) "this is it". > > Sorry Paul, I appreciate your standards-driven perspective, but in > > this area I'd rather build in more flexibility than strictly needed, > > than too little. If it turns out that on a particular platform all > > files are in UTF-8, making Python *on that platform* always choose > > UTF-8 is simple enough. > > The problem is not the systems where all files are UTF-8, or all files are > another known charset. The problem is the platforms where half of the files > are UTF-8 and half are in some other charset, determined either by type or by > presence of a UTF-8 BOM. This is a *very* common situation, especially for > European users. Right. (And Paul appears to be ignorant of this.) > Such a user cannot set the locale to UTF-8, because that will break all of > their non-Unicode-aware applications. The Unicode-aware applications typically > have much better support for reading and writing files in charsets that are > not the system default. So in practice the locale has to be set to the "old" > charset during a migration to UTF-8. > > (Setting different locales for different applications is far too much hassle. > On Windows, although I believe it is technically possible to do the equivalent > of selecting a UTF-8 locale, most users don't know how to do it, even if they > want to use UTF-8 exclusively.) Right. Of course, "locale" and "encoding" are somewhat orthogonal issues; the encoding may be UTF-8 but that doesn't determine other aspects of the locale (such as language-specific collation order, or culture-specific formatting of numbers, dates and money). Now, some platforms may equate the two somehow, and on those platforms we would have to inspect the locale to tell the encoding; but other platforms may specify the encoding separate from the locale... -- --Guido van Rossum (home page: http://www.python.org/~guido/) From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 02:46:29 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 01:46:29 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> Message-ID: <44FE1A65.7020900@blueyonder.co.uk> Guido van Rossum wrote: > On 9/5/06, Brian Quinlan wrote: > [...] > > That would not be doing what the user wants. We have extensive > experience with defaulting to ASCII in Python 2.x and it's mostly bad. > There should definitely be a way to force ASCII as the default > encoding (if only as a debugging aid), both in the program code and in > the environment; but it shouldn't be the only default. There should > also be a way to force UTF-8 as the default, or ISO-8859-1. But if > CP436 is the default encoding set by the OS I don't see why Python > shouldn't use that as the default *in the absence of any other > preferences*. Cp436 is almost certainly *not* the encoding set by the OS; Python has got it wrong. If Brian is using an English-language variant of Windows XP and has not changed the defaults, the system ("ANSI") encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1 if C1 control characters are not used). -- David Hopwood From guido at python.org Wed Sep 6 03:09:21 2006 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Sep 2006 18:09:21 -0700 Subject: [Python-3000] encoding hell In-Reply-To: <20060904102413.GC21049@phd.pp.ru> References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> <20060904102413.GC21049@phd.pp.ru> Message-ID: On 9/4/06, Oleg Broytmann wrote: > On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote: > > On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote: > > > "tomer filiba" writes: > > >> > > >> file("foo", "w+") ? > > > > > > What is a rationale of this operation for a text file? > > > > You want to be able to read the file and write data to it. That argues > > in favor of seek(0) and seek(-1) being the only supported behaviors, > > though. Umm, where he wrote seek(-1) he probably meant seek(0, 2) which is how one seeks to EOF. > Sometimes programs need tell() + seek(). Two examples (very similar, > really). > > Example 1. I have a program, an email robot that receives email(s) and > marks email addresses in a "database" that is actually a text file: > > --- email database file --- > phd at phd.pp.ru > phd at oper.med.ru > --- / --- > > The program opens the file in "r+" mode, reads it line by line and > stores the positions of the first character in an every line using tell(). > When it needs to mark an email it seek()'s to the stored position and write > '+' mark so the file looks like > > --- email database file --- > +phd at phd.pp.ru > phd at oper.med.ru > --- / --- I don't understand how it can insert a character into the file without rewriting everything after that point. But it does remind me of a use case for tell+seek on a read-only text file. An email-reading program may have a text-based multi-message mailbox format (e.g. UNIX mailbox format) and build an in-memory index of seek positions using a quick initial scan (or scanning as it goes). Once it has computed the position of a message it can quickly seek to its start and display that message. Granted, typical mailbox formats tend to use ASCII only. But one could easily imagine a similar use case for encoded text files containing multiple application-specific sections. As long as the state of the decoder is "neutral" at the start of a line, it should be possible to do this. I like the idea that tell() returns a "cookie" which is really a byte offset. If one wants to be able to seek to positions with a non-neutral decoder state, the cookie would have to be more abstract. It shouldn't matter; text apps should not do arithmetic on seek/tell positions. > Example 2. INN (the NNTP daemon) stores (at least stored when I was > using it) information about newsgroup in a text file database. It uses > another approach - it stores info using lines of equal length: > > --- newsgroups --- > comp.lang.python 000001234567 > comp.lang.python.announce 000000abcdef > --- / --- > > Probably INN doesn't use tell() - it just calculates the position using > line length. But a python program needs tell() and seek() for such a file. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 03:28:31 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 02:28:31 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: <44FE243F.80203@blueyonder.co.uk> Guido van Rossum wrote: > On 9/5/06, David Hopwood wrote: >> Guido van Rossum wrote: >> > On 9/5/06, Paul Prescod wrote: >> > >> >> Beyond all of that: It just seems wrong to me that I could send >> >> someone a bunch of files and a Python program and their results >> >> processing them would be different from mine, despite the fact that >> >> we run the same version of Python on the same operating system. >> > >> > And it seems just as wrong if Python doesn't do what the user expects. >> > If I were a beginning Python user, I'd hate it if I had prepared a >> > simple data file in vi or notepad and my Python program wouldn't read >> > it right because Python's idea of encoding differs from my editor's. >> >> I don't know about vi, but notepad will open and save files that are >> not in the system ("ANSI") encoding just fine. On opening it checks for >> a BOM and auto-detects UTF-8 and UTF-16; on saving it will write a BOM >> if you choose "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or >> UTF-8 in the Encoding drop-down box. >> >> This is exactly the behaviour that most users would expect of a >> well-behaved Unicode-aware app. It should be as easy as possible to >> match this behaviour in a Python program. > > And this is exactly why I want the determination of the default > encoding (i.e. the encoding to be used when opening a file when no > explicit encoding is specified by the Python code that does the > opening) to be open-ended, rather than picking some standard default > like UTF-8 and saying (like Paul seems to want to say) "this is it". The point I was making is that the system encoding *should not* be treated as (or called) a "default" encoding. I can't speak for Paul, but that seemed to also be what he was saying. The whole idea of a default encoding is flawed. Ideally there would be no default; programmers should be forced to think about the issue on a case-by-case basis. In some cases they might choose to open a file with the system encoding, but that should be an explicit decision. >> (Setting different locales for different applications is far too much >> hassle. On Windows, although I believe it is technically possible to >> do the equivalent of selecting a UTF-8 locale, most users don't know >> how to do it, even if they want to use UTF-8 exclusively.) > > Right. Of course, "locale" and "encoding" are somewhat orthogonal > issues; the encoding may be UTF-8 but that doesn't determine other > aspects of the locale (such as language-specific collation order, or > culture-specific formatting of numbers, dates and money). The encoding is usually an attribute of the locale. This is certainly the case on POSIX and Windows platforms. -- David Hopwood From paul at prescod.net Wed Sep 6 03:53:53 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 18:53:53 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> Message-ID: <1cb725390609051853p59574772q16ca26d17b52c76f@mail.gmail.com> On 9/5/06, Guido van Rossum wrote: > > On 9/5/06, David Hopwood wrote: > > Guido van Rossum wrote: > > > On 9/5/06, Paul Prescod wrote: > > > > > >> Beyond all of that: It just seems wrong to me that I could send > someone a > > >> bunch of files and a Python program and their results processing them > > >> would be different from mine, despite the fact that we run the same > version of > > >> Python on the same operating system. > > > > > > And it seems just as wrong if Python doesn't do what the user expects. > > > If I were a beginning Python user, I'd hate it if I had prepared a > > > simple data file in vi or notepad and my Python program wouldn't read > > > it right because Python's idea of encoding differs from my editor's. > > > > I don't know about vi, but notepad will open and save files that are not > in > > the system ("ANSI") encoding just fine. On opening it checks for a BOM > and > > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you > choose > > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the > > Encoding drop-down box. > > > > This is exactly the behaviour that most users would expect of a > well-behaved > > Unicode-aware app. It should be as easy as possible to match this > behaviour > > in a Python program. > > And this is exactly why I want the determination of the default > encoding (i.e. the encoding to be used when opening a file when no > explicit encoding is specified by the Python code that does the > opening) to be open-ended, rather than picking some standard default > like UTF-8 and saying (like Paul seems to want to say) "this is it". I never suggested that UTF-8 should be the default. In fact, I think it was very wise of Python 2.x to make ASCII the default and I'm astounded to hear that you regret that decision. "In the face of ambiguity, refuse the temptation to guess." Python 2.x provided an option to allow users to change the default system-wide and ever since then we've (almost unanimously) counselled users against changing it. > > Sorry Paul, I appreciate your standards-driven perspective, but in > > > this area I'd rather build in more flexibility than strictly needed, > > > than too little. If it turns out that on a particular platform all > > > files are in UTF-8, making Python *on that platform* always choose > > > UTF-8 is simple enough. > > > > The problem is not the systems where all files are UTF-8, or all files > are > > another known charset. The problem is the platforms where half of the > files > > are UTF-8 and half are in some other charset, determined either by type > or by > > presence of a UTF-8 BOM. This is a *very* common situation, especially > for > > European users. > > Right. (And Paul appears to be ignorant of this.) I don't see how the fact that an individual system can have half of the files in one encoding and half in another could argue IN FAVOUR of a system-global default. I would have thought it strengthens my argument AGAINST trying to apply a random encoding to files. You said: "If on a particular box most files are encoded in encoding X, and the user did whatever is necessary to tell the tools that that's their preferred encoding, I want Python to honor that encoding when opening text files, unless the program makes other arrangements explicitly (such as specifying an explicit encoding as a parameter to open())." But there is no such thing that "most users do" to tell tool what's their preferred encoding. Most users use some random (to them) operating system default which on Windows is usually wrong and is different (for no particular reason) on the Macintosh than on Linux. Long-time Windows users in this thread cannot even agree what is the default for US English Windows because there is no single default. There are two. Can we at least agree that if LC_CHARSET is demonstrably wrong most of the time on Windows that we should not use it (at least on Windows)? Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/dba9e8d0/attachment.html From paul at prescod.net Wed Sep 6 04:00:06 2006 From: paul at prescod.net (Paul Prescod) Date: Tue, 5 Sep 2006 19:00:06 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE1A65.7020900@blueyonder.co.uk> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> <44FE1A65.7020900@blueyonder.co.uk> Message-ID: <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com> On 9/5/06, David Hopwood wrote: > > Guido van Rossum wrote: > > On 9/5/06, Brian Quinlan wrote: > > [...] > > > > That would not be doing what the user wants. We have extensive > > experience with defaulting to ASCII in Python 2.x and it's mostly bad. > > There should definitely be a way to force ASCII as the default > > encoding (if only as a debugging aid), both in the program code and in > > the environment; but it shouldn't be the only default. There should > > also be a way to force UTF-8 as the default, or ISO-8859-1. But if > > CP436 is the default encoding set by the OS I don't see why Python > > shouldn't use that as the default *in the absence of any other > > preferences*. > > Cp436 is almost certainly *not* the encoding set by the OS; Python > has got it wrong. If Brian is using an English-language variant of > Windows XP and has not changed the defaults, the system ("ANSI") > encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1 > if C1 control characters are not used). http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbdaen9/00000376.htm "There are at least two code pages in use on most PCs. Applications using the Windows graphical user interface use the Windows code pages. These code pages are compatible with ISO character sets, and also with ANSI character sets. They are often referred to as *ANSI code pages*. Character-mode applications (those using the console or command prompt window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were used in DOS. These are called *OEM code pages* (Original Equipment Manufacturer) for historical reasons. ... Example Consider the following situation: - A PC is running a Windows operating system with ANSI code page 1252. - The code page for character-mode applications is OEM code page 437. - Text is held in a database created using the collation UTF8. An upper case A grave in the database is stored as hex byes C380. In a Windows application, the same character is represented as hex CO. In a DOS application, it is represented as hex B7." Now notice that when we introduce Unicode (and all Python 3K strings are Unicode), we aren't talking about DISPLAY of characters. We're talking about INTERPRETATION of characters. So if I read a file and then merge it with some XML data then a Windows default encoding-using application will create different output in a Python script run from the command line versus run from the Windows desktop. Same app. Same data. Different default encodings. Different output. Of course we could arbitrarily choose one of these two encodings as the "true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates something about how likely either one is to be correct for a particular user's goals. Paul Prescod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/8be863ca/attachment.htm From david.nospam.hopwood at blueyonder.co.uk Wed Sep 6 04:52:28 2006 From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood) Date: Wed, 06 Sep 2006 03:52:28 +0100 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FDBDFF.7090505@sweetapp.com> <44FE1A65.7020900@blueyonder.co.uk> <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com> Message-ID: <44FE37EC.2050504@blueyonder.co.uk> Paul Prescod wrote: > On 9/5/06, David Hopwood wrote: >> Guido van Rossum wrote: >> > On 9/5/06, Brian Quinlan wrote: >> > [...] >> > >> > That would not be doing what the user wants. We have extensive >> > experience with defaulting to ASCII in Python 2.x and it's mostly bad. >> > There should definitely be a way to force ASCII as the default >> > encoding (if only as a debugging aid), both in the program code and in >> > the environment; but it shouldn't be the only default. There should >> > also be a way to force UTF-8 as the default, or ISO-8859-1. But if >> > CP436 is the default encoding set by the OS I don't see why Python >> > shouldn't use that as the default *in the absence of any other >> > preferences*. >> >> Cp436 is almost certainly *not* the encoding set by the OS; Python >> has got it wrong. If Brian is using an English-language variant of >> Windows XP and has not changed the defaults, the system ("ANSI") >> encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1 >> if C1 control characters are not used). > > http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbdaen9/00000376.htm > > "There are at least two code pages in use on most PCs. Applications using > the Windows graphical user interface use the Windows code pages. These code > pages are compatible with ISO character sets, and also with ANSI character > sets. They are often referred to as *ANSI code pages*. > > Character-mode applications (those using the console or command prompt > window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were > used in DOS. These are called *OEM code pages* (Original Equipment > Manufacturer) for historical reasons. True, I oversimplified. In practice, each text file on a Windows system is somewhat more likely to be encoded in the ANSI charset than in the OEM charset (unless the user still commonly uses DOS-era applications). The OEM charset only exists at all as a compatibility hack. > Of course we could arbitrarily choose one of these two encodings as the > "true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates > something about how likely either one is to be correct for a particular > user's goals. Right -- it's impossible to make a clear distinction between "files used by console applications" and "files used by graphical applications", since any text file can be used by both. This just supports my assertion that there should not be a "default" encoding at all. -- David Hopwood From qrczak at knm.org.pl Wed Sep 6 08:10:44 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 06 Sep 2006 08:10:44 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <44FE243F.80203@blueyonder.co.uk> (David Hopwood's message of "Wed, 06 Sep 2006 02:28:31 +0100") References: <44F8FEED.9000600@gmail.com> <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> Message-ID: <87bqpttti3.fsf@qrnik.zagroda> David Hopwood writes: > The whole idea of a default encoding is flawed. Ideally there would be > no default; programmers should be forced to think about the issue > on a case-by-case basis. In some cases they might choose to open a file > with the system encoding, but that should be an explicit decision. Perhaps this is shows a difference between Unix and Windows culture. On Unix there is definitely a default encoding; this is what most good programs operating on text files assume by default. It would be insane to have to tell each program separately about the encoding. Locale is the OS mechanism used to provide this information in a uniform way. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Wed Sep 6 12:08:21 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 03:08:21 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <87bqpttti3.fsf@qrnik.zagroda> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> Message-ID: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> On 9/5/06, Marcin 'Qrczak' Kowalczyk wrote: > David Hopwood writes: > > > The whole idea of a default encoding is flawed. Ideally there would be > > no default; programmers should be forced to think about the issue > > on a case-by-case basis. In some cases they might choose to open a file > > with the system encoding, but that should be an explicit decision. > > Perhaps this is shows a difference between Unix and Windows culture. > > On Unix there is definitely a default encoding; this is what most good > programs operating on text files assume by default. It would be insane > to have to tell each program separately about the encoding. Locale is > the OS mechanism used to provide this information in a uniform way. Windows users do not "tell each program separately about the encoding." The encoding varies by file type. It makes no more sense to have a global variable that says "all of my files are Shift-JIS" than it does to say "all of my files are PowerPoint files." Because someday somebody is going to email you a Big-5 file (or a zipfile) and that setting will be wrong. Once you know that a file is of type Zip then you know that the "encoding" is zipped binary. Once you know that it is an Office 2007 file, then you know that the encoding is Zipped XML and that the XML will have its own encoding declaration. Once you know that it is HTML, then you look for meta tags. This is how real-world programs work. They shouldn't guess based on system global variables. May I ask an empircal question? In your experience, what percentage of Macintosh users change the default encoding from US-ASCII to something specific to their culture? What percentage of Ubuntu users change it froom UTF-8 to something specific? If the answers are "few", then we are talking about a feature that will break Windows programs and offer little value to Unix and Macintosh users. If "many" users change the global system encoding on their modern Unix distributions then I propose the following. There should be a property called something like "encodings.recommendedEncoding". On Windows it should be ASCII. On Unix-like platforms it can be inferred from the locale. Programmers who know what it means and want to take advantage of it can do so like this: opentext(filename, "r", encoding=encodings.recommendedEncoding) This is almost exactly how C# does it, though it uses the confusing term "defaut encoding" which implies a default behaviour. The lack of an encoding argument should default to ASCII or perhaps UTF-8. (either one is relatively safe about not processing data incorrectly by accident) Paul Prescod From phd at mail2.phd.pp.ru Wed Sep 6 12:37:51 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Wed, 6 Sep 2006 14:37:51 +0400 Subject: [Python-3000] encoding hell In-Reply-To: References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com> <44FB0CBF.7070102@jmunch.dk> <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com> <87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com> <20060904102413.GC21049@phd.pp.ru> Message-ID: <20060906103751.GD30635@phd.pp.ru> On Tue, Sep 05, 2006 at 06:09:21PM -0700, Guido van Rossum wrote: > On 9/4/06, Oleg Broytmann wrote: > >--- email database file --- > > phd at phd.pp.ru > > phd at oper.med.ru > >--- / --- > > > > The program opens the file in "r+" mode, reads it line by line and > >stores the positions of the first character in an every line using tell(). > >When it needs to mark an email it seek()'s to the stored position and write > >'+' mark so the file looks like > > > >--- email database file --- > >+phd at phd.pp.ru > > phd at oper.med.ru > >--- / --- > > I don't understand how it can insert a character into the file without > rewriting everything after that point. The essential part of the program is: results = open("results", "r+") name, email = getaddresses([to])[0] while 1: pos = results.tell() line = results.readline() if not line: break if line.strip() == email: results.seek(pos) results.write('+') break results.close() Open the "database" file in "r+" mode, find the email, seek to the beginning of the line, replace the space with '+'. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From phd at mail2.phd.pp.ru Wed Sep 6 12:48:39 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Wed, 6 Sep 2006 14:48:39 +0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> Message-ID: <20060906104839.GE30635@phd.pp.ru> On Wed, Sep 06, 2006 at 03:08:21AM -0700, Paul Prescod wrote: > Windows users do not "tell each program separately about the > encoding." The encoding varies by file type. It makes no more sense to > have a global variable that says "all of my files are Shift-JIS" than > it does to say "all of my files are PowerPoint files." Because someday > somebody is going to email you a Big-5 file (or a zipfile) and that > setting will be wrong. Once you know that a file is of type Zip then > you know that the "encoding" is zipped binary. Once you know that it > is an Office 2007 file, then you know that the encoding is Zipped XML > and that the XML will have its own encoding declaration. Once you know > that it is HTML, then you look for meta tags. > > This is how real-world programs work. They shouldn't guess based on > system global variables. Unfortunately, the real world is a bit worse than that. There are many protocol and file formats that cary textual information and still don't provide a hint on encoding. First, there are text files. Really, there are still text files. A user can dump a README file unto his/her personal FTP server, and the file ususally is in the local encoding. MP3 tags. Real nightmare. Nobody follows the standard - tag editors write tags in the local encoding, and mp3 players interpret them in the local encoding. FTP and other dumb protocols that transfer file names in the encoding local to the server without announcing that encoding in the metadata. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From qrczak at knm.org.pl Wed Sep 6 12:51:55 2006 From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) Date: Wed, 06 Sep 2006 12:51:55 +0200 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> (Paul Prescod's message of "Wed, 6 Sep 2006 03:08:21 -0700") References: <44FC4B5B.9010508@blueyonder.co.uk> <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> Message-ID: <87k64hl12s.fsf@qrnik.zagroda> "Paul Prescod" writes: > Windows users do not "tell each program separately about the > encoding." The encoding varies by file type. There are lots of Unix file types which are based on text files and their encoding is not specified explicitly. > It makes no more sense to have a global variable that says "all of > my files are Shift-JIS" than it does to say "all of my files are > PowerPoint files." Not all: it's just the default for text files. > This is how real-world programs work. They shouldn't guess based on > system global variables. But they do. It's a fact which is impossible to change with a decree. There is no place, other than the locale, which would suggest which encoding is used in /etc files, or in the contents of environment variables, or on the terminal. You might say that it's unfortunate, but it's true. At most you could advocate specifying new file formats with the encoding in mind, like XML does. This doesn't enrich existing file formats with that information. Of course technically these formats are just sequences of bytes, and most programs pass non-ASCII fragments around without looking into them deeper. But as long as one tries to treat them as natural language text, search them case-insensitively, embed text taken from them in HTML files, then the encoding begins to matter, and there is a general shift among programming languages to translate it on I/O to a common format instead of dealing with encoded text on all levels. > May I ask an empircal question? In your experience, what percentage > of Macintosh users change the default encoding from US-ASCII to > something specific to their culture? I have no experience with Macintoshes at all. > What percentage of Ubuntu users change it froom UTF-8 to something > specific? Why would it matter? If most of their programs use UTF-8, and it's specified by the locale, then fine. My system uses mostly ISO-8859-2, and it's also fine, as long as there is a way for the program to get that information. If a program can't read my text files or filenames or environment variables or program invocation arguments, while they are encoded according to the locale, then the program is broken. If a file is not encoded using the encoding specified by the locale, and I don't tell the program explicitly about the encoding, then it's not the program's fault when it can't read that. If a language requires extra steps in order to make the locale encoding work, then it's unhelpful. Most programmers won't bother, and their programs will work most of the time when they test it, assuming they use it with English texts. Such programs suddenly break when used in a non-English speaking country. > If the answers are "few", then we are talking about a feature that > will break Windows programs and offer little value to Unix and > Macintosh users. How does it break more programs than assuming ASCII does? All encodings suitable as a system encoding are ASCII supersets, so if a file can't be read using the locale encoding, it can't be read in ASCII either. -- __("< Marcin Kowalczyk \__/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/ From paul at prescod.net Wed Sep 6 12:55:04 2006 From: paul at prescod.net (Paul Prescod) Date: Wed, 6 Sep 2006 03:55:04 -0700 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <20060906104839.GE30635@phd.pp.ru> References: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com> <44FE171C.1090101@blueyonder.co.uk> <44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda> <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> <20060906104839.GE30635@phd.pp.ru> Message-ID: <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> But how would a system-wide default encoding help with any of these situations? These situations are IN FACT caused by system-wide default encodings used by naive programmers. Python should be part of the solution, not part of the problem. On 9/6/06, Oleg Broytmann wrote: > On Wed, Sep 06, 2006 at 03:08:21AM -0700, Paul Prescod wrote: > ... > Unfortunately, the real world is a bit worse than that. There are many > protocol and file formats that cary textual information and still don't > provide a hint on encoding. > First, there are text files. Really, there are still text files. A user > can dump a README file unto his/her personal FTP server, and the file > ususally is in the local encoding. > MP3 tags. Real nightmare. Nobody follows the standard - tag editors > write tags in the local encoding, and mp3 players interpret them in the > local encoding. > FTP and other dumb protocols that transfer file names in the encoding > local to the server without announcing that encoding in the metadata. > > Oleg. > -- > Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru > Programmers don't die, they just GOSUB without RETURN. From phd at mail2.phd.pp.ru Wed Sep 6 13:16:43 2006 From: phd at mail2.phd.pp.ru (Oleg Broytmann) Date: Wed, 6 Sep 2006 15:16:43 +0400 Subject: [Python-3000] locale-aware strings ? In-Reply-To: <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com> References: